Effective Concurrency: Measuring Parallel Performance — Optimizing a Concurrent Queue

This month’s Effective Concurrency column is special — it turned into a feature-length article. (I don’t know whether it’ll officially be called a “feature” or a “column” in the print issue.) “Measuring Parallel Performance: Optimizing a Concurrent Queue” just went live on DDJ’s site, and will also appear in the print magazine.

From the article:

How would you write a fast, internally synchronized queue, one that callers can use without any explicit external locking or other synchronization? Let us count the ways…or four of them, at least, and compare their performance. We’ll start with a baseline program and then successively apply three optimization techniques, each time stopping to measure each change’s relative performance for queue items of different sizes to see how much each trick really bought us.

The goal of the article is to see how to measure and understand our code’s parallel performance and measure the actual effect of specific optimizations. Disclaimer: The goal of this article is not to write the fastest possible queue in the world (though it’s pretty good). I’ve already had plenty of email on recent queue-related columns from people who sent me their “faster” implementations; writing lock-free queues seems to be a popular indoor sport. Interestingly, for well of half of the ones I received, a 30-second glance at the code was enough to determine that the code had to be incorrect. Why? Because if it doesn’t do any synchronization on the shared variables — if there aren’t any locks, atomics, fences or other synchronization in the code — then it has races, which will manifest in practice even on forgiving platforms like x86/x64, and there’s no need to look further. (For more details, see the September 2008 column, Lock-Free Code: A False Sense of Security. Even some code submissions I received in response to that very article were broken for the same reasons shown in that article.)

I hope you enjoy it. Finally, here are links to previous Effective Concurrency columns and the DDJ print magazine issue in which they first appeared:

The Pillars of Concurrency (Aug 2007)

How Much Scalability Do You Have or Need? (Sep 2007)

Use Critical Sections (Preferably Locks) to Eliminate Races (Oct 2007)

Apply Critical Sections Consistently (Nov 2007)

Avoid Calling Unknown Code While Inside a Critical Section (Dec 2007)

Use Lock Hierarchies to Avoid Deadlock (Jan 2008)

Break Amdahl’s Law! (Feb 2008)

Going Superlinear (Mar 2008)

Super Linearity and the Bigger Machine (Apr 2008)

Interrupt Politely (May 2008)

Maximize Locality, Minimize Contention (Jun 2008)

Choose Concurrency-Friendly Data Structures (Jul 2008)

The Many Faces of Deadlock (Aug 2008)

Lock-Free Code: A False Sense of Security (Sep 2008)

Writing Lock-Free Code: A Corrected Queue (Oct 2008)

Writing a Generalized Concurrent Queue (Nov 2008)

Understanding Parallel Performance (Dec 2008)

Measuring Parallel Performance: Optimizing a Concurrent Queue (Jan 2009)

5 thoughts on “Effective Concurrency: Measuring Parallel Performance — Optimizing a Concurrent Queue

  1. Just a constructive critique: Would you consider replacing those dotted charts with 3D ones ? I believe from a proper perspective they would be much more illustrative, could even use coloring like matlab’s mesh. After all you are displaying just that kind of data, with y dependent on x1,x2.

  2. ad Ex.2 [q](…) performing an extra heap allocation might seem like a bad idea at first, because heap allocations are notorious scalability busters on many of today’s memory allocators. But experience is no substitute for measurement: It turns out that, even on a system with a nonscalable allocator, the benefits often outweigh the advantages. (…)[/q]

    It seems that as the pointers/heap allocations are purely used internally to the Queue it would be an interesting exercise to see if this could further profit from a custom allocator (e.g. using an extra Heap for the queue).

    Just a thought.

  3. Okay..

    Not only there is still no reference code to all the articles mentioned, something odd for DDJ and ftp, and critique of others is still presented on every blog entry.. all of this while new books and materials are sold, the content itself is :

    a) still buggy
    b) still sub-optimal
    c) still relying on a feature that both Alexandrescu and Herb were tought about some years back on C++ lang list.
    d) very inefficient

    Put simply, MS’s own compiler team has work to do to even compete because all the .NET nonsense ate all the resources. The internal compiler is not a secret anymore either.

    More importantly, content is still doing no justice, while the author is lecturing ‘how would you do this’, ‘do that as exercise’ etc. This type of research material and speculative securitization of future work is common and has been out there for 10 years and still hasn’t produced anything or optimal.

    Finally, I would suggest getting heavily tested code out there ( btw C++0x is by NO means required) and only then writing an article or pushing for another book, conference, lecture or other sale.

    Sincerely,
    M

  4. Hi Herb,

    Interesting series of articles. One thing I’ve noticed in your concurrent code is that the while loops used to acquire the locks are empty. Would there be a performance difference from using YieldProcessor() on Windows or whatever the POSIX equivalent is to generate the PAUSE opcode? Also, might there be a measurable difference from using a locked exchange to clear the lock (this would only be an advantage if the contention were really high)?

  5. Hello Herb,

    Nice series of articles. Per our previous email discussion I would like to suggest that a RING buffer is faster, simpler and gives you control over memory usage.

    In my testing, I ran a series of data object sizes (1, 10, 100, 200, 500, 1000) and graphed both the transactions per second and bytes per second.

    I posted my code to;
    http://home.comcast.net/~lang.dennis/code/

Comments are closed.