Effective Concurrency: Measuring Parallel Performance — Optimizing a Concurrent Queue

This month’s Effective Concurrency column is special — it turned into a feature-length article. (I don’t know whether it’ll officially be called a “feature” or a “column” in the print issue.) “Measuring Parallel Performance: Optimizing a Concurrent Queue” just went live on DDJ’s site, and will also appear in the print magazine.

From the article:

How would you write a fast, internally synchronized queue, one that callers can use without any explicit external locking or other synchronization? Let us count the ways…or four of them, at least, and compare their performance. We’ll start with a baseline program and then successively apply three optimization techniques, each time stopping to measure each change’s relative performance for queue items of different sizes to see how much each trick really bought us.

The goal of the article is to see how to measure and understand our code’s parallel performance and measure the actual effect of specific optimizations. Disclaimer: The goal of this article is not to write the fastest possible queue in the world (though it’s pretty good). I’ve already had plenty of email on recent queue-related columns from people who sent me their “faster” implementations; writing lock-free queues seems to be a popular indoor sport. Interestingly, for well of half of the ones I received, a 30-second glance at the code was enough to determine that the code had to be incorrect. Why? Because if it doesn’t do any synchronization on the shared variables — if there aren’t any locks, atomics, fences or other synchronization in the code — then it has races, which will manifest in practice even on forgiving platforms like x86/x64, and there’s no need to look further. (For more details, see the September 2008 column, Lock-Free Code: A False Sense of Security. Even some code submissions I received in response to that very article were broken for the same reasons shown in that article.)

I hope you enjoy it. Finally, here are links to previous Effective Concurrency columns and the DDJ print magazine issue in which they first appeared:

The Pillars of Concurrency (Aug 2007)

How Much Scalability Do You Have or Need? (Sep 2007)

Use Critical Sections (Preferably Locks) to Eliminate Races (Oct 2007)

Apply Critical Sections Consistently (Nov 2007)

Avoid Calling Unknown Code While Inside a Critical Section (Dec 2007)

Use Lock Hierarchies to Avoid Deadlock (Jan 2008)

Break Amdahl’s Law! (Feb 2008)

Going Superlinear (Mar 2008)

Super Linearity and the Bigger Machine (Apr 2008)

Interrupt Politely (May 2008)

Maximize Locality, Minimize Contention (Jun 2008)

Choose Concurrency-Friendly Data Structures (Jul 2008)

The Many Faces of Deadlock (Aug 2008)

Lock-Free Code: A False Sense of Security (Sep 2008)

Writing Lock-Free Code: A Corrected Queue (Oct 2008)

Writing a Generalized Concurrent Queue (Nov 2008)

Understanding Parallel Performance (Dec 2008)

Measuring Parallel Performance: Optimizing a Concurrent Queue (Jan 2009)

(out of order) Effective Concurrency: Writing Lock-Free Code — A Corrected Queue

Oops, I just noticed that I forgot to blog about one recent Effective Concurrency column: “Writing Lock-Free Code: A Corrected Queue” which also appeared in the October 2008 print issue of Dr. Dobb’s Journal.

From the article:

As we saw last month [1], lock-free coding is hard even for experts. There, I dissected a published lock-free queue implementation [2] and examined why the code was quite broken. This month, let’s see how to do it right.

Here is the complete-as-of-this-writing set of links to the published Effective Concurrency columns. As always, the months reflect the magazine print issue dates; they usually hit the web a bit sooner:

August 2007: The Pillars of Concurrency

September 2007: How Much Scalability Do You Have or Need?

October 2007: Use Critical Sections (Preferably Locks) to Eliminate Races

November 2007: Apply Critical Sections Consistently

December 2007: Avoid Calling Unknown Code While Inside a Critical Section

January 2007: Use Lock Hierarchies to Avoid Deadlock

February 2008: Break Amdahl’s Law!

March 2008: Going Superlinear

April 2008: Super Linearity and the Bigger Machine

May 2008: Interrupt Politely

June 2008: Maximize Locality, Minimize Contention

July 2008: Choose Concurrency-Friendly Data Structures

August 2008: The Many Faces of Deadlock

September 2008: Lock-Free Code: A False Sense of Security

October 2008: Writing Lock-Free Code: A Corrected Queue

November 2008: Writing a Generalized Concurrent Queue

December 2008: Understanding Parallel Performance

Effective Concurrency: Understanding Parallel Performance

Wow, DDJ just posted the previous one a few days ago, and already the next Effective Concurrency column is available: “Understanding Parallel Performance” just went live, and will also appear in the print magazine.

From the article:

Let’s say that we’ve slickly written our code to apply divide-and-conquer algorithms and concurrent data structures and parallel traversals and all our other cool tricks that make our code wonderfully scalable in theory. Question: How do we know how well we’ve actually succeeded? Do we really know, or did we just try a couple of tests on a quad-core that looked reasonable and call it good? What key factors must we measure to understand our code’s performance, and answer not only whether our code scales, but quantify how well under different circumstances and workloads? What costs of concurrency do we have to take into account?

This month, I’ll summarize some key issues we need to keep in mind to accurately analyze the real performance of our parallel code. I’ll list some basic considerations, and then some common costs. Next month, I have a treat in store: We’ll take some real code and apply these techniques to analyze its performance in detail as we successively apply a number of optimizations and measure how much each one actually buys us, under what conditions and in what directions, and why.

I hope you enjoy it. Finally, here are links to previous Effective Concurrency columns (based on the magazine print issue dates):

August 2007: The Pillars of Concurrency

September 2007: How Much Scalability Do You Have or Need?

October 2007: Use Critical Sections (Preferably Locks) to Eliminate Races

November 2007: Apply Critical Sections Consistently

December 2007: Avoid Calling Unknown Code While Inside a Critical Section

January 2007: Use Lock Hierarchies to Avoid Deadlock

February 2008: Break Amdahl’s Law!

March 2008: Going Superlinear

April 2008: Super Linearity and the Bigger Machine

May 2008: Interrupt Politely

June 2008: Maximize Locality, Minimize Contention

July 2008: Choose Concurrency-Friendly Data Structures

August 2008: The Many Faces of Deadlock

September 2008: Lock-Free Code: A False Sense of Security

October 2008: Writing Lock-Free Code: A Corrected Queue

November 2008: Writing a Generalized Concurrent Queue

December 2008: Understanding Parallel Performance

Effective Concurrency: Writing a Generalized Concurrent Queue

The next Effective Concurrency column, “Writing a Generalized Concurrent Queue”, just went live on DDJ’s site, and also appears in the print magazine.

From the article:

Last month [1], I showed code for a lock-free queue that supported the limited case of exactly two threads—one producer, and one consumer. That’s useful, but maybe not as exciting now that our first rush of lock-free coding glee has worn off. This month, let’s tackle the general problem of supporting multiple producers and multiple consumers with as much concurrency as possible. The code in this article uses four main design techniques: …

I hope you enjoy it. Finally, here are links to previous Effective Concurrency columns (based on the magazine print issue dates):

August 2007: The Pillars of Concurrency

September 2007: How Much Scalability Do You Have or Need?

October 2007: Use Critical Sections (Preferably Locks) to Eliminate Races

November 2007: Apply Critical Sections Consistently

December 2007: Avoid Calling Unknown Code While Inside a Critical Section

January 2007: Use Lock Hierarchies to Avoid Deadlock

February 2008: Break Amdahl’s Law!

March 2008: Going Superlinear

April 2008: Super Linearity and the Bigger Machine

May 2008: Interrupt Politely

June 2008: Maximize Locality, Minimize Contention

July 2008: Choose Concurrency-Friendly Data Structures

August 2008: The Many Faces of Deadlock

September 2008: Lock-Free Code: A False Sense of Security

October 2008: Writing Lock-Free Code: A Corrected Queue

November 2008: Writing a Generalized Concurrent Queue

Ralph Johnson on Parallel Programming Patterns

A few days ago at UIUC, Ralph Johnson gave a very nice talk on “Parallel Programming Patterns.” It’s now online, and here’s the abstract:

Parallel programming is hard. One proposed solution is to provide a standard set of patterns. Learning the patterns would help people to become expert parallel programmers. The patterns would provide a vocabulary that would let programmers think about their programs at a higher level than the programming language. The patterns could steer programmers away from common errors and towards good design principles.

There have been a number of papers about parallel patterns, and one book Patterns for Parallel Programming. None of them have become popular. I think the problem is that parallel programming is diverse and requires more design expertise than traditional software design. Thus, parallel programming experts use more patterns than parallel programming expert. I’ll critique the existing patterns and explain what I think should be done to make a set of patterns that can be as effective for parallel programming as patterns have been for object-oriented design.

If Johnson’s name sounds familiar, it should: He’s one of the “Gang of Four” authors of the seminal book Design Patterns.

Recommended viewing.

Effective Concurrency Course: Sep 22-24, 2008

The first offering of the three-day Effective Concurrency course in May went very well. We’re doing it again later this month — this will be the last offering this year.

Here’s the brief information (more details below):

3-Day Seminar: Effective Concurrency

September 22-4, 2008
Bellevue, WA, USA
Developed and taught by Herb Sutter

This course covers the fundamental tools that software developers need to write effective concurrent software for both single-core and multi-core/many-core machines. To use concurrency effectively, we must identify and solve four key challenges:

  • Leverage the ability to perform and manage work asynchronously
  • Build applications that naturally run faster on new hardware having more and more cores
  • Manage shared objects in memory effectively to avoid races and deadlocks
  • Engineer specifically for high performance

This seminar will equip attendees to reason correctly about concurrency requirements and tradeoffs, to migrate existing code bases to be concurrency-enabled, and to achieve key success factors for a concurrent programming project. Most code examples in the course can be directly translated to popular platforms and concurrency libraries, including Linux, Windows, Java, .NET, pthreads, and the forthcoming ISO C++0x standard.

Here’s a summary of what we’ll cover during the three days:

Fundamentals

  • Define basic concurrency goals and requirements
  • Understand applications’ scalability needs
  • Key concurrency patterns

Isolation: Keep Work Separate

  • Running tasks in isolation and communicate via async messages
  • Integrating multiple messaging systems, including GUIs and sockets
  • Building responsive applications using background workers
  • Threads vs. thread pools

Scalability: Re-enable the Free Lunch

  • When and how to use more cores 
  • Exploiting parallelism in algorithms 
  • Exploiting parallelism in data structures 
  • Breaking the scalability barrier

Consistency: Don’t Corrupt Shared State

  • The many pitfalls of locks–deadlock, convoys, etc.
  • Locking best practices
  • Reducing the need for locking shared data
  • Safe lock-free coding patterns
  • Avoiding the pitfalls of general lock-free coding
  • Races and race-related effects

Migrating Existing Code Bases to Use Concurrency

Near-Future Tools and Features

High Performance Concurrency

  • Machine architecture and concurrency
  • Costs of fundamental operations, including locks, context switches, and system calls
  • Memory and cache effects
  • Data structures that support and undermine concurrency
  • Enabling linear and superlinear scaling

I hope to get to meet some of you in the Seattle area!

Embedded Multicore Development Webinar with Lee, Reinders, and Truchard

Last month, I was privileged to be part of a panel in a webinar on Embedded Multicore Development moderated by Richard Nass, Editor-in-Chief of Embedded Systems Design, Embedded.com, and the Embedded Systems Conferences. It’s online and available on demand.

I say “privileged” especially because of the stature of the other panelists. These distinguished gentlemen were:

It was a fun discussion, and I hope informative.

Effective Concurrency: Lock-Free Code — A False Sense of Security

DDJ posted the next Effective Concurrency column a couple of weeks earlier than usual. You can find it here: “Lock-Free Code: A False Sense of Security”, just went live on DDJ’s site, and also appears in the print magazine.
 
This is a special column in a way, because I rarely critique someone else’s published code. However, mere weeks ago DDJ itself published an article with fundamentally broken code that tried to show how to write a simplified lock-free queue. I corresponded with the author, Petru Marginean, and his main reviewer, Andrei Alexandrescu, to discuss the problems, and they have patched the code somewhat and added a disclaimer to the article. But the issues need to be addressed, and so Petru kindly let me vivisect his code in public in this column (and the next, not yet available, which will show how to do it right).
 
From the article:

[Lock-free code is] hard even for experts. It’s easy to write lock-free code that appears to work, but it’s very difficult to write lock-free code that is correct and performs well. Even good magazines and refereed journals have published a substantial amount of lock-free code that was actually broken in subtle ways and needed correction.

To illustrate, let’s dissect some peer-reviewed lock-free code that was published here in DDJ just two months ago …

I hope you enjoy it. (Note: Yes, the title is a riff on Tom Cargill’s classic article “Exception Handling: A False Sense of Security”, one of Scott Meyers’ five “Most Important C++ Non-Book Publications…Ever”.)
 
Finally, here are links to previous Effective Concurrency columns (based on the magazine print issue dates):
August 2007 The Pillars of Concurrency
September 2007 How Much Scalability Do You Have or Need?
October 2007 Use Critical Sections (Preferably Locks) to Eliminate Races
November 2007 Apply Critical Sections Consistently
December 2007 Avoid Calling Unknown Code While Inside a Critical Section
January 2007 Use Lock Hierarchies to Avoid Deadlock
February 2008 Break Amdahl’s Law!
March 2008 Going Superlinear
April 2008 Super Linearity and the Bigger Machine
May 2008 Interrupt Politely
June 2008 Maximize Locality, Minimize Contention
July 2008 Choose Concurrency-Friendly Data Structures
August 2008 The Many Faces of Deadlock
September 2008 “Lock-Free Code: A False Sense of Security”

Server Concurrency != Client Concurrency

Today I received an email that asked:

I have recently come across your excellent articles on concurrency and the changes in software writing paradigm. They make a lot of sense, but I am having trouble translating them to my world of Telecom oriented web services, where practically everything is run through a DBMS. It seems to me we get everything “free”, simply by using an inherently concurrent multi-everything beast such as that :-) .

Could you please share your thoughts on the issue in one of your coming blog entries? It seems to me nowadays most complex systems would take advantage of a DBMS, certainly any application that is internet based, telecom oriented, or enterprise level. Be it in C++, Java, or PHP and its ilk, using a DBMS – often as a sort of message queue – is one of the best practices that ensures parallelism.

imageSure. At right is a slide I give in talks that summarizes the answer to this question, and I’ve addressed this and other similar issues in an ACM Queue article.

The problem with taking advantage of multicore/manycore hardware isn’t (as much) on the server, it’s on the client. When experienced people say things like, “but the concurrency problem is already solved, we’ve been building scalable software for years,” that’s server or niche-client application people talking. That kind of laid-back sound bite sure isn’t coming from mainstream client application developers.

Typical Server Workloads

On the server:

  • Workloads typically already have lots of inherent concurrency (000s or 000,000s of incoming requests for web/dbms/etc. operations), and it’s easy to launch independent requests concurrently.
  • Shared data is typically inside highly structured relational databases where we have decades of experience with automatic concurrency control. The DBMS itself knows how to optimistically run transactions in parallel and control conflicts by escalating row locks to page locks, index locks to table locks, and so on.
  • The programming model is typically transactions, and all the programmer has to know about is to write “begin transaction; /* … read/write whatever stuff you feel the need to, then … */ end transaction;” which is about as sweet as it gets.

We already know how to build somewhat scalable server apps. Sure, it’s still rocket science and takes expert knowledge to do well. But we generally know the rocket science, have experts who can implement it with repeatable success, and have regularly scheduled missions to the “scalable servers” space station. With some care, we can say that the “concurrency problem is already solved” here.

Typical Client Workloads

The world is very different for typical mainstream client applications (i.e., I’m not talking about Photoshop and a handful of others), where:

  • Workloads don’t have lots of inherent concurrency. The user clicks one button, and we have to figure out how to divide the work and recombine it in order to get the answer faster on many cores.
  • Shared data is typically in unstructured pointer-chasing graphs of objects in shared memory that require explicit concurrency control. Note that “unstructured” doesn’t mean there’s no structure — of course there’s some — but it’s a gloriously diverse pile of objects and containers, and more like an organically growing shantytown with unplanned twisty little alleys and passages, than the nice rectangular downtown city street plan of a nice rectangular database table.
  • The programming model to protect shared data is to use error-prone explicit locks. You have to remember which locks protect which data, and not to acquire them in inconsistent and deadlock-prone orders.

We’re still discovering and productizing the rocket science here. You could say that the tools like OpenMP that we do have now are still at the V-2 stage — they have limited applicability, are somewhat fussy, and don’t always land where you aim them.

But we’re working on it. Up-and-coming tools like Threading Building Blocks are like the Mercury and Venera missions, setting out to reach successively higher goals and repeatability… and we’re starting to see what are perhaps Apollo– and ISS-class missions in the form of PLINQ, the Task Parallel Library, and one for native C++ we’ll be announcing in October at PDC. In part, these tools are trying to see how much we can make client workloads look more like server workloads, notably in providing a transaction-oriented programming model. For example, transactional memory is an area of active research that would let us write “begin transaction; /* … read/write whatever memory variables you feel the need to, then … */ end transaction;”, and if successful it could eventually replace many or even most existing uses of locks.

We have rightly celebrated some successful ‘manned’ flights with client products like Photoshop (parallel rendering) and Excel (parallel recalc) that scale to a number of cores. We’re on the road to, but still working toward, establishing the infrastructure and technology base to enable regularly scheduled commercial flights/shipments of scalable client applications that “light up” on multicore/manycore machines.

Effective Concurrency: The Many Faces of Deadlock

The latest Effective Concurrency column, “The Many Faces of Deadlock”, just went live on DDJ’s site, and also appears in the print magazine. From the article:

… That’s the classic deadlock example from college. Of course, two isn’t a magic number. An improved definition of deadlock is: “When N threads enter a locking cycle where each tries to take a lock the next already holds.”

“But wait,” someone might say. “I once had a deadlock just like the code you just showed, but it didn’t involve locks at all—it involved messages.” …
I hope you enjoy it.
 
Finally, here are links to previous Effective Concurrency columns (based on the magazine print issue dates):
August 2007 The Pillars of Concurrency
September 2007 How Much Scalability Do You Have or Need?
October 2007 Use Critical Sections (Preferably Locks) to Eliminate Races
November 2007 Apply Critical Sections Consistently
December 2007 Avoid Calling Unknown Code While Inside a Critical Section
January 2007 Use Lock Hierarchies to Avoid Deadlock
February 2008 Break Amdahl’s Law!
March 2008 Going Superlinear
April 2008 Super Linearity and the Bigger Machine
May 2008 Interrupt Politely
June 2008 Maximize Locality, Minimize Contention
July 2008 Choose Concurrency-Friendly Data Structures
August 2008 The Many Faces of Deadlock