Anon on Data

The adage, quoted again this week by Bruce Schneier:

The plural of “anecdote” is not “data.”

But lest we enshrine raw data as holy in itself, another perspective:

And the plural of “datum” is not “proof.”

Embedded Multicore Development Webinar with Lee, Reinders, and Truchard

Last month, I was privileged to be part of a panel in a webinar on Embedded Multicore Development moderated by Richard Nass, Editor-in-Chief of Embedded Systems Design, Embedded.com, and the Embedded Systems Conferences. It’s online and available on demand.

I say “privileged” especially because of the stature of the other panelists. These distinguished gentlemen were:

Edward A. Lee, Professor and Chair of the EE&CS department at U.C. Berkeley, and known among other things for “The Problem with Threads”.
James Reinders, the chief concurrency evangelist at Intel and author of the Threading Building Blocks book.
James Truchard, the co-founder, President and CEO of National Instruments.

It was a fun discussion, and I hope informative.

Effective Concurrency: Lock-Free Code — A False Sense of Security

DDJ posted the next Effective Concurrency column a couple of weeks earlier than usual. You can find it here: “Lock-Free Code: A False Sense of Security”, just went live on DDJ’s site, and also appears in the print magazine.

This is a special column in a way, because I rarely critique someone else’s published code. However, mere weeks ago DDJ itself published an article with fundamentally broken code that tried to show how to write a simplified lock-free queue. I corresponded with the author, Petru Marginean, and his main reviewer, Andrei Alexandrescu, to discuss the problems, and they have patched the code somewhat and added a disclaimer to the article. But the issues need to be addressed, and so Petru kindly let me vivisect his code in public in this column (and the next, not yet available, which will show how to do it right).

From the article:

[Lock-free code is] hard even for experts. It’s easy to write lock-free code that appears to work, but it’s very difficult to write lock-free code that is correct and performs well. Even good magazines and refereed journals have published a substantial amount of lock-free code that was actually broken in subtle ways and needed correction.

To illustrate, let’s dissect some peer-reviewed lock-free code that was published here in DDJ just two months ago …

I hope you enjoy it. (Note: Yes, the title is a riff on Tom Cargill’s classic article “Exception Handling: A False Sense of Security”, one of Scott Meyers’ five “Most Important C++ Non-Book Publications…Ever”.)

Finally, here are links to previous Effective Concurrency columns (based on the magazine print issue dates):

August 2007	The Pillars of Concurrency
September 2007	How Much Scalability Do You Have or Need?
October 2007	Use Critical Sections (Preferably Locks) to Eliminate Races
November 2007	Apply Critical Sections Consistently
December 2007	Avoid Calling Unknown Code While Inside a Critical Section
January 2007	Use Lock Hierarchies to Avoid Deadlock
February 2008	Break Amdahl’s Law!
March 2008	Going Superlinear
April 2008	Super Linearity and the Bigger Machine
May 2008	Interrupt Politely
June 2008	Maximize Locality, Minimize Contention
July 2008	Choose Concurrency-Friendly Data Structures
August 2008	The Many Faces of Deadlock
September 2008	“Lock-Free Code: A False Sense of Security”

Server Concurrency != Client Concurrency

Today I received an email that asked:

I have recently come across your excellent articles on concurrency and the changes in software writing paradigm. They make a lot of sense, but I am having trouble translating them to my world of Telecom oriented web services, where practically everything is run through a DBMS. It seems to me we get everything “free”, simply by using an inherently concurrent multi-everything beast such as that :-) .

Could you please share your thoughts on the issue in one of your coming blog entries? It seems to me nowadays most complex systems would take advantage of a DBMS, certainly any application that is internet based, telecom oriented, or enterprise level. Be it in C++, Java, or PHP and its ilk, using a DBMS – often as a sort of message queue – is one of the best practices that ensures parallelism.

Sure. At right is a slide I give in talks that summarizes the answer to this question, and I’ve addressed this and other similar issues in an ACM Queue article.

The problem with taking advantage of multicore/manycore hardware isn’t (as much) on the server, it’s on the client. When experienced people say things like, “but the concurrency problem is already solved, we’ve been building scalable software for years,” that’s server or niche-client application people talking. That kind of laid-back sound bite sure isn’t coming from mainstream client application developers.

Typical Server Workloads

On the server:

Workloads typically already have lots of inherent concurrency (000s or 000,000s of incoming requests for web/dbms/etc. operations), and it’s easy to launch independent requests concurrently.
Shared data is typically inside highly structured relational databases where we have decades of experience with automatic concurrency control. The DBMS itself knows how to optimistically run transactions in parallel and control conflicts by escalating row locks to page locks, index locks to table locks, and so on.
The programming model is typically transactions, and all the programmer has to know about is to write “begin transaction; /* … read/write whatever stuff you feel the need to, then … */ end transaction;” which is about as sweet as it gets.

We already know how to build somewhat scalable server apps. Sure, it’s still rocket science and takes expert knowledge to do well. But we generally know the rocket science, have experts who can implement it with repeatable success, and have regularly scheduled missions to the “scalable servers” space station. With some care, we can say that the “concurrency problem is already solved” here.

Typical Client Workloads

The world is very different for typical mainstream client applications (i.e., I’m not talking about Photoshop and a handful of others), where:

Workloads don’t have lots of inherent concurrency. The user clicks one button, and we have to figure out how to divide the work and recombine it in order to get the answer faster on many cores.
Shared data is typically in unstructured pointer-chasing graphs of objects in shared memory that require explicit concurrency control. Note that “unstructured” doesn’t mean there’s no structure — of course there’s some — but it’s a gloriously diverse pile of objects and containers, and more like an organically growing shantytown with unplanned twisty little alleys and passages, than the nice rectangular downtown city street plan of a nice rectangular database table.
The programming model to protect shared data is to use error-prone explicit locks. You have to remember which locks protect which data, and not to acquire them in inconsistent and deadlock-prone orders.

We’re still discovering and productizing the rocket science here. You could say that the tools like OpenMP that we do have now are still at the V-2 stage — they have limited applicability, are somewhat fussy, and don’t always land where you aim them.

But we’re working on it. Up-and-coming tools like Threading Building Blocks are like the Mercury and Venera missions, setting out to reach successively higher goals and repeatability… and we’re starting to see what are perhaps Apollo– and ISS-class missions in the form of PLINQ, the Task Parallel Library, and one for native C++ we’ll be announcing in October at PDC. In part, these tools are trying to see how much we can make client workloads look more like server workloads, notably in providing a transaction-oriented programming model. For example, transactional memory is an area of active research that would let us write “begin transaction; /* … read/write whatever memory variables you feel the need to, then … */ end transaction;”, and if successful it could eventually replace many or even most existing uses of locks.

We have rightly celebrated some successful ‘manned’ flights with client products like Photoshop (parallel rendering) and Excel (parallel recalc) that scale to a number of cores. We’re on the road to, but still working toward, establishing the infrastructure and technology base to enable regularly scheduled commercial flights/shipments of scalable client applications that “light up” on multicore/manycore machines.