How parallelism demos are useful

In "Break Amdahl’s Law!", I described ways to enable scalable applications, and wrote in part:

But don’t show me ray-traced bouncing balls or Mandelbrot graphics or the other usual embarrassingly parallel but niche (or downright useless) clichés—what we’re looking for are real ideas of real software we could imagine real kids and grandmothers using that could become possible on manycore machines. Here’s a quick potential example: Researchers know how to do speech recognition with near-human-quality accuracy in the lab, which is astonishingly good and would enable breakthrough user interfaces if it could be done that reliably in real time. The only trouble is that the software takes a week to run…on a single core. Can it be parallelized, and if so how many cores would we need to get a useful answer in a useful time? Bunches of smart people (and the smart money behind them) are investing hard work not only to find out the answer to that question, but also to find more questions like it.

Just to be clear, in the first sentence above I didn’t mean to say that the standard demos are useless — far from it (see below). This was intended to be a challenging call to action to not be satisfied with demos alone, but for us as an industry to imagine and develop compelling mainstream end applications that are multicore- and manycore-scalable. (To make that clearer, I’m going to ditch and rewrite the first sentence above for the Effective Concurrency book.)

The standard demos are indeed important — not only as proofs of concept, that the technology really does enable scalable parallel code, but at least as importantly as helpful tools in helping us to understand how a given parallel technology or runtime works.

To understand concurrency mechanics/characteristics

For example, consider a standard Mandelbrot-rendering demo, but with the twist that each worker thread (core) renders its portion of the work in a different color. On a traditional runtime with static scheduling, some workers with easy-to-compute sections will be done early and wait idly while the other workers finish their harder-to-compute sections, and we can see visually that each colored section is the same size and some colored sections appear faster than others. But on a runtime with dynamic scheduling, and especially one that supports Cilk-style work stealing, we get efficient load balancing where workers who are assigned "easy" sections and are done early can contribute to remaining work in harder-to-compute areas — and visually some sections fill in with one color but then the same color starts to add to other yet-unfinished sections. The bands of color let us see which worker did what work and helped out in what other areas, and the overall visual progress of the whole image lets us see that the system as a whole is doing useful work the whole time. So the colored Mandelbrot demo is a very useful tool to let us understand what’s going on quickly and clearly, in a way that presenting the results in a numerical table can’t.

To illustrate a path to future applications

Similarly, ray-tracing may well make multicore and manycore CPUs the future of photorealistic graphics in a way that may not be applicable to standard GPUs (time will tell). As shown in the blogs below, ray-tracing makes a qualitative difference in the nature of lighting models. But can’t we do this already with GPUs? Interestingly, not necessarily; ray-tracing seems to represent an algorithm that is hard to accelerate with GPUs with limited abilities to do the fine-grain scheduling that runtimes based on techniques like work stealing are well suited to do. Some links:

Yes, demos are useful

My point in the original quote above (which I see I could have stated more clearly) was simply this: Once we’ve achieved the demos, we shouldn’t sit back and declare victory. The demos aren’t the end goal; we still need the applications.

Concurrency demos are useful to help prove a technology can scale and to understand how it works, and some of them show potentially fruitful and exciting paths to real and compelling manycore-exploiting applications, but it’s still up to us as an industry to continue to imagine and build those applications. I believe that we can and will.

Effective Concurrency: Going Superlinear

The latest Effective Concurrency column, "Going Superlinear", just went live on DDJ’s site, and will also appear in the print magazine. From the article:

We spend most of our scalability lives inside a triangular box, shown in Figure 1. It reminds me of the early days of flight: We try to lift ourselves away from the rough ground of zero scalability and fly as close as possible to the cloud ceiling of linear speedup. Normally, the Holy Grail of parallel scalability is to linearly use P processors or cores to complete some work almost P times faster, up to some hopefully high number of cores before our code becomes bound on memory or I/O or something else that means diminishing returns for adding more cores. As Figure 1 illustrates, the traditional shape of our "success" curve lies inside the triangle.

Sometimes, however, we can equip our performance plane with extra tools and safely break through the linear ceiling into the superlinear stratosphere. So the question is: "Under what circumstances can we use P cores to do work more than P times faster?" There are two main ways to enter that rarefied realm:

Do disproportionately less work.

Harness disproportionately more resources.

This month and next, we’ll consider situations and techniques that fall into one or both of these categories. …

I hope you enjoy it.

Finally, here are links to previous Effective Concurrency columns (based on the dates they hit the web, not the magazine print issue dates):

July 2007	The Pillars of Concurrency
August 2007	How Much Scalability Do You Have or Need?
September 2007	Use Critical Sections (Preferably Locks) to Eliminate Races
October 2007	Apply Critical Sections Consistently
November 2007	Avoid Calling Unknown Code While Inside a Critical Section
December 2007	Use Lock Hierarchies to Avoid Deadlock
January 2008	Break Amdahl’s Law!
February 2008	Going Superlinear

Effective Concurrency: Break Amdahl’s Law!

The latest Effective Concurrency column, "Break Amdahl’s Law!", just went live on DDJ’s site, and will also appear in the print magazine. From the article:

Back in 1967, Gene Amdahl famously pointed out what seemed like a fundamental limit to how fast you can make your concurrent code: Some amount of a program’s processing is fully "O(N)" parallelizable (call this portion p), and only that portion can scale directly on machines having more and more processor cores. The rest of the program’s work is "O(1)" sequential (s). [1,2] Assuming perfect use of all available cores and no parallelization overhead, Amdahl’s Law says that the best possible speedup of that program workload on a machine with N cores is given by

Note that, as N increases to infinity, the best speedup we can ever get is (s+p)/s. Figure 1 illustrates why a program that is half scalably parallelizable and half not won’t scale beyond a factor of two even given infinite cores. Some people find this depressing. They conclude that Amdahl’s Law means there’s no point in trying to write multicore- and manycore-exploiting applications except for a few "embarrassingly parallel" patterns and problem domains with essentially no sequential work at all.

Fortunately, they’re wrong. If Amdahl’s Game is rigged, well then, to paraphrase a line from the movie WarGames: The only way to win is not to play. …

I hope you enjoy it.

Finally, here are links to previous Effective Concurrency columns (based on the dates they hit the web, not the magazine print issue dates):

July 2007	The Pillars of Concurrency
August 2007	How Much Scalability Do You Have or Need?
September 2007	Use Critical Sections (Preferably Locks) to Eliminate Races
October 2007	Apply Critical Sections Consistently
November 2007	Avoid Calling Unknown Code While Inside a Critical Section
December 2007	Use Lock Hierarchies to Avoid Deadlock

Effective Concurrency: Use Lock Hierarchies to Avoid Deadlock

The latest Effective Concurrency column, "Use Lock Hierarchies to Avoid Deadlock", just went live on DDJ’s site, and will also appear in the print magazine. From the article:

… The only way to eliminate such a potential deadlock is to make sure that all mutexes ever held at the same time are acquired in a consistent order. But how can we ensure this in a way that will be both usable and correct? For example, we could try to figure out which groups of mutexes might ever be held at the same time, and then try to define pairwise ordering rules that cover each possible combination. But that approach by itself is prone to accidentally missing unexpected combinations of locks; and even if we did it perfectly, the result would still be at best "DAG spaghetti"—a directed acyclic graph (DAG) that nobody could comprehend as a whole. And every time we want to add a new mutex to the system, we would have to find a way fit it into the DAG without creating any cycles.

We can do better by directly exploiting the knowledge we already have about the structure of the program to regularize the mess and make it understandable. …

I hope you enjoy it.

Finally, here are links to previous Effective Concurrency columns (based on the dates they hit the web, not the magazine print issue dates):

July 2007	The Pillars of Concurrency
August 2007	How Much Scalability Do You Have or Need?
September 2007	Use Critical Sections (Preferably Locks) to Eliminate Races
October 2007	Apply Critical Sections Consistently
November 2007	Avoid Calling Unknown Code While Inside a Critical Section

Parallel Computing Releases at Microsoft

For those of you who may be interested in concurrency for Microsoft platforms, and .NET in particular, I’m happy to report some fresh announcements:

MSDN has launched a Parallel Computing developer center. This is a section of the MSDN site focused on "providing information, ideas, community, and technology to developers to make it easier to write programs that perform and scale well on parallel hardware."
You can download and try a CTP "preview" of the .NET Parallel Extensions (PFX). These compile and run, but remember they’re CTP quality. PFX currently has two major parts:
- PLINQ: Parallel Language Integrated Query, a parallel version of LINQ with extensions like .AsParallel.
- TPL: Task Parallel Library with extensions like Parallel.For loops.
There’s also a PFX Team Blog.

PFX is for .NET concurrency development, but of course we are (and personally I am) working on native development tools, including for C++ specifically, that are at least as exciting. Stay tuned here for future news. In the meantime, enjoy these bits and the resources and discussion on the Parallel Computing developer center and forums.

The Concurrency Land Rush: 2007-20??

Every time that we experience a "wave" in which the industry takes a programming paradigm that’s been growing in a niche, and brings it to the mainstream, we go through similar phases:

A land rush phase during which vendors try to stake out their turf. The market sees an explosive proliferation of products trying to enable mainstream developers to work in the new paradigm and at the same time to differentiate themselves from each other. This phase always takes multiple years, and multiple major releases of many products, consumed at first mainly by early adopters and then, as products mature, broadly across the marketplace.
A shakeout phase as, eventually, the market sifts through the products, selects major winners, and realigns itself around them.

We (the software industry) have done it all before for technologies like OO and GUIs, and now we’re going to do it for concurrency.

Starting in 2005, companies started to talk concurrency and to launch internal efforts. But only this year, in 2007, has our industry formally entered the concurrency wave’s land rush phase by starting to field real products and preview releases. We’re starting to see initial steps toward candidate mainstream-accessible products in releases like Intel’s Threading Building Blocks and Software Transactional Memory prototype compiler, and Microsoft’s just-announced and downloadable .NET Parallel Extensions (PFX) including CTP-level support for Parallel LINQ (PLINQ) and the Task Parallel Library.

True, the concurrency landscape has lots of existing tools. So did OO (e.g., Simula) and GUIs (e.g., the Win16 C API) long before they became widely accessible to developers in general. Concurrency’s "Simulas and Win16s" include Thread classes and OpenMPI for doing work asynchronously, to OpenMP for fast parallel computation, to some pre-fab lock-free container classes, to limited race detectors and very basic thread profilers. But today’s tools are still what you’d expect to see in a niche market: Good enough for highly-trained experts to deliver real products, but not easy enough and not integrated enough across the software stack and tool chains to be broadly adopted by mainstream developers.

Just as the OO mainstream land rush was fueled by new and updated tools like Smalltalk and Objective-C and C++ and Java offerings, and the GUI land rush was fueled by MFC and OWL and Cocoa and Visual Basic and Delphi, we’re starting to see new and existing companies bringing lots of new and updated offerings across the entire software stack and ecosystem. Expect to see lots of new and updated/extended offerings of:

Languages and frameworks
Runtimes and operating systems
Debuggers and profilers and program analysis tools
Component wizards and IDE integration
Books and courses and conferences

These are the just beginning. Much more will come, from those same companies and others. Expect at least dozens of major product announcements and releases across the industry, before the toolset expansion phase is fully underway and approaching some maturity. We the industry have undertaken to bring concurrency to the mainstream, and as with OO and GUIs it will take multiple years, and multiple major releases, across the industry on all platforms.

As I write this, Seattle’s first snowfall of the season is gently descending outside my window. It readily puts one in mind of Frost: We have promises to keep, and miles to go before we sleep.

Concurrency Town Hall: On the web, this Monday, December 3

Earlier this fall, James Reinders and I each gave concurrency webcast as part of Intel’s fall webcast series. On Monday, James and I are going to be following that up with a virtual Town Hall panel discussion on concurrency, with Dr. Dobb’s editor Jon Erickson as our moderator. Should be cool. And you are invited to attend.

Here are the coordinates:

Date: Monday, December 3, 2007

Time: 9:00-10:00am U.S. Pacific Time

Registration: http://www.life20.net/intel1reg.php

Web simulcast (doesn’t require Second Life): http://www.slcn.tv

I expect that most people will be watching the event via the live simulcast on Second Life Cable Network. Use that if you’re not a Second Life member, or if you are a member but can’t get in there because the SL servers are expected to reach their maximum capacity for a single event.

There’s also a network breakfast before the town hall portion. Here’s the full description, as it appears on Intel’s site:

SOFTWARE AND THE CONCURRENCY REVOLUTION: DESIGNING FOR MULTIPLE CORES

A virtual Town Hall held in Second Life by Intel Software Network and Dr. Dobb’s Journal

December 3, 2007 8:00 AM – 11:00 AM Pacific Time (11:00 AM – 1:00 PM ET) on Intel and Dr Dobb’s Islands.

Meet Intel and Industry experts, learn about parallel programming and concurrency trends, let your voice be heard.

This half day event will begin with a networking breakfast on Dr. Dobbs Island where you will have the ability to meet some of Intel’s best engineers. Tim Mattson (Intel Principal Engineer, Designer of the 80 Core test chip parallel application and one of the creator’s of Open MP*) will give a brief introduction on our transition to a many-core world.

Following breakfast, attendees will adjourn to Intel Island for the virtual town hall. This event will be moderated by Jonathan Erickson (Editorial Director, Dr. Dobb’s) and will feature Herb Sutter (Architect, Microsoft, and chair of the ISO C++ Standards committee) and James Reinders (Chief Evangelist and Threading Guru for Intel Software Products) who will debate and discuss issues involved in realizing the promise of parallelization. As in any town Hall, your input will be vital.

After the town Hall, Peretz Stine of Intel Island and Rissa Maidstone of Dr. Dobbs will lead tours of their respective islands.

Schedule for December 3, 2007:
• 8:00-9:00 AM SLT – Networking Breakfast with Tim Mattson
• 9:00-10:00 AM SLT – Town Hall Meeting with Jon Erickson, James Reinders and Herb Sutter
• 10:00-11:00 AM SLT – Tour of Intel and Dr. Dobbs Islands

You must register to reserve a place at this event: http://www.life20.net/intel1reg.php
This event will be simulcast live on Second Life Cable Network: http://www.slcn.tv

Effective Concurrency: Avoid Calling Unknown Code While Inside a Critical Section

The latest Effective Concurrency column, “Avoid Calling Unknown Code While Inside a Critical Section”, just went live on DDJ’s site, and will also appear in the print magazine. Here’s a teaser from the article’s intro:

Don’t walk blindly into a deadly embrace…

Our world is built on modular, composable software. We routinely expect that we don’t need to know about the internal implementation details of a library or plug-in to be able to use it correctly.

The problem is that locks, and most other kinds of synchronization we use today, are not modular or composable. That is, given two separately authored modules, each of which is provably correct but uses a lock or similar synchronization internally, we generally can’t combine them and know that the result is still correct and will not deadlock. There are only two ways to combine such a pair of lock-using modules safely: …

I hope you enjoy it.

2022-11 Update: DDJ.com has bitrotted the article, so here’s a quick rough copy-paste of the article contents.

Our world is built on modular, composable software. We routinely expect that we don’t need to know about the internal implementation details of a library or plug-in in order to be able to use it correctly.

The problem is that locks, and most other kinds of synchronization we use today, are not modular or composable. That is, given two separately authored modules each of which is provably correct but uses a lock or similar synchronization internally, we generally can’t combine them and know that the result is still correct and will not deadlock. There are only two ways to combine such a pair of lock-using modules safely:

Option 1 (generally impossible for code you don’t control): Each module must know about the complete internal implementation of any function it calls in the other module.
Option 2: Each module must be careful not to call into the other module while the calling code is inside a critical section (e.g., while holding a lock).

Let’s examine why Option 2 is generally the only viable choice, and what consequences it has for how we write concurrent code. For convenience and familiarity, I’m going to cast the problem in terms of locks, but the general case arises whenever:

The caller is inside one critical section.
The callee directly or indirectly tries to enter another critical section, or performs another blocking call.
Some other thread could try to enter the two critical sections in the opposite order.

Quick Recap: Deadlock

A deadlock (aka deadly embrace) can happen anytime two different threads can try to acquire the same two locks (or, more generally, acquire exclusive use of the resources protected by the same two synchronization objects) in opposite orders. Therefore anytime you write code where a thread holds one lock L1 and tries to acquire another lock L2, that code is liable to be one arm of a potential deadly embrace—unless you can eliminate the possibility that some other thread might try to acquire the locks in the other order.

We can use techniques such as lock hierarchies to guarantee that locks are taken in order. Unfortunately, those techniques do not compose either. For example, you can use a lock hierarchy and guarantee that the code you control follows the discipline, but you can’t guarantee that other people’s code will know anything about your lock levels, much less follow them correctly.

Example: Two Modules, One Lock Each

One fine day, you decide to write a new web browser that allows users to write plug-ins to customize the behavior or rendering of individual page elements.

Consider the following possible code, where we simply protect all the data structures representing elements on a given page using a single mutex mutPage:[3]

// Example 1: Thread 1 of a potential deadly embrace

class CoreBrowser {

… other methods …

private void RenderElements() {

mutPage.lock(); // acquire exclusion on the page elements

try {

for( each PageElement e on the page ) {

DoRender( e ); // do our own default processing

plugin.OnRender( e ); // let the plug-in have a crack at it

}

} finally {

mutPage.unlock(); // and then release it

}

Do you see the potential for deadlock? The trouble is that, if inside the call to plugin.OnRender the plug-in might acquire some internal lock of its own, then this could be one arm of a potential deadly embrace. For example, consider this plug-in implementation that just does some basic instrumentation of how many times certain actions have been performed, and it protects its internal data with a single mutex mutMyData:

class MyPlugin {

… other methods …

public void OnRender( PageElement e ) {

mutMyData.lock(); // acquire exclusion on some internal shared data

try {

renderCount[e]++; // update #times e has been rendered

} finally {

mutMyData.unlock(); // and then release it

}

Thread 1 can therefore acquire mutPage and mutMyData in that order. Thread 1 is potential deadlock-bait, but the trouble will only actually manifest if one fine day some other Thread 2 that could run concurrently with the above performs something like the following:

// Example 2: Thread 2 of a potential deadly embrace

class MyPlugin {

… other methods …

public void RefreshDisplay( PageElement e ) {

mutMyData.lock(); // acquire exclusion on some internal shared data

try { // display stuff in a debug window

for( each element e we’ve counted ) {

listRenderedCount.Add( e.Name(), renderCount[e] );

}

textHiddenCount = browser.CountHiddenElements();

} finally {

mutMyData.unlock(); // and then release it

}

Notice how the plugin calls code unknown to it, namely browser.CountHiddenElements? You can probably see the trouble coming on like a steamroller:

class CoreBrowser {

… other methods …

public int CountHiddenElements() {

mutPage.lock(); // acquire exclusion on the page elements

try {

int count = 0;

for( each PageElement e on the page ) {

if( e.Hidden() ) count++;

}

return count;

} finally {

mutPage.unlock(); // and then release it

}

Threads 1 and 2 can therefore acquire mutPage and mutMyData in the opposite order, and so this is a deadlock waiting to happen if Threads 1 and 2 can ever run concurrently. For added fun, note that each mutex is purely an internal implementation detail of its module that is never exposed in the interface; neither module knows anything about the internal lock being used within the other. (Nor, in a better programming world than the one we now inhabit, should it have to.)

Note: Both the CoreBrowser and MyPlugin violate the rule. For CoreBrowser, see below for workarounds. For the plug-in, it should easily move the call to browser.CountHiddenElements (which is code external to the plug-in) out before or after the critical section – it does need a lock for some of the work in the try block, but it doesn’t need the lock around all of that work, especially not the call out to unknown code.

Example: Two Modules, but Only One Has Locks

This kind of problem can arise even if both locks are in the same module, but control flow passes through another module so that you don’t know what locks are taken.

Consider the following modification, where the browser protects each page element using a separate mutex, which can be desirable to allow different parts of the page to be rendered concurrently:

// Example 3: Thread 1 of an alternative potential deadly embrace

class CoreBrowser {

… other methods …

private void RenderElements() {

for( each PageElement e on the page ) {

e.mut.lock(); // acquire exclusion on this page element

try {

DoRender( e ); // do our own default processing

plugin.OnRender( e ); // let the plug-in have a crack at it

} finally {

e.mut.unlock(); // and then release it

}

And consider a plug-in that does no locking of its own at all:

class MyPlugin {

… other methods …

public void OnRender( PageElement e ) {

GuiCoord cPrev = browser.GetElemPosition( e.Previous() );

GuiCoord cNext = browser.GetElemPosition( e.Next() );

Use( e, cPrev, cNext ); // do something with the coordinates

}

But which calls back into:

class CoreBrowser {

… other methods …

public GuiCoord GetElemPosition( PageElement e2 ) {

e2.mut.lock(); // acquire exclusion on this page element

try {

return FigureOutPositionFor( e2 );

} finally {

e2.mut.unlock(); // and then release it

}

The order of mutex acquisition is:

for each element e

acquire e.mut

acquire e.Previous().mut

release e.Previous().mut

acquire e.Next().mut

release e.Next().mut

release e.mut

Perhaps the most obvious issue is that any pair of locks on adjacent elements can be taken in either order by Thread 1; so this cannot possibly be part of a correct lock hierarchy discipline.

Because of the interference of the plug-in code, which does not even have any locks of its own, this code will have a latent deadlock if any other concurrently running thread (including perhaps another instance of Thread 1 itself) can take any two adjacent elements’ locks in any order. The deadlock-proneness is inherent in the design, which fails to guarantee a rigorous ordering of locks. In this respect, the original Example 1 was better, even though its locking was coarse-grained and less friendly to concurrent rendering of different page elements.

Consequences: What Is “Unknown Code”?

It’s one thing to say “avoid calling unknown code while holding a lock” or while inside a similar kind of critical section. It’s another to do it, because there are so many ways to get into “someone else’s code.” Let’s consider a few.

While inside a critical section, including while holding a lock:

Avoid calling library functions. A library function is the classic case of “somebody else’s code.” Unless the library function is documented to not take any locks, deadlock problems can arise.
Avoid calling plug-ins. Clearly a plug-in is “somebody else’s code.”
Avoid calling other callbacks, function pointers, functors, delegates, etc. C function pointers, C++ functors, C# delegates, and the like can also fairly obviously lead to “somebody else’s code.” Sometimes you know that a function pointer, functor, or delegate will always lead to your own code, and calling it is safe; but if you don’t know that for certain, avoid calling it from inside a critical section.
Avoid calling virtual methods. This may be less obvious and quite surprising, even Draconian; after all, virtual functions are common and pervasive. But every virtual function is an extensibility point that can lead to executing code that doesn’t even exist today. Unless you control all derived classes (for example, the virtual function is internal to your module and not exposed for others to derive from), or you can somehow enforce or document that overrides must not take locks, deadlock problems can arise if it is called while inside a critical section.
Avoid calling generic methods, or methods on a generic type. When writing a C++ template or a Java or .NET generic, we have yet another way to parameterize and intentionally let “other people’s code” be plugged into our own. Remember that anything you call on a generic type or method can be provided by anyone, and unless you know and control all the types with which your template or generic can be instantiated (for example, your template or generic is internal to your module and so cannot be externally instantiated by someone else), avoid calling something generic from within a critical section.

Some of these restrictions may be obvious to you; others may be surprising at first.

Avoidance: Non-Critical Calls

So you want to remove a call to unknown code from a critical section. But how? What can you do? Four options are: (a) move the call out of the critical section if you didn’t need the exclusion anyway; (b) make copies of data inside the critical section and later pass the copies; (c) reduce the granularity or power of the critical section being held at the point of the call; or (d) instruct the callee sternly and hope for the best. Let’s look at each option in turn.

We can apply the first approach directly to Example 2. There is no reason the plugin needs to call browser.CountHiddenElements() while holding its internal lock. That call should simply be moved to before or after the critical section.

The second approach is to avoid the need for taking the lock by avoiding the use of mutable shared data. As shown in Items todo and todo, the two general approaches are: 1. Avoid shared data by passing copies of data, which solves the correctness problem by trading off space and some performance; variants of this approach include passing a subset of the data, and passing the copies via messages to run the callee asynchronously. 2. Avoid mutable data by using immutable snapshots of the state. todo refine this

To improve Example 1, for instance, it might be appropriate to change the RenderElements method to hold the lock only long enough to take copies of the necessary shared information in a local container, then doing processing outside the lock, passing the copied elements. However, this could be inappropriate if the data is very expensive to copy, or the callee needs to work on the real data. Alternatively, perhaps the callee doesn’t really need all the information it gets from being given direct access to the protected object, and it would be both sufficient and efficient to pass copies of just what parts of the data the callee does need.

The third option is to reduce the power or granularity of the critical section, which implicitly trades off ease of use because making your synchronization finer-tuned and/or finer-grained also makes it harder to code correctly. One example of reducing the power of the critical section is to replace a mutex with a reader-writer mutex so that multiple concurrent readers are allowed; if the only deadlocks could arise among threads that are only performing reads of the protected resources, then this can be a valid solution by enabling them to take a read-only lock instead of a read-write lock. And an example of making the critical section finer-grained is to replace a single mutex protecting a large data structure with mutexes protecting parts of the structure; if the only deadlocks could arise among threads that use different parts of the structure, then this can be a valid solution (note that Example 1 is not such a case).

The fourth option is to tell the callee not to block, which trades off enforceability. In particular, if you do enjoy the power to impose requirements on the callee (as you do with plug-ins to your software, but not with simple calls into existing third-party libraries), then you can require them to not take locks or otherwise perform blocking actions. Alas, these requirements are typically going to be limited to documentation, and are typically not enforceable automatically, but it can often be sufficient in practice to rely on programmer discipline to follow the rule. Tell the callee what (not) to do, and hope he follows the yellow brick road.

Summary

Be aware of the many opportunities modern languages give us to call “other people’s code,” and eliminate external opportunities for deadlock by not calling unknown code from within a critical section. If you additionally eliminate internal opportunities for deadlock by applying a lock hierarchy discipline within the code you control, your use of locks will be highly likely to be correct… and then we can move on to making it efficient, which we’ll consider in future articles.

Notes

[1] H. Sutter. “The Trouble With Locks” (C/C++ Users Journal, 23(3), March 2005). Available at http://www.gotw.ca/publications/mill36.htm.

[2] Note that calling unknown code from within a critical section is a problem even in single-threaded code, because while our code is inside a critical section we typically have broken invariants. For example, an object may be partway through the process of changing its values from one valid state to another, or we may be partway through a debit-credit transaction where one account has had money removed but the other account has not yet received it. If we call into unknown code while in such a situation, there’s the (usually remote) possibility that the code we call may in turn call other code, which in turn calls other code, which in turn calls into the original module and sees the broken invariant. Techniques like layering can minimize this problem by ensuring code at lower levels can’t “call back up” into higher-level code. The problem is greatly magnified, however, in the presence of concurrency, and we need the same kinds of “layering” tools, only now applied to locks and other ways to express critical sections, to ensure similar problems don’t occur. Unfortunately, those tools only work for code you control, which doesn’t help with modularity.

Finally, here are links to previous Effective Concurrency columns (based on the dates they hit the web, not the magazine print issue dates):

July 2007	The Pillars of Concurrency
August 2007	How Much Scalability Do You Have or Need?
September 2007	Use Critical Sections (Preferably Locks) to Eliminate Races
October 2007	Apply Critical Sections Consistently

Effective Concurrency: Apply Critical Sections Consistently

The latest Effective Concurrency column, "Apply Critical Sections Consistently", just went live on DDJ’s site, and will also appear in the print magazine. Here’s a teaser from the article’s intro:

The critical section is our One True Tool for guaranteeing mutual exclusion on mutable shared variables. …

[But critical sections] must be applied consistently, and with the intended meanings. A program must ensure that every use of a mutable shared object is properly protected using exactly one of these mechanisms at any given point in its lifetime. Chaos can erupt if code tries to avoid or invert these meanings (e.g., trying to abuse taking a lock or reading an atomic variable as a "critical section exit" operation; see Example 3), or tries to use inconsistent synchronization techniques at the same time.

I hope you enjoy it.

Next month’s article is already in post-production. It delves into what I think is a really interesting rule, both because it’s widely ignored and because it’s full of surprises: "Avoid Calling Unknown Code While Inside a Critical Section." I’ll blog here when it hits the web about a month from now.

Finally, here are links to previous Effective Concurrency columns (based on the dates they hit the web, not the magazine print issue dates):

September 2007: Use Critical Sections (Preferably Locks) to Eliminate Races
August 2007: How Much Scalability Do You Have or Need?
July 2007: The Pillars of Concurrency

Effective Concurrency: Use Critical Sections (Preferably Locks) to Eliminate Races

"In a race, no one can hear you scream."

That’s my tagline for the third Effective Concurrency column, "Use Critical Sections (Preferably Locks) to Eliminate Races." It just went live on DDJ’s site, and will also appear in the print magazine.

This article focuses on two main things:

The detailed facts of life about why most or all bets are off if you have a data race. You really, really, really don’t want to have any races in your code.
The commonality that unifies all synchronization constructs you’ve ever used or will use, from locking to lock-free styles to fences to transactional memory.

Here’s the article’s intro:

Everyone knows the basics of how to use locks:

mut.lock(); // acquire lock on x
… read/write x …
mut.unlock(); // release lock on x

But why do locks, lock-free styles, and other synchronization techniques work at all, never mind interoperate well with each other and with aggressive optimizers that transform and reorder your program to make it run faster? Because every synchronization technique you’ve ever heard of must express, and every optimization that may ever be performed must respect and uphold, the common fundamental concept of a critical section. …

I hope you enjoy it.

Next month’s article is already in post-production. It follows directly from this one, and will be titled "Apply Critical Sections Consistently." I’ll blog here when it hits the web about a month from now.

Finally, here are links to previous Effective Concurrency columns (based on the dates they hit the web, not the magazine print issue dates):

August 2007: How Much Scalability Do You Have or Need?
July 2007: The Pillars of Concurrency