Effective Concurrency: Eliminate False Sharing

This month’s Effective Concurrency column, “Eliminate False Sharing”, is now live on DDJ’s website.

People keep writing asking me about my previous mentions of false sharing, even debating whether it’s really a problem. So this month I decided to treat it in depth, including:

A compelling and realistic example where just changing a couple of lines to remove false sharing takes an algorithm from zero scaling to perfect scaling – even when many threads are merely doing reads. Hopefully after this nobody will argue that false sharing isn’t a problem. :-)
How your performance monitoring and analysis tools do and/or don’t help you uncover the problem, and how to use them effectively to identify the culprit. Short answer: CPU activity monitors aren’t very helpful, but cycles-per-instruction (CPI) and cache miss rate measurements attributed to specific lines of source code are your friend.
The two ways to correct the code: Reduce the frequency of writes to the too-popular cache line, or add padding to move other data off the line.
Reusable code in C++ and C#, and a note about Java, that you can use to use padding (and alignment if available) to put frequently-updated objects on their own cache lines.

From the article:

In two previous articles I pointed out the performance issue of false sharing (aka cache line ping-ponging), where threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time. … It’s easy to see why the problem arises when multiple cores are writing to different parts of the same cache line… In practice, however, it can be even more common to encounter a reader thread using what it thinks is read-only data still getting throttled by a writer thread updating a different but nearby memory location…

A number of readers have asked for more information and examples on where false sharing arises and how to deal with it. … This month, let’s consider a concrete example that shows an algorithm in extremis due to false sharing distress, how to use tools to analyze the problem, and the two coding techniques we can use to eliminate false sharing trouble. …

I hope you enjoy it. Finally, here are links to previous Effective Concurrency columns:

The Pillars of Concurrency (Aug 2007)

How Much Scalability Do You Have or Need? (Sep 2007)

Use Critical Sections (Preferably Locks) to Eliminate Races (Oct 2007)

Apply Critical Sections Consistently (Nov 2007)

Avoid Calling Unknown Code While Inside a Critical Section (Dec 2007)

Use Lock Hierarchies to Avoid Deadlock (Jan 2008)

Break Amdahl’s Law! (Feb 2008)

Going Superlinear (Mar 2008)

Super Linearity and the Bigger Machine (Apr 2008)

Interrupt Politely (May 2008)

Maximize Locality, Minimize Contention (Jun 2008)

Choose Concurrency-Friendly Data Structures (Jul 2008)

The Many Faces of Deadlock (Aug 2008)

Lock-Free Code: A False Sense of Security (Sep 2008)

Writing Lock-Free Code: A Corrected Queue (Oct 2008)

Writing a Generalized Concurrent Queue (Nov 2008)

Understanding Parallel Performance (Dec 2008)

Measuring Parallel Performance: Optimizing a Concurrent Queue (Jan 2009)

volatile vs. volatile (Feb 2009)

Sharing Is the Root of All Contention (Mar 2009)

Use Threads Correctly = Isolation + Asynchronous Messages (Apr 2009)

“Use Thread Pools Correctly: Keep Tasks Short and Nonblocking” (Apr 2009)

“Eliminate False Sharing” (May 2009)

10 thoughts on “Effective Concurrency: Eliminate False Sharing”

It appears that the ‘performance analyzer tool’ that could measure CPI and Cache miss rate provided critical clues to help find the cause.

Can you tell us which tool you used ?

The final example in “Eliminate False Sharing” is a little unclear on the intent. The text says “we arbitrarily force half the workers to only perform reads from their result[p]” but it appears the code does a read on about half the threads when an element in the matrix is odd (good) and a read and write, “++result[p]”, for every thread when an element in the matrix is even (doh).

There is a problem with C# sample class. Runtime does not allow you to instantiate generic type with explicit layout.

If type is not generic everything works fine, e.g

[StructLayout(LayoutKind.Explicit, Size = 2 * CACHE_LINE_SIZE)]
public struct CacheLineStorage
{
const int CACHE_LINE_SIZE = 512;

[FieldOffset(CACHE_LINE_SIZE)]
public int data;
}
CacheLineStorage test = default(CacheLineStorage);
int size = Marshal.SizeOf(test);

Size var will be equal to 1024.

Rob: No, we didn’t mean to say that.

Henrik: Thanks to Intel for their 4-proc x 6-core test machine. :-)

Ghita: Thanks. The article uses C++0x lambda syntax, which will be in Visual C++ 2010 (whose beta just went live incidentally, maybe I should blog about that [edit: ok, here it is]). You can do similar things in C# lambda syntax. In C or Java or a C++ implementation that doesn’t support lambdas, just write an out-of-line helper function or Runnable class or functor, respectively (they’re pretty much the same thing).

Michael: There have been experiments about automatically detecting false sharing and adding padding to remove the contention, particularly in managed environments that have more latitude to play with layout. But it’s problematic so far and I don’t know of a real-world product that can do this automatically. Does anyone else know of one?

Pizer: Excellent point. Score +1 bug report. I’ll see if they can fix it on the website, but it’ll be fixed in the book. The problem didn’t arise in my testing version because the test code was a little different.

In all three examples the lambda object captures the variable ‘p’ by reference due to [&]. This is a serious bug. It should have been [&,p] { … } instead.

Hi Herb,

Is there an effort to do the padding at runtime automatically to avoid false sharing? If no, why not?

Very nice article Herb. One of the best ones !
Does the example written (using lambda) run in a current C# language/library release or do we need to use Visual Studio CTP ? Because in case we have to use CTP we will not be able to use so many cores on virtual PC…

Hi Herb,
Where do you get your computers? Do you have 24 cores on your netbook ;)
Henrik

Hi Herb,
in your article you mention new standard C++
attribute syntax, [[align(32)]],
but I’ve heard that Microsoft is refusing to implement it.
Could You comment on this?

Hi Herb,

Just a quick note to say thank you for all your great articles. They have been enlightening for me in my work and I typically share them with the Drizzle contributor community as required reading. Thanks, and keep up the great work!

Jay Pipes

Comments are closed.

Published by Herb Sutter

10 thoughts on “Effective Concurrency: Eliminate False Sharing”