VS2010 Beta 1 Now Available

For those of you who are interested in using or trying Microsoft development tools, I’m happy to report that Visual Studio 2010 Beta 1 is now available.

If you’re interested in:

Remember this is just a beta and not intended for production use, but there’s a lot of cool stuff to play around with and it should run fine side-by-side with VS2008. Feedback via forums or filing a bug/suggestion is always appreciated. Enjoy!

Effective Concurrency: Eliminate False Sharing

This month’s Effective Concurrency column, “Eliminate False Sharing”, is now live on DDJ’s website.

People keep writing asking me about my previous mentions of false sharing, even debating whether it’s really a problem. So this month I decided to treat it in depth, including:

  • A compelling and realistic example where just changing a couple of lines to remove false sharing takes an algorithm from zero scaling to perfect scaling – even when many threads are merely doing reads. Hopefully after this nobody will argue that false sharing isn’t a problem. :-)
  • How your performance monitoring and analysis tools do and/or don’t help you uncover the problem, and how to use them effectively to identify the culprit. Short answer: CPU activity monitors aren’t very helpful, but cycles-per-instruction (CPI) and cache miss rate measurements attributed to specific lines of source code are your friend.
  • The two ways to correct the code: Reduce the frequency of writes to the too-popular cache line, or add padding to move other data off the line.
  • Reusable code in C++ and C#, and a note about Java, that you can use to use padding (and alignment if available) to put frequently-updated objects on their own cache lines.

From the article:

In two previous articles I pointed out the performance issue of false sharing (aka cache line ping-ponging), where threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time. … It’s easy to see why the problem arises when multiple cores are writing to different parts of the same cache line… In practice, however, it can be even more common to encounter a reader thread using what it thinks is read-only data still getting throttled by a writer thread updating a different but nearby memory location…

A number of readers have asked for more information and examples on where false sharing arises and how to deal with it. … This month, let’s consider a concrete example that shows an algorithm in extremis due to false sharing distress, how to use tools to analyze the problem, and the two coding techniques we can use to eliminate false sharing trouble. …

I hope you enjoy it. Finally, here are links to previous Effective Concurrency columns:

The Pillars of Concurrency (Aug 2007)

How Much Scalability Do You Have or Need? (Sep 2007)

Use Critical Sections (Preferably Locks) to Eliminate Races (Oct 2007)

Apply Critical Sections Consistently (Nov 2007)

Avoid Calling Unknown Code While Inside a Critical Section (Dec 2007)

Use Lock Hierarchies to Avoid Deadlock (Jan 2008)

Break Amdahl’s Law! (Feb 2008)

Going Superlinear (Mar 2008)

Super Linearity and the Bigger Machine (Apr 2008)

Interrupt Politely (May 2008)

Maximize Locality, Minimize Contention (Jun 2008)

Choose Concurrency-Friendly Data Structures (Jul 2008)

The Many Faces of Deadlock (Aug 2008)

Lock-Free Code: A False Sense of Security (Sep 2008)

Writing Lock-Free Code: A Corrected Queue (Oct 2008)

Writing a Generalized Concurrent Queue (Nov 2008)

Understanding Parallel Performance (Dec 2008)

Measuring Parallel Performance: Optimizing a Concurrent Queue (Jan 2009)

volatile vs. volatile (Feb 2009)

Sharing Is the Root of All Contention (Mar 2009)

Use Threads Correctly = Isolation + Asynchronous Messages (Apr 2009)

“Use Thread Pools Correctly: Keep Tasks Short and Nonblocking” (Apr 2009)

“Eliminate False Sharing” (May 2009)

Effective Concurrency: Use Thread Pools Correctly – Keep Tasks Short and Nonblocking

This month’s Effective Concurrency column, “Use Thread Pools Correctly: Keep Tasks Short and Nonblocking”, is now live on DDJ’s website.

From the article:

… But the thread pool is a leaky abstraction. That is, the pool hides a lot of details from us, but to use it effectively we do need to be aware of some things a pool does under the covers so that we can avoid inadvertently hitting performance and correctness pitfalls. Here’s the summary up front:

1. Tasks should be small, but not too small, otherwise performance overheads will dominate.

2. Tasks should avoid blocking (waiting idly for other events, including inbound messages or contested locks), otherwise the pool won’t consistently utilize the hardware well — and, in the extreme worst case, the pool could even deadlock.

Let’s see why. …

I hope you enjoy it. Finally, here are links to previous Effective Concurrency columns:

The Pillars of Concurrency (Aug 2007)

How Much Scalability Do You Have or Need? (Sep 2007)

Use Critical Sections (Preferably Locks) to Eliminate Races (Oct 2007)

Apply Critical Sections Consistently (Nov 2007)

Avoid Calling Unknown Code While Inside a Critical Section (Dec 2007)

Use Lock Hierarchies to Avoid Deadlock (Jan 2008)

Break Amdahl’s Law! (Feb 2008)

Going Superlinear (Mar 2008)

Super Linearity and the Bigger Machine (Apr 2008)

Interrupt Politely (May 2008)

Maximize Locality, Minimize Contention (Jun 2008)

Choose Concurrency-Friendly Data Structures (Jul 2008)

The Many Faces of Deadlock (Aug 2008)

Lock-Free Code: A False Sense of Security (Sep 2008)

Writing Lock-Free Code: A Corrected Queue (Oct 2008)

Writing a Generalized Concurrent Queue (Nov 2008)

Understanding Parallel Performance (Dec 2008)

Measuring Parallel Performance: Optimizing a Concurrent Queue (Jan 2009)

volatile vs. volatile (Feb 2009)

Sharing Is the Root of All Contention (Mar 2009)

Use Threads Correctly = Isolation + Asynchronous Messages (Apr 2009)

“Use Thread Pools Correctly: Keep Tasks Short and Nonblocking” (Apr 2009)

New Dates for Effective Concurrency Seminar in Europe: May 27-29, Stockholm, Sweden

Now that I’m over the icky flu that forced me to postpone the seminar two weeks ago, I’m happy to say that we have new dates: Effective Concurrency (Europe) will be held on May 27-29, 2009, in Stockholm, Sweden. I’ll cover the following topics:

  • Fundamentals: Define basic concurrency goals and requirements • Understand applications’ scalability needs • Key concurrency patterns
  • Isolation — Keep work separate: Running tasks in isolation and communicate via async messages • Integrating multiple messaging systems, including GUIs and sockets • Building responsive applications using background workers • Threads vs. thread pools
  • Scalability — Re-enable the Free Lunch: When and how to use more cores • Exploiting parallelism in algorithms • Exploiting parallelism in data structures • Breaking the scalability barrier
  • Consistency — Don’t Corrupt Shared State: The many pitfalls of locks–deadlock, convoys, etc. • Locking best practices • Reducing the need for locking shared data • Safe lock-free coding patterns • Avoiding the pitfalls of general lock-free coding • Races and race-related effects
  • High Performance Concurrency: Machine architecture and concurrency • Costs of fundamental operations, including locks, context switches, and system calls • Memory and cache effects • Data structures that support and undermine concurrency • Enabling linear and superlinear scaling
  • Migrating Existing Code Bases to Use Concurrency
  • Near-Future Tools and Features

I hope to see some of you there!

Effective Concurrency: Use Threads Correctly = Isolation + Asynchronous Messages

This month’s Effective Concurrency column, “Use Threads Correctly = Isolation + Asynchronous Messages”, is now live on DDJ’s website.

From the article:

Explicit threads are undisciplined. They need some structure to keep them in line. In this column, we’re going to see what that structure is, as we motivate and illustrate best practices for using threads — techniques that will make our concurrent code easier to write correctly and to reason about with confidence.

I hope you enjoy it. Finally, here are links to previous Effective Concurrency columns and the DDJ print magazine issue in which they first appeared:

The Pillars of Concurrency (Aug 2007)

How Much Scalability Do You Have or Need? (Sep 2007)

Use Critical Sections (Preferably Locks) to Eliminate Races (Oct 2007)

Apply Critical Sections Consistently (Nov 2007)

Avoid Calling Unknown Code While Inside a Critical Section (Dec 2007)

Use Lock Hierarchies to Avoid Deadlock (Jan 2008)

Break Amdahl’s Law! (Feb 2008)

Going Superlinear (Mar 2008)

Super Linearity and the Bigger Machine (Apr 2008)

Interrupt Politely (May 2008)

Maximize Locality, Minimize Contention (Jun 2008)

Choose Concurrency-Friendly Data Structures (Jul 2008)

The Many Faces of Deadlock (Aug 2008)

Lock-Free Code: A False Sense of Security (Sep 2008)

Writing Lock-Free Code: A Corrected Queue (Oct 2008)

Writing a Generalized Concurrent Queue (Nov 2008)

Understanding Parallel Performance (Dec 2008)

Measuring Parallel Performance: Optimizing a Concurrent Queue (Jan 2009)

volatile vs. volatile (Feb 2009)

Sharing Is the Root of All Contention (Mar 2009)

Use Threads Correctly = Isolation + Asynchronous Messages (Apr 2009)

Free Training For Laid-Off Developers

Like many areas in the United States, Seattle has recently been hit with layoffs and downsizing in our industry. So it’s quite timely that Steve McConnell’s company Construx, in the Seattle area, is offering free training for laid-off software workers:

After listening to doom and gloom economic reports for the past few months, we decided we would try to do something to brighten our little corner of the world. Here’s our official press release about it:

Construx Software has designated 25% of its public seminar seats free of charge to software workers who have been laid off. Construx seminars help software professionals improve their technical and managerial skills. Seminar attendees will be more effective when they reenter the workforce. Construx hopes this program will help laid-off software workers reenter the workforce more quickly.

Please read the full article for more details.

I’m very pleased to report that this program will include my next U.S. Effective Concurrency seminar which will be held in Bellevue, WA on July 27-30, 2009. I’ll blog about that seminar again as we get closer, but it’s on the Construx calendar now.

I hope that this will be helpful to some people, and I look forward to seeing some of you in Bellevue.

Aside: Please Be Professional

One problem with offering things for free is that people don’t always value them. I was surprised to learn that a surprising number of people who’ve asked for and received this free admission at Construx have been no-shows.

The following should go without saying, but here it is: As a courtesy to Construx and to other attendees, please don’t sign up unless you intend to come, and if something comes up that prevents you from attending please notify Construx. Otherwise, you may be taking the place of someone who would have liked to attend in your place (seating is limited), and Construx is out the cost of food (they provide breakfast and lunch) and printed color binders and such that go to waste.

Other Free Materials

Most of you know that virtually all the articles and columns I’ve ever written have always been available for free on my own website or via magazine websites (modulo some link rot, sigh). These include all the Effective Concurrency articles, and my C++-focused “Guru of the Week” and magazine articles.

The only things I’ve written that aren’t legally freely available are the final texts of my books, including translations, to which the publishing company owns the copyright; but the books are based on the same freely-available English articles, and Addison-Wesley/Pearson has never had any issue with those staying available, so the basic material is all there.

But there’s news on the “final book contents” front, too, at least in English: Once my current book(s) on Effective Concurrency is done, Jim Hyslop and I plan to compile a book of the C++ Conversations articles we coauthored in C++ Report and C/C++ Users Journal (thanks again for your patience, Jim!) and I also plain to write a second edition of Exceptional C++ – and both books will be updated for C++0x, which is now feature-complete and undergoing its first round of international review. It’s quite interesting just how often using C++0x language and library features (from lambdas to shared_ptrs and concurrency) really help solve the old problems in even more elegant and robust ways… and I’m sure we’ll throw in a couple of new problems and solutions too. My goal is to post these books’ draft Items, perhaps in the form of blog entries, as we write them. But I’m also working with the publisher to see if perhaps we could even post the very final Items in that format with their permission. I’ll post more news about that as there’s something to say.

Effective Concurrency: Sharing Is the Root of All Contention

This month’s Effective Concurrency column, “Sharing Is the Root of All Contention”, is now live on DDJ’s website.

This article aims to address the root cause behind some frequently made assertions: Statements like “locks kill scalability” and “CAS kills scalability” are mostly true but focus on symptoms rather than causes; and others such as “reader/writer mutexes are great for scalability” at minimum need some heavy qualification.

From the article:

Quick: What fundamental principle do all of the following have in common?

  • Two children drawing with crayons.
  • Three customers at Starbucks adding cream to their coffee.
  • Four processes running on a CPU core.
  • Five threads updating an object protected by a mutex.
  • Six threads running on different processors updating the same lock-free queue.
  • Seven modules using the same reference-counted shared object.

Answer: In each case, multiple users share a resource that requires coordination because not everyone can use it simultaneously — and such sharing is the root cause of all resulting contention that will degrade everyone’s performance. (Note: The only kind of shared resource that doesn’t create contention is one that inherently requires no synchronization because all users can use it simultaneously without restriction, which means it doesn’t fall into the description above. For example, an immutable object can have its unchanging bits be read by any number of threads or processors at any time without synchronization.)

In this article, I’ll deliberately focus most of the examples on one important case, namely mutable (writable) shared objects in memory, which are an inherent bottleneck to scalability on multicore systems. But please don’t lose sight of the key point, namely that "sharing causes contention" is a general principle that is not limited to shared memory or even to computing.

The Inherent Costs of Sharing

Here’s the issue in one sentence: Sharing fundamentally requires waiting and demands answers to expensive questions. …

Aside: This will be in the first electronic-only Dr. Dobb’s Report, and I’m going to have to get used to Dobb’s being electronic-only. One consequence of this change that I just discovered today is that an article can go online hours after I correct the proofs. That may seem like it shouldn’t be surprising in the days of blogs, but a magazine isn’t (or at least wasn’t) like a blog. When I wrote for magazines like C++ Report in the late 1990s, the time between submitting the final column manuscript and seeing it in print was about two to three months, depending on the magazine. Since the mid-2000s with Dobb’s, it’s been only one month from proofs to print — and Dobb’s could have posted the online versions even sooner, but held them back to be semi-synced with the print magazine. Now that the print constraint has disappeared, apparently I can (and, today, did) start work on a Friday morning to find the Dobb’s proofs I corrected the night before already on the web. Nifty. (And thanks again, Deirdre!)

I hope you enjoy it. Finally, here are links to previous Effective Concurrency columns and the DDJ print magazine issue in which they first appeared:

The Pillars of Concurrency (Aug 2007)

How Much Scalability Do You Have or Need? (Sep 2007)

Use Critical Sections (Preferably Locks) to Eliminate Races (Oct 2007)

Apply Critical Sections Consistently (Nov 2007)

Avoid Calling Unknown Code While Inside a Critical Section (Dec 2007)

Use Lock Hierarchies to Avoid Deadlock (Jan 2008)

Break Amdahl’s Law! (Feb 2008)

Going Superlinear (Mar 2008)

Super Linearity and the Bigger Machine (Apr 2008)

Interrupt Politely (May 2008)

Maximize Locality, Minimize Contention (Jun 2008)

Choose Concurrency-Friendly Data Structures (Jul 2008)

The Many Faces of Deadlock (Aug 2008)

Lock-Free Code: A False Sense of Security (Sep 2008)

Writing Lock-Free Code: A Corrected Queue (Oct 2008)

Writing a Generalized Concurrent Queue (Nov 2008)

Understanding Parallel Performance (Dec 2008)

Measuring Parallel Performance: Optimizing a Concurrent Queue (Jan 2009)

volatile vs. volatile (Feb 2009)

Sharing Is the Root of All Contention (Mar 2009)

Effective Concurrency Seminar in Europe: March 16-18, Stockholm, Sweden

A number of people have asked whether I will be teaching my Effective Concurrency seminar in Europe. The answer is yes:

Effective Concurrency (Europe) will be held on March 16-18, 2009, in Stockholm, Sweden. This is my only public European seminar in 2009. I’ll cover the following topics:

  • Fundamentals: Define basic concurrency goals and requirements • Understand applications’ scalability needs • Key concurrency patterns
  • Isolation — Keep work separate: Running tasks in isolation and communicate via async messages • Integrating multiple messaging systems, including GUIs and sockets • Building responsive applications using background workers • Threads vs. thread pools
  • Scalability — Re-enable the Free Lunch: When and how to use more cores • Exploiting parallelism in algorithms • Exploiting parallelism in data structures • Breaking the scalability barrier
  • Consistency — Don’t Corrupt Shared State: The many pitfalls of locks–deadlock, convoys, etc. • Locking best practices • Reducing the need for locking shared data • Safe lock-free coding patterns • Avoiding the pitfalls of general lock-free coding • Races and race-related effects
  • High Performance Concurrency: Machine architecture and concurrency • Costs of fundamental operations, including locks, context switches, and system calls • Memory and cache effects • Data structures that support and undermine concurrency • Enabling linear and superlinear scaling
  • Migrating Existing Code Bases to Use Concurrency
  • Near-Future Tools and Features

I hope to see some of you there!

Effective Concurrency: volatile vs. volatile

This month’s Effective Concurrency column, “volatile vs. volatile”, is now live on DDJ’s website and also appears in the print magazine. (As a historical note, it’s DDJ’s final print issue, as I mentioned previously.)

This article aims to answer the frequently asked question: “What does volatile mean?” The short answer: “It depends, do you mean Java/.NET volatile or C/C++ volatile?” From the article:

What does the volatile keyword mean? How should you use it? Confusingly, there are two common answers, because depending on the language you use volatile supports one or the other of two different programming techniques: lock-free programming, and dealing with ‘unusual’ memory.

Adding to the confusion, these two different uses have overlapping requirements and impose overlapping restrictions, which makes them appear more similar than they are. Let’s define and understand them clearly, and see how to spell them correctly in C, C++, Java and C# — and not always as volatile. …

I hope you enjoy it. Finally, here are links to previous Effective Concurrency columns and the DDJ print magazine issue in which they first appeared:

The Pillars of Concurrency (Aug 2007)

How Much Scalability Do You Have or Need? (Sep 2007)

Use Critical Sections (Preferably Locks) to Eliminate Races (Oct 2007)

Apply Critical Sections Consistently (Nov 2007)

Avoid Calling Unknown Code While Inside a Critical Section (Dec 2007)

Use Lock Hierarchies to Avoid Deadlock (Jan 2008)

Break Amdahl’s Law! (Feb 2008)

Going Superlinear (Mar 2008)

Super Linearity and the Bigger Machine (Apr 2008)

Interrupt Politely (May 2008)

Maximize Locality, Minimize Contention (Jun 2008)

Choose Concurrency-Friendly Data Structures (Jul 2008)

The Many Faces of Deadlock (Aug 2008)

Lock-Free Code: A False Sense of Security (Sep 2008)

Writing Lock-Free Code: A Corrected Queue (Oct 2008)

Writing a Generalized Concurrent Queue (Nov 2008)

Understanding Parallel Performance (Dec 2008)

Measuring Parallel Performance: Optimizing a Concurrent Queue (Jan 2009)

volatile vs. volatile (Feb 2009)

TRS-80 vs. Alpha, and Parallel Optimization

Lest people get the wrong idea, I enjoy reading Jeff Atwood’s blog and agree with much of what he writes so entertainingly and provocatively. So far I’ve only responded when I strongly felt differently about something, which has been a grand total of twice now.

So let me also offer an example of something I wholeheartedly agree with. Yesterday, Jeff cited what is also my own favorite Programming Pearls figure:

 

Despite the enduring wonder of the yearly parade of newer, better hardware, we’d also do well to remember my all time favorite graph from Programming Pearls:

trs80vsalpha

Everything is fast for small n.

Spot on. If you’re a professional programmer and haven’t read Programming Pearls yet, “run don’t walk” to your bookstore of choice.

Incidentally, just to tie this in to parallel computing as well, Jeff’s article also cites a nice graph of optimizations that improved NDepend:

Patrick Smacchia’s lessons learned from a real-world focus on performance is a great case study in optimization.

ndepend optimization graph

Patrick was able to improve nDepend analysis performance fourfold, and cut memory consumption in half. As predicted, most of this improvement was algorithmic in nature, but at least half of the overall improvement came from a variety of different optimization techniques.

As I’ve said many times, measure twice, optimize once: Know when and where to optimize. Profilers are your friend. As Patrick writes:

When it comes to enhancing performance there is only one way to do things right: measure and focus your work on the part of the code that really takes the bulk of time, not the one that you think takes the bulk of time.

And of course, in our ever-more-multicore world, the contribution of parallelization gain will continue to grow and dominate the optimization of CPU-bound code. But as Patrick also notes, realizing that gain is not always trivial:

While we get the 15% gain from between 1 and 2 processors, the gain is almost zero between 2 and 4 processors. We identified some potential IO contentions and memory issues that will require more attention in the future. This leads to another lesson: Don’t expect that scaling on many processors will be something easy, even if you don’t share states and don’t use synchronization.