Effective Concurrency: Choose Concurrency-Friendly Data Structures

The latest Effective Concurrency column, “Choose Concurrency-Friendly Data Structures”, just went live on DDJ’s site, and also appears in the print magazine. From the article:

What is a high-performance data structure? To answer that question, we’re used to applying normal considerations like Big-Oh complexity, and memory overhead, locality, and traversal order. All of those apply to both sequential and concurrent software.

But in concurrent code, we need to consider two additional things to help us pick a data structure that is also sufficiently concurrency-friendly:

  • In parallel code, your performance needs likely include the ability to allow multiple threads to use the data at the same time. If this is (or may become) a high-contention data structure, does it allow for concurrent readers and/or writers in different parts of the data structure at the same time? If the answer is, “No,” then you may be designing an inherent bottleneck into your system and be just asking for lock convoys as threads wait, only one being able to use the data structure at a time.
  • On parallel hardware, you may also care about minimizing the cost of memory synchronization. When one thread updates one part of the data structure, how much memory needs to be moved to make the change visible to another thread? If the answer is, “More than just the part that has ostensibly changed,” then again you’re asking for a potential performance penalty, this time due to cache sloshing as more data has to move from the core that performed the update to the core that is reading the result.

It turns out that both of these answers are directly influenced by whether the data structure allows truly localized updates. …

I hope you enjoy it.
 
Finally, here are links to previous Effective Concurrency columns (based on the magazine print issue dates):
August 2007 The Pillars of Concurrency
September 2007 How Much Scalability Do You Have or Need?
October 2007 Use Critical Sections (Preferably Locks) to Eliminate Races
November 2007 Apply Critical Sections Consistently
December 2007 Avoid Calling Unknown Code While Inside a Critical Section
January 2007 Use Lock Hierarchies to Avoid Deadlock
February 2008 Break Amdahl’s Law!
March 2008 Going Superlinear
April 2008 Super Linearity and the Bigger Machine
May 2008 Interrupt Politely
June 2008 Maximize Locality, Minimize Contention
July 2008 Choose Concurrency-Friendly Data Structures

Stroustrup & Sutter on C++: The Interviews

While Bjarne and I were at SD for S&S, we took time out to do an interview together with Ted Neward for InformIT. I just got word that it went live… here are the links.

On the OnSoftware – Video (RSS):

On the OnSoftware – Audio (RSS):

Enjoy!

Memory Model talk at Gamefest 2008

I’ll be giving a memory model talk at Gamefest in Seattle next month. Here’s a quick summary:

Memory Models: Foundational Knowledge for Concurrent Code
July 22-23, 2008
Gamefest 2008
Seattle, WA, USA

A memory model defines a contract between the programmer and the execution environment, that trades off:

  • programmability via stronger guarantees for programmers, vs.
  • performance via greater flexibility for reordering program memory operations.

The “execution environment” includes everything from the compiler and optimizer on down to the CPU and cache hardware, and it really wants to help you by reordering your program to make it run faster. You, on the other hand, you want it to not help you excessively in ways that will break the meaning of your code. In this talk, we’ll consider why a memory model is important, how to achieve a reasonable balance, detailed considerations on current and future PC and Xbox platforms, and some best practices for writing solid concurrent code.

Effective Concurrency: Maximize Locality, Minimize Contention

The latest Effective Concurrency column, “Maximize Locality, Minimize Contention”, just went live on DDJ’s site, and also appears in the print magazine. From the article:

ec11-fig2Want to kill your parallel application’s scalability? Easy: Just add a dash of contention.

Locality is no longer just about fitting well into cache and RAM, but also about avoiding scalability busters by keeping tightly coupled data physically close together and separately used data far, far apart. …

I hope you enjoy it.
 
Finally, here are links to previous Effective Concurrency columns (based on the dates they hit the web, not the magazine print issue dates):
July 2007 The Pillars of Concurrency
August 2007 How Much Scalability Do You Have or Need?
September 2007 Use Critical Sections (Preferably Locks) to Eliminate Races
October 2007 Apply Critical Sections Consistently
November 2007 Avoid Calling Unknown Code While Inside a Critical Section
December 2007 Use Lock Hierarchies to Avoid Deadlock
January 2008 Break Amdahl’s Law!
February 2008 Going Superlinear
March 2008 Super Linearity and the Bigger Machine
April 2008 Interrupt Politely
May 2008 Maximize Locality, Minimize Contention

Part 2 of concurrency interview with DevX

Part 2 of DevX’s interview with me about concurrency just went live on the web. From the article’s blurb:

What does the future hold for concurrency? What will happen to the tools and techniques around concurrent programming? In part two of our series, concurrency guru Herb Sutter talks about these issues and what developers need to be reading to understand concurrency. 

… In this final installment he looks into his crystal ball with an eye towards the future and gives developers hints for the resources they need to be better concurrent programmers.

This part touches on a variety of topics, from right-now items like delivering parallelism internally inside libraries to shield the programmer from knowing about concurrency and where to look for further reading, to future topics like transactional memory and upcoming homogeneous vs. heterogeneous manycore CPUs. I hope you enjoy it.

(March’s part 1 is here.)

Quad-core a "waste of electricity"?

Jeff Atwood wrote:

In my opinion, quad-core CPUs are still a waste of electricity unless you’re putting them in a server. Four cores on the desktop is great for bragging rights and mathematical superiority (yep, 4 > 2), but those four cores provide almost no benchmarkable improvement in the type of applications most people use. Including software development tools.

Really? You must not be using the right tools. :-) For example, here are three I’m familiar with:

image Visual C++ 2008’s /MP flag tells the compiler to compile files in the same project in parallel. I typically get linear speedups on the compile phase. The link phase is still sequential, but on most projects compilation dominates.
imageSince Visual Studio 2005 we’ve supported parallel project builds in Batch Build mode, where you can build multiple subprojects in parallel (e.g., compile your release and debug builds in parallel), though that feature didn’t let you compile multiple files in the same project in parallel. (As I’ve blogged about before, Visual C++ 2005 actually already shipped with the /MP feature, but it was undocumented.)
image Excel 2007 does parallel recalculation. Assuming the spreadsheet is large and doesn’t just contain sequential dependencies between cells, it usually scales linearly up to at least 8 cores (the most I heard that was tested before shipping). I’m told that customers who are working on big financial spreadsheets love it.
imageAnd need I mention games? (This is just a snarky comment… Jeff already correctly noted that “rendering, encoding, or scientific applications” are often scalable today.)

And of course, even if you’re having a terrible day and not a single one of your applications can use more than one core, you can still see real improvement on CPU-intensive multi-application workloads on a multicore machine today, such as by being able to run other foreground applications at full speed while encoding a movie in the background.

Granted, as I’ve said before, we do need to see examples of manycore (e.g., >10 cores) exploiting mainstream applications (e.g., something your dad might use). But it’s overreaching to claim that there are no multicore (e.g., <10 cores) exploiting applications at all, not even development tools. We may not yet have achieved the mainstream manycore killer app, but it isn’t like we have nothing to show at all. We have started out on the road that will take us there.

Effective Concurrency: Interrupt Politely

The latest Effective Concurrency column, “Interrupt Politely”, just went live on DDJ’s site, and will also appear in the print magazine. From the article:

ec10-tbl1 Violence isn’t the answer.

We want to be able to stop a running thread or task when we discover that we no longer need or want to finish it. As we saw in the last two columns, in a simple parallel search we can stop other workers once one finds a match, and when speculatively running two alternative algorithms to compute the same result we can stop the longer-running one once the first finds a result. [1,2] Stopping threads or tasks lets us reclaim their resources, including locks, and apply them to other work.

But how do you stop a thread or task you longer need or want? Table 1 summarizes the four main ways, and how they are supported on several major platforms. Let’s consider them in turn. …

I hope you enjoy it.
 
Finally, here are links to previous Effective Concurrency columns (based on the dates they hit the web, not the magazine print issue dates):
July 2007 The Pillars of Concurrency
August 2007 How Much Scalability Do You Have or Need?
September 2007 Use Critical Sections (Preferably Locks) to Eliminate Races
October 2007 Apply Critical Sections Consistently
November 2007 Avoid Calling Unknown Code While Inside a Critical Section
December 2007 Use Lock Hierarchies to Avoid Deadlock
January 2008 Break Amdahl’s Law!
February 2008 Going Superlinear
March 2008 Super Linearity and the Bigger Machine
April 2008 Interrupt Politely

Concurrency Interview with DevX

I recently spent an hour on the phone to talk concurrency with DevX’s Alexa Weber Morales. Part 1 of that interview just went live on the web, and focuses mostly on what concurrency and parallelism are, how to take advantage of multicore chips, and whether concurrency will ever be really accessible to mainstream developers. The site seems to be having intermittent problems displaying the pages; just hit the link a few more times if it doesn’t work right away.

Disclaimer: I am not responsible for the article title (yikes! “whisperer”?! my goodness gracious) and Alexa’s intro blurb is way too kind. But it is true that it’s important for us-the-industry to bring concurrency to the mainstream in a grokkable way, as we have already successfully done with OO and GUIs in the past.

New Course Available: Effective Concurrency

Many of you have kindly sent mail about my Effective Concurrency columns and asking when there’ll be a course. Well, I’m happy to announce that the answer is: May 19-21, 2008.

Here’s the brief information (more details below):

3-Day Seminar: Effective Concurrency

May 19-21, 2008
Bellevue, WA, USA
Developed and taught by Herb Sutter

This course covers the fundamental tools that software developers need to write effective concurrent software for both single-core and multi-core/many-core machines. To use concurrency effectively, we must identify and solve four key challenges:

  • Leverage the ability to perform and manage work asynchronously
  • Build applications that naturally run faster on new hardware having more and more cores
  • Manage shared objects in memory effectively to avoid races and deadlocks
  • Engineer specifically for high performance

This seminar will equip attendees to reason correctly about concurrency requirements and tradeoffs, to migrate existing code bases to be concurrency-enabled, and to achieve key success factors for a concurrent programming project. Most code examples in the course can be directly translated to popular platforms and concurrency libraries, including Linux, Windows, Java, .NET, pthreads, and the forthcoming ISO C++0x standard.

Note on class size limit and possible waitlist: There is a hard limit on attendance at this first one (really). But if the registration site says you’ll get waitlisted, don’t give up: Go ahead and sign up anyway because we may be able to put together a second installment of the seminar a week or two later if there’s enough interest.

Finally, here’s a summary of what we’ll cover during the three days.

Fundamentals

  • Define basic concurrency goals and requirements
  • Understand applications’ scalability needs
  • Key concurrency patterns

Isolation: Keep Work Separate

  • Running tasks in isolation and communicate via async messages
  • Integrating multiple messaging systems, including GUIs and sockets
  • Building responsive applications using background workers
  • Threads vs. thread pools

Scalability: Re-enable the Free Lunch

  • When and how to use more cores 
  • Exploiting parallelism in algorithms 
  • Exploiting parallelism in data structures 
  • Breaking the scalability barrier

Consistency: Don’t Corrupt Shared State

  • The many pitfalls of locks–deadlock, convoys, etc.
  • Locking best practices
  • Reducing the need for locking shared data
  • Safe lock-free coding patterns
  • Avoiding the pitfalls of general lock-free coding
  • Races and race-related effects

Migrating Existing Code Bases to Use Concurrency

Near-Future Tools and Features

High Performance Concurrency

  • Machine architecture and concurrency
  • Costs of fundamental operations, including locks, context switches, and system calls
  • Memory and cache effects
  • Data structures that support and undermine concurrency
  • Enabling linear and superlinear scaling

I hope to get to meet some of you here in the Seattle area!

Effective Concurrency: Super Linearity and the Bigger Machine

The latest Effective Concurrency column, "Super Linearity and the Bigger Machine", just went live on DDJ’s site, and will also appear in the print magazine. From the article:

ec09-fig2

There are two main ways to achieve superlinear scalability, or to use P processors to compute an answer more than P times faster…:

  • Do disproportionately less work.
  • Harness disproportionately more resources.

Last month, we focused on the first point by illustrating parallel search and how it naturally achieves superlinear speedups when matches are not distributed evenly because some workers get "rich" subranges and will find a match faster, which benefits the whole search because we can stop as soon as any worker finds a match.

This month, we’ll conclude examining the first point with a few more examples, and then consider how to achieve superlinear speedups by harnessing more resources—quite literally, running on a bigger machine without any change in the hardware. …

I hope you enjoy it.
Finally, here are links to previous Effective Concurrency columns (based on the dates they hit the web, not the magazine print issue dates):
July 2007 The Pillars of Concurrency
August 2007 How Much Scalability Do You Have or Need?
September 2007 Use Critical Sections (Preferably Locks) to Eliminate Races
October 2007 Apply Critical Sections Consistently
November 2007 Avoid Calling Unknown Code While Inside a Critical Section
December 2007 Use Lock Hierarchies to Avoid Deadlock
January 2008 Break Amdahl’s Law!
February 2008 Going Superlinear
March 2008 Super Linearity and the Bigger Machine