Two More C&B Sessions: C++0x Memory Model (Scott) and Exceptional C++0x (me)

Scott Meyers, Andrei Alexandrescu and I are continuing to craft and announce the technical program for C++ and Beyond (C&B) 2011, and two more sessions are now posted. All talks are brand-new material created specifically for C&B 2011. Here are short blurbs; follow the links for longer descriptions.

  • Scott will give a great new talk on “The C++0x Memory Model and Why You Care” that will cover topics of interest to anybody who cares about concurrency and parallel programming under C++0x: everything from compiler optimizations and memory access reorderings, to “sequenced before” and “happens before” relations, to atomic types and memory consistency models, and how they all relate to both correctness and performance. This is stuff that in a perfect world nobody should ever have to know, but in our actual world every modern C++ developer who cares about correct high-performance code has to understand it thoroughly.
  • I’ll be giving a brand-new talk  “Exceptional C++0x (aka C++11)” that shows how the new features in C++0x change the way we solve problems, our C++ coding style, and even the way we think about our code. I’ll demonstrate that with code that works today on existing compilers, using selected familiar examples from my Exceptional C++ books. This is not rehashed material, as I’ll assume you’re already familiar with the pre-C++0x solutions (I’ll provide links to read as refreshers before the course), and then we’ll analyze and solve them entirely the 21st-century C++ way and see why C++0x feels like a whole new fresh language that leads to different approaches, new and changed guidelines, and even better solutions. As Bjarne put it: “Surprisingly, C++0x feels like a new language: The pieces just fit together better than they used to and I find a higher-level style of programming more natural than before and as efficient as ever.” This talk will show why — deeply, madly, and truly.

The other two talks already announced are the following, which I previously reported last week:

  • Andrei will be giving an in-depth talk on “BIG: C++ Strategies, Data Structures, and Algorithms Aimed at Scalability.” Briefly, it’s about writing high-performance C++ code for  highly distributed architectures, focusing on translating C++’s strong modeling capabilities directly to great scaling and/or great savings, and finding the right but non-intuitive C++ techniques and data structures to get there.
  • I’ll be giving a brand-new talk on “C++ and the GPU… and Beyond.” I’ll cover the state of the art for using C++ (not just C) for general-purpose computation on graphics processing units (GPGPU). The first half of the talk discusses the most important issues and techniques to consider when using GPUs for high-performance computation, especially where we have to change our traditional advice for doing the same computation on the CPU. The second half focuses on upcoming C++ language and library extensions that bring key abstractions for GPGPU — and in time considerably more — directly into C++.

I hope to see many of you there this August. Last year’s event sold out during the early-bird period, and although we’ve increased the attendance cap this year to make room for more, if you’re interested in coming you may want to register soon to reserve a place.

Keynote at the AMD Fusion Developer Summit

In a couple of months, I’ll be giving a keynote at the AMD Fusion Developer’s Summit, which will be held on June 13-16, 2011, in Bellevue, WA, USA.

Here’s my talk’s description as it appears on the conference website:

AFDS Keynote: “Heterogeneous Parallelism at Microsoft”
Herb Sutter, Microsoft Principal Architect, Native Languages

Parallelism is not just in full bloom, but increasingly in full variety. We know that getting full computational performance out of most machines—nearly all desktops and laptops, most game consoles, and the newest smartphones—already means harnessing local parallel hardware, mainly in the form of multicore CPU processing. This is the commoditization of the supercomputer.

More and more, however, getting that full performance can also mean using gradually ever-more-heterogeneous processing, from local GPGPU and Accelerated Processing Unit (APU) flavors to “often-on” remote parallel computing power in the form of elastic compute clouds. This is the generalization of the heterogeneous cluster in all its NUMA glory, and it’s appearing at all scales from on-die to on-machine to on-cloud.

In this talk, Microsoft’s chief native languages architect shares a vision of what this will mean for native software on Microsoft platforms from servers to devices, and showcases upcoming innovations that bring access to increasingly heterogeneous compute resources — from vector units and multicore, to GPGPU and APU, to elastic cloud — directly into the world’s most popular native languages.

If you’re interested in high performance code for GPUs, APUs, and other high-performance TLAs, I hope to see you there.

Note: This talk is related to, but different from, the GPU talk I’ll be presenting in August at C++ and Beyond 2011 (aka C&B). You can expect the above keynote to be, well, keynote-y… oriented toward software product features and of course AMD’s hardware, with plenty of forward-looking industry vision style material. My August C&B technical talk will be just that, an in-depth performance-oriented and sometimes-gritty technical session that will also mention product-related and hardware-specific stuff but is primarily about heterogeneous hardware, with a more pragmatically focused forward-looking eye.

Links I enjoyed reading this week

Concurrency-related (more or less directly)

Samples updated for ConcRT, PPL and Agents (Microsoft Parallel Programming blog)
Update to the samples for the Visual Studio 2010 Release Candidate. Hmm, I suppose I should include a link to that too:

Intel’s Core i7-980X Extreme processor (The Tech Report)
Desktop part with 12 hardware threads (6 cores x 2 threads/core), 32nm process, >1.1B transistors.

General information/amusement

Application compatibility layers are there for the customer, not for the program (Raymond Chen)
You wouldn’t believe the backward-compatibility hoops we all need to jump through hold in the right place for older apps to jump through, and then the app developer asks for more…

Jetpack to be commercially available soon? (Gizmag)
Yes, we see a story like this every few years. This one actually gets the flight time beyond just one minute. Now if we could only take the expected price down by one order of magnitude, and safety up by an order of magnitude or three…

Igor Ostrovsky and the Seven Cache Effects

My colleague Igor Ostrovsky has written a useful summary of seven cache memory effects that every advanced developer should know about because of their performance impact, particularly as we strive to keep invisible bottlenecks out of parallel code.

I’ve covered variations of Igor’s examples #1, #2, #3, and #6 in my Machine Architecture talk and several of my articles. His article provides a crisp and concise summary of these and three more kinds of cache effects along with simple and clear sample code and intriguing measurements (for example, see the detail in the graph for #5 and its analysis).

Recommended.

Effective Concurrency: Design for Manycore Systems

This month’s Effective Concurrency column, Design for Manycore Systems, is now live on DDJ’s website.

From the article:

Why worry about “manycore” today?

Dual- and quad-core computers are obviously here to stay for mainstream desktops and notebooks. But do we really need to think about "many-core" systems if we’re building a typical mainstream application right now? I find that, to many developers, "many-core" systems still feel fairly remote, and not an immediate issue to think about as they’re working on their current product.

This column is about why it’s time right now for most of us to think about systems with lots of cores. In short: Software is the (only) gating factor; as that gate falls, hardware parallelism is coming more and sooner than many people yet believe. …

I hope you enjoy it. Finally, here are links to previous Effective Concurrency columns:

The Pillars of Concurrency (Aug 2007)

How Much Scalability Do You Have or Need? (Sep 2007)

Use Critical Sections (Preferably Locks) to Eliminate Races (Oct 2007)

Apply Critical Sections Consistently (Nov 2007)

Avoid Calling Unknown Code While Inside a Critical Section (Dec 2007)

Use Lock Hierarchies to Avoid Deadlock (Jan 2008)

Break Amdahl’s Law! (Feb 2008)

Going Superlinear (Mar 2008)

Super Linearity and the Bigger Machine (Apr 2008)

Interrupt Politely (May 2008)

Maximize Locality, Minimize Contention (Jun 2008)

Choose Concurrency-Friendly Data Structures (Jul 2008)

The Many Faces of Deadlock (Aug 2008)

Lock-Free Code: A False Sense of Security (Sep 2008)

Writing Lock-Free Code: A Corrected Queue (Oct 2008)

Writing a Generalized Concurrent Queue (Nov 2008)

Understanding Parallel Performance (Dec 2008)

Measuring Parallel Performance: Optimizing a Concurrent Queue (Jan 2009)

volatile vs. volatile (Feb 2009)

Sharing Is the Root of All Contention (Mar 2009)

Use Threads Correctly = Isolation + Asynchronous Messages (Apr 2009)

Use Thread Pools Correctly: Keep Tasks Short and Nonblocking (Apr 2009)

Eliminate False Sharing (May 2009)

Break Up and Interleave Work to Keep Threads Responsive (Jun 2009)

The Power of “In Progress” (Jul 2009)

Design for Manycore Systems (Aug 2009)

Effective Concurrency: Eliminate False Sharing

This month’s Effective Concurrency column, “Eliminate False Sharing”, is now live on DDJ’s website.

People keep writing asking me about my previous mentions of false sharing, even debating whether it’s really a problem. So this month I decided to treat it in depth, including:

  • A compelling and realistic example where just changing a couple of lines to remove false sharing takes an algorithm from zero scaling to perfect scaling – even when many threads are merely doing reads. Hopefully after this nobody will argue that false sharing isn’t a problem. :-)
  • How your performance monitoring and analysis tools do and/or don’t help you uncover the problem, and how to use them effectively to identify the culprit. Short answer: CPU activity monitors aren’t very helpful, but cycles-per-instruction (CPI) and cache miss rate measurements attributed to specific lines of source code are your friend.
  • The two ways to correct the code: Reduce the frequency of writes to the too-popular cache line, or add padding to move other data off the line.
  • Reusable code in C++ and C#, and a note about Java, that you can use to use padding (and alignment if available) to put frequently-updated objects on their own cache lines.

From the article:

In two previous articles I pointed out the performance issue of false sharing (aka cache line ping-ponging), where threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time. … It’s easy to see why the problem arises when multiple cores are writing to different parts of the same cache line… In practice, however, it can be even more common to encounter a reader thread using what it thinks is read-only data still getting throttled by a writer thread updating a different but nearby memory location…

A number of readers have asked for more information and examples on where false sharing arises and how to deal with it. … This month, let’s consider a concrete example that shows an algorithm in extremis due to false sharing distress, how to use tools to analyze the problem, and the two coding techniques we can use to eliminate false sharing trouble. …

I hope you enjoy it. Finally, here are links to previous Effective Concurrency columns:

The Pillars of Concurrency (Aug 2007)

How Much Scalability Do You Have or Need? (Sep 2007)

Use Critical Sections (Preferably Locks) to Eliminate Races (Oct 2007)

Apply Critical Sections Consistently (Nov 2007)

Avoid Calling Unknown Code While Inside a Critical Section (Dec 2007)

Use Lock Hierarchies to Avoid Deadlock (Jan 2008)

Break Amdahl’s Law! (Feb 2008)

Going Superlinear (Mar 2008)

Super Linearity and the Bigger Machine (Apr 2008)

Interrupt Politely (May 2008)

Maximize Locality, Minimize Contention (Jun 2008)

Choose Concurrency-Friendly Data Structures (Jul 2008)

The Many Faces of Deadlock (Aug 2008)

Lock-Free Code: A False Sense of Security (Sep 2008)

Writing Lock-Free Code: A Corrected Queue (Oct 2008)

Writing a Generalized Concurrent Queue (Nov 2008)

Understanding Parallel Performance (Dec 2008)

Measuring Parallel Performance: Optimizing a Concurrent Queue (Jan 2009)

volatile vs. volatile (Feb 2009)

Sharing Is the Root of All Contention (Mar 2009)

Use Threads Correctly = Isolation + Asynchronous Messages (Apr 2009)

“Use Thread Pools Correctly: Keep Tasks Short and Nonblocking” (Apr 2009)

“Eliminate False Sharing” (May 2009)

Answer to "16 Technologies": Engelbart and the Mother of All Demos

A few days ago I posted a challenge to name the researcher/team and approximate year each of the following 16 important technologies was first demonstrated. In brief, they were:

  • The personal computer for dedicated individual use all day long.
  • The mouse.
  • Internetworks.
  • Network service discovery.
  • Live collaboration and desktop/app sharing.
  • Hierarchical structure within a file system and within a document.
  • Cut/copy/paste, with drag-and-drop.
  • Paper metaphor for word processing.
  • Advanced pattern search and macro search.
  • Keyword search and multiple weighted keyword search.
  • Catalog-based information retrieval.
  • Flexible interactive formatting and line drawing.
  • Hyperlinks within a document and across documents.
  • Tagging graphics, and parts of graphics, as hyperlinks.
  • Shared workgroup document collaboration with annotations etc.
  • Live shared workgroup collaboration with live audio/video teleconference in a window.

A single answer to all of the above: Doug Engelbart and his ARC team, in what is now known as “The Mother of All Demos”, on Monday, December 9, 1968.

Last month, we marked the 40th anniversary of the famous Engelbart Demo, a truly unique “Eureka!” moment in the history of computing. 40 years go, Engelbart and his visionary team foresaw — and prototyped and demonstrated — many essential details of what we take for granted as our commonplace computing environment today, including all of the above-listed technologies, most of them demonstrated for the first time in that talk.

This talk would be noteworthy and historic just for being the first time a “mouse” was shown and called by that name. Yet the mouse was just one of over a dozen important innovations to be compellingly presented with working prototype implementations.

Note: Yes, some of the individual technologies have earlier theoretical roots. I deliberately phrased the question to focus on implementations because it’s great to imagine a new idea, but it isn’t engineering until we prove it can work by actually building it. For example, consider hypertext: Vannevar Bush’s Memex, vintage 1945, was a theorectical “proto-hypertext” system but it unfortunately remained theoretical, understandably so given the nascent state of computers at the time. Project Xanadu, started in 1960, pursued similar ideas but wasn’t demonstrated until 1972. The Engelbart Demo was the first time that hypertext was publicly shown in a working form, together with a slew of other important working innovations that combined to deliver an unprecedented tour de force. What made it compelling wasn’t just the individual ideas, but the working demonstrations to show that the ideas worked and how they could combine and interact in wonderful ways.

Recommended viewing

You can watch the 100-minute talk here (Stanford University) in sections with commentary, and here (Google Video) all in one go.

16 Important Technologies: Who demonstrated each one first?

We enjoy such an abundance of computing riches that it’s easy to take wonderful technological ideas for granted. Yet so many of the pieces of our modern computing experience that we consider routine today were at one time unimaginable. After all, back in the early days of computing, we were still discovering what these newfangled room-filling gadgets might eventually become capable of — who could have known then what using computers would be like today?

Of course, we have these technologies today because some visionaries did know, did imagine them… and, best of all, built and demonstrated them.

Hence today’s challenge:

Quiz: For each of the following 16 technologies that have become commonplace in our modern computing experience, give the researcher/team and approximate year that a working prototype was first demonstrated. How many can you answer without a web search?

  • The personal computer for dedicated individual use, that one person can have at their disposal all day long. (Hint: Before the Altair in 1975 and Apple I in 1976.)
  • Mouse input with a graphical pointer. (Hint: Before the Xerox Alto at Xerox PARC in 1973.)
  • Internetworks across campuses and cities. (Hint: Before Ethernet at Xerox PARC (again) in 1973.)
  • Discovery of ‘who’s got what service’ in an internetwork.
  • Using internetworks for live collaboration, not just file sharing. (Hint: Before RDP and others.)
  • Hierarchical structure within a file system and within a document. (Hint: Before Unix.)
  • Cut/copy/paste, with drag-and-drop.
  • Paper metaphor for word processing, starting with a blank piece of paper and the applying  formatting and navigating levels in the structure of text.
  • Advanced pattern search and macro search within documents. (Hint: Before MIT’s Emacs.)
  • Keyword search and multiple weighted keyword search. (Hint: Long before Google (alternate link).)
  • Information retrieval through indirect construction of a catalog.
  • Flexible interactive formatting and line drawing.
  • Hyperlinks within a document and across documents, and “jumping on a link” to navigate. (Hint: Before Tim Berners-Lee invented the World Wide Web in 1989-1990.) (Hint’: Yes, before HyperCard too.)
  • Tagging graphics, and parts of graphics, as hyperlinks. (Hint: Before Flickr.)
  • Workgroup collaboration on a document, including collaborative annotations, allowing members of a group to use and modify a document. (Hint: Before Lotus Notes and Ward Cunningham’s Wikis.)
  • The next step up from that: Live collaboration on a document with screen sharing on the two writers’ computers so they can see what the other is doing — with live audio/video teleconference in a window at the same time. (Hint: Not Skype or LiveMeeting.)

Research Firms Are Good At Research, Not Technology Predictions

This story has been picked up semi-widely since last night. I’m sure this Steven Prentice they quote is a fine (Gartner) Fellow, but really:

The computer mouse is set to die out in the next five years and will be usurped by touch screens and facial recognition, analysts believe.

Seriously, does anyone who uses computers daily really believe this kind of prediction just because someone at Gartner says so? Dude, sanity check: 1. What functions do you use your mouse for? 2. How many of those functions can be done by pointing at your screen or smiling at the camera: a) at all; and b) with equivalent high precision and low arm fatigue? Of course the mouse, including direct equivalents like the touchpad/trackpad, will be replaced someday. But to notice that people like to turn and shake their Wii controllers and iPhones and then make the leap to conclude that this will replace mice outright in the short term seems pretty thin even for Gartner.

When you read a report from Gartner, Forrester, IDC and their brethren research firms, remember that you’re either getting real-world data (aka research) or a single analyst’s personal predictions (aka crystal-ball gazing). Research firms are good at what they’re good at, namely research:

  • They’re “decent” at compiling current industry market data. Grade: A.
  • They’re “pretty okay” when they limit themselves to simple short-term extrapolation of that data, such as two-year projections of cost changes of high-speed networking in Canada or cell phone penetration in India. Grade: A-.

But when they try bigger technology movement predictions like “X will replace Y in Z years” they average somewhere around “spotty,” and on their off days they dip down into “I think you forgot to sanity check that sound bite” territory. It’s a pity that some venture capitalists take the research analysts’ word as gospel. Reliability of technology shift predictions: D+.