Welcome to the Jungle

With so much happening in the computing world, now seemed like the right time to write “Welcome to the Jungle” – a sequel to my earlier “The Free Lunch Is Over” essay. Here’s the introduction:


Welcome to the Jungle

In the twilight of Moore’s Law, the transitions to multicore processors, GPU computing, and HaaS cloud computing are not separate trends, but aspects of a single trend – mainstream computers from desktops to ‘smartphones’ are being permanently transformed into heterogeneous supercomputer clusters. Henceforth, a single compute-intensive application will need to harness different kinds of cores, in immense numbers, to get its job done.

The free lunch is over. Now welcome to the hardware jungle.


From 1975 to 2005, our industry accomplished a phenomenal mission: In 30 years, we put a personal computer on every desk, in every home, and in every pocket.

In 2005, however, mainstream computing hit a wall. In “The Free Lunch Is Over” (December 2004), I described the reasons for the then-upcoming industry transition from single-core to multi-core CPUs in mainstream machines, why it would require changes throughout the software stack from operating systems to languages to tools, and why it would permanently affect the way we as software developers have to write our code if we want our applications to continue exploiting Moore’s transistor dividend.

In 2005, our industry undertook a new mission: to put a personal parallel supercomputer on every desk, in every home, and in every pocket. 2011 was special: it’s the year that we completed the transition to parallel computing in all mainstream form factors, with the arrival of multicore tablets (e.g., iPad 2, Playbook, Kindle Fire, Nook Tablet) and smartphones (e.g., Galaxy S II, Droid X2, iPhone 4S). 2012 will see us continue to build out multicore with mainstream quad- and eight-core tablets (as Windows 8 brings a modern tablet experience to x86 as well as ARM), image_thumb99and the last single-core gaming console holdout will go multicore (as Nintendo’s Wii U replaces Wii).

This time it took us just six years to deliver mainstream parallel computing in all popular form factors. And we know the transition to multicore is permanent, because multicore delivers compute performance that single-core cannot and there will always be mainstream applications that run better on a multi-core machine. There’s no going back.

For the first time in the history of computing, mainstream hardware is no longer a single-processor von Neumann machine, and never will be again.

That was the first act.  . . .


I hope you enjoy it.

Daniel Moth’s C++ AMP session is now online

In my keynote on Wednesday, I highlighted just the top two important features in the C++ AMP programming model. That afternoon, my coding colleague and demo demigod Daniel Moth gave a 45-minute session covering the entire C++ AMP programming model that walked through all the features with more examples. Daniel’s talk is now also online at Channel 9. I hope you enjoy it.

Note: The PDF slides link is small but important — the screen isn’t easy to see in the video itself.

C++ AMP keynote is online

Yesterday I had the privilege of talking about some of the work we’ve been doing to support massive parallelism on GPUs in the next version of Visual C++. The video of my talk announcing C++ AMP is now available on Channel 9. (Update: Here’s an alternate link; it seems to be posted twice.)

The first 20 minutes has nothing to do with C++ in particular or any platform in particular, but tries to make the case that the right way to view the “trends” of multicore computing, GPU computing, and cloud computing (HaaS) is that they are not three trends at all, but merely facets of the same single trend — heterogeneous parallel computing.

If they are, then one programming model should be able to address them all. We think we’ve found one.

The main reasons we decided to build a new model is that we believe there needs to be a single model that has all of the following attributes:

  • C++, not C: It should leverage C++’s power for strong abstraction without sacrificing performance, not just be a dialect of C.
  • Mainstream: It should be programmable by millions of developers, not just by a priesthood. Litmus test: Is the Hello World parallel GPU program a page and half, or a couple of lines?
  • Minimal: It adds just one general-purpose language extension that addresses not only the immediate problem (dealing with cores that can’t support full C++) but many others. With the right general-purpose extension, the rest can be done as just a library.
  • Portable: It allows shipping a single EXE that can use any combination of GPU vendors’ hardware. The initial implementation uses DirectCompute and supports all devices that are DX11 capable; DirectCompute is just an implementation detail of the first release, and the model can (and I expect will) be implemented to directly talk to any interesting hardware.
  • General and future-proof: The initial release will focus on GPU computing, but it’s intended to enable people to write code for the GPU in a way that in the future we can recompile with few or no changes to spread across any and all accessible compute cores, including ones in the cloud.
  • Open: I mentioned that Microsoft intends to make the C++ AMP specification open, and encourages its implementation on other C++ compilers for any hardware or OS target. AMD announced that they will implement C++ AMP in their FSA reference compiler. NVidia also announced support.

We’re really excited about this, and I hope you find the information in the talk to be useful. A prerelease implementation in Visual C++ that runs on Windows will be available later this year. More to come…

Two More C&B Sessions: C++0x Memory Model (Scott) and Exceptional C++0x (me)

Scott Meyers, Andrei Alexandrescu and I are continuing to craft and announce the technical program for C++ and Beyond (C&B) 2011, and two more sessions are now posted. All talks are brand-new material created specifically for C&B 2011. Here are short blurbs; follow the links for longer descriptions.

  • Scott will give a great new talk on “The C++0x Memory Model and Why You Care” that will cover topics of interest to anybody who cares about concurrency and parallel programming under C++0x: everything from compiler optimizations and memory access reorderings, to “sequenced before” and “happens before” relations, to atomic types and memory consistency models, and how they all relate to both correctness and performance. This is stuff that in a perfect world nobody should ever have to know, but in our actual world every modern C++ developer who cares about correct high-performance code has to understand it thoroughly.
  • I’ll be giving a brand-new talk  “Exceptional C++0x (aka C++11)” that shows how the new features in C++0x change the way we solve problems, our C++ coding style, and even the way we think about our code. I’ll demonstrate that with code that works today on existing compilers, using selected familiar examples from my Exceptional C++ books. This is not rehashed material, as I’ll assume you’re already familiar with the pre-C++0x solutions (I’ll provide links to read as refreshers before the course), and then we’ll analyze and solve them entirely the 21st-century C++ way and see why C++0x feels like a whole new fresh language that leads to different approaches, new and changed guidelines, and even better solutions. As Bjarne put it: “Surprisingly, C++0x feels like a new language: The pieces just fit together better than they used to and I find a higher-level style of programming more natural than before and as efficient as ever.” This talk will show why — deeply, madly, and truly.

The other two talks already announced are the following, which I previously reported last week:

  • Andrei will be giving an in-depth talk on “BIG: C++ Strategies, Data Structures, and Algorithms Aimed at Scalability.” Briefly, it’s about writing high-performance C++ code for  highly distributed architectures, focusing on translating C++’s strong modeling capabilities directly to great scaling and/or great savings, and finding the right but non-intuitive C++ techniques and data structures to get there.
  • I’ll be giving a brand-new talk on “C++ and the GPU… and Beyond.” I’ll cover the state of the art for using C++ (not just C) for general-purpose computation on graphics processing units (GPGPU). The first half of the talk discusses the most important issues and techniques to consider when using GPUs for high-performance computation, especially where we have to change our traditional advice for doing the same computation on the CPU. The second half focuses on upcoming C++ language and library extensions that bring key abstractions for GPGPU — and in time considerably more — directly into C++.

I hope to see many of you there this August. Last year’s event sold out during the early-bird period, and although we’ve increased the attendance cap this year to make room for more, if you’re interested in coming you may want to register soon to reserve a place.

Keynote at the AMD Fusion Developer Summit

In a couple of months, I’ll be giving a keynote at the AMD Fusion Developer’s Summit, which will be held on June 13-16, 2011, in Bellevue, WA, USA.

Here’s my talk’s description as it appears on the conference website:

AFDS Keynote: “Heterogeneous Parallelism at Microsoft”
Herb Sutter, Microsoft Principal Architect, Native Languages

Parallelism is not just in full bloom, but increasingly in full variety. We know that getting full computational performance out of most machines—nearly all desktops and laptops, most game consoles, and the newest smartphones—already means harnessing local parallel hardware, mainly in the form of multicore CPU processing. This is the commoditization of the supercomputer.

More and more, however, getting that full performance can also mean using gradually ever-more-heterogeneous processing, from local GPGPU and Accelerated Processing Unit (APU) flavors to “often-on” remote parallel computing power in the form of elastic compute clouds. This is the generalization of the heterogeneous cluster in all its NUMA glory, and it’s appearing at all scales from on-die to on-machine to on-cloud.

In this talk, Microsoft’s chief native languages architect shares a vision of what this will mean for native software on Microsoft platforms from servers to devices, and showcases upcoming innovations that bring access to increasingly heterogeneous compute resources — from vector units and multicore, to GPGPU and APU, to elastic cloud — directly into the world’s most popular native languages.

If you’re interested in high performance code for GPUs, APUs, and other high-performance TLAs, I hope to see you there.

Note: This talk is related to, but different from, the GPU talk I’ll be presenting in August at C++ and Beyond 2011 (aka C&B). You can expect the above keynote to be, well, keynote-y… oriented toward software product features and of course AMD’s hardware, with plenty of forward-looking industry vision style material. My August C&B technical talk will be just that, an in-depth performance-oriented and sometimes-gritty technical session that will also mention product-related and hardware-specific stuff but is primarily about heterogeneous hardware, with a more pragmatically focused forward-looking eye.

Links I enjoyed reading this week

Concurrency-related (more or less directly)

Samples updated for ConcRT, PPL and Agents (Microsoft Parallel Programming blog)
Update to the samples for the Visual Studio 2010 Release Candidate. Hmm, I suppose I should include a link to that too:

Intel’s Core i7-980X Extreme processor (The Tech Report)
Desktop part with 12 hardware threads (6 cores x 2 threads/core), 32nm process, >1.1B transistors.

General information/amusement

Application compatibility layers are there for the customer, not for the program (Raymond Chen)
You wouldn’t believe the backward-compatibility hoops we all need to jump through hold in the right place for older apps to jump through, and then the app developer asks for more…

Jetpack to be commercially available soon? (Gizmag)
Yes, we see a story like this every few years. This one actually gets the flight time beyond just one minute. Now if we could only take the expected price down by one order of magnitude, and safety up by an order of magnitude or three…