Talk Video: Welcome to the Jungle (60 min version + Q&A)

While visiting Facebook earlier this month, I gave a shorter version of my “Welcome to the Jungle” talk, based on the eponymous WttJ article. They made a nice recording and it’s now available online here:

Facebook Engineering

Title: Herb Sutter: Welcome to the Jungle

In the twilight of Moore’s Law, the transitions to multicore processors, GPU computing, and HaaS cloud computing are not separate trends, but aspects of a single trend—mainstream computers from desktops to ‘smartphones’ are being permanently transformed into heterogeneous supercomputer clusters. Henceforth, a single compute-intensive application will need to harness different kinds of cores, in immense numbers, to get its job done. — The free lunch is over. Now welcome to the hardware jungle.

The slides are available here. (There doesn’t seem to be a link to the slides on the page itself as I write this.)

For those interested in a longer version, in April I gave a 105-minute + Q&A version of this talk in Kansas City at Perceptive, also available online where I posted before.

A word about “cluster in a box”

I should have remembered that describing a PC as a “heterogeneous cluster in a box” is a big red button for people, in particular because “cluster” implies “parts can fail and program should continue.” So in the Q&A, one commenter made the point that I should have mentioned reliability is an issue.

As I answered there, I half agree – it’s true but it’s only half the story, and it doesn’t affect the programming model (see more below). One of the slides I omitted to shorten this version of the talk highlighted that there are actually two issues when you go from “Disjoint (tightly coupled)” to “Disjoint (loosely coupled)”: reliability and latency, and both are important. (I also mentioned this in the original WttJ article this is based on; just search for “reliability.”)

Even after the talk, I still got strong resistance along the lines that, ‘no, you obviously don’t get it, latency isn’t a significant issue at all, reliability is the central issue and it kills your argument because it makes the model fundamentally different.’ Paraphrasing subsequent email:

‘A fundamental difference between distributed computing and single-box multiprocessing is that in the former case you don’t know whether a failure was a communication failure (i.e. the task was completed but communication failed) or a genuine failure to carry the task. (Hence all complicated two-phase commit protocols etc.) In contrast, in a single-box scenario you can know the box you’re on is working.’

Let me respond further to this here, because clearly these guys know more about distributed systems than I do and I’m always happy to be educated, but I also think we have a disconnect on three things asserted above: It is not my understanding that reliability is more important than latency, or that apps have to distinguish comms failures from app exceptions, or that N-phase commit enters the picture.

First, I don’t agree with the assertion that reliability alone is what’s important, or that it’s more important than latency, for the following reason:

You can build reliable transports on top of unreliable ones. You do it through techniques like sequencing, redundancy, and retry. A classic example is TCP, which delivers reliable communications over notoriously- and deliberately-unreliable IP which can drop and reorder packets as network nodes and communications paths keep madly appearing and reappearing like a herd of crazed Cheshire cats. We can and do build secure reliable global banking systems on that.
Once you do that, you have turned a reliability issue into a performance (specifically latency) issue. Both reliability and latency are key issues when moving to loosely-coupled systems, but because you can turn the first into the second, it’s latency that is actually the more fundamental and important one – and the only one the developer needs to deal with.

For example, to use compute clouds like Azure and AWS, you usually start with two basic pieces:

the queue(s), which you use to push the work items out/around and results back/around; and
an elastic set of compute nodes, each of which pulls work items from the queue and processes them.

What happens when you encounter a reliability problem? A node can pull a work item but fail to complete it, for example if the node crashes or the system encounters a partial network outage or other communication problem.

Many modern systems already automatically recover and have another node re-pull the same work item to make sure each work item gets done even in the face of partial failures. From the app’s point of view, such failures just manifest as degraded performance (higher latency or time-to-solution) and therefore mainly affect the granularity of parallel work items – they have to be big enough to be worth sending elsewhere and so minimum size is directly proportional to latency so that the overheads do not dominate. They do not manifest as app-visible failures.

Yes, the elastic cloud implementation has to deal with things like network failures and retries. But no, this isn’t your problem; it’s not supposed to be your job to implement the elastic cloud, it’s supposed to be your job just to implement each node’s local logic and to create whatever queues you want and push your work item data into them.

Aside: Of course, as with any retry-based model, you have to make sure that a partly-executed work item doesn’t expose any partial side effects it shouldn’t, and normally you prevent that by doing the work in a transaction and rolling it back on failure, or in the extreme (not generally recommended but sometimes okay) resorting to compensating writes to back out partial work.

That covers everything except the comment about two-phase commit: Citing that struck me as odd because I haven’t heard much us of that kind of coupled approach in years. Perhaps I’m misinformed, but my impression of 2- or N-phase commit protocols was that they have some serious problems:

They are inherently nonscalable.
They increase rather than decrease interdependencies in the system – even with heroic efforts like majority voting and such schemes that try to allow for subsets of nodes being unavailable, which always seemed fragile to me.
Also, I seem to remember that NPC is a blocking protocol, which if so is inherently anti-concurrency. One of the big realizations in modern mainstream concurrency in the past few years is that Blocking Is Nearly Always Evil. (I’m looking at you, future.get(), and this is why the committee is now considering adding the nonblocking future.then() as well.)

So my impression is that these were primarily of historical interest – if they are still current in modern datacenters, I would appreciate learning more about it and seeing if I’m overly jaded about N-phase commit.

Facebook Folly – OSS C++ Libraries

I’ve been beating the drum this year (see the last section of the talk) that the biggest problem facing C++ today is the lack of a large set of de jure and de facto standard libraries. My team at Microsoft just recently announced Casablanca, a cloud-oriented C++ library and that we intend to open source, and we’re making other even bigger efforts that I hope will bear fruit and I’ll be able to share soon. But it can’t be just one company – many companies have already shared great open libraries on the past, but still more of us have to step up.

In that vein, I was extremely happy to see another effort come to fruition today. A few hours ago I was at Facebook to speak at their C++ day, and I got to be in the room when Andrei Alexandrescu dropped his arm to officially launch Folly:

Folly: The Facebook Open Source Library

by Jordan DeLong on Saturday, June 2, 2012 at 2:59pm ·

Facebook is built on open source from top to bottom, and could not exist without it. As engineers here, we use, contribute to, and release a lot of open source software, including pieces of our core infrastructure such as HipHop and Thrift.

But in our C++ services code, one clear bottleneck to releasing more work has been that any open sourced project needed to break dependencies on unreleased internal library code. To help solve that problem, today we open sourced an initial release of Folly, a collection of reusable C++ library artifacts developed and used at Facebook. This announcement was made at our C++ conference at Facebook in Menlo Park, CA.

Our primary aim with this ‘foolishness’ is to create a solution that allows us to continue open sourcing parts of our stack without resorting to reinventing some of our internal wheels. And because Folly’s components typically perform significantly faster than counterparts available elsewhere, are easy to use, and complement existing libraries, we think C++ developers might find parts of this library interesting in their own right.

Andrei announced that Folly stood for the Facebook Open Llibrary. He claimed it was a Welsh name, which is even funnier when said by a charming Romanian man.

Microsoft, and I personally, would like to congratulate Facebook on this release! Getting more high-quality open C++ libraries is a Good Thing, and I think it is safe to say that Casablanca and Folly are just the beginning. There’s a lot more coming across our industry this year. Stay tuned.

Two Sessions: C++ Concurrency and Parallelism – 2012 State of the Art (and Standard)

It’s time for, not one, but two brand-new, up-to-date talks on the state of the art of concurrency and parallelism in C++. I’m going to put them together especially and only for C++ and Beyond 2012, and I’ll be giving them nowhere else this year:

C++ Concurrency – 2012 State of the Art (and Standard)
C++ Parallelism – 2012 State of the Art (and Standard)

And there’s a lot to tell. 2012 has already been a busy year for the pushing the boundaries of both “shipping-and-practical” and “proto-standard” concurrency and parallelism in C++:

In February, the spring ISO C++ standards meeting saw record attendance at 73 experts (normal is 50-55), and spent the full week primarily on new language and library proposals, with notable emphasis on the area of concurrency and parallelism. There was so much interest that I formed four Study Groups and appointed chairs: the largest on concurrency and parallelism (SG1, Hans Boehm), and three others on modules (SG2, Doug Gregor), filesystem (SG3, Beman Dawes), and networking (SG4, Kyle Kloepper).
Three weeks ago, we hosted another three-day face-to-face meeting for SG1 and SG4 – and at nearly 40 people the SG1 attendance rivaled that of a normal full ISO C++ meeting, with a who’s-who of the world’s concurrency and parallelism experts in attendance and further proposal presentations from companies like IBM, Intel, and Microsoft. There was so much interest that I had to form a new Study Group 5 for Transactional Memory (SG5), and appointed Michael Wong of IBM as chair.
Over the summer, we’ll all be working on updated proposals for the October ISO C++ meeting in Portland.

Things are heating up, and we’re narrowing down which areas to focus on.

I’ve spoken and written on these topics before. Here’s what’s different about these talks:

Brand new: This material goes beyond what I’ve written and taught about before in my Effective Concurrency articles and courses.
Cutting-edge current: It covers the best-practices state of the art techniques and shipping tools, and what parts of that are standardized in C++11 already (the answer to that one may surprise you!) and what’s en route to near-term standardization and why, with coverage of the latest discussions.
Mainstream hardware – many kinds of parallelism: What’s the relationship among multi-core CPUs, hardware threads, SIMD vector units (Intel SSE and AVX, ARM Neon), and GPGPU (general-purpose computation on GPUs, which I covered at C++ and Beyond 2011)? Which are most interesting, what technologies are available now, and what’s being considered for near-term standardization?
Blocking vs. non-blocking: What’s the difference between blocking and non-blocking styles, why on earth would you care, which kinds does C++11 support, and how are we looking at rounding it out in C++1y?
Task and data parallelism: What’s the difference between task parallelism and data parallelism, which kind of of hardware does each allow you to exploit, and why?
Work stealing: What’s the difference between thread pools and work stealing, what are the major flavors of work stealing, which of these (if any) does C++11 already support and is already shipping on some advanced commercial C++ compilers today (this answer will likely surprise you), and what needs to be done in the next round for a complete state-of-the-art parallelism story in C++1y?

The answers all matter to you – even the ones not yet in the C++ standard – because they are real, available in shipping products, and affect how you design your software today.

This will be a broad and deep dive. At C++ and Beyond 2011, the attendees (audience!) included some of the world’s leading experts on parallelism and compilers. At these sessions of C&B 2012, I expect anyone who wasn’t personally at the SG1 meeting this month, even world-class experts, will learn something new in these talks. I certainly did, and that’s why I’m motivated to turn the information into talks and share. This isn’t just cool stuff – it’s important and useful in production code today.

I hope to see many of you at C&B 2012. I’m excited about these topics, and about Scott’s and Andrei’s new material – you just can’t get this stuff anywhere else.

Asheville is going to be blast. I can’t wait.

Herb

P.S.: I haven’t seen this much attention and investment in C++ since last century – C++ conferences at record numbers, C++ compiler investments by the biggest companies in the industry (e.g., Clang), and much more that we’ve seen already…

… and a little bird tells me there’s a lot more major C++ news coming this year. Stay tuned, and fasten your seat belts. 2012 ain’t done yet, not by a long shot, and I’ll be able to say more about C++ as a whole (besides the specific topics mentioned above) for the first time at C&B in August. I hope to see you there.

FYI, C&B is already over 60% full, and early bird registration ends this Friday, June 1 – so register today.

C++ Libraries: Casablanca

At GoingNative in February, I emphasized the need for more modern and portable C++ libraries, including for things like RESTful web/cloud services, HTTP, JSON, and more. The goal is to find or develop modern C++ libraries that leverage C++11 features, and then submit the best for standardization.

Microsoft wants to do its part, and here’s a step in that direction.

Today I’m pleased to see Soma’s post about “C++, Cloud Services, and You” announcing the DevLabs release of Casablanca, a set of C++ libraries for Visual C++ users that start to bring the same modern conveniences already enjoyed by .NET and Node.js and Erlang users also to C++ developers on our local and cloud platforms, including modern C++ libraries for REST, HTTP, and JSON. From Soma’s announcement, adding my own emphasis and minor code edits:

Historically, we’ve lacked such simple tools for developers using C++. While there are multiple relevant native networking APIs (e.g. WinINet, WinHTTP, IXMLHttpRequest2, HTTP Server API), these are not optimized from a productivity perspective for consuming and implementing RESTful cloud services using modern C++. They don’t compose particularly well with code based on the standard C++ libraries, and they don’t take advantage of modern C++ language features and practices in their programming models.

This is where “Casablanca” comes in. “Casablanca” is a set of libraries for C++ developers, taking advantage of some recent standard language features already available through Visual Studio.

“Casablanca” aims to make it significantly easier for C++ coders to consume and implement RESTful services. It builds on lessons from .NET, from Node.js, from Erlang, and from other influencers to create a modern model that is meant to be easy to program while still being scalable, composable, and flexible.

As an example, here’s a snippet that uses the client HTTP library to search Bing for my name and output the results to the console:
    http_client bing( L"http://www.bing.com/search" );

    bing.request( methods::GET, L"?q=S.Somasegar" )
        .then( []( http_response response ) {
            cout << "HTML SOURCE:" << endl << response.to_string() << endl; })
        .wait();
and here’s a simple web listener hosted in a console application:
    listener::create( argv[1], []( http_request req ) {
            req.reply( status_codes::OK, "Namaste!" ); })
        .listen( []{ fgetc( stdin ); } )
        .wait();
For those of you looking to build Azure services in C++, “Casablanca” comes with a Visual Studio wizard to set up everything up correctly. You can target both Web and Worker roles, and you can access Azure storage using the built-in C++ library bindings. […] Taking C++ to the cloud with “Casablanca” is another exciting step in that journey.

Today’s Casablanca release is as a DevLabs project, to get usability feedback and to eventually support these features in the full Visual C++ product. If you’re interested in using C++ to consume and implement cloud services, and sharing what kind of support you want and whether you think Casablanca is on the right track, please let us know in the forums.

Looking beyond Visual C++, one piece of Casablanca is already being proposed for standardization, namely the “future.then” nonblocking continuations that are required to be able to write highly responsive composable libraries – you really want all async libraries to talk about their work using the same type, and std::future already gives half of what we need (blocking synchronization) and just needs the non-blocking part too. Also being proposed as an optional layer on top of “future.then” is an “await” style of language support to make the async operations as easy to express and use in C++ as in any of the best languages in the world.

Note that there are other C++ libraries too for several of these facilities. Repeating what I said at GoingNative, we (Microsoft) don’t care whose libraries get standardized – whether ones we contribute or someone else’s. We care most that there be standard C++ libraries for these modern uses, starting with the most basic “future.then” support, and to encourage all companies and groups who have libraries in these important spaces to contribute them and take the best and make them available to C++ developers on all platforms. This is a small step in that process.

Talk Video: Welcome to the Jungle

Last month in Kansas City I gave a talk on “Welcome to the Jungle,” based on my recent essay of the same name (sequel to “The Free Lunch Is Over”) concerning the turn to mainstream heterogeneous distributed computing and the end of Moore’s Law.

Perceptive Software has now made the talk available online [EOA: the talk itself starts six minutes in]:

Welcome to the Jungle

In the twilight of Moore’s Law, the transitions to multicore processors, GPU computing, and HaaS cloud computing are not separate trends, but aspects of a single trend – mainstream computers from desktops to ‘smartphones’ are being permanently transformed into heterogeneous supercomputer clusters. Henceforth, a single compute-intensive application will need to harness different kinds of cores, in immense numbers, to get its job done. – The free lunch is over. Now welcome to the hardware jungle.

I hope you enjoy it.

Warning: It’s two hours (with Q&A) because of the broad and deep material. There’s a nice pause point between major sections at the one-hour mark that makes it convenient to split it into two one-hour lunchtime brownbag viewings.

We want await! A C# talk that’s applicable to C++

A nice talk by Mads Torgersen just went live on Channel 9 about C#’s non-blocking Task<T>.ContinueWith() library feature and await language feature, which are a big hit in C# (and Visual Basic) for writing highly concurrent code that looks pretty much just like sequential code. Mads is one of the designers of await.

If you’re a C++ programmer, you may be interested in this because I’ve worked to have these very features be offered as proposals for ISO C++, just with a few naming tweaks like renaming Task<T>.ContinueWith() to std::future<T>::then(). They were initially presented at the recent Kona meeting in February, and we’ll dig deeper next month at the special ISO C++ study group meeting on concurrency and parallelism.

Here’s the talk link and abstract:

Language Support for Asynchronous Programming

Mads Torgersen

Asynchronous programming is what the doctor usually orders for unresponsive client apps and for services with thread-scaling issues. This usually means a bleak departure from the imperative programming constructs we know and love into a spaghetti hell of callbacks and signups. C# and VB are putting an end to that, reinstating all your tried-and-true control structures on top of a future-based model of asynchrony.

While we were chatting after the talk, I managed to gently twist Mads’ arm and he has graciously agreed to come to the May 7-9 ISO C++ parallelism study group special meeting to present this to the committee members in detail and answer questions about await’s design and C# users’ experience with it in production code, which will help the committee decide whether or not this is something they want to pursue for ISO C++.

I hope you enjoy the talk. While at Lang.NEXT, I also participated in a panel and gave a C++ talk but those sessions aren’t live yet; I’ll post links once they are.

Trivia: If you noticed the Romanian accent in the first question from the audience, it’s because it came from Andrei Alexandrescu, who was sitting beside Walter Bright, both of whom were two of the other speakers at the conference. It was fun to be in a room full of language designers and implementers sharing notes about each other’s languages and experience.

“Welcome to the Jungle” in Kansas City – March 20, 2012

Thanks to Perceptive Software who are bringing me to Kansas City in two weeks to give a free talk on “Welcome to the Jungle.”

The talk will be based on my recent essay of the same name (sequel to ”The Free Lunch Is Over”) concerning the turn to mainstream heterogeneous distributed computing and the end of Moore’s Law, with ample time for Q&A and discussion.

Here are the coordinates:

Computing Trends with Herb Sutter: Welcome to the Jungle

When: Tuesday, March 20 at 1:00 – 3:00 P.M.

Where: Boulevard Brewery, 2501 Southwest Blvd, Kansas City, MO, USA 64108

Abstract: In the twilight of Moore’s Law, the transitions to multicore processors, GPU computing, and HaaS cloud computing are not separate trends, but aspects of a single trend – mainstream computers from desktops to ‘smartphones’ are being permanently transformed into heterogeneous supercomputer clusters. Henceforth, a single compute-intensive application will need to harness different kinds of cores, in immense numbers, to get its job done. – The free lunch is over. Now welcome to the hardware jungle.

This is a free lecture; all are invited, but you should register to make sure you’ll have a seat. Note that this talk is live only and is not being recorded or webcast.

I look forward to meeting many of you there in person.

Welcome to the Jungle

With so much happening in the computing world, now seemed like the right time to write “Welcome to the Jungle” – a sequel to my earlier “The Free Lunch Is Over” essay. Here’s the introduction:

Welcome to the Jungle

In the twilight of Moore’s Law, the transitions to multicore processors, GPU computing, and HaaS cloud computing are not separate trends, but aspects of a single trend – mainstream computers from desktops to ‘smartphones’ are being permanently transformed into heterogeneous supercomputer clusters. Henceforth, a single compute-intensive application will need to harness different kinds of cores, in immense numbers, to get its job done.

The free lunch is over. Now welcome to the hardware jungle.

From 1975 to 2005, our industry accomplished a phenomenal mission: In 30 years, we put a personal computer on every desk, in every home, and in every pocket.

In 2005, however, mainstream computing hit a wall. In “The Free Lunch Is Over” (December 2004), I described the reasons for the then-upcoming industry transition from single-core to multi-core CPUs in mainstream machines, why it would require changes throughout the software stack from operating systems to languages to tools, and why it would permanently affect the way we as software developers have to write our code if we want our applications to continue exploiting Moore’s transistor dividend.

In 2005, our industry undertook a new mission: to put a personal parallel supercomputer on every desk, in every home, and in every pocket. 2011 was special: it’s the year that we completed the transition to parallel computing in all mainstream form factors, with the arrival of multicore tablets (e.g., iPad 2, Playbook, Kindle Fire, Nook Tablet) and smartphones (e.g., Galaxy S II, Droid X2, iPhone 4S). 2012 will see us continue to build out multicore with mainstream quad- and eight-core tablets (as Windows 8 brings a modern tablet experience to x86 as well as ARM), and the last single-core gaming console holdout will go multicore (as Nintendo’s Wii U replaces Wii).

This time it took us just six years to deliver mainstream parallel computing in all popular form factors. And we know the transition to multicore is permanent, because multicore delivers compute performance that single-core cannot and there will always be mainstream applications that run better on a multi-core machine. There’s no going back.

For the first time in the history of computing, mainstream hardware is no longer a single-processor von Neumann machine, and never will be again.

That was the first act. . . .

I hope you enjoy it.

Daniel Moth’s C++ AMP session is now online

In my keynote on Wednesday, I highlighted just the top two important features in the C++ AMP programming model. That afternoon, my coding colleague and demo demigod Daniel Moth gave a 45-minute session covering the entire C++ AMP programming model that walked through all the features with more examples. Daniel’s talk is now also online at Channel 9. I hope you enjoy it.

Note: The PDF slides link is small but important — the screen isn’t easy to see in the video itself.

C++ AMP keynote is online

Yesterday I had the privilege of talking about some of the work we’ve been doing to support massive parallelism on GPUs in the next version of Visual C++. The video of my talk announcing C++ AMP is now available on Channel 9. (Update: Here’s an alternate link; it seems to be posted twice.)

The first 20 minutes has nothing to do with C++ in particular or any platform in particular, but tries to make the case that the right way to view the “trends” of multicore computing, GPU computing, and cloud computing (HaaS) is that they are not three trends at all, but merely facets of the same single trend — heterogeneous parallel computing.

If they are, then one programming model should be able to address them all. We think we’ve found one.

The main reasons we decided to build a new model is that we believe there needs to be a single model that has all of the following attributes:

C++, not C: It should leverage C++’s power for strong abstraction without sacrificing performance, not just be a dialect of C.
Mainstream: It should be programmable by millions of developers, not just by a priesthood. Litmus test: Is the Hello World parallel GPU program a page and half, or a couple of lines?
Minimal: It adds just one general-purpose language extension that addresses not only the immediate problem (dealing with cores that can’t support full C++) but many others. With the right general-purpose extension, the rest can be done as just a library.
Portable: It allows shipping a single EXE that can use any combination of GPU vendors’ hardware. The initial implementation uses DirectCompute and supports all devices that are DX11 capable; DirectCompute is just an implementation detail of the first release, and the model can (and I expect will) be implemented to directly talk to any interesting hardware.
General and future-proof: The initial release will focus on GPU computing, but it’s intended to enable people to write code for the GPU in a way that in the future we can recompile with few or no changes to spread across any and all accessible compute cores, including ones in the cloud.
Open: I mentioned that Microsoft intends to make the C++ AMP specification open, and encourages its implementation on other C++ compilers for any hardware or OS target. AMD announced that they will implement C++ AMP in their FSA reference compiler. NVidia also announced support.

We’re really excited about this, and I hope you find the information in the talk to be useful. A prerelease implementation in Visual C++ that runs on Windows will be available later this year. More to come…