Andrei Alexandrescu, Scott Meyers, Herb Sutter
Channel 9 was invited to this year’s C++ and Beyond to film some sessions (that will appear on C9 over the coming months!)…
At the end of day 2, Andrei, Herb and Scott graciously agreed to spend some time discussing various modern C++ topics and, even better, answering questions from the community. In fact, the questions from Niners (and a conversation on reddit/r/cpp) drove the conversation.
Here’s what happened…
In Welcome to the Jungle, I predicted that “weak” hardware memory models will disappear. This is true, and it’s happening before our eyes:
- x86 has always been considered a “strong” hardware memory model that supports sequentially consistent atomics efficiently.
- The other major architecture, ARM, recently announced that they are now adding strong memory ordering in ARMv8 with the new sequentially consistent ldra and strl instructions, as I predicted they would. (Actually, Hans Boehm and I influenced ARM in this direction, so it was an ever-so-slightly disingenuous prediction…)
However, at least two people have been confused by what I meant by “weak” hardware memory models, so let me clarify what “weak” means – it means something different for hardware memory models and software memory models, so perhaps those aren’t the clearest terms to use.
By “weak (hardware) memory model” CPUs I mean specifically ones that do not natively support efficient sequentially consistent (SC) atomics, because on the software side programming languages have converged on “sequential consistency for data-race-free programs” (SC-DRF, roughly aka DRF0 or RCsc) as the default (C11, C++11) or only (Java 5+) supported software memory model for software. POWER and ARMv7 notoriously do not support SC atomics efficiently.
Hardware that supports only hardware memory models weaker than SC-DRF, meaning that they do not support SC-DRF efficiently, are permanently disadvantaged and will either become stronger or atrophy. As I mentioned specifically in the article, the two main current hardware architectures with what I called “weak” memory models were current ARM (ARMv7) and POWER:
- ARM recently announced ARMv8 which, as I predicted, is upgrading to SC acquire/release by adding new SC acquire/release instructions ldra and strl that are mandatory in both 32-bit and 64-bit mode. In fact, this is something of an industry first — ARMv8 is the first major CPU architecture to support SC acquire/release instructions directly like this. (Note: That’s for CPUs, but the roadmap for ARM GPUs is similar. ARM GPUs currently have a stronger memory model, namely fully SC; ARM has announced their GPU future roadmap has the GPUs fully coherent with the CPUs, and will likely add “SC load acquire” and “SC store release” to GPUs as well.)
- It remains to be seen whether POWER will adapt similarly, or die out.
Note that I’ve seen some people call x86 “weak”, but x86 has always been the poster child for a strong (hardware) memory model in all of our software memory model discussions for Java, C, and C++ during the 2000s. Therefore perhaps “weak” and “strong” are not useful terms if they mean different things to some people, and I’ve updated the WttJ text to make this clearer.
I will be discussing this in detail in my atomic<> Weapons talk at C&B next week, which I hope to make freely available online in the near future (as I do most of my talks). I’ll post a link on this blog when I can make it available online.
At the end of the Monday afternoon session, I will be making a special announcement related to Standard C++ on all platforms. Be there to hear the details, and to receive an extra perk that’s being reserved for C&B 2012 attendees only.
- Note: We sometimes record sessions and make them freely available online via Channel 9, and we intend to do that again this year for some selected sessions. However, this session is for C&B attendees only and will not be recorded.
Registration is open until Wednesday and the event is pretty full but a few spaces are still available. I’m looking forward to seeing many of you there for a top-notch C++ conference full of fresh new current material – I’ve seen Andrei’s and Scott’s talk slides too, and I think this C&B is going to be the best one yet.
You’ll leave exhausted, but with a full brain and quite likely a big silly grin as you think about all the ways to use the material right away on your current project back home.
Here’s another deep session for C&B 2012 on August 5-8 – if you haven’t registered yet, register soon. We got a bigger venue this time, but as I write this the event is currently almost 75% full with five weeks to go.
I know, I’ve already posted three sessions and a panel. But there’s just so much about C++11 to cover, so here’s a fourth brand-new session I’ll do at C&B 2012 that goes deeper on its topic than I’ve ever been willing to go before.
This session in one word: Deep.
It’s a session that includes topics I’ve publicly said for years is Stuff You Shouldn’t Need To Know and I Just Won’t Teach, but it’s becoming achingly clear that people do need to know about it. Achingly, heartbreakingly clear, because some hardware incents you to pull out the big guns to achieve top performance, and C++ programmers just are so addicted to full performance that they’ll reach for the big red levers with the flashing warning lights. Since we can’t keep people from pulling the big red levers, we’d better document the A to Z of what the levers actually do, so that people don’t SCRAM unless they really, really, really meant to.
This session covers:
- The facts: The C++11 memory model and what it requires you to do to make sure your code is correct and stays correct. We’ll include clear answers to several FAQs: “how do the compiler and hardware cooperate to remember how to respect these rules?”, “what is a race condition?”, and the ageless one-hand-clapping question “how is a race condition like a debugger?”
- The tools: The deep interrelationships and fundamental tradeoffs among mutexes, atomics, and fences/barriers. I’ll try to convince you why standalone memory barriers are bad, and why barriers should always be associated with a specific load or store.
- The unspeakables: I’ll grudgingly and reluctantly talk about the Thing I Said I’d Never Teach That Programmers Should Never Need To Now: relaxed atomics. Don’t use them! If you can avoid it. But here’s what you need to know, even though it would be nice if you didn’t need to know it.
- The rapidly-changing hardware reality: How locks and atomics map to hardware instructions on ARM and x86/x64, and throw in POWER and Itanium for good measure – and I’ll cover how and why the answers are actually different last year and this year, and how they will likely be different again a few years from now. We’ll cover how the latest CPU and GPU hardware memory models are rapidly evolving, and how this directly affects C++ programmers.
- Coda: Volatile and “compiler-only” memory barriers. It’s important to understand exactly what atomic and volatile are and aren’t for. I’ll show both why they’re both utterly unrelated (they have exactly zero overlapping uses, really) and yet are fundamentally related when viewed from the perspective of talking about the memory model. Also, people keep seeing and asking about “compiler-only” memory barriers and when to use them – they do have a valid-though-rare use, but it’s not the use that most people are trying to use them for, so beware!
For me, this is going to be the deepest and most fun C&B yet. At previous C&Bs I’ve spoken about not only code, but also meta topics like design and C++’s role in the marketplace. This time it looks like all my talks will be back to Just Code. Fun times!
Here a snapshot of the list of C&B 2012 sessions so far:
Universal References in C++11 (Scott)
You Don’t Know [keyword] and [keyword] (Herb)
Convincing Your Colleagues (Panel)
Initial Thoughts on Effective C++11 (Scott)
Modern C++ = Clean, Safe, and Faster Than Ever (Panel)
Error Resilience in C++11 (Andrei)
C++ Concurrency – 2012 State of the Art (and Standard) (Herb)
C++ Parallelism – 2012 State of the Art (and Standard) (Herb)
Secrets of the C++11 Threading API (Scott)
atomic<> Weapons: The C++11 Memory Model and Modern Hardware (Herb)
It’ll be a blast. I hope to see many of you there. Register soon.
While visiting Facebook earlier this month, I gave a shorter version of my “Welcome to the Jungle” talk, based on the eponymous WttJ article. They made a nice recording and it’s now available online here:
In the twilight of Moore’s Law, the transitions to multicore processors, GPU computing, and HaaS cloud computing are not separate trends, but aspects of a single trend—mainstream computers from desktops to ‘smartphones’ are being permanently transformed into heterogeneous supercomputer clusters. Henceforth, a single compute-intensive application will need to harness different kinds of cores, in immense numbers, to get its job done. — The free lunch is over. Now welcome to the hardware jungle.
The slides are available here. (There doesn’t seem to be a link to the slides on the page itself as I write this.)
For those interested in a longer version, in April I gave a 105-minute + Q&A version of this talk in Kansas City at Perceptive, also available online where I posted before.
A word about “cluster in a box”
I should have remembered that describing a PC as a “heterogeneous cluster in a box” is a big red button for people, in particular because “cluster” implies “parts can fail and program should continue.” So in the Q&A, one commenter made the point that I should have mentioned reliability is an issue.
As I answered there, I half agree – it’s true but it’s only half the story, and it doesn’t affect the programming model (see more below). One of the slides I omitted to shorten this version of the talk highlighted that there are actually two issues when you go from “Disjoint (tightly coupled)” to “Disjoint (loosely coupled)”: reliability and latency, and both are important. (I also mentioned this in the original WttJ article this is based on; just search for “reliability.”)
Even after the talk, I still got strong resistance along the lines that, ‘no, you obviously don’t get it, latency isn’t a significant issue at all, reliability is the central issue and it kills your argument because it makes the model fundamentally different.’ Paraphrasing subsequent email:
‘A fundamental difference between distributed computing and single-box multiprocessing is that in the former case you don’t know whether a failure was a communication failure (i.e. the task was completed but communication failed) or a genuine failure to carry the task. (Hence all complicated two-phase commit protocols etc.) In contrast, in a single-box scenario you can know the box you’re on is working.’
Let me respond further to this here, because clearly these guys know more about distributed systems than I do and I’m always happy to be educated, but I also think we have a disconnect on three things asserted above: It is not my understanding that reliability is more important than latency, or that apps have to distinguish comms failures from app exceptions, or that N-phase commit enters the picture.
First, I don’t agree with the assertion that reliability alone is what’s important, or that it’s more important than latency, for the following reason:
- You can build reliable transports on top of unreliable ones. You do it through techniques like sequencing, redundancy, and retry. A classic example is TCP, which delivers reliable communications over notoriously- and deliberately-unreliable IP which can drop and reorder packets as network nodes and communications paths keep madly appearing and reappearing like a herd of crazed Cheshire cats. We can and do build secure reliable global banking systems on that.
- Once you do that, you have turned a reliability issue into a performance (specifically latency) issue. Both reliability and latency are key issues when moving to loosely-coupled systems, but because you can turn the first into the second, it’s latency that is actually the more fundamental and important one – and the only one the developer needs to deal with.
For example, to use compute clouds like Azure and AWS, you usually start with two basic pieces:
- the queue(s), which you use to push the work items out/around and results back/around; and
- an elastic set of compute nodes, each of which pulls work items from the queue and processes them.
What happens when you encounter a reliability problem? A node can pull a work item but fail to complete it, for example if the node crashes or the system encounters a partial network outage or other communication problem.
Many modern systems already automatically recover and have another node re-pull the same work item to make sure each work item gets done even in the face of partial failures. From the app’s point of view, such failures just manifest as degraded performance (higher latency or time-to-solution) and therefore mainly affect the granularity of parallel work items – they have to be big enough to be worth sending elsewhere and so minimum size is directly proportional to latency so that the overheads do not dominate. They do not manifest as app-visible failures.
Yes, the elastic cloud implementation has to deal with things like network failures and retries. But no, this isn’t your problem; it’s not supposed to be your job to implement the elastic cloud, it’s supposed to be your job just to implement each node’s local logic and to create whatever queues you want and push your work item data into them.
Aside: Of course, as with any retry-based model, you have to make sure that a partly-executed work item doesn’t expose any partial side effects it shouldn’t, and normally you prevent that by doing the work in a transaction and rolling it back on failure, or in the extreme (not generally recommended but sometimes okay) resorting to compensating writes to back out partial work.
That covers everything except the comment about two-phase commit: Citing that struck me as odd because I haven’t heard much us of that kind of coupled approach in years. Perhaps I’m misinformed, but my impression of 2- or N-phase commit protocols was that they have some serious problems:
- They are inherently nonscalable.
- They increase rather than decrease interdependencies in the system – even with heroic efforts like majority voting and such schemes that try to allow for subsets of nodes being unavailable, which always seemed fragile to me.
- Also, I seem to remember that NPC is a blocking protocol, which if so is inherently anti-concurrency. One of the big realizations in modern mainstream concurrency in the past few years is that Blocking Is Nearly Always Evil. (I’m looking at you, future.get(), and this is why the committee is now considering adding the nonblocking future.then() as well.)
So my impression is that these were primarily of historical interest – if they are still current in modern datacenters, I would appreciate learning more about it and seeing if I’m overly jaded about N-phase commit.
It’s time for, not one, but two brand-new, up-to-date talks on the state of the art of concurrency and parallelism in C++. I’m going to put them together especially and only for C++ and Beyond 2012, and I’ll be giving them nowhere else this year:
- C++ Concurrency – 2012 State of the Art (and Standard)
- C++ Parallelism – 2012 State of the Art (and Standard)
And there’s a lot to tell. 2012 has already been a busy year for the pushing the boundaries of both “shipping-and-practical” and “proto-standard” concurrency and parallelism in C++:
- In February, the spring ISO C++ standards meeting saw record attendance at 73 experts (normal is 50-55), and spent the full week primarily on new language and library proposals, with notable emphasis on the area of concurrency and parallelism. There was so much interest that I formed four Study Groups and appointed chairs: the largest on concurrency and parallelism (SG1, Hans Boehm), and three others on modules (SG2, Doug Gregor), filesystem (SG3, Beman Dawes), and networking (SG4, Kyle Kloepper).
- Three weeks ago, we hosted another three-day face-to-face meeting for SG1 and SG4 – and at nearly 40 people the SG1 attendance rivaled that of a normal full ISO C++ meeting, with a who’s-who of the world’s concurrency and parallelism experts in attendance and further proposal presentations from companies like IBM, Intel, and Microsoft. There was so much interest that I had to form a new Study Group 5 for Transactional Memory (SG5), and appointed Michael Wong of IBM as chair.
- Over the summer, we’ll all be working on updated proposals for the October ISO C++ meeting in Portland.
Things are heating up, and we’re narrowing down which areas to focus on.
I’ve spoken and written on these topics before. Here’s what’s different about these talks:
- Brand new: This material goes beyond what I’ve written and taught about before in my Effective Concurrency articles and courses.
- Cutting-edge current: It covers the best-practices state of the art techniques and shipping tools, and what parts of that are standardized in C++11 already (the answer to that one may surprise you!) and what’s en route to near-term standardization and why, with coverage of the latest discussions.
- Mainstream hardware – many kinds of parallelism: What’s the relationship among multi-core CPUs, hardware threads, SIMD vector units (Intel SSE and AVX, ARM Neon), and GPGPU (general-purpose computation on GPUs, which I covered at C++ and Beyond 2011)? Which are most interesting, what technologies are available now, and what’s being considered for near-term standardization?
- Blocking vs. non-blocking: What’s the difference between blocking and non-blocking styles, why on earth would you care, which kinds does C++11 support, and how are we looking at rounding it out in C++1y?
- Task and data parallelism: What’s the difference between task parallelism and data parallelism, which kind of of hardware does each allow you to exploit, and why?
- Work stealing: What’s the difference between thread pools and work stealing, what are the major flavors of work stealing, which of these (if any) does C++11 already support and is already shipping on some advanced commercial C++ compilers today (this answer will likely surprise you), and what needs to be done in the next round for a complete state-of-the-art parallelism story in C++1y?
The answers all matter to you – even the ones not yet in the C++ standard – because they are real, available in shipping products, and affect how you design your software today.
This will be a broad and deep dive. At C++ and Beyond 2011, the attendees (audience!) included some of the world’s leading experts on parallelism and compilers. At these sessions of C&B 2012, I expect anyone who wasn’t personally at the SG1 meeting this month, even world-class experts, will learn something new in these talks. I certainly did, and that’s why I’m motivated to turn the information into talks and share. This isn’t just cool stuff – it’s important and useful in production code today.
I hope to see many of you at C&B 2012. I’m excited about these topics, and about Scott’s and Andrei’s new material – you just can’t get this stuff anywhere else.
Asheville is going to be blast. I can’t wait.
P.S.: I haven’t seen this much attention and investment in C++ since last century – C++ conferences at record numbers, C++ compiler investments by the biggest companies in the industry (e.g., Clang), and much more that we’ve seen already…
… and a little bird tells me there’s a lot more major C++ news coming this year. Stay tuned, and fasten your seat belts. 2012 ain’t done yet, not by a long shot, and I’ll be able to say more about C++ as a whole (besides the specific topics mentioned above) for the first time at C&B in August. I hope to see you there.
FYI, C&B is already over 60% full, and early bird registration ends this Friday, June 1 – so register today.
Want to know how to write cool tablet apps using Visual C++?
On May 18, Microsoft is hosting a one-day free technical event for developers who want to write Metro apps for Windows 8 using Visual C++. I’m giving the opening talk, and the rest of the day is full of useful technical information on everything from XAML and DirectX to networking and VC++ compiler flags.
From the page:
Join the Microsoft Visual C++ and Windows teams in Redmond on May 18, 2012 for a free, all-day event focused on building Windows 8 Metro style apps with C++.
We will have pragmatic advice for every developer writing Metro style apps and games with XAML and/or DirectX and C++.
- Visual C++ for Windows 8, Keynote by Herb Sutter
- Building Windows 8 apps with XAML and C++
- Building Windows 8 games with DirectX and C++
- Introduction to the Windows Runtime Library (WRL)
- Writing Connected apps: Writing networking code with C++
- Combining XAML & DirectX in a Metro style apps
- Writing WinRT components to be consumed from any language
- VC11 compiler flags for getting the most out of C++
All sessions will be recorded and available for on demand viewing on C9.
I wish I’d blogged about it right away when it was announced a week or so ago, because registration filled immediately before I could blog it (I think on the first day), and then when the room was expanded it filled again right away again before I could blog about it. Then I procrastinated for a few days. You can still register here for the waitlist to see it in person, but I have good news…
All sessions will be broadcast livestream and then available for viewing on demand. If you’re halfway around the world, or just halfway across the country, it’s hard to fly somewhere for a one-day event anyway; thanks to livestream and on-demand, the Internet is our friend. I look forward to seeing and e-seeing many of you there.