atomic Weapons: The C++ Memory Model and Modern Hardware

[ETA: Updated OneDrive slides link]

Most of the talks I gave at C++ and Beyond 2012 last summer are already online at Channel 9. Here are two more.

This is a two-part talk that covers the C++ memory model, how locks and atomics and fences interact and map to hardware, and more. Even though we’re talking about C++, much of this is also applicable to Java and .NET which have similar memory models, but not all the features of C++ (such as relaxed atomics).

Note: This is about the basic structure and tools, not how to write lock-free algorithms using atomics. That next-level topic may be on deck for this year’s C++ and Beyond in December, we’ll see…

atomic<> Weapons: The C++ Memory Model and Modern Hardware

This session in one word: Deep.

It’s a session that includes topics I’ve publicly said for years is Stuff You Shouldn’t Need To Know and I Just Won’t Teach, but it’s becoming achingly clear that people do need to know about it. Achingly, heartbreakingly clear, because some hardware incents you to pull out the big guns to achieve top performance, and C++ programmers just are so addicted to full performance that they’ll reach for the big red levers with the flashing warning lights. Since we can’t keep people from pulling the big red levers, we’d better document the A to Z of what the levers actually do, so that people don’t SCRAM unless they really, really, really meant to.

Topics Covered:

  • The facts: The C++11 memory model and what it requires you to do to make sure your code is correct and stays correct. We’ll include clear answers to several FAQs: “how do the compiler and hardware cooperate to remember how to respect these rules?”, “what is a race condition?”, and the ageless one-hand-clapping question “how is a race condition like a debugger?”
  • The tools: The deep interrelationships and fundamental tradeoffs among mutexes, atomics, and fences/barriers. I’ll try to convince you why standalone memory barriers are bad, and why barriers should always be associated with a specific load or store.
  • The unspeakables: I’ll grudgingly and reluctantly talk about the Thing I Said I’d Never Teach That Programmers Should Never Need To Now: relaxed atomics. Don’t use them! If you can avoid it. But here’s what you need to know, even though it would be nice if you didn’t need to know it.
  • The rapidly-changing hardware reality: How locks and atomics map to hardware instructions on ARM and x86/x64, and throw in POWER and Itanium for good measure – and I’ll cover how and why the answers are actually different last year and this year, and how they will likely be different again a few years from now. We’ll cover how the latest CPU and GPU hardware memory models are rapidly evolving, and how this directly affects C++ programmers.

16 thoughts on “atomic Weapons: The C++ Memory Model and Modern Hardware

  1. Incredibly good talk, especially part 2. Any chance the slides will be released?

  2. Really liked the talk Herb, I think I get pretty much everything a lot better than before, but I’m a bit confused about this now:
    all the positive ordering effects of atomics are introduced in part 1 in terms of how Acquire/Release form the packaged deal that guarantees the induction of a “happens-before” relationship.
    But the default memory ordering for atomic operation is std::memory_order_seq_cst (as further explained in part 2), right? So, aren’t all things introduced in part 1 (related to Acquire/Release) more specific for the cases where you request a std::memory_order_acquire, std::memory_order_release or std::memory_order_acq_rel?
    IOW: if using default memory order for atomics, aren’t all the things explained in part 1 “over-guaranteed” because the default std::memory_order_seq_cst is the strongest and hence does all that Acquire/Release do and more? I’m asking this because after looking at part 1, I got the idea that I reading atomics would use std::memory_order_acquire and writing them would use std::memory_order_release, which I knew wasn’t true and which you confirm it’s not true in part 2, the default being std::memory_order_seq_cst…
    Hope I was able to explain my doubt…

    Thanks,
    Andrea.

  3. @Michael: The slides are now linked from the video page. Here’s the link for convenience: https://1drv.ms/b/s!Aq0V7yDPsIZOgcI0y2P8R-VifbnTtw

    @Andrea: See the part where I mentioned two kinds of acquire-release: “plain acq/rel” and “sequentially consistent acq/rel.” The only difference is that the latter forbids reordering a release followed by an acquire. The standard’s memory_order_seq_cst default means “sequentially consistent acquire/release” — loads are by default “SC acquire” and stores are by default “SC release.” See the slide “Enter the memory_order_*” (page 45 of the handout link) which summarizes these rules.

  4. Herb, congratulations on pretty good attempt to push arcane knowledge into mainstream. It has been around for at least 17 years (I am referring to wonderful but hard to read paper ‘Memory Consistency
    Models for Shared-Memory Multiprocessors’ which could be found in many places, e.g. http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-9.pdf)
    Soon ‘masters of arcane’ will have to invent something even more twisted (to maintain their status). :-)

  5. First off. Excellent presentation(s)! I’d loved to have had this available ten years ago when I was first blundering through these concepts. I’ve done similar ones for our internal developers, but liked what you did much better!

    In the second presentation, you aren’t using lwsync for PowerPc, instead are using the heavyweight sync, which is what we’ve used in our code (for example mutex implementation) since it was available. Is this because you want the MM to default to not allowing a rel/acq to be reordered?

  6. @peeterjoot: Yes, the instruction sequences shown are for directly mapping each atomic operation in the source in isolation conservatively, so that it’s correct no matter what else is going on nearby. That’s a correct baseline code generation strategy, and often sufficient. If your optimizer can see more of what’s going on in the surrounding context, however, it can further refine the code generation. The example you give is one such case. See also the POWER code gen slide on page 37 of the handouts, where it notes “you can almost get away with an lwsync here” (you can indeed get away with it if you little more about other nearby atomic loads and stores in the same thread).

  7. Very interesting talk! You mention that .net is not fully SC. Do you know if that is still true or did they fix it in .net 4.5? You also mention this in your “volatile vs volatile” paper and that they would fix it in VS2010 but I take it that didn’t happen?

  8. What if in the example where you’re talking about benign races, reading x could generate a hardware error (page fault or so) due to being in an unmapped page. Wouldn’t reading x speculatively (by the compiler) wrongly cause the page fault?

  9. Could I ask for some clarification? In the “relaxed” section of the talk, you have a “stop” variable and a relaxed load from it to check whether a thread should end, paired with an SC store “stop = true;” in the main thread. Is it actually guaranteed that the store operation propagates and becomes visible eventually? Would such a guarantee depend on the existence of an acquire-release-pair? (I appreciate that actual synchronization isn’t required in the example and that we don’t need a release sequence, but I don’t see how eventual propagation is guaranteed.)

    Thanks!

  10. Hi Herb, do you have any C/C++ code examples that illustrate how the new lrda and strl in ARMv8 are advantageous over x86? I am looking for a few practical illustrations /exercises for students who might run code on an ARMv8 simulator or board vs their x86 laptop. Thank you, Jim

  11. Hi Herb,
    First of all, thank you for the great talk and for making it available online.
    A section of the slides that you couldn’t cover in the talk was that on the Double-Checked locking pattern (pages 53-56). More specifically, I have a question on page 54, where you propose to use:
    if (!create.exchange_explicit(true, memory_order_relaxed)) {…}
    In my opionion, this should be fine based on the standard’s 29.3/12 (“Atomic RMW operations shall always read the last value (in the modification order) written before the write associated with the RMW operation”).
    But you wrote on the second slide, as the answer to the initial question: “No; e.g. could do some widget creation even if CAS fails – and worse”.
    This doesn’t look accurate to me, as you are using a plain exchange() and not a compare_exchange_weak(); note the standard allows only the latter to fail.
    Am I misinterpreting your proposed implementation?
    PS I’m assuming this code should be able to run in either a weak or strong hardware memory model

Comments are closed.