In Welcome to the Jungle, I predicted that “weak” hardware memory models will disappear. This is true, and it’s happening before our eyes:
- x86 has always been considered a “strong” hardware memory model that supports sequentially consistent atomics efficiently.
- The other major architecture, ARM, recently announced that they are now adding strong memory ordering in ARMv8 with the new sequentially consistent ldra and strl instructions, as I predicted they would. (Actually, Hans Boehm and I influenced ARM in this direction, so it was an ever-so-slightly disingenuous prediction…)
However, at least two people have been confused by what I meant by “weak” hardware memory models, so let me clarify what “weak” means – it means something different for hardware memory models and software memory models, so perhaps those aren’t the clearest terms to use.
By “weak (hardware) memory model” CPUs I mean specifically ones that do not natively support efficient sequentially consistent (SC) atomics, because on the software side programming languages have converged on “sequential consistency for data-race-free programs” (SC-DRF, roughly aka DRF0 or RCsc) as the default (C11, C++11) or only (Java 5+) supported software memory model for software. POWER and ARMv7 notoriously do not support SC atomics efficiently.
Hardware that supports only hardware memory models weaker than SC-DRF, meaning that they do not support SC-DRF efficiently, are permanently disadvantaged and will either become stronger or atrophy. As I mentioned specifically in the article, the two main current hardware architectures with what I called “weak” memory models were current ARM (ARMv7) and POWER:
- ARM recently announced ARMv8 which, as I predicted, is upgrading to SC acquire/release by adding new SC acquire/release instructions ldra and strl that are mandatory in both 32-bit and 64-bit mode. In fact, this is something of an industry first — ARMv8 is the first major CPU architecture to support SC acquire/release instructions directly like this. (Note: That’s for CPUs, but the roadmap for ARM GPUs is similar. ARM GPUs currently have a stronger memory model, namely fully SC; ARM has announced their GPU future roadmap has the GPUs fully coherent with the CPUs, and will likely add “SC load acquire” and “SC store release” to GPUs as well.)
- It remains to be seen whether POWER will adapt similarly, or die out.
Note that I’ve seen some people call x86 “weak”, but x86 has always been the poster child for a strong (hardware) memory model in all of our software memory model discussions for Java, C, and C++ during the 2000s. Therefore perhaps “weak” and “strong” are not useful terms if they mean different things to some people, and I’ve updated the WttJ text to make this clearer.
I will be discussing this in detail in my atomic<> Weapons talk at C&B next week, which I hope to make freely available online in the near future (as I do most of my talks). I’ll post a link on this blog when I can make it available online.
@Ragavendra: If I wrote a formal reference on a paper it would be: “Jem Davies and Richard Grisenthwaite, personal communication, June 2012.” Jem and Richard are The Guys — the primary ARM processor and memory model architects — and what I wrote and presented in the talk was the summary of the ARM v8 CPU and GPU memory models that ARM approved for me to report publicly at that time for this talk. Specifically, as of that writing, ARM GPUs had no equivalent of load-acquire store-release instructions because they didn’t need them.
“ARM GPUs currently have a stronger memory model, namely fully SC”.
I’ve been reading about ARM GPU memory models and cannot find a reference that states they have SC model. Can you please provide a reference for this?
@Jon: This was the terminology we used during the C++ and C memory model discussions (and Java, which I wasn’t involved with but did involve many of the same people): We talked about x86 as the poster child for strong because it has the strongest guarantees of mainstream hardware (not all of which guarantees are necessary or desirable, as I’ll mention in my talk on Tuesday) and especially it efficiently supports SC program semantics, whereas ARM(v7) and POWER require heavy sync operations to get SC program semantics. I say “SC program semantics” because it’s not only about atomics, though that is part of it.
Of course, no mainstream hardware is truly SC any more and will never be again, so “pure SC hardware” is of historical interest only. However, giving programs SC semantics is a very strong guarantee, even with the C++11/C11/Java qualifier “if you don’t write races,” and it’s important because it’s the only thing that anyone has been able to show mainstream developers can reason about successfully.
@Dave: Synchronizing using mutexes/cvs is what we prefer to teach people to use by default. Programming using SC atomics is indeed experts-only difficulty, but unfortunately it’s also justified more often than I would like (e.g., DCL, reference counting, and several other common patterns some of which should be wrapped in types but can’t always be) and so it turns out that we still have to teach SC atomics techniques to advanced-but-mainstream C++ programmers. However, programming using weaker-than-SC atomics is yet another major level of difficulty beyond that, and I don’t know if there are 100 people in the world who can reliably use those directly; I’m still trying to discourage resorting to them, although there currently are performance reasons to reach for them on ARMv7 and POWER. There is a difference of opinion among experts as to whether weaker-than-SC atomics are fully going away or not (e.g., they are lingering in 1000+ core count supercomputing applications), but with ARMv8 in particular the industry momentum in mainstream processors is going in the direction of seeing the performance carrot shrinking and disappearing for resorting to weaker-than-SC models.
“x86 has always been the poster child for a strong (hardware) memory model”
FWIW, memory model researchers at the University of Cambridge regard x86 as having a weak memory model:
Click to access cacm.pdf
And memory model researchers at the University of Oxford also regard x86 as having a weak memory model:
Click to access aplas11.pdf
The reason is that they regard a strong model as one where all memory operations are sequentially consistent. That is obviously not true of x86 because it reorders independent reads so x86 is a weak memory model by that definition.
I had never seen your definitions before but your previous statements make sense in their context. You were saying that only atomics will be sequentially consistent in future architectures.
I’m still not pretending to be an expert in this area, but to the best of my understanding (and please correct me if I’m wrong),
• SC-DRF has been efficiently implemented on PowerPC for higher-level threading primitives like mutexes, locks, and condition variables.
• PowerPC’s weaker atomics are also sometimes less expensive
• In some lock-free algorithms you don’t need sequential consistency for every operation, and at those times, the PPC model can be more efficient
• It can be a lot harder to write some lock-free algorithms without SC-DRF atomics.
If that’s the case, I don’t understand the evolutionary pressure you imply exists toward SC-DRF atomics. After all, lock-free programming is still (and probably always will be) considered an experts-only domain, and everybody else is supposed to use the higher-level primitives, which all have the SC-DRF property.
I’m happy to know that “strong” memory models are taking a stronghold! The definitions of “weak” and “strong” make sense to me for a single core but for many-core processors I think the memory models should be defined more broadly. For instance, Intel SCC (http://communities.intel.com/community/marc) has 48 Pentium P54C cores but it provides no mechanism for cache coherence between the cores (http://communities.intel.com/docs/DOC-5512). I believe the individial cores provide sequential consistency but I would not call the chip’s overall memory model “strong” due to the lack of hardware supported cache coherence. Perhaps it is further evidence that “weak” and “strong” is not the right way of thinking about it. Is cache coherence between cores orthogonal to the “weak” and “strong” models you are talking about? More importatnly, do you think such cache incoherent architectures would be common in future?
Alex: No, I’m not talking about lwarx+stwcx, which are not scalable, but certainly do not go on forever (assuming no user error). There are special store instructions for BQC that do store-with-increment, etc. as a single instruction that is applied in the L2 itself. The documentation will be eventually at https://bgq.anl-external.org/wiki/index.php/Main_Page when we have time to upload it.
2Jeff Hammond:
Can you please clarify your point about hardware atomics on PowerPC?
From my knowledge to make an atomic operation you have to lock L2 cache line, then do, then check if it’s OK. If it’s not OK, you should do it again. So this kind of cycle can go on forever.
Potentially it’s more powerful since you can do all operations you want and then try to “commit” all cache-line. But on practice it’s very difficult to use.
Your definition of strong and weak memory models is somewhat inconsistent with every other one I\’ve seen, which are talking about the ordering of conventional loads and stores, not presence or absence of single-instruction atomics.
The Blue Gene/Q compute chip (BQC), which PowerPC with extensions, is very much weak in the traditional sense of PowerPC, but also provides hardware atomics in the L2 cache that are probably superior to anything you can get from x86.
The BQC also has hardware transaction memory, which allows one to do far richer lock-free algorithms than x86.
Hi Herb,
I didn’t understand everything you explained even after having watched before the video that you mention here. However I think it’s a great opportunity for me to learn about that, thanks.
By the way, here is a suggestions related to one of the GOTW (we can’t comment the pages): In GOTW #100 and #101 you suggest using std::unique_ptr in the implementation of the class exposing the interface, while this pointer would hold the implementation. I think it might be very useful to add to the document the case where your class is the interface of a shared library that is supposed to be accessible by other compilers or versions of compilers or binaries with different compilation builds, because std::unique_ptr is a class template. Visual Studio compilers notify the programmer through a warning so it is obvious when you try, but it might be good to specify in such a great article that it is no use in this case. There seem to be a way to compile the specific instance of the template used, and then expose it in the interface of the shared library, but it is not very clear to me yet.