Living in the future: Using C++26 at work

I recently contributed a guest post on my employer’s blog about the importance of the almost-feature-complete C++26 draft standard:

“Sea change in C++: Why opportunities abound”

It starts by summarizing a talk I gave recently, about how C++26 is poised to do what C++11 did over a decade ago: usher in a new era of C++ code style. Just as we can now glance at 10 or 20 lines and quickly recognize “modern C++” as C++11-era code, soon we’ll be able to do the same with code that uses std::execution, contracts, memory safety improvements, and (fingers crossed) reflection, and quickly see it’s “modern” C++26-era code. A second wave of modernization, marked by a visibly refreshed language and style.

One thing I’ve appreciated here at Citadel is how aggressively the key advances are being adopted in our live trading systems. We already use C++26’s std::execution  in production for an entire asset class, and as the foundation of our new messaging infrastructure. That’s possible because we’ve had our own in-house implementation running for several years now—thanks, Gašper and Bronek! Next, we’ll be pushing hard to adopt draft C++26’s new hardened standard library and contracts in our production systems, which were just voted into C++26 at our latest WG21 meeting in February.

This doesn’t mean we throw caution to the wind or reach for just any shiny new feature the moment it appears. But when key features are ready and delivering real value today, and we can get them with a little extra effort, there’s no reason to wait until tomorrow to use them. One of the nice things about that model is we get to “live in the future” by using those key features in production early, to get the benefits sooner and also to start building experience now with the cool things everyone will be using routinely everywhere soon.

The future’s not quite here. But for some features, we can already write production code as if it is. It’s illuminating; personally, I’m learning more about std::execution now that I’m in an environment where it’s being used for real. Fun times for C++!

My little New Year’s Week project (and maybe one for you?)

[Updates: Clarified that an intrusive discriminator would be far beyond what most people mean by “C++ ABI break.” Mentioned unique addresses and common initial sequences. Added “unknown” state for passing to opaque functions.]

Here is my little New Year’s Week project: Trying to write a small library to enable compiler support for automatic raw union member access checking.

The problem, and what’s needed

During 2024, I started thinking: What would it take to make C/C++ union accesses type-checked? Obviously, the ideal is to change naked union types to something safe.(*) But because it will take time and effort for the world to adopt any solution that requires making source code changes, I wondered how much of the safety we might be able to get, at what overhead cost, just by recompiling existing code in a way that instruments ordinary union objects?

Note: I describe this in my C++26 Profiles proposal, P3081R0 section 3.7. The following experiment is trying to validate/invalidate the hypothesis that this can be done efficiently enough to warrant including in an ISO C++ opt-in type safety profile. Also, I’m sure this has been tried before; if you know of a recent (last 10 years?) similar attempt that measured its results, please share it in the comments.

What do we need? Obviously, an extra discriminator field to track the currently active member of each C/C++ union object. But we can’t just add a discriminator field intrusively inside each C/C++ union object, because that would change the size and layout of the object and be a massive link/ABI incompatibility even with C compilers and C code on the same platform which would all need to be identically updated at the same time, and it would break most OSes whose link compatibility (existing apps, device drivers, …) rely on C ABIs and APIs and use unions in stable interfaces; breaking that is much more than people usually mean by “C++ taking an ABI break” which is more about evolving C++ standard library types.

So we have to store it… extrinsically? … as-if in a global internally-synchronized map<void* /*union obj address*/, uintNN_t /*discriminator*/>…? But that sounds stupid scary: global thread safety lions, data locality tigers, and even some branches bears, O my! Could such extrinsic storage and additional checking possibly be efficient enough?

My little experiment

I didn’t know, so earlier this year I wrote some code to find out, and this week I cleaned it up and it’s now posted here:

The workhorse is extrinsic_storage<Data>, a fast and scalable lock-free data structure to nonintrusively store additional Data for each pointer key. It’s wait-free for nearly all operations (not just lock-free!), and I’ve never written memory_order_relaxed this often in my life. It’s designed to be cache- and prefetcher-friendly, such as using SOA to store keys separately so that default hash buckets contain 4 contiguous cache lines of keys. Here I use it for union discriminators, but it’s a general tool that could be considered for any situation where a type needs to store additional data members but can’t store them internally.

If you’re looking for a little New Year’s experiment…

If you’re looking for a little project over the next few days to start off the year, may I suggest one of these:

  • Little Project Suggestion #1: Find a bug or improvement in my little lock-free data structure! I’d be happy to learn how to make it better, fire away! Extra points for showing how to fix the bug or make it run better, such as in a PR or your cloned repo.

  • Little Project Suggestion #2: Minimally extend a C++ compiler (Clang and GCC are open source) as described below, so that every construction/access/destruction of a union type injects a call to my little library’s union_registry<>:: functions which will automatically flag type-unsafe accesses. If you try this, please let me know in the comments what happens when you use the modified compiler on some real world source! I’m curious whether you find true positive union violations in the union-violations.log file – of course it will also contain false positives, because real code does sometimes use unions to do type punning on purpose, but you should be able to eliminate batches of those at a time by their similar text in the log file.

To make #2 easier, here’s a simple API I’ve provided as union_registry<>, which wraps the above in a compiler-intgration-targeted API. I’ll paste the comment documentation here:

//  For an object U of union type that
//  has a unique address, when              Inject a call to this (zero-based alternative #s)
//
//    U is created initialized                on_set_alternative(&U,0) = the first alternative# is active
//
//    U is created uninitialized              on_set_alternative(&U,invalid)
//
//    U.A = xxx (alt A is assigned to)        on_set_alternative(&U,#A)
//
//    U or U.A is passed to a function by     on_set_alternative(&U,unknown)
//      pointer/reference to non-const
//      and we don't know the function
//      is compiled in this mode
//
//    U.A (alt A is otherwise used)           on_get_alternative(&U,#A)
//      and A is not a common initial
//      sequence
//
//    U is destroyed / goes out of scope      on_destroy(&U)
//
//  That's it. Here's an example:
//    {
//      union Test { int a; double b; };
//      Test t = {42};                        union_registry<>::on_set_alternative(&u,0);
//      std::cout << t.a;                     union_registry<>::on_get_alternative(&u,0);
//      t.b = 3.14159;                        union_registry<>::on_set_alternative(&u,1);
//      std::cout << t.b;                     union_registry<>::on_get_alternative(&u,1);
//    }                                       union_registry<>::on_destroy(&u);
//
//  For all unions with up to 254 alternatives, use union_registry<>
//  For all unions with between 255 and 16k-2 alternatives, use union_registry<uint16_t>
//  If you find a union with >16k-2 alternatives, email me the story and use union_registry<uint32_t>

Rough initial microbenchmark performance

My test environment:

  • CPU: 2.60 GHz i9-13900H (14 physical cores, 20 logical cores)
  • OSes: Windows 11, running MSVC natively and GCC and Clang via Fedora in WSL2

My test harness provided here:

  • 14 test runs: Each successively uses { 1, 2, 4, 8, 16 32, 64, 1, 2, 4, 8, 16, 32, 64 } threads
    • Each run tests 1 million union objects, 10,000 at a time, 10 operations on each union; the test type is union Union { char alt0; int alt1; long double alt2; };
    • Each run injects 1 deliberate “type error” failure to trigger detection, which results in a line of text written to union-violations.log that records the bad union access including the source line that committed it (so there’s a little file I/O here too)
  • Totals:
    • 14 million union objects created/destroyed
    • 140 million union object accesses (10 per object, includes construct/set/get/destroy)

On my machine, here is total the run-time overhead (“total checked” time using this checking, minus “total raw” time using only ordinary raw union access), for a typical run of the whole 140M unit accesses:

Compiler total raw (ms) total checked (ms) total overhead (ms) Notes
MSVC 19.40 -O2 ~190 ~1020 ~830 Compared to -O2, -Ox checked was the same or very slightly slower, and -Os checked was 3x slower
GCC 14 -O3 ~170 ~800 ~630 Compared to -O3, -O2 overall was only slightly slower
Clang 18 -O3 ~170 ~510 ~340 Compared to -O3, -O2 checked was about 40% slower

Dividing that by 140 million accesses, the per-access overhead is:

Compiler total overhead (ns) / total accesses average overhead / access (ns)
MSVC 830M ns / 140M accesses 5.9 ns / access
GCC (midpoint) 630M ns / 140M accesses 4.5 ns / access
Clang 340M ns / 140M accesses 2.4 ns / access

Finally, recall we’re running on a 2.6 GHhz processor = 2.6 clock cycles per ns, so in CPU clock cycles the per-access overhead is:

Compiler average overhead / access (cycles)
MSVC 15 cycles / access
GCC 11.7 cycles / access
Clang 6.2 cycles / access

This… seems too good to be true. I may well be making a silly error (or several) but I’ll post anyway so we can all have fun correcting them! Maybe there’s a silly bug in my code, or I moved a decimal point, or I converted units wrong, but I invite everyone to have fun pointing out the flaw(s) in my New Year’s Day code and/or math – please fire away in the comments.

Elaborating on why this seems too good to be true: Recall that one “access” means to check the global hash table to create/find/destroy the union object’s discriminator tag (using std::atomics liberally) and then also set or check either the tag (if setting or using one of the union’s members) and/or the key (if constructing or destroying the union object). But even a single L2 cache access is usually around 10-14 cycles! This would mean this microbenchmark is hitting L1 cache almost always, even while iterating over 10,000 active unions at a time, often with more hot threads than there are physical or logical cores pounding on the same global data structure, and occasionally doing a little file I/O to report violations.

Even if I didn’t make any coding/calculation errors, one explanation is that this microbenchmark has great L1 cache locality because the program isn’t doing any other work, and in a real whole program it won’t get to run hot in L1 that often – that’s a valid possibility and concern, and that’s exactly why I’m suggesting Little Project #2, above, if anyone would like to give that little project a try.

In any event, thank you all for all your interest and support for C++ and its evolution and standardization, and I wish all of you and your families a happier and more peaceful 2025!


(*) Today we have std::variant which safely throws an exception if you access the wrong alternative, but variant isn’t as easy to use as union today, and not as type-safe in some ways. For example, the variant members are anonymous so you have to access them by index or by type; and every variant<int,string> in the program is also anonymous == the same type, so we can’t distinguish/overload unrelated variants that happen to have similar alternatives. I think the ideal answer – and it looks like ISO C++ is just 1-2 years from being powerful enough to do this! – will be something like the safe union metaclass using reflection that I’ve implemented in cppfront, which is as easy to use as union and as safe as variant – see my CppCon 2023 keynote starting at 39:16 for a 4-minute discussion of union vs. variant vs a safe union metafunction that uses reflection.

Effective Concurrency course & upcoming talks

With the winter ISO meeting behind us, it’s onward into spring conference season!

ACCU Conference 2024. On April 17, I’ll be giving a talk on C++’s current and future evolution, where I plan to talk about safety based on my recent essay “C++ safety, in context,” and progress updates on cppfront. I’m also looking forward to these three keynoters:

  • Laura Savino, who you may recall gave an outstanding keynote at CppCon 2023 just a few months ago. Thanks again for that great talk, Laura!
  • Björn Fahller, who not only develops useful libraries but is great at naming them (Trompeloeil, I’m looking at you! [sic]).
  • Inbal Levi, who chairs one of the two largest subgroups in the ISO C++ committee (Library Evolution Working Group, responsible for the design of the C++ standard library) and is involved with organizing and running many other C++ conferences.

Effective Concurrency online course. On April 22-25, I’ll be giving a live online public course for four half-days, on the topic of high-performance low-latency coding in C++ (see link for the course syllabus). The times of 14.00 to 18.00 CEST daily are intended to be friendly to the home time zones of attendees anywhere in EMEA and also to early risers in the Americas. If you live in a part of the world where these times can’t work for you, and you’d like another offering of the course that is friendlier to your home time zone, please email Alfasoft to let them know! If those times work for you and you’re interested in high performance and low latency coding, and how to achieve them on modern hardware architectures with C++17, 20, and 23, you can register now.

Beyond April, later this year I’ll be giving talks in person at these events:

Details for the November conferences will be available on their websites soon.

I look forward to chatting with many of you in person or online this year!

Reader Q&A: Book recommendations

Vigen Isayan emailed me today to ask:

What book(s) you would recommend for learning

1. design patterns, and

2. concurrency programming.

Off the top of my head:

1. For Design Patterns, the greatest is still the original “Gang of Four” Design Patterns book. The design patterns are mostly universal, and the implementations happen to focus on OO techniques (dynamic run-time virtual dispatch etc.) but you can also express many of the patterns using generic programming (static compile-time templates etc.) – for example Observer becomes really simple. The book is almost two decades old, has inspired huge numbers of follow-on patterns books, and there’s still none better of its kind that I know of.

2. For concurrency (and parallelism), check out my 33 articles on Effective Concurrency. I hope to assemble them into a book in the near future. In the meantime they’re all out there online freely available.

If you have additional recommendations, please post them in the comments. Thanks.

On a personal note

Speaking of the Gang of Four, here’s a personal story I don’t know if I’ve shared on the blog before:

At the time it first came out, I was working in downtown Toronto. We had a really excellent local bookstore that specialized in programming books and magazines (alas, it’s long gone now). On lunch breaks, I would go there to browse and get new books to absorb. Note that this was before I was a published author myself, and I had no idea about major new books coming out, so I had no warning of what was about to happen.

On one day that at first seemed like any other, at lunchtime I was browsing the shelves as usual, and I saw copies of a white and blue hardcover book I’d never seen before. Curious, I picked up a copy. It had an unusual title, Design Patterns. I had never heard of the book or its authors before, knew nothing about it at all, and so was quite unsuspecting when I opened it and something happened that I had never experienced before and haven’t experienced since (so far):

I opened that book for the first time, stood there paging through it for less than five minutes, and knew immediately and profoundly that I was holding a classic in my hands. The realization was so unexpected and surprising, it hit me almost physically. At first sight, it was as easy to recognize this book as an important leap forward as the first you see an airplane flying in the sky. Scales fell from my eyes: Cataloguing OO design problems and their known solutions! Imagine!

I had already learned a few of the patterns on my own here and there, some through hard work and experience, others by combining ideas from various articles, many by working with experienced colleagues. But suddenly I found myself standing there in the bookstore aisle just absorbing one problem and solution after another after another, and feeling my understanding begin expanding. Even the patterns I sort of knew about were explained with concise clarity: Motivation, the problem and the specific constraints on the solution which are so important because they affect the design. Known solutions. Variations with tradeoffs.

Bliss! No, more than bliss — treasure.

This must be what explorers and hunters feel like when one day unexpectedly they discover an unopened and unransacked tomb of a young Egyptian Pharaoh, a sunken treasure galleon filled with precious cargo and artifacts, or a cave near the West Bank containing well-preserved two-thousand-year-old scrolls of important literature.

And Design Patterns was language-agnostic, to boot.

Nineteen years later, it’s still at the top of my list of design books to recommend.

Disclaimer: Something this profound inspires a whole new subculture. Later “pattern” wannabe books often went way, way, way overboard, some of them almost to the border of quasi-mysticism. Ignore most (not all, but most) of the follow-on pattern books. Only a few are worth your time; read the reviews carefully first.

Reader Q&A: Acquire/release and sequential consistency

Reader Ernie Cohen emailed me this morning to ask a question about one slide in my atomic<> Weapons talk from last year’s C++ and Beyond:

In your atomic weapons talk (part 1) (updated 2/15/2013) ,page 18, titled “Sc > Acq/Rel Alone: Some examples”, the first example listed “transitivity/causality”:

T0: g = 1; x = 1;

T1: if (x == 1) y = 1;

T2: if (y == 1) assert(g == 1);

I understood you to mean that the assertion might fail if the loads were simple C++11 acquires and the stores were simple C++ releases. But this works just fine with the weaker memory order; the operations in each thread are related by sequenced-before, the communications between the threads create happens-before, and without consumes happens-before is transitive, so there is a happens-before edge from g = 1 to the assertion. Am I missing something?

[Note: g is an ordinary variable, x and y are std::atomic, and all initially zero as usual.] The motivation behind this example, and the other example on the same slide, was to show that when we specified the C++ memory model and atomics, we had to consider more than individual acquire-release pairs in isolation, but also provide additional guarantees to ensure that the whole program was sequentially consistent (SC).

In the above example, yes, we guarantee that the assertion cannot fail with C++ acquire and release semantics, and making sure the memory model required this transitivity is exactly one of the two key points of this example. As you point out, it requires getting the “right” answer when combining sequenced-before and happens-before.

The second point illustrated here is that it was essential to support cases where the programmer could depend on reasoning based on tests of whether a particular write was read and then making SC assumptions based on the outcome of the test, otherwise the whole program wouldn’t be SC.

For completeness, the other example on the slide showed an additional case where individual pairwise acquire/release alone was insufficient to guarantee SC outcomes unless we added requirements. Here is the example, with x and y std::atomic and initially zero:

T1: x = 1;

T2: y = 1;

T3: if( x == 1 && y == 0 ) print( “x first” );

T4: if( y == 1 && x == 0 ) print( “y first” );

This illustrates the total store order requirement: It must be impossible to print both messages, else the result wouldn’t be SC.

Note that in most cases using (non-SC) memory_order_acquire and memory_order_release explicitly happens to give you SC results, except when they don’t (e.g., Dekker’s fails, and I think the second example above fails as well). And of course other relaxed atomics can allow non-SC results at the drop of a hat.

atomic Weapons: The C++ Memory Model and Modern Hardware

[ETA: Updated OneDrive slides link]

Most of the talks I gave at C++ and Beyond 2012 last summer are already online at Channel 9. Here are two more.

This is a two-part talk that covers the C++ memory model, how locks and atomics and fences interact and map to hardware, and more. Even though we’re talking about C++, much of this is also applicable to Java and .NET which have similar memory models, but not all the features of C++ (such as relaxed atomics).

Note: This is about the basic structure and tools, not how to write lock-free algorithms using atomics. That next-level topic may be on deck for this year’s C++ and Beyond in December, we’ll see…

atomic<> Weapons: The C++ Memory Model and Modern Hardware

This session in one word: Deep.

It’s a session that includes topics I’ve publicly said for years is Stuff You Shouldn’t Need To Know and I Just Won’t Teach, but it’s becoming achingly clear that people do need to know about it. Achingly, heartbreakingly clear, because some hardware incents you to pull out the big guns to achieve top performance, and C++ programmers just are so addicted to full performance that they’ll reach for the big red levers with the flashing warning lights. Since we can’t keep people from pulling the big red levers, we’d better document the A to Z of what the levers actually do, so that people don’t SCRAM unless they really, really, really meant to.

Topics Covered:

  • The facts: The C++11 memory model and what it requires you to do to make sure your code is correct and stays correct. We’ll include clear answers to several FAQs: “how do the compiler and hardware cooperate to remember how to respect these rules?”, “what is a race condition?”, and the ageless one-hand-clapping question “how is a race condition like a debugger?”
  • The tools: The deep interrelationships and fundamental tradeoffs among mutexes, atomics, and fences/barriers. I’ll try to convince you why standalone memory barriers are bad, and why barriers should always be associated with a specific load or store.
  • The unspeakables: I’ll grudgingly and reluctantly talk about the Thing I Said I’d Never Teach That Programmers Should Never Need To Now: relaxed atomics. Don’t use them! If you can avoid it. But here’s what you need to know, even though it would be nice if you didn’t need to know it.
  • The rapidly-changing hardware reality: How locks and atomics map to hardware instructions on ARM and x86/x64, and throw in POWER and Itanium for good measure – and I’ll cover how and why the answers are actually different last year and this year, and how they will likely be different again a few years from now. We’ll cover how the latest CPU and GPU hardware memory models are rapidly evolving, and how this directly affects C++ programmers.

“256 cores by 2013”?

I just saw a tweet that’s worth commenting on:

image

Almost right, and we have already reached that.

I said something similar to the above, but with two important differences:

  1. I said hardware “threads,” not only hardware “cores” – it was about the amount of hardware parallelism available on a mainstream system.
  2. What I gave was a min/max range estimate of roughly 16 to 256 (the latter being threads) under different sets of assumptions.

So: Was I was right about 2013 estimates?

Yes, pretty much, and in fact we already reached or exceeded that in 2011 and 2012:

  • Lower estimate line: In 2011 and 2012 parts, Intel Core i7 Sandy Bridge and Ivy Bridge are delivering almost the expected lower baseline, and offering 8-way and 12-way parallelism = 4-6 cores x 2 hardware threads per core.
  • Upper estimate line: In 2012, as mentioned in the article (which called it Larrabee, now known as MIC or Xeon Phi) is delivering 200-way to 256-way parallelism = 50-64 cores x 4 hardware threads per core. Also, in 2011 and 2012, GPUs have since emerged into more mainstream use for computation (GPGPU), and likewise offer massive compute parallelism, such as 1,536-way parallelism on a machine having a single NVidia Tesla card.

Yes, mainstream machines do in fact have examples of both ends of the “16 to 256 way parallelism” range. And beyond the upper end of the range, in fact, for those with higher-end graphics cards.

For more on these various kinds of compute cores and threads, see also my article Welcome to the Jungle.

 

Longer answer follows:

Here’s the main part from article, “Design for Manycore Systems” (August 11, 2009). Remember this was written over three years ago – in the Time Before iPad, when Android was under a year old:

How Much Scalability Does Your Application Need?

So how much parallel scalability should you aim to support in the application you‘re working on today, assuming that it’s compute-bound already or you can add killer features that are compute-bound and also amenable to parallel execution? The answer is that you want to match your application’s scalability to the amount of hardware parallelism in the target hardware that will be available during your application’s expected production or shelf lifetime. As shown in Figure 4, that equates to the number of hardware threads you expect to have on your end users’ machines.

Figure 4: How much concurrency does your program need in order to exploit given hardware?

Let’s say that YourCurrentApplication 1.0 will ship next year (mid-2010), and you expect that it’ll be another 18 months until you ship the 2.0 release (early 2012) and probably another 18 months after that before most users will have upgraded (mid-2013). Then you’d be interested in judging what will be the likely mainstream hardware target up to mid-2013.

If we stick with "just more of the same" as in Figure 2’s extrapolation, we’d expect aggressive early hardware adopters to be running 16-core machines (possibly double that if they’re aggressive enough to run dual-CPU workstations with two sockets), and we’d likely expect most general mainstream users to have 4-, 8- or maybe a smattering of 16-core machines (accounting for the time for new chips to be adopted in the marketplace). [[Note: I often get lazy and say “core” to mean all hardware parallelism. In context above and below, it’s clear we’re talking about “cores and threads.”]]

But what if the gating factor, parallel-ready software, goes away? Then CPU vendors would be free to take advantage of options like the one-time 16-fold hardware parallelism jump illustrated in Figure 3, and we get an envelope like that shown in Figure 5.

Figure 5: Extrapolation of “more of the same big cores” and “possible one-time switch to 4x smaller cores plus 4x threads per core” (not counting some transistors being used for other things like on-chip GPUs).

First, let’s look at the lower baseline, ‘most general mainstream users to have [4-16 way parallelism] machines in 2013’? So where are were in 2012 today for mainstream CPU hardware parallelism? Well, Intel Core i7 (e.g., Sandy Bridge, Ivy Bridge) are typically in the 4 to 6 core range – which, with hyperthreading == hardware threads, means 8 to 12 hardware threads.

Second, what about the higher potential line for 2013? As noted above:

  • Intel’s Xeon Phi (then Larrabee) is now delivering 50-64 cores x 4 threads = 200 to 256-way parallelism. That’s no surprise, because this article’s upper line was based on exactly the Larrabee data point (see quote below).
  • GPUs already blow the 256 upper bound away – any machine with a two-year-old Tesla has 1,536-way parallelism for programs (including mainstream programs like DVD encoders) that can harness the GPU.

So not only did we already reach the 2013 upper line early, in 2012, but we already exceeded it for applications that can harness the GPU for computation.

As I said in the article:

I don’t believe either the bottom line or the top line is the exact truth, but as long as sufficient parallel-capable software comes along, the truth will probably be somewhere in between, especially if we have processors that offer a mix of large- and small-core chips, or that use some chip real estate to bring GPUs or other devices on-die. That’s more hardware parallelism, and sooner, than most mainstream developers I’ve encountered expect.

Interestingly, though, we already noted two current examples: Sun’s Niagara, and Intel’s Larrabee, already provide double-digit parallelism in mainstream hardware via smaller cores with four or eight hardware threads each. "Manycore" chips, or perhaps more correctly "manythread" chips, are just waiting to enter the mainstream. Intel could have built a nice 100-core part in 2006. The gating factor is the software that can exploit the hardware parallelism; that is, the gating factor is you and me.

Reader Q&A: How to write a CAS loop using std::atomics

The following is not intended to be a complete treatise on atomics, but just an answer to a specific question.

A colleague asked:

How should one write the following “conditional interlocked” function in the new C++ atomic<> style?

// if (*plValue >= 0) *plValue += lAdd  ; return the original value

LONG MpInterlockedAddNonNegative(__inout LONG volatile* plValue,  __in  LONG const  lAdd) 
{ 
    LONG lValue = 0; 
    for (;;)  {

        lValue = *plValue; // volatile plValue suppress compile optimizations in which

 

                           // lValue is optimized out hence MT correctness is broken

        if (lValue < 0)   break;

        if (lValue == InterlockedCompareExchange(plValue, lValue + lAdd, lValue)) { 
            break; 
        } 
    }

    return lValue; 
}

Note: ISO C/C++ volatile is not for inter-thread communication,[*] but this is legacy code that predates std::atomics and was using a combination of platform-specific volatile semantics and Windows InterlockedXxx APIs.

The answer is to use a CAS loop (see code at top), which for std::atomics is spelled compare_exchange:

  • Use compare_exchange_weak by default when looping on this which generally naturally tolerates spurious failures.
  • Use compare_exchange_strong for single tests when you generally don’t want spurious failures.
  • Usage note: In the code at top we save an explicit reload from ‘a’ in the loop because compare_exchange helpfully (or “helpfully” – this took me a while to discover and remember) stores the actual value in the ‘expected’ value slot on failure. This actually makes loops simpler, though some of us are still have different feelings on different days about whether this subtlety was a good idea… anyway, it’s in the standard.

For the std::atomic version, roughly (compiling in my head), and generalizing to any numeric type just because I’m in the habit, and renaming for symmetry with atomic<T>::fetch_add(), I think this is what you want:

template<typename T>
T fetch_add_if_nonnegative( std::atomic<T>& a,  T val ) {
    T old = a;
    while( old >= 0 && !a.compare_exchange_weak( old, old+val ) )
        { }
    return old;
}

Because the only test in your loop was to break on negative values, it naturally migrated into the loop condition. If you want to do more work, then follow the general pattern which is the following (pasting from the standard, 29.6.5/23 – and note that the explicit “.load()” is unnecessary but some people including the author of this clause of the standard prefer to be pedantically explicit :) ):

[ Example: the expected use of the compare-and-exchange operations is as follows.

The compare-and-exchange operations will update expected when another iteration of the loop is needed.

expected = current.load();

do {

desired = function(expected);

} while (!current.compare_exchange_weak(expected, desired));

—end example ]

So the direct implementation of your function in the general pattern would be:

T old = a; 
do { 
    if( old < 0 ) break; 
} while(!a.compare_exchange_weak( old, old+val ) )


but since that easily moves into the loop test I just did this instead in the code at top:

T old = a; 
while( old >= 0 && !a.compare_exchange_weak( old, old+val ) ) 
    { }

and hoping that no one will discover and point out that I’ve somehow written a subtle bug by trying to make the code cuter just before leaving for a holiday weekend.

 

[*] Here’s the difference between ISO C/C++ volatile vs. std::atomic<T>/atomic_T: ISO C/C++ volatile is intended to be used only for things like hardware access and setjmp/longjmp safety, to express that the variable is in storage that is not guaranteed to follow the C++11 memory model (e.g., the compiler can’t make any assumptions about it). It has nothing to do with inter-thread communication – the proper tool for that is std::atomic<T> which for C compatibility can also be spelled atomic_T (note that in Java and C# this is called volatile which adds to the confusion). For more, see my article “volatile vs. volatile” and Hans Boehm’s ISO C++ paper “Should volatile Acquire Atomicity and Thread Visibility Semantics?”.

C&B Panel: Alexandrescu, Meyers, Sutter on Static If, C++11, and Metaprogramming

The first panel from C++ and Beyond 2012 is now available on Channel 9:

On Static If, C++11 in 2012, Modern Libraries, and Metaprogramming

Andrei Alexandrescu, Scott Meyers, Herb Sutter

Channel 9 was invited to this year’s C++ and Beyond to film some sessions (that will appear on C9 over the coming months!)…

At the end of day 2, Andrei, Herb and Scott graciously agreed to spend some time discussing various modern C++ topics and, even better, answering questions from the community. In fact, the questions from Niners (and a conversation on reddit/r/cpp) drove the conversation.

Here’s what happened…

[more]

“Strong” and “weak” hardware memory models

In Welcome to the Jungle, I predicted that “weak” hardware memory models will disappear. This is true, and it’s happening before our eyes:

  • x86 has always been considered a “strong” hardware memory model that supports sequentially consistent atomics efficiently.
  • The other major architecture, ARM, recently announced that they are now adding strong memory ordering in ARMv8 with the new sequentially consistent ldra and strl instructions, as I predicted they would. (Actually, Hans Boehm and I influenced ARM in this direction, so it was an ever-so-slightly disingenuous prediction…)

However, at least two people have been confused by what I meant by “weak” hardware memory models, so let me clarify what “weak” means – it means something different for hardware memory models and software memory models, so perhaps those aren’t the clearest terms to use.

By “weak (hardware) memory model” CPUs I mean specifically ones that do not natively support efficient sequentially consistent (SC) atomics, because on the software side programming languages have converged on “sequential consistency for data-race-free programs” (SC-DRF, roughly aka DRF0 or RCsc) as the default (C11, C++11) or only (Java 5+) supported software memory model for software. POWER and ARMv7 notoriously do not support SC atomics efficiently.

Hardware that supports only hardware memory models weaker than SC-DRF, meaning that they do not support SC-DRF efficiently, are permanently disadvantaged and will either become stronger or atrophy. As I mentioned specifically in the article, the two main current hardware architectures with what I called “weak” memory models were current ARM (ARMv7) and POWER:

  • ARM recently announced ARMv8 which, as I predicted, is upgrading to SC acquire/release by adding new SC acquire/release instructions ldra and strl that are mandatory in both 32-bit and 64-bit mode. In fact, this is something of an industry first — ARMv8 is the first major CPU architecture to support SC acquire/release instructions directly like this. (Note: That’s for CPUs, but the roadmap for ARM GPUs is similar. ARM GPUs currently have a stronger memory model, namely fully SC; ARM has announced their GPU future roadmap has the GPUs fully coherent with the CPUs, and will likely add “SC load acquire” and “SC store release” to GPUs as well.)
  • It remains to be seen whether POWER will adapt similarly, or die out.

Note that I’ve seen some people call x86 “weak”, but x86 has always been the poster child for a strong (hardware) memory model in all of our software memory model discussions for Java, C, and C++ during the 2000s. Therefore perhaps “weak” and “strong” are not useful terms if they mean different things to some people, and I’ve updated the WttJ text to make this clearer.

I will be discussing this in detail in my atomic<> Weapons talk at C&B next week, which I hope to make freely available online in the near future (as I do most of my talks). I’ll post a link on this blog when I can make it available online.