My little New Year’s Week project (and maybe one for you?)

Herb Sutter C++, Concurrency, Cpp2, cppfront 2025-01-022025-01-02 7 Minutes

[Updates: Clarified that an intrusive discriminator would be far beyond what most people mean by “C++ ABI break.” Mentioned unique addresses and common initial sequences. Added “unknown” state for passing to opaque functions.]

Here is my little New Year’s Week project: Trying to write a small library to enable compiler support for automatic raw union member access checking.

The problem, and what’s needed

During 2024, I started thinking: What would it take to make C/C++ union accesses type-checked? Obviously, the ideal is to change naked union types to something safe.(*) But because it will take time and effort for the world to adopt any solution that requires making source code changes, I wondered how much of the safety we might be able to get, at what overhead cost, just by recompiling existing code in a way that instruments ordinary union objects?

Note: I describe this in my C++26 Profiles proposal, P3081R0 section 3.7. The following experiment is trying to validate/invalidate the hypothesis that this can be done efficiently enough to warrant including in an ISO C++ opt-in type safety profile. Also, I’m sure this has been tried before; if you know of a recent (last 10 years?) similar attempt that measured its results, please share it in the comments.

What do we need? Obviously, an extra discriminator field to track the currently active member of each C/C++ union object. But we can’t just add a discriminator field intrusively inside each C/C++ union object, because that would change the size and layout of the object and be a massive link/ABI incompatibility even with C compilers and C code on the same platform which would all need to be identically updated at the same time, and it would break most OSes whose link compatibility (existing apps, device drivers, …) rely on C ABIs and APIs and use unions in stable interfaces; breaking that is much more than people usually mean by “C++ taking an ABI break” which is more about evolving C++ standard library types.

So we have to store it… extrinsically? … as-if in a global internally-synchronized map<void* /*union obj address*/, uintNN_t /*discriminator*/>…? But that sounds stupid scary: global thread safety lions, data locality tigers, and even some branches bears, O my! Could such extrinsic storage and additional checking possibly be efficient enough?

My little experiment

I didn’t know, so earlier this year I wrote some code to find out, and this week I cleaned it up and it’s now posted here:

https://github.com/hsutter/cppfront/tree/main/experimental

The workhorse is extrinsic_storage<Data>, a fast and scalable lock-free data structure to nonintrusively store additional Data for each pointer key. It’s wait-free for nearly all operations (not just lock-free!), and I’ve never written memory_order_relaxed this often in my life. It’s designed to be cache- and prefetcher-friendly, such as using SOA to store keys separately so that default hash buckets contain 4 contiguous cache lines of keys. Here I use it for union discriminators, but it’s a general tool that could be considered for any situation where a type needs to store additional data members but can’t store them internally.

If you’re looking for a little New Year’s experiment…

If you’re looking for a little project over the next few days to start off the year, may I suggest one of these:

Little Project Suggestion #1: Find a bug or improvement in my little lock-free data structure! I’d be happy to learn how to make it better, fire away! Extra points for showing how to fix the bug or make it run better, such as in a PR or your cloned repo.
Little Project Suggestion #2: Minimally extend a C++ compiler (Clang and GCC are open source) as described below, so that every construction/access/destruction of a union type injects a call to my little library’s union_registry<>:: functions which will automatically flag type-unsafe accesses. If you try this, please let me know in the comments what happens when you use the modified compiler on some real world source! I’m curious whether you find true positive union violations in the union-violations.log file – of course it will also contain false positives, because real code does sometimes use unions to do type punning on purpose, but you should be able to eliminate batches of those at a time by their similar text in the log file.

To make #2 easier, here’s a simple API I’ve provided as union_registry<>, which wraps the above in a compiler-intgration-targeted API. I’ll paste the comment documentation here:

//  For an object U of union type that
//  has a unique address, when              Inject a call to this (zero-based alternative #s)
//
//    U is created initialized                on_set_alternative(&U,0) = the first alternative# is active
//
//    U is created uninitialized              on_set_alternative(&U,invalid)
//
//    U.A = xxx (alt A is assigned to)        on_set_alternative(&U,#A)
//
//    U or U.A is passed to a function by     on_set_alternative(&U,unknown)
//      pointer/reference to non-const
//      and we don't know the function
//      is compiled in this mode
//
//    U.A (alt A is otherwise used)           on_get_alternative(&U,#A)
//      and A is not a common initial
//      sequence
//
//    U is destroyed / goes out of scope      on_destroy(&U)
//
//  That's it. Here's an example:
//    {
//      union Test { int a; double b; };
//      Test t = {42};                        union_registry<>::on_set_alternative(&u,0);
//      std::cout << t.a;                     union_registry<>::on_get_alternative(&u,0);
//      t.b = 3.14159;                        union_registry<>::on_set_alternative(&u,1);
//      std::cout << t.b;                     union_registry<>::on_get_alternative(&u,1);
//    }                                       union_registry<>::on_destroy(&u);
//
//  For all unions with up to 254 alternatives, use union_registry<>
//  For all unions with between 255 and 16k-2 alternatives, use union_registry<uint16_t>
//  If you find a union with >16k-2 alternatives, email me the story and use union_registry<uint32_t>

Rough initial microbenchmark performance

My test environment:

CPU: 2.60 GHz i9-13900H (14 physical cores, 20 logical cores)
OSes: Windows 11, running MSVC natively and GCC and Clang via Fedora in WSL2

My test harness provided here:

14 test runs: Each successively uses { 1, 2, 4, 8, 16 32, 64, 1, 2, 4, 8, 16, 32, 64 } threads
- Each run tests 1 million union objects, 10,000 at a time, 10 operations on each union; the test type is union Union { char alt0; int alt1; long double alt2; };
- Each run injects 1 deliberate “type error” failure to trigger detection, which results in a line of text written to union-violations.log that records the bad union access including the source line that committed it (so there’s a little file I/O here too)
Totals:
- 14 million union objects created/destroyed
- 140 million union object accesses (10 per object, includes construct/set/get/destroy)

On my machine, here is total the run-time overhead (“total checked” time using this checking, minus “total raw” time using only ordinary raw union access), for a typical run of the whole 140M unit accesses:

Compiler	total raw (ms)	total checked (ms)	total overhead (ms)	Notes
MSVC 19.40 `-O2`	~190	~1020	~830	Compared to `-O2`, `-Ox` checked was the same or very slightly slower, and `-Os` checked was 3x slower
GCC 14 `-O3`	~170	~800	~630	Compared to `-O3`, `-O2` overall was only slightly slower
Clang 18 `-O3`	~170	~510	~340	Compared to `-O3`, `-O2` checked was about 40% slower

Dividing that by 140 million accesses, the per-access overhead is:

Compiler	total overhead (ns) / total accesses	average overhead / access (ns)
MSVC	830M ns / 140M accesses	5.9 ns / access
GCC (midpoint)	630M ns / 140M accesses	4.5 ns / access
Clang	340M ns / 140M accesses	2.4 ns / access

Finally, recall we’re running on a 2.6 GHhz processor = 2.6 clock cycles per ns, so in CPU clock cycles the per-access overhead is:

Compiler	average overhead / access (cycles)
MSVC	15 cycles / access
GCC	11.7 cycles / access
Clang	6.2 cycles / access

This… seems too good to be true. I may well be making a silly error (or several) but I’ll post anyway so we can all have fun correcting them! Maybe there’s a silly bug in my code, or I moved a decimal point, or I converted units wrong, but I invite everyone to have fun pointing out the flaw(s) in my New Year’s Day code and/or math – please fire away in the comments.

Elaborating on why this seems too good to be true: Recall that one “access” means to check the global hash table to create/find/destroy the union object’s discriminator tag (using std::atomics liberally) and then also set or check either the tag (if setting or using one of the union’s members) and/or the key (if constructing or destroying the union object). But even a single L2 cache access is usually around 10-14 cycles! This would mean this microbenchmark is hitting L1 cache almost always, even while iterating over 10,000 active unions at a time, often with more hot threads than there are physical or logical cores pounding on the same global data structure, and occasionally doing a little file I/O to report violations.

Even if I didn’t make any coding/calculation errors, one explanation is that this microbenchmark has great L1 cache locality because the program isn’t doing any other work, and in a real whole program it won’t get to run hot in L1 that often – that’s a valid possibility and concern, and that’s exactly why I’m suggesting Little Project #2, above, if anyone would like to give that little project a try.

In any event, thank you all for all your interest and support for C++ and its evolution and standardization, and I wish all of you and your families a happier and more peaceful 2025!

(*) Today we have std::variant which safely throws an exception if you access the wrong alternative, but variant isn’t as easy to use as union today, and not as type-safe in some ways. For example, the variant members are anonymous so you have to access them by index or by type; and every variant<int,string> in the program is also anonymous == the same type, so we can’t distinguish/overload unrelated variants that happen to have similar alternatives. I think the ideal answer – and it looks like ISO C++ is just 1-2 years from being powerful enough to do this! – will be something like the safe union metaclass using reflection that I’ve implemented in cppfront, which is as easy to use as union and as safe as variant – see my CppCon 2023 keynote starting at 39:16 for a 4-minute discussion of union vs. variant vs a safe union metafunction that uses reflection.

Published by Herb Sutter

Herb Sutter is an author and speaker, a technical fellow at Citadel Securities, and serves as chair of the Standard C++ Foundation and chair emeritus of the ISO C++ standards committee. View all posts by Herb Sutter

Published 2025-01-022025-01-02

16 thoughts on “My little New Year’s Week project (and maybe one for you?)”

Herb Sutter says:

2025-01-04 at 4:04 pm

@SirSN

You can find all WG21 papers here: https://open-std.org/jtc1/sc22/wg21/docs/papers/. It’s the full list, but it’s fairly easy to search for “profiles” and “SG23” there.

In the week of Jan 13 it will have a 2025 folder that contains my latest profiles paper P3081R1.

I think Little Project Suggestion #2 is relevant and I look forward to getting more data with real-world workloads.

There are no SG23-specific public discussion areas, sorry, but anyone is welcome to join as a guest! See https://isocpp.org/std/meetings-and-participation for more information about being a guest.
SirSN says:

2025-01-04 at 3:30 pm

Herb, is there a good place or website to see “past” and current papers on profiles, and to see what people are currently working on and what remains to be worked upon? P3447R0 encourages people to help in its last section. Does SG23, the study group which (I believe) is focused on Profiles, have a website?

Is Little Project Suggestion #2 still relevant? The proposal is daunting to me. But, not having to add any annotations and yet still get memory-safety, would be a large gain, and fit with the spirit of avoiding a high frequency of annotations.

I have some thoughts and many questions on Profiles, but me spamming your blog’s comment section is probably not the best way for sharing them 🙂 . Is there an online discussion forum for Profiles? Discord server, subreddit, other kinds of forums, mailing lists, etc.? /r/cpp/ has issues.
null says:

2025-01-02 at 10:32 pm

Can something similar be done for use-after-free detection? Something like a reachability mask that marks if a certain address is accessible and the type associated within that address space.

I get it that the canonical way of solving lifetime safety bugs is to introduce P1179/P3390 style static analyzer, but runtime checks should be much easier to deliver (as it should have much lower false-positive rate for existing codes) and be used as a temporary strategy in the meantime, at the cost of some performance penalty.

Besides, static analyzers can have bugs (like https://github.com/rust-lang/rust/issues/25860) and runtime checks can be used as an extra defense-in-depth in this case, and I won’t mind trading some performance outside the hot path for software correctness.
Herb Sutter says:

2025-01-02 at 10:26 pm

@Phil wrote:

I feel like this should use non-throwing allocation for the chunk and have some appropriate fallback for when memory allocation fails. … I feel like constructing a union on the stack should never result in std::bad_alloc being thrown.

Thanks! That’s a good point. In principle, we do not want to change the behavior of correct program, so should not throw. In practice, a bad_alloc for 296 bytes (the current chunk size) pretty much means the program is about to die an ugly death anyway.

That said, a related factor is that I was already just updating the code to allow the case of a check for the discriminator of a union that was never entered into the discriminator table (e.g., for the case if the union was created in code that wasn’t compiled with this system, so we understand we can’t check its state). So it now already handles “key not present,” and I could switch to use non-throwing allocation and just not add a new chunk if there isn’t room. — Done, thanks. It doesn’t materially affect the timings.
Phil says:

2025-01-02 at 6:33 pm

I feel like this should use non-throwing allocation for the chunk and have some appropriate fallback for when memory allocation fails. I know there is a school of thought that memory allocation failure for small allocations should be treated as fatal, but that’s not the world we live in today. I feel like constructing a union on the stack should never result in std::bad_alloc being thrown.
SirSN says:

2025-01-02 at 3:59 pm

This may already be part of the proposal for unions and Profiles, but trivially safe unions could be made mandatory for compilers to accept. One example that was mentioned by someone else (excluding the array part) is “`union {struct {float x, y, z, w;}; struct {float r, g, b, a;}}“` , since the common initial subsequence in this example includes all members of all structs in the union.
Herb Sutter says:

2025-01-02 at 3:31 pm

@gmit771bb0de048

it is not clear how the tracking stops when union goes to the C code?

Yes, I have a to-do to add an ‘unknown’ state, for use when unions are exposed to code that isn’t compiled with this discriminator.

Update: Done.
gmit771bb0de048 says:

2025-01-02 at 3:29 pm

Since unions cannot be recursive, the unique address problem could possibly be solved by bundling union type index (decorated name hash?) with address, and using that as a key.

Unrelated – to me, it is not clear how the tracking stops when union goes to the C code?
Herb Sutter says:

2025-01-02 at 2:59 pm

@gmit771bb0de048: Right, I just added a note about unique addresses and common initial sequences. Thanks!
gmit771bb0de048 says:

2025-01-02 at 2:09 pm

I apologize if I’ll say something stupid or too obvious, but this wouldn’t work if you’d have a union of unions as multiple unions would have the same address.
SirSN says:

2025-01-02 at 2:02 pm

The wild animals like global thread safety lions, data locality tigers, and branches bears, and other animals, are too scary for me in this specific case with unions. People have on Reddit mentioned unions inside unions as one possibly problematic case, and unions in C++ are allowed to be inspected for a common (initial) subsequence :

If two union members are standard-layout types, it’s well-defined to examine their common subsequence on any compiler.

An argument for simply requiring [[suppress(type_safety)]] whenever unions are used

As I understand it, “Reject” in C++ Profiles is allowed to have false positives, and [[suppress(type_safety)]] is used to opt out. “Reject” as I understand it is not allowed to have false negatives.

If my understanding is correct, then using “Reject” for unions may be an option, even if it leads to a lot of instances of [[suppress(type_safety)]] . My reasoning is that some contemporary languages do exactly this. Rust supports C-style unions, but only in its unsafe subset. And unsafe Rust is harder than C++ according to Armin Ronacher (2022) :

I made the case on Twitter a few days ago that writing unsafe Rust is harder than C or C++, so I figured it might be good to explain what I mean by that.

Armin Ronacher has held speeches at several conferences, including about Rust. And many others have argued the same.

And unsafe Rust is not memory safe. Making C-style unions memory safe in C++ would therefore go beyond what some other contemporary languages offer. Rust usage of C-style unions can involve a lot of “unsafe” blocks, which would be similar to C++ usage of C-style unions involving a lot of [[suppress(type_safety)]] blocks.

Rust does offer memory safe alternatives for similar functionality, namely tagged unions through enum , which are more akin to C++’s std::variant. Rust’s enum is more advanced than C++’s std::variant, since Rust’s enum is based on the tagged union abstraction found in languages like ML, OCaml, Haskell, F# and Scala, and Rust also has pattern matching functionality. Pattern matching, or maybe more ergonomic std::variant support, would IMHO be nice to have in C++, though your work with safe union metaclasses and the future possibilities of C++ reflection sound interesting. Pattern matching can also be a difficult feature to implement well in a language.

Static checking algorithms

Question: Is “Reject” allowed to do a “best effort” checking dependent on compiler and compiler version, meaning that different compilers may reject different things depending on what they can figure out is safe or not? Maybe with some upper bound on what false positives a compiler is allowed to reject? Might be a bad idea, since what compiles on one compiler and compiler version might not compile on another compiler. An alternative could be that each new C++ version could require fewer and fewer false positives as analysis is figured out and evolved, enabling more and more code to be handled without needing [[suppress(type_safety)]] or similar opt-out attributes.

Question: If “Reject” may have false positives, will there be “official” abstract algorithms, or requirements on what “Reject” is obligated to reject and on what “Reject” is obligated to not reject, that compilers must implement?

To decrease the number of [[suppress(type_safety)]] blocks, the compilers could be mandated to do light analysis checking algorithms to not require [[suppress(type_safety)]] for basic usages of unions, within the constraints that P3081R0 prescribes on compilation times.

One static checking algorithm may be to mandate that the compiler does light flow control checking. For instance https://godbolt.org/z/v6Ea7vsW8 and https://godbolt.org/z/61vjherzM would be rejected by the compiler using flow control checking, while https://godbolt.org/z/6YWhfj8v8 would be mandatory for the compiler to accept without having to use [[suppress(type_safety)]]. While more complex examples, that are beyond what the compiler can check feasibly without spending too much time during compilation or that would require too advanced algorithms, would be rejected, even if it is a false positive and the code happens to be correct (that case would then require adding [[suppress(type_safety)]] ). This would be a bit similar to the flow typing checking of Typescript, except Typescript can in some cases allow false negatives, while this would not allow false negatives.

An additional static checking algorithm could be mandating that the compiler does basic safe guessing. A very common pattern is for the first field in any of the union members to be the same type and to basically be the discriminator, storing the discriminator intrinsically. Compilers could be mandated to guess for very specific and basic cases that the first member is the discriminator, and then check accordingly at compile-time that this holds, and be allowed to bail out with rejection on anything non-basic. Another pattern is to have a structure with the discriminator as the first field and the union as the second field, and the compilers could be mandated to guess on this as well.

Ultimately, my general proposal of static checking algorithms, as well as my specific proposals, may be poor. Going the Rust-approach of just adding “unsafe”/[[suppress(type_safety)]] everywhere unions are used, and always have false positives, may be sufficient. It is not uncommon for Rust codebases to have lots of “unsafe”. This would run counter to the goal of recompiling with few or no changes, however.

Conclusion

Also I would like to wish you a happy new year as well.
Herb Sutter says:

2025-01-02 at 12:25 pm

@Dave:

Isn’t the call to make_unique if the storage needs to grow considered locking?
auto pnew = std::make_unique<chunk>();

I know this will use operator new but isn’t the default implementation to call malloc which will at best contain a lock on the heap but at worst call to the OS for a new page?

…

Are you trying to say that other concurrent calls won’t lock whilst a call is allocating?

Exactly — it’s lock-free because there is no locking synchronization on this shared data structure that blocks other threads trying to use the data structure.

Any thread may always slow down or suspend for various reasons, even if it doesn’t perform a waiting operation; for example, it could end its quantum and be suspended by the OS, it could just be running on a slow core, it could even be doing a simple memory access that turns out to be a page fault (much worse than an allocation!), etc. The meaning of lock-free here is that regardless of whether any of those things happens to a thread A that’s trying to access this shared data structure, it doesn’t cause a thread B that wants to do something else with that data structure to block, not even if B is also inserting.

Example: Let’s say threads A and B and C each just constructed a new union object, and each of them is simultaneously trying to add a new data entry for their union object. And let’s say that threads A and B happen to be using unions whose addresses hash to the same bucket, and that that bucket is full so that each simultaneously discovers it needs to allocate a new bucket chunk, and finally that this all happens at a time when there’s a spike of contention on the global allocator so that threads A and B actually have to block to acquire the allocator’s internal lock (or similar). Then even A and B are lock-free with respect to each other, because even if they simultaneously try to compare_exchange to add their own new chunk to the bucket, the compare_exchange loop will make one succeed and then the other succeed, they do not wait for each other. And thread C, which is inserting but has a free null-key slot in its bucket and so just uses that and doesn’t need to allocate at all, continues to be wait-free = not affected by A’s or B’s being blocked. And threads D and E that are each looking up their own existing union object’s data will continue to be wait-free = not affected by A or B or C or each other.
Dave Rowland says:

2025-01-02 at 11:25 am

I really like the idea of this, thanks Herb.

One minor point is on the documentation:
Lock-free for insertion if additional storage is needed

Isn’t the call to make_unique if the storage needs to grow considered locking?
auto pnew = std::make_unique<chunk>();

I know this will use operator new but isn’t the default implementation to call malloc which will at best contain a lock on the heap but at worst call to the OS for a new page?

Are you trying to say that other concurrent calls won’t lock whilst a call is allocating?

From my perspective I do a lot of real-time audio code and if something is advertised as “lock-free” I would assume it would be non-blocking. But perhaps the concurrency guarantees are more subtle here?
Herb Sutter says:

2025-01-02 at 7:55 am

@Tom: In this test, it pretty much stays in L1 cache, but Project #2 would test what happens in real-world code that’s also doing other work. Also check out the instrumentation in the code, if you compile and run with `constexpr auto debug_instrumentation = true;`.

@Ivan: This actually isn’t so much about what people generally mean by “C++ breaking ABI,” it’s actually more fundamental… for example, changing a union layout would make a compiler incompatible with a C compiler (and C code) on the same platform, which would be a massive breaking change, not to mention that most OS APIs and ABIs are C. I’ll add a note to try to make that clearer.
Ivan says:

2025-01-02 at 5:22 am

Or C++ could just finally break ABI…
Tom Kirby-Green says:

2025-01-02 at 3:01 am

Lovely work Herb, how does having a external type field perform a scale cache wise?