I may well be missing something (it wouldnt be the first time!) but I have read and reread almost every reputable article on this subject of reordering and SMP and AMD that I can find and there is a lack of concrete clarity here.
Nobody has been able to tell me how VS 2005 and its updated semantics for ‘volatile’ can actually help prevent hardware reordering, unless one uses LOCK or LFENCE, MFENCE etc it seems that (even AMD are saying) hardware reordering may occur in an SMP setting.
Clearly the update to ‘volatile’ in VS 2005 is something that has been circulated within a small circle of experts because these few experts (AMD etc) seem have an understanding of it, that differs from what is published by MS.
As someone who is very knowledgable about concurrency, threading etc, are you able to assist with a general problem?
Basically Microsoft make passing references to "As of VS 2005, volatile now implements aquire/release semantics" in several places.
Unfortunately none of the MSDN pages that actually document ‘volatile’ say anything about this, everything I see seems to be vague and lacks official detailed expalations, almost as if it rumor or second hand information available to whatever writer has written an article.
Furthermore, examining assembly code generated by VS 2005 for x64 compilations, containing/using ‘volatile’ reveals absolutely no evidence that any barrier ooperations are being inserted by the compiler, this is bewildering.
I am engaged right now in some challenging C library development for concurrent use in a Windows 64-bit world (XP, S 2003 and Vista) and frankly I am uncomfortable with the lack of precision in my knowledge here, I have actually spent days soucring the web reading as much as I can on this area, but the vaguness remains!
This basically boils down to a few questions, perhaps Micorosft can close this gap and write some definitive article that leaves developers in no doubts:
1. If we use ‘volatile’ in C or C++ code compiled under VS 2005, do we or dont we need to use MemoryBarrier() ?
2. If we do need to use MemoryBarrier(), what actually did change about ‘volatile’ in VS 2005?
3. If we dont need to use MemoryBarrier, can MS explain why x64 assembly code contains no indication that barrier/fence operations are being inserted for references to volatile data items?
4. Can Microsoft consider writing some single, comprehensive article that explains when one needs to use
f) Any special cases of the above based on OS and CPU
Including why one should use one and not another, which of these impact compiler only and which impact hardware (cache etc)
As I say, these items are discussed here and there, but given the need for increased technical excellence when developing concurrent code for various Windows platforms coupled with the somewaht isolated manner in which these mechanisms are documented, don’t you think all of the macros, languages extensions etc should be discussed and explained with examples in one comprehensive article?
Finally, unrelated question:
Is there ANY difference between multi-core hardware and multi-processor hardware, that would lead you to select one in order to perform rigorous concurrency testing?
In other words can one envisage bugs that would be masked on multi-core but might be visible on multi-processor or vice versa?
Herb, I was wondering what your thoughts were on languages like Haskell. I see you’ve already mentioned STM, which has a very good implementation in the Glasgow Haskell Compiler, and as far as I can tell solves the "elephant in the room" problem (IO isn’t transactional, so shouldn’t be allowed in an atomic block, and in Haskell the type system ensures this).
I seem to remember you mentioning the need to be able to markup parts of code which don’t have side effects in that Channel9 video. This too, seems to be solved already in Haskell. A modern Haskell programs have many "layers": * IO: impure, anything goes * STM: transactional references * Purely functional code: Here’s the bulk of the program * ST: Mutable state that can be encapsulated into little "nuggets" (by calling runST) which are statically ensured to not "leak" side effects and can therefore be used by pure code.
The final two layers can call each other, but other than that a layer can only call "down" in this list, which ensures that e.g. a function which doesn’t return an IO action won’t go off and call "launch_nukes()" in a transaction (which then gets rerun).
Now I personally don’t think Haskell comes all the way of solving all of this. For example, I make games for a living, and I find that laziness has a very real performance cost (though something like lenient evaluation may strike the right balance between modularity/expressiveness and speed). It also makes any possibility of an implicitly parallellising runtime much less likely. The point I’d like to make is this: It seems to me that the "right" way to go about this is to add mutable variables etc. in a safe "encapsulateable" way on top of a pure layer (see the ST monad), rather than trying to graft a pure layer on top of impure code (the result of which will likely be that nobody uses the pure layer). Haskell proves this concept to my satisfaction. I think the ST monad is a bit clunky to work with, but there’s no reason why a language couldn’t provide some nice syntactic support for writing impure code with mutable state, which just compiles into something like the ST monad underneath, providing static guarantees that no side effects leak outside the ST type, and that you can therefor use algorithms with mutable state from pure code (important!). This syntax would be identical for transactions, and IO-actions as well.
The way I see it, Haskell gets it almost 100% right. I have a number of gripes with Haskell (laziness, the module system, lack of lightweight records, clunkiness of writing monadic code, etc.), but the main good ideas from Haskell (purity, strong static typing, ST monad for impure code wrapped up inside pure code, STM for when you really need low level shared state threading, etc.) could be borrowed and put inside a language which looks and feels a bit more like C++ (to avoid scaring people off too much!). That would, for me, be pretty much the ultimate language for the coming many-core future.
A good question. Short answer: We won’t stay stuck with heavyweight threads, and it’s not that hard to beat pthreads. :-) We will definitely see a movement toward future lightweight/user-mode thread runtimes, including very efficient ones based on work stealing (pioneered via Cilk). There’s nothing in these implementations that precludes C++, or for that matter any other given language, and I know of C++-fronted prototype implementations being developed within various companies as we speak.
If I am reading things right, pthreads (or win32 threads for that matter) carry this heavy ball-and-chain of spawning a heawyweight OS process for each thread. Is there anything that could be done for C++ to improve threading performance, or is counting on more cores/processors to bail us out the only strategy?
Sebastian and Hugh: Excellent questions. I’ll try to blog some longer answers to these in the future.
In the meantime, here are some very short answers.
Re memory model: Yes, and I’m working to define and document a standard memory model across all Microsoft platforms, and regularize some of the, ah, irregular facilities out there right now. This is being done in conjunction with the ISO C++0x memory model work; see my previous blog posts, or websearch for "Prism" and my name to see drafts.
Re Haskell: For a few thoughts, see the comments on functional languages in this paper I co-wrote with Jim Larus for ACM Queue.