Effective Concurrency: Lock-Free Code — A False Sense of Security

DDJ posted the next Effective Concurrency column a couple of weeks earlier than usual. You can find it here: “Lock-Free Code: A False Sense of Security”, just went live on DDJ’s site, and also appears in the print magazine.
 
This is a special column in a way, because I rarely critique someone else’s published code. However, mere weeks ago DDJ itself published an article with fundamentally broken code that tried to show how to write a simplified lock-free queue. I corresponded with the author, Petru Marginean, and his main reviewer, Andrei Alexandrescu, to discuss the problems, and they have patched the code somewhat and added a disclaimer to the article. But the issues need to be addressed, and so Petru kindly let me vivisect his code in public in this column (and the next, not yet available, which will show how to do it right).
 
From the article:

[Lock-free code is] hard even for experts. It’s easy to write lock-free code that appears to work, but it’s very difficult to write lock-free code that is correct and performs well. Even good magazines and refereed journals have published a substantial amount of lock-free code that was actually broken in subtle ways and needed correction.

To illustrate, let’s dissect some peer-reviewed lock-free code that was published here in DDJ just two months ago …

I hope you enjoy it. (Note: Yes, the title is a riff on Tom Cargill’s classic article “Exception Handling: A False Sense of Security”, one of Scott Meyers’ five “Most Important C++ Non-Book Publications…Ever”.)
 
Finally, here are links to previous Effective Concurrency columns (based on the magazine print issue dates):
August 2007 The Pillars of Concurrency
September 2007 How Much Scalability Do You Have or Need?
October 2007 Use Critical Sections (Preferably Locks) to Eliminate Races
November 2007 Apply Critical Sections Consistently
December 2007 Avoid Calling Unknown Code While Inside a Critical Section
January 2007 Use Lock Hierarchies to Avoid Deadlock
February 2008 Break Amdahl’s Law!
March 2008 Going Superlinear
April 2008 Super Linearity and the Bigger Machine
May 2008 Interrupt Politely
June 2008 Maximize Locality, Minimize Contention
July 2008 Choose Concurrency-Friendly Data Structures
August 2008 The Many Faces of Deadlock
September 2008 “Lock-Free Code: A False Sense of Security”

13 thoughts on “Effective Concurrency: Lock-Free Code — A False Sense of Security

  1. Hi Herb,

    Critique is good. Not long ago, I remember when you and Alexandrescu, and friends got pretty much slaughtered on a C++ list for not figuring out atomic semantics and barriers and ordering problems, in a very similar fashion.

    It all kicked off with a reference counted string implementation or similiar. Even source was provided and compilers could have adopted to support it in fact. And no one ever did anything about that?

    It is common knowledge these days a few barriers eliminate the problem of single consumer, single producer scenarios. I guess that is what is taking over a month to publish as people eagerly await a common and trivial solution. At least it has been common since 2001.

    So, critique is good, and no you are no different than anyone else out there regarding parallel programming.

    It is all trickery against moving target that is hardware. Hence working on VMs and C++/CLI is a waste of time really. They even go against stack as the main culprit, which is symptomatic of going the bloated, slow, framework route (ie. doomed, in terms of performance and time it will be adopted for ).

    Regards,
    Majka

  2. I think that the topic of this month’s column has already been covered by an enlightening paper by Scott Meyers and Andrei Alexandrescu, which appeared in Dr. Dobb’s Journal four years ago:

    C++ and the Perils of Double-Checked Locking
    http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf
    http://www.nwcpp.org/Downloads/2004/DCLP_notes.pdf

    In a nutshell, all concurrent software we have written in C++98 up to now that does not rely on threading libraries with operating system support (which includes numerous beloved non-blocking algorithms employing volatile) works by accident. In fact, volatile with its optimization-prevention semantics is often sufficient for single-processor parallelism (by accident, however), but that does not mean that optimization prevention using volatile is the key to thread safety under all circumstances, including multi-processor parallelism with weak memory consistency. The following part of the Linux kernel documentation offers a good treatment of volatile and its abuse for achieving thread safety:

    Why the “volatile” type class should not be used
    http://www.mjmwired.net/kernel/Documentation/volatile-considered-harmful.txt

    Following Scott and Andrei, the only way to guarantee thread safety using C++98 is the use of platform-specific memory barriers (in other words: there is no portable solution in C++98), which are essentially also used in platform-specific implementations of threading libraries. So if you roll your own data structures, there’s no way around these operations.

    However, in the realm of embedded systems the world can look quite different: Many low-cost controllers do not confront the programmer with topics like memory consistency, so you mostly get along without barriers or the like. Your code works, but — from the language lawyers’s perspective — by accident.

    Regarding Fred’s ring implementation: When we assume that reading and writing the index variables is done atomically, the protocols described in Scott’s slides (DCLP_notes, slide 29-32) can be employed: ring_get needs an Acquire between reading write_index and accessing buffer, and ring_put needs a Release between writing to buffe and updating write_index.

    By the way: During my lectures I discovered Petru Marginean’s name in the Acknowledgements section of Scott’s and Andrei’s paper. Now the circle closes…

  3. As others have mentioned, I have a hard time believing that we have to worry about some of the transformations and reorderings mentioned near the end of the article. Or, stated a different way I have trouble believing it’s possible to write any valid lock free program if such transformations and reorderings are really possible. This bothers me a lot because I’ve worked with a lot of embedded system’s code (including the source code for more than a few of the proprietary and very popular embedded operating systems). I’ve had to verify and correct the operation of a lot of this code for some pretty safety critical applications. I now wonder, “is the code I thought was safe really safe”.

    I’d sort of like to get an opinion about one bit of C code that I have seen used over and over to do “lock free” “single reader – single writer” communication. I’ve seen this used on many of the processors out there with lots of compilers and never known there to be a problem. But if processors can insert false writes and totally ignore sequential consistency then maybe this code isn’t valid.

    typedef struct {
    volatile size_t read_index; /* index of next data to get/read */
    volatile size_t write_index; /* index of next place to put/write data */
    size_t buffer_size; /* size of ring ‘buffer’ in bytes */
    uint8_t *buffer; /* pointer to start of buffer */
    } RING_T;

    typedef RING_T *RING_ID;

    /**
    * Get one byte from a ring buffer.
    *
    * @Param ring_id Ring buffer to manipulate
    * @Param byte_ptr Where to store returned byte
    *
    * @Return true if there was a byte to return, false otherwise
    */
    bool
    ring_get(RING_ID ring_id, uint8_t *byte_ptr)
    {
    size_t read_index;

    read_index = ring_id->read_index;
    if (read_index == ring_id->write_index) {
    /* Ring is empty */
    return false;
    } else {
    /* Pull byte out of ring and advance ring index */
    *byte_ptr = ring_id->buffer[read_index];
    if (++read_index == ring_id->buffer_size) {
    read_index = 0;
    }
    ring_id->read_index = read_index;
    return true;
    }
    }

    /**
    * Initialize ring buffer to be empty.
    *
    * Create a ring buffer of size ‘nbytes’, and initialize
    * it. Note, a ring buffer can store a
    * maximum of one byte less than ‘nbytes’.
    *
    * @Param ring_id Preallocated ring buffer control structure to initialize
    * @Param buffer Buffer that ring buffer data to be stored in
    * @Param nbytes Size of ‘buffer’ in bytes
    */
    void
    ring_init(RING_ID ring_id, uint8_t *buffer, size_t nbytes)
    {
    /* Save the buffer info in the ring control structure. */
    ring_id->buffer = buffer;
    ring_id->buffer_size = nbytes;

    /* Make sure ring buffer starts out empty */
    ring_id->read_index = 0;
    ring_id->write_index = 0;
    }

    /**
    * Put one byte into a ring buffer.
    *
    * @Param ring_id Ring buffer to manipulate
    * @Param byte Byte to put into ring
    *
    * @Return true if there was space for byte, false otherwise
    */
    bool
    ring_put(RING_ID ring_id, uint8_t byte)
    {
    size_t write_index;

    write_index = ring_id->write_index;
    if (write_index == ring_id->read_index – 1) {
    /* Ring is full so can’t write any more */
    return false;
    } else {
    if (write_index == ring_id->buffer_size – 1) {
    if (ring_id->read_index == 0) {
    /* Ring is full so can’t write any more */
    return false;
    } else {
    ring_id->buffer[write_index] = byte;
    ring_id->write_index = 0;
    return true;
    }
    } else {
    ring_id->buffer[write_index] = byte;
    ring_id->write_index = write_index + 1;
    return true;
    }
    }
    }

  4. To Jaroslav Sevcik:

    Thanks for the explanation! And I see the larger issue underlying the “why not make “data” volatile” part of it, too; it may only be at very specific points in a program that one cares about the volatility of a variable, and prohibitively expensive to make it volatile throughout.

    I would have thought that it would generally be possible to encapsulate the relevant portions of the code in a function and only declare data to be volatile within that function, though. I gather that that isn’t as general as I think?

  5. To Brooks Moses:

    The problem with the C++ volatile is that it is too weak for safe multi-threading, because it only prevents optimizations on the volatile variable itself, but it does not prevent interfering optimization on other variables, such as the ‘payload’ data in the queue.

    While I do not have an example showing where real compiler does something wrong with this particular implementation of queue, I can show you an example, where C++ volatile is just not enough:

    int data = 0;

    volatile bool requestReady = false;
    volatile bool responseReady = false;

    void thread1() {
    data = 1;
    requestReady = true;

    if (responseReady)
    std::cout << data;
    }

    void thread2() {
    if (requestReady) {
    data = 2;
    responseReady = true;
    }
    }

    Note this program can never print 1. However, compilers will happily replace ‘std::cout << data’ by ‘std::cout << 1’. Suddenly, your program can print 1!

    In Java, this program is data race free (because data races can only happen on volatile variables). In C++0x the program would be data race free, if requestReady and ResponseReady were atomic variables. Both Java and C++0x guarantee sequentially consistent behaviours for data race free program. That means that Java and C++0x guarantee sane behaviours, i.e., the program can only print 2.

    You might suggest that programmers should make the ‘data’ variable volatile, too. However, the ‘data’ variable might be a pointer to some big data structure that is defined in a library, so you cannot make all fields of the data structure volatile. And you certainly do not want create a ‘volatile’ copy of the entire structure just because you want to send a pointer…

    Below is the real example that prints 1 (compile with gcc -O and pthreads):

    #include
    #include

    int data = 0;

    volatile bool requestReady = false;
    volatile bool responseReady = false;

    void *thread_proc(void *arg) {
    //Wait for request
    for(volatile int i = 0; i < 500000000; i++) ;

    if (requestReady) {
    data = 2;
    responseReady = true;
    }
    return NULL;
    }

    int main() {
    pthread_t t1;
    pthread_create(&t1, NULL, thread_proc, NULL);

    data = 1;
    requestReady = true;

    // Wait for response
    for(volatile int i = 0; i < 1000000000; i++) ;

    if (responseReady)
    std::cout << data << std::endl;
    return 0;
    }

  6. I’m curious about some of the compiler code-reorganization that you refer to in the article — why isn’t the solution to this problem to declare the relevant variables as “volatile”? After all, the whole point of volatile is to tell the compiler that the variable in question may be changed or read by things outside the scope of the code being compiled, while that code is running, and thus that optimizations which would be broken by that (such as reordering the reads to be after other function calls, or inserting additional writes) should be avoided.

    (See, for example, this DDJ article about the subject: http://www.ddj.com/cpp/184403766.)

  7. If we look closely at the queue constructor and Produce code:

    LockFreeQueue() {
    list.push_back(T());
    iHead = list.begin();
    iTail = list.end(); // <— here
    }

    void Produce(const T& t) {
    list.push_back(t);
    iTail = list.end(); // <— and here
    list.erase(list.begin(), iHead);
    }

    In a typical list implementation list::end() always returns the very same iterator regardless of prior insertion and removals. It is no surprise, because a straigt forward implemenation would use a dummy list node to maintain the list. dummy_node.next is the return value of list::begin(), &dummy_node is the return value of list::end() and dummy_node.prev is the last node (if any) of the list. When the list is empty &dummy_node == dummy_node.next, so that list::begin() == list::end().

    In other words, the necessity of iTail member is implementation defined and is rather questionable.

  8. Thanks for the nice article. While I agree that the code you show is flawed, I am somewhat skeptical that compilers/processors perform the transformations you show on the last page.

    As for the insertion of write: Is there any processor that can introduce new writes of a different value from the ones that are already in the execution? As far as I know, Intel IA-32, AMD and Sun TSO cannot introduce new writes. Intel Itanium sort of can but only with the same value as the existing ones. As for compilers, .NET cannot introduce writes at all, and Java cannot introduce writes out-of-thin-air, which is the case you show. C++0x could insert the write you show, but I am somewhat doubtful that any implementation would do that.

    The speculation on the outcome of if-statement seems to be even more suspicious, because it violates sequential consistency for correctly synchronized programs.

    For example, consider this program:

    LockFreeQueue q;
    main() {start_thread(t2); int r; if (q.Consume(r)) printf(“!”);}
    t2() {int r; if (q.Consume(r)) printf(“!”);}

    The program does not have any data races, because after initialization it only reads. Moreover, it cannot print in any sequentially consistent execution, because q is always an empty queue. However, after your speculation transformation, there is an interleaving where the program can print in a sequentially consistent execution (e.g., if first thread performs iNext=iHead;iNext++;__temp=iHead;iHead=iNext; then the second thread goes iNext=iHead;iNext++;__temp=iHead;iHead=iNext;if (iNext==iTail) … {t=*Head;return true;}). This implies that neither Java, nor C#, nor C++ can perform such transformation, because they guarantee that you only get sequentially consistent behaviors from data race free programs.

    Am I wrong somewhere? Do you know any compiler/hardware that performs these transformations? I am sorry for the long comment.

  9. I suppose I should have scanned your recent articles and read the one about deadlock before posting — I am sure you will correct me and explain that deadlock *can* exist in lock-free designs. :) I think it’s still the case that lock-based designs do force programmers to contend with lots of error-prone issues like lock granularity, ordering, etc. Maybe lock-free approaches will have lots of issues of their own, but I think they’re at least worth a shot, and I think people should understand that lock-free data structures are analogous to mutexes themselves.

  10. There’s no doubt that lock-free data structures are extremely difficult to write and get right. But I disagree with the subtext that I perceive from this blog entry and your article that says “lock-free == risky, locks == safe.” To me that seems like the wrong message to be sending.

    There’s no doubt that only the experts should be writing lock-free data structures. But the same is true of locks themselves: only experts who are familiar with the architecture (memory visibility model, barriers, etc) should be writing them. Everyone else just uses the work that the experts do. For example, it doesn’t take an expert to *use* a lock-free queue to coordinate a bunch of threads that are acting in a producer/consumer relationship.

    From the perspective of the average programmer, *using* lock-free data structures shouldn’t be significantly more error-prone than using locks. In fact it could be less so: lock-based programming forces you to deal with deadlock, priority inversion, etc. none of which arise in lock-free designs.

    Good, robust lock-free data structures aren’t ubiquitous like locks are today, but to me that just means that the field is immature, not that lock-free approaches are inherently risky.

Comments are closed.