Effective Concurrency: Maximize Locality, Minimize Contention

The latest Effective Concurrency column, “Maximize Locality, Minimize Contention”, just went live on DDJ’s site, and also appears in the print magazine. From the article:

ec11-fig2Want to kill your parallel application’s scalability? Easy: Just add a dash of contention.

Locality is no longer just about fitting well into cache and RAM, but also about avoiding scalability busters by keeping tightly coupled data physically close together and separately used data far, far apart. …

I hope you enjoy it.
 
Finally, here are links to previous Effective Concurrency columns (based on the dates they hit the web, not the magazine print issue dates):
July 2007 The Pillars of Concurrency
August 2007 How Much Scalability Do You Have or Need?
September 2007 Use Critical Sections (Preferably Locks) to Eliminate Races
October 2007 Apply Critical Sections Consistently
November 2007 Avoid Calling Unknown Code While Inside a Critical Section
December 2007 Use Lock Hierarchies to Avoid Deadlock
January 2008 Break Amdahl’s Law!
February 2008 Going Superlinear
March 2008 Super Linearity and the Bigger Machine
April 2008 Interrupt Politely
May 2008 Maximize Locality, Minimize Contention

11 thoughts on “Effective Concurrency: Maximize Locality, Minimize Contention

  1. I am still curious if you or your contacts can explain why the distance change “in my test program” causes a few spikes (takes longer).

    I would expect only three possible timing values verses distance.

    1. distance == 0, (slowest, per your article)
    2. distance l1 and l2, no cache line hit.

    But the test program shows many more timing variations.

  2. Note: The sample program will only be available on our FTP site for a short time. I would recommend you pull it and place it on your wiki or some other source if you think it helps the discussion.

    I would also suggest you post an editorial update to DDJ

  3. At the top of the test program, I include the results from three different computers.

    HP E5440
    Dell 670
    and a AMD box.

    All three have similar patterns:

    1. At distance 0, the counter is semi-shared
    2. At distance 1 to ~15 the speed increases with increasing distance.
    3. At one or more specific distances <~15, the speed decreases (sample above the magic distance is 6 and 11).
    On Dell 670, the magic distances are 6 and 11. On AMD box, the distances are 9 and 14.

    I would suggest you run your own tests on several boxes and post the results.

  4. Dennis,

    one immediate thing with your program: There is no guarantee that the threads run on the same core, or on different cores (depending on what you want) – this would skew the results as we don’t know how the cache is being hit.

    See: http://msdn.microsoft.com/en-us/library/ms686247(VS.85).aspx
    and:
    http://www.ibm.com/developerworks/linux/library/l-affinity.html

    Although I must admit I have no idea how the E5440 cache is structured. All I can see is that it has 12MB of cache on the Intel website. I can’t figure out if it is the same 2x6MB that I saw on Wikipedia, if there is a L1 data cache involved or anything…

    Also see:
    http://softwarecommunity.intel.com/isn/Community/en-US/forums/thread/30251847.aspx

    Phew!

  5. First the easy answer. I allocate memory with a call to new.
    I then round the pointer up to the nearest 4 byte boundary.
    The counters start at this new offset and are 4 byte unsigned longs.

    I will post the code to ftp://wsiftp.wsicorp.com/pub/wxdev/dlang/thread_locality.cpp

    As for the cache lock, we did discover that Visual Studio’s compiler translates i++ to a load, add and store sequence. The Intel compiler generates an ‘inc i’ instruction.

    The G++ compiler under Linux, generates inc as well.

    At the assembly level ‘inc’ is close to an atomic operation there is. I do agree that at the micro-code level it is probably multiple transactions.

  6. Dennis, I think the cache line lock doesn’t work the way you think it does. Incrementing a variable is a multi-step operation:

    1. Read from cache into register

    2. Increment

    3. Write back.

    So even with a lock, since it is only locked for the duration of 1 and 3, but not for 2, you do get race conditions.

    As for the timings, have you considered memory alignment as a factor? What size are the counters and what is the distance measured in?

  7. I question your theory and decided to write a program to verify if ‘cache line’ locking and locality effect results and performance. The program is more than 40 lines (because it compiles on both linux and windows) and I don’t think you want me to paste it directly in to this comment block. I will be happy to provide, just let me know how/where to post it.

    I disagree that there is any real cache line lock. If such a condition existed we would never have a race condition on a variable. That is partly why mutexes were invented.

    The test program will run ‘t’ threads, which iterate ‘i’ times and increment a counter which is ‘d’ distance from the next counter.

    When you run with distance set to ‘0’, all the threads share the exact same counter and after ‘i’ iterations the result is not ‘t * i’ as you would expect if there was a cache line lock. The value is also not ‘i’. It is somewhere between ‘i’ and ‘t * i’.

    When the distance is set to ‘1’ or larger, each thread completes with its counter set to ‘i’ (as expected). The surprise to me was the threads do get faster when you increase the distance, but it is not linear and it quickly maxes out.

    Sample run on Windows XP Prof/SP3 on HP E5440 quad-core Intel Xeon 2.83 GHZ

    Total time: 11.657619 seconds, distance:0, iterations:400000000, threads:4
    Total time: 11.608698 seconds, distance:1, iterations:400000000, threads:4
    Total time: 20.127997 seconds, distance:2, iterations:400000000, threads:4
    Total time: 15.124538 seconds, distance:3, iterations:400000000, threads:4
    Total time: 5.492766 seconds, distance:4, iterations:400000000, threads:4
    Total time: 5.555509 seconds, distance:5, iterations:400000000, threads:4
    Total time: 14.724832 seconds, distance:6, iterations:400000000, threads:4
    Total time: 7.718066 seconds, distance:7, iterations:400000000, threads:4
    Total time: 3.820296 seconds, distance:8, iterations:400000000, threads:4
    Total time: 3.884410 seconds, distance:9, iterations:400000000, threads:4
    Total time: 4.164538 seconds, distance:10, iterations:400000000, threads:4
    Total time: 11.788299 seconds, distance:11, iterations:400000000, threads:4
    Total time: 11.884536 seconds, distance:11, iterations:400000000, threads:4
    Total time: 4.778139 seconds, distance:12, iterations:400000000, threads:4
    Total time: 1.046083 seconds, distance:13, iterations:400000000, threads:4
    Total time: 1.046171 seconds, distance:15, iterations:400000000, threads:4
    Total time: 1.046150 seconds, distance:16, iterations:400000000, threads:4

  8. Good article, thanks for sharing. I’ve thought of it this way:

    You don’t want your threads working in very different localities, that will stress the L2 (or L3) cache and cause evictions and compulsory cache misses. Also, you don’t want your threads working on memory that’s too close, because then you risk sharing (and race conditions) and false sharing.

    A solution other than padding for cache lines? Consider using Structures of Arrays instead of Arrays of Structures (AOS is what we are taught in CS classes). Looks like that’s the sort of idea that will be discussed next.

    A question: If consumers bought chips based on the memory architecture rather than frequency or core count would our memory subsystem still look the way it does. I guess we will never know since explaining a memory subsystem to your average CEO is much more difficult to explain than 4>2>1 or 3.0>2.0>1.0.

  9. Nice article, but I can’t help but think that just as we managed to make memory management almost invisible to the programmer, here it comes back in an even more insidious form.

    Not only are the points you specify hard to achieve in a language with managed memory (Java, C#), even if you write C you have no idea where malloc will put your stuff.

    Also, your tip about keeping hot and cold data separate even when they are part of the same object means that we either have to break the C++ ABI and let objects be laid out in “pieces”, or twist the whole architecture of a system into hot and cold parts.

    Is there any research being done in language extensions for this?

Comments are closed.