TRS-80 vs. Alpha, and Parallel Optimization

Lest people get the wrong idea, I enjoy reading Jeff Atwood’s blog and agree with much of what he writes so entertainingly and provocatively. So far I’ve only responded when I strongly felt differently about something, which has been a grand total of twice now.

So let me also offer an example of something I wholeheartedly agree with. Yesterday, Jeff cited what is also my own favorite Programming Pearls figure:

 

Despite the enduring wonder of the yearly parade of newer, better hardware, we’d also do well to remember my all time favorite graph from Programming Pearls:

trs80vsalpha

Everything is fast for small n.

Spot on. If you’re a professional programmer and haven’t read Programming Pearls yet, “run don’t walk” to your bookstore of choice.

Incidentally, just to tie this in to parallel computing as well, Jeff’s article also cites a nice graph of optimizations that improved NDepend:

Patrick Smacchia’s lessons learned from a real-world focus on performance is a great case study in optimization.

ndepend optimization graph

Patrick was able to improve nDepend analysis performance fourfold, and cut memory consumption in half. As predicted, most of this improvement was algorithmic in nature, but at least half of the overall improvement came from a variety of different optimization techniques.

As I’ve said many times, measure twice, optimize once: Know when and where to optimize. Profilers are your friend. As Patrick writes:

When it comes to enhancing performance there is only one way to do things right: measure and focus your work on the part of the code that really takes the bulk of time, not the one that you think takes the bulk of time.

And of course, in our ever-more-multicore world, the contribution of parallelization gain will continue to grow and dominate the optimization of CPU-bound code. But as Patrick also notes, realizing that gain is not always trivial:

While we get the 15% gain from between 1 and 2 processors, the gain is almost zero between 2 and 4 processors. We identified some potential IO contentions and memory issues that will require more attention in the future. This leads to another lesson: Don’t expect that scaling on many processors will be something easy, even if you don’t share states and don’t use synchronization.