I just saw a tweet that’s worth commenting on:
Almost right, and we have already reached that.
I said something similar to the above, but with two important differences:
- I said hardware “threads,” not only hardware “cores” – it was about the amount of hardware parallelism available on a mainstream system.
- What I gave was a min/max range estimate of roughly 16 to 256 (the latter being threads) under different sets of assumptions.
So: Was I was right about 2013 estimates?
Yes, pretty much, and in fact we already reached or exceeded that in 2011 and 2012:
- Lower estimate line: In 2011 and 2012 parts, Intel Core i7 Sandy Bridge and Ivy Bridge are delivering almost the expected lower baseline, and offering 8-way and 12-way parallelism = 4-6 cores x 2 hardware threads per core.
- Upper estimate line: In 2012, as mentioned in the article (which called it Larrabee, now known as MIC or Xeon Phi) is delivering 200-way to 256-way parallelism = 50-64 cores x 4 hardware threads per core. Also, in 2011 and 2012, GPUs have since emerged into more mainstream use for computation (GPGPU), and likewise offer massive compute parallelism, such as 1,536-way parallelism on a machine having a single NVidia Tesla card.
Yes, mainstream machines do in fact have examples of both ends of the “16 to 256 way parallelism” range. And beyond the upper end of the range, in fact, for those with higher-end graphics cards.
For more on these various kinds of compute cores and threads, see also my article Welcome to the Jungle.
Longer answer follows:
Here’s the main part from article, “Design for Manycore Systems” (August 11, 2009). Remember this was written over three years ago – in the Time Before iPad, when Android was under a year old:
How Much Scalability Does Your Application Need?
So how much parallel scalability should you aim to support in the application you‘re working on today, assuming that it’s compute-bound already or you can add killer features that are compute-bound and also amenable to parallel execution? The answer is that you want to match your application’s scalability to the amount of hardware parallelism in the target hardware that will be available during your application’s expected production or shelf lifetime. As shown in Figure 4, that equates to the number of hardware threads you expect to have on your end users’ machines.
Figure 4: How much concurrency does your program need in order to exploit given hardware?
Let’s say that YourCurrentApplication 1.0 will ship next year (mid-2010), and you expect that it’ll be another 18 months until you ship the 2.0 release (early 2012) and probably another 18 months after that before most users will have upgraded (mid-2013). Then you’d be interested in judging what will be the likely mainstream hardware target up to mid-2013.
If we stick with "just more of the same" as in Figure 2′s extrapolation, we’d expect aggressive early hardware adopters to be running 16-core machines (possibly double that if they’re aggressive enough to run dual-CPU workstations with two sockets), and we’d likely expect most general mainstream users to have 4-, 8- or maybe a smattering of 16-core machines (accounting for the time for new chips to be adopted in the marketplace). [[Note: I often get lazy and say “core” to mean all hardware parallelism. In context above and below, it’s clear we’re talking about “cores and threads.”]]
But what if the gating factor, parallel-ready software, goes away? Then CPU vendors would be free to take advantage of options like the one-time 16-fold hardware parallelism jump illustrated in Figure 3, and we get an envelope like that shown in Figure 5.
Figure 5: Extrapolation of “more of the same big cores” and “possible one-time switch to 4x smaller cores plus 4x threads per core” (not counting some transistors being used for other things like on-chip GPUs).
First, let’s look at the lower baseline, ‘most general mainstream users to have [4-16 way parallelism] machines in 2013’? So where are were in 2012 today for mainstream CPU hardware parallelism? Well, Intel Core i7 (e.g., Sandy Bridge, Ivy Bridge) are typically in the 4 to 6 core range – which, with hyperthreading == hardware threads, means 8 to 12 hardware threads.
Second, what about the higher potential line for 2013? As noted above:
- Intel’s Xeon Phi (then Larrabee) is now delivering 50-64 cores x 4 threads = 200 to 256-way parallelism. That’s no surprise, because this article’s upper line was based on exactly the Larrabee data point (see quote below).
- GPUs already blow the 256 upper bound away – any machine with a two-year-old Tesla has 1,536-way parallelism for programs (including mainstream programs like DVD encoders) that can harness the GPU.
So not only did we already reach the 2013 upper line early, in 2012, but we already exceeded it for applications that can harness the GPU for computation.
As I said in the article:
I don’t believe either the bottom line or the top line is the exact truth, but as long as sufficient parallel-capable software comes along, the truth will probably be somewhere in between, especially if we have processors that offer a mix of large- and small-core chips, or that use some chip real estate to bring GPUs or other devices on-die. That’s more hardware parallelism, and sooner, than most mainstream developers I’ve encountered expect.
Interestingly, though, we already noted two current examples: Sun’s Niagara, and Intel’s Larrabee, already provide double-digit parallelism in mainstream hardware via smaller cores with four or eight hardware threads each. "Manycore" chips, or perhaps more correctly "manythread" chips, are just waiting to enter the mainstream. Intel could have built a nice 100-core part in 2006. The gating factor is the software that can exploit the hardware parallelism; that is, the gating factor is you and me.
Read Full Post »