What will the computer of the future look like?
That may seem like a bizarre question, but it is more important than you think. If Moore’s Law does end, we will reach a point where we may see far more limited improvements than in the past.
Previously, I wrote that the end result would somewhat resemble automobiles, where we see improvements, but single digit improvements each generation. Arguably, single threaded CPU performance has already reached that stage and it is likely that in a couple of generations, multi-threaded tassks will as well, and along with them, GPUs, which are inherently parallel. The computer that wee see may be in its current configuration, for decades to come, unless something truly revolutionary replaces it.
I think that we have become accustomed to seeing massive computer progress each generation as a society that it will be a shock for us in many ways when computers don’t improve exponentially any more. Some projects, like Exascale by 2020, may be delayed to much longer time scales. In many ways, it is like computers are becoming like every other innovation – there is improvement, but not exponential improvement. It is not just gaming or the consumer end; we need the extra computer power for things like simulations for weather, scientific computing, and for data centers. There is much that society could benefit from with extra computer power.
So what will the individual parts look like?
Some parts will see much bigger improvements than other parts, and many will be unchanged.
I think that if Moore’s Law ends, it may make a lot of sense to buy the “best of the best” so to speak, knowing that next year’s model won’t be that much better. In some ways, your computer will be your appliance. It will be replaced when it dies. For non-computer enthusiasts, it has arguably already reached that point.
Central Processing Unit
The end of single threaded scaling
Probably the most famous part of any computer, alas, it will be very difficult to improve the CPU for the desktop.
It has become increasingly difficult to find any real improvements in single threaded performance. Currently, Intel is using a Tick-Tock model, where they introduce a new architecture in one year, followed by a die shrink the following year, followed by another architecture the year after that. At 14nm, that is now broken and it appears that we will not see it return as it has become more difficult to advance to the new nodes (they are battling against the laws of physics here to make it work). Now we seem to be heading towards a new die shrink every 3 years.
Typically though, it is the new architectures that see single threaded gains on the order of 10%, over the old architecture, while the die shrinks see much smaller gains (typically as little as 3-4% and it seems to be shrinking now with each node shrink). Note that the 10% is an average though and that applications that use newer instruction sets (like AVX2 for Intel’s Haswell architecture) saw huge gains (sometimes as large as 30%), while the applications that did not often saw much smaller improvements – sometimes as low as 5%. Strictly as a system builder, you are generally best off buying at the new architecture’s introduction or perhaps the increasingly the refresh.
We saw 10% gains from Nehalem over the older Core (Conroe and its 4 core Kentsfield variant), 10% again with Sandy Bridge, and although no new instruction sets for desktop, approximately 10% again with Skylake. Regarding Skylake in particular, it will be fascinating to see if we get any gains with AVX3 on Skylake E, assuming we even get the full 512 bit instructions on the HEDT desktops.
It does seem like super wide architectures might also not lead to the kind of scaling that we want – Haswell saw wide registers, improved branch prediction, and other advantages with very modest desktop gains, although as noted, the gains with anything using the new instruction sets were better. If the Purley leaks are anything to go by, as of 2016, Intel seems to be pushing down this path for the server market.
I think that a future design would look like the following:
- L1 cache: Small fast, for each core and largely unchanged from the past
- L2 cache: Larger and slower, but mostly unchanged from the past
- L3 is where it gets interesting – with more cores and perhaps heterogeneous cores, it would have to be shared between the different core types, large, but still fast
- Finally, if we do see an L4 cache, I think that it will be off die, with either High Bandwidth Memory (HBM) or Hybrid Memory Cube (HMC) or some other stacked, slow-but-wide memory technology.
Broadwell with its eDRAM, which acted like an L4 cache saw some very impressive gains. The claim with HMC is that the bandwidth could be as good as the L3 cache of a CPU. I suspect that the latency will not be as good simply because it will not be on die. The end result will need to have a controller that can handle the HMC (or whatever stacked RAM is used), and perhaps a second one for the DRAM (unless they are hoping to replace DRAM altogether with this technology). Perhaps it would have to built in a manner not dissimilar to the AMD Fiji with the HBM memory – through the use of TSVs or the Xeon Phi.
There are numerous challenges of course to making this design, apart from the sheer cost. I think that for high end server processors and workstation CPUs, there will be a market nonetheless. For more price sensitive applications, perhaps not as much.
For my next post, we will explore the graphics processing units.