Most of our software development economy is based on a simple promise: The computing power (or “performance”) of a common computer will double every two years. This promise accompanied us for 40 years now, a time during which our computers got monitors, acquired harddisks and provided RAM beyond the 640 kB that was enough for nobody. In the more recent years, we don’t operate systems with one CPU, but four, eight or even twelve of them. So it came as a great irritation when ten years ago, Herb Sutter predicted that “The free lunch is over” and even Gordon Moore, the originator of Moore’s Law that forms the basis of our simple promise said that it will only hold true for ten to fiveteen more years. Or, in other words, until today.
Irritation
That’s a bit unsettling, to say the least, and should be motivation enough to have a good look at everything we are doing. Intel, the biggest manufacturer of CPUs for computers, has indicated earlier this year that Moore’s Law cannot be fulfilled any longer. So, the free lunch is really over. And it turns out to have some hidden costs. One cost is a certain complacency, the conviction that things will continue to be as they were and that coding styles chiseled over years and decades hold an inherent value of experience.
Complacency
Don’t get me wrong – there is great value in experience, but not all knowledge of the past is helpful for the future. Sometimes, fundamental things change. Just as the tables will eventually turn for every optimization trick, we need to reevaluate some axioms of our stance towards performance. Let me reiterate some common knowledge:
There are two types of performance inherently baked into your source code: Theoretical and practical performance.
Performance
The first type is theoretical performance, measured in O(n), O(n²) or even O(n!) and mostly influenced by the complexity class of the algorithm you are using. It will translate into runtime behaviour (like in the case of O(n!) your software is already dead, you just don’t know yet), but isn’t concerned with the details of your implementation. Not using an unnecessary high complexity class for a given problem will continue to be a valueable skill that every developer should master.
On the other hand, practical performance is measured in milliseconds (or nanoseconds if you are into micro-benchmarks and can pull off to measure them correctly) and can heavily depend on just a few lines in your source code. Practical performance is the observable runtime behaviour of your software on a given hardware. There are two subtypes of practical performance:
- Throughput (How many operations are computed by the system in a given unit of time?)
- Latency (How long does it take one operation to be computed by the system?)
If you run a service, throughput is your main metric for performance. If you use a service, latency is your main concern. Let me explain this by the metapher of a breakfast egg. If you want to eat your breakfast at a hotel buffet and the eggs are empty, your main concern is how fast you will get your freshly boiled egg (latency). But if you run the hotel kitchen, you probably want to cook a lot of eggs at once (throughput), even if that means that one particular egg might boil slightly longer as if you’d boiled each of them individually.
Latency
Those two subtypes are not entirely independent from each other. But the main concern for most performance based work done by developers is latency. It is relatively easy to measure and to reason about. If you work with latency-based performance issues, you should know about the latency numbers every programmer should know, either in visual form or translated to a more human time scale. Lets iterate some of the numbers and their scaled counterpart here:
- 1 CPU cycle (0.3 ns): 1 second
- Level 1 cache access (0.9 ns): 3 seconds
- Branch mispredict (2.5 ns): 8 seconds
- Level 2 cache access (2.8 ns): 9 seconds
- Level 3 cache access (12.9 ns): 43 seconds
- Main memory access (120 ns): 6 minutes
- Solid-state disk I/O (50-150 μs): 2-6 days
- Rotational disk I/O (1-10 ms): 1-12 months
We can discuss any number in detail, but the overall message stands out nonetheless: CPUs are lightning fast and caches are the only system components that can somewhat keep up. As soon as your program hits the RAM, your peak performance is lost. This brings us to the main concept of latency optimization:
Your program’s latency is ultimately decided by your ability to decrease cache misses.
You can save CPU cycles by performing clever hacks, but if you are able to always read your data (and code) from the cache, you’ll be 360 times faster than if your program constantly has to read from RAM. Your source code doesn’t have to change at all for this to happen. A good compiler and/or optimizing runtime can work wonders if you adhere to your programming language’s memory model. In reality, you probably have to rearrange your instructions and align your data structures. That’s the performance optimization of today, not the old cycle stinting. The big challenge is that none of these aspects are visible on the source code level of your program. We have to develop our programs kind of blindfolded currently.
Concurrency
One way how we’ve held up Moore’s Law in the last ten years was the introduction of multiprocessor computing into normal computers. If you cram two CPUs onto the die, the number of transistors on it has doubled. A single-threaded program doesn’t run any faster, though. So we need to look at concurrent programming to unlock the full power of our systems. Basically, there are two types of concurrent programming, deliberate and mechanical.
- Deliberate concurrent programming means that you as the developer actively introduce threads, fibers or similar concepts into your source code to control parallel computation.
- Mechanical concurrent programming means that your source code can be parallelized by the compiler and/or runtime and/or the hardware (e.g. hyper-threading) without changing the correctness of your program.
In both types of concurrent programming, you need to be aware about the constraints and limitations of correct concurrency. It doesn’t matter if your program is blazingly fast and utilizes all cores if the result is wrong or only occasionally correct. Once again, the memory model of your programming language is a useful set of rules and abstractions to guide you. Most higher-level concurrency models like actors narrow your possibilities even further, with functional programming being one of the strictest (and most powerful ones).
In the field of software development, we are theoretically well-prepared to take on the task of pervasive concurrent programming. But we need to forget about the good old times of single-core confirmability and embrace the chaotic world of raw computing power, the world of cores and caches.
Cores ‘n’ Caches
This is our live now: We rely on Cores ‘n’ Caches to feed us performance, but the lunch isn’t free anymore. We have to adapt and adjust, to rethink our core axioms and let go of those parts of our experience that are now hindrances. Just like Rock ‘n’ Roll changed the rules in the music business, our new motto changes ours.
Let’s rock.
By using a machine model that has a layered memory architecture, you can actually handle caches in theoretical performance abstraction. You usually get the cache size as another variable in your Big-Oh, so this does get a little more complex.
But for some problems, you can find great algorithms regardless of the cache size: https://en.wikipedia.org/wiki/Cache-oblivious_algorithm