Grails Domain update optimisation

As many readers may know we are developing and maintaining some Grails applications for more than 10 years now. One of the main selling points of Grails is its domain model and object-relational-mapper (ORM) called GORM.

In general ORMs are useful for easy and convenient development at the cost of a bit of performance and flexibility. One of the best features of GORM is the availability of several flexible APIs for use-cases where dynamic finders are not enough. Let us look at a real-world example.

The performance problem

In one part of our application we have personal messages that are marked as read after viewing. For some users there can be quite a lot messages so we implemented a “mark all as read”-feature. The naive implementation looks like this:

def markAllAsRead() {
    def user = securityService.loggedInUser
    def messages = Messages.findAllByUserAndTimelineEntry.findAllByAuthorAndRead(user, false)
    messages.each { message ->
        message.read = true
        message.save()
    }
    Messages.withSession { session -> session.flush()}
 }

While this is both correct and simple it only works well for a limited amount of messages per user. Performance will degrade because all the domain objects are loaded into domain objects, then modified and save one-by-one to the session. Finally the session is persisted to the database. In our use case this could take several seconds which is much too long for a good user experience.

DetachedCriteria to the rescue

GORM offers a much better solution for such use-cases that does not sacrifice expressiveness. Instead it offers a succinct API called “Where Queries” that creates DetachedCriteria and offers batch-updates.

def markAllAsRead() {
    def user = securityService.loggedInUser
    def messages = Messages.where {
        read == false
        addressee == user
    }
    messages.updateAll(read: true)
}

This implementation takes only a few milliseconds to execute with the same dataset as above which is de facto native SQL performance.

Conclusion

Before cursing GORM for bad performance one should have a deeper look at the alternative querying APIs like Where Queries, Criteria, DetachedCriteria, SQL Projections and Restrictions to enhance your ORM toolbox. Compared to dynamic finders and GORM-methods on domain objects they offer better composability and performance without resorting to HQL or plain SQL.

Partitioning in Oracle Database: Because Who Wants to Search an Endless Table?

As data volumes continue to grow, managing large database tables and indexes can become a challenge. This is where partitioning comes in. Partitioning is a feature of database systems that allows you to divide large tables and indexes into smaller, more manageable parts, known as partitions. This can improve the performance and manageability of your database. Aside from performance considerations, maintenance operations, such as backups and index rebuilds, can become easier by allowing them to be performed on smaller subsets of data.

This is achieved by reducing the amount of data that needs to be scanned during query execution. When a query is executed, the database can use the partitioning information to skip over partitions that do not contain the relevant data, instead of having to scan the entire table. This reduces the amount of I/O required to execute the query, which can result in significant performance gains, especially for large tables.

There are several types of partitioning available in Oracle Database, including range partitioning, hash partitioning, list partitioning, and composite partitioning. Each type of partitioning is suited to different use cases and can be used to optimize the performance of your database in different ways. In this blog post we will look range partitioning.

Range partitioning

Here is an example of range-based partitioning in Oracle:

CREATE TABLE books (
  id NUMBER,
  title VARCHAR2(200),
  publication_year NUMBER
)

PARTITION BY RANGE (publication_year) (
  PARTITION p_before_2000 VALUES LESS THAN (2000),
  PARTITION p_2000s VALUES LESS THAN (2010),
  PARTITION p_2010s VALUES LESS THAN (2020),
  PARTITION p_after_2020 VALUES LESS THAN (MAXVALUE)
);

In this example, we have created a table called books that stores book titles, partitioned by the year of publication. We have defined four partitions, p_before_2000, p_2000s, p_2010s, and p_after_2020.

Now, when we insert data into the books table, it will automatically be placed in the appropriate partition based on the year of publication:

INSERT INTO books (id, title, publication_year)
  VALUES (1, 'Nineteen Eighty-Four', 1949);

This book will be inserted into partition p_before_2000, as the year of publication is before 2000. The following book will be placed into partition p_2000s:

INSERT INTO books (id, title, publication_year)
  VALUES (2, 'The Hunger Games', 2008);

When we query the books table, the database will only access the partitions that contain the data we need. For example, if we want to retrieve data for books published in 2015 and 2016, the database will only access partition p_2010s.

SELECT * FROM books WHERE publication_year>=2015 AND publication_year<=2016:

However, you should be aware that while partitioning can improve query performance for some types of queries, it can also negatively impact query performance for others, especially if the partitioning scheme does not align well with the query patterns. Therefore, you should tailor the partitioning to your needs and check if it brings the desired effect.

Understanding, identifying and fixing the N+1 query problem

One of the most common performance pitfalls for applications accessing data from databases is the so-called “N+1 query problem”, or sometimes also called the “N+1 selects problem”. It is the first thing you should look for when an application has performance issues related to database access. It is especially easy to run into with object-relational mappers (ORMs).

The problem

The problem typically arises when your entity-relationship model has a 1:n or n:m association. It exists when application code executes one query to get objects of one entity and then executes another query for each of these objects to get the objects of an associated entity. An example would be a blog application that executes one query to fetch all authors whose names start with the letter ‘B’, and then another query for each of these authors to fetch their articles. In pseudocode:

# The 1 query
authors = sql("SELECT * FROM author WHERE name LIKE 'B%'");

# The N queries
articles = []
FOR EACH author IN authors:
    articles += sql("SELECT * FROM article WHERE author_id=:aid", aid: author.id)

The first query is the “1” in “N+1”, the following queries in the loop are the “N”.

Of course, to anybody who knows SQL this is a very naive way to get the desired result. However, OR mappers often seduce their users into writing inefficient database access code by hiding the SQL queries and allowing their users to reach for the normal tools of their favorite programming language like loops or collection operations such as map. A lot of popular web application frameworks come along with OR mappers: Rails with Active Record, Grails with GORM (Hibernate based), Laravel with Eloquent.

How to detect

The easiest way to detect the problem in an application is to log the database queries. Virtually all ORMs have a configuration option to enable query logging.

For Grails/GORM the logging can be enabled per data source in the application.yml config file:

dataSource:
    logSql: true
    formatSql: true

For Rails/ActiveRecord query logging is automatically enabled in the development environment. Since Grails 5.2 the Verbose Query Logs format is enabled by default, which you had to explicitly enable in earlier versions.

For Laravel/Eloquent you can enable and access the query log with these two methods/functions:

DB::connection()->enableQueryLog();
DB::getQueryLog();

Once query logging is enabled you will quickly see if the same query is executed over and over again, usually indicating the presence of the N+1 problem.

How to fix

The goal is to replace the N+1 queries with a single query. In SQL this means joining. The example above would be written as a single query:

SELECT article.*
FROM article
JOIN author
  ON article.author_id=author.id
WHERE author.name LIKE 'B%'

The query interface of ORMs usually allows you to write joins as well. Here the example in ActiveRecord:

Article.joins(:authors).where("authors.name LIKE ?", "B%")

Another option when using ORMs is to enable eager loading for associations. In GORM this can be enabled via the fetchMode static property:

class Author {
    static hasMany = [articles: Article]
    static fetchMode = [articles: 'eager']
}

REST APIs

The problem isn’t limited to SQL databases and SQL queries. For REST APIs it’s the “N+1 requests problem”, describing the situation where a client application has to call the server N+1 times to fetch one collection resource + N child resources. Here the REST-API has to be extended or modified to serve the client’s use cases with a single request. Another option is to offer a GraphQL API instead of a REST API. GraphQL is a query language for HTTP APIs that allows complex queries, so the client application can specify exactly what resources it needs with in a single request.

Hacking one‘s wetware by sleeping slantwise

Over the turn of the year I took some weeks off, in order to work out some private projects, finish some books that I had halfway-finished for far too long, and of course, to reflect about what such concepts like New Year actually could mean. As usual, the short answer is… not much per se, but one can seize the occasion to contrive a few goals for the year. You know, not these mundane, marginal resolutions like “on February 20th I will definitely go for a run”, or ambiguous abstracts like “in 2021 I‘m finally gonna be a people person!”, but a more profound search for something new, a kind of evaluation of undiscovered instrument in the toolkit of one‘s being, the juicy stuff.

Now we‘ve seen for quite some years, that the world of self-improvement likes to border on the superstitious. From particularly fine-tuned compositions of one‘s diet plan, to the sheer religious belief in certain routines, there‘s no shortage in shady suggestions that draw their data from unique success stories that makes me wonder: Even if there was a kind of biological truth to these underlying claims, how could I survive the cringing of my heart that I would experience by reading these articles?

A special downside of solutions of the kind „collection of very intricate details“ is that they aim to intervene in your life at a very incessant level. Which is, you have to think about them all day in fear of breaking the patterns, i.e. you not only distract yourself from the stuff you actually want to do, but also not giving you a certainty of feedback in either – if something appears to help – what exactly it is that helps, or – if there‘s no effect to be noticed – which detail is maybe done wrong in order to blast away all the other efforts. I guess I would only resort to such methods if I had the impression that something‘s seriously wrong with me, and I currently don‘t notice people telling me this more than once a week, so it‘s fine.

Then there are the kind of solutions that I consider just minor variations of the stuff that one already sort of knows. E.g. I don‘t consider “doing some physical movement once in a while is quite ok” as a real form of “bio-hack”, as that is just common knowledge. Ok, you still have to actually apply it, fair enough, but from an intellectual standpoint it’s boring.

So anyway, I‘m still convinced there ought to be some ways of modifying your life style at quite a beginner-suitable level, one that can easily be opted-in, low requirement of risk or investion, and that‘s where I usually start listening.

A few years back, for instance, for me that was the discovery of intermittent fasting, which in my case happened to show nearly instantaneous effects, mostly positive (e.g. for subjective impressions of my attention and overall well-being). This is something that I‘d at least easily recommend trying out a few times, but as for me, right now, I‘m not really inclined in implementing it right now.

One can probably be a bit ambivalent about all these over-the-(online-)counter supplement prescriptions that are listed everywhere. I‘m not going to recommend the regular use of any substance here, but there‘s plenty of articles about nootropics or other cognitive enhancers; some of these articles also border on the quasi-religious realm, others appear more scientific (the usage of coffee is living in the same domain, and I like that a lot) – so I‘ll leave it to the reader to form his own opinion.

Having said that, I can now finally reveal what this post is all about, as I seem to have found another intriguing way which I‘m just trying out now since a few weeks. I found the claim that it is advantageous of sleeping on an inclined bed. That certainly was outside my curretn scope, the underlying claims are about an improved flow of your glymphatic system, as well as pressure regulation. Be that as it may, it doesn‘t sound harmful and it‘s easily obtainable: by raising your bed‘s head end about 10-15cm with some suitable, stable risers, available for below 20€. I found only weird for the first night and seemed to wake up in a better mood. Together with such nice add-on as a wake up light (if you have one), this certainly qualifies as a “why didn’t anyone tell me earlier?”-moment for me.

So, if you have more intriguing bio-hacks that you consider definitely on the non-quacky side, I‘m interested 🙂

Transposition as a programming technique

If you have been programming for a while, you will probably, and hopefully, agree that it is preferable to have a sequence of functions as opposed to the same number of functions nested. In other words, call-graph breadth is better than depth. Among other reasons, a “linear” set of instructions is often easier to follow, which is better for humans, and also tends to not go haywire with what memory it touches, which is better for computers.
However, deep call hierarchies occur much more than I would like. I have seen call stacks well beyond 200 functions deep. But this need not be – one can often be turned into the other by transposition. Transposition derives from the latin transponere, which roughly means “to put across”. With matrices, it means swapping rows and columns. Similarly, we can swap call-hierarchy depth for breath.

The example

A couple of months ago, I was tasked with programming a standing-wave display of power-line voltage curves. As you might know, the signal is roughly sine-wave shaped at about 50Hz. The signal is captured in time windows of 200ms, i.e. there’s a new packet of data 5 times a second with 10 sine cycles in it. However, the frequency value jitters just enough to make the signal drift a bit in the 200ms window, i.e. the wave moves forwards and backwards a little bit. The standing-wave feature tries to remove that drift and make it seemingly stationary in our fixed time window, so changes in amplitude become more visible.

Algorithm 1

The idea seems simple enough for just one signal:

  1. In the previous wave, search backwards to find the spot where the wave crosses from positive to negative.
  2. Take the previous wave from that point on and stitch it together with the current one, and cut that off at 200ms of data.

But there is not just one signal, there can be hundreds. And they should all be aligned to one designated “master” signal. So now we add extra steps:

  1. For all other signals, find the wave packets overlapping (in time) with our new stiched wave packet.
  2. Order them, and stitch them to a new wave packet covering exactly the same time window.

Now even in this version, finding the right packets for a time interval can be more tricky than it seems, because the values for the signals come in irregularly and can be shifted significantly. So you can just buffer of the last N (5?) packets for each signal and search in there. Still, one more requirement remains. For the display of archived data, the algorithm should work on batches of waves, i.e. many seconds worth, which made step 3 harder by extending the search space. So add:

  1. For each previous and current pair in a given time-interval:

Now the whole thing was pretty much implemented with steps 0 to 4 being functions calling into the next step, with major loops on the 0th and 3rd step. The wave data flows through these implementation layers vertically, i.e. from step 0 to step 4 and back, but the control flow of the program does not. It flows perpendicular to it, horizontally, solely controlled by the outer-most loop. It is intuitive to write it this way – after all, the control flow follows the flow of time in the data we are processing, but the code was not particularly easy, especially with the search in step 3 becoming unnecessarily complex.

Algorithm 2

Now let us try transposing this, and match the flow of data with our control flow:

  1. Gather all relevant signals for the time interval and sort their packets.
  2. Extract all the “stitching” time codes from the master signal.
  3. For all signals, traverse pairs together with the time codes and stitch accordingly.

The whole process becomes more digestible, and processing the data in stages made it obvious that sorted data makes using a “merge” type algorithm very easy.
Both algorithms use the same data, but the second makes it explicit, while the first just passes it through the call-stack in chunks.

Conclusion

I have since used this idea of “transposition” a few times to clean up and simplify my designs. It seems especially helpful when trying to decouple messaging from bulk processing.
The idea of looking at the data flow and adapting the control flow to match it, is central to data-oriented design. I argue that while this can be used to optimized programs, transposition is mainly a tool to make programs simpler, which can then lead to optimization. Separating processing into stages is also very similar to loop-fission.
Have you used a technique like this before? Do you, perhaps, know it by another name? Let me know!

Game Optimization Resolved

In my last blog post, I explained a performance problem in my game abstractanks but not how I solved it.

So I had not done any optimization work in a while, so the first thing I did turned out to be an error. And not only in hindsight – I actually knew how to tackle a problem like that – I just temporarily forgot at that point.

Going down the rabbit hole

Where we left off, my profiler showed FriendlyUnitOccupies as the culprit. That function basically does circle/circle collision detection using a quad-tree as the spatial acceleration structure. Looking at the samples from my profiler, I could see that that it was descending into the tree quite deeply. Like all tree structures, a quad-tree does pointer-chasing which is very bad for modern CPUs. So I figured I should look at how to optimize that. The data structure was implemented in a hurry, so there seemed plenty to do:

  • Instead of recursing into each node, use tail-call optimization and early culling to speed up traversal.
  • Pre-cache the query with the max-search radius and the other requirements to the units, e.g. not dead, same team, etc.. and then use that to build a new tree for the actual queries.

Because the data structure was pretty non-generic, I started to basically rewrite it to use it in this scenario. While I was about half way through with that, it dawned on me that I was barking at the wrong tree.

Taking a step back

The excellent book Video Game Optimization has some great advice on which level to attack an optimization problem.

  1. System-level. Can you change the system to do something differently and still solve your problem?
  2. Algorithm-level. Are you using the most efficient right algorithm for the data you have?
  3. Micro-level. Are you not wasting any processing power on the lower levels?

I was already on the algorithm level. So I went back to the systemic level: What if the AI did not try to change the target position that often, maybe just every few seconds? That effectively meant lowering the AIs APM. It’s not a bad solution, especially since that makes the AI behave more human. But on the other hand, real-time games, as the name implies, have a soft real-time requirement. So you generally like to avoid huge workloads that go over your frame budget. With how slow the algorithm was, that could easily be the case. The solution is then to do the work concurrently, either by splitting it up or doing it in the background. Both solutions seemed difficult, since the AI code does currently not allow for easy concurrency. So that idea was out.

What if the parking-positions where cached? Subsequent calls to get parking positions could probably reuse a lot of the positions that were computed in previous frames, given that the target point only moves by a little bit each frame. I figured that might work, but it requires more housekeeping and data-dependencies – the result of the previous query needs to be used for the next. That seemed complex and therefore brittle.

A Solution?

Temporal coherency was a pretty good idea though, but not the scale was to big this time. What if I exploited it within a single frame? Now the original code did obscure this, but maybe it gets a little more clear if I write it like this:

optional<v2> GameWorld::FindFreePosition(v2 Center, std::vector<v2> const& Occupied)
{
  auto CheckPosition = [&](v2 Candiate)
  {
    if (!IsPassable(Candidate))
      return false;

    if (OverlapsWith(Occupied))
      return false;

    return !FriendlyUnitOccupies(Candidate);  
  };
  auto Samples = SampledPositions(Center, SomeRandomness());
  auto Found = find_if(SampledPositions.begin(), SampledPositions.end(), CheckPosition(Position));
  
  return (Found != SampledPositions.end()) ? *Found : none;
}

Now as I explained in the previous post, this was called in a loop for each unit to be parked.

std::vector<v2> GameWorld::FindParkingPositions(v2 Center, std::size_t N)
{
  std::vector<v2> Results;
  for (std::size_t i = 0; i < N; ++i)
  {
    auto MaybePosition = FindFreePosition(Center, Results);
    if (!MaybePosition) // No more free space?
      break;
    Results.push_back(*MaybePosition);
  }
  return Results;
}

Easy to see: counting the number of CheckPosition calls, this algorithm is O(n) in number of sampled positions. The number of sampled positions depends linearly on the number of units to be parked, because more units obviously need more parking positions, essentially making this O(n²) for the unit count! But the positions get resampled for each unit – with the only change being the little bit of randomness that is injected everytime. In other words, each call would just test false for sampled positions roughly corresponding to the units that are already placed.

So what I did was a very small change: only inject the randomness once and merge the loops:

auto Samples = SampledPositions(Center, SomeRandomness());
std::vector<v2> Results;

for (auto const& Sample : Samples)
{
  if (CheckPosition(Sample))
    Results.push_back(Sample);

  if (Result.size() >= N)
    break;
}
return Results;

And this did the trick! The algorithm’s run-time when below the 1ms range, and the smaller variation in randomness is not really visible.

Conslusions

I was thrown off-track be the false conclusion that CheckPositions was too slow when it was in fact just called too often. Context is key! Always approach these things outside-in.
Using less-than-optimal abstractions obscured the opportunity to hoist out the sample generation from me. Iteration is always a separate concern, even when it is not on containers!

Keeping connections alive with libcurl

libcurl is quite a comfortable option to transfer files across a variety of network protocols, e.g. HTTP, FTP and SFTP.

It’s really easy to get started: downloading a single file via http or ftp takes only a couple of lines.

Drip, drip..

But as with most powerful abstractions, it is a bit leaky. While it does an excellent job of hiding such steps as name resolution and authentication, these steps still “leak out” by increasing the overall run-time.

In our case, we had five dozen FTP servers and we needed to repeatedly download small files from all of them. To make matters worse, we only had a small time window of 200ms for each transfer.

Now FTP is not the most simple protocol. Essentially, it requires the client to establish a TCP control connection, that it uses negotiate a second data connection and initiate file transfers.

This initial setup phase needs a lot of back and forth between server and client. Naturally, this is quite slow. Ideally, you would want to do the connection setup once and keep both the control and the data connection open for subsequent transfers.

libcurl does not explicitly expose the concept of an active connection. Hence you cannot explicitly tell the library not to disconnect it. In a naive implementation, you would download multiple files by simply creating an easy session object for each file transfer:

for (auto file : FILE_LIST)
{
  std::vector<uint8_t> buffer;
  auto curl = curl_easy_init();
  if (!curl)
    return -1;
  auto url = (SERVER+file);
  curl_easy_setopt(curl, CURLOPT_URL,
    url.c_str());
  curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION,
    appendToVector);
  curl_easy_setopt(curl, CURLOPT_WRITEDATA,
    &buffer);
  if (curl_easy_perform(curl) != CURLE_OK)
    return -1;

  process(buffer);
  curl_easy_cleanup(curl);
}

That does indeed reset the connection for every single file.

Re-use!

However, libcurl can actually keep the connection open as part of a connection re-use mechanism in the session object. This is documented with the function curl_easy_perform. If you simply hoist the easy session object out of the loop, it will no longer disconnect between file transfers:

auto curl = curl_easy_init();
if (!curl)
  return -1;

for (auto file : FILE_LIST)
{
  std::vector<uint8_t> buffer;
  auto url = (SERVER+file);
  curl_easy_setopt(curl, CURLOPT_URL, 
    url.c_str());
  curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, 
    appendToVector);
  curl_easy_setopt(curl, CURLOPT_WRITEDATA, 
    &buffer);
  if (curl_easy_perform(curl) != CURLE_OK)
    return -1;

  process(buffer);
}
curl_easy_cleanup(curl);

libcurl will now cache the active connection in the session object, provided the files are actually on the same server. This improved the download timings of our bulk transfers from 130ms-260ms down to 30ms-40ms, quite the enormous gain. The timings now fit into our 200ms time window comfortably.

It’s only Cores and Caches but I like it

The programming game changed dramatically in the past ten years. We are playing CPU cores and caches now, but without proper visibility.

759px-amd_am5x86_dieMost of our software development economy is based on a simple promise: The computing power (or “performance”) of a common computer will double every two years. This promise accompanied us for 40 years now, a time during which our computers got monitors, acquired harddisks and provided RAM beyond the 640 kB that was enough for nobody. In the more recent years, we don’t operate systems with one CPU, but four, eight or even twelve of them. So it came as a great irritation when ten years ago, Herb Sutter predicted that “The free lunch is over” and even Gordon Moore, the originator of Moore’s Law that forms the basis of our simple promise said that it will only hold true for ten to fiveteen more years. Or, in other words, until today.

Irritation

That’s a bit unsettling, to say the least, and should be motivation enough to have a good look at everything we are doing. Intel, the biggest manufacturer of CPUs for computers, has indicated earlier this year that Moore’s Law cannot be fulfilled any longer. So, the free lunch is really over. And it turns out to have some hidden costs. One cost is a certain complacency, the conviction that things will continue to be as they were and that coding styles chiseled over years and decades hold an inherent value of experience.

Complacency

Don’t get me wrong – there is great value in experience, but not all knowledge of the past is helpful for the future. Sometimes, fundamental things change. Just as the tables will eventually turn for every optimization trick, we need to reevaluate some axioms of our stance towards performance. Let me reiterate some common knowledge:

There are two types of performance inherently baked into your source code: Theoretical and practical performance.

Performance

The first type is theoretical performance, measured in O(n), O(n²) or even O(n!) and mostly influenced by the complexity class of the algorithm you are using. It will translate into runtime behaviour (like in the case of O(n!) your software is already dead, you just don’t know yet), but isn’t concerned with the details of your implementation. Not using an unnecessary high complexity class for a given problem will continue to be a valueable skill that every developer should master.

On the other hand, practical performance is measured in milliseconds (or nanoseconds if you are into micro-benchmarks and can pull off to measure them correctly) and can heavily depend on just a few lines in your source code. Practical performance is the observable runtime behaviour of your software on a given hardware. There are two subtypes of practical performance:

  • Throughput (How many operations are computed by the system in a given unit of time?)
  • Latency (How long does it take one operation to be computed by the system?)

If you run a service, throughput is your main metric for performance. If you use a service, latency is your main concern. Let me explain this by the metapher of a breakfast egg. If you want to eat your breakfast at a hotel buffet and the eggs are empty, your main concern is how fast you will get your freshly boiled egg (latency). But if you run the hotel kitchen, you probably want to cook a lot of eggs at once (throughput), even if that means that one particular egg might boil slightly longer as if you’d boiled each of them individually.

Latency

Those two subtypes are not entirely independent from each other. But the main concern for most performance based work done by developers is latency. It is relatively easy to measure and to reason about. If you work with latency-based performance issues, you should know about the latency numbers every programmer should know, either in visual form or translated to a more human time scale. Lets iterate some of the numbers and their scaled counterpart here:

  • 1 CPU cycle (0.3 ns): 1 second
  • Level 1 cache access (0.9 ns): 3 seconds
  • Branch mispredict (2.5 ns): 8 seconds
  • Level 2 cache access (2.8 ns): 9 seconds
  • Level 3 cache access (12.9 ns): 43 seconds
  • Main memory access (120 ns): 6 minutes
  • Solid-state disk I/O (50-150 μs): 2-6 days
  • Rotational disk I/O (1-10 ms): 1-12 months

We can discuss any number in detail, but the overall message stands out nonetheless: CPUs are lightning fast and caches are the only system components that can somewhat keep up. As soon as your program hits the RAM, your peak performance is lost. This brings us to the main concept of latency optimization:

Your program’s latency is ultimately decided by your ability to decrease cache misses.

You can save CPU cycles by performing clever hacks, but if you are able to always read your data (and code) from the cache, you’ll be 360 times faster than if your program constantly has to read from RAM. Your source code doesn’t have to change at all for this to happen. A good compiler and/or optimizing runtime can work wonders if you adhere to your programming language’s memory model. In reality, you probably have to rearrange your instructions and align your data structures. That’s the performance optimization of today, not the old cycle stinting. The big challenge is that none of these aspects are visible on the source code level of your program. We have to develop our programs kind of blindfolded currently.

Concurrency

One way how we’ve held up Moore’s Law in the last ten years was the introduction of multiprocessor computing into normal computers. If you cram two CPUs onto the die, the number of transistors on it has doubled. A single-threaded program doesn’t run any faster, though. So we need to look at concurrent programming to unlock the full power of our systems. Basically, there are two types of concurrent programming, deliberate and mechanical.

  • Deliberate concurrent programming means that you as the developer actively introduce threads, fibers or similar concepts into your source code to control parallel computation.
  • Mechanical concurrent programming means that your source code can be parallelized by the compiler and/or runtime and/or the hardware (e.g. hyper-threading) without changing the correctness of your program.

In both types of concurrent programming, you need to be aware about the constraints and limitations of correct concurrency. It doesn’t matter if your program is blazingly fast and utilizes all cores if the result is wrong or only occasionally correct. Once again, the memory model of your programming language is a useful set of rules and abstractions to guide you. Most higher-level concurrency models like actors narrow your possibilities even further, with functional programming being one of the strictest (and most powerful ones).

In the field of software development, we are theoretically well-prepared to take on the task of pervasive concurrent programming. But we need to forget about the good old times of single-core confirmability and embrace the chaotic world of raw computing power, the world of cores and caches.

Cores ‘n’ Caches

This is our live now: We rely on Cores ‘n’ Caches to feed us performance, but the lunch isn’t free anymore. We have to adapt and adjust, to rethink our core axioms and let go of those parts of our experience that are now hindrances. Just like Rock ‘n’ Roll changed the rules in the music business, our new motto changes ours.

Let’s rock.

The tables will eventually turn for every optimization

Nearly all performance optimizations turn sour sooner or later. Here’s a story that illustrates the concept and gives a simple ruleset to avoid the effect.

An inconvenient truth

One of the things every software developer has to learn the hard way is that performance optimizations are a bad thing most of the time. The lesson is counter-intuitive because it is in conflict with several fundamental motivations of our artist/engineer attitude:

  • We want our code to be as fast as possible or even faster. (We really like the prospect of making the impossible happen!)
  • We don’t want to waste resources if it can be avoided. (Digital resources have always been scarce. Development in abundance wasn’t taught!)
  • We want to be clever and one-up everybody with the latest trick/hack. (This ego boost driven development attitude is a major problem on its own!)

But we can’t deny that clever guys exist since forever and that their wisdom should be considered. Here’s one clever guy fourty years ago:

premature optimization is the root of all evil.

said Donald Knuth in 1974. A typical computer had RAM in the lower kilobyte range these days and clocked with around a megahertz. The Intel 8080 was introduced in this year. No sign of abundance all around.

Coming from this insight, i teach the “three rules of performance optimization” to my students:

  1. Don’t
  2. Not Yet
  3. Measure

A little disclaimer: Performance optimizations are measured in milliseconds. You are still responsible for complexity optimizations and should engage them. Complexity is measured in O(n).

I blogged about the rules some time ago, so I won’t repeat the whole meaning behind them. Today, I want to tell a story related to rule three (“Measure”) and why it’s generally a good idea to stay clear of optimizations if not absolutely necessary.

An accidental observation

by ortodoxfoto / fotolia

In one of our long-running projects, we store data in a 24/7 manner since more than ten years. The project started on Pentium IV boxes with 120 GB harddisks. Soon enough, the available disk space vanished rapidely. Our customer wanted to optimize the storage efficiency. We told him that we can always trade storage space versus computation time or the other way around (the infamous space/time tradeoff), but if we compress the data in the system’s archive to save space, it will result in slower archive access. Our customer understood the tradeoff and decided he wanted storage efficiency over access speed for the archive.

So we added a compression step when certain data types were stored in the archive. Because the data was text based (XML and other formats), the compression rate was at 90% and even higher, meaning that we could fit ten times more data on the disk than before. We certainly met the customer’s goal of store efficiency. But what about the access speed? We needed to add the decompression step for certain data types at the place where our system loads from the archive. After that, we hoped that the speed didn’t suffer too badly. We measured the access times before and after the change and couldn’t believe our eyes: The access time was nearly ten times faster than before. There was no tradeoff, we actually shrank memory consumption and computation time at once and in the same ballpark figure. We felt like heros.

Discovering false premises

Why did this happen? Our explanation is that at some specific point in time, the CPUs became too fast for a tradeoff. In the good old days, there really was a tradeoff: if you needed to compress parts of your harddisk to save space, loading these parts became slower. Because the CPU had to perform extra work on top of the loading, you had to wait longer. Loading more data directly from disk was faster than loading less data from disk and decompressing it. When the CPU became fast enough to actually decompress during the I/O delay of the harddisk, that’s when the raw amount of data that needed to be transferred from disk determined the loading speed. And since uncompressed data is larger, it now took longer to load than compressed data.

At some specific point in time, the old wisdom that compressed data is slower became a lie. Nobody told us. We weren’t heros, we just lived on false premises.

Our customer was pleased and continued to be pleased for years. Every few years, the CPUs of the target machines would become even faster, but the harddisk performance would hardly improve. The new wisdom was truer than ever: Less data is faster, regardless of what the CPU has to do with it. This was a performance optimization that could be used everywhere. You want to increase I/O speed? Compress the data.

The tables begin to turn

Fast forward to modern days. A new technology promised to improve harddisk performance by orders of magnitude: The solid state disk (SSD) delivers impressive amounts of data per second, but more important, gets rid of the initial seek time that magnetic spindle disks had (those few milliseconds to locate the data on the disk that felt like ages for modern CPUs). We started our migration to SSD in 2010 and were SSD-exclusive at the end of 2013. Our customer was a bit more hesistant, but the latest generation of target machines run on SSDs, too. So how did this affect our performance optimization?

As you can probably deduce by now, the decompression work of the CPU re-emerged in the archive access time. It’s by no means as bad as in ancient times, but the days of “no tradeoff” are gone, again. The performance optimization isn’t an optimization anymore. We still have it in the system, because it was never meant to improve performance, but storage efficiency, and it still does that perfectly. And needs to, because SSD are still expensive and don’t offer storage space in abundance. But the old wisdom is back (partially): compressed data is slower, if not by much.

What do we learn from this?

Over the course of a few years, a specific feature (transparent storage compression) in our system was a performance burden, then a performance booster and a burden again. We didn’t change the code, we just changed the hardware circumstances. The best lesson to be learnt is that no performance optimization lasts forever. Premises will change. Effects will be negated. Bottlenecks will be shifted. It’s best to know when that happens. And then it’s best to be able to revoke the optimization. Or, if all this sounds way too much trouble for such a tiny performance gain, remember the first rule of performance optimization (Don’t) and just leave it alone. You can always tell yourself that you’ve just optimized your code for the machine it will run on in ten years.

Recap of the Schneide Dev Brunch 2015-08-09

If you couldn’t attend the Schneide Dev Brunch at 9th of August 2015, here is a summary of the main topics.

brunch64-borderedTwo weeks ago, we held another Schneide Dev Brunch, a regular brunch on the second sunday of every other (even) month, only that all attendees want to talk about software development and various other topics. So if you bring a software-related topic along with your food, everyone has something to share. The brunch was well attented this time with enough stuff to talk about. As usual, a lot of topics and chatter were exchanged. This recapitulation tries to highlight the main topics of the brunch, but cannot reiterate everything that was spoken. If you were there, you probably find this list inconclusive:

News on Docker

Docker is the hottest topic among developers and operators in 2015. No wonder we started chatting about it the minute we sat down. There are currently two interesting platform projects that provide runtime services for docker: Tutum (commercial) and Rancher (open source). We all noted the names and will check them out. The next interesting fact was that Docker is programmed in the Go language. The team probably one day decided to give it a go.

Air Conditioning

We all experienced the hot spell this summer and observed that work in the traditional sense is impossible beyond 30° Celsius. Why there are still so few air conditioned offices in our region is beyond our grasp. Especially since it’s possible to power the air condition system with green electricity and let sun-power deal with the problem that, well, the sun brought us. In 2015 alone, there are at minimum two work weeks lost to the heat. The productivity gain from cooling should outweigh the costs.

License Management

We talked about how different organisations deal with the challenge of software license management. Nearly every big company has a tool that does essentially the same license management but has its own cool name. Other than that, bad license management is such a great productivity killer that even air conditioning wouldn’t offset it.

Windows 10

Even if we are largely operation system agnostic, the release of Windows 10 is hot news. A few of our participants already tried it and concluded that “it’s another Windows”. A rather confusing aspect is the split system settings. And you have to abdicate the Cortana assistant if you want to avoid the data gathering.

Patch Management

A rather depressing topic was the discussion about security patches. I just repeat two highlights: A substantial number of servers on the internet are still vulnerable to the heartbleed attack. And if a car manufacturer starts a big recall campaign with cost-free replacements, less than 10 percent of the entitled cars are actually fixed on average. These explicitely includes safety-critical issues. That shouldn’t excuse us as an industry for our own shortcomings and it’s not reassuring to see that other industries face the same problems.

Self-Driving Cars

We disgressed on the future hype topic of self-driving cars. I can’t reiterate the complete discussion, but we agreed that those cars will hit the streets within the next ten years. The first use case will be freight transports, because the cargo doesn’t mind if the driver is absent and efficiency matters a lot in logistics. Plus, machines don’t need breaks. Ok, those were enough puns on the topic. Sorry.

Tests on Interfaces

An interesting question was how to build tests that can ensure a class or interface contract. Much like regression tests for recently broken functionality, compatibility tests should deal with backward compatibility issues in the interface. Turns out, the Eclipse foundation gave the topic some thoughts and came up with an exhaustive list of aspects to check. There are even some tools that might come in handy if you want to compare two versions of an API.

API Design

When the topic of API Design came up, some veterans of the Schneide Events immediately mentioned the API Design Fest we held in November 2013 to get our noses bloody on API design. Well, bleed we did. The most important take-away from the Fest was that if you plan to publish an API that can endure some years in production while being enhanced and improved, you just shouldn’t do it. Really, don’t do it, it’s probably a bad idea and you lack the required skill without even knowing it. If you want to know, participate or even host an API Design Fest.

And if you happen to design a web-based API, you might abandon backward compatibility by offering several distinct “versions” of APIs of a service. The version is included in the API URL, and acts more like a name than a version. This will ease your burden a bit. A nice reference resource might also be the PayPal API style guide.

Let’s just agree that API design is really hard and should not be done until it’s clear you don’t suffer from Dunning-Kruger effect symptoms too much.

Performance Tests

We talked about the most effective setup of performance tests. There were a lot of ideas and we cornerstoned the topic around this:

  • There was a nearly heroic effort from the Eclipse development team to measure their IDE performance, especially to compare different versions of the IDE. The Eclipse Test & Performance Tools Platform (TPTP) was (as in: discontinued) a toolkit of interesting approaches to the topic. The IDE itself was measured by performance fingerprints like this example from 2011. As far as we know, all those things ceased to exist.
  • At the last Java Forum Stuttgart, there was a talk about performance testing from an experienced tester that loved to give specific advice. The slides can be viewed online in german language (well, not really, but the talk was).
  • The book Release It! has a lot of insights to this topic. It’s one of the bigger books on the pragmatic bookshelf.
  • The engineers at NetFlix actually did a lot of thinking about the topic. They came up with Hystrix, a resilience library, aimed to make it easier to prevent complete system blackouts. They also came up with Chaos Monkey, a service that makes it easier to have a complete system blackout. If we can say anything about NetFlix, it is that they definitely approach their problems from the right angle.

Company Culture

Leaking over from the previous topic about effective performance-related measures, we talked about different company cultures, especially in regard to a centralized human resources departments and works council (german: Betriebsrat). We agreed that it is very difficult to maintain a certain culture and continued growth. We also agreed that culture trickles down from top management.

OpenGL

The last topic on this Dev Brunch was about the rendering of text or single characters in OpenGL. By using signed distance fields, you can render text more crisp and still only use cheap computation instructions. There is a paper from Valve on the topic that highlights the benefits and gives a list of additional reading. It’s always cool to learn about something simple that actually improves things.

Epilogue

As usual, the Dev Brunch contained a lot more chatter and talk than listed here. The number of attendees makes for an unique experience every time. We are looking forward to the next Dev Brunch at the Softwareschneiderei. And as always, we are open for guests and future regulars. Just drop us a notice and we’ll invite you over next time.