Understanding, identifying and fixing the N+1 query problem

One of the most common performance pitfalls for applications accessing data from databases is the so-called “N+1 query problem”, or sometimes also called the “N+1 selects problem”. It is the first thing you should look for when an application has performance issues related to database access. It is especially easy to run into with object-relational mappers (ORMs).

The problem

The problem typically arises when your entity-relationship model has a 1:n or n:m association. It exists when application code executes one query to get objects of one entity and then executes another query for each of these objects to get the objects of an associated entity. An example would be a blog application that executes one query to fetch all authors whose names start with the letter ‘B’, and then another query for each of these authors to fetch their articles. In pseudocode:

# The 1 query
authors = sql("SELECT * FROM author WHERE name LIKE 'B%'");

# The N queries
articles = []
FOR EACH author IN authors:
    articles += sql("SELECT * FROM article WHERE author_id=:aid", aid: author.id)

The first query is the “1” in “N+1”, the following queries in the loop are the “N”.

Of course, to anybody who knows SQL this is a very naive way to get the desired result. However, OR mappers often seduce their users into writing inefficient database access code by hiding the SQL queries and allowing their users to reach for the normal tools of their favorite programming language like loops or collection operations such as map. A lot of popular web application frameworks come along with OR mappers: Rails with Active Record, Grails with GORM (Hibernate based), Laravel with Eloquent.

How to detect

The easiest way to detect the problem in an application is to log the database queries. Virtually all ORMs have a configuration option to enable query logging.

For Grails/GORM the logging can be enabled per data source in the application.yml config file:

dataSource:
    logSql: true
    formatSql: true

For Rails/ActiveRecord query logging is automatically enabled in the development environment. Since Grails 5.2 the Verbose Query Logs format is enabled by default, which you had to explicitly enable in earlier versions.

For Laravel/Eloquent you can enable and access the query log with these two methods/functions:

DB::connection()->enableQueryLog();
DB::getQueryLog();

Once query logging is enabled you will quickly see if the same query is executed over and over again, usually indicating the presence of the N+1 problem.

How to fix

The goal is to replace the N+1 queries with a single query. In SQL this means joining. The example above would be written as a single query:

SELECT article.*
FROM article
JOIN author
  ON article.author_id=author.id
WHERE author.name LIKE 'B%'

The query interface of ORMs usually allows you to write joins as well. Here the example in ActiveRecord:

Article.joins(:authors).where("authors.name LIKE ?", "B%")

Another option when using ORMs is to enable eager loading for associations. In GORM this can be enabled via the fetchMode static property:

class Author {
    static hasMany = [articles: Article]
    static fetchMode = [articles: 'eager']
}

REST APIs

The problem isn’t limited to SQL databases and SQL queries. For REST APIs it’s the “N+1 requests problem”, describing the situation where a client application has to call the server N+1 times to fetch one collection resource + N child resources. Here the REST-API has to be extended or modified to serve the client’s use cases with a single request. Another option is to offer a GraphQL API instead of a REST API. GraphQL is a query language for HTTP APIs that allows complex queries, so the client application can specify exactly what resources it needs with in a single request.

Game Optimization Resolved

In my last blog post, I explained a performance problem in my game abstractanks but not how I solved it.

So I had not done any optimization work in a while, so the first thing I did turned out to be an error. And not only in hindsight – I actually knew how to tackle a problem like that – I just temporarily forgot at that point.

Going down the rabbit hole

Where we left off, my profiler showed FriendlyUnitOccupies as the culprit. That function basically does circle/circle collision detection using a quad-tree as the spatial acceleration structure. Looking at the samples from my profiler, I could see that that it was descending into the tree quite deeply. Like all tree structures, a quad-tree does pointer-chasing which is very bad for modern CPUs. So I figured I should look at how to optimize that. The data structure was implemented in a hurry, so there seemed plenty to do:

  • Instead of recursing into each node, use tail-call optimization and early culling to speed up traversal.
  • Pre-cache the query with the max-search radius and the other requirements to the units, e.g. not dead, same team, etc.. and then use that to build a new tree for the actual queries.

Because the data structure was pretty non-generic, I started to basically rewrite it to use it in this scenario. While I was about half way through with that, it dawned on me that I was barking at the wrong tree.

Taking a step back

The excellent book Video Game Optimization has some great advice on which level to attack an optimization problem.

  1. System-level. Can you change the system to do something differently and still solve your problem?
  2. Algorithm-level. Are you using the most efficient right algorithm for the data you have?
  3. Micro-level. Are you not wasting any processing power on the lower levels?

I was already on the algorithm level. So I went back to the systemic level: What if the AI did not try to change the target position that often, maybe just every few seconds? That effectively meant lowering the AIs APM. It’s not a bad solution, especially since that makes the AI behave more human. But on the other hand, real-time games, as the name implies, have a soft real-time requirement. So you generally like to avoid huge workloads that go over your frame budget. With how slow the algorithm was, that could easily be the case. The solution is then to do the work concurrently, either by splitting it up or doing it in the background. Both solutions seemed difficult, since the AI code does currently not allow for easy concurrency. So that idea was out.

What if the parking-positions where cached? Subsequent calls to get parking positions could probably reuse a lot of the positions that were computed in previous frames, given that the target point only moves by a little bit each frame. I figured that might work, but it requires more housekeeping and data-dependencies – the result of the previous query needs to be used for the next. That seemed complex and therefore brittle.

A Solution?

Temporal coherency was a pretty good idea though, but not the scale was to big this time. What if I exploited it within a single frame? Now the original code did obscure this, but maybe it gets a little more clear if I write it like this:

optional<v2> GameWorld::FindFreePosition(v2 Center, std::vector<v2> const& Occupied)
{
  auto CheckPosition = [&](v2 Candiate)
  {
    if (!IsPassable(Candidate))
      return false;

    if (OverlapsWith(Occupied))
      return false;

    return !FriendlyUnitOccupies(Candidate);  
  };
  auto Samples = SampledPositions(Center, SomeRandomness());
  auto Found = find_if(SampledPositions.begin(), SampledPositions.end(), CheckPosition(Position));
  
  return (Found != SampledPositions.end()) ? *Found : none;
}

Now as I explained in the previous post, this was called in a loop for each unit to be parked.

std::vector<v2> GameWorld::FindParkingPositions(v2 Center, std::size_t N)
{
  std::vector<v2> Results;
  for (std::size_t i = 0; i < N; ++i)
  {
    auto MaybePosition = FindFreePosition(Center, Results);
    if (!MaybePosition) // No more free space?
      break;
    Results.push_back(*MaybePosition);
  }
  return Results;
}

Easy to see: counting the number of CheckPosition calls, this algorithm is O(n) in number of sampled positions. The number of sampled positions depends linearly on the number of units to be parked, because more units obviously need more parking positions, essentially making this O(n²) for the unit count! But the positions get resampled for each unit – with the only change being the little bit of randomness that is injected everytime. In other words, each call would just test false for sampled positions roughly corresponding to the units that are already placed.

So what I did was a very small change: only inject the randomness once and merge the loops:

auto Samples = SampledPositions(Center, SomeRandomness());
std::vector<v2> Results;

for (auto const& Sample : Samples)
{
  if (CheckPosition(Sample))
    Results.push_back(Sample);

  if (Result.size() >= N)
    break;
}
return Results;

And this did the trick! The algorithm’s run-time when below the 1ms range, and the smaller variation in randomness is not really visible.

Conslusions

I was thrown off-track be the false conclusion that CheckPositions was too slow when it was in fact just called too often. Context is key! Always approach these things outside-in.
Using less-than-optimal abstractions obscured the opportunity to hoist out the sample generation from me. Iteration is always a separate concern, even when it is not on containers!

A Game Optimization War Story

As our customers surely know, I’m not working here on fridays. This is because that’s the time I allocate to my side project, an arcade real-time strategy game called abstractanks. It is a passion project above all else, but of course, I am also learning a lot, much of which I can apply to my “day job” here as well. Today I want to share the story of how I optimized a critical bit of code in that game.

The Big Slowdown

While working on scripted missions, one main element I am using is to make a group of units attack when you enter an area (a.k.a. a zone-trigger). This seems easy enough, but was causing massive slowdowns as soon as the enemy group started moving. My average logic frame-time jumped from 0.3 ms to more than 1500 ms, which essentially makes the game unplayable. When seeing a performance problem, your first instinct should always be to profile it. So I booted up WPR/WPA and did just that. Once I had the profile, I followed the most-sampled path in the stack and found my way to the supposed culprit: the parking algorithm.

Context

When optimizing, you need as much context as you possible to find the best possible course of action. So let me explain how that algorithm fits into the broader picture.

Parking

My main game-mechanic is moving around your units. You do this by selecting a group and then clicking somewhere on the map to issue the move-order. In addition to path-finding process, this also runs an algorithm I call park-planning (as in parking a car). It makes sure that the units know to position themselves around the target point in a roughly circular shape once they arrive. It is essential to the interaction of this mechanic with the capturing of objectives, which are circular as well. Before this was implemented, the units would just decelerate after passing the target point. This caused them to “overshot” and miss the objectives, which was frustrating to the players: they clicked in the right place, but the units would not stop there, but slightly behind it. To make things worse, units arriving later, would bump into those that were already there, further pushing them away and clumping up.

AI Moving

In my particular case, the AI enemy was repeatedly issuing move-orders to close in on the intruder – the player. Since the player group usually also moved, the AI was trying to adapt by changing the move order every frame (effectively working at around 2000 APMs).

Diving into the code

My park-planning implementation is divided into two steps: finding enough parking spots, and then assigning units to it. The profiler was showing that the first part was the problem while the assignment was negligible in terms of run-time. Historically, the first step was reusing and extending some code I first wrote for spawning units, which worked like this:

optional<v2> GameWorld::FindFreePosition(v2 Center, std::vector<v2> const& Occupied)
{
  auto CheckPosition = [&](v2 Candiate)
  {
    if (!IsPassable(Candidate))
      return false;

    if (OverlapsWith(Occupied))
      return false;

    return !FriendlyUnitOccupies(Candidate);   
  };

  if (CheckPosition(Center))
    return Center;

  auto Radius = UNIT_SIZE;
  while (Radius < MAX_SEARCH_RADIUS)
  {
    // Roll a random starting angle
    auto AngleOffset = RandomAngle();
    auto Angle = 0.f;
    while (Angle < 2*Pi)
    {
      auto Candidate = Center + AngleVector(Angle + AngleOffset)*Radius;
      if (CheckPosition(Candidate))
        return Candidate;

      // Move along this circle
      Angle += 2*Pi*Radius / UNIT_SIZE / OVERSAMPLING_FACTOR;
    }

    // Increase the Radius
    Radius += UNIT_SIZE;
  }
  return none;  
}

Note that all the functions in the CheckPosition lambda are “size aware” and respect the UNIT_SIZE – so they are slightly more complex than what the pseudo-code here would have you believe.
The occupied parameter was added for the parking-position finding. It successively fills up the std::vector with positions and uses them once it found enough.

Back to the profiling results: They were showing that most of the time was spent in the FriendlyUnitOccupies, followed by IsPassable and and then OverlapsWith. FriendlyUnitOccupies dominated the time by about 8x times the rest. That function uses a quad-tree to accelerate spatial queries for other units.

Next steps

Obviously, this code uses pretty simplistic approach to the problem – basically just brute-forcing it. But that’s good now there are many different paths to take, many optimization opportunities. My approach was a relatively simple change that got the frame time back down below 1 ms, but before I did that, I considered many and tested a few other different approaches. I will talk about that in detail in my next post. How would you approach this?

The tables will eventually turn for every optimization

Nearly all performance optimizations turn sour sooner or later. Here’s a story that illustrates the concept and gives a simple ruleset to avoid the effect.

An inconvenient truth

One of the things every software developer has to learn the hard way is that performance optimizations are a bad thing most of the time. The lesson is counter-intuitive because it is in conflict with several fundamental motivations of our artist/engineer attitude:

  • We want our code to be as fast as possible or even faster. (We really like the prospect of making the impossible happen!)
  • We don’t want to waste resources if it can be avoided. (Digital resources have always been scarce. Development in abundance wasn’t taught!)
  • We want to be clever and one-up everybody with the latest trick/hack. (This ego boost driven development attitude is a major problem on its own!)

But we can’t deny that clever guys exist since forever and that their wisdom should be considered. Here’s one clever guy fourty years ago:

premature optimization is the root of all evil.

said Donald Knuth in 1974. A typical computer had RAM in the lower kilobyte range these days and clocked with around a megahertz. The Intel 8080 was introduced in this year. No sign of abundance all around.

Coming from this insight, i teach the “three rules of performance optimization” to my students:

  1. Don’t
  2. Not Yet
  3. Measure

A little disclaimer: Performance optimizations are measured in milliseconds. You are still responsible for complexity optimizations and should engage them. Complexity is measured in O(n).

I blogged about the rules some time ago, so I won’t repeat the whole meaning behind them. Today, I want to tell a story related to rule three (“Measure”) and why it’s generally a good idea to stay clear of optimizations if not absolutely necessary.

An accidental observation

by ortodoxfoto / fotolia

In one of our long-running projects, we store data in a 24/7 manner since more than ten years. The project started on Pentium IV boxes with 120 GB harddisks. Soon enough, the available disk space vanished rapidely. Our customer wanted to optimize the storage efficiency. We told him that we can always trade storage space versus computation time or the other way around (the infamous space/time tradeoff), but if we compress the data in the system’s archive to save space, it will result in slower archive access. Our customer understood the tradeoff and decided he wanted storage efficiency over access speed for the archive.

So we added a compression step when certain data types were stored in the archive. Because the data was text based (XML and other formats), the compression rate was at 90% and even higher, meaning that we could fit ten times more data on the disk than before. We certainly met the customer’s goal of store efficiency. But what about the access speed? We needed to add the decompression step for certain data types at the place where our system loads from the archive. After that, we hoped that the speed didn’t suffer too badly. We measured the access times before and after the change and couldn’t believe our eyes: The access time was nearly ten times faster than before. There was no tradeoff, we actually shrank memory consumption and computation time at once and in the same ballpark figure. We felt like heros.

Discovering false premises

Why did this happen? Our explanation is that at some specific point in time, the CPUs became too fast for a tradeoff. In the good old days, there really was a tradeoff: if you needed to compress parts of your harddisk to save space, loading these parts became slower. Because the CPU had to perform extra work on top of the loading, you had to wait longer. Loading more data directly from disk was faster than loading less data from disk and decompressing it. When the CPU became fast enough to actually decompress during the I/O delay of the harddisk, that’s when the raw amount of data that needed to be transferred from disk determined the loading speed. And since uncompressed data is larger, it now took longer to load than compressed data.

At some specific point in time, the old wisdom that compressed data is slower became a lie. Nobody told us. We weren’t heros, we just lived on false premises.

Our customer was pleased and continued to be pleased for years. Every few years, the CPUs of the target machines would become even faster, but the harddisk performance would hardly improve. The new wisdom was truer than ever: Less data is faster, regardless of what the CPU has to do with it. This was a performance optimization that could be used everywhere. You want to increase I/O speed? Compress the data.

The tables begin to turn

Fast forward to modern days. A new technology promised to improve harddisk performance by orders of magnitude: The solid state disk (SSD) delivers impressive amounts of data per second, but more important, gets rid of the initial seek time that magnetic spindle disks had (those few milliseconds to locate the data on the disk that felt like ages for modern CPUs). We started our migration to SSD in 2010 and were SSD-exclusive at the end of 2013. Our customer was a bit more hesistant, but the latest generation of target machines run on SSDs, too. So how did this affect our performance optimization?

As you can probably deduce by now, the decompression work of the CPU re-emerged in the archive access time. It’s by no means as bad as in ancient times, but the days of “no tradeoff” are gone, again. The performance optimization isn’t an optimization anymore. We still have it in the system, because it was never meant to improve performance, but storage efficiency, and it still does that perfectly. And needs to, because SSD are still expensive and don’t offer storage space in abundance. But the old wisdom is back (partially): compressed data is slower, if not by much.

What do we learn from this?

Over the course of a few years, a specific feature (transparent storage compression) in our system was a performance burden, then a performance booster and a burden again. We didn’t change the code, we just changed the hardware circumstances. The best lesson to be learnt is that no performance optimization lasts forever. Premises will change. Effects will be negated. Bottlenecks will be shifted. It’s best to know when that happens. And then it’s best to be able to revoke the optimization. Or, if all this sounds way too much trouble for such a tiny performance gain, remember the first rule of performance optimization (Don’t) and just leave it alone. You can always tell yourself that you’ve just optimized your code for the machine it will run on in ten years.

How to avoid premature optimization

Three simple rules to develop by if you really want to avoid falling into the trap of premature performance optimization.

eternityA common quote linked with Donald E. Knuth of TeX fame is “premature optimization is the root of all evil”. While this might sound a bit harsh, it holds a lot of truth.

Performance as an asset

If you consider software performance as an asset, you can determine its characteristics and derive your decisions about whether to work on it from them. For example, you will discover that while a good performance is paramount, there is a certain threshold when further optimizations are worthless from the asset’s point of view. If you happen to develop a game, you only need to draw as much frames as the monitor can handle. If you process sensor data in real time, there is no need for a prolonged pause between data packets, because computers don’t grow tired.
If you treat performance as an asset, you can also apply a worth to every optimization you want to make and contrast it to the cost of the work you expect to have to invest. This divides the possible optimizations into a group of lucrative and a (probably larger) group of unprofitable investments.

Simple rules

Treating performance as an asset gives you the mental tools to make profound decisions about when and what to optimize. But there are also three simple rules you can apply if you don’t want to write a business plan every time you think “if I just change this line, the code will run much smoother”.

First rule: Don’t

The first rule of performance optimization with a tendency to avoid premature optimization is to just don’t care. You ask yourself if a LinkedList is faster than an ArrayList for a given use case? The short (and ignorant) answer is: both will be fast enough. Is it better to explicitly set all references to null after usage? Why bother when the garbage collector won’t slow you down anyway. Following this rule, you deliberately act dumber than you are with the goal to delay action.

There is a disclaimer, though. There are two different kinds of performance optimization: The first one was referenced in the examples above and deals with actual, but rather local code changes. The second and more important type of performance consideration deals with complexity theory (the one with big O notation) and isn’t measured in milliseconds, but in scalability. You don’t want to be ignorant of the latter type because it will always ruin your runtime behaviour regardless of any optimization of the former type if you implement an exponential or even factorial algorithm. You can be ignorant of “real performance tuning”, but should always be aware of the complexity category your algorithm is living in.

Second rule: Not Yet

There will be a moment when you clearly see an opportunity to improve the runtime performance of your code with just this very small (and very clever) modification. This is when you are ready to break the first rule. Now you should adhere to the second rule: If the cost is as marginal as you say and the gain is profound, go for it – but not now. Performance tuning isn’t a time limited sale that you are only offered right now or never. You can make the same change and reap the same advantages next week or next month. You doubt that you will remember the details? Write an issue or insert some code comment about it. You probably have another task on your todo list that is more important than speeding up the functionality at hand.

The goal of the first rule was to delay action, and that’s the goal of the second rule, too. You’ve probably guessed it already: you avoid premature optimization best by not optimizing at all or at least not optimizing too early. You need to be sure about the value of an optimization before you implement it. As a result of the second rule, your code will be enriched with possibilities for performance improvement. And if you actually need to improve your performance, you can orient yourself along these possibilities or find them then. You want to invest in the tuning business as late as possible, for it is highly speculative.

Third rule: Measure

If you cannot hold on to the first two rules, for example when a real performance issue is reported, you need to take action. But as you are going to invest work into performance optimization, you can as well invest it efficiently. In most applications, there is a 90/10 rule in effect, stating that 90 percent of the runtime is spent in just 10 percent of the code. If you don’t know exactly where your performance bottleneck is, find it using a profiler and remember the 90/10 rule. It’s not efficient nor effective to improve the 90 percent of your code that doesn’t matter in regard to performance.

If you have identified the piece of code that most likely slows your application down, you should remember the second part of the third rule: Never make performance optimizations without a meaningful benchmark that you can run beforehand and afterwards. All to often, the clever performance trick you remember from long ago is actually hurting your performance now. A meaningful benchmark will tell you if you did good. To make a benchmark “meaningful”, you really need to read up on benchmarking in your target platform. In Java, for example, you need to know about proper warm-up of the VM and perform enough cycles to not include one-time effects in your numbers. If you’ve written such a benchmark, keep it! Try to fully automate it and let it be the cornerstone of your growing performance test suite. There might come the day when this test/benchmark tells you that your formerly clever optimization is now obsolete due to internal platform changes.

Conclusion

If you follow these three simple rules, you won’t automatically write high performance software. But you will spend your valuable time fixing real performance issues instead of tinkering with your code to no effect. You definitely won’t optimize prematurely and steer clear of this “root of all evil”.