Game Optimization Resolved

In my last blog post, I explained a performance problem in my game abstractanks but not how I solved it.

So I had not done any optimization work in a while, so the first thing I did turned out to be an error. And not only in hindsight – I actually knew how to tackle a problem like that – I just temporarily forgot at that point.

Going down the rabbit hole

Where we left off, my profiler showed FriendlyUnitOccupies as the culprit. That function basically does circle/circle collision detection using a quad-tree as the spatial acceleration structure. Looking at the samples from my profiler, I could see that that it was descending into the tree quite deeply. Like all tree structures, a quad-tree does pointer-chasing which is very bad for modern CPUs. So I figured I should look at how to optimize that. The data structure was implemented in a hurry, so there seemed plenty to do:

  • Instead of recursing into each node, use tail-call optimization and early culling to speed up traversal.
  • Pre-cache the query with the max-search radius and the other requirements to the units, e.g. not dead, same team, etc.. and then use that to build a new tree for the actual queries.

Because the data structure was pretty non-generic, I started to basically rewrite it to use it in this scenario. While I was about half way through with that, it dawned on me that I was barking at the wrong tree.

Taking a step back

The excellent book Video Game Optimization has some great advice on which level to attack an optimization problem.

  1. System-level. Can you change the system to do something differently and still solve your problem?
  2. Algorithm-level. Are you using the most efficient right algorithm for the data you have?
  3. Micro-level. Are you not wasting any processing power on the lower levels?

I was already on the algorithm level. So I went back to the systemic level: What if the AI did not try to change the target position that often, maybe just every few seconds? That effectively meant lowering the AIs APM. It’s not a bad solution, especially since that makes the AI behave more human. But on the other hand, real-time games, as the name implies, have a soft real-time requirement. So you generally like to avoid huge workloads that go over your frame budget. With how slow the algorithm was, that could easily be the case. The solution is then to do the work concurrently, either by splitting it up or doing it in the background. Both solutions seemed difficult, since the AI code does currently not allow for easy concurrency. So that idea was out.

What if the parking-positions where cached? Subsequent calls to get parking positions could probably reuse a lot of the positions that were computed in previous frames, given that the target point only moves by a little bit each frame. I figured that might work, but it requires more housekeeping and data-dependencies – the result of the previous query needs to be used for the next. That seemed complex and therefore brittle.

A Solution?

Temporal coherency was a pretty good idea though, but not the scale was to big this time. What if I exploited it within a single frame? Now the original code did obscure this, but maybe it gets a little more clear if I write it like this:

optional<v2> GameWorld::FindFreePosition(v2 Center, std::vector<v2> const& Occupied)
{
  auto CheckPosition = [&](v2 Candiate)
  {
    if (!IsPassable(Candidate))
      return false;

    if (OverlapsWith(Occupied))
      return false;

    return !FriendlyUnitOccupies(Candidate);  
  };
  auto Samples = SampledPositions(Center, SomeRandomness());
  auto Found = find_if(SampledPositions.begin(), SampledPositions.end(), CheckPosition(Position));
  
  return (Found != SampledPositions.end()) ? *Found : none;
}

Now as I explained in the previous post, this was called in a loop for each unit to be parked.

std::vector<v2> GameWorld::FindParkingPositions(v2 Center, std::size_t N)
{
  std::vector<v2> Results;
  for (std::size_t i = 0; i < N; ++i)
  {
    auto MaybePosition = FindFreePosition(Center, Results);
    if (!MaybePosition) // No more free space?
      break;
    Results.push_back(*MaybePosition);
  }
  return Results;
}

Easy to see: counting the number of CheckPosition calls, this algorithm is O(n) in number of sampled positions. The number of sampled positions depends linearly on the number of units to be parked, because more units obviously need more parking positions, essentially making this O(n²) for the unit count! But the positions get resampled for each unit – with the only change being the little bit of randomness that is injected everytime. In other words, each call would just test false for sampled positions roughly corresponding to the units that are already placed.

So what I did was a very small change: only inject the randomness once and merge the loops:

auto Samples = SampledPositions(Center, SomeRandomness());
std::vector<v2> Results;

for (auto const& Sample : Samples)
{
  if (CheckPosition(Sample))
    Results.push_back(Sample);

  if (Result.size() >= N)
    break;
}
return Results;

And this did the trick! The algorithm’s run-time when below the 1ms range, and the smaller variation in randomness is not really visible.

Conslusions

I was thrown off-track be the false conclusion that CheckPositions was too slow when it was in fact just called too often. Context is key! Always approach these things outside-in.
Using less-than-optimal abstractions obscured the opportunity to hoist out the sample generation from me. Iteration is always a separate concern, even when it is not on containers!

A Game Optimization War Story

As our customers surely know, I’m not working here on fridays. This is because that’s the time I allocate to my side project, an arcade real-time strategy game called abstractanks. It is a passion project above all else, but of course, I am also learning a lot, much of which I can apply to my “day job” here as well. Today I want to share the story of how I optimized a critical bit of code in that game.

The Big Slowdown

While working on scripted missions, one main element I am using is to make a group of units attack when you enter an area (a.k.a. a zone-trigger). This seems easy enough, but was causing massive slowdowns as soon as the enemy group started moving. My average logic frame-time jumped from 0.3 ms to more than 1500 ms, which essentially makes the game unplayable. When seeing a performance problem, your first instinct should always be to profile it. So I booted up WPR/WPA and did just that. Once I had the profile, I followed the most-sampled path in the stack and found my way to the supposed culprit: the parking algorithm.

Context

When optimizing, you need as much context as you possible to find the best possible course of action. So let me explain how that algorithm fits into the broader picture.

Parking

My main game-mechanic is moving around your units. You do this by selecting a group and then clicking somewhere on the map to issue the move-order. In addition to path-finding process, this also runs an algorithm I call park-planning (as in parking a car). It makes sure that the units know to position themselves around the target point in a roughly circular shape once they arrive. It is essential to the interaction of this mechanic with the capturing of objectives, which are circular as well. Before this was implemented, the units would just decelerate after passing the target point. This caused them to “overshot” and miss the objectives, which was frustrating to the players: they clicked in the right place, but the units would not stop there, but slightly behind it. To make things worse, units arriving later, would bump into those that were already there, further pushing them away and clumping up.

AI Moving

In my particular case, the AI enemy was repeatedly issuing move-orders to close in on the intruder – the player. Since the player group usually also moved, the AI was trying to adapt by changing the move order every frame (effectively working at around 2000 APMs).

Diving into the code

My park-planning implementation is divided into two steps: finding enough parking spots, and then assigning units to it. The profiler was showing that the first part was the problem while the assignment was negligible in terms of run-time. Historically, the first step was reusing and extending some code I first wrote for spawning units, which worked like this:

optional<v2> GameWorld::FindFreePosition(v2 Center, std::vector<v2> const& Occupied)
{
  auto CheckPosition = [&](v2 Candiate)
  {
    if (!IsPassable(Candidate))
      return false;

    if (OverlapsWith(Occupied))
      return false;

    return !FriendlyUnitOccupies(Candidate);   
  };

  if (CheckPosition(Center))
    return Center;

  auto Radius = UNIT_SIZE;
  while (Radius < MAX_SEARCH_RADIUS)
  {
    // Roll a random starting angle
    auto AngleOffset = RandomAngle();
    auto Angle = 0.f;
    while (Angle < 2*Pi)
    {
      auto Candidate = Center + AngleVector(Angle + AngleOffset)*Radius;
      if (CheckPosition(Candidate))
        return Candidate;

      // Move along this circle
      Angle += 2*Pi*Radius / UNIT_SIZE / OVERSAMPLING_FACTOR;
    }

    // Increase the Radius
    Radius += UNIT_SIZE;
  }
  return none;  
}

Note that all the functions in the CheckPosition lambda are “size aware” and respect the UNIT_SIZE – so they are slightly more complex than what the pseudo-code here would have you believe.
The occupied parameter was added for the parking-position finding. It successively fills up the std::vector with positions and uses them once it found enough.

Back to the profiling results: They were showing that most of the time was spent in the FriendlyUnitOccupies, followed by IsPassable and and then OverlapsWith. FriendlyUnitOccupies dominated the time by about 8x times the rest. That function uses a quad-tree to accelerate spatial queries for other units.

Next steps

Obviously, this code uses pretty simplistic approach to the problem – basically just brute-forcing it. But that’s good now there are many different paths to take, many optimization opportunities. My approach was a relatively simple change that got the frame time back down below 1 ms, but before I did that, I considered many and tested a few other different approaches. I will talk about that in detail in my next post. How would you approach this?

Oversimplified C++ Project FAQ 2018

If you are starting a new C++ project, you’re faced with a few difficult decisions. C++ is not a ‘batteries-included’ language, so you need to pick a few technologies before you can start.
Yet worse, the answer to most of the pressing questions is often ‘it depends’ and changing one of the choices mid-project can be very expensive.
Therefore, I have compiled this list to give totally biased and oversimplified to the most important questions. If you want more nuanced answers, feel free to do your own research.
This is meant to be a somewhat amusing starting point.

FAQ

1. Which OS should I pick?

Linux

Rationale

Usually, not a choice you can make yourself – but if you do: dependency management is easier with a package manager, and it seems to be the most dominant OS in the C++ community. Hence you will get the best support and easiest access to technologies.

2. Which build system should I use?

CMake

Rationale

This is what everyone else is using, and those that are not are a real pain. For better or worse, the market is locked in. With target based properties in modern CMake, it’s not even that bad.

3. Which IDE should I choose?

Visual Studio 2017 on Windows, CLion everywhere else.

Rationale

CLion is getting more robust and feature rich with every release. Native CMake support and really cool refactoring capabilities finally make this a valid contender to Visual Studio’s crown. However, the VS debugger is still the best in the game, so VS still comes out on top on Windows – tho not by a huge margin.

4. Which Language version should you use?

C++14

Rationale

C++17 is not quite there yet with library, tool and platform support. Also, people do not really know how to use it well yet. C++14 builds on the now well-established C++11, which a few rather important “fixes” – and support is ubiquitous.

5. Which GUI toolkit should you use?

Qt

Rationale

No other toolkit comes close in maturity. Qt’s signal/slot system almost seamlessly integrates with C++11 lambdas, making the precompile step needed for SLOTs a non-issue. Barring the license costs for closed-source projects, there is really no reason not to use it.

6. Should you use Boost?

No

Rationale

Boost is a huge and clunky dependency that will explode your build times as soon as you even touch it. And it’s ‘viral’ enough that you can distinguish a Boost project from a non-Boost project. Boost.Optional, Boost.Variant and Boost.Filesystem prepare you for a smooth transition to C++17, but there are other more lightweight alternatives available.

Closing thoughts

There you have my totally biased opinion but hopefully entertaining. YMWV, but I think this is a good starting point if you don’t want to exeriment too much.

C++17: The two line visitor explained

If you have ever used an “idiomatic” C++ variant datatype like Boost.Variant or the new C++17 std::variant, you probably wished you could assemble a visitor to dispatch on the type by assembling a couple of lambda expressions like this:

auto my_visitor = visitor{
  [&](int value) { /* ... */ },
  [&](std::string const& value) { /* ... */ },
};

The code in question

While reading through the code for lager I stumbled upon a curious way to to make this happen. And it is just two lines of code! Wow, that is cool.

template<class... Ts> struct visitor: Ts... { using Ts::operator()...; };
template<class... Ts> visitor(Ts...) -> visitor<Ts...>;

A comment in the code indicated that the code was copied from cppreference.com where I quickly found the source on the page for std::visit, albeit with the different name “overloaded”. There were, however, no comments as to how this code worked.

Multiple inheritance to the rescue

Lambda expressions in C++ are just syntactic sugar for callables, pretty much like a struct with an operator(). As such, you can derive from them. which is what the first line does.
It uses variadic templates and multiple inheritance to assemble the types of the lambdas into one type. Without the content in the struct body, an instantiation with our example would be roughly equivalent to this:

struct int_visitor {
  void operator()(int value)
  {/* ... */}
};

struct string_visitor {
  void operator()(std::string const& value)
  {/* ... */}
};

struct visitor : int_visitor, string_visitor {
};

Using all of it

Now this cannot yet be called, as overload resolution (by design) does not work across different types. Hence the using in the structs body. It pulls the operator() implementations into the visitor type where overload resolution can work across all of them.
With it, our hypothetical instantiation becomes:

struct visitor : int_visitor, string_visitor {
  using int_visitor::operator();
  using string_visitor::operator();
};

Now an instance of that type can actually be called with both our types, which is what the interface for, e.g. std::visit demands.

Don’t go without a guide

The second line intruiged me. It looks a bit like a function declaration but that is not what it is. The fact that I had to ask in the (very helpful!) C++ slack made me realize that I did not keep up with the new features in C++17 as much as I would have liked.
This is, in fact, a class template argument deducation (CTAD) guide. It is a new feature in C++17 that allows you do deduce template arguments for a type based on constructor parameters. In a way, it supercedes the Object Generator idiom of old.
The syntax is really quite straight-forward. Given a list of constructor parameter types, resolve to a specific template instance based on those.

Constructing

The last piece of the puzzle is how the visitor gets initialized. The real advantage of using lambdas instead of writing the struct yourself is that you can capture variables from your context. Therefore, you cannot just default-initialize most lambdas – you need to transport its values, its bound context.
In our example, this uses another new C++17 feature: extended aggregate initialization. Aggregate initialization is how you initialized structs way back in C with curly-brackets. Previously, it was forbidden to do this with structs that have a base class. The C++17 extension now lifts this restriction, thus making it possible to initialize this visitor with curly brackets.

Edit 2018/04/16: The people on r/cpp rightfully pointed out that using the “other name” in the code snippet was confusing – so the visitor is now called “visitor”.

4 Tips for better CMake

We are doing one of those list posts again! This time, I will share some tips and insights on better CMake. Number four will surprise you! Let’s hop right in:

Tip #1

model dependencies with target_link_libraries

I have written about this before, and this is still my number one tip on CMake. In short: Do not use the old functions that force properties down the file hierarchy such as include_directories. Instead set properties on the targets via target_link_libraries and its siblings target_compile_definitions, target_include_directories and target_compile_options and “inherit” those properties via target_link_libraries from different modules.

Tip #2

always use find_package with REQUIRED

Sure, having optional dependencies is nice, but skipping on REQUIRED is not the way you want to do it. In the worst case, some of your features will just not work if those packages are not found, with no explanation whatsoever. Instead, use explicit feature-toggles (e.g. using option()) that either skip the find_package call or use it with REQUIRED, so the user will know that another lib is needed for this feature.

Tip #3

follow the physical project structure

You want your build setup to be as straight forward as possible. One way to simplify it is to follow the file system and and the artifact structure of your code. That way, you only have one structure to maintain. Use one “top level” file that does your global configuration, e.g. find_package calls and CPack configuration, and then only defers to subdirectories via add_subdirectory. Only for direct subdirectories though: if you need extra levels, those levels should have their own CMake files. Then build exactly one artifact (e.g. add_executable or add_library) per leaf folder.

Tip #4

make install() an option()

It is often desirable to include other libraries directly into your build process. For example, we usually do this with googletest for our unit test. However, if you do that and use your install target, it will also install the googletest headers. That is usually not what you want! Some libraries handle this automagically by only doing the install() calls when they are the top level project. Similar to the find_package tip above, I like to do this with an option() for explicit user control!

Generating done

That is it for today! I hope this is helps and we will all see better CMake code in the future.

A Tale of Two Languages

Recently, I presented my mysteriously titled talk “A Tale of Two Languages” at our local C++ user group. Before the talk, I was not really sure whether it would resonate with my audience. But it did, and helped to engage people in a healthy discussion about how to use C++.

Essentially, the talk was about how I am using two different modes or dialects of C++ to write and maintain applications. The title suggests two languages – and it sure can be thought of that way, but for now I’m using the word “modes” to distinguish it from the term programming languages.

You write the application in one mode, while keeping the style relatively easy. In the other mode, you make sure that you can write easy and efficient code in the other, while leveraging the full power of C++.

I call the first application mode and the second library mode.

Library mode

As I said before, this the power mode. One of C++’s design paradigms is self-extension. You are extending the language from the language itself. It’s a very powerful mechanism, the same one that drives the standard library. It’s also why C++ does not have the need for a built-in string type.

This power is a bit of a double-edged sword. On the one hand, it allows you to adapt the language specifically for your needs, for example with application specific value types. For a 3D application, a well designed 2d vector or point type will make your code easier and probably faster. On the other hand, a badly designed type on this level will haunt your application for years to come. I have seen both.

That’s a simple example though. More powerful primitives, such as domain-specific-language like constructs also belong into this mode. In general, things in this mode are less discoverable and less maintainable, but they strive to improve discoverability, efficiency and maintainability on the other side. As a consequence, this code needs more stringent documentation and specification.

Application mode

This is the mode that you use to write the majority of your application. Application mode is all about agility and leveraging opportunities. You intentionally restrict yourself in order to keep your development speed up. Simplicity trumps most other qualities in this mode. If you need another quality to be the defining factor, for example because you need some code to be a little more complex in order to run faster, you should put it into library mode instead.

Unlike code in library mode, this code needs to speak for itself. Therefore, documentation is usually nothing but a duplication.

One important aspect is that this code should be devoid of all subtleties.

Parameter passing and its consequences

That last bit is especially uncommon in C++, where most decisions are really a catch-22. Hence the resulting code hints at the struggles endured while writing it.

For example, to write an efficient function in C++, you need to decide whether to pass each parameter by value, or by reference, or by a pointer. The decision on which to use depends largely on your implementation, i.e. what you are doing with the parameter after it was passed to the function. That usually couples your implementation too tightly to its interface and degrades programmer productivity by giving too many options.

Using a shared-ownership smart-pointer such as std::shared_ptr by default is a good middle ground here. It does the right thing most of the time and is not to far off at most other times. Many other mainstream languages, such as python, go this route. Some frameworks, such as Qt, use that semantic as well.
Like const-correctness, passing all parameters in a std::shared_ptr is viral. Object thus passed need to be created on a the stack, preferably with std::make_shared. You will also store those smart pointers in other objects, so shared_ptr will have quite a lot of screen space. Therefore I usually make an alias:

template using Ptr = std::shared_ptr;

If it’s going to be the default, it should not clutter your code. Since objects are transported in a Ptr by default, they usually do not even need a copy constructor or other “value-like” semantics. These objects are less about maintaining invariants, and more about implementing abstract interfaces and bundling functionality in maintainable chunks. I usually use boost::noncopyable to mark them, though Herb Sutter’s Metaclasses proposal could make this even nicer.
Note that you can still promote them to value types in library mode, should the need arise. But they will become more costly to maintain.

Other simplifications

There are plenty of other things to avoid in application mode. Writing templated types makes your code inherently non-local and dependent on a type that can be anywhere. Note that instantiation of templates from library mode is fine – at that point, all the facts are known.

Another thing that makes your code non-local, and therefore unfitting for application mode, is overloading. Especially in the presence of ADL. For example, which functions are in your actual overload set depends on which headers you include and which using-directives and declarations are active. Sometimes, that is desirable. But not in application mode.

Resolution

Since using this “two modes” approach, I have found that my productivity is much higher – even in older code that went through a lot of evolution. The code does not actually get a lot slower, even with all the smart pointers. In fact, I am sure that I could only optimize a few cases because the design in application mode is a lot more flexible, and the structure more visible.

C++ modules example

Two weeks back, I blogged about C++ modules, and why we so desperately need them. Now I have actually played with the implementation in Visual Studio 2017, and I want to share my findings.

The Files

My example consists of four files in two “components”, i.e. one library and one executable. The executable only has one file, main.cpp:

import pets;
import std.core;
import std.memory;

int main()
{
  std::unique_ptr<Pet> pet = std::make_unique<Dog>();
  std::cout << "Pet says: "
    << pet->pet() << std::endl;
}

The library consists of three files. First is pet.cpp, which contains the abstract base class for all pets:

import std.core;
module pets.pet;

export class Pet
{
public:
  virtual char const* pet() = 0;
};

Then there is dog.cpp – our only concrete implementation of that base class (yes, I’m not a cat person).

module pets.dog;
import std.core;
import pets.pet;

export class Dog : public Pet
{
public:
  char const* pet() override;
};

char const* Dog::pet()
{
  return "Woof!";
}

Notice they each define their own submodule. Finally, there is interface.cpp, which just cobbles those submodules together into one single “parent” module:

module pets;

export module pets.pet;
export module pets.dog;

You can get the full source code including the CMake setup at our github repository. I was not able to get the standard library path setup automated so far, so you probably want to adjust that.

Discussion

There are no headers at all, which was one of my goals of laying it out like this. I think that alone means an enormous increase in productivity.

The information that was previously in the header files is now in .ifc files that the microsoft compiler generates automatically from the module definitions.
When trying this out, a couple of things stood out to me:

  • Intellisense does not work with the new keywords yet.
  • The way I used it, interface.cpp needs to be compiled after pet.cpp and dog.cpp, so that the appropriate .ifc file exists. Having an order dependency like that within a single library is a new challenge.
  • I was not able to use the standard lib in the library. That would compile, but not link.
  • Not having to duplicate the function declaration feels very strange in C++.
  • There are a lot of paradigm changes required. For example, include paths are a thing of the past – you will need to configure correct module search paths in the future.
  • We will need to get the naming straight: right now, “modules” is already used as a “distinct software component”. The new meaning is similar, but still competes with it. since the granularity is no longer so flexible. I already started using “components” as a new word for the former.

What are your experiences with modules so far? Do you have another way of composing modules? I really like to hear about it! I think the biggest challenge right now is how to use these new possibilities to improve the design of bigger C++ projects.