The “parameter self-destruction” bug

A few days ago, I got a bug report for a C++ program about a weird exception involving invalid characters in a JSON format. Now getting weird stuff back from a web backend is not something totally unexpected, so my first instinct was to check whether any calls to the parser did not deal with exceptions correctly. To my surprise, they all did. So I did what I should have done right away: just try to use the feature were the client found the bug. It crashed after a couple of seconds. And what I found was a really interesting problem. It was actually the JSON encoder trying to encode a corrupted string. But how did it get corrupted?

Tick, tick, boom..

The code in question logs into a web-service and then periodically sends a keep-alive signal with the same information. Let me start by showing you some support code:


class ticker_service
{
public:
  using callable_type = std::function<void()>;
  using handle = std::shared_ptr<callable_type>;

  handle insert(callable_type fn)
  {
    auto result = std::make_shared<callable_type>(
      std::move(fn));
    callables_.push_back(result);
    return result;
  }

  void remove(handle const& fn_ptr)
  {
    if (fn_ptr == nullptr)
      return;

    // just invalidate the function
    *fn_ptr = {};
  }

  void tick()
  {
    auto callable_invalid =
      [](handle const& fn_ptr) -> bool
    {
      return !*fn_ptr;
    };

    // erase all the 'remove()d' functions
    auto new_end = std::remove_if(
      callables_.begin(),
      callables_.end(),
      callable_invalid);

    callables_.erase(new_end, callables_.end());

    // call the remainder
    for (auto const& each : callables_)
      (*each)();
  }

private:
  std::vector<handle> callables_;
};

This is dumbed down from the real thing, but enough to demonstrate the problem. In the real code, this only runs the functions after a specific time has elapsed, and they are all in a queue. Invalidating the std::function serves basically as “marking for deletion”, which is a common pattern for allowing deletion in queue or heap-like data structure. In this case, it just allows to mark a function for deletion in constant time, while the actual element shifting is “bundled” in the tick() function.

Now for the code that uses this “ticker service”:

class announcer_service
{
public:
  explicit announcer_service(ticker_service& ticker)
  : ticker_(ticker)
  {
  }

  void update_presence(std::string info)
  {
    // Make sure no jobs are running
    ticker_.remove(job_);

    if (!send_web_request(info))
      return;

    // reinsert the job
    job_ = ticker_.insert(
      [=] {
        update_presence(info);
    });
  }
private:
  ticker_service& ticker_;
  ticker_service::handle job_;
};

The announcer service runs

ticker_service ticker;
announcer_service announcer(ticker);

announcer.update_presence(
  "hello world! this is a longer text.");
ticker.tick();

A subtle change

You might be wondering where the bug is. To the best of my knowledge, there is none. And the real code corresponding to this worked like a charm for years. And I did not make any significant changes to it lately either. Or so I thought.
If I open that code in CLion, Clang-Tidy is telling me that the parameter “info” to update_presence is only used as a reference, and I should consider turning it into one. Well, Clang-Tidy, that’s bad advice. Because that’s pretty much the change I made:

void update_presence(std::string const& info) // <--

And this makes it go boom on the second call to update_presence(), the one from tick(). Whew. But why?

What is happening?

It turns out, even though we are capturing everything by value, the lambda is still at fault here. Or rather, using values that are captured by the lambda after the lambda has been destroyed. And in this case, the lambda actually destroys itself in the call to ticker_service::remove(). In the first call to update_presence(), the job_ handle is still nullptr, turning remove() into a no-op. On the second call however, remove() overwrites the std::function that is currently on the stack, calling into update_presence, with a default-constructed value. This effectively deletes the lambda that was put there by the last iteration of update_presence, thereby also destroying the captured info string. Now if info was copied into update_presence, this is not a problem, but if you’re still referencing the value stored in the lambda, this is a typical use-after-free. Ooops. I guess C++ can be tricky sometimes, even if you are using automatic memory management.

How to avoid this

This bug is not unlike changing your container when changing it while iterating over it. Java people know this error from the ConcurrentModificationException. Yes, this is possible, if you are really really careful. But in general, you better solve this bug by defering your container modification to a later point after you’re done iterating. Likewise, in this example, the std::function that is currently executing is being modified while it is executing.
A good solution is to defer the deletion until after the execution. So I argue the bug is actually in the ticker_service, which is not as safe as it can be. It should make sure that the lambda survives the complete duration of the call. An easy, albeit somewhat inefficient, approach would be copying the std::function before calling it. Luckily, in the real code, the functions are all just executed once, so I could std::move them to a local variable before executing.

The best of both worlds: scoped_flags

C++11 introduced a pretty nice change to enum types in C++, the scoped enumeration. They mostly supersede the old unscoped enumeration, which was inherited from C and had a few shortcomings. For example, the names in the enumeration where added to its parent scope. This means that given an enum colors {red, green blue}; you can simply say auto my_color = red;. This can, of course, lead to ambiguities and people using some weird workarounds like putting the enums in namespaces or prefixing all elements á la hungarian-notation. Also, unscoped enumerations are not particularly type-safe: they can be converted to integer types and back without any special consideration, so you can write things like int x = red; without the compiler complaining.
Scoped enumerations improves both theses aspects: with enum class colors {red, green, blue};, you have to use auto my_color = colors::red; and int x = colors::red; will simply not compile.
To get the second part to compile, you need to insert a static_cast: int x = static_cast(colors::red); which is purposefully a lot more verbose. Now this is a bit of a blessing and a curse. Of course, this is a lot more type-safe, but it make one really common usage pattern with enums very cumbersome: bit flags.

Did this get worse?

While you could previously use the bit operators to combine different bitmasks defined as enums, scoped enumerations will only let you do that if you cast them first. In other words, type-safety prevents us from combining flags because the result might, of course, no longer be a valid enum.
However, we can still get the convenience and compactness of bit flags with a type that represents combinations bitmasks from a specific enum type. Oh, this reeks of a template. I give you scoped_flags, which you can use like this:

enum class window_flags
{
  has_border = 1 << 0,
  has_caption = 1 << 1,
  is_child = 1 << 2,
  /* ... */
};
void create_window(scoped_flags<window_flags> flags);

void main()
{
  create_window({window_flags::has_border, window_flags::has_caption});
}

scoped_flags<window_flags> something = /* ... */

// Check a flag
bool is_set = something.test(window_flags::is_child);

// Remove a flag
auto no_border = something.without(window_flags::has_border);

// Add a flag
auto with_border = something.with(window_flags::has_border);

Current implementation

You can find my current implementation on this github gist. Even in its current state, I find it a niftly little utility class that makes unscoped enumerations all but legacy code.
I opted not to replicate the bitwise operator syntax, because &~ for “without” is so ugly, and ~ alone makes little sense. A non-explicit single-argument constructor makes usage with a single flag as convenient as the old C-style variant, while the list construction is just a tiny bit more complicated.
The implementation is not complete or final yet; for example without is missing an overload that gets a list of flags. After my previous adventures with initializer_lists, I’m also not entirely sure whether std::initializer_list should be used anywhere but in the c’tor. And maybe CTAD could make it more comfortable? Of course, everything here can be constexpr‘fied. Do you think this is a useful abstraction? Any ideas for improvements? Do tell!

std::initializer_list considered evil

I am so disappointed in you, std::initializer_list. You are just not what I thought you were.

Lights out

While on the train to Meeting C++ this year, I was working on the lighting subsystem of the 3D renderer for my game abstractanks. Everything was looking fine, until I switched to the release build. Suddenly, my sun light went out. All the smaller lights were still there, it just looked like night instead of day.
Now stuff working in Debug and not working in Release used to be quite common and happens when you’re not correctly initializing built-in variables. So I went digging, but it was not as easy as I had thought. Several hours later, I tracked the problem down to my global light’s uniform buffer initialization code. This is a buffer that is sent to the GPU so the shaders can read all the lighting information. It looked like a fairly innocent for-loop doing byte-copies of matrices and vectors to a buffer:

using Pair = std::pair;
auto Mapping = std::initializer_list{
  {ShadowMatrix.ptr(), MATRIX_BYTE_SIZE},
  {LightDirection.ptr(), VECTOR4_BYTE_SIZE},
  {ColorAndAmbient.ptr(), VECTOR4_BYTE_SIZE}
};

std::size_t Offset = 0;
for (auto const& Each : Mapping)
{
  mUniformBuffer.SetSubData(GL_UNIFORM_BUFFER, Each.second, Offset, Each.first);
  Offset += Each.second;
}

The Culprit

After mistakenly blaming alignment issues for a while, I finally tried looking at the values of Each.second and Each.first. To my surprise, they were bogus. Now what is going on there? It turns out not writing this in almost-always-auto style, i.e. using direct- instead of copy-initialization fixes the problem, so there’s definitely a lifetime issue here.

Looking at the docs, it became apparent that std::initializer_list is indeed a reference-type that automatically creates a value-type (the backing array) internally and keeps it alive exactly as binding a reference to that array would. For the common cases, i.e. when std::initializer_list is used as a parameter, this is fine, because the original list lives for the whole function-call expression. For the direct-initialization case, this is also fine, since the reference-like lifetime-extension kicks in. But for copy-initialization, the right-hand-side is done after the std::initializer_list is copied. So the backing array is destroyed. Oops.

Conclusion and alternatives

Do not use std::initializer_list unless as a function parameter. It works well for that, and is surprising for everything else. In my case, a naive “extract variable” refactoring of for (auto const& each : {a, b, c}) { /* ... */ } led me down this rabbit hole.
My current alternative is stupidly simple: a built-in array on the stack:

using Pair = std::pair;
Pair Mapping[]{
  {ShadowMatrix.ptr(), MATRIX_BYTE_SIZE},
  {LightDirection.ptr(), VECTOR4_BYTE_SIZE},
  {ColorAndAmbient.ptr(), VECTOR4_BYTE_SIZE}
};

It does the same thing as the “correct” version of the std::initializer_list, and if you try to use it AAA-style, at least clang will give you this nice warning: warning: temporary whose address is used as value of local variable 'Mapping' will be destroyed at the end of the full-expression [-Wdangling]

Transposition as a programming technique

If you have been programming for a while, you will probably, and hopefully, agree that it is preferable to have a sequence of functions as opposed to the same number of functions nested. In other words, call-graph breadth is better than depth. Among other reasons, a “linear” set of instructions is often easier to follow, which is better for humans, and also tends to not go haywire with what memory it touches, which is better for computers.
However, deep call hierarchies occur much more than I would like. I have seen call stacks well beyond 200 functions deep. But this need not be – one can often be turned into the other by transposition. Transposition derives from the latin transponere, which roughly means “to put across”. With matrices, it means swapping rows and columns. Similarly, we can swap call-hierarchy depth for breath.

The example

A couple of months ago, I was tasked with programming a standing-wave display of power-line voltage curves. As you might know, the signal is roughly sine-wave shaped at about 50Hz. The signal is captured in time windows of 200ms, i.e. there’s a new packet of data 5 times a second with 10 sine cycles in it. However, the frequency value jitters just enough to make the signal drift a bit in the 200ms window, i.e. the wave moves forwards and backwards a little bit. The standing-wave feature tries to remove that drift and make it seemingly stationary in our fixed time window, so changes in amplitude become more visible.

Algorithm 1

The idea seems simple enough for just one signal:

  1. In the previous wave, search backwards to find the spot where the wave crosses from positive to negative.
  2. Take the previous wave from that point on and stitch it together with the current one, and cut that off at 200ms of data.

But there is not just one signal, there can be hundreds. And they should all be aligned to one designated “master” signal. So now we add extra steps:

  1. For all other signals, find the wave packets overlapping (in time) with our new stiched wave packet.
  2. Order them, and stitch them to a new wave packet covering exactly the same time window.

Now even in this version, finding the right packets for a time interval can be more tricky than it seems, because the values for the signals come in irregularly and can be shifted significantly. So you can just buffer of the last N (5?) packets for each signal and search in there. Still, one more requirement remains. For the display of archived data, the algorithm should work on batches of waves, i.e. many seconds worth, which made step 3 harder by extending the search space. So add:

  1. For each previous and current pair in a given time-interval:

Now the whole thing was pretty much implemented with steps 0 to 4 being functions calling into the next step, with major loops on the 0th and 3rd step. The wave data flows through these implementation layers vertically, i.e. from step 0 to step 4 and back, but the control flow of the program does not. It flows perpendicular to it, horizontally, solely controlled by the outer-most loop. It is intuitive to write it this way – after all, the control flow follows the flow of time in the data we are processing, but the code was not particularly easy, especially with the search in step 3 becoming unnecessarily complex.

Algorithm 2

Now let us try transposing this, and match the flow of data with our control flow:

  1. Gather all relevant signals for the time interval and sort their packets.
  2. Extract all the “stitching” time codes from the master signal.
  3. For all signals, traverse pairs together with the time codes and stitch accordingly.

The whole process becomes more digestible, and processing the data in stages made it obvious that sorted data makes using a “merge” type algorithm very easy.
Both algorithms use the same data, but the second makes it explicit, while the first just passes it through the call-stack in chunks.

Conclusion

I have since used this idea of “transposition” a few times to clean up and simplify my designs. It seems especially helpful when trying to decouple messaging from bulk processing.
The idea of looking at the data flow and adapting the control flow to match it, is central to data-oriented design. I argue that while this can be used to optimized programs, transposition is mainly a tool to make programs simpler, which can then lead to optimization. Separating processing into stages is also very similar to loop-fission.
Have you used a technique like this before? Do you, perhaps, know it by another name? Let me know!

xBase gotchas

For the last year, I have regularly worked on a legacy project written in xBase / Clipper, which is a dialect of the dBase programming language. The language as a whole is surprisingly comfortable for its age, being 40 years old now. But there were a few quite infuriating things I stumbled upon:

Trailing commas in arrays

The language allows you to define array contents inline like this:

myArray := { 1, "foo", .t., 4.0 }

It is also dynamically typed, as you can see in this example – where the array holds numbers, strings and boolean (logical) values. Now sometimes, you want to spread-out an array initialization over multiple lines. xBase lets you do that – using the semi-colon quite contrary to many other languages – to continue a line:

myArray := {;
  "foo",;
  "bar";
}

Now you might be temped, as I was, to write it like this instead:

myArray := {;
  "foo",;
  "bar",;
}

With a trailing comma. But, gotcha, that does not do what you probably expect. This actually creates an array with 3 elements, where the last element is nil.

Unequal is not not-equals

xBase has both the == and the != operators. But imagine my suprise when and if a != b was not entered, with what had to be unequal strings. I added some logging (yea, no debugger) and saw that I indeed had the strings "" and "foo". Still, a != b seemed to evaluate to false. Because I could not believe that, I changed it to !(a == b) – and it worked!

It turns out that the != operator is not the opposite of the == operator, but rather of the = operator (xBase uses := for assignment). The = operator, however, implements some kind of “fuzzy” equals by default, also evaluating true if the left operand is a prefix of the right operand. This is a behaviour that can be changed globally by “SET EXACT on”. Wow.

Have unregistered classes throw with the unity DI container

The unity container (not to be confused with game engine) is one of the most popular dependency injection tools for C#.
However, by default the unity container will try to Resolve() all classes that it can. If you do not register a class, it will will often just succeed anyways.
I much prefer explicitly registering classes, and resolution just throwing and exception if I try to resolve something I did not register.
There’s a viable solution for that on stackoverflow, but it fails to throw when trying to resolve a class that was only registered via its interface.
Here’s our fixed version:

public class UnityRegistrationTracking : UnityContainerExtension
{
  private readonly ConcurrentDictionary<NamedTypeBuildKey, bool> registrations =
    new ConcurrentDictionary<NamedTypeBuildKey, bool>();

  protected override void Initialize()
  {
    base.Context.Registering += Context_Registering;
    base.Context.Strategies.Add(
        new ValidateRegistrationStrategy(this.registrations), UnityBuildStage.Setup);
  }

  private void Context_Registering(object sender, RegisterEventArgs e)
  {
    var buildKey = new NamedTypeBuildKey(e.TypeFrom, e.Name);
    this.registrations.AddOrUpdate(buildKey, true,
      (key, oldValue) => true);
  }

  public class ValidateRegistrationStrategy : BuilderStrategy
  {
    private ConcurrentDictionary<NamedTypeBuildKey, bool> registrations;

    public ValidateRegistrationStrategy(ConcurrentDictionary<NamedTypeBuildKey, bool> registrations)
    {
      this.registrations = registrations;
    }

    public override void PreBuildUp(ref BuilderContext context)
    {
        var buildKey = new NamedTypeBuildKey(context.RegistrationType, context.Name);
        if (!this.registrations.ContainsKey(buildKey))
        {
          throw new ResolutionFailedException(buildKey.Type, buildKey.Name,
            string.Format("Type {0} was not explicitly registered in the container.", buildKey.Type.Name));
        }
    }
  }
}

We hook into two parts of the unity API here:

  1. The registration, which is called when you call Unity.RegisterType
  2. The resolution process, which is called when unity tries to resolve a specific instance.

The first part happens in Context_Registering. We just store the registration in dictionary for later. It is important to use TypeFrom as a key, since we want to refer to objects by the interfaces they are registered with, not their concrete implementations.
The second part is the ValidateRegistrationStrategy. All registered BuilderStrategy objects go in a list that is processed when an object is built. The UnityBuildStage.Setup acts as a sorting key, to make sure that this strategy is executed as early as possible.
In the strategy, we check whether the requested type was previously registered, and throw an exception if it was not. It is important to use context.RegistrationType here, since context.Type will again contain the concrete type, and not the interface.

Non-determinism in C++

A deterministic program, when given the same input, will always result in the same output. This intuitive, albeit quite fuzzily defined, property is often times pretty important for correct program. Sources of non-determinism can be quite subtle – and once they creep into your program, they can propagate and amplify and have enormous consequences. It is pretty much the well-known butterfly effect.

When discussing this problem, it is important to know what exactly makes up the input and the output of the program. For example, when logging times to a logfile, and considering this an actual output, no two runs will ever be the same – so this is usually not considered an output relevant for determinism. Which brings us to the first common source of non-determinism:

Time

If any part of your program depends on the time it is run at, it will be easily be non-deterministic. Common cases are using the time to initializing some variable depending on the time, or using the time for some kind of numerical integration, like computing a value over time. Also, using execution time as an output respective for determinism is hopeless on a normal desktop computer – but can be crucial for a real-time system.

Random number generation

Random number generation seems like an obvious candidate, yet most random number generators are not really random, but only pseudo-random. For example, std::mersenne_twister_engine will generate the same sequence of values every time, when initialized with the same seed. So do not initialize this with a non-deterministic input like the time, and it will be predictable. However, std::random_device might not share this property and give you fresh non-deterministic input. As a weird middle ground, std::default_random_engine will probably give you the same results when compiled with the same compiler/standard-lib, but on another compiler version or OS, it will not. Subtle.

The allocator

Another source of non-determinism that is pretty tricky is the allocator. For example, consider the following piece of code:

template <class T>
T sum(std::set<Thingy*> const& set)
{
  T result{};
  for (auto const& each : set)
    result += each->value();
  return result;
}

Is this deterministic or not? It depends. Now let’s assume that all the Thingys were allocated using standard new. In that case, the actual pointers, Thingy* are non-deterministic, and hence the order of the Thingy*s in the set is random. But does this matter? Well if T is std::uint32_t, it does not. Order in addition does not matter for unsigned integers, even with overflows. However, if T is float, then it does matter and the whole result becomes unpredictable, at least in the general case (it will even be predictable, if e.g. all the numbers in the computation are integers that are exactly representable as floats). Other languages have “insertion-ordered” containers to get around this problem. A sensible approximation in C++ is to use the (unordered_)set and (unordered_)map containers together with another list to iterate on.

The thread scheduler

When you cannot really control the order of instructions, which is really the whole point of threading, you will have a harder time making things deterministic. Like the allocator problem, this is usually also paired with floating-point arithmetic. The workaround here is to make sure that the order of computation does not influence the final result. One common way around this is to sort the output by a unique criteria. For example, if you use multiple threads to report the intersections of a bunch of line segments, you can later sort them by their position in space.

There’s of course the honorable mention for uninitialized variables, but I’m sure your static analyzer will complain about it. Any interaction with the “outside” of your program, any side-effect, be it filesystems, user input, output or cosmic radiation can lead to non-determinism, so be sure to know the context well enough and plan accordingly to your determinism requirements.