The “parameter self-destruction” bug

A few days ago, I got a bug report for a C++ program about a weird exception involving invalid characters in a JSON format. Now getting weird stuff back from a web backend is not something totally unexpected, so my first instinct was to check whether any calls to the parser did not deal with exceptions correctly. To my surprise, they all did. So I did what I should have done right away: just try to use the feature were the client found the bug. It crashed after a couple of seconds. And what I found was a really interesting problem. It was actually the JSON encoder trying to encode a corrupted string. But how did it get corrupted?

Tick, tick, boom..

The code in question logs into a web-service and then periodically sends a keep-alive signal with the same information. Let me start by showing you some support code:


class ticker_service
{
public:
  using callable_type = std::function<void()>;
  using handle = std::shared_ptr<callable_type>;

  handle insert(callable_type fn)
  {
    auto result = std::make_shared<callable_type>(
      std::move(fn));
    callables_.push_back(result);
    return result;
  }

  void remove(handle const& fn_ptr)
  {
    if (fn_ptr == nullptr)
      return;

    // just invalidate the function
    *fn_ptr = {};
  }

  void tick()
  {
    auto callable_invalid =
      [](handle const& fn_ptr) -> bool
    {
      return !*fn_ptr;
    };

    // erase all the 'remove()d' functions
    auto new_end = std::remove_if(
      callables_.begin(),
      callables_.end(),
      callable_invalid);

    callables_.erase(new_end, callables_.end());

    // call the remainder
    for (auto const& each : callables_)
      (*each)();
  }

private:
  std::vector<handle> callables_;
};

This is dumbed down from the real thing, but enough to demonstrate the problem. In the real code, this only runs the functions after a specific time has elapsed, and they are all in a queue. Invalidating the std::function serves basically as “marking for deletion”, which is a common pattern for allowing deletion in queue or heap-like data structure. In this case, it just allows to mark a function for deletion in constant time, while the actual element shifting is “bundled” in the tick() function.

Now for the code that uses this “ticker service”:

class announcer_service
{
public:
  explicit announcer_service(ticker_service& ticker)
  : ticker_(ticker)
  {
  }

  void update_presence(std::string info)
  {
    // Make sure no jobs are running
    ticker_.remove(job_);

    if (!send_web_request(info))
      return;

    // reinsert the job
    job_ = ticker_.insert(
      [=] {
        update_presence(info);
    });
  }
private:
  ticker_service& ticker_;
  ticker_service::handle job_;
};

The announcer service runs

ticker_service ticker;
announcer_service announcer(ticker);

announcer.update_presence(
  "hello world! this is a longer text.");
ticker.tick();

A subtle change

You might be wondering where the bug is. To the best of my knowledge, there is none. And the real code corresponding to this worked like a charm for years. And I did not make any significant changes to it lately either. Or so I thought.
If I open that code in CLion, Clang-Tidy is telling me that the parameter “info” to update_presence is only used as a reference, and I should consider turning it into one. Well, Clang-Tidy, that’s bad advice. Because that’s pretty much the change I made:

void update_presence(std::string const& info) // <--

And this makes it go boom on the second call to update_presence(), the one from tick(). Whew. But why?

What is happening?

It turns out, even though we are capturing everything by value, the lambda is still at fault here. Or rather, using values that are captured by the lambda after the lambda has been destroyed. And in this case, the lambda actually destroys itself in the call to ticker_service::remove(). In the first call to update_presence(), the job_ handle is still nullptr, turning remove() into a no-op. On the second call however, remove() overwrites the std::function that is currently on the stack, calling into update_presence, with a default-constructed value. This effectively deletes the lambda that was put there by the last iteration of update_presence, thereby also destroying the captured info string. Now if info was copied into update_presence, this is not a problem, but if you’re still referencing the value stored in the lambda, this is a typical use-after-free. Ooops. I guess C++ can be tricky sometimes, even if you are using automatic memory management.

How to avoid this

This bug is not unlike changing your container when changing it while iterating over it. Java people know this error from the ConcurrentModificationException. Yes, this is possible, if you are really really careful. But in general, you better solve this bug by defering your container modification to a later point after you’re done iterating. Likewise, in this example, the std::function that is currently executing is being modified while it is executing.
A good solution is to defer the deletion until after the execution. So I argue the bug is actually in the ticker_service, which is not as safe as it can be. It should make sure that the lambda survives the complete duration of the call. An easy, albeit somewhat inefficient, approach would be copying the std::function before calling it. Luckily, in the real code, the functions are all just executed once, so I could std::move them to a local variable before executing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.