A few days ago, I got a bug report for a C++ program about a weird exception involving invalid characters in a JSON format. Now getting weird stuff back from a web backend is not something totally unexpected, so my first instinct was to check whether any calls to the parser did not deal with exceptions correctly. To my surprise, they all did. So I did what I should have done right away: just try to use the feature were the client found the bug. It crashed after a couple of seconds. And what I found was a really interesting problem. It was actually the JSON encoder trying to encode a corrupted string. But how did it get corrupted?
Tick, tick, boom..
The code in question logs into a web-service and then periodically sends a keep-alive signal with the same information. Let me start by showing you some support code:
class ticker_service { public: using callable_type = std::function<void()>; using handle = std::shared_ptr<callable_type>; handle insert(callable_type fn) { auto result = std::make_shared<callable_type>( std::move(fn)); callables_.push_back(result); return result; } void remove(handle const& fn_ptr) { if (fn_ptr == nullptr) return; // just invalidate the function *fn_ptr = {}; } void tick() { auto callable_invalid = [](handle const& fn_ptr) -> bool { return !*fn_ptr; }; // erase all the 'remove()d' functions auto new_end = std::remove_if( callables_.begin(), callables_.end(), callable_invalid); callables_.erase(new_end, callables_.end()); // call the remainder for (auto const& each : callables_) (*each)(); } private: std::vector<handle> callables_; };
This is dumbed down from the real thing, but enough to demonstrate the problem. In the real code, this only runs the functions after a specific time has elapsed, and they are all in a queue. Invalidating the std::function
serves basically as “marking for deletion”, which is a common pattern for allowing deletion in queue or heap-like data structure. In this case, it just allows to mark a function for deletion in constant time, while the actual element shifting is “bundled” in the tick() function.
Now for the code that uses this “ticker service”:
class announcer_service { public: explicit announcer_service(ticker_service& ticker) : ticker_(ticker) { } void update_presence(std::string info) { // Make sure no jobs are running ticker_.remove(job_); if (!send_web_request(info)) return; // reinsert the job job_ = ticker_.insert( [=] { update_presence(info); }); } private: ticker_service& ticker_; ticker_service::handle job_; };
The announcer service runs
ticker_service ticker; announcer_service announcer(ticker); announcer.update_presence( "hello world! this is a longer text."); ticker.tick();
A subtle change
You might be wondering where the bug is. To the best of my knowledge, there is none. And the real code corresponding to this worked like a charm for years. And I did not make any significant changes to it lately either. Or so I thought.
If I open that code in CLion, Clang-Tidy is telling me that the parameter “info” to update_presence
is only used as a reference, and I should consider turning it into one. Well, Clang-Tidy, that’s bad advice. Because that’s pretty much the change I made:
void update_presence(std::string const& info) // <--
And this makes it go boom on the second call to update_presence()
, the one from tick()
. Whew. But why?
What is happening?
It turns out, even though we are capturing everything by value, the lambda is still at fault here. Or rather, using values that are captured by the lambda after the lambda has been destroyed. And in this case, the lambda actually destroys itself in the call to ticker_service::remove()
. In the first call to update_presence()
, the job_
handle is still nullptr
, turning remove()
into a no-op. On the second call however, remove()
overwrites the std::function
that is currently on the stack, calling into update_presence
, with a default-constructed value. This effectively delete
s the lambda that was put there by the last iteration of update_presence
, thereby also destroying the captured info string. Now if info
was copied into update_presence, this is not a problem, but if you’re still referencing the value stored in the lambda, this is a typical use-after-free. Ooops. I guess C++ can be tricky sometimes, even if you are using automatic memory management.
How to avoid this
This bug is not unlike changing your container when changing it while iterating over it. Java people know this error from the ConcurrentModificationException
. Yes, this is possible, if you are really really careful. But in general, you better solve this bug by defering your container modification to a later point after you’re done iterating. Likewise, in this example, the std::function
that is currently executing is being modified while it is executing.
A good solution is to defer the deletion until after the execution. So I argue the bug is actually in the ticker_service
, which is not as safe as it can be. It should make sure that the lambda survives the complete duration of the call. An easy, albeit somewhat inefficient, approach would be copying the std::function
before calling it. Luckily, in the real code, the functions are all just executed once, so I could std::move
them to a local variable before executing.