The joy of being a student assistant

Lately I heard a lecture by Daniel Lindner about error codes and why you should avoid using them. I had to smile because it reminded me of my time as a student assistant, when I worked with some people that had a slightly different opinion on that point. Maybe they enjoyed torturing student assistants, but it seems the most likely to me that they just did not know any better. But let‘s start at the beginning.

One day the leader of my research group sent me an excel sheet with patient data and asked me to perform some statistical calculations with the programming language R, that is perfectly suitable for such a task. Therefore I did not expect it to take much time – but it soon turned out that I was terribly wrong. Transforming that excel sheet into something R could work with gave me a really hard time and so I decided to write down some basic rules you should consider when recording such data in the hope that at least some future student assistants won‘t have to deal with the problems I had again.

First I told my program to read in the excel sheet with the patient data, which worked as expected. But when I started to perform some simple operations like calculating the average age of the patients, my computer soon told me things like that:

Warning message: In mean.default(age): argument is not numeric or logical: returning NA

I was a little confused then, because I knew that the age of something or someone is a numeric value. But no matter how often I tried to explain that to my computer: He was absolutly sure that I was wrong. So I had no choice but to have a look at the excel sheet with about 2000 rows and 30 columns. After hours of searching (at least it felt like hours) I found a cell with the following content: Died last week. That is indeed no numeric value, it‘s a comment that was made for humans to read. So here‘s my rule number one:

1. Don‘t use comments in data files

There is one simple reason for that: The computer, who has to work with that data, does not understand it. And (as sad as it is) he does not care about the death of a patient. The only thing he wants to do is to calculate a mean value. And he needs numeric values for it. If you still want to have that comment, just save it somewhere else in another column for comments or in a separate file the computer does not have to deal with when performing statistical calculations.

So I removed that comment (and some others I found) and tried to calculate the mean value again. This time my computer did not complain, the warning message disappeard and for one moment I felt relieved. In the next moment I saw the result of the calculation. The average age of the patients was 459.76. And again I told my computer that this is not possible and again he was sure that I was wrong and again I had to take a look at the excel sheet with the data. Did I mention that the file contained about 2000 rows and 30 columns? However, after a little searching I found a cell with the value -999999. It was immediately clear to me that this was not the real age of the patient, but I wasn‘t able to find out by myself what that value meant. It could have been a typo, however the leader of my research group told me that some people use -999999 as an error code. It could mean something like: „I don‘t know the age of the patient.“ Or: „That patient also died.“ But that was only a guess. So here is my rule number 2:

2. Document your error codes

If there would have been some documentation I would maybe have known what to do with that value. Instead I secretly deleted it, hoping that it was not important to anyone, because unfortunately to my computer -999999 is just a numeric value, not better or worse than any other. So I had to tell him not to use it. But that was only the beginning.

I learned from my previous mistakes and before performing any other statistical calculations I had a look at the whole excel sheet. And it was even more horrible than I expected it to be. If every person who worked with that table would have used the same error code, I could just have written a script that eliminated all -999999 from the sheet and it would have been done. But it seemed that everyone had his own favorite (undocumented) error code. Or if at least there would have been some documentation about the value ranges of each column, I could have told my computer to ignore all values that are not in that range. For something like the age of a patient this is easy, but for other medical data a computer science student like me does not know that can be hard. Is 0 a valid value or does it mean that there is no value? What about 999? So in any case: I had to read the whole table (again: 2000×30 values!) and manually guess for each value if it really was a value or an error code and then tell my computer to ignore it, so he could calculate the right means. I don‘t know exactly how much time that cost, but I‘m sure in the same time I could have read all nine books of The Histories by Herodotus twice, watch every single episode of Gilmore Girls including the four episodes of A Year in the Life and learn Japanese. So finally here‘s my rule number 3 (and the good part about this one is that you can immediately forget about rule number 2):

3. Don‘t use error codes in data files

Really. Don‘t. The student assistants of the future will thank you.

Some strings are more equal before your Oracle database

When working with customer code based on ADO.net, I was surprised by the following error message:

The german message just tells us that some UpdateCommand had an effect on “0” instead of the expected “1” rows of a DataTable. This happened on writing some changes to a table using an OracleDataAdapter. What really surprised me at this point was that there certainly was no other thread writing to the database during my update attempt. Even more confusing was, that my method of changing DataTables and using the OracleDataAdapter to write changes had worked pretty well so far.

In this case, the title “DBConcurrencyExceptionturned out to be quite misleading. The text message was absolutely correct, though.

The explanation

The UpdateCommand is a prepared statement generated by the OracleDataAdapter. It may be used to write the changes a DataTable keeps track of to a database. To update a row, the UpdateCommand identifies the row with a WHERE-clause that matches all original values of the row and writes the updates to the row. So if we have a table with two rows, a primary id and a number, the update statement would essentially look like this:

UPDATE EXAMPLE_TABLE
  SET ROW_ID =:current_ROW_ID, 
      NUMBER_COLUMN =:current_NUMBER_COLUMN
WHERE
      ROW_ID =:old_ROW_ID 
  AND NUMBER_COLUMN =:old_NUMBER_COLUMN

In my case, the problem turned out to be caused by string-valued columns and was due to some oracle-weirdness that was already discussed on this blog (https://schneide.blog/2010/07/12/an-oracle-story-null-empty-or-what/): On writing, empty strings (more precisely: empty VARCHAR2s) are transformed to a DBNull. Note however, that the following are not equivalent:

WHERE TEXT_COLUMN = ''
WHERE TEXT_COLUMN is null

The first will just never match… (at least with Oracle 11g). So saying that null and empty strings are the same would not be an accurate description.

The WHERE-clause of the generated UpdateCommands look more complicated for (nullable) columns of type VARCHAR2. But instead of trying to understand the generated code, I just guessed that the problem was a bug or inconsistency in the OracleDataAdapter that caused the exception. And in fact, it turned out that the problem occured whenever I tried to write an empty string to a column that was DBNull before. Which would explain the message of the DBConcurrencyException, since the DataTable thinks there is a difference between empty strings and DBNulls but due to the conversion there will be no difference when the corrensponding row is updated. So once understood, the problem was easily fixed by transforming all empty strings to null prior to invoking the UpdateCommand.

The “parameter self-destruction” bug

A few days ago, I got a bug report for a C++ program about a weird exception involving invalid characters in a JSON format. Now getting weird stuff back from a web backend is not something totally unexpected, so my first instinct was to check whether any calls to the parser did not deal with exceptions correctly. To my surprise, they all did. So I did what I should have done right away: just try to use the feature were the client found the bug. It crashed after a couple of seconds. And what I found was a really interesting problem. It was actually the JSON encoder trying to encode a corrupted string. But how did it get corrupted?

Tick, tick, boom..

The code in question logs into a web-service and then periodically sends a keep-alive signal with the same information. Let me start by showing you some support code:


class ticker_service
{
public:
  using callable_type = std::function<void()>;
  using handle = std::shared_ptr<callable_type>;

  handle insert(callable_type fn)
  {
    auto result = std::make_shared<callable_type>(
      std::move(fn));
    callables_.push_back(result);
    return result;
  }

  void remove(handle const& fn_ptr)
  {
    if (fn_ptr == nullptr)
      return;

    // just invalidate the function
    *fn_ptr = {};
  }

  void tick()
  {
    auto callable_invalid =
      [](handle const& fn_ptr) -> bool
    {
      return !*fn_ptr;
    };

    // erase all the 'remove()d' functions
    auto new_end = std::remove_if(
      callables_.begin(),
      callables_.end(),
      callable_invalid);

    callables_.erase(new_end, callables_.end());

    // call the remainder
    for (auto const& each : callables_)
      (*each)();
  }

private:
  std::vector<handle> callables_;
};

This is dumbed down from the real thing, but enough to demonstrate the problem. In the real code, this only runs the functions after a specific time has elapsed, and they are all in a queue. Invalidating the std::function serves basically as “marking for deletion”, which is a common pattern for allowing deletion in queue or heap-like data structure. In this case, it just allows to mark a function for deletion in constant time, while the actual element shifting is “bundled” in the tick() function.

Now for the code that uses this “ticker service”:

class announcer_service
{
public:
  explicit announcer_service(ticker_service& ticker)
  : ticker_(ticker)
  {
  }

  void update_presence(std::string info)
  {
    // Make sure no jobs are running
    ticker_.remove(job_);

    if (!send_web_request(info))
      return;

    // reinsert the job
    job_ = ticker_.insert(
      [=] {
        update_presence(info);
    });
  }
private:
  ticker_service& ticker_;
  ticker_service::handle job_;
};

The announcer service runs

ticker_service ticker;
announcer_service announcer(ticker);

announcer.update_presence(
  "hello world! this is a longer text.");
ticker.tick();

A subtle change

You might be wondering where the bug is. To the best of my knowledge, there is none. And the real code corresponding to this worked like a charm for years. And I did not make any significant changes to it lately either. Or so I thought.
If I open that code in CLion, Clang-Tidy is telling me that the parameter “info” to update_presence is only used as a reference, and I should consider turning it into one. Well, Clang-Tidy, that’s bad advice. Because that’s pretty much the change I made:

void update_presence(std::string const& info) // <--

And this makes it go boom on the second call to update_presence(), the one from tick(). Whew. But why?

What is happening?

It turns out, even though we are capturing everything by value, the lambda is still at fault here. Or rather, using values that are captured by the lambda after the lambda has been destroyed. And in this case, the lambda actually destroys itself in the call to ticker_service::remove(). In the first call to update_presence(), the job_ handle is still nullptr, turning remove() into a no-op. On the second call however, remove() overwrites the std::function that is currently on the stack, calling into update_presence, with a default-constructed value. This effectively deletes the lambda that was put there by the last iteration of update_presence, thereby also destroying the captured info string. Now if info was copied into update_presence, this is not a problem, but if you’re still referencing the value stored in the lambda, this is a typical use-after-free. Ooops. I guess C++ can be tricky sometimes, even if you are using automatic memory management.

How to avoid this

This bug is not unlike changing your container when changing it while iterating over it. Java people know this error from the ConcurrentModificationException. Yes, this is possible, if you are really really careful. But in general, you better solve this bug by defering your container modification to a later point after you’re done iterating. Likewise, in this example, the std::function that is currently executing is being modified while it is executing.
A good solution is to defer the deletion until after the execution. So I argue the bug is actually in the ticker_service, which is not as safe as it can be. It should make sure that the lambda survives the complete duration of the call. An easy, albeit somewhat inefficient, approach would be copying the std::function before calling it. Luckily, in the real code, the functions are all just executed once, so I could std::move them to a local variable before executing.

24 hour time format: Difference between JodaTime and java.time

We have been using JodaTime in many projects since before Java got better date and time support with Java 8. We update projects to the newer java.time classes whenever we work on them, but some still use JodaTime. One of these was a utility that imports time series from CSV files. The format for the time stamps is flexible and the user can configure it with a format string like “yyyyMMdd HHmmss”. Recently a user tried to import time series with timestamps like this:

20200101 234500
20200101 240000
20200102 001500

As you can see this is a 24-hour format. However, the first hour of the day is represented as the 24th hour of the previous day if the minutes and seconds are zero, and it is represented as “00” otherwise. When the user tried to import this with the “yyyyMMdd HHmmss” format the application failed with an internal exception:

org.joda.time.IllegalFieldValueException:
Cannot parse "20200101 240000": Value 24 for
hourOfDay must be in the range [0,23]

Then he tried “yyyyMMdd kkmmss”, which uses the “kk” format for hours. This format allows the string “24” as hour. But now “20200101 240000” was parsed as 2020-01-01T00:00:00 and not as 2020-01-02T00:00:00, as intended.

I tried to help and find a format string that supported this mixed 24-hour format, but I did not find one, at least not for JodaTime. However, I found out that with java.time the import would work with the “yyyyMMdd HHmmss” format, even though the documentation for “H” simply says “hour-of-day (0-23)”, without mentioning 24.

The import tool was finally updated to java.time and the user was able to import the time series file.

Running a for-loop in a docker container

Docker is a great tool for running services or deployments in a defined and clean environment. Operations just has to provide a host for running the containers and everything else is up to the developers. They can forge their own environment and setup all the prerequisites appropriately for their task. No need to beg the admins to install some tools and configure server machines to fit the needs of a certain project. The developers just define their needs in a Dockerfile.

The Dockerfile contains instructions to setup a container in a domain specific language (DSL). This language consists only of a couple commands and is really simple. Like every language out there, it has its own quirks though. I would like to show a solution to one I encountered when trying to deploy several items to a target machine.

The task at hand

We are developing a distributed system for data acquisition, storage and real-time-display for one of our clients. We deliver the different parts of the system as deb-packages for the target machines running at the customer’s site. Our customer hosts her own debian repository using an Artifactory server. That all seems simple enough, because artifactory tells you how to upload new artifacts using curl. So we built a simple container to perform the upload using curl. We tried to supply the bash shell script required to the CMD instruction of the Dockerfile but ran into issues with our first attempts. Here is the naive, dysfunctional Dockerfile:

FROM debian:stretch
RUN DEBIAN_FRONTEND=noninteractive apt-get update &amp;&amp; apt-get -y dist-upgrade
RUN DEBIAN_FRONTEND=noninteractive apt-get update &amp;&amp; apt-get -y install dpkg curl

# Setup work dir, $PROJECT_ROOT must be mounted as a volume under /elsa
WORKDIR /packages

# Publish the deb-packages to clients artifactory
CMD for package in *.deb; do\n\
  ARCH=`dpkg --info $package | grep "Architecture" | sed "s/Architecture:\ \([[:alnum:]]*\).*/\1/g" | tr -d [:space:]`\n\
  curl -H "X-JFrog-Art-Api:${API_KEY}" -XPUT "${REPOSITORY_URL}/${package};deb.distribution=${DISTRIBUTION};deb.component=non-free;deb.architecture=$ARCH" -T ${package} \n\
  done

The command fails because the for-shell built-in instruction does not count as a command and the shell used to execute the script is sh by default and not bash.

The solution

After some unsuccessfull attempts to set the shell to /bin/bash using dockers’ SHELL instruction we finally came up with the solution for an inline shell script in the CMD instruction of a Dockerfile:

FROM debian:stretch
RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get -y dist-upgrade
RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get -y install dpkg curl

# Setup work dir, $PROJECT_ROOT must be mounted as a volume under /elsa
WORKDIR /packages

# Publish the deb-packages to clients artifactory
CMD /bin/bash -c 'for package in *.deb;\
do ARCH=`dpkg --info $package | grep "Architecture" | sed "s/Architecture:\ \([[:alnum:]]*\).*/\1/g" | tr -d [:space:]`;\
  curl -H "X-JFrog-Art-Api:${API_KEY}" -XPUT "${REPOSITORY_URL}/${package};deb.distribution=${DISTRIBUTION};deb.component=non-free;deb.architecture=$ARCH" -T ${package};\
done'

The trick here is to call bash directly and supplying the shell script using the -c parameter. An alternative would have been to extract the script into an own file and call that in the CMD instruction like so:

# Publish the deb-packages to clients artifactory
CMD ["deploy.sh", "${API_KEY}", "${REPOSITORY_URL}", "${DISTRIBUTION}"]

In the above case I prefer the inline solution because of the short and simple script, no need for an additional external file and worrying about how to pass the parameters to the script.

Last night, an end-to-end test saved my live

but not with a song, only with an assertion failure. Okay, I’ll admit, this is not the only hyperbole in the blog title. Let’s set things right: The story didn’t happen last night, it didn’t even happen at night (because I’m not working night hours). In fact, it was a well-lit sunny day. And the test didn’t save my live. It did save me from some embarrassment, though. And spared a customer some trouble and hotfixing sessions with me.

The true parts of the blog title are: An end-to-end test reported an assertion failure that helped me to fix a bug before it got released. In other words: An end-to-end test did its job and I felt fine. And that’s enough of the obscure song references from the 1980s, hopefully.

Let me explain the setting. We develop a big system that runs in production 24/7-style and is perpertually adjusted and improved. The system should run unattended and only require human attention when a real incident happens, either in the data or the system itself. If you look at the system from space, it looks like this:

A data source (in fact, lots of data sources) occasionally transmits data to the system that gets transformed and sent to a third-party system that does all kind of elaborate analysis with it. Of course, there is more to the real system than that, but for our story, it is sufficient to know that there are several third-party systems, each with their own data formats. We’ll call them third-party system A (TPS-A) and TPS-B in this story.

Our system is secured by lots and lots of unit tests, a good number of integration tests and a handful of end-to-end tests. All tests run green all the time. The end-to-end tests (E2ET) try to replicate the production environment as close as possible by running on the target operating system and using data transfer means like in the real case. The test boots up the complete system and simulates the data source and the third-party systems. By configuring the system so that it transfers back to the E2ET, we can write assertions in the kind of “given a specific input data, we expect this particular output data from the system, however it chooses to produce it”. This kind of broad-brush test is invalueable because it doesn’t care for specifics in the system. It only cares for output from the system.

This kind of E2ET is also difficult to write, complicated to run, prone to brittleness and obscure in its failure statement. If you write a three-line unit test, it is straight-forward, fast and probably pretty specific when it breaks: “This one thing that I’m testing is wrong now”. An E2ET as described here is like the Delphi Oracle:

  • E2ET: “Somewhere in your system, something doesn’t work the way I like it anymore.”
  • Developer: “Can you be more specific?”
  • E2ET: “No, but here are 2000 lines of debug output that might help you on your quest for the cause.”
  • Developer: “But I didn’t change anything.”
  • E2ET: “Oh, in that case, it could as well be the weather or a slow network. Mind if I run again?”

This kind of E2ET is also rather slow and takes its sweet time resetting the workspace state, booting the system again and using prolonged timeouts for every step. So you wait several minutes for the E2ET to fail again:

  • E2ET: “Yup, there is still something fishy with your current code.”
  • Developer: “It’s the same code you let pass yesterday.”
  • E2ET: “Well, today I don’t like it!”
  • Developer: “Okay, let me dive into the debug output, then.”

Ten to twenty minutes pass.

  • Developer: “Is it possible that you just choke on a non-free network port?”
  • E2ET: “Yes, I’m supposed to wait for the system’s output on this port and cannot bind to it.”
  • Developer: “You cannot bind to it because an instance of yourself is stuck in the background and holding onto the port.”
  • E2ET: “See? I’ve told you there is something wrong with your system!”
  • Developer: “This isn’t a problem with the system’s code. This is a problem with your code.”
  • E2ET: “Hey! Don’t blame me! I’m here to help. You can always chose to ignore or delete me if I’m too much of a burden to you.”

This kind of E2ET cries wolf lots of times for problems in the test code itself, for too optimistic timeouts or just oddities in the environment. If such an E2ET fails, it’s common practice to run it again to see if the problem persists.

How many false alarms can you tolerate before you stop listing?

In my case, I choose to reduce the amount of E2ET to the essential minimum and listen to them every time. If the test raises a false alarm, invest the time to come up with a way to make it more robust without reducing its sensitivity. But listen to the test every time.

Because, in my story, the test insisted that third-party system A and B both didn’t receive half of their data. I could rule out stray effects in the network or on the harddisk pretty fast, because the other half of the data arrived just fine. So I invested considerable time in understanding the debug output. This led me nowhere – it seemed that the missing data wasn’t sent to the system in the first place. Checking the E2ET disproved this hypothesis. The test sent as much data to the system as ever. But half of it went missing underway. Now I reviewed all changes I made in the last few days. Some of them affected the data export to the TPS. But all unit and integration tests in this area insisted that everything worked as intended.

It is important to have this kind of “multiple witnesses”. Unit tests are like traces in a criminal investigation: They report one very specific information about one very specific part of the code and are only useful in larger numbers. Integration tests testify on the possibility to combine several code parts. In my case, this meant that the unit tests guaranteed that all parts involved in the data export work correct on their own (in isolation). The integration tests vouched that, given the right combination of data export parts, they will work as intended.

And this left one area of the code as the main suspect: Something in the combination of export parts must go wrong in the real case. The code that wires the parts together is the only code not tested by unit and integration tests, but E2ET. Both other test types base their work on the hypothesis of “given that everything is wired together like this”. If this hypothesis doesn’t hold true in the real case, no test but the E2ET finds a problem, because the E2ET bases on another hypothesis: “given that the system is started for real”.

I went through all the changes of the wiring code and there it was: A glorious TODO comment, written by myself on a late friday afternoon, stating: “TODO: Register format X export again after you’ve finished issue Y”. I totally forgot about it over the weekend. I only added the comment into the code, not as an issue comment. Friday afternoon me wasn’t cooperative with future or current me. In effect, I totally played myself. I don’t remember removing the registering code or writing the comment. It probably was a minor side change for an unimportant aspect of the main issue. The whole thing was surely done in a few seconds and promptly forgotten. And the only guardian that caught it was the E2ET, using the most nonspecific words for it (“somewhere, something went wrong”).

If I had chosen to ignore or modify the E2ET, the bug would have made it to production. It would have caused a loss of current data for the TPS. Maybe they would have detected it. But maybe, they would have just chosen to operate on the most recent data – data that doesn’t get updated anymore. In short, we don’t want to find out.

So what is the moral of the story?
Always have an end-to-end test for your most important functionality in place. Let it deal with your system from the outside. And listen to it, regardless of the effort it takes.

And what can you do if your E2ET has saved you?

  • Give your test a medal. No, seriously, leave a comment in the test code that it was helpful.
  • Write an unit or integration test that replicates the specific cause. You want to know immediately if you make the same error again in the future. Your E2ET is your last line of defense. Don’t let it be your only one. You’ve been shown a weakness in your test setup, remediate it.
  • Blame past you for everything. Assure future you that you are a better developer than this.
  • Feel good that you were able to neutralize your faux pas in time. That’s impressive!

When was the last time a test has saved you? Leave the story or a link to it in the comments. I can’t be the only one.

Oh, and if you find the random links to 80s music videos weird: At least I didn’t rickroll you. The song itself is from 1987 and would fit my selection.

The best of both worlds: scoped_flags

C++11 introduced a pretty nice change to enum types in C++, the scoped enumeration. They mostly supersede the old unscoped enumeration, which was inherited from C and had a few shortcomings. For example, the names in the enumeration where added to its parent scope. This means that given an enum colors {red, green blue}; you can simply say auto my_color = red;. This can, of course, lead to ambiguities and people using some weird workarounds like putting the enums in namespaces or prefixing all elements á la hungarian-notation. Also, unscoped enumerations are not particularly type-safe: they can be converted to integer types and back without any special consideration, so you can write things like int x = red; without the compiler complaining.
Scoped enumerations improves both theses aspects: with enum class colors {red, green, blue};, you have to use auto my_color = colors::red; and int x = colors::red; will simply not compile.
To get the second part to compile, you need to insert a static_cast: int x = static_cast(colors::red); which is purposefully a lot more verbose. Now this is a bit of a blessing and a curse. Of course, this is a lot more type-safe, but it make one really common usage pattern with enums very cumbersome: bit flags.

Did this get worse?

While you could previously use the bit operators to combine different bitmasks defined as enums, scoped enumerations will only let you do that if you cast them first. In other words, type-safety prevents us from combining flags because the result might, of course, no longer be a valid enum.
However, we can still get the convenience and compactness of bit flags with a type that represents combinations bitmasks from a specific enum type. Oh, this reeks of a template. I give you scoped_flags, which you can use like this:

enum class window_flags
{
  has_border = 1 << 0,
  has_caption = 1 << 1,
  is_child = 1 << 2,
  /* ... */
};
void create_window(scoped_flags<window_flags> flags);

void main()
{
  create_window({window_flags::has_border, window_flags::has_caption});
}

scoped_flags<window_flags> something = /* ... */

// Check a flag
bool is_set = something.test(window_flags::is_child);

// Remove a flag
auto no_border = something.without(window_flags::has_border);

// Add a flag
auto with_border = something.with(window_flags::has_border);

Current implementation

You can find my current implementation on this github gist. Even in its current state, I find it a niftly little utility class that makes unscoped enumerations all but legacy code.
I opted not to replicate the bitwise operator syntax, because &~ for “without” is so ugly, and ~ alone makes little sense. A non-explicit single-argument constructor makes usage with a single flag as convenient as the old C-style variant, while the list construction is just a tiny bit more complicated.
The implementation is not complete or final yet; for example without is missing an overload that gets a list of flags. After my previous adventures with initializer_lists, I’m also not entirely sure whether std::initializer_list should be used anywhere but in the c’tor. And maybe CTAD could make it more comfortable? Of course, everything here can be constexpr‘fied. Do you think this is a useful abstraction? Any ideas for improvements? Do tell!

Code duplication is not always evil

Before you start getting mad at me first a disclaimer: I really think you should adhere to the DRY (don’t repeat yourself) principle. But in my opinion the term “code duplication” is too weak and blurry and should be rephrased.

Let me start with a real life story from a few weeks ago that lead to a fruitful discussion with some fellow colleagues and my claims.

The story

We are developing a system using C#/.NET Core for managing network devices like computers, printers, IP cameras and so on in a complex network infrastructure. My colleague was working on a feature to sync these network devices with another system. So his idea was to populate our carefully modelled domain entities using the JSON-data from the other system and compare them with the entities in our system. As this was far from trivial we decided to do a pair-programming session.

We wrote unit tests and fixed one problem after another, refactored the code that was getting messing and happily chugged along. In this process it became more and more apparent that the type system was not helping us and we required quite some special handling like custom IEqualityComparers and the like.

The problem was that certain concepts like AddressPools that we had in our domain model were missing in the other system. Our domain handles subnets whereas the other system talks about ranges. In our system the entities are persistent and have a database id while the other system does not expose ids. And so on…

By using the same domain model for the other system we introduced friction and disabled benefits of C#’s type system and made the code harder to understand: There were several occasions where methods would take two IEnumerables of NetworkedDevices or Subnets and you needed to pay attention which one is from our system and which from the other.

The whole situation reminded me of a blog post I read quite a while ago:

https://www.sandimetz.com/blog/2016/1/20/the-wrong-abstraction

Obviously, we were using the wrong abstraction for the entities we obtained from the other system. We found ourselves somewhere around point 6. in Sandy’s sequence of events. In our effort to reuse existing code and avoid code duplication we went down a costly and unpleasant path.

Illustration by example

If code duplication is on the method level we may often simply extract and delegate like Uncle Bob demonstrates in this article. In our story that would not have been possible. Consider the following model of Price and Discount e-commerce system:

public class Price {
    public final BigDecimal amount;
    public final Currency currency;

    public Price(BigDecimal amount, Currency currency) {
        this.amount = amount;
        this.currency = currency;
    }

    // more methods like add(Price)
}

public class Discount {
    public final BigDecimal amount;
    public final Currency currency;

    public Discount(BigDecimal amount, Currency currency) {
        this.amount = amount;
        this.currency = currency;
    }

    // more methods like add(Discount<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>)
}

The initial domain entities for price and discount may be implemented in the completely same way but they are completely different abstractions. Depending on your domain it may be ok or not to add two discounts. Discounts could be modelled in a relative fashion like “30 % off” using a base price and so. Coupling them early on by using one entity for different purposes in order to avoid code duplication would be a costly error as you will likely need to disentangle them at some later point.

Another example could be the initial model of a name. In your system Persons, countries and a lot of other things could have a name entity attached which may look identical at first. As you flesh out your domain it becomes apparent that the names are different things really: person names should not be internationalized and sometimes obey certain rules. Country names in contrast may very well be translated.

Modified code duplication claim

Duplicated code is the root of all evil in software design.

— Robert C. Martin

I would like to reduce the temptation of eliminating code duplication for different abstractions by modifying the well known claim of Uncle Bob to be a bit more precise:

Duplicated code for the same abstraction is the root of all evil in software design.

If you introduce coupling of independent concepts by eliminating code duplication you open up a new possibility for errors and maintenance drag. And these new problems tend to be harder to spot and to resolve than real code duplication.

Duplication allows code to evolve independently. I think it is important to add these two concepts to your thinking.

Containers allot responsibilities anew

Earlier this year, we experienced a strange bug with our invoices. We often add time tables of our work to the invoices and generate them from our time tracking tool. Suddenly, from one invoice to the other, the dates were wrong. Instead of Monday, the entry was listed as Sunday. Every day was shifted one day “to the left”. But we didn’t release a new version of any of the participating tools for quite some time.

What we did since the last invoice generation though was to dockerize the invoice generation tool. We deployed the same version of the tool into a docker container instead of its own virtual machine. This reduced the footprint of the tool and lowered our machine count, which is a strategic goal of our administrators.

By dockerizing the tool, we also unknowingly decoupled the timezone setting of the container and tool from the timezone setting of the host machine. The host machine is set to the correct timezone, but the docker container was set to UTC, being one hour behind the local timezone. This meant that the time table generation tool didn’t land at midnight of the correct day, but at 23 o’clock of the day before. Side note: If the granularity of your domain data is “days”, it is not advisable to use 00:00 o’clock as the reference time for your technical data. Use something like 12:00 o’clock or adjust your technical data to match the domain and remove the time aspect from your dates.

We needed to adjust the timezone of the docker container by installing the tzdata package and editing some configuration files. This was no big deal once we knew where the bug originated from. But it shows perfectly that docker (as a representative of the container technology) rearranges the responsibilities of developers and operators/administrators and partitions them in a clear-cut way. Before the dockerization, the timezone information was provided by the host and maintained by the administrator. Afterwards, it is provided by the container and therefore maintained by the developers. If containers are immutable service units, their creators need to accomodate for all the operation parameters that were part of the “environment” beforehands. And the environment is provided by the operators.

So we see one thing clearly: Docker and container technology per se partitions the responsibilities between developers and operators in a new way, but with a clear distinction: Everything is developer responsibility as long as the operators provide ports and volumes (network and persistent storage). Volume backup remains the responsibility of operations, but formatting and upgrading the volume’s content is a developer task all of a sudden. In a containerized world, the operators don’t know you are using a NoSQL database and they really don’t care anymore. It’s just one container more in the zoo.

I like this new partitioning of responsibilities. It assigns them for technical reasons, so you don’t have to find an answer in each organization anew. It hides a lot of detail from the operators who can concentrate on their core responsibilities. Developers don’t need to ask lots of questions about their target environment, they can define and deliver their target environment themselves. This reduces friction between the two parties, even if developers are now burdened with more decisions.

In my example from the beginning, the classic way of communication would have been that the developers ask the administrator/operator to fix the timezone on the production system because they have it right on all their developer machines. The new way of communication is that the timezone settings are developer responsibility and now the operator asks the developers to fix it in their container creation process. And, by the way, every developer could have seen the bug during development because the developer environment matches the production environment by definition.

This new partition reduces the gray area between the two responsibility zones of developers and operators and makes communication and coordination between them easier. And that is the most positive aspect of container technology in my eyes.

std::initializer_list considered evil

I am so disappointed in you, std::initializer_list. You are just not what I thought you were.

Lights out

While on the train to Meeting C++ this year, I was working on the lighting subsystem of the 3D renderer for my game abstractanks. Everything was looking fine, until I switched to the release build. Suddenly, my sun light went out. All the smaller lights were still there, it just looked like night instead of day.
Now stuff working in Debug and not working in Release used to be quite common and happens when you’re not correctly initializing built-in variables. So I went digging, but it was not as easy as I had thought. Several hours later, I tracked the problem down to my global light’s uniform buffer initialization code. This is a buffer that is sent to the GPU so the shaders can read all the lighting information. It looked like a fairly innocent for-loop doing byte-copies of matrices and vectors to a buffer:

using Pair = std::pair;
auto Mapping = std::initializer_list{
  {ShadowMatrix.ptr(), MATRIX_BYTE_SIZE},
  {LightDirection.ptr(), VECTOR4_BYTE_SIZE},
  {ColorAndAmbient.ptr(), VECTOR4_BYTE_SIZE}
};

std::size_t Offset = 0;
for (auto const& Each : Mapping)
{
  mUniformBuffer.SetSubData(GL_UNIFORM_BUFFER, Each.second, Offset, Each.first);
  Offset += Each.second;
}

The Culprit

After mistakenly blaming alignment issues for a while, I finally tried looking at the values of Each.second and Each.first. To my surprise, they were bogus. Now what is going on there? It turns out not writing this in almost-always-auto style, i.e. using direct- instead of copy-initialization fixes the problem, so there’s definitely a lifetime issue here.

Looking at the docs, it became apparent that std::initializer_list is indeed a reference-type that automatically creates a value-type (the backing array) internally and keeps it alive exactly as binding a reference to that array would. For the common cases, i.e. when std::initializer_list is used as a parameter, this is fine, because the original list lives for the whole function-call expression. For the direct-initialization case, this is also fine, since the reference-like lifetime-extension kicks in. But for copy-initialization, the right-hand-side is done after the std::initializer_list is copied. So the backing array is destroyed. Oops.

Conclusion and alternatives

Do not use std::initializer_list unless as a function parameter. It works well for that, and is surprising for everything else. In my case, a naive “extract variable” refactoring of for (auto const& each : {a, b, c}) { /* ... */ } led me down this rabbit hole.
My current alternative is stupidly simple: a built-in array on the stack:

using Pair = std::pair;
Pair Mapping[]{
  {ShadowMatrix.ptr(), MATRIX_BYTE_SIZE},
  {LightDirection.ptr(), VECTOR4_BYTE_SIZE},
  {ColorAndAmbient.ptr(), VECTOR4_BYTE_SIZE}
};

It does the same thing as the “correct” version of the std::initializer_list, and if you try to use it AAA-style, at least clang will give you this nice warning: warning: temporary whose address is used as value of local variable 'Mapping' will be destroyed at the end of the full-expression [-Wdangling]