bug – Schneide Blog

Naming is hard and Java Enums don’t help

This is a short blog post about a bug in my code that stumped me for some moments. I try to tell it in a manner where you can follow the story and try to find the solution before I reveal it. You can also just read along and learn something about Java Enums and my coding style.

A code structure that I use sometimes is the Enum type that implements an interface:

public enum BuiltinTopic implements Topic {

    administration("Administration"),
    userStatistics("User Statistics"),
    ;
	
    private final String denotation;

    private BuiltinTopic(String denotation) {
        this.denotation = denotation;
    }
	
    @Override
    public String denotation() {
        return this.denotation;
    }
}

The Topic interface is nothing special in this example. It serves as a decoupling layer for the (often large) part of client code that doesn’t need to know about any specifics that stem from the Enum type. It helps with writing tests that aren’t coupled to locked-down types like Enums. It is just some lines of code:

public interface Topic {

    String denotation();
}

Right now, everything is fine. The problems start when I discovered that the denotation text is suited for the user interface, but not for the configuration. In order to be used in the configuration section of the application, it must not contain spaces. Ok, so let’s introduce a name concept and derive it from the denotation:

public interface Topic {

    String denotation();
	
    default String name() {
        return Without.spaces(denotation());
    }
}

I’ve chosen a default method in the interface so that all subclasses have the same behaviour. The Without.spaces() method does exactly what the name implies.

The new method works well in tests:

@Test
public void name_contains_no_spaces() {
    Topic target = () -> "User Statistics";
    assertEquals(
       "UserStatistics",
       target.name()
    );
}

The perplexing thing was that it didn’t work in production. The names that were used to look up the configuration entries didn’t match the expected ones. The capitalization was wrong!

To illustrate the effect, take a look at the following test:

@Test
public void name_contains_no_spaces() {
    Topic target = BuiltinTopic.userStatistics;
    assertEquals(
        "userStatistics",
        target.name()
    );
}

You can probably spot the difference in the assertion. It is “userStatistics” instead of “UserStatistics”. For a computer, that’s a whole different text. Why does the capitalization of the name change from testing to production?

The answer lies in the initialization of the test’s target variable:

In the first test, I use an ad-hoc subtype of Topic.

In the second test and in production, I use an object of type BuiltinTopic. This object is an instance of an Enum.

In Java, Enum classes and Enum objects are enriched with automatically generated methods. One of these methods equip Enum instances with a name() method that has a default implementation to return the Enum instances’ variable/constant name. Which in my case is “userStatistics”, the same string I expect, minus the correct capitalization of the first character.

If I had named the Enum instance “UserStatistics”, everything would have worked out until somebody changes the name or adds another instance with a slight difference in naming.

If I had named my Enum instance something totally different like “topic2”, it would have been obvious. But in this case, with only the minor deviation, I was compelled to search for problems elsewhere.

The problem is that the auto-generated name() method overwrites my default method, but only in cases of real Enum instances.

So I thought hard about the name of the name() method and decided that I don’t really want a name(), I want an identifier(). And that made the problem go away:

public interface Topic {

    String denotation();
	
    default String identifier() {
        return Without.spaces(denotation());
    }
}

Because the configuration code only refers to the Topic type, it cannot see the name() method anymore and only uses the identifier() that creates the correct strings.

I don’t see any (sane) way to prohibit the Java Enum from automatically overwriting my methods when the signature matches. So it feels natural to sidestep the problem by changing names.

Which shows once more that naming is hard. And soft-restricting certain names like Java Enums do doesn’t lighten the burden for the programmer.

When laziness broke my code

I was just integrating a new task-graph system for a C# machine control system when my tests started to go red. Note that the tasks I refer to are not the same as the C# Task implementation, but the broader concept. Task-graphs are well known to be DAGs, because otherwise the tasks cannot be finished. The general algorithm to execute a task-graph like this is called topological sorting, and it goes like this:

Find the number of dependencies (incoming edges) for each task
Find the tasks that have zero dependencies and start them
For any finished tasks, decrement the follow-up tasks dependency count by one and start them if they reach zero.

The graph that was failed looked like the one below. Task A was immediately followed by a task B that was followed by a few more tasks.

I quickly figured out that the reason that the tests were failing was that node B was executed twice. Looking at the call-stack for both executions, I could see that the first time B was executed was when A was completed. This is correct as per step 3 in the algorithm. However, the second time it was started was directly from the initial Run method that does the work from step 2: Starting the initial tasks that are not being started recursively. I was definitely not calling Run twice, so how did that happen?

public void Run()
{
    var ready = tasks
        .Where(x => x.DependencyCount == 0);

    StartGroup(ready);
}

Can you see it? It is important to note that many of the tasks in this graph are asynchronous. Their completion is triggered by an IObserver, a C# Task completing or some other event. When the event is processed, StartGroup is used to start all tasks that have no more dependencies. However, A was no such task, it was synchronous, so the StartGroup({B}) call happened while Run was still on the stack.

Now what happened was that when A (instantly!) completed, it set the DependencyCount of B to 0. Since ready in the code snippet is lazily evaluated from within StartGroup, the ‘contents’ actually change while StartGroup is running.

The fix was adding a .ToList after the .Where, a unit test that checked that this specifically would not happen again, and a mental note that lazy evaluation can be deceiving.

My own little Y2K22 bug

Ever since the year 2000 (or Y2K), software developers dread the start of a new year. You’ll never know which arbitrary limit will affect the fitness of your projects. Sometimes, it isn’t even the new year (see the year 2038 problem that will manifest itself in late January). But more often than not, the first day of a new year is a risky time.

Welcome, 2022!

The year 2022 started with Microsoft Exchange quarantining lots of e-mails for no apparent reason other than it is no longer 2021. I was amused about this “other people’s problem” until my phone rang.

A customer reported that one of my applications doesn’t start anymore, when it ran perfectly a few days ago – in 2021. My mind began to race:

The application in question wasn’t updated recently. It has to be something in the code that parses a current date with an unfortunate date/time format. My search for all format strings (my search term was “MMddHH” without the quotes) in the application source code brought some expected instances like “yyyyMMddHHmmss” and one of a very suspicious kind: “yyMMddHHmm”.

The place where this suspicious format was used took a version information file and reported a version number, some other data and a build number. The build number was defined as an integer (32 bit). Let me explain why this could be a problem:

2G should be enough for everyone!

A 32-bit integer has an arbitrary value limit of 2³¹=2.147.483.648. If you represent the last minute of 2021 in the format above, you get 2.112.312.359 which is beneath the limit, but quite close.

If you add one minute and count up the year, you’ll be at 2.201.010.000 which is clearly above the value limit and result in either an integer overflow ending in a very negative number or an arithmetic exception.

In my case, it was the arithmetic exception which halted the program in its very first steps while figuring out what, where and when it is.

This is a rookie mistake that can only be explained by “it evolved that way”. The mistake is in the source code since the year 2004. I wrote it myself, so it is my mistake. But I didn’t just think about a weird date format that won’t spark joy 18 years later. I started with a build number from continuous integration. The first build of the project is “build 1”, the next is “build 2”, and so on. You really have to commit early, commit often (and trigger builds) to reach the integer limit that way. This is true for a linear series of builds. But what if you decide to use feature branches? The branches can happen in parallel and each have their own build number series. So “build 17” could be the 17th build of your main branch and go in production or it could be a fleeting build result on a feature branch that gets merged and deleted a few days later. If you want to use the build number as a chronological ordering, perhaps to look for updates, you cannot rely on the CI build numbering. Why not use time for your chronological ordering?

Time as an integer

And how do you capture time in an integer? You invent a clever format that captures the essence of “now” in a string that can be parsed as an integer. The infamous “yyMMddHHmm” is born. The year 2022 is a long time down the road if you apply a quick and clever fix in 2004.

But why did the application crash in 2022 without any update? The build number had to be from 2021 and would still pass the conversion. Well, it turned out that this specific application had no build number set, because we changed our build system and deemed this information not important for this application. So the string in the version file was empty. How is an empty string interpreted as today?

Well, there was another clever code by another developer from 2008 that took a string being null or empty and replaced it with the current date/time. The commit message says “Quickfix for new version format”.

Combined cleverness

Combine these three things and you have the perfect timebomb:

A clever way to store a date/time as an integer
A clever way to intepret missing settings
A lazy way to intriduce a new build process

The problem described above was present in a total of five applications. Four applications had fixed build numbers/dates and would have broken with the next version in 2022 or later. The fifth application had an empty build number and failed exactly as programmed after the 01.01.2022.

Lessons learnt

What can we learn from this incident?

First: clever code or a quick fix is always a bad idea.

Second: cleverness doesn’t stack. One clever workaround can neutralize another clever hack even if both “solutions” would work on their own.

Third: If your solution relies on a certain limit to never be reached, it is only a temporary solution. The limit will be reached eventually. At least leave an automated test that warns about this restriction.

Fourth: Don’t mitigate a hack with another hack. You only make your situation worse in the long run.

The fourth take-away is important. You could fix the problem described above in at least two ways:

Replace the integer with a long (64 bit) and hope that your software isn’t in production anymore when the long wraps around. Replace the date/time format with the usual “yyyyMMddHHmmss”.
Leave the integer in place and change the date/time format to “yyDDDHHmm” with “DDD” being the day of the year. With this approach, you shorten the string by one digit and keep it below the limit. You also make the build number even less readable and leave a timebomb for the year 2100.

You can probably guess which route I took, even if it was a lot more work than expected. The next blog entry about this particular code can be expected at 01.01.10000.

It’s not a bug, it’s a missing test

Recently, I changed some wording when I talk about code and code problems. In my opinion, the new words have more correlation with the root cause of the problem, while the old words used to be “nearer” to the symptoms.

Let me explain the changes with a small code example that serves no other purpose than to contain a few problems:

public class InMemoryItemRepository {
	private final Map<String, Item> items;
	
	public InMemoryItemRepository() {
		this.items = new HashMap<>();
	}
	
	public Optional<Item> itemFor(String itemNumber) {
		return Optional.of(this.items.get(itemNumber));
	}
}

This class is a concrete implementation of an “item repository” that stores items by their “item number”. You can retrieve items by calling the itemFor method and giving it a valid itemNumber. Otherwise, you’ll end up with an empty Optional.

The first problem

The first problem in this code is a code smell. The data structure used to store the items is an HashMap. In Java, the HashMap implementation is not thread-safe:

Note that this implementation is not synchronized. If multiple threads access a hash map concurrently, and at least one of the threads modifies the map structurally, it must be synchronized externally.
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/HashMap.html

With our current implementation, we leak this limitation onto our clients, which will be totally unaware, because nothing in the class design hints towards an HashMap. In most production environments, our in-memory implementation might even be swapped out with a database-based one. And the database implementation hopefully is thread-safe.

A code smell is a bug

My change in wording calls this type of code problem a bug (instead of a code smell). It might be possible that right now, this class is only used in a single-threaded manner and the implementation flaw doesn’t matter. In that case, it’s a bug in hibernation. Right now, it sleeps, but the day will come when it wakes up. That doesn’t change its bug-like features, just its immediate damage potential.

If you label a code smell as a bug, you highlight the damage potential instead of the current “nice to have” prioritization.

The second problem

Let’s return to the code example. There is another problem with this implementation: If you ask for an item number that is not stored in the repository, you don’t get an empty Optional, you’ll get a nice and unexpected NullPointerException.

A small unit test that asks for any item number on an empty repository reveals the problem:

	@Test
	void returnsEmptyIfNotStored() {
		var target = new InMemoryItemRepository();
		assertThat(
			target.itemFor("non-existent")
		).isEmpty();
	}

This test will run red with a NullPointerException, pointing to this line in the implementation:

		return Optional.of(this.items.get(itemNumber));

Of course, this is “just a typo” in the way we construct the Optional. Because our HashMap returns null if a given key is not stored in it, we need to call Optional.ofNullable() and have it convert null to Optional.empty:

		return Optional.ofNullable(this.items.get(itemNumber));

A bug is a missing test

This “little fix” makes our unit test pass and the repository useful. My change in wording calls every bug a “missing test”. This points out that the test you write as you fix the bug could have prevented the bug’s damage altogether.

If you were surprised by my assumption that you write tests when you fix bugs, please consider adopting this as a habit. It is the last possible moment to improve your test coverage at a place that has proven to be important enough to have tests and undertested enough to exhibit bugs. If you want your project to be antifragile (and there are good reasons why it should be), start by strengthening it at every place it breaks.

Tests for every code smell?

The combination of both changes would make every code smell a missing test. Right now, I’m not sure if this needs to be an automatism. The existence of a code smell hints towards a more fragile part of the system, but I’m not sure if “potential fragility” is a valid indicator for additional tests.

Of course, if you program completely test driven, these kind of thoughts probably seem pointless to you.

What are your thoughts on this topic and specifically on compulsory testing for every code smell?

Last night, an end-to-end test saved my live

but not with a song, only with an assertion failure. Okay, I’ll admit, this is not the only hyperbole in the blog title. Let’s set things right: The story didn’t happen last night, it didn’t even happen at night (because I’m not working night hours). In fact, it was a well-lit sunny day. And the test didn’t save my live. It did save me from some embarrassment, though. And spared a customer some trouble and hotfixing sessions with me.

The true parts of the blog title are: An end-to-end test reported an assertion failure that helped me to fix a bug before it got released. In other words: An end-to-end test did its job and I felt fine. And that’s enough of the obscure song references from the 1980s, hopefully.

Let me explain the setting. We develop a big system that runs in production 24/7-style and is perpertually adjusted and improved. The system should run unattended and only require human attention when a real incident happens, either in the data or the system itself. If you look at the system from space, it looks like this:

A data source (in fact, lots of data sources) occasionally transmits data to the system that gets transformed and sent to a third-party system that does all kind of elaborate analysis with it. Of course, there is more to the real system than that, but for our story, it is sufficient to know that there are several third-party systems, each with their own data formats. We’ll call them third-party system A (TPS-A) and TPS-B in this story.

Our system is secured by lots and lots of unit tests, a good number of integration tests and a handful of end-to-end tests. All tests run green all the time. The end-to-end tests (E2ET) try to replicate the production environment as close as possible by running on the target operating system and using data transfer means like in the real case. The test boots up the complete system and simulates the data source and the third-party systems. By configuring the system so that it transfers back to the E2ET, we can write assertions in the kind of “given a specific input data, we expect this particular output data from the system, however it chooses to produce it”. This kind of broad-brush test is invalueable because it doesn’t care for specifics in the system. It only cares for output from the system.

This kind of E2ET is also difficult to write, complicated to run, prone to brittleness and obscure in its failure statement. If you write a three-line unit test, it is straight-forward, fast and probably pretty specific when it breaks: “This one thing that I’m testing is wrong now”. An E2ET as described here is like the Delphi Oracle:

E2ET: “Somewhere in your system, something doesn’t work the way I like it anymore.”
Developer: “Can you be more specific?”
E2ET: “No, but here are 2000 lines of debug output that might help you on your quest for the cause.”
Developer: “But I didn’t change anything.”
E2ET: “Oh, in that case, it could as well be the weather or a slow network. Mind if I run again?”

This kind of E2ET is also rather slow and takes its sweet time resetting the workspace state, booting the system again and using prolonged timeouts for every step. So you wait several minutes for the E2ET to fail again:

E2ET: “Yup, there is still something fishy with your current code.”
Developer: “It’s the same code you let pass yesterday.”
E2ET: “Well, today I don’t like it!”
Developer: “Okay, let me dive into the debug output, then.”

Ten to twenty minutes pass.

Developer: “Is it possible that you just choke on a non-free network port?”
E2ET: “Yes, I’m supposed to wait for the system’s output on this port and cannot bind to it.”
Developer: “You cannot bind to it because an instance of yourself is stuck in the background and holding onto the port.”
E2ET: “See? I’ve told you there is something wrong with your system!”
Developer: “This isn’t a problem with the system’s code. This is a problem with your code.”
E2ET: “Hey! Don’t blame me! I’m here to help. You can always chose to ignore or delete me if I’m too much of a burden to you.”

This kind of E2ET cries wolf lots of times for problems in the test code itself, for too optimistic timeouts or just oddities in the environment. If such an E2ET fails, it’s common practice to run it again to see if the problem persists.

How many false alarms can you tolerate before you stop listing?

In my case, I choose to reduce the amount of E2ET to the essential minimum and listen to them every time. If the test raises a false alarm, invest the time to come up with a way to make it more robust without reducing its sensitivity. But listen to the test every time.

Because, in my story, the test insisted that third-party system A and B both didn’t receive half of their data. I could rule out stray effects in the network or on the harddisk pretty fast, because the other half of the data arrived just fine. So I invested considerable time in understanding the debug output. This led me nowhere – it seemed that the missing data wasn’t sent to the system in the first place. Checking the E2ET disproved this hypothesis. The test sent as much data to the system as ever. But half of it went missing underway. Now I reviewed all changes I made in the last few days. Some of them affected the data export to the TPS. But all unit and integration tests in this area insisted that everything worked as intended.

It is important to have this kind of “multiple witnesses”. Unit tests are like traces in a criminal investigation: They report one very specific information about one very specific part of the code and are only useful in larger numbers. Integration tests testify on the possibility to combine several code parts. In my case, this meant that the unit tests guaranteed that all parts involved in the data export work correct on their own (in isolation). The integration tests vouched that, given the right combination of data export parts, they will work as intended.

And this left one area of the code as the main suspect: Something in the combination of export parts must go wrong in the real case. The code that wires the parts together is the only code not tested by unit and integration tests, but E2ET. Both other test types base their work on the hypothesis of “given that everything is wired together like this”. If this hypothesis doesn’t hold true in the real case, no test but the E2ET finds a problem, because the E2ET bases on another hypothesis: “given that the system is started for real”.

I went through all the changes of the wiring code and there it was: A glorious TODO comment, written by myself on a late friday afternoon, stating: “TODO: Register format X export again after you’ve finished issue Y”. I totally forgot about it over the weekend. I only added the comment into the code, not as an issue comment. Friday afternoon me wasn’t cooperative with future or current me. In effect, I totally played myself. I don’t remember removing the registering code or writing the comment. It probably was a minor side change for an unimportant aspect of the main issue. The whole thing was surely done in a few seconds and promptly forgotten. And the only guardian that caught it was the E2ET, using the most nonspecific words for it (“somewhere, something went wrong”).

If I had chosen to ignore or modify the E2ET, the bug would have made it to production. It would have caused a loss of current data for the TPS. Maybe they would have detected it. But maybe, they would have just chosen to operate on the most recent data – data that doesn’t get updated anymore. In short, we don’t want to find out.

So what is the moral of the story?
Always have an end-to-end test for your most important functionality in place. Let it deal with your system from the outside. And listen to it, regardless of the effort it takes.

And what can you do if your E2ET has saved you?

Give your test a medal. No, seriously, leave a comment in the test code that it was helpful.
Write an unit or integration test that replicates the specific cause. You want to know immediately if you make the same error again in the future. Your E2ET is your last line of defense. Don’t let it be your only one. You’ve been shown a weakness in your test setup, remediate it.
Blame past you for everything. Assure future you that you are a better developer than this.
Feel good that you were able to neutralize your faux pas in time. That’s impressive!

When was the last time a test has saved you? Leave the story or a link to it in the comments. I can’t be the only one.

Oh, and if you find the random links to 80s music videos weird: At least I didn’t rickroll you. The song itself is from 1987 and would fit my selection.

Twelve years a bug

Recently, a friend recommended the talk “Love your bugs” by Allison Kaptur to me. It’s a good talk with a powerful message: Every bug is a chance to learn. The only prerequisite for this chance: You need to be aware of the bug. You need to see it and then understand and fix it.

What if you don’t see a bug for over a decade?

You can still learn from it – a lot. Here is a story about a bug that lingered in my code for twelve years and had impact on the software results without anybody, including me, noticing.

Let’s say the software is a measurement system that records physical measurement values that cannot be reenacted easily. The samples are measured in rapid succession and then stored away. The measurement results act as a quality quantifier as in “the sample was this good at that time”. The sample’s quality diminishes over time, even with perfect storage conditions. And just to be sure that the measurement is accurate, it isn’t performed once, but twice in quick succession: The measurement device traverses the sample in one direction, recording the values and uses the return path for a second measurement. This results in two measurement streaks that should, in theory, be very similar.

The raw measurement values are aggregated in different ways. In the final report, the maximum value over both measurement streaks is the most prominent and most important aspect. There is nothing exciting or error-prone about finding the maximum value in a value series, even twelve years ago. But I tested the code nonetheless. The whole value aggregation code was under test and had a good test coverage. All tests showed their thumbs up.

But twelve years ago, at the end of a workweek, on a Friday at 17 o’clock, I made a small and easy refactoring to the code. I know this in such detail because of the wonders of version control. Without version control, I probably would have learnt a lot less from this bug.

The code before the refactoring contained two nearly identical sections for the measurement streaks, making it duplicated code. There were only two differences: The first section of code counted from 0 to 99 and stored the values in the first streak’s array. The second section counted from 99 to 0, because the measurement device travels backwards over the sample, and stored the values into the second array.

My refactoring brought both pieces of code together: A new method with two parameters was introduced and called at both places. The first parameter specified a series of positions (0 to 99 or backwards), the second parameter specified the array to store into. All automated tests approved the changes. My manual testing showed no differences. The refactoring was going live.

Twelve years later, while working on a new engine for the measurement device, the customer took the raw values from the journal and performed some manual calculations. The maximum value calculation was wrong. It wasn’t wrong all of the time, but also not correct for all measurements. When it went wrong, it selected the second greatest value or, seldom, the third greatest value as the maximum value.

All the tests still insisted that everything is correct – as it was for twelve long years. There was no other change to the code that could affect the aggregation in any way.

There are two different ways that lead me to find the bug’s origin. The first way was to inspect the trail of commits in the area of code that performs the aggregation. My findings were that the abovementioned refactoring was the most likely culprit. The second way was to write more tests to ensure that yes, the maximum value calculation was indeed correct if given the correct values. Using the examples my customer had examined, I could prove with enough certainty that, given the input of all 200 values, the correct maximum value was returned in all cases. The aggregation therefore wasn’t given all values!

And that lead directly to the bug: The first measurement streak used the same array to store the values as the second streak. During the refactoring, I must have copied and pasted the call to the new method and forgotten to change the second parameter. Of 200 measured values, only 100 got stored permanently. The first 100 values were stored and promptly overwritten as the measurement device returned to its park position. Change the calls to store in both arrays and everything works like intended and like before the refactoring.

How did no test, automated or manual, catch this bug? It turns out that most automated tests were too focussed to indicate a problem. All unit tests that secured the maximum value calculation used given sets of values, they didn’t care about the origin of these value sets. The integration tests that covered the whole measurement process should have raised objections to the refactoring. But they used given sets of measurement values, too. And by chance, the given maximum value was in the second measurement streak. In fact, the given set of measurement values was produced by a loop that just increased a value. The greatest value was always the last one.

The manual tests had the same problem: The simulated measurement device produced measurement values that were either fixed or random. If your whole measurement uses the same fixed values, you don’t see it if half of them went missing. And if your measurement uses random values, you’ll have to pay close attention to a detail that isn’t in focus because it is unchanged. Except that one time when it was changed recently. Remember the change date? Friday afternoon, only minutes before the weekend? Not the best time to manually test a change that is a simple standard refactoring, after all.

So, what have I learnt? First: automated tests, even with great test coverage, aren’t enough. There is so much leeway in the setup of these tests that they will have blind spots without your knowledge. Second: A code review by another human (or even by the same human, some days later) might have caught the bug. It was painfully obvious in hindsight. The problem? “Might have” is an heuristics, just like your automated tests are.

My guess is that mutation testing would have shown the blind spot of the existing tests – among several hundred others. The heuristics is now your trained eye that sifts through the results and separates false positives from true positives.

Right now, I’ve fixed the bug, kept all additional tests and added one more: An integration test that performs a measurement with fixed values and checks nothing else but if all 200 values are stored at the end of the measurement. It’s oddly specific, but conveys this story in an automated fashion.

Oh, and I made another refactoring: I’ve replaced the arrays with collections. Hopefully, I won’t regret this one twelve years in the future. I’ve made it on a Monday.

Domain-aligned bugs

Frank C. Müller [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0) ]

Imagine that you are an user of a typical enterprise software that handles commercial products and their prices. There are different prices in the software that are somehow related to each other. There is the purchase price that indicates your cost if you buy the product. There is the retail price that gets listed in your price lists and is paid by your customers, should they buy the product. You probably already figured out that the retail price should never be lower than the purchase price, because that would mean you lose money with every successful sale.

Let’s say that the enterprise software not only handles products, but also parts. Several parts combined, with some manufacturing effort, result in a product. Each part has a purchase price, the resulting product has a retail price. The retail price of the product should be higher than the sum of purchase prices of the parts. If it isn’t, you lose the costs of the manufacturing effort and some extra money with every successful sale.

If for any reason you cannot clearly estimate your manufacturing effort, the enterprise software has another input field for an amount of money that you can add to the sum of the parts’ costs. We call this field the “sales bonus”. So, if you sell a product made up of parts, your customer has to pay a price that consists at least of the retail prices of the parts and the sales bonus. Of course, your customer has an individual discount percentage that needs to be subtracted from the total price. Are you still following?

You are now thinking in the domain of price determination and financial mathematics. If you were the user of said enterprise software, you’d probably expect some bugs like these:

It is possible to enter a retail price lower than the purchase price
The price of products manufactured from parts isn’t calculated correctly
It is possible to enter a negative sales bonus
The total price with discount could be lower than the sum of purchase prices of the parts without a warning

All of them are bugs in the domain. All of them can be explained to a domain expert or a user with terms and concepts from the domain.

But what about the bug when you sell a product that consists of three parts, each with a retail price of 10 €, and a sales bonus of 5 €. You want to create a quote for your customer and the price shows up as 34,99999999998 €. You are a bit bewildered and try to countervail the apparent rounding error by changing the sales bonus to 5,00000000002 €. After this change you get another crazy total price and your prices in the database are different from what you entered, too. Everything seems to destabilize and deviate further and further from clear cut prices.

As a programmer, you know what happened. You know what caused this effect of numerical instability. Somebody stored monetary values in a floating point number. You know that is a bad idea and you’d never do this. But this blog post isn’t about you or what you should do or not do. It is about the user, expert in his domain, that stumbles over the bug as described and has to make some decision on how to fix it. This user cannot use any knowledge from the domain to even understand the mechanics of the bug. You, as the programmer, cannot explain this bug in terms of things the user already knows. You need to be vague (“the software doesn’t store the exact values, just approximations”) or introduce additional complexity (“we store this value by splitting it into a significand and multiply it with a factor consisting of a fixed base and an exponent. We can omit the base and just store the significand and the exponent and express a very large numerical range in just a few bits. Think about how cool that is!”).

Read the last explanation again, from the viewpoint of a salesman. We want to add some prices in the range of a few €, slap a moderate discount on top and call it a day. We don’t care about bits or exponential formulas. That is not part of our domain and it shouldn’t affect our domain or software that works in our domain. Confronting us with technical details reflects negatively on your ability to solve our problems. You seem to burden us with your problems in exchange.

As domain experts, we want only domain-aligned bugs.

Timestamps make horrible identifiers

If you think about using a timestamp or date as an identifier for some kind of entity, object or data record – think again. They are horribly ill-equipped to be identifiers due to their dynamic resolution. Here’s the story how we got to this conclusion.

Not long ago, I’ve struggled with a system that uses timestamps as entity identifiers. What can I say? Timestamps aren’t meant to identify anything else than a specific point in time. Don’t use them as entity identifiers, ever. If you want to know why, I invite you to read on. The blog post is written in Freytag’s dramatic structure for added effect.

Exposition

We’ve designed a system that runs on multiple instances that communicate in all sorts of way. A central archive instance stores all data related to measurements. The whole network revolves around the notion of measurement. Measurement data is the most precious data and all instances will either produce or consume data based on these measurements.

Most important for human operators is an instance that lets you view all existing measurement data. Let’s call it the viewer. The viewer displays an overview list of all measurements in a given context and lets the operator choose to view ever more details of any of them. To be able to provide the overview list as fast as possible, we added a cache that holds the information.

Rising action

This measurement list cache was the source of all kinds of peculiar behaviour of the system. Most, but not all measurement data was incomplete. The list cache entries were assembled from different sources that were available at different times, so it seemed that while one part of the data got written to the cache, another part couldn’t be written for whatever reasons. The operator could load detailed data of some few measurements, but the majority just produced an error message that the data couldn’t be found (despite it being present).
The most obviously broken functionality left the following trace in the log files (paraphrased):

- storing measurement at 2016-02-28T13:25:55.189+01:00 into the list cache
- measurement stored
[...]
- loading measurement at 2016-02-28T13:25:55.189+01:00 from the list cache
- error: measurement not found in list cache

So, the system is essentially telling me that it can’t load some data it just stored. As you can imagine, this may lead to some questions about the sanity of the database product underneath.

Climax

After some investigation and fruitless integration testing, it dawned me: The problem wasn’t timing or the database. All the bugs could be explained with only one circumstance: Measurements were ultimately identified by their timestamp, the moment the measurement was made. There’s also a location, type and some other information in the identifier for each measurement, but only the timestamp changes between two measurements in the same narrow context. And the timestamp was stored in different precisions, depending on the origin of the measurement identifier. Most identifiers were create at the measurement producing system instances (let’s call them measurers) and had millisecond precision. As soon as they got stored in the production database (but not our development database), they lost the milliseconds. And some of the most important measurement data got exported to third-party systems, using a minute-based precision. So we had one measurement identifier in the system, but with three different types, each mostly incompatible to each other.

Falling action

That’s why the log excerpt above never occurred in development, but in production: The measurement is stored in the database, the used identifier gets passed around in the software, but a query on the exact same identifier in the database yields no result because the timestamps now differ in the millisecond range. And the strange effects that sometimes, everything worked just fine? That’s when the milliseconds are zero by chance. Given that most actions in the system are scheduled and performed automatically exactly on the zero mark, the zero milliseconds case happened more often than it would in an even distribution.

Our system dealt with three types of measurement identifiers: Millisecond-precise identifiers produced by the measurers, second-precise identifiers used by the measurement list cache and minute-precise identifiers used (and sometimes fed back into the system) by the data export. These identifiers were incompatible even for the same measurement most of the time, but not always. In unit tests, the timestamps were made up and didn’t reveal the problem properly (who thinks about odd milliseconds when making up a timestamp?).

My solution was to pull this incompatibility up into the type system. Instead of one measurement identifier, there are now three measurement identifiers: MillisecondPreciseIdentifier, SecondPreciseIdentifier and MinutePreciseIdentifier. An identifier of higher precision can be converted to an identifier of lower precision, but not the other way around. Everytime a measurement identifier is created, it needs to explicitely state its precision of the timestamp. This made the compiler highlight the problematic usages clearly as type conflicts and therefore dealing with the problem much easier.

Revelation

Choosing a timestamp as a vital part of a (measurement) identifier was a mistake from the beginning. The greater problem was the omission of the timestamp’s precision. Timestamps perform more like floating-point numbers and less like integers, even if every timestamp can be represented by a long. As soon as I made the precision of each timestamp clear to the compiler, the bugs revealed themselves. The annoying difference between developer and production database would have been detected much sooner, because a millisecond-precise timestamp will now warn in the log files if its millisecond part is zero. As soon as this log entry is seen very often, its clear that something is wrong. The new datatypes not only serve as a clearer API contract definition tool, but also as a runtime sanity check.

If you don’t want to repeat this mistake, keep in mind that each timestamp, date or whatever time-related data type you use will inherently have a maximum precision. As soon as you mix different precisions into the same data type, you’re going to have a bad time. Explicitely state the required precision in your type system and your compiler will keep an eye on it, too.

The Story of a Multithreading Sin

The story of a bug that was caused by a common multithreading pitfall, the dreaded liquid lock.

In my last blog entry, I wrote about multithreading pitfalls (in Java), and ironically, this was the week when we got a strange bug report from one of our customers. This blog entry tells the story of the bug and adds another multithreading pitfall to the five I’ve already listed in my blog entry “When it comes to multithreading, better be safe than sorry”.

The premise

We developed a software that runs on several geographically distant independent “stations” that collect a multitude of environmental measurement data. This data is preprocessed and stuffed into data packages, which are periodically transferred to a control center. The software of this control center, also developed by us, receives the data packages, stores them on disk and in a huge database and extracts the overall state of the measurement network from raw data. If you describe the main task of the network on this level, it sounds nearly trivial. But the real functionality requirements are manifold and the project grew large.

We kept the whole system as modular as necessary to maintain an overall grasp of what is going on where in the system and installed a sufficient automatic test coverage for the most important parts. The system is still under active development, but the main parts of the network are in production usage without real changes for years now.

The symptoms

This might explain that we were very surprised when our customer told us that the control center had lost some data packages. Very soon, it turned out that the control center would randomly enter a state of “denial”. In this state, it would still accept data packages from the stations and even acknowledge their arrival (so the stations wouldn’t retry the transmission), but only write parts of the package or nothing at all to the disk and database. When the control center entered this state, it would never recover from it. But when we restarted the software manually, everything would run perfectly fine for several days and then revert back into denial without apparent trigger.

We monitored the control center with every means on our disposal, but its memory consumption, CPU footprint and threading behaviour was without noticeable problem even when the instance was in its degraded state. There was no exception or uncommon entry logged in the logfiles. As the symptom happened randomly, without external cause and with no chance of reversal once it happened, we soon suspected some kind of threading issue.

The bug

The problem with a threading issue is that you can’t just reproduce the bug with an unit or system test. We performed several code reviews until we finally had a trace. When a data package arrives, a global data processing lock is acquired (so that no two data packages can be processed in parallel) and the content of the package is inspected. This might trigger several network status changes. These change events are propagated through the system with classic observer/listener structures, using synchronous calls (normal delegation). The overall status of the network is translated in a human readable status message and again forwarded to a group of status message listeners. This is a synchronous call again. One of the status message listeners was the software driver for a LED ticker display. This module was a recent addition to the control center’s hardware outfit and used to display the status message prominently to the operators. Inside this LED software driver, some bytes are written to a socket stream and then the driver awaits an answer of the hardware device. To avoid the situation that two messages are sent to the device at the same time, a lock is acquired just before the message is sent. This code attracted our attention. Lets have a look at it:

private Message lastMessage = new Message();

public void show(Message message) {
    synchronized (this.lastMessage) {
        writeCommandAndWaitForResponse(Command.SHOW_TEXT, message.asBytes());
        this.lastMessage = message;
    }
}

The main problem here is the object the lock is acquired upon: the reference of lastMessage is mutable! We call this a liquid lock, because the lock isn’t as solid as it should be. It’s one of the more hideous multithreading pitfalls as it looks like everything’s fine at first glance. But this lock doesn’t have a complete “locking” effect because each caller may acquire the lock of a different instance. And a lock with a flawed locking behaviour is guaranteed to fail (in production). The liquid lock is like the bigger brother of the local lock. It isn’t local, but its mutability cause the same problems.

The bug finally turned out to be caused by the liquid lock in the LED display driver that got notified of system message changes when a data package arrived. But only if multiple messages were sent at once to the device, discarding some of the necessary answers in this circumstance or if the connection to the LED hardware would fail in the midst of a transmission, the system would not return from the write attempt. If one thread wouldn’t return to the data package processor, the global data processing lock would not be freed (read the start of this chapter again, this is the most important lock in the system!). And while the data processing lock was still held, all other data packages would be received, but piling up to obtain the lock. But the lock would never be returned from the thread waiting on an answer from a hardware device that had no intention to send another answer. This was when the control center appeared to be healthy but didn’t process any data packages anymore.

The conclusion

If you want to avoid the category of liquid lock multithreading bugs, make sure that all your lock instance references are immutable. Being final is an important property of lock instance references. Avoid to retrieve your locks from notoriously muteable data structures like collections or arrays. The best thing you can do to avoid liquid locks is to “freeze” all your lock instances.

Another insight from this story is that software modules have to be separated threadwise, too. It was a major design flaw to let the data processing thread, while holding the main processing lock, descend down into the deep ends of the LED driver, eventually getting stuck there for infinity. Some simple mechanisms like asynchronous listener notification or producer/consumer queues for pending transmission requests would have helped to confine the effects of the liquid lock bug inside the LED module. Without proper thread separation, it took down the whole software instance.

	Writing Integration… on Every Unit Test Is a Stage Pla…
	mariuselvert on C# is very strict about modify…
	Anonymous on C# is very strict about modify…
	Anonymous on Cache configuration with WildF…
	Miq on Nested queries like N+1 in pra…