Communication through Tests – a larger experiment

We evaluated our ability to communicate through tests in a two-day experiment and gathered some interesting results.

triangulatorFor us, automated tests are the hallmark of professional software development. That doesn’t mean that we buy into every testing fad that comes along or consider ourselves testing experts just because we write some tests alongside our code. We put our money where our mouth is and evaluate our abilities in writing effective tests.

One way to measure the effectiveness of tests is to try to “communicate through tests”. One developer/team writes code and tests for a given specification. Another team picks up the tests only and tries to recreate the production code and infer the specification. The only communication between the two teams happens through the tests.

We performed a small experiment with two teams and one day for both phases and blogged about it. The results of this evaluation was that unit tests are a good medium to transport specification details. But we got a hint that problems might be bigger when the code was less arithmetic and more complex. As most of our development tasks are rather complex and driven by business rules instead of clean mathematical algorithms, we wanted to inspect further.

Our larger experiment

So we organized a bigger experiment with a broader scope. Instead of two teams, we had three teams. We ran the phases for eight instead of two hours, essentially increasing the resulting code size by a factor of 3. The assignments weren’t static, but versioned – and the team only knew the rules of the current version. When a team would reach a certain milestone, more rules would be revealed, partly contradicting the previous ruleset. This should emulate changing customer requirements. And to provide the ability to retrospect on the reconstruction phase, we recorded this phase with a screencast software (we used the commercial product Debut Video Capture), capturing both inputs and conversation by using headsets for every developer.

The first part of this experiment happened in late January of 2013, where all teams had one day to produce production and test code. This was a day of loud buzz in our development department. The second part for the reconstruction phase was scheduled for the middle of February 2013. We had to be a bit more quiet this time to increase the audio recording quality, but the developers were humming nonetheless.

Here are some numbers of what was produced in the first session:

  • Team 1: 400 lines of production code, 530 lines of test code. 8 production classes, 54 tests. Test coverage of 90.6%.
  • Team 2: 576 lines of production code, 655 lines of test code. 17 production classes, 59 tests. Test coverage of 98.2%.
  • Team 3: 442 lines of production code, 429 lines of test code. 18 production classes, 37 tests. Test coverage of 97.0%.

The reconstruction phase was finished in less than five hours, partly because we stuck very close to the actual tests with little guesswork. When the tests didn’t enforce a functionality, it wasn’t implemented to reveal the holes in the test coverage. This reduced the amount of production code that had to be written. On the flipside, every team got lost once on the way, loosing the better part of an hour without noticeable progress.

The results

After all the talk about the event itself, let’s have a look at our results of the experiment:

  • The recording of the reconstruction phase was a huge gain in understanding the detailed problems. We even discussed recording the construction phase too to capture the original design decisions.
  • Every decision on unclear terms from the original team lead to “blurry” tests that didn’t guide the reconstruction team as good as the “razor-sharp” tests did.
  • You could definitely tell the TDD tests from the “test first” tests or even the tests written “immediately after”. More on this aspect later, but this was our biggest overall take-away: The quality of the tests in terms of being a specification differed greatly. This wasn’t bound to teams – as soon as a team lost the TDD “drive”, the tests lost guidance power.
  • Test coverage (in terms of line coverage or conditional coverage) means nothing. You can have 100% test coverage and still suffer from severe plot holes in your tests. Blurry tests tend to increase the coverage, but not the accountability of tests.
  • In general, we were surprised how little guidance and coverage most tests offered. The assignments included some obvious “testing problems” like dealing with randomness and every team dealt with them deliberately. Still, these were the major pain points during the reconstruction phase. This result puts our first small experiment a bit into perspective. What works well with small code bases might be disproportionally harder to achieve when the code size scales up. So while TDD/tests might work sufficiently easy on a small task, it needs more attention for a larger task.

The biggest problem

When talking about “plot holes” from the tests, let me give you a detailed example of what I mean. The more useless tests suffered from a lack of triangulation. In geometry, triangulation is the process of determining the location of a point by measuring several angles to it from known points. When writing tests, triangulation is the effort to “pinpoint” or specify the implementation with a set of different inputs and required outputs. You specify enough different tests of the same functionality to require it being “real” instead of a dummy implementation. Let’s look at this test:

@Test
public void parsesUserInput() {
  assertThat(new InputParser().parse("1 3 5"), hasItems(1, 3, 5));
}

Well, the test tells us that we need to convert a given string into a bunch of integers. It specifies the necessary class and method for this task, but gives us great freedom in the actual implementation. This makes the test green:

public Iterable<Integer> parse(String input) {
  return Arrays.asList(1, 3, 5);
}

As far as the tests are concerned, this is a concise and correct implementation of the required functionality. And while it is obvious in our example that this will never be sufficient, it oftentimes isn’t so obvious when the problem domain isn’t as familiar as parsing strings to numbers. But to complete my explanation of test triangulation, let’s consider a more elaborate implementation of this test that needs a lot more work on the implementation side (especially when developed in accordance with the Transformation Priority Premise by Uncle Bob and without obvious duplication):

@Test
public void parsesUserInput() {
  assertThat(new InputParser().parse("1 3 5"), hasItems(1, 3, 5));
  assertThat(new InputParser().parse("1 2"), hasItems(1, 2));
  assertThat(new InputParser().parse("1 2 3 4 5"), hasItems(1, 2, 3, 4, 5));
  assertThat(new InputParser().parse("1 4 5 3 2"), hasItems(1, 2, 3, 4, 5));
  assertThat(new InputParser().parse("5 4"), hasItems(4, 5));
  assertThat(new InputParser().parse("5 3"), hasItems(3, 5));
}

Maybe not all assertions are required and maybe they should live in different tests giving more hints in their names, but you get the idea: Making this test green is way “harder” than the initial test. Writing properly triangulated tests is one of the immediate benefits of Test Driven Development (TDD), as for example outlined nicely by Ray Sinnema on his blog entry about test-driving a code kata.
Our tests that were written “after the fact” often lacked the proper amount of triangulation, making it easier to “fake it” in the reconstruction phase. In a real project setting, these tests would allow for too much implementation deviation to act as a specification. They act more as usage examples and happy path “smoke” tests.

Our benefits

While this experiment doesn’t fulfill rigid academic requirements on gathering data, it already paid off greatly for us. We’ve examined our ability to express our implementations through tests and gathered insight on our real capabilities to use test-driven methodologies. Being able to judge relatively objectively on the quality of your own tests (by watching the reconstruction phase’s screencast) was very helpful. We now know better what skills to improve and what to focus on during training.

Where to go from here?

We plan to repeat this experiment with interested participants as a spare-time event later this year. For now and ourselves, we have gathered enough impressions to act on them. If you are interested in more details, drop us a note. We could publish only the tests (for reconstruction), the complete code or even the screencasts (albeit they are somewhat long-running). Our participants could elaborate their impressions in the comment section, if you ask them.
We are very interested in your results when performing similar events, like Tomasz Borek did this month in Krakow, Poland. We found his blog entry about the event to be very interesting. We definitely lacked the surprise element for the teams during the event.

A small test saves the day

You think a method is too trivial to write a test for it? Think again if the method is mission-critical!

Just recently, I had to write a connection between an existing application and a new hardware unit. This is a fairly common job for our company, even considering the circumstances that I’d never even seen the hardware, let alone being able to connect to it. The hardware unit itself was rather big and it was installed in a security sensitive area with restricted access. So, I only got a specification of the protocol to use and a description of the hardware’s features.

Our common procedure to include hardware dependent modules into an application is to write two implementations of the module: One implementation is the real deal and interacts with the hardware over ethernet, USB, serial port or whatever proprietary communication device is used. This version of the module can only work as intended if the hardware is present. The other implementation acts as an emulation of the hardware, without any dependencies. If you are familiar with unit tests, think of a big test mock. The emulation version is used during development to test and run the application without requirements about the hardware. There are a lot of subtle pitfalls to consider and avoid, but on a bird-view level of abstraction, these interchangeable implementations of a module enable us to develop software with hardware dependencies without need for the actual hardware.

The first piece of code that’s used of a module is a factory/builder class that chooses between the available implementations, based on some configuration entry (or hardware availability, etc.). A typical implementation of the responsible method might look like this:


public HardwareModule createFor(ModuleConfiguration configuration) {
  if (configuration.isHardwarePresent()) {
    new RealHardwareModule();
  }
  return new EmulatedHardwareModule();
}

If the configuration object says that the hardware is present, the real implementation is used, subsequentially opening a connection to the hardware and talking the client side of the given protocol. Otherwise, the emulation is created and returned, maybe opening a debug GUI window to display certain internal states and values and providing controls to mess with the application during development.

The method itself looks very innocent and meager. There is not much going on, so what could possibly go wrong?

I’m not the most eager test-driven developer in the world, I have to admit. But I see the value of tests (and unit tests in particular) and adhere to the A-TRIP rules defined by Andy Hunt and (pragmatic) Dave Thomas:

  • Automatic
  • Thorough
  • Repeatable
  • Independent
  • Professional

For a complete definition of the rules, read the linked blog entry or, even better, buy the book. It’s small and cheap, but contains a lot of profound basic knowledge about unit testing.

The “Thorough” rule is more of a rule of thumb than a hard scientific formula for good unit tests: Always write a test if you’ve found a bug or if the code you’re writing is mission-critical. This was when my gut feeling told me that while the method above might seem trivial, it is definitely essential for the hardware module. So I wrote a test:

  @Test
  public void providesEmulationIfUnspecified() {
    HardwareModuleFactory factory = new HardwareModuleFactory();
    HardwareModule hardware = factory.createFor(configuration(""));
    assertEquals("not the hardware emulation", EmulatedHardwareModule.class, hardware.getClass());
  }

  @Test
  public void providesEmulationIfHardwareAbsent() {
    HardwareModuleFactory factory = new HardwareModuleFactory();
    HardwareModule hardware = factory.createFor(configuration("hardware.present=false"));
    assertEquals("not the hardware emulation", EmulatedHardwareModule.class, hardware.getClass());
  }

  @Test
  public void providesRealImplementationIfHardwarePresent() {
    HardwareModuleFactory factory = new HardwareModuleFactory();
    HardwareModule hardware = factory.createFor(configuration("hardware.present=true"));
    assertEquals("not the real hardware implementation", RealHardwareModule.class, hardware.getClass());
  }

To my surprise, the test immediately went red for the third test method. After double-checking the test code, I was certain that the test was correct. The test discovered a bug in the production code. And being a mostly independent unit test, it pointed to the problematic lines right away: the method implementation above. The helper method named configuration() spared in the code sample was very unlikely to contain a bug.

After a short moment of reading the code again, I corrected it (note the added return statement in line 3):


public HardwareModule createFor(ModuleConfiguration configuration) {
  if (configuration.isHardwarePresent()) {
    return new RealHardwareModule();
  }
  return new EmulatedHardwareModule();
}

This might not seem like the most disastrous bug ever, but it would have made for a nasty start when I finally would have tried the application with the real hardware. There is nothing more valueable than to be able to keep your cool “in the wild” and work on the real problems like faulty protocol specifications or unexpected/undocumented hardware behaviour. So, my gut feeling (and the Thorough rule) were right and my brain, telling me “skip this petty test” longer than I like to admit, was wrong. A small test for a small method paid off immediately and saved the day, at least for me.

A blind spot of Continuous Integration

Imagine you haven’t changed a continuously integrated project for months when suddenly a unit test breaks. Here’s why CI has a blind spot with flawed tests.

In the early days of April 2008, we updated our hudson continuous integration (CI) server to a new version. This was no unusual action, as there was a new version every day back then, bringing new features in a rapid rate. What was unusual after the upgrade was that one of the surveilled projects failed to build all of a sudden.

Sudden (test) failure without a change

The build was started manually, without a code change. The project itself was inactive back then, meaning that no changes were made for months. And suddenly, a unit test broke. The test was in the project for two whole years without ever going off. What happened?

Good unit tests

There are rules for good unit tests. A basic set are the A-TRIP rules formulated in the excellent beginner’s book “Pragmatic Unit Testing” by Andy Hunt and Dave Thomas. The failed test clearly disobeyed the “repeatable” rule (the R in A-TRIP): It didn’t result in the same result as before while the code under test didn’t change.

Write repeatable tests or your CI will be blind (partially)

The cause of the failed test was putting the clocks back because the daylight-saving time part of the year began. The unit test secured some date calculations by taking the current date and comparing it to future and past dates that got calculated. The calculation went wrong when the daylight-saving mode changes in the calculated period, which was a bug. Repeatability of the unit test was lost when “the current date” entered the code – whether on the unit test or productive code side.

Two years of blindness

How could this bug survive two years without being noticed? The project was under CI surveillance since the beginning, the unit test to detect the bug was present along with the bugged code. The answer is: We never programmed for this project around the weeks of the year when the clock is adjusted and the bug occurs. This was a coincidence influenced by the customer’s schedule. So every time CI (or we) ran the unit test, it passed. Until that day right after putting the clocks back.

How to avoid this blind spot

There are two things you can do to avoid this scenario:

  • Always inject a fixed “the current date” into your code when dealing with date calculations. Only use absolute dates in your unit tests. Time isn’t a healer for your tests, it’s a beast to be tamed.
  • Set up a nightly build for your project that runs once a day even when no changes have been made. It would have caught this bug one and a half year earlier.

To sum it up:

CI only spots bugs when they move (aka the code is changed). Nightly builds provide a (fuzzy) security layer against non-repeatable unit tests. And unit tests with flaws provide only delusory security.

Additional background information

After fully understanding the circumstances, we were curious why the customer didn’t notice the bug and asked him about it. The answer was delightful: “Our computers don’t adjust their clocks. Daylight-saving time only causes trouble.” What a wise decision!

For a good comparison of CI vs. Nightly Builds see this blog entry.