For us, automated tests are the hallmark of professional software development. That doesn’t mean that we buy into every testing fad that comes along or consider ourselves testing experts just because we write some tests alongside our code. We put our money where our mouth is and evaluate our abilities in writing effective tests.
One way to measure the effectiveness of tests is to try to “communicate through tests”. One developer/team writes code and tests for a given specification. Another team picks up the tests only and tries to recreate the production code and infer the specification. The only communication between the two teams happens through the tests.
We performed a small experiment with two teams and one day for both phases and blogged about it. The results of this evaluation was that unit tests are a good medium to transport specification details. But we got a hint that problems might be bigger when the code was less arithmetic and more complex. As most of our development tasks are rather complex and driven by business rules instead of clean mathematical algorithms, we wanted to inspect further.
Our larger experiment
So we organized a bigger experiment with a broader scope. Instead of two teams, we had three teams. We ran the phases for eight instead of two hours, essentially increasing the resulting code size by a factor of 3. The assignments weren’t static, but versioned – and the team only knew the rules of the current version. When a team would reach a certain milestone, more rules would be revealed, partly contradicting the previous ruleset. This should emulate changing customer requirements. And to provide the ability to retrospect on the reconstruction phase, we recorded this phase with a screencast software (we used the commercial product Debut Video Capture), capturing both inputs and conversation by using headsets for every developer.
The first part of this experiment happened in late January of 2013, where all teams had one day to produce production and test code. This was a day of loud buzz in our development department. The second part for the reconstruction phase was scheduled for the middle of February 2013. We had to be a bit more quiet this time to increase the audio recording quality, but the developers were humming nonetheless.
Here are some numbers of what was produced in the first session:
- Team 1: 400 lines of production code, 530 lines of test code. 8 production classes, 54 tests. Test coverage of 90.6%.
- Team 2: 576 lines of production code, 655 lines of test code. 17 production classes, 59 tests. Test coverage of 98.2%.
- Team 3: 442 lines of production code, 429 lines of test code. 18 production classes, 37 tests. Test coverage of 97.0%.
The reconstruction phase was finished in less than five hours, partly because we stuck very close to the actual tests with little guesswork. When the tests didn’t enforce a functionality, it wasn’t implemented to reveal the holes in the test coverage. This reduced the amount of production code that had to be written. On the flipside, every team got lost once on the way, loosing the better part of an hour without noticeable progress.
The results
After all the talk about the event itself, let’s have a look at our results of the experiment:
- The recording of the reconstruction phase was a huge gain in understanding the detailed problems. We even discussed recording the construction phase too to capture the original design decisions.
- Every decision on unclear terms from the original team lead to “blurry” tests that didn’t guide the reconstruction team as good as the “razor-sharp” tests did.
- You could definitely tell the TDD tests from the “test first” tests or even the tests written “immediately after”. More on this aspect later, but this was our biggest overall take-away: The quality of the tests in terms of being a specification differed greatly. This wasn’t bound to teams – as soon as a team lost the TDD “drive”, the tests lost guidance power.
- Test coverage (in terms of line coverage or conditional coverage) means nothing. You can have 100% test coverage and still suffer from severe plot holes in your tests. Blurry tests tend to increase the coverage, but not the accountability of tests.
- In general, we were surprised how little guidance and coverage most tests offered. The assignments included some obvious “testing problems” like dealing with randomness and every team dealt with them deliberately. Still, these were the major pain points during the reconstruction phase. This result puts our first small experiment a bit into perspective. What works well with small code bases might be disproportionally harder to achieve when the code size scales up. So while TDD/tests might work sufficiently easy on a small task, it needs more attention for a larger task.
The biggest problem
When talking about “plot holes” from the tests, let me give you a detailed example of what I mean. The more useless tests suffered from a lack of triangulation. In geometry, triangulation is the process of determining the location of a point by measuring several angles to it from known points. When writing tests, triangulation is the effort to “pinpoint” or specify the implementation with a set of different inputs and required outputs. You specify enough different tests of the same functionality to require it being “real” instead of a dummy implementation. Let’s look at this test:
@Test public void parsesUserInput() { assertThat(new InputParser().parse("1 3 5"), hasItems(1, 3, 5)); }
Well, the test tells us that we need to convert a given string into a bunch of integers. It specifies the necessary class and method for this task, but gives us great freedom in the actual implementation. This makes the test green:
public Iterable<Integer> parse(String input) { return Arrays.asList(1, 3, 5); }
As far as the tests are concerned, this is a concise and correct implementation of the required functionality. And while it is obvious in our example that this will never be sufficient, it oftentimes isn’t so obvious when the problem domain isn’t as familiar as parsing strings to numbers. But to complete my explanation of test triangulation, let’s consider a more elaborate implementation of this test that needs a lot more work on the implementation side (especially when developed in accordance with the Transformation Priority Premise by Uncle Bob and without obvious duplication):
@Test public void parsesUserInput() { assertThat(new InputParser().parse("1 3 5"), hasItems(1, 3, 5)); assertThat(new InputParser().parse("1 2"), hasItems(1, 2)); assertThat(new InputParser().parse("1 2 3 4 5"), hasItems(1, 2, 3, 4, 5)); assertThat(new InputParser().parse("1 4 5 3 2"), hasItems(1, 2, 3, 4, 5)); assertThat(new InputParser().parse("5 4"), hasItems(4, 5)); assertThat(new InputParser().parse("5 3"), hasItems(3, 5)); }
Maybe not all assertions are required and maybe they should live in different tests giving more hints in their names, but you get the idea: Making this test green is way “harder” than the initial test. Writing properly triangulated tests is one of the immediate benefits of Test Driven Development (TDD), as for example outlined nicely by Ray Sinnema on his blog entry about test-driving a code kata.
Our tests that were written “after the fact” often lacked the proper amount of triangulation, making it easier to “fake it” in the reconstruction phase. In a real project setting, these tests would allow for too much implementation deviation to act as a specification. They act more as usage examples and happy path “smoke” tests.
Our benefits
While this experiment doesn’t fulfill rigid academic requirements on gathering data, it already paid off greatly for us. We’ve examined our ability to express our implementations through tests and gathered insight on our real capabilities to use test-driven methodologies. Being able to judge relatively objectively on the quality of your own tests (by watching the reconstruction phase’s screencast) was very helpful. We now know better what skills to improve and what to focus on during training.
Where to go from here?
We plan to repeat this experiment with interested participants as a spare-time event later this year. For now and ourselves, we have gathered enough impressions to act on them. If you are interested in more details, drop us a note. We could publish only the tests (for reconstruction), the complete code or even the screencasts (albeit they are somewhat long-running). Our participants could elaborate their impressions in the comment section, if you ask them.
We are very interested in your results when performing similar events, like Tomasz Borek did this month in Krakow, Poland. We found his blog entry about the event to be very interesting. We definitely lacked the surprise element for the teams during the event.
4 thoughts on “Communication through Tests – a larger experiment”