An experiment about communication through tests

How effectively communicates our test code? We wanted to know if we were able to recreate a software from its tests alone. The experiment gained us some worthwile insights.

lrg-668-wuerfelRecently, we conducted a little experiment to determine our ability to communicate effectively by only using automatic tests. We wanted to know if the tests we write are sufficient to recreate the entire production code from them and understand the original requirements. We were inspired by a similar experiment performed by the Softwerkskammer Karlsruhe in July 2012.

The rules

We chose a “game master” and two teams of two developers each, named “Team A” and “Team B”. The game master secretly picked two coding exercises with comparable skill and effort and briefed every team to one of them. The other team shouldn’t know the original assignment beforehands, so the briefings were held in isolation. Then, the implementation phase began. The teams were instructed to write extensive tests, be it unit or integration tests, before or after the production code. The teams knew about the further utilization of the tests. After about two hours of implementation time, we stopped development and held a little recreation break. Then, the complete test code of each implementation was transferred to the other team, but all production code was kept back for comparison. So, Team A started with all tests of Team B and had to recreate the complete missing production code to fulfill the assignment of Team B without knowing exactly what it was. Team B had to do the same with the production code and assignment of Team A, using only their test code, too. After the “reengineering phase”, as we called it, we compared the solutions and discussed problems and impressions, essentially performing a retrospective on the experiment.

The assignments

The two coding exercises were taken from the Kata Catalogue and adapted to exhibit slightly different rules:

  • Compare Poker Hands: Given two hands of five poker cards, determine which hand has a higher rank and wins the round.
  • Automatic Yahtzee Player: Given five dice and our local Yahtzee rules, determine a strategy which dice should be rerolled.

There was no obligation to complete the exercise, only to develop from a reasonable starting point in a comprehensible direction. The code should be correct and compileable virtually all the time. The test coverage should be near to 100%, even if test driven development or test first wasn’t explicitely required. The emphasis of effort should be on the test code, not on the production code.

The implementation

Both teams understood the assignment immediately and had their “natural” way to develop the code. Programming language of choice was Java for both teams. The game master oscillated between the teams to answer minor questions and gather impressions. After about two hours, we decided to end the phase and stop coding with the next passing test. No team completed their assignment, but the resulting code was very similar in size and other key figures:

  • Team A: 217 lines production code, 198 lines test code. 5 production classes, 17 tests. Test coverage of 94,1%
  • Team B: 199 lines production code, 166 lines test code. 7 production classes, 17 tests. Test coverage of 94,1%

In summary, each team produced half a dozen production classes with a total of ~200 lines of code. 17 tests with a total of ~180 lines of code covered more than 90% of the production code.

The reengineering

After a short break, the teams started with all the test code of the other team, but no production code. The first step was to let the IDE create the missing classes and methods to get the tests to compile. Then, the teams chose basic unit tests to build up the initial production code base. This succeeded very quickly and turned a lot of tests to green. Both teams struggled later on when the tests (and production code) increased in complexity. Both teams introduced new classes to the codebase even when the tests didn’t suggest to do so. Both teams justified their decision with a “better code design” and “ease of implementation”. After about 90 minutes (and nearly simultaneous), both teams had implemented enough production code to turn all tests to green. Both teams were confident to understand the initial assignment and to have implemented a solution equal to the original production code base.

The examination

We gathered for the examination and found that both teams met their requirements: The recreated code bases were correct in terms of the original solution and the assignment. We have shown that communication through only test code is possible for us. But that wasn’t the deepest insight we got from the experiment. Here are a few insights we gathered during the retrospective:

  • Both teams had trouble to effectively distinguish between requirements from the assignment and implementation decisions made by the other team. The tests didn’t transport this aspect good enough. See an example below.
  • The recreated production code turned out to be slightly more precise and concise than the original code. This surprised us a bit and is a huge hint that test driven development, if applied with the “right state of mind”, might improve code quality (at least for this problem domain and these developers).
  • The classes that were introduced during the reengineering phase were present in the original code, too. They just didn’t explicitely show up in the test code.
  • The test code alone wasn’t really helpful in several cases, like:
    • Deciding if a class was/should be an Enum or a normal class
    • Figuring out the meaning of arguments with primitive values. A language with named parameter support would alleviate this problem. In Java, you might consider to use Code Squiggles if you want to prepare for this scenario.
  • The original team would greatly benefit from watching the reengineering team during their coding. The reengineering team would not benefit from interference by the original team. For a solution to this problem, see below.

The revelation

One revelation we can directly apply to our test code was how to help with the distinction between a requirement (“has to be this way”) and implementator’s choice (“incidentally is this way”). Let’s look at an example:

In the poker hands coding exercise, every card is represented by two characters, like “2D” for a two of diamonds or “AS” for an ace of spades. The encoding is straight-forward, except for the 10, it is represented by a “T” and not a “10”: “TH” is a ten of hearts. This is a requirement, the implementator cannot choose another encoding. The test for the encoding looks like this:

public void parseValueForSymbol() {
  assertEquals(Value._2, Value.forSymbol("2"));
  assertEquals(Value._10, Value.forSymbol("T"));
  assertEquals(Value.ACE, Value.forSymbol("A"));

If you write the test like this, there is a clear definition of the encoding, but not of the underlying decision for it. Let’s rewrite the test to communicate that the “T” for ten isn’t an arbitrary choice:

public void parseValueForSymbol() {
  assertEquals(Value._2, Value.forSymbol("2"));
  assertEquals(Value.ACE, Value.forSymbol("A"));

public void tenIsRequiredToBeRepresentedByT() {
  assertEquals(Value._10, Value.forSymbol("T"));

Just by extracting this encoding to a special test case, you emphasize that you are aware of the “inconsistency”. By the test name, you state that it wasn’t your choice to encode it this way.

The improvement

We definitely want to repeat this experiment again in the future, but with some improvements. One would be that the reengineering phases should be recorded with a screencast software to be able to watch the steps in detail and listen to the discussions without the possibility to interact or influence. Both original teams had great interest in the details of the recreation process and the problems with their tests. The other improvement might be an easing on the time axis, as with the recorded implementation phases, there would be no need for a direct observation by a game master or even a concurrent performance. The tasks could be bigger and a bit more relaxed.

In short: It was fun, challenging, informative and reaffirming. A great experience!

13 thoughts on “An experiment about communication through tests”

  1. Thank you, this was really interesting.

    I have one question. From the description of the excercises (poker hands & yahtzee player) I can imagine that you can test both of them using solely state-tests (at least the majority of tests could have been of this kind). Was this the case?
    What I’m wondering about is how would the reengineering phase look like in case of huge percentage/amount of interaction tests (you know – mock behaviour verification). I think that would be much harder then, because you would have much more implementation details carved in stone (which is inevitable even if you write really loosely-coupled interactions tests)

    Could you comment on this please?

    1. Hi Tomek,
      thank you for your interest. Your question is a very good one (and the reason why we don’t want to over-generalize the validity of the results). Most tests were very focussed input->blackbox->expected output unit tests with a computational background. The only test that needed some kind of mocks/stubs was a real time-killer (the team reported a 20%/80%-ratio for this single test vs. all(!) other unit tests). So I agree to your assumption and will try to come up with some more interactive assignments the next time.
      The test mentioned above was a multi-roll test for the yahtzee player. Perhaps an example with really convoluted business logic would be interesting next time.
      Thank you for your thoughts. I hope I could give you an useful answer. Feel free to ask more.
      We can even provide you with the unit tests and reference implementations if you want to repeat the experiment. Only the exact wording of the assignments is lost, as it wasn’t written down.

      1. I would like to repeat the experiment. Maybe you can share the code on Github so I can fork it.

  2. Hi Daniel,

    thank you for your response! Now I have a full picture.

    No, I do not plan to repeat the experiment. 🙂

    Thank you once again for this very interesting post!


  3. I finnished the first assignment I didn’t set myself a time limit and tried to fix all tests. I didn’t know the rules of Kniffel before I started the assignment.

    My thoughts so far:
    – It doesn’t make sense to adhere to the TDD dogma “Fix one test after another”, if you understood the purpose of the tests.

    – You can understand how the code should behave, but it’s hard to derive “domain knowledge” from it.

    – I think it will be much more interesting with a partner, so I will try to find someone to finnish the second assignment.

  4. @Daniel
    Great many thanks for this entry, as well as your detailed answers and assignments!

    I’ve just finished my first replication of the experiment, as we talked. I’m tired now, but have compiled hot notes, and will transform them into blog post later. Code will land on GitHub.


    Pair-programming changes A LOT. When I prepared for this experiment, I spent some time coding the assignments. Then I had seen others coding them in pairs. Difference in time spent on a task and in how code looks is often fairly large.

    So, your last point is quite on the spot.

    Your second to last point IMO depends on number of tests and your place on ‘dreyfussian’ ladder.

    1. I assume you meant point 1&2 in your last paragraph.

      1: Yes, you should only fix multiple tests if you know what you’re doing. Much like reduction in math. If you’re unsure, write it out. Clarify up front which tests are going to be green and which ones will still be red.

      2: My assumptions were based on the assignment. More tests would’ve helped me grasp the domain better. But also the style of the tests can tell you more or less about the domain. Think of BDD for instance.

      I also finished the second assignment (poker) a while ago and finally pushed it to Github.
      I programmed it with a friend who mainly uses C#. Most of the time he commented that LINQ would’ve enabled an easier way to fix the tests. When you deal with data in Java, much of the time you write code that you’ve written thousands of times before. If you don’t use a library for common tasks like getting the first item in a list or filtering, you probably end up with a lot of secret and unnecessary code duplication.

      1. In my pair-programming remark I was just verbosely agreeing with your point on that you will try another assignment in pair. 🙂

        As for Java verbosity – again yes. Without a library that deals with collections boiler-plate code, C# wins – I’ve tried my hands at it during Code Retreats and found it really much better at getting the data in and out of collection in a way that nicely tells those reading your code what you were after. Have you tried Guava?

        Thanks to your reply I realized I miscalculated! When I wrote “your second to last point” I actually meant one before that one. Sorry, confusion NOT intended.

        I wanted to say that with fairly large number of tests in place, it’s easy to miss, if you go like that, unless you have the skill. Which is essentially your point 1 just above.

        I (again! :D) fully agree about more tests and their style. More tests almost always will give a better, more complete documentation, safety net and therefore domain knowledge. And style of tests can be helpful or detrimental.

        I’ve just finished deconstructing first code from Saturday and am brimming with insight from it. 🙂 Will try to post that today, though am not sure how long the writing will take me – it took few hours already. 🙂 This also was assignments on poker hands.

    2. I can’t reply to your recent comment (the maximum reply depth prevents it) so I’ll reply to this one.

      Daniel told me that I won’t get “the full experience” when I try the assignments alone. It turned out that you and Daniel were absolutely right. The pair programming gave me way more impressions than the single-handed solved assignment. They were also easier to relate to my day-to-day work.

      Actually, I found the remark about the Dreyfus model really thoughtful as it made me consider whether my decision to fix multiple tests at once was too bold.

      I have yet to try Guava. I’m looking forward to it.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.