The difference between Test First and Test Driven Development

If you tend to fall into the “one-two-everything”-trap while doing TDD, this blog post might give you some perspective on what really happens and how to avoid the trap by decomposing the problem.

The concept of Test First (“TF”, write a failing test first and make it green by writing exactly enough production code to do so) was always very appealing to me. But I’ve never experienced the guiding effect that is described for Test Driven Development (“TDD”, apply a Test First approach to a problem using baby steps, letting the tests drive the production code). This lead to quite some frustration and scepticism on my side. After a lot of attempts and training sessions with experienced TDD practioners, I concluded that while I grasped Test First and could apply it to everyday tasks, I wouldn’t be able to incorporate TDD into my process toolbox. My biggest grievance was that I couldn’t even tell why TDD failed for me.

The bad news is that TDD still lies outside my normal toolbox. The good news is that I can pinpoint a specific area where I need training in order to learn TDD properly. This blog post is the story about my revelation. I hope that you can gather some ideas for your own progress, implied that you’re no TDD master, too.

A simple training session

In order to learn TDD, I always look for fitting problems to apply it to. While developing a repository difference tracker, the Diffibrillator, there was a neat little task to order the entries of several lists of commits into a single, chronologically ordered list. I delayed the implementation of the needed algorithm for a TDD session in a relaxed environment. My mind began to spawn background processes about possible solutions. When I finally had a chance to start my session, one solution had already crystallized in my imagination:

An elegant solution

Given several input lists of commits, now used as queues, and one result list that is initially empty, repeat the following step until no more commits are pending in any input queue: Compare the head commits of all input queues by their commit date and remove the oldest one, adding it to the result list.
I nick-named this approach the “PEZ algorithm” because each commit list acts like the old PEZ candy dispensers of my childhood, always giving out the topmost sherbet shred when asked for.

A Test First approach

Trying to break the problem down into baby-stepped unit tests, I fell into the “one-two-everything”-trap once again. See for yourself what tests I wrote:

@Test
public void emptyIteratorWhenNoBranchesGiven() throws Exception {
  Iterable<ProjectBranch> noBranches = new EmptyIterable<>();
  Iterable<Commit> commits = CombineCommits.from(noBranches).byCommitDate();
  assertThat(commits, is(emptyIterable()));
}

The first test only prepares the classes’ interface, naming the methods and trying to establish a fluent coding style.

@Test
public void commitsOfBranchIfOnlyOneGiven() throws Exception {
  final Commit firstCommit = commitAt(10L);
  final Commit secondCommit = commitAt(20L);
  final ProjectBranch branch = branchWith(secondCommit, firstCommit);
  Iterable<Commit> commits = CombineCommits.from(branch).byCommitDate();
  assertThat(commits, contains(secondCommit, firstCommit));
}

The second test was the inevitable “simple and dumb” starting point for a journey led by the tests (hopefully). It didn’t lead to any meaningful production code. Obviously, a bigger test scenario was needed:

@Test
public void commitsOfSeveralBranchesInChronologicalOrder() throws Exception {
  final Commit commitA_1 = commitAt(10L);
  final Commit commitB_2 = commitAt(20L);
  final Commit commitA_3 = commitAt(30L);
  final Commit commitA_4 = commitAt(40L);
  final Commit commitB_5 = commitAt(50L);
  final Commit commitA_6 = commitAt(60L);
  final ProjectBranch branchA = branchWith(commitA_6, commitA_4, commitA_3, commitA_1);
  final ProjectBranch branchB = branchWith(commitB_5, commitB_2);
  Iterable<Commit> commits = CombineCommits.from(branchA, branchB).byCommitDate();
  assertThat(commits, contains(commitA_6, commitB_5, commitA_4, commitA_3, commitB_2, commitA_1));
}

Now we are talking! If you give the CombineCommits class two branches with intertwined commit dates, the result will be a chronologically ordered collection. The only problem with this test? It needed the complete 100 lines of algorithm code to be green again. There it is: the “one-two-everything”-trap. The first two tests are merely finger exercises that don’t assert very much. Usually the third test is the last one to be written for a long time, because it requires a lot of work on the production side of code. After this test, the implementation is mostly completed, with 130 lines of production code and a line coverage of nearly 98%. There wasn’t much guidance from the tests, it was more of a “holding back until a test allows for the whole thing to be written”. Emotionally, the tests only hindered me from jotting down the algorithm I already envisioned and when I finally got permission to “show off”, I dived into the production code and only returned when the whole thing was finished. A lot of ego filled in the lines, but I didn’t realize it right away.

But wait, there is a detail left out from the test above that needs to be explicitely specified: If two commmits happen at the same time, there should be a defined behaviour for the combiner. I declare that the order of the input queues is used as a secondary ordering criterium:

@Test
public void decidesForFirstBranchIfCommitsAtSameDate() throws Exception {
  final Commit commitA_1 = commitAt(10L);
  final Commit commitB_2 = commitAt(10L);
  final Commit commitA_3 = commitAt(20L);
  final ProjectBranch branchA = branchWith(commitA_3, commitA_1);
  final ProjectBranch branchB = branchWith(commitB_2);
  Iterable<Commit> commits = CombineCommits.from(branchA, branchB).byCommitDate();
  assertThat(commits, contains(commitA_3, commitA_1, commitB_2));
}

This test didn’t improve the line coverage and was green right from the start, because the implementation already acted as required. There was no guidance in this test, only assurance.

And that was my session: The four unit tests cover the anticipated algorithm completely, but didn’t provide any guidance that I could grasp. I was very disappointed, because the “one-two-everything”-trap is a well-known anti-pattern for my TDD experiences and I still fell right into it.

A second approach using TDD

I decided to remove my code again and pair with my co-worker Jens, who formulated a theory about finding the next test by only changing one facet of the problem for each new test. Sounds interesting? It is! Let’s see where it got us:

@Test
public void noBranchesResultsInEmptyTrail() throws Exception {
  CommitCombiner combiner = new CommitCombiner();
  Iterable<Commit> trail = combiner.getTrail();
  assertThat(trail, is(emptyIterable()));
}

The first test starts as no big surprise, it only sets “the mood”. Notice how we decided to keep the CommitCombiner class simple and plain in its interface as long as the tests don’t get cumbersome.

@Test
public void emptyBranchesResultInEmptyTrail() throws Exception {
  ProjectBranch branchA = branchFor();
  CommitCombiner combiner = new CommitCombiner(branchA);
  assertThat(combiner.getTrail(), is(emptyIterable()));
}

The second test asserts only one thing more than the initial test: If the combiner is given empty commit queues (“branches”) instead of none like in the first test, it still returns an empty result collection (the commit “trail”).

With the single-facet approach, we can only change our tested scenario in one “domain dimension” and only the smallest possible amount of it. So we formulate a test that still uses one branch only, but with one commit in it:

@Test
public void branchWithCommitResultsInEqualTrail() throws Exception {
  Commit commitA1 = commitAt(10L);
  ProjectBranch branchA = branchFor(commitA1);
  CommitCombiner combiner = new CommitCombiner(branchA);
  assertThat(combiner.getTrail(), Matchers.contains(commitA1));
}

With this test, there was the first meaningful appearance of production code. We kept it very simple and trusted our future tests to guide the way to a more complex version.

The next test introduces the central piece of domain knowledge to the production code, just by changing the amount of commits on the only given branch from “one” to “many” (three):

@Test
public void branchWithCommitsAreReturnedInOrder() throws Exception {
  Commit commitA1 = commitAt(10L);
  Commit commitA2 = commitAt(20L);
  Commit commitA3 = commitAt(30L);
  ProjectBranch branchA = branchFor(commitA3, commitA2, commitA1);
  CommitCombiner combiner = new CommitCombiner(branchA);
  assertThat(combiner.getTrail(), Matchers.contains(commitA3, commitA2, commitA1));
}

Notice how this requires the production code to come up with the notion of comparable commit dates that needs to be ordered. We haven’t even introduced a second branch into the scenario yet but are already asserting that the topmost mission critical functionality works: commit ordering.

Now we need to advance to another requirement: The ability to combine branches. But whatever we develop in the future, it can never break the most important aspect of our implementation.

@Test
public void twoBranchesWithOnlyOneCommit() throws Exception {
  Commit commitA1 = commitAt(10L);
  ProjectBranch branchA = branchFor(commitA1);
  ProjectBranch branchB = branchFor();
  CommitCombiner combiner = new CommitCombiner(branchA, branchB);
  assertThat(combiner.getTrail(), Matchers.contains(commitA1));
}

You might say that we knew about this behaviour of the production code before, when we added the test named “branchWithCommitResultsInEqualTrail”, but it really is the assurance that things don’t change just because the amount of branches changes.

Our production code had no need to advance as far as we could already anticipate, so there is the need for another test dealing with multiple branches:

@Test
public void allBranchesAreUsed() throws Exception {
  Commit commitA1 = commitAt(10L);
  ProjectBranch branchA = branchFor(commitA1);
  ProjectBranch branchB = branchFor();
  CommitCombiner combiner = new CommitCombiner(branchB, branchA);
  assertThat(combiner.getTrail(), Matchers.contains(commitA1));
}

Note that the only thing that’s different is the order in which the branches are given to the CommitCombiner. With this simple test, there needs to be some important improvements in the production code. Try it for yourself to see the effect!

Finally, it is time to formulate a test that brings the two facets of our algorithm together. We tested the facets separately for so long now that this test feels like the first “real” test, asserting a “real” use case:

@Test
public void twoBranchesWithOneCommitEach() throws Exception {
  Commit commitA1 = commitAt(10L);
  Commit commitB1 = commitAt(20L);
  ProjectBranch branchA = branchFor(commitA1);
  ProjectBranch branchB = branchFor(commitB1);
  CommitCombiner combiner = new CommitCombiner(branchA, branchB);
  assertThat(combiner.getTrail(), Matchers.contains(commitB1, commitA1));
}

If you compare this “full” test case to the third test case in my first approach, you’ll see that it lacks all the mingled complexity of the first try. The test can be clear and concise in its scenario because it can rely on the assurances of the previous tests. The third test in the first approach couldn’t rely on any meaningful single-faceted “support” test. That’s the main difference! This is my error in the first approach: Trying to cramp more than one new facet in the next test, even putting all required facets in there at once. No wonder that the production code needed “everything” when the test requires it. No wonder there’s no guidance from the tests when I wanted to reach all my goals at once. Decomposing the problem at hand into independent “features” or facets is the most essential step to learn in order to advance from Test First to Test Driven Development. Finding a suitable “dramatic composition” for the tests is another important ability, but it can only be applied after the decomposition is done.

But wait, there is a fourth test in my first approach that needs to be tested here, too:

@Test
public void twoBranchesWithCommitsAtSameTime() throws Exception {
  Commit commitA1 = commitAt(10L);
  Commit commitB1 = commitAt(10L);
  ProjectBranch branchA = branchFor(commitA1);
  ProjectBranch branchB = branchFor(commitB1);
  CommitCombiner combiner = new CommitCombiner(branchA, branchB);
  assertThat(combiner.getTrail(), Matchers.contains(commitA1, commitB1));
}

Thankfully, the implementation already provided this feature. We are done! And in this moment, my ego showed up again: “That implementation is an insult to my developer honour!” I shouted. Keep in mind that I just threw away a beautiful 130-lines piece of algorithm for this alternate implementation:

public class CommitCombiner {
  private final ProjectBranch[] branches;

  public CommitCombiner(ProjectBranch... branches) {
    this.branches = branches;
  }

  public Iterable<Commit> getTrail() {
    final List<Commit> result = new ArrayList<>();
    for (ProjectBranch each : this.branches) {
      CollectionUtil.addAll(result, each.commits());
    }
    return sortedWithBranchOrderPreserved(result);
  }

  private Iterable<Commit> sortedWithBranchOrderPreserved(List<Commit> result) {
    Collections.sort(result, antichronologically());
    return result;
  }

  private <D extends Dated> Comparator<D> antichronologically() {
    return new Comparator<D>() {
      @Override
      public int compare(D o1, D o2) {
        return o2.getDate().compareTo(o1.getDate());
      }
    };
  }
}

The final and complete second implementation, guided to by the tests, is merely six lines of active code with some boiler-plate! Well, what did I expect? TDD doesn’t lead to particularly elegant solutions, it leads to the simplest thing that could possibly work and assures you that it will work in the realm of your specification. There’s no place for the programmer’s ego between these lines and that’s a good thing.

Conclusion

Thank you for reading until here! I’ve learnt an important lesson that day (thank you, Jens!). And being able to pinpoint the main hindrance on my way to fully embracing TDD enabled me to further improve my skills even on my own. It felt like opening an ever-closed door for the first time. I hope you’ve extracted some insights from this write-up, too. Feel free to share them!

Does Refactoring turn unit test of TDD to integration tests?

We really value automated tests and do experiments regarding test driven development (TDD) and tests in general from time to time. In the retrospective of our lastest experiment this question struck me: Does refactoring turn the unit tests of TDD to integration tests over time?

Let me elaborate this a bit further. When you start out with your tests you have some unit of functionality – usually a class – in mind. As you add test after test your implementation slowly fleshes out. You are repeating the TDD cycle “Write a failing test – Make test pass – Refactor” as you are adding features. The refactoring step is crucial in the whole process because it keeps the code clean and evolvable. But this step is also the cause leading to my observation: As you add new features you may extract new classes when refactoring to obey the single responsibility principle (SRP) and keep your design sane. It is very easy to forget or just ignore refactoring the tests. They still pass. You still have the same code coverage. But your tests now test the combination of several units. And what’s worse: You have units without direct tests.

This happened even in relatively small experiments on “Communication through tests” where the recontructing team could sometimes only guess that some class existed and either went on without it or created the class out of neccessity. The problem with this is that there are no obvious and clear indicators that your unit tests are not real unit tests anymore.

Conclusion

I neither have any solution nor am I completely sure how big the problem is in practice. It may help to state the TDD cycle more explicitly like “Write a failing test – Make test pass – Refactor implementation and tests” although that is no 100% remedy. One could implement a simple, checkstyle-like tool which lists all units without associated test class. I will keep an eye on the phenomenom and try to analyse it further. I would love to hear you view and experience on the matter.

Communication through Tests – a larger experiment

We evaluated our ability to communicate through tests in a two-day experiment and gathered some interesting results.

triangulatorFor us, automated tests are the hallmark of professional software development. That doesn’t mean that we buy into every testing fad that comes along or consider ourselves testing experts just because we write some tests alongside our code. We put our money where our mouth is and evaluate our abilities in writing effective tests.

One way to measure the effectiveness of tests is to try to “communicate through tests”. One developer/team writes code and tests for a given specification. Another team picks up the tests only and tries to recreate the production code and infer the specification. The only communication between the two teams happens through the tests.

We performed a small experiment with two teams and one day for both phases and blogged about it. The results of this evaluation was that unit tests are a good medium to transport specification details. But we got a hint that problems might be bigger when the code was less arithmetic and more complex. As most of our development tasks are rather complex and driven by business rules instead of clean mathematical algorithms, we wanted to inspect further.

Our larger experiment

So we organized a bigger experiment with a broader scope. Instead of two teams, we had three teams. We ran the phases for eight instead of two hours, essentially increasing the resulting code size by a factor of 3. The assignments weren’t static, but versioned – and the team only knew the rules of the current version. When a team would reach a certain milestone, more rules would be revealed, partly contradicting the previous ruleset. This should emulate changing customer requirements. And to provide the ability to retrospect on the reconstruction phase, we recorded this phase with a screencast software (we used the commercial product Debut Video Capture), capturing both inputs and conversation by using headsets for every developer.

The first part of this experiment happened in late January of 2013, where all teams had one day to produce production and test code. This was a day of loud buzz in our development department. The second part for the reconstruction phase was scheduled for the middle of February 2013. We had to be a bit more quiet this time to increase the audio recording quality, but the developers were humming nonetheless.

Here are some numbers of what was produced in the first session:

  • Team 1: 400 lines of production code, 530 lines of test code. 8 production classes, 54 tests. Test coverage of 90.6%.
  • Team 2: 576 lines of production code, 655 lines of test code. 17 production classes, 59 tests. Test coverage of 98.2%.
  • Team 3: 442 lines of production code, 429 lines of test code. 18 production classes, 37 tests. Test coverage of 97.0%.

The reconstruction phase was finished in less than five hours, partly because we stuck very close to the actual tests with little guesswork. When the tests didn’t enforce a functionality, it wasn’t implemented to reveal the holes in the test coverage. This reduced the amount of production code that had to be written. On the flipside, every team got lost once on the way, loosing the better part of an hour without noticeable progress.

The results

After all the talk about the event itself, let’s have a look at our results of the experiment:

  • The recording of the reconstruction phase was a huge gain in understanding the detailed problems. We even discussed recording the construction phase too to capture the original design decisions.
  • Every decision on unclear terms from the original team lead to “blurry” tests that didn’t guide the reconstruction team as good as the “razor-sharp” tests did.
  • You could definitely tell the TDD tests from the “test first” tests or even the tests written “immediately after”. More on this aspect later, but this was our biggest overall take-away: The quality of the tests in terms of being a specification differed greatly. This wasn’t bound to teams – as soon as a team lost the TDD “drive”, the tests lost guidance power.
  • Test coverage (in terms of line coverage or conditional coverage) means nothing. You can have 100% test coverage and still suffer from severe plot holes in your tests. Blurry tests tend to increase the coverage, but not the accountability of tests.
  • In general, we were surprised how little guidance and coverage most tests offered. The assignments included some obvious “testing problems” like dealing with randomness and every team dealt with them deliberately. Still, these were the major pain points during the reconstruction phase. This result puts our first small experiment a bit into perspective. What works well with small code bases might be disproportionally harder to achieve when the code size scales up. So while TDD/tests might work sufficiently easy on a small task, it needs more attention for a larger task.

The biggest problem

When talking about “plot holes” from the tests, let me give you a detailed example of what I mean. The more useless tests suffered from a lack of triangulation. In geometry, triangulation is the process of determining the location of a point by measuring several angles to it from known points. When writing tests, triangulation is the effort to “pinpoint” or specify the implementation with a set of different inputs and required outputs. You specify enough different tests of the same functionality to require it being “real” instead of a dummy implementation. Let’s look at this test:

@Test
public void parsesUserInput() {
  assertThat(new InputParser().parse("1 3 5"), hasItems(1, 3, 5));
}

Well, the test tells us that we need to convert a given string into a bunch of integers. It specifies the necessary class and method for this task, but gives us great freedom in the actual implementation. This makes the test green:

public Iterable<Integer> parse(String input) {
  return Arrays.asList(1, 3, 5);
}

As far as the tests are concerned, this is a concise and correct implementation of the required functionality. And while it is obvious in our example that this will never be sufficient, it oftentimes isn’t so obvious when the problem domain isn’t as familiar as parsing strings to numbers. But to complete my explanation of test triangulation, let’s consider a more elaborate implementation of this test that needs a lot more work on the implementation side (especially when developed in accordance with the Transformation Priority Premise by Uncle Bob and without obvious duplication):

@Test
public void parsesUserInput() {
  assertThat(new InputParser().parse("1 3 5"), hasItems(1, 3, 5));
  assertThat(new InputParser().parse("1 2"), hasItems(1, 2));
  assertThat(new InputParser().parse("1 2 3 4 5"), hasItems(1, 2, 3, 4, 5));
  assertThat(new InputParser().parse("1 4 5 3 2"), hasItems(1, 2, 3, 4, 5));
  assertThat(new InputParser().parse("5 4"), hasItems(4, 5));
  assertThat(new InputParser().parse("5 3"), hasItems(3, 5));
}

Maybe not all assertions are required and maybe they should live in different tests giving more hints in their names, but you get the idea: Making this test green is way “harder” than the initial test. Writing properly triangulated tests is one of the immediate benefits of Test Driven Development (TDD), as for example outlined nicely by Ray Sinnema on his blog entry about test-driving a code kata.
Our tests that were written “after the fact” often lacked the proper amount of triangulation, making it easier to “fake it” in the reconstruction phase. In a real project setting, these tests would allow for too much implementation deviation to act as a specification. They act more as usage examples and happy path “smoke” tests.

Our benefits

While this experiment doesn’t fulfill rigid academic requirements on gathering data, it already paid off greatly for us. We’ve examined our ability to express our implementations through tests and gathered insight on our real capabilities to use test-driven methodologies. Being able to judge relatively objectively on the quality of your own tests (by watching the reconstruction phase’s screencast) was very helpful. We now know better what skills to improve and what to focus on during training.

Where to go from here?

We plan to repeat this experiment with interested participants as a spare-time event later this year. For now and ourselves, we have gathered enough impressions to act on them. If you are interested in more details, drop us a note. We could publish only the tests (for reconstruction), the complete code or even the screencasts (albeit they are somewhat long-running). Our participants could elaborate their impressions in the comment section, if you ask them.
We are very interested in your results when performing similar events, like Tomasz Borek did this month in Krakow, Poland. We found his blog entry about the event to be very interesting. We definitely lacked the surprise element for the teams during the event.

An experiment about communication through tests

How effectively communicates our test code? We wanted to know if we were able to recreate a software from its tests alone. The experiment gained us some worthwile insights.

lrg-668-wuerfelRecently, we conducted a little experiment to determine our ability to communicate effectively by only using automatic tests. We wanted to know if the tests we write are sufficient to recreate the entire production code from them and understand the original requirements. We were inspired by a similar experiment performed by the Softwerkskammer Karlsruhe in July 2012.

The rules

We chose a “game master” and two teams of two developers each, named “Team A” and “Team B”. The game master secretly picked two coding exercises with comparable skill and effort and briefed every team to one of them. The other team shouldn’t know the original assignment beforehands, so the briefings were held in isolation. Then, the implementation phase began. The teams were instructed to write extensive tests, be it unit or integration tests, before or after the production code. The teams knew about the further utilization of the tests. After about two hours of implementation time, we stopped development and held a little recreation break. Then, the complete test code of each implementation was transferred to the other team, but all production code was kept back for comparison. So, Team A started with all tests of Team B and had to recreate the complete missing production code to fulfill the assignment of Team B without knowing exactly what it was. Team B had to do the same with the production code and assignment of Team A, using only their test code, too. After the “reengineering phase”, as we called it, we compared the solutions and discussed problems and impressions, essentially performing a retrospective on the experiment.

The assignments

The two coding exercises were taken from the Kata Catalogue and adapted to exhibit slightly different rules:

  • Compare Poker Hands: Given two hands of five poker cards, determine which hand has a higher rank and wins the round.
  • Automatic Yahtzee Player: Given five dice and our local Yahtzee rules, determine a strategy which dice should be rerolled.

There was no obligation to complete the exercise, only to develop from a reasonable starting point in a comprehensible direction. The code should be correct and compileable virtually all the time. The test coverage should be near to 100%, even if test driven development or test first wasn’t explicitely required. The emphasis of effort should be on the test code, not on the production code.

The implementation

Both teams understood the assignment immediately and had their “natural” way to develop the code. Programming language of choice was Java for both teams. The game master oscillated between the teams to answer minor questions and gather impressions. After about two hours, we decided to end the phase and stop coding with the next passing test. No team completed their assignment, but the resulting code was very similar in size and other key figures:

  • Team A: 217 lines production code, 198 lines test code. 5 production classes, 17 tests. Test coverage of 94,1%
  • Team B: 199 lines production code, 166 lines test code. 7 production classes, 17 tests. Test coverage of 94,1%

In summary, each team produced half a dozen production classes with a total of ~200 lines of code. 17 tests with a total of ~180 lines of code covered more than 90% of the production code.

The reengineering

After a short break, the teams started with all the test code of the other team, but no production code. The first step was to let the IDE create the missing classes and methods to get the tests to compile. Then, the teams chose basic unit tests to build up the initial production code base. This succeeded very quickly and turned a lot of tests to green. Both teams struggled later on when the tests (and production code) increased in complexity. Both teams introduced new classes to the codebase even when the tests didn’t suggest to do so. Both teams justified their decision with a “better code design” and “ease of implementation”. After about 90 minutes (and nearly simultaneous), both teams had implemented enough production code to turn all tests to green. Both teams were confident to understand the initial assignment and to have implemented a solution equal to the original production code base.

The examination

We gathered for the examination and found that both teams met their requirements: The recreated code bases were correct in terms of the original solution and the assignment. We have shown that communication through only test code is possible for us. But that wasn’t the deepest insight we got from the experiment. Here are a few insights we gathered during the retrospective:

  • Both teams had trouble to effectively distinguish between requirements from the assignment and implementation decisions made by the other team. The tests didn’t transport this aspect good enough. See an example below.
  • The recreated production code turned out to be slightly more precise and concise than the original code. This surprised us a bit and is a huge hint that test driven development, if applied with the “right state of mind”, might improve code quality (at least for this problem domain and these developers).
  • The classes that were introduced during the reengineering phase were present in the original code, too. They just didn’t explicitely show up in the test code.
  • The test code alone wasn’t really helpful in several cases, like:
    • Deciding if a class was/should be an Enum or a normal class
    • Figuring out the meaning of arguments with primitive values. A language with named parameter support would alleviate this problem. In Java, you might consider to use Code Squiggles if you want to prepare for this scenario.
  • The original team would greatly benefit from watching the reengineering team during their coding. The reengineering team would not benefit from interference by the original team. For a solution to this problem, see below.

The revelation

One revelation we can directly apply to our test code was how to help with the distinction between a requirement (“has to be this way”) and implementator’s choice (“incidentally is this way”). Let’s look at an example:

In the poker hands coding exercise, every card is represented by two characters, like “2D” for a two of diamonds or “AS” for an ace of spades. The encoding is straight-forward, except for the 10, it is represented by a “T” and not a “10”: “TH” is a ten of hearts. This is a requirement, the implementator cannot choose another encoding. The test for the encoding looks like this:


@Test
public void parseValueForSymbol() {
  assertEquals(Value._2, Value.forSymbol("2"));
  [...]
  assertEquals(Value._10, Value.forSymbol("T"));
  [...]
  assertEquals(Value.ACE, Value.forSymbol("A"));
}

If you write the test like this, there is a clear definition of the encoding, but not of the underlying decision for it. Let’s rewrite the test to communicate that the “T” for ten isn’t an arbitrary choice:


@Test
public void parseValueForSymbol() {
  assertEquals(Value._2, Value.forSymbol("2"));
  [...]
  assertEquals(Value.ACE, Value.forSymbol("A"));
}

@Test
public void tenIsRequiredToBeRepresentedByT() {
  assertEquals(Value._10, Value.forSymbol("T"));
}

Just by extracting this encoding to a special test case, you emphasize that you are aware of the “inconsistency”. By the test name, you state that it wasn’t your choice to encode it this way.

The improvement

We definitely want to repeat this experiment again in the future, but with some improvements. One would be that the reengineering phases should be recorded with a screencast software to be able to watch the steps in detail and listen to the discussions without the possibility to interact or influence. Both original teams had great interest in the details of the recreation process and the problems with their tests. The other improvement might be an easing on the time axis, as with the recorded implementation phases, there would be no need for a direct observation by a game master or even a concurrent performance. The tasks could be bigger and a bit more relaxed.

In short: It was fun, challenging, informative and reaffirming. A great experience!