Writing Integration Tests – Heuristics of what to Aim for

When software has grown over some years, or split into multiple moving parts, or both, chances are that a customer reports erroneous behaviour that cannot be represented in unit tests. You might then need an integration test, and while these have the advantage of resembling more the story of what your customer wanted to do and expected to see – one can feel a larger degree of freedom, or rather, a larger question mark, in what they actually should know or assume.

And especially when you are not the original author of said code base, it might be especially tricky to strike the right chords; so with this post, I try to specify a few heuristics to find the right scope, the right aim for such a test.

Now first of all, try not to aim too needlessly large. As in – when possible, prefer unit tests, and if you can’t, be intentional about writing an integration test. As Daniel laid out in a series of blog posts some years ago, you can think of a unit test as a theater play, and using these metaphors helps in making your test clearly convey what it is about. The largest impact is in identifying what the “target” is. It has to contain the code under test (“cut”, also), and it should be very clear who that is (think of a simple stage play, not some avantgarde David Lynch mystery thriller). To perpetuate that idea – unit tests are these that define the character of your target; who is our hero, in their inner core, regardless of the context?

As you knew beforehand, totally™ – writing the test in advance gives you a clearer overall understanding of your actualy problem. Any kind of test. If not done, for whatever reason, you might either now switch to a fresh branch and do so – or, at least, treat any test code as it could have been written before.
I do stress this because it can be too tempting to focus on specific implementation details, but if some interna are rightfully encapsulated, hidden from other production code, then your tests should also not see these things.

Do not change any field or method visibility in order to make your test work.

More “freedom” is deceptive. Integration tests tell the behaviour of your target-hero within a given backstory, but how much backstory do you need? And can you keep up the storytelling – imagine yourself to be a cliche Gen Z TikTok user that ran short of Methylphenidate – would you still follow the story, after describing that one tree over here, these cars over there, who owns them, and the general idea of a planet where all that takes place?

But in said codebase grown-over-years, it might still be hard to aim right. A “quick” way out (speaking in time of writing that test) would be to try to mock away everything you need to construct your target; but you shouldn’t do so on first sight.

For every argument, any dependency that your situation needs, get at least a coarse understanding of what it is used for, and why not a version of that entity could exist without that dependency. Take that time. This can feel like a detour, but either it reveals to you how the backdrop actually looks, or it might even give you a hint of needless coupling in that existing code base.

Either way, does every single line of the test reveal its purpose?
(you should apply that question to any line of production code as well, but the cases where it’s just not possible is larger there than it should be within tests.)

And then there’s the point of deduplication. As stated, tests can exist that have a bit of scaffolding to them. It’s not illegal. But consider that any behaviour covered by an integration test probably has some variations – stuff happens in a different order – and when several tests have the same setup, this might encourage outsourcing a “setup helper” method. You even should do so (imo) if you clearly want to say “in this scene, everything is the same as yesterday, BUT THEN…” but if you do so, keep one thing in mind: that setup function needs to show the same clarity that is required of the tests themselves.

If your setup functions exceed any trivial logic that makes it feel like you require an extra test just to prove that your test setups works, stop and rethink.

So this list surely is incomplete, and as stated, they are rough “heuristics”.
But if you have a nontrivial case to test, and do so in a project that is not completely in your brain right now, but just might want to somehow draw the line between “what should I (or the test) need to know” and “what is to be mocked or otherwise left out” – maybe these thoughts can help you too.

And if you disagree, I will be very glad to consider your stage plays.

Don’t test details from a distance

The concept described in this blog entry has evoked a lot of different metaphors and descriptions from our team when we discussed it. So don’t take my words or thoughts on it as the one true way to talk about it – the concept of the “testing gap” or the distance between the code under test and the test’s vantage point.

Before I describe my metaphor for it with some weird visuals, let’s look at some code:

public Budget(String denotation, int maximumHours) {
    this.denotation = denotation;
    this.maximumHours = maximumHours;
    this.currentHours = maximumHours;
}

This is the constructor for an entity, a domain class that represents a budget of work hours that gets slowly used up when you work for the customer’s project. There is not much going on in this code except one little detail of the domain: New budgets always start fully “filled up”, in that the currentHours are set to the maximumHours. You can’t create a budget that is already half empty with this code.

Such a domain concept or “business rule” requires a test that ensures it is still in place:

@Test
public void has_initially_current_hours_set_to_maximum() {
    Budget target = new Budget(
        "current is maxed",
        100
    );
    assertThat(target.getMaximumHours()).isEqualTo(100);
    assertThat(target.getCurrentHours()).isEqualTo(100);
}

This is a fairly boring unit test that ensures that freshly created budgets have all their working hours still available.

In our example, the entity lives in the core of a web application that provides an endpoint to create new budgets. We have a test for the endpoint, of course:

@Test
public void stores_new_budget() throws Exception {
    this.web.perform(
        post("/budgets")
        .contentType(MediaType.APPLICATION_JSON)
        .content("{\"denotation\": \"new budget\", \"maximumHours\": 300}")
    )
    .andExpect(status().isOk())
    .andExpect(content().json("{\"denotation\": \"new budget\", \"maximumHours\": 300, \"currentHours\": 300}"));
}

You can shudder at the code formatting or the necessity to escape your JSON data into inscrutability. At least the second problem more or less disappears with current Java versions. But that’s not the point today. The point is that this is effectively the same test as above, but with a gap in between.

If you wrote just the second test, your code coverage metrics would probably not decrease. Your business rule would still be tested. So why write the first test if it adds nothing to the safety net?

This can be explained with the idea that there is a considerable “testing gap” between the second test and the business rule. It covers the entity’s constructor code and states explicitely that the currentHours property should be set to the same value as the maximumHours property. But it also defines the communication protocol as being HTTP, the data format as being JSON and travels through code that finds an “endpoint” for the given URL, maps the given JSON to a constructor call and serializes the resulting object as JSON back to the requester. That would be a lot of padding just to test the constructor’s third line.

The first test has virtually no testing gap. It knows nothing about the web, data formats or whatever else the application consists of. It just looks at the entity and its behaviour in isolation.

There are perfectly valid reasons to write the second test, but it should not be the only test that ensures the business rule in our example. The second test “sees too much” from its vantage point to pay attention to a little detail like the business rule.

In case you didn’t quite get the concept of the testing gap yet, here is how I imagine it in my head: If your code under test is a mystery box (really try to picture a shoebox made of cardboard that rattles when you move it), then your test is a big floating eye that uses little cracks and holes in the box to get a quick peek inside. If you exhibit state by getter methods like in our example, the eye ensures the internal state of the box by looking at the gauges that are placed on its sides.

If your testing gap is small, the eye hovers up close to the box. It doesn’t see anything else, but it notices every detail of the box.

If you have some testing gap in your test code, the eye is placed in a considerate distance from the mystery box. There are other important things between them. The gauges aren’t directly readable. The eye uses indirect clues and reflections to gather its informations. Every time something in the gap’s setup changes, the testing eye needs to adjust its gaze.

Which brings us to the conclusion of this metaphor: If you want things to be looked at in detail, write tests without a testing gap. Otherwise, your tests will have increased execution times, exhibit a strange imprecision in their message (“something in these dozen of classes has changed and it might not even be relevant”) and require frequent adjustments that are not related to their testing story.

Or, if said with the words of my imagination, place your testing eye directly at the entrance of your test’s hideout.

You’ve probably thought about this concept already, in your own terms and metaphors. Can you try to describe it in a comment? Just for the name, we discussed “testing distance”, “testing height”, “testing gap” and others. Perhaps we like your description even better.

Look at the automated tests to diagnose the project ailments

If you have to evaluate a new software project, draw the outline of the existing test types. If it doesn’t look like a pyramide, you might be in trouble. Here are some common non-pyramide outlines and what they mean.

A cornerstone of modern software development is developer testing. That means that developers are the primary authors of automated test code. In theory, that is a good thing and might look like the quality assurance department is out of work soon. In practice, we as a profession tried for nearly twenty years to install a culture of developer testing in our work and still end up with software projects that feature no automated tests at all (Side note: JUnit 1.0 was released in February of 1998).

What we know about automated tests

One piece of common understanding about developer testing is the test pyramide. Let’s iterate quickly what we know about it. There are different kinds of automated tests and the test pyramide differentiates three of them:

  • Acceptance tests or UI tests are the heaviest type of automated test. They operate on the software from the outside, with the means of a real user and try to assert that real use cases are accomplishable.
  • Integration tests often use several parts of the system in a test scenario that asserts the correct collaboration of the parts. Integration tests may take some time to come to a conclusion and utilize real hardware like network or disks.
  • Unit tests tend to be small and quick and focus on a particular aspect of an “unit” like a class or entity aggregate. Their reach into the system should be short and might be forcefully restricted by employing mocks.

These three types, the A, I and U of automated tests, should come in different numbers. A good rule of thumb is that for every acceptance test, there might be up to one thousand unit tests. If you draw the quantities as areas, they appear in form of a pyramide. A small top of acceptance tests rests on a broader seating of integration tests that relies on a groundwork of many unit tests. A healthy test pyramide looks like this:

Take this picture as an orientation, not as an absolute scale. But be sure to count your different test types from time to time.

Outlining the tests

This is actually one of the first things I do when I get introduced to a new and unknown code base. This happens quite often when I do consulting work for existing development teams. Have a look at the automated tests, determine their type and count their numbers. If it resembles anything close to the test pyramide, you’ve got a chance. If the resulting shape looks different, you might find this blog entry useful:

The Tower

If you have a hard time finding any tests (because there are none) or you find only some half-assed attempts to produce a meaningful automated test suite, you look at a tower project. The tower is rather small in diameter, in the cases of absent tests it is nothing more than a thin vertical line (the “stick”). If you find a solid number of tests for every type, you’ve found a “block” project. Block projects usually don’t have a problem, but a history of test effort migration either from unit to acceptance tests or, more common, in the other direction. If you find a block, you are fine.

The tower, though, is a case of neglect. The project team might have started serious efforts to automated their tests, but got demotivated by intrinsic or extrinsic influences and abandoned the tests soon after their creation. Nobody has looked after them since and the only reason they still pass green is that they didn’t really test anything to begin with or only cover an area of the system that is as finished as it is boring. Topics like user management or utility classes are usually the first and only things that got tests in a tower scenario.

Don’t get me wrong, the tower indicates the absence of tests, but not the absence of willingness to write automated tests, unless the tower is really a stick. A team willing to invest in automated tests may only lack knowledge and coaching about the topic. Be sure to lead them bottom-up (unit tests first), though.

The Egg

If you’ve categorized and counted the tests and couldn’t find many acceptance or unit tests, you’ve found an egg. The egg consists of mostly integration tests that may lean into unit testing territory by asserting smallest bits of functionality here and there (often embedded in an overarching test storyline) or dip their toes into gui-based testing by asserting presentation-specific properties of widget objects. While they provide ample test coverage for the system, they also tie application logic and presentation details together and don’t help to separate domain code from the use cases.

The project team is probably proud of their test coverage and doesn’t see any value in differentiating the automated tests types, because “every test improves the situation”. The blindness to test types is the core problem that may be cured with training and coaching (I’ve found the ATRIP-rules to be particularly effective to distinguish integration and unit tests), but the symptoms, especially the lack of separation of concerns, have to be mitigated soon, too.

One way to start there is to break the tests down into their integration and their unit test parts. You can work from assertion to assertion and ask: is this necessary to ensure the current use case? If not, extract a new unit test focussed on only this one assertion.

As soon as you add a pedestal consisting of unit tests to your egg, you are on your best way to a healthy test pyramide.

The Ice Cream Cone

This is the most fearsome automated test outline in existence, even more dramatic than the stick. Usually, the project team is really enthusiastic about writing tests or at least follow order to do so, but they cannot test parts of the application in isolation. A really tragic case was a complex system that was so entangled with its database, through countless stored procedures that contributed to the application logic, that it was hopeless to think about tests without the database. And because every automated test had to start the whole system including the database, there was really no need to differentiate between application logic and presentation logic. It all became a gordic knot of dependencies that enforced the habit of writing elaborate automated GUI-based tests to test the smallest logic bits deep inside the core. It felt like eating single rice grains with overly long, flimsy wooden chopsticks that would break often.

The ice cream cone is problematic because the project team needs to realize that their effort was mislead and the tests are all telling the bitter truth: the system’s architecture isn’t fit for proper automated tests. It’s not the tests, it’s you (or your architecture)! Nobody wants to hear that and more so, nobody wants to untangle the mess (without the help of a proper safety net consisting of automated tests). Pinning tests are probably helpful in this scenario.

But you need to turn the test pyramide around or the project team will suffocate by the overly costly test tax while increasing technical debt.

Epilogue

Please keep in mind that it’s not a problem in itself that your project doesn’t have a normal test pyramide. It’s great that you have automated tests at all! But your current test type distribution might not be as effective as possible, might be more expensive than necessary and might be not the right automated test setup for your development goals.

What are your stories with automated test setups? Care to share it with us in the comments?