Make yourself comfortable

If you have to add a new feature to an existing code base, you’ve likely already experienced an uncomfortable truth: Nobody has thought about your use case. Nothing in the existing code base fits your goals. This isn’t because everybody wanted you to fail, but because your new feature is in fact brand new to the software and responsible software developers stop working on a code as soon as their work is done (while obeying the project’s definition of done).

So you try to shoehorn your functionality into the existing code. It’s not neat, but you get it working. Are you done? In my opionion, you haven’t even started yet. Your first attempt to combine your idea of the best implementation of your functionality and the existing system will be cumbersome and painful. Use it as a learning experience how the system behaves and throw it away once you get a working prototype. Yes, you’ve read that right. Throw it away as in undo your commits or changes. This was the learning/exploration phase of your implementation. You’ve applied your idea to reality. It didn’t work well. Now is the time to apply reality to your idea. Commence your second attempt.

For your second attempt, you should make use of your refactoring skills on the existing code. Bend it to your anticipated and tried needs. And once the code base is ready, drop your new feature into the new code “socket”. Your work doesn’t need to be cumbersome and painful. Make yourself comfortable, then make it work.

Here is a example, based on a real case:

An existing system was in development for many years and worked with a lot of domain objects. One domain object was a price tag that looked something like this:

interface PriceTag {
    PriceCategory category();
    TaxGroup taxGroup();
    Euro nettoAmount();
    Product forProduct();
}

Well, it was a normal domain object giving back other normal domain objects. The new feature should be an audio module that could read price tags out loud. The team used a text-to-speech synthesizing library that takes a string and outputs an audio stream. No big deal and pretty independent from the already existing code base.

But the code that takes a price tag and converts it into a string, aka the connection point between the unbound library code and the existing system, was ugly and undecipherable:

String priceTagToText(PriceTag price) {
    return price.forProduct().getDenotation()
        + " for only "
        + CurrencyFormatter.format(price.nettoAmount())
        + " with "
        + String.valueOf(price.taxGroup().percentage())
        + " % VAT in the "
        + price.category().getDenotation()
        + " section.";
}

This is how it looks if somebody tries to combine two building blocks that aren’t meant for each other. To test this method, you’ll have to mock deep into the domain objects.

If two building blocks aren’t matching naturally, maybe it’s an idea to add some lubrication code between them. This code isn’t exactly doing anything newfound, but adds a requirement seam that point towards the existing system:

interface ReadablePriceTag {
    String denotation();
    String netto();
    String vatPercentage();
    String category();
}

You can probably already see where this is heading. Just in case you cannot, I will take you through all parts of the code.

First we can write a priceTagToText() method that reads a lot nicer:

String priceTagToText(ReadablePriceTag price) {
    return price.denotation()
        + " for only "
        + price.netto()
        + " with "
        + price.vatPercentage()
        + " VAT in the "
        + price.category()
        + " section.";
}

The second and complementary part is the implementation of the ReadablePriceTag interface that is given a PriceTag object and translates the data for the new methods:

class PriceTagBasedReadablePriceTag {
    private final PriceTag price;

    PriceTagBasedReadablePriceTag(PriceTag price) {
        this.price = price;
    }

    @Override
    public String denotation() {
        return this.price.forProduct().getDenotation();
    }

    @Override
    public String netto() {
        return CurrencyFormatter.format(this.price.nettoAmount());
    }

    @Override
    public String vatPercentage() {
        return String.valueOf(
                this.price.taxGroup().percentage()) + " %";
    }

    @Override
    public String category() {
        return this.price.category().getDenotation();
    }
}

Basically, you have a lot of existing code that is using PriceTag objects and some new code that wants to use ReadablePriceTag objects. The PriceTagBasedReadablePriceTag class is the connector between both worlds (at least in one direction). We can definitely argue about the name, but that’s a detail, not the main point. The main points of all this effort are two things:

  1. The new code does not suffer in quality and readability from decisions made at a different time in a different context.
  2. The code clearly models these contexts. If you are aware of Domain Driven Design, you probably see the “Bounded Context” border that crosses right between PriceTag and ReadablePriceTag. The PriceTagBasedReadablePriceTag class is the bridge across that border.

If you express your context borders explicitely like in this example, your code reads fine on any side of the border. There is no notion of “old and fitting” and “new and awkward” code. It seems like additional work and it surely is, but is pays off in the long run because you can play this game indefinitely. A code base that gets more muddied and forced with time will reach a breaking point after which any effort needs knowledge in archeology and cryptoanalysis.

So, my advice boils down to one thing: Make yourself comfortable when adding new code to an existing code base.
And then, think about your type names. PriceTagBasedReadablePriceTag is most likely not the best name for it. But that’s a topic for another blog post. What would be your name for this class?

The Bird in the Jungle

When I was a young boy and attending school, there were art classes that I kind of liked. You either were told cool stories about lunatic painters or painted cool pictures by yourself. I still remember one particular assignment that sticks out to me: draw a bird!

In case you ask what this blog entry is all about: It is about project management and expectation management. It will stray a little bit into process territory. But mostly, it’s about a picture of a bird in a jungle.

When the teacher told us the task for the next weeks, my imagination ran wild. Draw a bird! I envisioned a stork-like bird with long legs and a long beak, fishing by a river in a tropical setting. The river would flow through a full-blown jungle with lots of plants and insects and things to discover. The bird would be in the center of the picture, drawing your attention to its eyes as it hovered over a fish in the water, ready for the strike that would provide its supper. But if your eyes follow the water, you would see that the bird is just a piece in a complex ecosystem with lots of untold stories. My pen would tell some of those stories!

We had rougly two hours of art class, so I had to move fast. Because the bird was the essential piece of my picture, I had to draw it last, on top of the background jungle and aligned with all the secondary motives. In the first hour, the jungle grew on my sheet of paper, following the outlined river. If I grew weary of drawing plant leaves, I added a bug or a little snake. Somewhere, a happy little jungle squirrel peaked through the green (Ok, that’s not true. I didn’t know about Bob Ross back then and my squirrel was probably not happy).

The thing was, the art class was suddenly over. My jungle wasn’t completed yet. Worse, the entire river, fish and bird were still missing. The teacher asked me if I misunderstood the assignment. What a joke! Of course I understand, my bird will be majestic – once it is finished next week.

I peeked at the pictures of my classmates. Most of them had some bird-like outlines, a beak or a foot. Nobody had finished their drawing. As far as I was concerned, I was well within expectations. Many classmates took their picture home to work on it in the afternoon. I waited until next week, I was certain about my plan.

Next week came and a sudden realization followed: Most of my classmates had indeed worked on their bird at home. Their drawings were energetic, strong and defined. They looked like black-and-white copies of bird photographies and nothing like the awkward lines from last week. They had traced an image through their sheet and called it a day. Two classmates had even traced the same image and produced essentially identical birds. Those cheaters. Not me!

When half the class had already turned in the assignment, the pressure began to rise for the strugglers. My plans were in jeopardy and I skipped several trees and mountains in the outskirts of the picture to save time. This also meant I had to fade out the picture at the edges to make it seem deliberately.

This art class went over so fast, I still hadn’t drawn my bird when the bell rang. The teacher wanted to collect our pictures, but I had to bargain for a week of work at home. With that much time at my disposal, I could implement my plans. I drew six evenings and half the weekend. My jungle was never right. My river wasn’t as dynamic as envisioned. But most of all, my bird was nowhere near majestic. It was a stork-like bird, but way too small. It didn’t hover over the fish, I had to make it dive into the water to even be near its prey. In the end, I had spend around 20 hours on a two-hour assignment that produced a drawing of a jungle that happened to have a hungry, desperate bird in the middle. Somewhere in the outskirts, a squirrel laughed.

The teacher gave me a mediocre grade out of pity. It was clear I didn’t copy from anywhere. But it was also clear as the water of the river in my jungle that I didn’t follow the instructions at all.

Looking back now, I can smile about the naive boy that set out to tackle a task that nobody asked for while getting lost and receiving an average grade for an enormous effort. You can probably already see that the return on investment was awful.

The interesting thing for me to see clearly now is that the project was doomed from the start. Mostly because I didn’t work on the teacher’s assignment, I worked on my own assignment. And my own assignment was much bigger than the original one, but the time budget was not. If you don’t see it as clearly, just count the words in the teacher’s assignment (“draw a bird” – 3 words) and just the nouns in my plan. Calculate one hour for every noun and you can see how this would never have worked. So this is the first take-away: Always make sure that you work on your customer’s project and not your own. If the customer didn’t use a specific noun in the mission statement, why did we include it in our project plan? In my story, the teacher never even mentioned a “background”. Just a bird. Draw it and be done. Go home and draw the jungle in your spare time.

The next thing that sticks out now for me is that I delayed the essential work (the bird) as much as possible. This is exactly the wrong approach if you want to act at least a little bit agile. Cover the essential things first and fill in the details later. In my today’s work, we call it the “risk first approach”. The most critical part comes first. It might not be pretty, it might not be complete, but it already works. Applying this to my story: If I had drawn the bird first (big and majestic in the middle), I could have called it quits anytime afterwards. Nobody asked for more and my teacher wouldn’t grade the background – only the bird.

The third thing is about expectation management: The teacher expected us to try, fail and cheat at home. Those clearly traced pictures matched the assignment better than my work. I just refused to fail and start over. I didn’t set out to match the expectations of my teacher, I wanted to fulfill my own expectations that were not grounded in reality. I even ignored the grading scheme of my teacher (my customer, but I couldn’t see it that way at that time) and got upset when all my effort just yielded a mediocre grade.

Today, I see all these things and can handle them successfully. As a young boy, those were foreign concepts for me. Between the two points in time, I acquired concepts, terms and words to reflect on the process. But I’m still learning. Perhaps you see something in the story that I haven’t seen yet? Tell me in a comment below!

Nevertheless, I often think about my bird in the jungle as a reminder to keep on track.

Your most precious resource

When I was a young student, living on my own for the first time, my most scarce resource seemed to be money. Money’s too tight to mention was (and probably still is) a motto that every student could understand. So we traded our time for money and participated in experiments and underpaid student assistant jobs.

Soon after I graduated, money began to accumulate. I have a rather frugal lifestyle, so my expenses didn’t suddenly surge. Instead, my perception of time and money shifted: Money isn’t the bottleneck (anymore), time is. Suddenly, time was much more valueable than money and I would gladly pay money if that meant some hours of additional leisure time or one less problem to tend to. It seems that time is the most precious thing there is.

The traditional economic wisdom supports this idea: “Time is money” is true, but the reverse is not: “money is time” doesn’t cut it. The richest man on earth still only has as much time available to him as anybody else.

If time is the most precious resource, the drive to automation as a time-saving effort can be understood directly. Automation also reduces learning costs if you scale horizontally by parallelizing production.

But soon after I had enough money to optimize my time, I hit another resource bottleneck. Suddenly, I had more time on hand than attention to spare. It turns out that attention is the most valueable resource you can spend. It is just entangled enough with time that its hard to distinguish which runs out first. If you reflect a bit, it becomes obvious. The term “to pay attention” is pretty spot on.

Let me take up the thought of automation again, this time in the domain of software development, in the form of automated tests. Here, automation is not in its most profitable shape. You don’t gain much from scaling your tests horizontally. If you don’t change the code, it doesn’t matter if your tests run once or a thousand times in parallel, the result will be the same (except if you run hardware-dependent tests, but even then you probably don’t gain much after covering all hardware variations).

You also don’t gain much from scaling your tests vertically by making them run faster and faster. It sure helps to have them run continuously in the background (think of a user-modded compiler – look into Continuous Test Runners if you are interested), but after one test run per meaningful change, the profit hits a limit.

So why else is automated testing an economically sound practice? My take on it is delegated attention. You write a small software (your test) that augments your attention area onto code that probably fades from your own attention pretty soon. Automated tests provide automated attention in a sustainable manner (except for those tests that cry wolf for no good reason, those are attention sinks and should be removed from your portfolio). Because of the automation, this delegated attention never fades – even after many years, the test has a close eye on “its” code.

If you are a developer, you have automation and zero-cost copying (aka parallelization without upfront costs) intrinsically in your solution portfolio. Look for ways how to make money with those super-powers. Or even better, look for ways to save time. But if you want the best return on investment for your efforts, you should look for ways to expand your attention area.

Do you agree that attention is our most precious resource? What do you do to lower your attention expenses? Perhaps you have experience in the Ops/DevOps area that resonates with this thoughts? Share your opinion by commenting below or writing your own blog entry!

We will pay attention to you.

The ALARA principle in software engineering

The ALARA principle originates in radiation protection and means “As Low As Reasonably Achievable”. It means that you have to weigh the purpose of an action dealing with radiation against its disadvantages, like radiation damage or long-term risks. The word “reasonably” means that while some disadvantages are not avoidable, a practicable amount of protection should be in place to lower them. The principle calls for a balancing act: Not without safety measures, but don’t overextend your means by trying to achieve a safety level that isn’t helpful anymore.
To put the ALARA principle in practice with an example: You shouldn’t need a X-ray every time you go to the dentist, but given enough time since the last one and reasonable doubt about a tooth, the X-ray examination will benefit your dental (and overall) health more than if you deny it. It isn’t healthy by itself, but the information gained by it will be used to improve your health.

I learnt about the ALARA principle when my father (a nuclear physicist by heart) explained it to me in context of the current corona pandemic: Use protection like face masks and distance, but don’t stress yourself too much over that one time when you grabbed a pen in the postal office. While preparation and watchfulness is helpful, fear is detrimental to your mental health. And even the most resilient mind has bounded resources that can better be spent on constructive things instead of fear.

A fun fact from radiation protection is that at least three of the four main rules of protection can be applied to corona, too:

  • Distance yourself from the radiation source
  • Use appropriate protection gear
  • Avoid incorporation (keep the thing outside your body)
  • Limit your exposure time (this doesn’t fit as nicely, because the virus is probably not cumulative)

But how can we apply the ALARA principle to software engineering? I was instantly reminded about the “Thorough” rule of unit testing. In the book “Pragmatic Unit Testing” by Andy Hunt and Dave Thomas, the two original Pragmatic Programmers, good unit tests have to follow the ATRIP-rules. The T stands for “Thorough” which is often misinterpreted as “test everything at least twice”. In reality, the rule states that:

  • all mission critical functionality needs to be tested
  • for every occuring bug, there needs to be an additional test that ensures that the bug cannot happen again

The first thing that meets the eye is that the rule doesn’t define a bug as a failure of your testing effort. It takes a bug that probably happened in production and caused some damage as a motivation to strengthen your test coverage in that particular area. The second part of the rule calls for directed, well-aimed testing effort. It is easy to follow because it has a clear trigger: A bug happened, now you have to write a test.

The first part of the rule is more complicated: What is mission critical functionality? And what means “is tested thoroughly” in this context? And here, the ALARA principle can help us. The bug rate in the important parts of your code should be as low as reasonably achievable. “Reasonably achievable” is defined by the resources at your disposal (like time to market), your expertise in testing and the potential damage that could happen if something in your code goes wrong.

If the potential damage is high or even life-threatening, your reasonable effort should be much higher than if the most critical thing that happens is a 15 minute downtime while you restart the server. There are use cases where even 15 minutes mean subsequent damage, but most software is written for a more relaxed context.

I’ve always found the “Thorough” rule of good unit tests pleasant and comforting: If you made reasonable effort to test your most important code and write a test for every bug you or your users encounter, you can say that your bug rate is ALARA – “As Low As Reasonably Achievable”. And that is good enough for most cases.

What was your first thought when you heard about the ALARA principle? Tell us in the comment section!

Contain me if you can – the art of tunneling

This blog post briefly explains the basics of SSH tunneling and dives into the detail of reverse tunneling. If you are already familiar with all these concepts, this might be a boring read. If you heard “reverse tunneling” for the first time now, you’re about to discover the dark side of a network administrator’s might.

SSH basics

SSH (secure shell) is a tool to connect to a remote computer. In this blog post, the remote computer is always “the server” and your local computer is “the notebook”. That’s just nomenclature, in reality, each computer (or “endpoint”, which is a misleading term, as we will see) can be any device you want, as long as “the server” runs a SSH server daemon and “the notebook” can run a SSH client program. Both computers can live in different local area networks (LAN). As long as it is possible to send network packets to the specific port the SSH server daemon listens to and you have some means to authenticate on the server, you have free reign over both LANs by utilizing SSH tunnels.

The trick to bypass the remote LAN firewall is to have a port forwarded to the server. This allows you to access the server’s port 22 by sending packets to the internet gateway’s worldwide address on a port of your choice, maybe 333. So the forwarding rule is:

gateway:333 -> server:22

If you can comprehend this, you’ve basically grasped the whole foundation of SSH tunneling, even if it had nothing to do with SSH at all.

SSH tunneling basics

Now that we have a connection to the server, we can enjoy the fun of forward tunneling. You didn’t just connect to the server, you opened up every computer in the remote LAN for direct access. By using the SSH connection, you can define a SSH tunnel to connect to another computer in the remote LAN by saying:

localhost:8000 -> another:80

This is what you give your SSH client, but what actually happens is:

localhost:8000 -> gateway:333 -> server:22 -> another:80

You open a local network port (8000) on your notebook that uses your SSH connection over the gateway and the server to send packets to the “another” computer in the remote LAN. Every byte you send to your port 8000 will be sent to port 80 of the “another” computer. Every byte it answers will be relayed all the way back to your client, speaking to your local port 8000. For your client, it will look as if port 80 of “another” and port 8000 of your notebook are the same thing (if you don’t mind a small delay). And the best thing: Your client can be anywhere in your local LAN. Given the right configuration, “notebook2” in your LAN can speak with port 80 of “another” in the remote LAN by using your SSH tunnel. If you’ve ever played the computer game “Portal”, SSH tunneling is like wielding the portal gun in networks, but with unlimited portals at the same time. Yes, you can have many SSH tunnels open at once. That’s pretty neat and powerful, but it’s essentially still only half of the story.

SSH reverse tunneling basics

What you can also do is to open a network port on the server that acts like a portal in the other direction. You probably noticed the arrows in the tunnel descriptions above. The arrows really give a direction. If your tunnel is

localhost:8000 -> another:80

you can send packets from you to another (and receive their answer), but it doesn’t work the other way around. The software listening at another:80 could not initiate a data transfer to localhost:8000 on its own. So SSH tunnels always have a direction, even if the answers flow back. It’s like a door phone: You, inside the building, can initiate a call to the entrance door phone at any time. The guest in front of the door can speak back, but is not able to “call you back” once you hang up. He can not strike up a conversation with you over the phone. (In reality he still has the doorbell to annoy you. Luckily, the doorbell is missing in networking.)

SSH reverse tunnels change the direction of your tunnels to do exactly that: Give somebody on the remote LAN the ability to send network packets to the server that will get “teleported” into your local LAN. Let’s try this:

localhost:8070 <- server:8080

If you set up a reverse tunnel, any computer that can send packets to the port 8080 on the server will in fact speak with your notebook on port 8070. This is true for the server itself, any computer in the remote LAN and any computer with a SSH tunnel to port 8080 on the server. Yes, you’ve just discovered that you can send a data packet from your LAN to the remote LAN and have it reverse tunneled to a third LAN. And if you haven’t thought about it yet: Every “endpoint” in your three-LAN hop can be a portal to yet another hop, with or without your knowledge. This is why “endpoint” is misleading: It implies an “end” to the data transfer. That might just be your perception of the situation. Every endpoint can be another portal until it’s portals all the way down.

SSH reverse tunneling fun

You can use this back and forth teleporting of network traffic to create all kinds of complexity.

But in our case, we want to use our new might for something fun. Let’s imagine the following scenario:

  • On your notebook, you can start a SSH connection to the server. You can open forward and reverse tunnels.
  • On your notebook, you run a VNC (Virtual Network Computing) server. VNC is basically a simple remote desktop technology that lets you see the display content of a remote computer and control its keyboard and mouse. You running the server means somebody else can connect to your notebook using a VNC client and control it as if it was local (if you can ignore a small delay).
  • Somebody else (let’s call her Eve), on her notebook, can also start a SSH connection to the server. She can open forward and reverse tunnels.
  • On her notebook, she has a VNC client software installed. She can connect and control VNC server machines.

And now the fun begins:

  • VNC servers typically listen on network port 5900. In the client, this is known as “display 0”, which means that a client connects to the port “display + 5900” on the server machine.
  • Using this knowledge, you create a reverse tunnel for your SSH connection to the server: localhost:5900 <- server:5901. Notice how we changed the port. This is not necessary, but possible.
  • Now, anybody with direct access to your network port 5900 (in your LAN) or access to the network port 5901 on the server (in the remote LAN) can remote control your desktop.
  • Eve creates a forward tunnel for her SSH connection to the server or any other machine in the remote LAN: localhost:5902 -> server:5901. Notice the port change again. Also just a possibility, not a requirement. Notice also that Eve doesn’t need to connect to the same server as you. Her server just needs to be able to initiate a data transfer with the server. This might be through multiple portals that don’t matter in our story.
  • Now, Eve starts her VNC client and lets it connect to localhost:2 (2 is the display number that gets resolved to local port 5902).
  • Eve can now see your screen, control your inputs and even transfer files (depending on the capabilities of your VNC setup).

This is the point where LAN containment doesn’t really matter anymore. Sure, you cannot reign freely over any machine anywhere, but you can always establish a combination of reverse and forward tunnels to connect two computers as if they were connected by a cable. Which is, in fact, a good analogy to think about SSH tunnels: cables with two different connector types that span between computers.

Wrap up

You are probably aware about the immediate security risks that follow all that complexity. If you want to mitigate some of the risks, have a look at stunnel. It is effectively a more secure (and constrainable) way to create portals between LANs.

And if you just want to have fun: Please be aware that most of the actions here are probably not legal if done without consent of all participants. Stay safe!

The emoji checksum

This blog article is a story about an idea, not an actual project report. If it were a movie, it would feature the “based on real events” disclaimer.

The warehouse

Imagine a warehouse of a medium sized company. You would expect a medium sized warehouse, but in reality, the amount of items in this warehouse is nearly as big as in a big company. The difference might be the storage count of each item, but the item count is a big number. So big that each item has its own “item ID”, which is also used as the location identifier in the warehouse. Let’s see three (contrived) examples:

  • 211 725: Retaining screw, 8 mm
  • 413 114: Power transformer, 5 A
  • 413 115: Power transformer, 10 A

As you can see, different item groups have numbers with a large numerical distance while similar items are numerically close. This makes sense for the engineers using these numbers by muscle memory and for the warehouse navigation. If you read the first three digits, you already know where to turn to in the large hall. If you’ve arrived in the general area, the next three digits lead you to the exact storage space.

The operators

But that’s not how it works. The warehouse workers cannot read. Yes, you’ve read that right. The warehouse is operated by humans and the workers are not familiar with digits and numbers. They decipher each digit on their own and cannot cross-check with the article name. They navigate the warehouse with a best-effort approach. The difference between item 413114 and item 413115 is negligible for them. It’s the same thing anyway – unless you can read (and understand) that one of them blows up above 5 Ampere and the other one doesn’t. And this is a problem for the engineers. The difference between a “Power transformer able to take 10 Ampere” and a “Power transformer (5 A), aka molten copper lump” is a successful or a failed project.

So what can you do? Teach the warehouse workers how to read and deal with numbers? Would be a good approach if the turnover rate among them wasn’t so high. What else can we do? We can abstract the problem at hand, apply a suitable solution approach and see if it works.

The abstraction

If you think about the situation in abstract terms, you deal with an unreliable data transmission. You send your item list to the warehouse and receive a collection of loosely related items. That’s similar to sending data over a faulty cable. To mitigate transmission errors, we’ve invented checksums. Each suitable part of the transmission is validated (or invalidated) by a checksum.

In our case, the “suitable part of the transmission” is each single item. We should add a checksum to the item list! Instead of requesting item 413114, we request 413114/7, while item 413115 is requested as 413115/1. Now, we have a clear indicator for wrong or right. But it is still an indicator in a foreign alphabet. If you ignore the difference between 4 and 5, why not also ignore the difference between 7 and 1?

The emojification

But what if we don’t rely on numbers or characters, but on something every human can understand, regardless of literacy level? What if we transpose the numbers into an emoji alphabet? Let 413114 be πŸ˜„πŸŒ΅β˜οΈπŸŒ΅πŸŒ΅πŸ˜„ and 413115 is written as πŸ˜„πŸŒ΅β˜οΈπŸŒ΅πŸŒ΅πŸ . But more important: The checksum is in emoji, too:

πŸ˜„πŸŒ΅β˜οΈπŸŒ΅πŸŒ΅πŸ˜„ (413114)

πŸš— (7)
vs.
πŸ˜„πŸŒ΅β˜οΈπŸŒ΅πŸŒ΅πŸ  (413115)

🌡 (1)

Even if you only glance at the emoji series (and fail to notice the difference at the end), you still have to acknowledge that your checksum doesn’t fit. A cactus is no car, regardless of your literacy.

This transposition of numbers into the iconographic realm plays right into every human’s built-in ability to distinguish concrete objects. Numbers, digits and characters are (more) abstract concepts and objects, but a cloud is recognizable as a cloud even if you draw it by hand and without care. The transposition is reversable quiet easily – you only have to remember ten number/emoji pairs (or eleven, if your checksum has an extra character). And nobody stops you from printing both on the item list and warehouse storage boxes:

And the best thing? You don’t even have to invent the transposition yourself. Just use the existing work of others by checking out emojisum by Vincent Batts or ecoji by Keith Turner.

The only thing that is stopping you is that ancient dot matrix printer that prints the item lists on continuous paper.

Last night, an end-to-end test saved my live

but not with a song, only with an assertion failure. Okay, I’ll admit, this is not the only hyperbole in the blog title. Let’s set things right: The story didn’t happen last night, it didn’t even happen at night (because I’m not working night hours). In fact, it was a well-lit sunny day. And the test didn’t save my live. It did save me from some embarrassment, though. And spared a customer some trouble and hotfixing sessions with me.

The true parts of the blog title are: An end-to-end test reported an assertion failure that helped me to fix a bug before it got released. In other words: An end-to-end test did its job and I felt fine. And that’s enough of the obscure song references from the 1980s, hopefully.

Let me explain the setting. We develop a big system that runs in production 24/7-style and is perpertually adjusted and improved. The system should run unattended and only require human attention when a real incident happens, either in the data or the system itself. If you look at the system from space, it looks like this:

A data source (in fact, lots of data sources) occasionally transmits data to the system that gets transformed and sent to a third-party system that does all kind of elaborate analysis with it. Of course, there is more to the real system than that, but for our story, it is sufficient to know that there are several third-party systems, each with their own data formats. We’ll call them third-party system A (TPS-A) and TPS-B in this story.

Our system is secured by lots and lots of unit tests, a good number of integration tests and a handful of end-to-end tests. All tests run green all the time. The end-to-end tests (E2ET) try to replicate the production environment as close as possible by running on the target operating system and using data transfer means like in the real case. The test boots up the complete system and simulates the data source and the third-party systems. By configuring the system so that it transfers back to the E2ET, we can write assertions in the kind of “given a specific input data, we expect this particular output data from the system, however it chooses to produce it”. This kind of broad-brush test is invalueable because it doesn’t care for specifics in the system. It only cares for output from the system.

This kind of E2ET is also difficult to write, complicated to run, prone to brittleness and obscure in its failure statement. If you write a three-line unit test, it is straight-forward, fast and probably pretty specific when it breaks: “This one thing that I’m testing is wrong now”. An E2ET as described here is like the Delphi Oracle:

  • E2ET: “Somewhere in your system, something doesn’t work the way I like it anymore.”
  • Developer: “Can you be more specific?”
  • E2ET: “No, but here are 2000 lines of debug output that might help you on your quest for the cause.”
  • Developer: “But I didn’t change anything.”
  • E2ET: “Oh, in that case, it could as well be the weather or a slow network. Mind if I run again?”

This kind of E2ET is also rather slow and takes its sweet time resetting the workspace state, booting the system again and using prolonged timeouts for every step. So you wait several minutes for the E2ET to fail again:

  • E2ET: “Yup, there is still something fishy with your current code.”
  • Developer: “It’s the same code you let pass yesterday.”
  • E2ET: “Well, today I don’t like it!”
  • Developer: “Okay, let me dive into the debug output, then.”

Ten to twenty minutes pass.

  • Developer: “Is it possible that you just choke on a non-free network port?”
  • E2ET: “Yes, I’m supposed to wait for the system’s output on this port and cannot bind to it.”
  • Developer: “You cannot bind to it because an instance of yourself is stuck in the background and holding onto the port.”
  • E2ET: “See? I’ve told you there is something wrong with your system!”
  • Developer: “This isn’t a problem with the system’s code. This is a problem with your code.”
  • E2ET: “Hey! Don’t blame me! I’m here to help. You can always chose to ignore or delete me if I’m too much of a burden to you.”

This kind of E2ET cries wolf lots of times for problems in the test code itself, for too optimistic timeouts or just oddities in the environment. If such an E2ET fails, it’s common practice to run it again to see if the problem persists.

How many false alarms can you tolerate before you stop listing?

In my case, I choose to reduce the amount of E2ET to the essential minimum and listen to them every time. If the test raises a false alarm, invest the time to come up with a way to make it more robust without reducing its sensitivity. But listen to the test every time.

Because, in my story, the test insisted that third-party system A and B both didn’t receive half of their data. I could rule out stray effects in the network or on the harddisk pretty fast, because the other half of the data arrived just fine. So I invested considerable time in understanding the debug output. This led me nowhere – it seemed that the missing data wasn’t sent to the system in the first place. Checking the E2ET disproved this hypothesis. The test sent as much data to the system as ever. But half of it went missing underway. Now I reviewed all changes I made in the last few days. Some of them affected the data export to the TPS. But all unit and integration tests in this area insisted that everything worked as intended.

It is important to have this kind of “multiple witnesses”. Unit tests are like traces in a criminal investigation: They report one very specific information about one very specific part of the code and are only useful in larger numbers. Integration tests testify on the possibility to combine several code parts. In my case, this meant that the unit tests guaranteed that all parts involved in the data export work correct on their own (in isolation). The integration tests vouched that, given the right combination of data export parts, they will work as intended.

And this left one area of the code as the main suspect: Something in the combination of export parts must go wrong in the real case. The code that wires the parts together is the only code not tested by unit and integration tests, but E2ET. Both other test types base their work on the hypothesis of “given that everything is wired together like this”. If this hypothesis doesn’t hold true in the real case, no test but the E2ET finds a problem, because the E2ET bases on another hypothesis: “given that the system is started for real”.

I went through all the changes of the wiring code and there it was: A glorious TODO comment, written by myself on a late friday afternoon, stating: “TODO: Register format X export again after you’ve finished issue Y”. I totally forgot about it over the weekend. I only added the comment into the code, not as an issue comment. Friday afternoon me wasn’t cooperative with future or current me. In effect, I totally played myself. I don’t remember removing the registering code or writing the comment. It probably was a minor side change for an unimportant aspect of the main issue. The whole thing was surely done in a few seconds and promptly forgotten. And the only guardian that caught it was the E2ET, using the most nonspecific words for it (“somewhere, something went wrong”).

If I had chosen to ignore or modify the E2ET, the bug would have made it to production. It would have caused a loss of current data for the TPS. Maybe they would have detected it. But maybe, they would have just chosen to operate on the most recent data – data that doesn’t get updated anymore. In short, we don’t want to find out.

So what is the moral of the story?
Always have an end-to-end test for your most important functionality in place. Let it deal with your system from the outside. And listen to it, regardless of the effort it takes.

And what can you do if your E2ET has saved you?

  • Give your test a medal. No, seriously, leave a comment in the test code that it was helpful.
  • Write an unit or integration test that replicates the specific cause. You want to know immediately if you make the same error again in the future. Your E2ET is your last line of defense. Don’t let it be your only one. You’ve been shown a weakness in your test setup, remediate it.
  • Blame past you for everything. Assure future you that you are a better developer than this.
  • Feel good that you were able to neutralize your faux pas in time. That’s impressive!

When was the last time a test has saved you? Leave the story or a link to it in the comments. I can’t be the only one.

Oh, and if you find the random links to 80s music videos weird: At least I didn’t rickroll you. The song itself is from 1987 and would fit my selection.

Zero Interaction Tools

Some time ago, a customer called us for a delicate task: To develop a little tool in a very tight budget that aggregates pictures in a specific way. The pictures were from the medical domain and comprised of sets with thousands of pictures that needed to be combined into one large picture with the help of some mathematics. Developing the algorithm was not a problem, the huge data sizes neither, but the budget was a challenge. We could program everything and test it on some sample picture sets (that arrived on several blue-ray discs) within the budget, but an elaborate graphical user interface (GUI) would be out of scope. On the other hand, the anticipated users of the tool weren’t computer affine enough to handle a CLI (command line interface). The tool needed to be simple, fast and cheap. And it only needed to do one thing.

Traditional usage of software tools

In the traditional way, a software tool comes with an installer that entrenches the tool onto the target computer and provides a start menu entry, a desktop icon and perhaps even a boot launcher with a fancy tray icon and some balloon tooltips that inform the user from time to time that the tool is still installed and wants some attention. If you click on the tool’s icon, a graphical user interface appears, perhaps in the form of an application window or just a tray pop-up menu. You need to interact with the user interface, then. Let’s say you want to combine several thousand pictures into one, then you need to specify directories or collections of files through some file dialogs with “browse” buttons and complex “ingredients” lists. You’ve probably seen this type of user interface while burning a blue-ray disc or uploading files into the cloud. After you’ve selected all your input pictures, you have to say where to write to (another file dialog) and what name your target file should have. After that, you’ll stare at a progress bar and wait for it to reach the right hand side of the widget. And then, the tool will beep and proudly present a little message box that informs you that everything has worked out just fine and you can find your result right were you wanted to. Afterwards, the tool will sit there on your screen in anticipation of your next move. What will you do? Do it all again because you love the progress bars and beeps? Combine another several thousand pictures? Shutdown the tool? Oh, come on! Are you sure you want to quit now?

None of this could be developed in the budget our customer gave us. The tool didn’t need the self-marketing aspects like a tray icon or launcher, because the customer would only use it internally. But even the rest of the user interface was too much work: the future users would not get a traditional software tool.

Zero interaction tool

So we thought about the minimal user interface that the picture aggregation tool needed to have. And came to the conclusion that no user interface was needed, because it really only needed to do one thing. At least, if certain assumptions hold true:

  • The tool is fast enough to produce no significant delay
  • The input directory holds all pictures that should be aggregated
  • The input directory can be the output directory as well
  • The name of the resulting picture file can contain a timestamp to distinguish between several tool runs

We consulted our customer and checked that the latter three assumptions were valid. So, given we can make the first assumption a reality, the tool could work without any form of user interaction by being copied into the picture directory and then started.

Yes, you’ve read this right. The tool would not be installed, but copied for every usage. This is a highly unusual usage scenario for a program, because it means that every picture directory that should be aggregated holds an identical copy of the program. But if we can make some more assumptions valid, it is a viable way to empower the users:

  • The tool must run on all target machines without additional preparation
  • The tool must only consist of one executable file, no DLLs or configuration files
  • The tool must be small in size, like one megabyte at most

We confirmed with a quick and dirty spike (an embarrasingly inchoate prototype) that we can produce a program that conforms to all three new assumptions/requirements. The only remaining problem was the very first assumption: No harddrive was fast enough to provide the pixel data of thousands of pictures in less than a second. Even if we could aggregate the pixels fast enough (given enough cores, this would be possible), we couldn’t get hold of them fast enough. We needed some kind of progress bar.

Use your information channels

We thought about the information channels our tool would have towards the user. Let’s repeat the scenario: The user navigates to the directory containing the pictures that should be aggregated, copies the executable program into it and double-clicks to start the tool. There are many possibilities to inform the user about progress:

  • Audio (Sound): We can play a little tune or some sound that changes frequency to indicate progress. This is highly unusual and we can’t be sure that the speakers aren’t muted (usage on a notebook was part of the domain analysis results). No sounds, that is.
  • Animation (Graphics): In the most boring case, this would be a little window with a progress bar that runs from left to right and disappears when the work is done. Possible, but boring. Perhaps we can think of something more in tune with the rest of the usage scenario.
  • Text: Well, this was the idea. We produce a result file and give it a name. We don’t need to keep the name static as long as we are working and things change inside the file, anyways. We just update the file name to reflect our progress.

So our tool creates a new result file in the picture directory that is named result_0_percent or something and runs up to result_100_percent and then gets renamed to result_timestamp with the current timestamp. You can just watch your file explorer to keep up with the tool’s completion. This is a bit unusual at first, but all pilot users grasped the concept immediately and were pleased with it.

The result

And this is the story when we developed a highly specialized tool within a very small budget without any graphical or otherwise traditional user interface. The user brings the tool to the data (by copying it into the same directory) and lets it perform its work by simply starting it. The tool reports its progress back via the result file name. As soon as the result file contains a timestamp (and the notebook air fans cease to go beserk), the user can copy it into the next tool in the tool chain, probably a picture viewer or a printer driver. The users loved the tool for its speed and simplicity.

One funny side-note remains to be told: Because thousands of pictures aggregated into one produces a picture with a lot of details, the result file was not too big (about 20-30 megabytes), but could take out any printer for several hours if printed. The tool got informally renamed to “printer-reaper.exe”.

Containers allot responsibilities anew

Earlier this year, we experienced a strange bug with our invoices. We often add time tables of our work to the invoices and generate them from our time tracking tool. Suddenly, from one invoice to the other, the dates were wrong. Instead of Monday, the entry was listed as Sunday. Every day was shifted one day “to the left”. But we didn’t release a new version of any of the participating tools for quite some time.

What we did since the last invoice generation though was to dockerize the invoice generation tool. We deployed the same version of the tool into a docker container instead of its own virtual machine. This reduced the footprint of the tool and lowered our machine count, which is a strategic goal of our administrators.

By dockerizing the tool, we also unknowingly decoupled the timezone setting of the container and tool from the timezone setting of the host machine. The host machine is set to the correct timezone, but the docker container was set to UTC, being one hour behind the local timezone. This meant that the time table generation tool didn’t land at midnight of the correct day, but at 23 o’clock of the day before. Side note: If the granularity of your domain data is “days”, it is not advisable to use 00:00 o’clock as the reference time for your technical data. Use something like 12:00 o’clock or adjust your technical data to match the domain and remove the time aspect from your dates.

We needed to adjust the timezone of the docker container by installing the tzdata package and editing some configuration files. This was no big deal once we knew where the bug originated from. But it shows perfectly that docker (as a representative of the container technology) rearranges the responsibilities of developers and operators/administrators and partitions them in a clear-cut way. Before the dockerization, the timezone information was provided by the host and maintained by the administrator. Afterwards, it is provided by the container and therefore maintained by the developers. If containers are immutable service units, their creators need to accomodate for all the operation parameters that were part of the “environment” beforehands. And the environment is provided by the operators.

So we see one thing clearly: Docker and container technology per se partitions the responsibilities between developers and operators in a new way, but with a clear distinction: Everything is developer responsibility as long as the operators provide ports and volumes (network and persistent storage). Volume backup remains the responsibility of operations, but formatting and upgrading the volume’s content is a developer task all of a sudden. In a containerized world, the operators don’t know you are using a NoSQL database and they really don’t care anymore. It’s just one container more in the zoo.

I like this new partitioning of responsibilities. It assigns them for technical reasons, so you don’t have to find an answer in each organization anew. It hides a lot of detail from the operators who can concentrate on their core responsibilities. Developers don’t need to ask lots of questions about their target environment, they can define and deliver their target environment themselves. This reduces friction between the two parties, even if developers are now burdened with more decisions.

In my example from the beginning, the classic way of communication would have been that the developers ask the administrator/operator to fix the timezone on the production system because they have it right on all their developer machines. The new way of communication is that the timezone settings are developer responsibility and now the operator asks the developers to fix it in their container creation process. And, by the way, every developer could have seen the bug during development because the developer environment matches the production environment by definition.

This new partition reduces the gray area between the two responsibility zones of developers and operators and makes communication and coordination between them easier. And that is the most positive aspect of container technology in my eyes.

Think of your code as a maintenance minefield

Most of the cost, effort and time of a software project is spent on the maintenance phase, the modification of a software product after delivery. If you think about all these resources as “negative investments” or debt settlement and try to associate your spendings with specific code areas or even single lines of code, you’ll probably find that the maintenance cost per line is not equally distributed. There are lots of lines of code that outlast the test of time without any maintenance work at all, a fair amount of lines that require moderate attention and some lines that seem to require constant and excessive developer care.

If you transfer this image to another metaphor, your code presents itself like a minefield for maintenance effort: Most of the area is harmless and safe to travel. But there are some positions that will just blow up once touched. The difference is that as a software developer, you don’t tread on the minefield, but you catch the flak if something happens.

You should try to deliver your code free of maintenance mines.

Spotting a maintenance mine

Identifying a line of code as a maintenance mine after the fact is easy. You probably already recognize the familiar code as “troublesome” because you’ve spent hours trying to understand and fix it. The commit history of your version control system can show you the “hottest” lines in your code – the areas that were modified most often. If you add tests for each new bug, you’ll find that the code is probably tested really well, with tests motivated by different bug issues. In hindsight, you can clearly distinguish low-effort code from high maintenance code.

But before delivery, all code looks the same. Or does it?

An example of a maintenance mine

Let’s look at an example. Our system monitors critical business data and sends out alerts if certain conditions are met. One implementation of the part sending the alerts is a simple e-mail sender. The code is given here:


public class SendEmailService {

  public void sendTo(
                Person person,
                String subject,
                String body) {
    execCmd(
         buildCmd(
               person.email(), subject, body));
  }

  private String buildCmd(String recipientMailAdress, String subject, String body){
    return "'/usr/bin/mutt -t " + recipientMailAdress + " -u " + subject + " -m " + body + "'";
  }

  private int execCmd(String command) throws IOException{
    return Runtime.getRuntime()
                  .exec(command).exitValue();
  }
}

This code has two interesting problems:

  • The first problem is that it is written in Java, a platform agnostic programming language, but depends on being run on a linux (or sufficiently similar unixoid) operating system. The system it runs on needs to supply the /usr/bin/mutt program and have the e-mail sending settings properly configured or else every try to run the send command will result in an error. This implicit dependency on the configuration of the production system isn’t the best way to deal with the situation, but it’s probably a one-time pain. The problem clearly presents itself and once the system is set up in the right way, it is gone (until somebody tampers with the settings again). And my impression is that this code separates two concerns between development and operations rather nicely: Development provides software that can send specific e-mails if operations provides a system that is capable of sending e-mails. No need to configure the system for e-mail sending and doing it again for the software on said system.
  • The second problem looks like a maintenance mine. In the line where the code passes the command line to the operating system (by calling Runtime.getRuntime().exec()), a Process object is returned that is only asked for its exitValue(), implicating a wait for the termination of the system command. The line looks straight and to the point. No need to store and handle intermediate objects if you aren’t interested in them. But perhaps, you should care:

By default, the created process does not have its own terminal or console. All its standard I/O (i.e. stdin, stdout, stderr) operations will be redirected to the parent process, where they can be accessed via the streams obtained using the methods getOutputStream(), getInputStream(), and getErrorStream(). The parent process uses these streams to feed input to and get output from the process. Because some native platforms only provide limited buffer size for standard input and output streams, failure to promptly write the input stream or read the output stream of the process may cause the process to block, or even deadlock.

Emphasize by me, see also: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Process.html

This means that the Process object’s stdout and stderr outputs are stored in buffers of unknown (and system dependent) size. If one of these buffers fills up, the execution of the command just stops, as if somebody had paused it indefinitely. So, depending on your call’s talkativeness, your alert e-mail will not be sent, your system will appear to have failed to recognize the condition and you’ll never see a stacktrace or error exit value. All other e-mails (with less chatter) will go through just fine. This is a guaranteed source of frantic telephone calls, headaches and lost trust in your system and your ability to resolve issues.

And all the problems originate from one line of code. This is a maintenance mine with a stdout fuse.

The fix for this line might lie in the use of the ProcessBuilder class or your own utility code to drain the buffers. But how would you discover the mine before you deliver it?

Mines often lie at borders

One thing that stands out in this line of code is that it passes control to the “outside”. It acts as a transit point to the underlying operating system and therefor has a lot of baggage to check. There are no safety checks implemented, so the transit must be regarded as unsafe. If you look out for transit points in your code (like passing control to the file system, the network, a database or another external system), make sure you’ve read the instructions and requirements thoroughly. The problems of a maintenance mine aren’t apparent in your code and only manifest themselves during the interaction with the external system. And this is a situation that happens disproportionately often in production and comparably seldom during development.

So, think of your code as a maintenance minefield and be careful around its borders.

What is your minesweeper story? Drop us a comment.