Contain me if you can – the art of tunneling

This blog post briefly explains the basics of SSH tunneling and dives into the detail of reverse tunneling. If you are already familiar with all these concepts, this might be a boring read. If you heard “reverse tunneling” for the first time now, you’re about to discover the dark side of a network administrator’s might.

SSH basics

SSH (secure shell) is a tool to connect to a remote computer. In this blog post, the remote computer is always “the server” and your local computer is “the notebook”. That’s just nomenclature, in reality, each computer (or “endpoint”, which is a misleading term, as we will see) can be any device you want, as long as “the server” runs a SSH server daemon and “the notebook” can run a SSH client program. Both computers can live in different local area networks (LAN). As long as it is possible to send network packets to the specific port the SSH server daemon listens to and you have some means to authenticate on the server, you have free reign over both LANs by utilizing SSH tunnels.

The trick to bypass the remote LAN firewall is to have a port forwarded to the server. This allows you to access the server’s port 22 by sending packets to the internet gateway’s worldwide address on a port of your choice, maybe 333. So the forwarding rule is:

gateway:333 -> server:22

If you can comprehend this, you’ve basically grasped the whole foundation of SSH tunneling, even if it had nothing to do with SSH at all.

SSH tunneling basics

Now that we have a connection to the server, we can enjoy the fun of forward tunneling. You didn’t just connect to the server, you opened up every computer in the remote LAN for direct access. By using the SSH connection, you can define a SSH tunnel to connect to another computer in the remote LAN by saying:

localhost:8000 -> another:80

This is what you give your SSH client, but what actually happens is:

localhost:8000 -> gateway:333 -> server:22 -> another:80

You open a local network port (8000) on your notebook that uses your SSH connection over the gateway and the server to send packets to the “another” computer in the remote LAN. Every byte you send to your port 8000 will be sent to port 80 of the “another” computer. Every byte it answers will be relayed all the way back to your client, speaking to your local port 8000. For your client, it will look as if port 80 of “another” and port 8000 of your notebook are the same thing (if you don’t mind a small delay). And the best thing: Your client can be anywhere in your local LAN. Given the right configuration, “notebook2” in your LAN can speak with port 80 of “another” in the remote LAN by using your SSH tunnel. If you’ve ever played the computer game “Portal”, SSH tunneling is like wielding the portal gun in networks, but with unlimited portals at the same time. Yes, you can have many SSH tunnels open at once. That’s pretty neat and powerful, but it’s essentially still only half of the story.

SSH reverse tunneling basics

What you can also do is to open a network port on the server that acts like a portal in the other direction. You probably noticed the arrows in the tunnel descriptions above. The arrows really give a direction. If your tunnel is

localhost:8000 -> another:80

you can send packets from you to another (and receive their answer), but it doesn’t work the other way around. The software listening at another:80 could not initiate a data transfer to localhost:8000 on its own. So SSH tunnels always have a direction, even if the answers flow back. It’s like a door phone: You, inside the building, can initiate a call to the entrance door phone at any time. The guest in front of the door can speak back, but is not able to “call you back” once you hang up. He can not strike up a conversation with you over the phone. (In reality he still has the doorbell to annoy you. Luckily, the doorbell is missing in networking.)

SSH reverse tunnels change the direction of your tunnels to do exactly that: Give somebody on the remote LAN the ability to send network packets to the server that will get “teleported” into your local LAN. Let’s try this:

localhost:8070 <- server:8080

If you set up a reverse tunnel, any computer that can send packets to the port 8080 on the server will in fact speak with your notebook on port 8070. This is true for the server itself, any computer in the remote LAN and any computer with a SSH tunnel to port 8080 on the server. Yes, you’ve just discovered that you can send a data packet from your LAN to the remote LAN and have it reverse tunneled to a third LAN. And if you haven’t thought about it yet: Every “endpoint” in your three-LAN hop can be a portal to yet another hop, with or without your knowledge. This is why “endpoint” is misleading: It implies an “end” to the data transfer. That might just be your perception of the situation. Every endpoint can be another portal until it’s portals all the way down.

SSH reverse tunneling fun

You can use this back and forth teleporting of network traffic to create all kinds of complexity.

But in our case, we want to use our new might for something fun. Let’s imagine the following scenario:

  • On your notebook, you can start a SSH connection to the server. You can open forward and reverse tunnels.
  • On your notebook, you run a VNC (Virtual Network Computing) server. VNC is basically a simple remote desktop technology that lets you see the display content of a remote computer and control its keyboard and mouse. You running the server means somebody else can connect to your notebook using a VNC client and control it as if it was local (if you can ignore a small delay).
  • Somebody else (let’s call her Eve), on her notebook, can also start a SSH connection to the server. She can open forward and reverse tunnels.
  • On her notebook, she has a VNC client software installed. She can connect and control VNC server machines.

And now the fun begins:

  • VNC servers typically listen on network port 5900. In the client, this is known as “display 0”, which means that a client connects to the port “display + 5900” on the server machine.
  • Using this knowledge, you create a reverse tunnel for your SSH connection to the server: localhost:5900 <- server:5901. Notice how we changed the port. This is not necessary, but possible.
  • Now, anybody with direct access to your network port 5900 (in your LAN) or access to the network port 5901 on the server (in the remote LAN) can remote control your desktop.
  • Eve creates a forward tunnel for her SSH connection to the server or any other machine in the remote LAN: localhost:5902 -> server:5901. Notice the port change again. Also just a possibility, not a requirement. Notice also that Eve doesn’t need to connect to the same server as you. Her server just needs to be able to initiate a data transfer with the server. This might be through multiple portals that don’t matter in our story.
  • Now, Eve starts her VNC client and lets it connect to localhost:2 (2 is the display number that gets resolved to local port 5902).
  • Eve can now see your screen, control your inputs and even transfer files (depending on the capabilities of your VNC setup).

This is the point where LAN containment doesn’t really matter anymore. Sure, you cannot reign freely over any machine anywhere, but you can always establish a combination of reverse and forward tunnels to connect two computers as if they were connected by a cable. Which is, in fact, a good analogy to think about SSH tunnels: cables with two different connector types that span between computers.

Wrap up

You are probably aware about the immediate security risks that follow all that complexity. If you want to mitigate some of the risks, have a look at stunnel. It is effectively a more secure (and constrainable) way to create portals between LANs.

And if you just want to have fun: Please be aware that most of the actions here are probably not legal if done without consent of all participants. Stay safe!

The emoji checksum

This blog article is a story about an idea, not an actual project report. If it were a movie, it would feature the “based on real events” disclaimer.

The warehouse

Imagine a warehouse of a medium sized company. You would expect a medium sized warehouse, but in reality, the amount of items in this warehouse is nearly as big as in a big company. The difference might be the storage count of each item, but the item count is a big number. So big that each item has its own “item ID”, which is also used as the location identifier in the warehouse. Let’s see three (contrived) examples:

  • 211 725: Retaining screw, 8 mm
  • 413 114: Power transformer, 5 A
  • 413 115: Power transformer, 10 A

As you can see, different item groups have numbers with a large numerical distance while similar items are numerically close. This makes sense for the engineers using these numbers by muscle memory and for the warehouse navigation. If you read the first three digits, you already know where to turn to in the large hall. If you’ve arrived in the general area, the next three digits lead you to the exact storage space.

The operators

But that’s not how it works. The warehouse workers cannot read. Yes, you’ve read that right. The warehouse is operated by humans and the workers are not familiar with digits and numbers. They decipher each digit on their own and cannot cross-check with the article name. They navigate the warehouse with a best-effort approach. The difference between item 413114 and item 413115 is negligible for them. It’s the same thing anyway – unless you can read (and understand) that one of them blows up above 5 Ampere and the other one doesn’t. And this is a problem for the engineers. The difference between a “Power transformer able to take 10 Ampere” and a “Power transformer (5 A), aka molten copper lump” is a successful or a failed project.

So what can you do? Teach the warehouse workers how to read and deal with numbers? Would be a good approach if the turnover rate among them wasn’t so high. What else can we do? We can abstract the problem at hand, apply a suitable solution approach and see if it works.

The abstraction

If you think about the situation in abstract terms, you deal with an unreliable data transmission. You send your item list to the warehouse and receive a collection of loosely related items. That’s similar to sending data over a faulty cable. To mitigate transmission errors, we’ve invented checksums. Each suitable part of the transmission is validated (or invalidated) by a checksum.

In our case, the “suitable part of the transmission” is each single item. We should add a checksum to the item list! Instead of requesting item 413114, we request 413114/7, while item 413115 is requested as 413115/1. Now, we have a clear indicator for wrong or right. But it is still an indicator in a foreign alphabet. If you ignore the difference between 4 and 5, why not also ignore the difference between 7 and 1?

The emojification

But what if we don’t rely on numbers or characters, but on something every human can understand, regardless of literacy level? What if we transpose the numbers into an emoji alphabet? Let 413114 be 😄🌵☁️🌵🌵😄 and 413115 is written as 😄🌵☁️🌵🌵🏠. But more important: The checksum is in emoji, too:

😄🌵☁️🌵🌵😄 (413114)

🚗 (7)
vs.
😄🌵☁️🌵🌵🏠 (413115)

🌵 (1)

Even if you only glance at the emoji series (and fail to notice the difference at the end), you still have to acknowledge that your checksum doesn’t fit. A cactus is no car, regardless of your literacy.

This transposition of numbers into the iconographic realm plays right into every human’s built-in ability to distinguish concrete objects. Numbers, digits and characters are (more) abstract concepts and objects, but a cloud is recognizable as a cloud even if you draw it by hand and without care. The transposition is reversable quiet easily – you only have to remember ten number/emoji pairs (or eleven, if your checksum has an extra character). And nobody stops you from printing both on the item list and warehouse storage boxes:

And the best thing? You don’t even have to invent the transposition yourself. Just use the existing work of others by checking out emojisum by Vincent Batts or ecoji by Keith Turner.

The only thing that is stopping you is that ancient dot matrix printer that prints the item lists on continuous paper.

Last night, an end-to-end test saved my live

but not with a song, only with an assertion failure. Okay, I’ll admit, this is not the only hyperbole in the blog title. Let’s set things right: The story didn’t happen last night, it didn’t even happen at night (because I’m not working night hours). In fact, it was a well-lit sunny day. And the test didn’t save my live. It did save me from some embarrassment, though. And spared a customer some trouble and hotfixing sessions with me.

The true parts of the blog title are: An end-to-end test reported an assertion failure that helped me to fix a bug before it got released. In other words: An end-to-end test did its job and I felt fine. And that’s enough of the obscure song references from the 1980s, hopefully.

Let me explain the setting. We develop a big system that runs in production 24/7-style and is perpertually adjusted and improved. The system should run unattended and only require human attention when a real incident happens, either in the data or the system itself. If you look at the system from space, it looks like this:

A data source (in fact, lots of data sources) occasionally transmits data to the system that gets transformed and sent to a third-party system that does all kind of elaborate analysis with it. Of course, there is more to the real system than that, but for our story, it is sufficient to know that there are several third-party systems, each with their own data formats. We’ll call them third-party system A (TPS-A) and TPS-B in this story.

Our system is secured by lots and lots of unit tests, a good number of integration tests and a handful of end-to-end tests. All tests run green all the time. The end-to-end tests (E2ET) try to replicate the production environment as close as possible by running on the target operating system and using data transfer means like in the real case. The test boots up the complete system and simulates the data source and the third-party systems. By configuring the system so that it transfers back to the E2ET, we can write assertions in the kind of “given a specific input data, we expect this particular output data from the system, however it chooses to produce it”. This kind of broad-brush test is invalueable because it doesn’t care for specifics in the system. It only cares for output from the system.

This kind of E2ET is also difficult to write, complicated to run, prone to brittleness and obscure in its failure statement. If you write a three-line unit test, it is straight-forward, fast and probably pretty specific when it breaks: “This one thing that I’m testing is wrong now”. An E2ET as described here is like the Delphi Oracle:

  • E2ET: “Somewhere in your system, something doesn’t work the way I like it anymore.”
  • Developer: “Can you be more specific?”
  • E2ET: “No, but here are 2000 lines of debug output that might help you on your quest for the cause.”
  • Developer: “But I didn’t change anything.”
  • E2ET: “Oh, in that case, it could as well be the weather or a slow network. Mind if I run again?”

This kind of E2ET is also rather slow and takes its sweet time resetting the workspace state, booting the system again and using prolonged timeouts for every step. So you wait several minutes for the E2ET to fail again:

  • E2ET: “Yup, there is still something fishy with your current code.”
  • Developer: “It’s the same code you let pass yesterday.”
  • E2ET: “Well, today I don’t like it!”
  • Developer: “Okay, let me dive into the debug output, then.”

Ten to twenty minutes pass.

  • Developer: “Is it possible that you just choke on a non-free network port?”
  • E2ET: “Yes, I’m supposed to wait for the system’s output on this port and cannot bind to it.”
  • Developer: “You cannot bind to it because an instance of yourself is stuck in the background and holding onto the port.”
  • E2ET: “See? I’ve told you there is something wrong with your system!”
  • Developer: “This isn’t a problem with the system’s code. This is a problem with your code.”
  • E2ET: “Hey! Don’t blame me! I’m here to help. You can always chose to ignore or delete me if I’m too much of a burden to you.”

This kind of E2ET cries wolf lots of times for problems in the test code itself, for too optimistic timeouts or just oddities in the environment. If such an E2ET fails, it’s common practice to run it again to see if the problem persists.

How many false alarms can you tolerate before you stop listing?

In my case, I choose to reduce the amount of E2ET to the essential minimum and listen to them every time. If the test raises a false alarm, invest the time to come up with a way to make it more robust without reducing its sensitivity. But listen to the test every time.

Because, in my story, the test insisted that third-party system A and B both didn’t receive half of their data. I could rule out stray effects in the network or on the harddisk pretty fast, because the other half of the data arrived just fine. So I invested considerable time in understanding the debug output. This led me nowhere – it seemed that the missing data wasn’t sent to the system in the first place. Checking the E2ET disproved this hypothesis. The test sent as much data to the system as ever. But half of it went missing underway. Now I reviewed all changes I made in the last few days. Some of them affected the data export to the TPS. But all unit and integration tests in this area insisted that everything worked as intended.

It is important to have this kind of “multiple witnesses”. Unit tests are like traces in a criminal investigation: They report one very specific information about one very specific part of the code and are only useful in larger numbers. Integration tests testify on the possibility to combine several code parts. In my case, this meant that the unit tests guaranteed that all parts involved in the data export work correct on their own (in isolation). The integration tests vouched that, given the right combination of data export parts, they will work as intended.

And this left one area of the code as the main suspect: Something in the combination of export parts must go wrong in the real case. The code that wires the parts together is the only code not tested by unit and integration tests, but E2ET. Both other test types base their work on the hypothesis of “given that everything is wired together like this”. If this hypothesis doesn’t hold true in the real case, no test but the E2ET finds a problem, because the E2ET bases on another hypothesis: “given that the system is started for real”.

I went through all the changes of the wiring code and there it was: A glorious TODO comment, written by myself on a late friday afternoon, stating: “TODO: Register format X export again after you’ve finished issue Y”. I totally forgot about it over the weekend. I only added the comment into the code, not as an issue comment. Friday afternoon me wasn’t cooperative with future or current me. In effect, I totally played myself. I don’t remember removing the registering code or writing the comment. It probably was a minor side change for an unimportant aspect of the main issue. The whole thing was surely done in a few seconds and promptly forgotten. And the only guardian that caught it was the E2ET, using the most nonspecific words for it (“somewhere, something went wrong”).

If I had chosen to ignore or modify the E2ET, the bug would have made it to production. It would have caused a loss of current data for the TPS. Maybe they would have detected it. But maybe, they would have just chosen to operate on the most recent data – data that doesn’t get updated anymore. In short, we don’t want to find out.

So what is the moral of the story?
Always have an end-to-end test for your most important functionality in place. Let it deal with your system from the outside. And listen to it, regardless of the effort it takes.

And what can you do if your E2ET has saved you?

  • Give your test a medal. No, seriously, leave a comment in the test code that it was helpful.
  • Write an unit or integration test that replicates the specific cause. You want to know immediately if you make the same error again in the future. Your E2ET is your last line of defense. Don’t let it be your only one. You’ve been shown a weakness in your test setup, remediate it.
  • Blame past you for everything. Assure future you that you are a better developer than this.
  • Feel good that you were able to neutralize your faux pas in time. That’s impressive!

When was the last time a test has saved you? Leave the story or a link to it in the comments. I can’t be the only one.

Oh, and if you find the random links to 80s music videos weird: At least I didn’t rickroll you. The song itself is from 1987 and would fit my selection.

Zero Interaction Tools

Some time ago, a customer called us for a delicate task: To develop a little tool in a very tight budget that aggregates pictures in a specific way. The pictures were from the medical domain and comprised of sets with thousands of pictures that needed to be combined into one large picture with the help of some mathematics. Developing the algorithm was not a problem, the huge data sizes neither, but the budget was a challenge. We could program everything and test it on some sample picture sets (that arrived on several blue-ray discs) within the budget, but an elaborate graphical user interface (GUI) would be out of scope. On the other hand, the anticipated users of the tool weren’t computer affine enough to handle a CLI (command line interface). The tool needed to be simple, fast and cheap. And it only needed to do one thing.

Traditional usage of software tools

In the traditional way, a software tool comes with an installer that entrenches the tool onto the target computer and provides a start menu entry, a desktop icon and perhaps even a boot launcher with a fancy tray icon and some balloon tooltips that inform the user from time to time that the tool is still installed and wants some attention. If you click on the tool’s icon, a graphical user interface appears, perhaps in the form of an application window or just a tray pop-up menu. You need to interact with the user interface, then. Let’s say you want to combine several thousand pictures into one, then you need to specify directories or collections of files through some file dialogs with “browse” buttons and complex “ingredients” lists. You’ve probably seen this type of user interface while burning a blue-ray disc or uploading files into the cloud. After you’ve selected all your input pictures, you have to say where to write to (another file dialog) and what name your target file should have. After that, you’ll stare at a progress bar and wait for it to reach the right hand side of the widget. And then, the tool will beep and proudly present a little message box that informs you that everything has worked out just fine and you can find your result right were you wanted to. Afterwards, the tool will sit there on your screen in anticipation of your next move. What will you do? Do it all again because you love the progress bars and beeps? Combine another several thousand pictures? Shutdown the tool? Oh, come on! Are you sure you want to quit now?

None of this could be developed in the budget our customer gave us. The tool didn’t need the self-marketing aspects like a tray icon or launcher, because the customer would only use it internally. But even the rest of the user interface was too much work: the future users would not get a traditional software tool.

Zero interaction tool

So we thought about the minimal user interface that the picture aggregation tool needed to have. And came to the conclusion that no user interface was needed, because it really only needed to do one thing. At least, if certain assumptions hold true:

  • The tool is fast enough to produce no significant delay
  • The input directory holds all pictures that should be aggregated
  • The input directory can be the output directory as well
  • The name of the resulting picture file can contain a timestamp to distinguish between several tool runs

We consulted our customer and checked that the latter three assumptions were valid. So, given we can make the first assumption a reality, the tool could work without any form of user interaction by being copied into the picture directory and then started.

Yes, you’ve read this right. The tool would not be installed, but copied for every usage. This is a highly unusual usage scenario for a program, because it means that every picture directory that should be aggregated holds an identical copy of the program. But if we can make some more assumptions valid, it is a viable way to empower the users:

  • The tool must run on all target machines without additional preparation
  • The tool must only consist of one executable file, no DLLs or configuration files
  • The tool must be small in size, like one megabyte at most

We confirmed with a quick and dirty spike (an embarrasingly inchoate prototype) that we can produce a program that conforms to all three new assumptions/requirements. The only remaining problem was the very first assumption: No harddrive was fast enough to provide the pixel data of thousands of pictures in less than a second. Even if we could aggregate the pixels fast enough (given enough cores, this would be possible), we couldn’t get hold of them fast enough. We needed some kind of progress bar.

Use your information channels

We thought about the information channels our tool would have towards the user. Let’s repeat the scenario: The user navigates to the directory containing the pictures that should be aggregated, copies the executable program into it and double-clicks to start the tool. There are many possibilities to inform the user about progress:

  • Audio (Sound): We can play a little tune or some sound that changes frequency to indicate progress. This is highly unusual and we can’t be sure that the speakers aren’t muted (usage on a notebook was part of the domain analysis results). No sounds, that is.
  • Animation (Graphics): In the most boring case, this would be a little window with a progress bar that runs from left to right and disappears when the work is done. Possible, but boring. Perhaps we can think of something more in tune with the rest of the usage scenario.
  • Text: Well, this was the idea. We produce a result file and give it a name. We don’t need to keep the name static as long as we are working and things change inside the file, anyways. We just update the file name to reflect our progress.

So our tool creates a new result file in the picture directory that is named result_0_percent or something and runs up to result_100_percent and then gets renamed to result_timestamp with the current timestamp. You can just watch your file explorer to keep up with the tool’s completion. This is a bit unusual at first, but all pilot users grasped the concept immediately and were pleased with it.

The result

And this is the story when we developed a highly specialized tool within a very small budget without any graphical or otherwise traditional user interface. The user brings the tool to the data (by copying it into the same directory) and lets it perform its work by simply starting it. The tool reports its progress back via the result file name. As soon as the result file contains a timestamp (and the notebook air fans cease to go beserk), the user can copy it into the next tool in the tool chain, probably a picture viewer or a printer driver. The users loved the tool for its speed and simplicity.

One funny side-note remains to be told: Because thousands of pictures aggregated into one produces a picture with a lot of details, the result file was not too big (about 20-30 megabytes), but could take out any printer for several hours if printed. The tool got informally renamed to “printer-reaper.exe”.

Containers allot responsibilities anew

Earlier this year, we experienced a strange bug with our invoices. We often add time tables of our work to the invoices and generate them from our time tracking tool. Suddenly, from one invoice to the other, the dates were wrong. Instead of Monday, the entry was listed as Sunday. Every day was shifted one day “to the left”. But we didn’t release a new version of any of the participating tools for quite some time.

What we did since the last invoice generation though was to dockerize the invoice generation tool. We deployed the same version of the tool into a docker container instead of its own virtual machine. This reduced the footprint of the tool and lowered our machine count, which is a strategic goal of our administrators.

By dockerizing the tool, we also unknowingly decoupled the timezone setting of the container and tool from the timezone setting of the host machine. The host machine is set to the correct timezone, but the docker container was set to UTC, being one hour behind the local timezone. This meant that the time table generation tool didn’t land at midnight of the correct day, but at 23 o’clock of the day before. Side note: If the granularity of your domain data is “days”, it is not advisable to use 00:00 o’clock as the reference time for your technical data. Use something like 12:00 o’clock or adjust your technical data to match the domain and remove the time aspect from your dates.

We needed to adjust the timezone of the docker container by installing the tzdata package and editing some configuration files. This was no big deal once we knew where the bug originated from. But it shows perfectly that docker (as a representative of the container technology) rearranges the responsibilities of developers and operators/administrators and partitions them in a clear-cut way. Before the dockerization, the timezone information was provided by the host and maintained by the administrator. Afterwards, it is provided by the container and therefore maintained by the developers. If containers are immutable service units, their creators need to accomodate for all the operation parameters that were part of the “environment” beforehands. And the environment is provided by the operators.

So we see one thing clearly: Docker and container technology per se partitions the responsibilities between developers and operators in a new way, but with a clear distinction: Everything is developer responsibility as long as the operators provide ports and volumes (network and persistent storage). Volume backup remains the responsibility of operations, but formatting and upgrading the volume’s content is a developer task all of a sudden. In a containerized world, the operators don’t know you are using a NoSQL database and they really don’t care anymore. It’s just one container more in the zoo.

I like this new partitioning of responsibilities. It assigns them for technical reasons, so you don’t have to find an answer in each organization anew. It hides a lot of detail from the operators who can concentrate on their core responsibilities. Developers don’t need to ask lots of questions about their target environment, they can define and deliver their target environment themselves. This reduces friction between the two parties, even if developers are now burdened with more decisions.

In my example from the beginning, the classic way of communication would have been that the developers ask the administrator/operator to fix the timezone on the production system because they have it right on all their developer machines. The new way of communication is that the timezone settings are developer responsibility and now the operator asks the developers to fix it in their container creation process. And, by the way, every developer could have seen the bug during development because the developer environment matches the production environment by definition.

This new partition reduces the gray area between the two responsibility zones of developers and operators and makes communication and coordination between them easier. And that is the most positive aspect of container technology in my eyes.

Think of your code as a maintenance minefield

Most of the cost, effort and time of a software project is spent on the maintenance phase, the modification of a software product after delivery. If you think about all these resources as “negative investments” or debt settlement and try to associate your spendings with specific code areas or even single lines of code, you’ll probably find that the maintenance cost per line is not equally distributed. There are lots of lines of code that outlast the test of time without any maintenance work at all, a fair amount of lines that require moderate attention and some lines that seem to require constant and excessive developer care.

If you transfer this image to another metaphor, your code presents itself like a minefield for maintenance effort: Most of the area is harmless and safe to travel. But there are some positions that will just blow up once touched. The difference is that as a software developer, you don’t tread on the minefield, but you catch the flak if something happens.

You should try to deliver your code free of maintenance mines.

Spotting a maintenance mine

Identifying a line of code as a maintenance mine after the fact is easy. You probably already recognize the familiar code as “troublesome” because you’ve spent hours trying to understand and fix it. The commit history of your version control system can show you the “hottest” lines in your code – the areas that were modified most often. If you add tests for each new bug, you’ll find that the code is probably tested really well, with tests motivated by different bug issues. In hindsight, you can clearly distinguish low-effort code from high maintenance code.

But before delivery, all code looks the same. Or does it?

An example of a maintenance mine

Let’s look at an example. Our system monitors critical business data and sends out alerts if certain conditions are met. One implementation of the part sending the alerts is a simple e-mail sender. The code is given here:


public class SendEmailService {

  public void sendTo(
                Person person,
                String subject,
                String body) {
    execCmd(
         buildCmd(
               person.email(), subject, body));
  }

  private String buildCmd(String recipientMailAdress, String subject, String body){
    return "'/usr/bin/mutt -t " + recipientMailAdress + " -u " + subject + " -m " + body + "'";
  }

  private int execCmd(String command) throws IOException{
    return Runtime.getRuntime()
                  .exec(command).exitValue();
  }
}

This code has two interesting problems:

  • The first problem is that it is written in Java, a platform agnostic programming language, but depends on being run on a linux (or sufficiently similar unixoid) operating system. The system it runs on needs to supply the /usr/bin/mutt program and have the e-mail sending settings properly configured or else every try to run the send command will result in an error. This implicit dependency on the configuration of the production system isn’t the best way to deal with the situation, but it’s probably a one-time pain. The problem clearly presents itself and once the system is set up in the right way, it is gone (until somebody tampers with the settings again). And my impression is that this code separates two concerns between development and operations rather nicely: Development provides software that can send specific e-mails if operations provides a system that is capable of sending e-mails. No need to configure the system for e-mail sending and doing it again for the software on said system.
  • The second problem looks like a maintenance mine. In the line where the code passes the command line to the operating system (by calling Runtime.getRuntime().exec()), a Process object is returned that is only asked for its exitValue(), implicating a wait for the termination of the system command. The line looks straight and to the point. No need to store and handle intermediate objects if you aren’t interested in them. But perhaps, you should care:

By default, the created process does not have its own terminal or console. All its standard I/O (i.e. stdin, stdout, stderr) operations will be redirected to the parent process, where they can be accessed via the streams obtained using the methods getOutputStream(), getInputStream(), and getErrorStream(). The parent process uses these streams to feed input to and get output from the process. Because some native platforms only provide limited buffer size for standard input and output streams, failure to promptly write the input stream or read the output stream of the process may cause the process to block, or even deadlock.

Emphasize by me, see also: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Process.html

This means that the Process object’s stdout and stderr outputs are stored in buffers of unknown (and system dependent) size. If one of these buffers fills up, the execution of the command just stops, as if somebody had paused it indefinitely. So, depending on your call’s talkativeness, your alert e-mail will not be sent, your system will appear to have failed to recognize the condition and you’ll never see a stacktrace or error exit value. All other e-mails (with less chatter) will go through just fine. This is a guaranteed source of frantic telephone calls, headaches and lost trust in your system and your ability to resolve issues.

And all the problems originate from one line of code. This is a maintenance mine with a stdout fuse.

The fix for this line might lie in the use of the ProcessBuilder class or your own utility code to drain the buffers. But how would you discover the mine before you deliver it?

Mines often lie at borders

One thing that stands out in this line of code is that it passes control to the “outside”. It acts as a transit point to the underlying operating system and therefor has a lot of baggage to check. There are no safety checks implemented, so the transit must be regarded as unsafe. If you look out for transit points in your code (like passing control to the file system, the network, a database or another external system), make sure you’ve read the instructions and requirements thoroughly. The problems of a maintenance mine aren’t apparent in your code and only manifest themselves during the interaction with the external system. And this is a situation that happens disproportionately often in production and comparably seldom during development.

So, think of your code as a maintenance minefield and be careful around its borders.

What is your minesweeper story? Drop us a comment.

Book review: Developer Hegemony

At the end of 2018, I searched for new software development books to read and came across a list that spiked my interest. My impulse was to buy and read all five books. I’ve bought them all, but only read four of them yet. You can read another book review from this list in a previous blog post.

The book I was most sceptical about came with a black cover and the menacing title “Developer Hegemony” by Erik Dietrich. It’s not a book about software development, it is a book about the industry of software development and why it is fundamentally different than “traditional” industries. And it is a book that promises an outlook on “the future of labor”, at least for us developers. Spoiler: It’s not about taking over the world, as the cover image suggests. It’s about finding your way in an industry that is in very high demand and mostly consists of players that play by the rules of an entirely different game: industrial manufacturing.

Let’s have a short overview about the content: My impression of the book was that it consists of three parts, even if there are five distinct parts in the table of contents:

  • The first part takes a good hard look on the current situation and identifies the losers and “winners” of the game. It introduces a taxonomy of industry employees that all give up on something in exchange for some personal gain. What that something is depends on the worker type. This is like the setting up on a chess board. You get to know the pieces and their characteristics and realize that they are mostly pawn cannon fodder.
  • The second part puts the taxonomy in action and describes the carnage unfolding on the chess board. The grim message is that the only winning move is to own the board itself and make up the rules, but never participate in the game. And if you find yourself on the board, keep moving sideways like the bishop and change your color often. Don’t associate with any team and don’t engage in any stalemate situations. The author describes the “delivery game” very illustrative: If you are responsible of delivering something, you might succeed, but you can also fail. If you are only responsible of counseling a delivery, you can attach yourself to success and detach from failure more easily. Be the bishop and evade delivery responsibility by an elegant sidestep. This part is especially gruesome because it describes in detail how technical expertise in software development is a recipe to remain at the center of the delivery game. It makes every passionate developer’s heart ache.
  • The third and last part shows an alternative to the “own the board or be a pawn” dichotomy. Emotionally, it rescues the developer enthusiast. The message is soothing: You can continue to develop software, but you have to step up and own the results of your development, too. This means effectively being self-employed and acting like a business entity. Yes, I’ve said it wrong: I meant being self-employed and being a business entity. You can probably count on being in high demand for years to come, so the step from “developer” to “entrepreneur” is not as big as you probably are afraid of. And you don’t need to be strictly alone: Find partners and associate with them. But don’t stop being your own business entity. Don’t shed your self-confidence: You are the world’s most sought-after expert.

This book left me speechless. I’ve founded my company nearly twenty years ago without Erik Dietrich’s experience, just based on my beliefs that I couldn’t even articulate. And now he spelled them out for me in detail. Don’t get me wrong: My company is different from Erik Dietrich’s ideal of an “efficiencer company” in many details, but in the root of the matter, this book describes my strategic business alignment and my reasons for it perfectly.

But even besides my own affection to the topic, the book provides a crystal clear view on the software industry and a lot food for thought. Even if you don’t plan to leave your corporate job anytime soon, you should at least be clear about the mechanics of the game you participate in. The rules might differ from company to company, but the mechanics stay the same.

Do yourself a favor and read this book. You don’t have to change anything about your job situation, but you are invited to think about your stance and position in this industry. For my last sentence in this blog entry, let me spoil you one main difference of our industry in comparison to others: If all you need to develop first-class software is a decent notebook and some coffee, why are you still depending on your employer to provide both for you in exchange of all the surplus of your work? That’s the world’s most expensive coffee.