How to Lay a Minefield

… in Minesweeper, of course.

One of the basic building blocks of my hands-on software development workshops are Coding Katas. Minesweeper, the classic game that every Microsoft Windows user knows since 1992, is one of the “medium everything” Katas: Medium scope, medium complexity, medium demand. One advantage is that virtually nobody needs any more explanation than “program a Minesweeper”.

The first milestone when programming a Minesweeper game is to lay the mines on the field in a random fashion. And to my continual surprise, this seems to be a hard task for most participants. Not hard in the sense of giving up, but hard in the sense that the solution lacks suitability or even correctness.

So in order to have a reference point I can use in future workshops and to discuss the usual approaches, here is how you lay a minefield in an adequate way.

The task is to fill a grid of tiles or cells with a predefined amount of mines. Let’s say you have a grid of ten rows and ten columns and want to place ten mines randomly.

Spoiler alert: If you want to think about this problem on your own, don’t read any further and develop your solution first!

Task analysis

If you pause for a moment and think about the ingredients that you need for the task, you’ll find three things:

  • A die (in the form of a random number generator)
  • An amount of mines to place (maybe just represented by a counter variable)
  • An amount of tiles (stored in a data structure)

Each solution I’ll present uses one of these things as the primary focus of interest. My language for this effect is that the solution “takes the thing into the hands”, while the other two things “lay on the table”. That’s because if you simulate how the solution works with real objects like paper and dice, you’ll really have one thing in your hands and the others on the table most of the time.

Solution #1: The probability approach

One way to think about the task of placing 10 mines somewhere on 100 tiles is to calculate the probability of a mine being on a tile and just roll the dice for every tile.

Here’s some code that shows how this approach might look like in Java:

private static final int fieldWidth = 10;
private static final int fieldHeigth = 10;

public static Set<Point> placeMinesOn(Set<Point> field) {
	Random rng = new Random();
	final double probability = 0.1D;
	for (int column = 0; column < fieldWidth; column++) {
		for (int row = 0; row < fieldHeigth; row++) {
			if (rng.nextDouble() < probability) {
				field.add(new Point(row, column));
			}
		}
	}
	return field;
}

Before we discuss the effects of this code, let’s have a little talk about the data structure that I use for the tiles:

The most common approach to model a two-dimensional grid of tiles in software is to use a two-dimensional array. There is nothing wrong with it, it’s just not the most practical approach. In reality, it is really cumbersome compared to its alternatives (yes, there are several). My approach is to separate the aspect of “two-dimensionalness” from my data structure. That’s why I use a set of points. The set (like a HashSet) is a one-dimensional data structure that more or less can only say “yes, I know this point” or “no, I never heard of this point”. To determine if a certain point (or tile at this coordinate) contains a mine, it just needs to be included in the set. An empty set represents an empty field. If you remove “cleared” mines from the set, its size is the number of remaining mines. With a two-dimensional array, you probably write several loop-in-loop structures, one of them just to count non-cleared mines.

Ok, now back to the solution. It is the approach that holds the die in the hands and uses it for every tile. The problem with it is that our customer didn’t ask for a probability of 10 percent for a mine, he/she asked for 10 mines. And the code above might place 10 mines, or 9, or 11, or even 14. In fact, the code places somewhere between 0 and 100 mines on the field.

The one thing this solution has going for it is the guaranteed runtime. We roll the dice 100 times and that’s it.

So we can categorize this solution as follows:

  • Correctness: not guaranteed
  • Runtime: guaranteed

If I were the customer, I would reject the solution because it doesn’t produce the outcome I require. A minesweeper contest based on this code would end in a riot.

Solution #2: Sampling with replacement

If you don’t take up the die, but the mines and try to dispense them on the field, you implement our second solution. As long as you have mines on hand, you choose a field at random and place it there. The only exception is that you can’t place a mine above a mine, so you have to check for the presence of a mine first.

Here’s the code for this solution in Java:

public static Set<Point> placeMinesOn(Set<Point> field) {
	Random rng = new Random();
	int remainingMines = 10;
	while (remainingMines > 0) {
		Point randomTile = new Point(
			rng.nextInt(fieldHeigth),
			rng.nextInt(fieldWidth)
		);
		if (field.contains(randomTile)) {
			continue;
		}
		field.add(randomTile);
		remainingMines--;
	}
	return field;
}

This solution works better than the previous one for the correctness category. There will always be 10 mines on the field once we are finished. The problem is that we can’t guarantee that we are able to place the mines in time before our player gets bored. Taking it to the extreme means that this code might run forever, especially if your random number generator isn’t up to standards.

So, the participants of your minesweeper contest might not protest the arbitrary number of mines on their field, but maybe just because they don’t know yet that they’ll always get 10 mines dispersed.

  • Correctness: guaranteed
  • Runtime: not guaranteed

This solution will probably work alright in reality. I just don’t see the need to utilize it when there is a clearly superior solution at hands.

Solution #3: Sampling without replacement

So far, we picked up the die and the mines. But what if we pick up the tiles? That’s the third solution. In short, we but all tiles in a bag and pick one by random. We place a mine there and don’t put it back into the bag. After we’ve picked 10 tiles and placed 10 mines, we solved the problem.

Here’s code that implements this solution in Java:

public static Set<Point> placeMinesOn(Set<Point> field) {
	List<Point> allTiles = new LinkedList<Point>();
	for (int column = 0; column < fieldWidth; column++) {
		for (int row = 0; row < fieldHeigth; row++) {
			allTiles.add(new Point(row, column));
		}
	}
	
	Collections.shuffle(allTiles);
	
	int mines = 10;
	for (int i = 0; i < mines; i++) {
		Point tile = allTiles.remove(0);
		field.add(tile);
	}
	return field;
}

The cool thing about this solution is that it excels in both correctness and runtime. Maybe we use some additional memory for our bag and some runtime for the shuffling, but both can be predefined to an upper limit.

Yet, I rarely see this solution in my workshops. It’s only after I challenge their first solution that people come up with it. I’m biased, of course, because I’ve seen too many approaches (successful and failed) and thought about the problem way longer than usual. But you, my reader, are probably an impartial mind on this topic and can give some thoughts in the comments. I would appreciate it!

So, let’s categorize this approach:

  • Correctness: guaranteed
  • Runtime: guaranteed

If I were the customer, this would be my anticipation. My minesweeper contest would go unchallenged as long as nobody finds a flaw in the shuffle algorithm.

Summary

It is suprisingly hard to solve simple tasks like “distribute N elements on a X*Y grid” in an adequate way. My approach to deconstruct and analyze these tasks is to visualize myself doing them “in reality”. This is how I come up with the “thing in hands” metaphor that help me to imagine new solutions. I hope this helps you sometimes in the future.

How do you lay a minefield and how do you find your solutions? Write a blog post or leave a comment!

Subtle Effects of Real Hardware

One key aspect of my work is writing software that interacts directly with hardware that consists of sensors and actors. A typical hardware setting is a machine that moves big steel barrels (or “drums”) around.

In order to being able to develop my code without physically sitting right besides the machine, which might include being in a loud, hazardous or plain dangerous environment, my software architecture consists of “hardware components” that can be the real thing or a simulation that acts as real as possible.

I’m writing this kind of software for over twenty years now. But regardless of how many simulations of real hardware I’ve written, there is always a catch or at least a surprise with every new hardware.

For this story, we need to imagine a machine that can lift and rotate steel barrels on command. The machine interface consists of several status bits and some command flags. Two status bits are of importance:

  • isMoving: Indicates if the machine is changing positions or standing still.
  • isInPosition: Because the machine’s movement is bounded by physical limit switches, this flag indicates if the machine has triggered a limit switch and stopped.

I wrote the simulation for this machine and developed the application code that performs a series of movements by waiting for the location to be at a limit switch and then issuing the next movement command. Right before the command is sent, the following condition is checked:

boolean commandCanBeSent = !isMoving && isInPosition;

My application worked perfectly with the simulated hardware. But when we switched to the real hardware, the series of movements worked oftentimes, but not always. After investigating a lot of possible error sources, we boiled it down to the condition above. The condition evaluated to true most of the times, but resulted in false every time the series of movements got stuck.

Expanding the logging capabilities of the code revealed that in the error cases, the signals showed isInPosition as true and at the same time, isMoving as true, too. This is a peculiar machine state: It is at the limit switch, but still moving around?

The explanation originates from the modularity of the machine. The isInPosition flag is controlled by the physical limit switches. If one of them has contact to the moving part, the flag evaluates to true. The isMoving flag is controlled by the engine activity. As long as there is substantial engine power consumption, the flag evaluates to true. The crucial aspect of this signal is that a negative engine power consumption (or engine power generation) is still considered a deviation from zero and results in isMoving as true. Which is kind of correct, because in both cases, there will be a translocation.

But why is the engine sometimes indicating movement after it was stopped by the limit switch? The answer lies in the mass of the steel barrel. If the machine was tested empty (without a barrel), everything worked fine. But by using a heavy barrel, the stopping wasn’t as instant as before. The deceleration of the engine took longer and converted the electrical engine in a generator. The mass of the barrel produced energy in the electrical engine when stopped, and it did so long enough to see the combination isInPosition=true and isMoving=true.

My simulation of an engine with limit switches had not included the mass of the moved object until now. In my simulation, the limit switch stopped the engine instantaneously, without any residual effects.

The bugfix was only a small change: When deciding if the next movement command can be sent, my application now waits for a small duration that isMoving switches to false when isInPosition is already true.

This kind of “dirty signals” is prevalent when dealing with real hardware. The dependence on the barrel mass for the effect to show up was a new one for me. Maybe previous PLC programmers at other machine had filtered them out in their control interfaces. Maybe other signals didn’t rely on engine power consumption or ignored negative consumption. Regardless, I will be more careful when simulating signals that indicate moving masses.

Optional polymorphism by delegation

A code design pattern I’ve used a lot in recent times is the “optional-based polymorphism” that looks like a delegation to another type that might not be available. It might be an implementation of the FCoI-principle (Favour Composition over Inheritance).

Let’s look at an example: An application has several different engines that move stuff around. Some engines are based on limit switches. They move until they are stopped by a physical switch. The application can make these engines move from one predefined position to the next, but not anywhere in between. Another type of engines is based on a relative position. You give the engine the new target position and it positions itself there, without any limit switches or predefined positions.

Traditional approach

A typical implementation using inheritance would be a common supertype “Engine” that provides the functionality both engine types exhibit. From there, we would define two subtypes that extend the functionality in their desired way. One subtype would be the “LimitSwitchEngine”, the other one the “PositionableEngine”.

Our client code that wants to use a particular engine has two possibilities: It only requires the common functionality of an engine and can work with the supertype. Or it needs to perform a downcast after checking the actual type of the engine.

Cast methods

The optional-based polymorphism guides the client code towards the specific subtype by providing all possibilities in the common interface:

public interface Engine {

	/* Common functionality */
	
	boolean isMoving();
	
	void emergencyStop();
	
	/* optional-based polymorphism */
	
	Optional<LimitSwitchEngine> boundToLimitSwitches();
	
	Optional<PositionableEngine> freelyPositionable();
}

The client code uses the Engine’s interface only as a stepping stone for the specific engine that is required for your use case. If the engine object cannot provide that functionality, you’ll get an empty Optional. Else you retrieve your reference to the specific type and work with it.

Disadvantages

One disadvantage of this approach is the fact that the supertype is aware and even dependent on the different subtypes. You limit the scope of your type hierarchy to the types offered in the “entrance interface”. You can still use the traditional downcast way as described in the introduction for all other types, but that separates them into “featured” and “non-featured” subtypes. So this approach will violate the Open/Closed principle by not being open to extension without modification.

Another disadvantage is that your typical navigation in the IDE doesn’t work as well anymore. If you want to know about all the different types of engines in the system, you can’t just look at the type hierarchy of the Engine type anymore. This is because of the first advantage this pattern brings:

Advantages

Not only gets this style rid of the downcast, it frees your type system up in two different dimensions: The LimitSwitchEngine and PositionableEngine don’t need to be subtypes of Engine. They can be totally independent types with no real connection to the Engine. And they can be different instances. Of course, there is no need to use any of these freedoms. You can still inherit PositionableEngine from Engine and implement both types in the same object. But it isn’t mandatory anymore.

Another advantage is discoverability. Your typical type hierarchy lookup in the IDE is replaced with code completion lookup. If you get the names right, this pattern feels like writing code on rails, because your code completion proposals will lead you to the correct place.

Your opinion

What is your opinion on this pattern? What would you expect from a code design that provides those “casting” methods? Tell us in the comments!

The Fragmented Sources of Truth for a Software Project

For each software application, there only exists one single, authoritative source of truth: The source code. If something isn’t in the code, it doesn’t exist. This source of truth is so important that we invented version control (or source control) that allows us to:

  • travel backwards in time
  • create alternative realities
  • progress multiple realities concurrently

That’s pretty awesome and something that not many professions can rely on. It is a “hidden superpower” of software development.

But when you look at a software project and not just the application, there is a lot more “truths” or information available than what fits into the source code. Let’s have a look at a few of them:

The ticket system

The ticket system or issue tracker or bug tracker or whatever you call it is a glorified to-do list that tells people what is lacking in the source code.

One view on the ticket system could be that of a health record system. Each ticket represents an ailment that the software application has. If it’s a bug, it is clearly in an undesirable state that needs to be “healed”. If it’s a new feature, the medical metaphor doesn’t fit perfectly, but we can view our development work as some kind of plastic surgery that makes the software more appealing to the customer.

Either way, the ticket system holds episodical wisdom. It explains the state of our application in hundreds or thousands of more or less independent short stories that are worked on in isolation. To gain a complete vision about the software project from the ticket system alone is possible, but cumbersome.

If you think about it, there is a clear connection between a short story (ticket) and one alternative reality in which the application is told about the story.

The wiki

For each project, there is a lot of information that is fluid, but not episodical. The current state of truth is valid until it gets replaced by a newer truth. Attempting to capture this information in tickets would result in an awkward lack of oversight.

Luckily, there is a tool that was invented specifically for this type of information: The wiki or the editable website graph. Your project can claim an area in this graph and fragment the information according to the mental model of the project team. Every time some outdated information is found, it can be updated in place. Every time some information is not found, it can be added in the place where it was anticipated.

The file storage

Every software project that I know has a lot of accompanying documents that are important for the project, but maybe not so much for the actual day-to-day development. Depending on who you ask, these documents may very well constitute “the project” and everything “below” them are just necessary technicalities.

The nature of a document is that it exists forever once created. There are lots of attempts to bring version control to the document world, but a typical question in this area is: “Is my document still valid?”

Because documents are represented by files (in the digital and the analog world), a file storage is the least we need to manage them. If your file storage entices you to name your documents “_latest”, “_newer”, “_version2” or something like that, you probably want to step up your document versioning game. Document management systems (DMS) might be what you are looking for. For small teams, a central instance of a Nextcloud might already be sufficient.

The derived documentation

If you happen to develop a software product, you need to provide a user manual and additional technical documentation. These documents need to be in eventual synchronisation with your first and central source of truth: The source code. And because your source code adapts, these documents need to adapt constantly, too.

This is the area where our company has the most “room for improvement”. I’m not diving into details here because I know our approaches are not sustainable.

Single source of truth?

The problem with this fragmented approach to capture the whole of a project is that you need to study all the different places and combine the information in your head. And not only you, every team member has to do this.

You can try to combine different sources into one:

If you squint really hard, you might think that a wiki can replace a ticket system, because each ticket can be represented by a graph node and the linking might resemble a grouping mechanism. My uninformed guess would be that this replaces software specialization with the need for human discipline. But maybe it can work and I just don’t know about the proper tooling yet?

One rather obvious integration might be to put the wiki alongside the code. I haven’t seen a good solution for merge conflicts yet, but maybe it is possible somehow.

Putting the file storage into your source repository makes it bigger and unwieldy, but it would be a natural step towards single sourcing – until you want to give your code repository away without revealing your company’s contracts. Suddenly, separate storage areas become important.

The one thing I struggle to integrate into the source repository is the derived documentation. I can think about storing the documents alongside the code and even requiring to update them before a merge request of a feature branch is accepted, but I shudder to think about the inevitable merge requests that need to be resolved.

Maybe there is a suitable solution out there that I’m missing? Leave a hint in the comments!

Naming is hard and Java Enums don’t help

This is a short blog post about a bug in my code that stumped me for some moments. I try to tell it in a manner where you can follow the story and try to find the solution before I reveal it. You can also just read along and learn something about Java Enums and my coding style.

A code structure that I use sometimes is the Enum type that implements an interface:

public enum BuiltinTopic implements Topic {

    administration("Administration"),
    userStatistics("User Statistics"),
    ;
	
    private final String denotation;

    private BuiltinTopic(String denotation) {
        this.denotation = denotation;
    }
	
    @Override
    public String denotation() {
        return this.denotation;
    }
}

The Topic interface is nothing special in this example. It serves as a decoupling layer for the (often large) part of client code that doesn’t need to know about any specifics that stem from the Enum type. It helps with writing tests that aren’t coupled to locked-down types like Enums. It is just some lines of code:

public interface Topic {

    String denotation();
}

Right now, everything is fine. The problems start when I discovered that the denotation text is suited for the user interface, but not for the configuration. In order to be used in the configuration section of the application, it must not contain spaces. Ok, so let’s introduce a name concept and derive it from the denotation:

public interface Topic {

    String denotation();
	
    default String name() {
        return Without.spaces(denotation());
    }
}

I’ve chosen a default method in the interface so that all subclasses have the same behaviour. The Without.spaces() method does exactly what the name implies.

The new method works well in tests:

@Test
public void name_contains_no_spaces() {
    Topic target = () -> "User Statistics";
    assertEquals(
       "UserStatistics",
       target.name()
    );
}

The perplexing thing was that it didn’t work in production. The names that were used to look up the configuration entries didn’t match the expected ones. The capitalization was wrong!

To illustrate the effect, take a look at the following test:

@Test
public void name_contains_no_spaces() {
    Topic target = BuiltinTopic.userStatistics;
    assertEquals(
        "userStatistics",
        target.name()
    );
}

You can probably spot the difference in the assertion. It is “userStatistics” instead of “UserStatistics”. For a computer, that’s a whole different text. Why does the capitalization of the name change from testing to production?

The answer lies in the initialization of the test’s target variable:

In the first test, I use an ad-hoc subtype of Topic.

In the second test and in production, I use an object of type BuiltinTopic. This object is an instance of an Enum.

In Java, Enum classes and Enum objects are enriched with automatically generated methods. One of these methods equip Enum instances with a name() method that has a default implementation to return the Enum instances’ variable/constant name. Which in my case is “userStatistics”, the same string I expect, minus the correct capitalization of the first character.

If I had named the Enum instance “UserStatistics”, everything would have worked out until somebody changes the name or adds another instance with a slight difference in naming.

If I had named my Enum instance something totally different like “topic2”, it would have been obvious. But in this case, with only the minor deviation, I was compelled to search for problems elsewhere.

The problem is that the auto-generated name() method overwrites my default method, but only in cases of real Enum instances.

So I thought hard about the name of the name() method and decided that I don’t really want a name(), I want an identifier(). And that made the problem go away:

public interface Topic {

    String denotation();
	
    default String identifier() {
        return Without.spaces(denotation());
    }
}

Because the configuration code only refers to the Topic type, it cannot see the name() method anymore and only uses the identifier() that creates the correct strings.

I don’t see any (sane) way to prohibit the Java Enum from automatically overwriting my methods when the signature matches. So it feels natural to sidestep the problem by changing names.

Which shows once more that naming is hard. And soft-restricting certain names like Java Enums do doesn’t lighten the burden for the programmer.

My biggest decision as a business owner (yet)

This week, a very fortunate event will take at our company: We all come together to have a summer party in person. This will be the first time in nearly 3 and a half years that we all spend time in the same room. It will be the conclusion of a decision that I call the “biggest one” that I had to come to. This is the very shortened story of that decision.

The end of an era

Our company was founded and set up as a place for direct interaction and short communication distances. We favored office workplaces and open space room plans and often visited customers at their location.

In March 2020, this setup appeared to be the exact opposite of what is advised. I remember the week from the 9th to the 13th March, when every day and every hour, things got worse and more restricted due to the Sars-Cov2 pandemic. On Friday, the 13th of March 2020, I was in a phone call with an employee that lasted 30 minutes. When we began to speak, one federal state had closed the schools. When we stopped, every school was closed in the whole country.

During the weekend, I tried to approach the situation with plans and lists. A list of endangered projects, a list of endangered customers, a list of endangered employees, a list of critical tasks, a plan to stay ahead of circumstances. I came up with a scheme to assess the risk and derive actions, but spent the whole sunday to talk with my employees just to gather some of the information necessary to base any decision on more than fear and hope. I am very grateful that my employees all picked up the phone and went through my questions with me. It helped me to realize that no matter how fitting the lists, how clever the plan, I won’t be able to process the information with the required speed.

Some employees offered to go on holiday to take moving parts out of the equation, but it was still overwhelming. If you know the feeling in a roller coaster when a certain “feel-good” speed limit is exceeded and real fear takes hold of your heart and head, you can imagine how these days felt for me.

The beginning of a different era

And then, on Monday morning, I knew exactly what to do. The situation necessitates that we change everything at the company at once. We need to go “virtual”, to retreat into home offices that didn’t exist yet.

Monday, 16th of March was the last day that several people were in our office simultaneously for a long time.

Everything the company was used to do didn’t work anymore. We had to buy new hardware, new furniture, new chairs and everything else that was needed in the home offices. We had to examine every business process and partition it into “on site” and “remote” work steps. We had to introduce new means of communication in the company and with our customers. We had to continue with our project work while transforming everything in our professional and our private lifes. We had to keep up our spirits while experiencing isolation and uncertainty.

And just like that, we replaced the “pre-covid” company with the “during-covid” company. Nobody could say that it would work. Nobody knew how long it would be required to work. Nobody could anticipate how much it would cost us.

The decision

The only thing I was certain of was that if we need to change, we would do it wholeheartedly. I was sure that even if the pandemic suddenly disappears, I don’t want to look back at that time and think of it as a makeshift solution.

My decision was to embrace the uncertainty and let go of any remnant of a masterplan that I might have left. I “jumped into the fog”.

For me, it felt as if I placed a wager on the existence of the whole company: “I bet we can do what we did for twenty years, but totally different and in a time of crisis. And we can start right now and keep going for an indeterminate period of time”.

The outcome

Since then, a long time has passed. The fog has cleared and we have survived. And not only that. The “gamble” has paid off:

We resumed our project work within two days and steadily improved our situation day by day and week by week. Our revenue went up, our productivity went up, our profits went up. New customers called us, new projects were started. Today, we are in a much better place than before.

But that’s not all: We have established new means of collaboration and communication, regardless of workplace. Every employee has a full-fledged home office with as many monitors as are physically possible, fitting furniture, a good webcam, good audio equipment, a powerful notebook or desktop computer and all the accessories that make the difference between “a workplace” and “my own workplace”. So we are fully equipped for any future isolation event that hopefully never comes.

Making the decision, trusting my employees and providing them with the equipment to master the challenge yielded the best outcome I could have hoped for. The whole experience humbled me: I lost any control over the situation early on and it didn’t really matter. What mattered was to keep innovating, investing and improving. And that is a group effort, not the vision of a single mind.

The future

So, here we are, at the natural end of the story. If this was a movie, the credits would begin to roll when we raise our glasses to celebrate our success. To me, it seems that lots of companies operate like this. “The temporarily embarrassing loss of control of upper management is past, now return to the office and commence the old rituals. And don’t forget to bring in that notebook that we borrowed you for your kitchen table home office.”

I’ve seen the potential of this transformation way too clearly to go back. There is nothing gained by reverting to the old ways. We will continue as a “hybrid” company with an attractive office and equally attractive home offices. We will continue to find ways to collaborate with each other and our customers that we didn’t think of before. We will continue to spend time, effort and money to improve our work reality. It might cost 15k euros to equip one workplace in the office and 15k euros more to do it again at home, but that money is the best investment I can think of. The return on investment is amazing.

I witnessed it firsthand.

Using the File System as an Interaction Device

In a recent project, my job was to build a scientific data processing pipeline for a new algorithm that wasn’t set in stone yet. Part of my work would be to explore different mathematical formulas interactively with the customer.

My usual approach to projects is a “risk first” strategy. I try to identify the riskiest or most demanding part of the project and deal with it first. This approach essentially resembles the “fail fast” mindset, just that we haven’t failed yet.

In the case of the calculation pipeline, the riskiest part and at the same time the functionality that matters to the customer most, was the pipeline itself. If we were able to implement a system that can transform the given entry data into the desired results, we had an end-to-end prototype and the means to explore different mathematical approaches.

The pipeline consists of different steps that can be described as a complex transformation each. The first step/transformation takes a proprietary data format file and converts it into a big JSON file. The main effort of this step is a deep physical analysis of the data contained in the proprietary format. This analysis requires a lot of thought, exploration and work, but can be seen as a black box that the data traverses on its way from proprietary format to JSON.

The next step takes the JSON input and extracts the necessary information required by the following step. It is essentially a data reduction operation.

The third step feeds the analyzed, reduced data into the formulas and stores the calculation result.

The fourth step aggregates the calculation results into a daily time series report in a format that can be read by a spreadsheet application. This report is the end product of the pipeline and will be used to make decisions and to rule out certain environmental hazards.

The main difference of this project to virtually every project before is that I didn’t write any user interface code. The application’s main window is still blank. The whole interaction of the system with other systems that provide the entry data, of the pipeline steps among each other and with the human user is based on files in the file system.

The system periodically checks for the existence of new entry data. If some is found, it is copied in the “inbox” directory of the first step. The first step periodically checks for the existence of files in its inbox and processes them into its “outbox” that conveniently serves as the inbox of the second step. You probably get the idea by now. All the steps in the system, including the upstream data fetching routine, are actors in an file-based actor model. The files serve as messages from one actor to another. The file system and its directory structure is the common communication channel that passes the messages around.

Each processing step is an actor node with input and output storages

One advantage of this approach is that the file system viewer application of the operating system can be used as the (graphical) user interface. By opening the appropriate directories and viewing their content, the user can supervise the operating state of the system. The system can report problems by moving the incoming message not in the step’s “done” directory , but into its “failed” or “problem” directory. If several directories are on display at once, the user can follow a specific piece of data through the pipeline and view the intermediate results. For domain specific reasons, the actors in this project also have the result directory “omitted” for data that will not be processed any further because some domain rules have determined a cancellation.

An user can even manipulate the data’s flow by moving files away or into a specific directory. Let’s say that we want to calculate a certain amount of data again, we can just copy the files from the “done” directory of the first step into its “inbox” and the system will process it again.

Because the analysis step takes some time while the calculation step is surprisingly fast, we can perform just the calculation again by not moving the initial data files, but the analyzed and reduced entry files for the calculation step. Using this approach, we can try different mathematical formulas by stopping the system, swapping the calculation step with a new version, starting the system again and moving the desired entry files into its inbox.

Using the file system as an interaction device for the user and the system’s parts has many immediate advantages, but some drawbacks, too. One drawback is performance. Using the harddisk for data transfer is the slowest possible way to bring data from step X to step X+1. If your system is required to have high throughput or low latency, this approach isn’t suitable. My project has a low, forecastable throughput and a latency requirement that is measured in minutes or seconds, but not in milliseconds or even nanoseconds. It can spend some time in the filesystem, because the first step alone takes several seconds for each file.

Another drawback is a certain fragility of the communication medium, the file system. You have to account for concurrent reads, writes or even deletes. The target platform of my system (Microsoft Windows) exhibits signs of exhaustion if the amount of files in one directory grows too large. This means that your file selection, already a costly operation, becomes more costly if the systems is put under pressure. If your throughput is usually steady, which is the case in my project, this won’t be a problem. Until you manually copy 100k files in an inbox for swift recalculation and discover that the file copy process alone takes several minutes.

Of course, the system cannot operate without a graphical user interface forever. But some basic interactions with the system will probably just result in some files being copied from one directory to another one in the background.

Useful background metrics: Distance to Disaster

This blog post would not have happened without my wife, who, upon learning that I use this metric in my everyday life, urged me to write about it.

I often categorize events that happen in my life. Due to my nature, I analyze detrimental events more thorough than things that “worked as intended”. One tool for my analysis is a measurement that I call “Distance to Disaster” (DtD). It indicates the “distance” or “bad faith work” or “bad decisions” that needs to be invested in order for disaster to happen. Let me explain:

If we wait on a train, we can stand in the middle of the platform and maximize the physical distance to the tracks before and behind us. Or we can stand right at the edge and minimize the physical distance to one track. If the track we chose for our position is the one where our train will arrive, we have a very low distance to distaster. We can lose our balance and fall onto the tracks. We can misjudge the physical dimensions of the train and get hit with something. In short: Nobody wants to wait on a train with a minimized (physical) distance to disaster.

Another measurement unit for the metric is “bad faith work”. Let’s assume you want to steal my most priced possession. That would be a disaster for me. You need to gain access to my home (step 1), then open the safe (step 2) and then find the key to the safe desposit box at my bank (no-brainer, not a step on its own). Afterwards, you need to gain access to the bank room before I recognize my loss (step 3) and open the box that has a two-lock system (step 4). It is probably easier to come up with a plan to circumvent some steps and attack the bank directly. If you just succeeded with step 1, my most priced possession is probably still very secure because a DtD of 3 is rather high.

And then, there are “bad decisions”. Let’s say you write code and accidentally hit “load” instead of “save”. If you are me in the early nineties, you just overwrote your code with an empty file. I still remember that day and it didn’t help that “save” was bound to F5 and “load” to F6. One bad decision lead to disaster.

Now imagine you still use the same shitty IDE (it was the GWBasic editor), but with modern version control. You commit early and often. You accidentally hit “load” instead of “save” and lose your last few minutes of work. Sad, but not a disaster. Even if you delete the whole file, you can restore your last commit as often as you want. Using version control adds +1 to your “bad decision distance” to disaster.

You probably understand the concept by now. You can specify what a “disaster” is and then measure your current distance to it by trying to come up with the least steps that lead to it.

In our normal everyday life, we are surprisingly often only one step away from disaster, but it never happens. That’s a reassuring reality, but shouldn’t keep us from thinking about how to increase the step count without much effort.

One typical implementation of this approach is a modest backup strategy for all data that you intend to keep. Another one is to have spare parts for crucial devices in stock (the “hardware backup”).

Don’t get me wrong: It’s not about maximizing the DtD. It’s about recognizing the cheap and easy opportunity to add one more step to the distance.

And it’s not about “disaster” in the meaning of life-altering, stop-the-world events. A “disaster” can be everything you don’t want to happen. Try to bring a reasonable distance between you and this thing if possible.

Now that you know about the concept, can you find examples of cheap and easy DtD improvements in software development? Let us know in the comments!

Addendum for my co-workers: Our ETOD metrics is the DtD metrics applied on financial resources.

And another addendum: I find a lot of similarities in the field and mindset of accident prevention. For example, airplane cockpits are designed in a way that dangerous actions require the actuation of two control elements like switches or buttons that are located on different sides of the room. Making it two buttons instead of one adds “bad decision” distance. Placing the buttons in different directions adds “intent distance”.

In software user interaction designs, we try to replicate the second button with a confirmation dialog (“Are you sure?”). It adds to the “bad decision” distance but often lacks in the “intent distance” dimension. I don’t want to be responsible for cumbersome “maximized mouse distance” dialogs, though.

What else can we do?

A common code structure to implement a decision is the if-statement, or in its complete form, the if-else-statement:

By using the explicit if-else-statement, you essentially partition a part of your code into two “execution lanes” that are used mutually exclusive. Instead of writing them one upon the other, we could, if our code editors supported it, write them side by side:

There are some graphical code editors that tried this tabular approach. It certainly looks unfamiliar to the eye trained on the first notation, but it makes one thing clear: The code flow will go through only one of the columns, not both.

Dependence on explicit conditionals

Using the if-else-statement became so second-nature to most developers that they acted confused and helpless when presented with a simple restriction:

“Don’t use the else keyword”

Jeff Bay, Object Calisthenics, 2008

The restriction is imposed as the second of nine rules from the object calisthenics by Jeff Bay. In the explanation of the rule, he stated that the rule should act as a first step towards implicit conditional statements. Paraphrased: There are 99 ways to express an else statement without using the keyword, but the average developer knows none of them.

In my opinion, the rule is merely the warm-up phase to a bigger challenge, as stated by the “anti-if campaign”: To get rid of if-statements (and else-statements by that matter) in all contexts where alternatives prove more effective.

In order to decide when not to use if-statements, we should learn about the alternatives. There are plenty to choose from! (refer to slide #4)

But we should also learn about the if-statement itself. The goal isn’t to abandon it, but to use it when appropriate and then use it to its full potential.

An interesting thought about the “else”

We already know everything about the if and else? I had the opportunity to learn something new not long ago. The hint came from Kevlin Henney in one of his talks (Non-Functional Coding):

The talk is fairly recent and has some traditional “Kevlin parts” in it. The part I highlighted is unusually aggressive for him. The reasoning is sound, but the nearly personal attack towards the audience (to “piss them off”) is uncalled for.

But, the “volume up to 200 %”-style works more often than not and the bit got me thinking. The culprit in question is this code:

According to Kevlin, this style “is just wrong”. Let’s try to find out why.

There is one principle that is mentioned by Kevlin in passing: The “Single Level of Abstraction” principle that states that you should not mix different levels of abstraction in one block of code (the principle talks about methods). It is a foundation for the first rule in the object calisthenics: “Only one level of indentation per method”.

If you look at the if-code and else-code, they operate on the same level of abstraction. Maybe not on the same level of probability, but they deal with the same topic. Elevating one part by eliminating the else-block in favor of an early return means that this part is more important. It also designates the if-code and in fact the whole if-statement to be a guard clause. Guard clauses typically deal with invalid state and don’t complement the desired functionality. They act as gatekeepers and interdict the invalid state to enter the method’s main body. As a metaphor: The bouncers in front of a club are like guard clauses. To say that being denied entry by a bouncer is comparable fun to being in the club is probably not a widespread opinion.

Unfinished reflection

I still reflect on other clues that are name-dropped by Kevlin, like the stated reduction of refactoring opportunities, but that’s probably because I don’t have enough comparison material.

There is one thing that I haven’t got a proper hold on yet and that’s the term “control state“. My google kung-fu is not mighty enough to reach past some obscure ASP.NET concepts from ten years ago. I haven’t heard the term in books – at least I don’t remember it.

So here is my call for help: Can you provide some source or explanation about what Kevlin Henney means by “control state“?

And what else do you think about the whole discussion?

The Optional Wildcast

This blog post presents a particular programming technique that I happen to use more often in recent months. It doesn’t state that this technique is superior or more feasible than others. It’s just a story about a different solution to an old programming problem.

Let’s program a class hierarchy for animals, in particular for mammals and birds. You probably know where this leads up to, but let’s start with a common solution.

Both mammals and birds behave like animals, so they are subclasses of it. Birds have the additional behaviour of laying eggs for reproduction. We indicate this feature by implementing the Egglaying interface.

Mammals feed their offsprings by giving them milk. There are two mammals in our system, a cow and the platypus. The cow behaves like the typical mammal and gives a lot of milk. The platypus also feeds their young with milk, but only after they hatched from their egg. Yes, the platypus is a rare exception in that it is both a mammal and egglaying. We indicate this odd fact by implementing the Egglaying interface, too.

If our code wants to access the additional methods of the Egglaying interface, it has to check if the given object implements it and then upcasts it. I call this type of cast “wildcast” because they seem to appear out of nowhere when reading the code and seemingly don’t lead up or down the typical type hierarchy. Why would a mammal lay eggs?

One of my approaches that I happen to use more often recently is to indicate the existence of real wildcast with a Optional return type. In theory, you can wildcast from anywhere to anyplace you want. But only some of these jumps have a purpose in the domain. And an explicit casting method is a good way to highlight this purpose:

public abstract class Mammal {
	public Optional<Egglaying> asEgglaying() {
		return Optional.empty();
	}
}

The “asEgglaying()” method might return an Egglaying object, or it might not. As you can see, on default, it returns only an empty Optional. This means that no cow, horse, cat or dog has to think about laying eggs, they just aren’t into it by default.

public class Platypus extends Mammal implements Egglaying {
	@Override
	public Optional<Egglaying> asEgglaying() {
		return Optional.of(this);
	}
}

The platypus is another story. It is the exception to the rule and knows it. The code “Optional.of(this)” is typical for this coding technique.

A client that iterates over a collection of mammals can now incorporate the special case with more grace:

for (Mammal each : List.of(mammals())) {
	each.lactate();
	each.asEgglaying().ifPresent(Egglaying::breed);
}

Compare this code with a more classic approach using a wildcast:

for (Mammal each : List.of(mammals())) {
	each.lactate();
	if (each instanceof Egglaying) {
		((Egglaying) each).breed();
	}
}

My biggest grief with the classic approach is that the instanceof is necessary for the functionality, but not guided by the domain model. It comes as a surprise and has no connection to the Mammal type. In the Optional wildcast version, you can look up the callers of “asEgglaying()” and see all the special code that is written for the small number of mammals that lay eggs. In the classic approach, you need to search for conditional upcasts or separate between code for birds and special mammal code when looking up the callers.

In my real-world projects, this “optional wildcast” style facilitates domain discovery by code completion and seems to lead me to more segregated type systems. These impressions are personal and probably biased, so I would like to hear from your experiences or at least opinions in the comments.