Mastering programming like a martial artist

Gichin Funakoshi, who is sometimes called the “father of modern karate”, issued a list of twenty guiding principles for his students, called the “Shōtōkan nijū kun”. While these principles as a whole are directed at karate practitioners, many of them are very useful for other disciplines as well.
In my understanding, lifelong learning is a fundamental pillar of both karate and programming, and many of those principles focus on “learning” as a more fundamental action. I’d like to focus on a particular one now:

Formal stances are for beginners; later, one stands naturally.

While, at first, this seems to focus on stances, the more important concept is progression, and how it relates to formalities.

Shuhari

It is a variation of the concept of Shuhari, the three stages of mastery in martial arts. I think they map rather beautifully to mastery in programming too.

The first stage, shu, is about learning traditions and movements, and how to apply them strictly and faithfully. For programming, this is learning to write your first programs with strict rules, like coding conventions, programming patterns and all the processes needed to release your programs to the world. There is no room for innovation, this stage is about absorbing what knowledge and practices already exist. While learning these rules may seem like a burden, the restrictions are also a gift. Because it is always clear what is right and what is wrong, and decisions are easy.

The second stage, ha, is about breaking away from these rules and traditions. Coding conventions, programming-patterns etc. are still followed. But more and more, exceptions are allowed, and encouraged, when they serve a greater purpose. A hack might no longer seem so bad, when you consider how much time it saved. Technical dept is no longer just avoided, but instead managed. Decisions are a little harder here, but there’s always the established conventions to fall back to.

In the final stage, ri, is about transcendence. Rules lose their inherent meaning to purpose. Coding conventions, best-practices, and patterns can still be observed, but they are seen for what they are: merely tools to achieve a goal. As thus, all conventions are subject to scrutiny here. They can be ignored, changed or even abandoned completely if necessary. This is the stages for true innovation, but also for mastery. To make decisions on this level, a lot of practice and knowledge and a bit of wisdom are certainly required.

How to use this for teaching

When I am teaching programming, I try to find out what stage my student is in, and adapt my style appropriately (although I am not always successful in this).

Are they beginners? Then it is better to teach rigid concepts. Do not leave room for options, do not try to explore alternatives or trade-offs. Instead, take away some of the complexity and propose concrete solutions. Setup rigid guidelines, how to code, how to use the IDEs, how to use tools, how to communicate. Explain exactly how they are to fulfill all their tasks. Taking decisions away will make things a lot easier for them.

Students in the second, or even in the final stage, are much more receptive to these freedoms. While students on the second stage will still need guidance in the form of rules and conventions, those in the final stage will naturally adapt or reject them. It is much more useful to talk about goals and purpose with advanced students.

Building the right software

When we talk about software development a lot of the discussion revolves around programming languages, frameworks and the latest in technology.

While all the above and also the knowledge and skill of the developers certainly matter a great deal regarding the success of a software project the interaction between the involved individual is highly undervalued in my opion. Some weeks ago I watched a great talk connecting air plane crashes and interaction in a software development team. The golden quote for me was certainly this one:

Building software takes technical skill, but building the right software take human interaction and lots of it”

Nickolas Means (“How to crash an airplane”, The Lead Developer UK 2016)

I could not word it better and it matches my personal experience. Many, if not most of the problems in software projects are about human communication, values, feelings and opinions and not technical.

In his talk Nickolas Means focuses on internal team communication and I completely agree with him. My focus as a team lead shifted a lot from technical to fostering diversity, opinions and communication within the team. I am less strict in enforcing certain rules and styles in a project. I think this leads to more freedom and better opportunities for experimentation and exploration of ways to approach a problem.

Extend it to your customer

As we work on projects in different domains with a variety of customers we are really working hard to understand our customers. Building up open, trustworthy and stable communications is key in forming a fruitful and productive collaborative partnership in a (software) project. It will help you to produce a great software that does meet the customers needs instead of just a great software. It may also help you in situation where you mess up or technical problems plague the project.

The aspect of human interaction in software projects has its place rightfully in the agile software development manifesto:

Through this work we have come to value:

Individuals and interactions over processes and tools

The Authors of the Agile Manifesto

Almost 20 years later this is still undervalued and many software developers are still way too much on the technical side. We are striving to steadily improve our skills on the human interaction side and think it proves fruitful everytime we succeed.

I hope that more and more software developers will grasp the value of this shifted view point and that it will increase quality and value of the software solutions provided to all users.

Maybe it will make working in this field friendlier for not so tech-savvy people and allow for more of much needed diversity in tech.

The ALARA principle in software engineering

The ALARA principle originates in radiation protection and means “As Low As Reasonably Achievable”. It means that you have to weigh the purpose of an action dealing with radiation against its disadvantages, like radiation damage or long-term risks. The word “reasonably” means that while some disadvantages are not avoidable, a practicable amount of protection should be in place to lower them. The principle calls for a balancing act: Not without safety measures, but don’t overextend your means by trying to achieve a safety level that isn’t helpful anymore.
To put the ALARA principle in practice with an example: You shouldn’t need a X-ray every time you go to the dentist, but given enough time since the last one and reasonable doubt about a tooth, the X-ray examination will benefit your dental (and overall) health more than if you deny it. It isn’t healthy by itself, but the information gained by it will be used to improve your health.

I learnt about the ALARA principle when my father (a nuclear physicist by heart) explained it to me in context of the current corona pandemic: Use protection like face masks and distance, but don’t stress yourself too much over that one time when you grabbed a pen in the postal office. While preparation and watchfulness is helpful, fear is detrimental to your mental health. And even the most resilient mind has bounded resources that can better be spent on constructive things instead of fear.

A fun fact from radiation protection is that at least three of the four main rules of protection can be applied to corona, too:

  • Distance yourself from the radiation source
  • Use appropriate protection gear
  • Avoid incorporation (keep the thing outside your body)
  • Limit your exposure time (this doesn’t fit as nicely, because the virus is probably not cumulative)

But how can we apply the ALARA principle to software engineering? I was instantly reminded about the “Thorough” rule of unit testing. In the book “Pragmatic Unit Testing” by Andy Hunt and Dave Thomas, the two original Pragmatic Programmers, good unit tests have to follow the ATRIP-rules. The T stands for “Thorough” which is often misinterpreted as “test everything at least twice”. In reality, the rule states that:

  • all mission critical functionality needs to be tested
  • for every occuring bug, there needs to be an additional test that ensures that the bug cannot happen again

The first thing that meets the eye is that the rule doesn’t define a bug as a failure of your testing effort. It takes a bug that probably happened in production and caused some damage as a motivation to strengthen your test coverage in that particular area. The second part of the rule calls for directed, well-aimed testing effort. It is easy to follow because it has a clear trigger: A bug happened, now you have to write a test.

The first part of the rule is more complicated: What is mission critical functionality? And what means “is tested thoroughly” in this context? And here, the ALARA principle can help us. The bug rate in the important parts of your code should be as low as reasonably achievable. “Reasonably achievable” is defined by the resources at your disposal (like time to market), your expertise in testing and the potential damage that could happen if something in your code goes wrong.

If the potential damage is high or even life-threatening, your reasonable effort should be much higher than if the most critical thing that happens is a 15 minute downtime while you restart the server. There are use cases where even 15 minutes mean subsequent damage, but most software is written for a more relaxed context.

I’ve always found the “Thorough” rule of good unit tests pleasant and comforting: If you made reasonable effort to test your most important code and write a test for every bug you or your users encounter, you can say that your bug rate is ALARA – “As Low As Reasonably Achievable”. And that is good enough for most cases.

What was your first thought when you heard about the ALARA principle? Tell us in the comment section!

Math development practices

As a mathematician that recently switched to almost full-time software developing, I often compare the two fields. During the last years of my mathematics career I was in the rather unique position of doing both at once – developing software of some sort and research in pure mathematics. This is due to a quite new mathematical discipline called Homotopy Type Theory, which uses a different foundation as the mathematics you might have learned at a university. While it has been a possibility for quite some time to check formalized mathematics using computers, the usual way to do this entails a crazy amount of work if you want to use it for recent mathematics. By some lucky coincidences this was different for my area of work and I was able to write down my math research notes in the functional language Agda and have them checked for correctness.

As a disclaimer, I should mention that what I mean with “math” in this post, is very far from applied mathematics and very little of the kind of math I talk about is implement in computer algebra systems. So this post is about looking at pure, abstract math, as if it were a software project. Of course, this comparison is a bit off from the start, since there is no compiler for the math written in articles, but it is a common believe, that it should always be possible to translate correct math to a common foundation like the Zermelo Fraenkel Set Theory and that’s at least something we can type check with software (e.g. isabelle).

Refactoring is not well supported in math

In mathematics, you want to refactor what you write from time to time pretty much the same way and for the same reasons as you would while developing software. The problem is, you do not have tools which tell you immediately if your change introduces bugs, like automated tests and compilers checking your types.

Most of the time, this does not cause problems, judging from my experiences with refactoring software, most of the time a refactoring breaks something detected by a test or the compiler, it is just about adjusting some details. And, in fact, I would conjecture that almost all math articles have exactly those kind of errors – which is no problem at all, since the mathematicians reading those articles can fix them or won’t even notice.

As with refactoring in software development, what does matter are the rare cases where it is crucial that some easy to overlook details need to match exactly. And this is a real issue in math – sometimes a statements gets reformulated while proving it and the changes are so subtle that you do not even realize you have to check if what you prove still matches your original problem. The lack of tools that help you to catch those bugs is something that could really help math – but it has to be formalized to have tools like that and that’s not feasible so far for most math.

Retrospectively, being able to refactor my math research was the biggest advantage of having fully formal research notes in Agda. There is no powerful IDE like they are used in mainstream software development, just a good emacs-mode. But being able to make a change and check afterwards if things still compile, was already enough to enable me to do things I would not have done in pen and paper math.

Not being able to refactor might also be the root cause of other problems in math. One wich would be really horrible for a software project, is that sometimes important articles do not contain working versions of the theorems used in some field of study and you essentially need to find some expert in the field to tell you things like that. So in software project, that would mean, you have to find someone who allegedly made the code base run some time in the past by applying lots of patches which are not in the repository and wich he hopefully is still able to find.

The point I wanted to make so far is: In some respects, this comparison looks pretty bad for math and it becomes surprising that it works in spite of these deficiencies. So the remainder of this post is about the things on the upside, that make math check out almost all the time.

Math spent person-centuries on designing its datatypes

This might be exaggerated, but it is probably not that far off. When I started studying math, one of my lecturers said “inventing good definitions is not less important as proving new results”. Today, I could not agree more, immense work went into the definitions in pure math and they allowed me to solve problems I would be too dumb to even think about otherwise. One analogue in programming is finding the ‘correct’ datatypes, which, if achieved can make your algorithms a lot easier. Another analogue is using good libraries.

Math certainly reaps a great benefit from its well-thought-through definitions, but I must also admit, that the comparison is pretty unfair, since pure mathematicians usually take the freedom to chose nice things to reason about. But this is a point to consider when analysing why math still works, even if some of its practices should doom a software project.

I chose to speak about ‘datatypes’ instead of, say ‘interfaces’, since I think that mathematics does not make that much use of polymorhpism like I learned it in school around 2000. Instead, I think, in this respect mathematical practice is more in line with a data-oriented approach (as we saw last week here on this blog), in math, if you want your X to behave like a Y, you usually give a map, that turns your X into a Y, and then you use Y.

All code is reviewed

Obviously like everywhere in science, there is a peer-review processs if you want to publish an article. But there are actually more instances of things that can be called a review of your math research. Possibly surprising to outsiders, mathematicians talk a lot about their ideas to each other and these kind of talks can be even closer to code reviews than the actual peer reviews. This might also be comparable to pair programming. Also, these review processes are used to determine success in math. Or, more to the point, your math only counts if you managed to communicate your ideas successfully and convince your audience that they work.

So having the same processes in software development would mean that you have to explain your code to your customer, which would be a software developer as yourself, and he would pay you for every convincing implementaion idea. While there is a lot of nonsense in that thought, please note that in a world like that, you cannot get payed for a working 300-line block code function that nobody understands. On the other hand, you could get payed for understanding the problem your software is supposed to solve even if your code fails to compile. And in total, the interesting things here for me is, that this shift in incentives and emphasis on practices that force you to understand your code by communicating it to others can save a very large project with some quite bad circumstances.

Data-Oriented Design: Using data as interfaces

A Code Centric World

In main-stream OOP, polymorphism is achieved by virtual functions. To reuse some code, you simply need one implementation of a specific “virtual” interface. Bigger programs are composed by some functions calling other functions calling yet other functions. Virtual functions introduce a flexibility here to that allow parts of the call tree to be replaced, allowing calling functions to be reused by running on different, but homogenuous, callees. This is a very “code centric” view of a program. The data is merely used as context for functions calling each other.

Duality

Let us, for the moment, assume that all the functions and objects that such a program runs on, are pure. They never have any side effects, and communicate solely via parameters to and return values from the function. Now that’s not traditional OOP, and a more functional-programming way of doing things, but it is surely possible to structure (at least large parts of) traditional OOP programs that way. This premise helps understanding how data oriented design is in fact dual to the traditional “code centric” view of a program: Instead of looking at the functions calling each other, we can also look at how the data is being transformed by each step in the program because that is exactly what goes into, and comes out of each function. IS-A becomes “produces/consumes compatible data”.

Cooking without functions

I am using C# in the example, because LINQ, or any nice map/reduce implementation, makes this really staight-forward. But the principle applies to many languages. I have been using the technique in C++, C#, Java and even dBase.
Let’s say we have a recipe of sorts that has a few ingredients encoded in a simple class:

class Ingredient
{
  public string Name { get; set; }
  public decimal Amount { get; set; }
}

We store them in a simple List and have a nice function that can compute the percentage of each ingredient:

public static IReadOnlyList<(string, decimal)> 
    Percentages(IEnumerable<Ingredient> incredients)
{
  var sum = incredients.Sum(x => x.Amount);
    return incredients
      .Select(x => (x.Name, x.Amount / sum))
      .ToList();
}

Now things change, and just to make it difficult, we need a new ingredient type that is just a little more complicated:

class IngredientInfo
{
  public string Name { get; set; }
  /* other useful stuff */
}

class ComplicatedIngredient
{
  public IngredientInfo Info { get; set; }
  public decimal Amount { get; set; }
}

And we definitely want to use the old, simple one, as well. But we need our percentage function to work for recipes that have both Ingredients and also ComplicatedIngredients. Now the go-to OOP approach would be to introduce a common interface that is implemented by both classes, like this:

interface IIngredient
{
  string GetName();
  string GetAmount();
}

That is trivial to implement for both classes, but adds quite a bunch of boilerplate, just about doubling the size of our program. Then we just replace IReadOnlyList<Ingredient> by IReadOnlyList<IIngredient> in the Percentage function. That last bit is just so violating the Open/Closed principle, but just because we did not use the interface right away (Who thought YAGNI was a good idea?). Also, the new interface is quite the opposite of the Tell, don’t ask principle, but there’s no easy way around that because the “Percentage” function only has meaning on a List<> of them.

Cooking with data

But what if we just use data as the interface? In this case, it so happens that we can easiely turn a ComplicatedIngredient into an Ingredient for our purposes. In C#’s LINQ, a simple Select() will do nicely:

var simplified = complicated
  .Select(x => new Ingredient
   { 
     Name = x.Info.Name,
     Amount = x.Amount
   });

Now that can easiely be passed into the Percentages function, without even touching it. Great!

In this case, one object could neatly be converted into the other, which is often not the case in practice. However, there’s often a “common denominator class” that can be found pretty much the same way as extracting a common interface would. Just look at the info you can retrieve from that imaginary interface. In this case, that was the same as the original Ingredients class.

Further thoughts

To apply this, you sometimes have to restructure your programs a little bit, which often means going wide instead of deep. For example, you might have to convert your data to a homogenuous form in a preprocessing step instead of accessing different objects homogenuously directly in your algorithms, or use postprocessing afterwards.
In languages like C++, this can even net you a huge performance win, which is often cited as the greatest thing about data-oriented design. But, first and foremost, I find that this leads to programs that are easier to understand for both machine and people. I have found myself using this data-centric form of code reuse a lot more lately.

Are you using something like this as well or are you still firmly on the override train, and why? Tell me in the comments!

The joy of being a student assistant

Lately I heard a lecture by Daniel Lindner about error codes and why you should avoid using them. I had to smile because it reminded me of my time as a student assistant, when I worked with some people that had a slightly different opinion on that point. Maybe they enjoyed torturing student assistants, but it seems the most likely to me that they just did not know any better. But let‘s start at the beginning.

One day the leader of my research group sent me an excel sheet with patient data and asked me to perform some statistical calculations with the programming language R, that is perfectly suitable for such a task. Therefore I did not expect it to take much time – but it soon turned out that I was terribly wrong. Transforming that excel sheet into something R could work with gave me a really hard time and so I decided to write down some basic rules you should consider when recording such data in the hope that at least some future student assistants won‘t have to deal with the problems I had again.

First I told my program to read in the excel sheet with the patient data, which worked as expected. But when I started to perform some simple operations like calculating the average age of the patients, my computer soon told me things like that:

Warning message: In mean.default(age): argument is not numeric or logical: returning NA

I was a little confused then, because I knew that the age of something or someone is a numeric value. But no matter how often I tried to explain that to my computer: He was absolutly sure that I was wrong. So I had no choice but to have a look at the excel sheet with about 2000 rows and 30 columns. After hours of searching (at least it felt like hours) I found a cell with the following content: Died last week. That is indeed no numeric value, it‘s a comment that was made for humans to read. So here‘s my rule number one:

1. Don‘t use comments in data files

There is one simple reason for that: The computer, who has to work with that data, does not understand it. And (as sad as it is) he does not care about the death of a patient. The only thing he wants to do is to calculate a mean value. And he needs numeric values for it. If you still want to have that comment, just save it somewhere else in another column for comments or in a separate file the computer does not have to deal with when performing statistical calculations.

So I removed that comment (and some others I found) and tried to calculate the mean value again. This time my computer did not complain, the warning message disappeard and for one moment I felt relieved. In the next moment I saw the result of the calculation. The average age of the patients was 459.76. And again I told my computer that this is not possible and again he was sure that I was wrong and again I had to take a look at the excel sheet with the data. Did I mention that the file contained about 2000 rows and 30 columns? However, after a little searching I found a cell with the value -999999. It was immediately clear to me that this was not the real age of the patient, but I wasn‘t able to find out by myself what that value meant. It could have been a typo, however the leader of my research group told me that some people use -999999 as an error code. It could mean something like: „I don‘t know the age of the patient.“ Or: „That patient also died.“ But that was only a guess. So here is my rule number 2:

2. Document your error codes

If there would have been some documentation I would maybe have known what to do with that value. Instead I secretly deleted it, hoping that it was not important to anyone, because unfortunately to my computer -999999 is just a numeric value, not better or worse than any other. So I had to tell him not to use it. But that was only the beginning.

I learned from my previous mistakes and before performing any other statistical calculations I had a look at the whole excel sheet. And it was even more horrible than I expected it to be. If every person who worked with that table would have used the same error code, I could just have written a script that eliminated all -999999 from the sheet and it would have been done. But it seemed that everyone had his own favorite (undocumented) error code. Or if at least there would have been some documentation about the value ranges of each column, I could have told my computer to ignore all values that are not in that range. For something like the age of a patient this is easy, but for other medical data a computer science student like me does not know that can be hard. Is 0 a valid value or does it mean that there is no value? What about 999? So in any case: I had to read the whole table (again: 2000×30 values!) and manually guess for each value if it really was a value or an error code and then tell my computer to ignore it, so he could calculate the right means. I don‘t know exactly how much time that cost, but I‘m sure in the same time I could have read all nine books of The Histories by Herodotus twice, watch every single episode of Gilmore Girls including the four episodes of A Year in the Life and learn Japanese. So finally here‘s my rule number 3 (and the good part about this one is that you can immediately forget about rule number 2):

3. Don‘t use error codes in data files

Really. Don‘t. The student assistants of the future will thank you.

Code duplication is not always evil

Before you start getting mad at me first a disclaimer: I really think you should adhere to the DRY (don’t repeat yourself) principle. But in my opinion the term “code duplication” is too weak and blurry and should be rephrased.

Let me start with a real life story from a few weeks ago that lead to a fruitful discussion with some fellow colleagues and my claims.

The story

We are developing a system using C#/.NET Core for managing network devices like computers, printers, IP cameras and so on in a complex network infrastructure. My colleague was working on a feature to sync these network devices with another system. So his idea was to populate our carefully modelled domain entities using the JSON-data from the other system and compare them with the entities in our system. As this was far from trivial we decided to do a pair-programming session.

We wrote unit tests and fixed one problem after another, refactored the code that was getting messing and happily chugged along. In this process it became more and more apparent that the type system was not helping us and we required quite some special handling like custom IEqualityComparers and the like.

The problem was that certain concepts like AddressPools that we had in our domain model were missing in the other system. Our domain handles subnets whereas the other system talks about ranges. In our system the entities are persistent and have a database id while the other system does not expose ids. And so on…

By using the same domain model for the other system we introduced friction and disabled benefits of C#’s type system and made the code harder to understand: There were several occasions where methods would take two IEnumerables of NetworkedDevices or Subnets and you needed to pay attention which one is from our system and which from the other.

The whole situation reminded me of a blog post I read quite a while ago:

https://www.sandimetz.com/blog/2016/1/20/the-wrong-abstraction

Obviously, we were using the wrong abstraction for the entities we obtained from the other system. We found ourselves somewhere around point 6. in Sandy’s sequence of events. In our effort to reuse existing code and avoid code duplication we went down a costly and unpleasant path.

Illustration by example

If code duplication is on the method level we may often simply extract and delegate like Uncle Bob demonstrates in this article. In our story that would not have been possible. Consider the following model of Price and Discount e-commerce system:

public class Price {
    public final BigDecimal amount;
    public final Currency currency;

    public Price(BigDecimal amount, Currency currency) {
        this.amount = amount;
        this.currency = currency;
    }

    // more methods like add(Price)
}

public class Discount {
    public final BigDecimal amount;
    public final Currency currency;

    public Discount(BigDecimal amount, Currency currency) {
        this.amount = amount;
        this.currency = currency;
    }

    // more methods like add(Discount<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>)
}

The initial domain entities for price and discount may be implemented in the completely same way but they are completely different abstractions. Depending on your domain it may be ok or not to add two discounts. Discounts could be modelled in a relative fashion like “30 % off” using a base price and so. Coupling them early on by using one entity for different purposes in order to avoid code duplication would be a costly error as you will likely need to disentangle them at some later point.

Another example could be the initial model of a name. In your system Persons, countries and a lot of other things could have a name entity attached which may look identical at first. As you flesh out your domain it becomes apparent that the names are different things really: person names should not be internationalized and sometimes obey certain rules. Country names in contrast may very well be translated.

Modified code duplication claim

Duplicated code is the root of all evil in software design.

— Robert C. Martin

I would like to reduce the temptation of eliminating code duplication for different abstractions by modifying the well known claim of Uncle Bob to be a bit more precise:

Duplicated code for the same abstraction is the root of all evil in software design.

If you introduce coupling of independent concepts by eliminating code duplication you open up a new possibility for errors and maintenance drag. And these new problems tend to be harder to spot and to resolve than real code duplication.

Duplication allows code to evolve independently. I think it is important to add these two concepts to your thinking.

Ignoring YAGNI – 12 years later

Fourteen years ago, we started to build a distributed system to gather environmental data in an automated 24/7 fashion. Our development process was agile and made heavy use of short iterations (at least that was what they were then, today they are normal-sized). So the system grew with many small new features and improvements, giving the customer immediate business value.

One part of the system was the task scheduler. Because the system had to run 24/7 and be mostly independent of human interaction, the task scheduler’s job was to launch different measurement processes at the right time. We had done extensive domain crunching and figured out that all tasks follow a rigid time regime like “start every 10 minutes” or “start every hour”, regardless of the processes’ runtime. This made the scheduler rather easy to develop. You should keep it simple, after all.

But another result of the domain crunching bothered us: The schedule of all tasks originated from the previous software system, built 30 years ago and definitely unfit for the modern software world. The schedules weren’t really rooted in the domain, they all had technical explanations like “the recording of the values is done sequentially and takes up to 8 minutes, we can’t record them more often than that”. For our project, the measurement hardware was changed, so our recording took a couple of milliseconds. We could store and display the values continuously, if the need arises.

So we discussed the required simpleness or complexity of the task scheduler with the customer and they seemed pleased with all the new possibilities. But they decided that the current schedules were sufficient and didn’t need to be changed. We could go ahead and build our simple task scheduler.

And this is when we decided to abandon KISS and make the task scheduler more powerful than needed. “But you ain’t going to need it!” was the enemy. Because we knew that the customer will inevitably come around and make use of their new possibilities. We knew that if we build the system with more complexity, we would be the heroes in a future time, wearing a smug smile and telling the customer: “We’ve already built this, you can use it right away”. Oh how glorious this prospect of the future shone! Just a few more thoughts going into the code and we’re set for a bright future.

Let me tell you a few details about the “few more thoughts” with the example of an “every hour” task schedule. Instead of hard-coding the schedule, we added a configuration file with a cron-like expression for the schedule. You could now leverage the power of cron expressions to design your schedule as you see fit. If you wanted to change the schedule from “every hour” to “every odd minute and when the pale moon rises”, you could do so. The task scheduler had to interpret the configuration file and make sure that tasks don’t pile up: If you schedule a task to run “every minute”, but it takes two minutes to process, you’ve essentially built a time-bomb for your system load. This must not be feasible.

But it doesn’t stop there. A lot of functionality, most of which wasn’t even present or outlined at the time of our decision, relies implicitly on that schedule. Two examples: There are manual operations that must not be performed during the execution of the task. The system goes into a “protected state” around the task execution. It disables these operations a few minutes before the scheduled execution and even some time afterwards. If you had a fixed schedule of “every hour”, you could even hard-code the protected timespan. With a possible dynamic schedule, you have to calculate your timespan based on the current schedule and warn your operator if it isn’t possible anymore to find a time slot to even perform the manual operation.
The second example is a functionality that supervises the completeness of the recorded data. The problem is: This functionality is on another computer (it’s a distributed system, remember?) that doesn’t know about the configuration files. To be able to scan the data archive and say “everything that should be there, is there”, the second computer needs to know about all the schedules of the first computers (there are many of them, recording their data on their own schedules and transferring it to the second computer). And if a schedule changes, the second computer needs to take the change into account and scan the data archive for two areas: one area with the old schedule and one area with the new schedule. Otherwise, there would be false alarms.

You can probably see that the one decision to make the task scheduler a little more complex and configurable as required had quite some impact on the complexity of other parts of the system. But this investment will be worth it as soon as the customer changes the schedule! The whole system is programmed, tested and documented to facilitate schedule changes. We are ready!

It’s been over twelve years since we wrote the first line of code for the more complex implementation (I’ve checked the source control logs). The customer hasn’t changed a single bit of the schedule yet. There are over twenty “first computers” and they all still run the same task schedule as initially planned. Our decision did nothing but to add accidental complexity to the system. It probably introduced some bugs along the way, too. It certainly increased our required level of awareness (“hurdle of understanding”) during the development of features that are somewhat coupled with the task schedule.

In short: It’s been a disaster. The smug smile we thought we’d wear has been replaced by a deep frown. Who wrote all that mess? And why? It wasn’t the customer, it was us. We will never be going to need it.

Domain-aligned bugs

Frank C. Müller [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0) ]

Imagine that you are an user of a typical enterprise software that handles commercial products and their prices. There are different prices in the software that are somehow related to each other. There is the purchase price that indicates your cost if you buy the product. There is the retail price that gets listed in your price lists and is paid by your customers, should they buy the product. You probably already figured out that the retail price should never be lower than the purchase price, because that would mean you lose money with every successful sale.

Let’s say that the enterprise software not only handles products, but also parts. Several parts combined, with some manufacturing effort, result in a product. Each part has a purchase price, the resulting product has a retail price. The retail price of the product should be higher than the sum of purchase prices of the parts. If it isn’t, you lose the costs of the manufacturing effort and some extra money with every successful sale.

If for any reason you cannot clearly estimate your manufacturing effort, the enterprise software has another input field for an amount of money that you can add to the sum of the parts’ costs. We call this field the “sales bonus”. So, if you sell a product made up of parts, your customer has to pay a price that consists at least of the retail prices of the parts and the sales bonus. Of course, your customer has an individual discount percentage that needs to be subtracted from the total price. Are you still following?

You are now thinking in the domain of price determination and financial mathematics. If you were the user of said enterprise software, you’d probably expect some bugs like these:

  • It is possible to enter a retail price lower than the purchase price
  • The price of products manufactured from parts isn’t calculated correctly
  • It is possible to enter a negative sales bonus
  • The total price with discount could be lower than the sum of purchase prices of the parts without a warning

All of them are bugs in the domain. All of them can be explained to a domain expert or a user with terms and concepts from the domain.

But what about the bug when you sell a product that consists of three parts, each with a retail price of 10 €, and a sales bonus of 5 €. You want to create a quote for your customer and the price shows up as 34,99999999998 €. You are a bit bewildered and try to countervail the apparent rounding error by changing the sales bonus to 5,00000000002 €. After this change you get another crazy total price and your prices in the database are different from what you entered, too. Everything seems to destabilize and deviate further and further from clear cut prices.

As a programmer, you know what happened. You know what caused this effect of numerical instability. Somebody stored monetary values in a floating point number. You know that is a bad idea and you’d never do this. But this blog post isn’t about you or what you should do or not do. It is about the user, expert in his domain, that stumbles over the bug as described and has to make some decision on how to fix it. This user cannot use any knowledge from the domain to even understand the mechanics of the bug. You, as the programmer, cannot explain this bug in terms of things the user already knows. You need to be vague (“the software doesn’t store the exact values, just approximations”) or introduce additional complexity (“we store this value by splitting it into a significand and multiply it with a factor consisting of a fixed base and an exponent. We can omit the base and just store the significand and the exponent and express a very large numerical range in just a few bits. Think about how cool that is!”).

Read the last explanation again, from the viewpoint of a salesman. We want to add some prices in the range of a few €, slap a moderate discount on top and call it a day. We don’t care about bits or exponential formulas. That is not part of our domain and it shouldn’t affect our domain or software that works in our domain. Confronting us with technical details reflects negatively on your ability to solve our problems. You seem to burden us with your problems in exchange.

As domain experts, we want only domain-aligned bugs.

Book review: A Philosophy of Software Design

This blog entry is structured in two main parts: The prologue sets the tone, but may be irritating because it doesn’t talk about the book itself. If you get irritated or know the topic well enough to skip it, you can jump to the second part when I talk about the book. It is indicated by a TL;DR summary of the prologue.

Prologue

Imagine a world where the last 25 years of computer game development didn’t happen. A world where we get the power of 5 GHz octacore computers and 128 GB of RAM, but nobody thought about 3D graphics or interaction design. The graphics of computer games is so rudimentary, it consists of ASCII art and color. In this world, two brothers develop a game that simulates a whole fantasy world with all details, in three dimensions. The game is an instant blockbuster hit and spawns multiple cinematic adaptions.
This world never happened. The only thing that seems to be from this world is the game itself: Dwarf Fortress. An ASCII art sandbox simulation of a bunch of dwarves that dig into the (three-dimensional) mountains and inevitably discover the fun in magma. Dwarf Fortress is a game told by stories, not graphics. It burdens the player to micro-manage a whole settlement down to the individual sock – Yes, no plural. There are left socks and right socks and they are different entities with a different story. Dwarves can literally go mad because they miss their favorite left sock and you didn’t notice in time. And you have to control all aspects of the settlement not by direct order, but by giving hints and suggestions through an user interface that is a game of riddles on its own.
Dwarf Fortress is an impossible game. It seems so out of time and touch with current gaming reality that you can only shake your head on first contact. But, it is incredibly deep and well-designed and, most surprising, provides the kind player with endless fun. This game actually works!

TL;DR: Just because something seems odd at first contact doesn’t mean it cannot work. Go and play Dwarf Fortress!

The book

John Ousterhout is a professor teaching software design at the Stanford university and writes software for decades now. In 1988, he invented the Tcl programming language. He got a lot of awards, including the Grace Murray Hopper Award. You can say that he knows what he’s doing and what he talks about. In 2018, he wrote a book with the title “A Philosophy of Software Design”. This book is a peculiar gem besides titles with a similar topic.

Imagine a world where the last 20 years of software development books didn’t happen. One man creates software for his whole life and writes down his thoughts and insights, structured in tactical advices, strategic approaches and an overarching philosophy. He has to invent some new vocabulary to express his ideas. He talks about how he performs programming – and it is nothing like today’s mainstream. In fact, it is sometimes the exact opposite of today’s best practices. But, it is incredibly insightful and well-structured and, most surprising, provides the kind developer with endless fun. Okay, I admit, the latter part of the previous sentence was speculative.

This is a book that seems a bit out of touch with today’s mainstream doctrine – and that’s a good thing. The book begins by defining some vocabulary, like the notion of complexity or the concept of deepness. That is rare by itself, most books just use established words to deliver a message. If you think about the definitions, they will probably enrich your perception of software design. They enriched mine, and I talk about software design to students for nearly twenty years now.

The most obvious thing that is different from other books with similar content: Most other books talk about behaviours, best practices and advices. Then they throw a buch of prohibitions in the mix. This isn’t wrong, but it’s “just” anecdotal knowledge. It is your job as the reader to discern between things that may have worked in the past, but are outdated and things that will continue to work in the future. The real question is left unanswered: Why is it so?

“A Philosophy of Software Design” begins by answering the “why” question. If you want to build an hierarchy of book wisdom depth, this might serve as one:

  • Tactical wisdom: What should be done? Most beginner’s books work on this level. They show exactly what goes on, but go easy on the bigger questions.
  • Strategic wisdom: How should it be done? This is the level that the majority of good software design books work on. They give insights about your work ethics and principles you should abide by.
  • Philosphical wisdom: Why should it be done? The reviewed book begins on this level. It explains the aspects of software and sourcecode that work against human perception and understanding and shows ways to avoid or at least diminish those aspects.

The book doesn’t stay on the philosophical level for long and dives deep into the “how” and “what” areas later on. But it does so with the background of an established “why”. And that’s a great reminder that even if you disagree with a specific “what” (or “how”), you should think about the root cause of your disagreement, not just anecdotes.

The author and the book aren’t as out-of-touch with current software development reality as you might think. There is a whole chapter addressed to current “software trends” like agile development and unit tests. It has a total page count of six pages and doesn’t go into details. But it at least mentions the things it doesn’t talk about.

Conclusion

My biggest learning point from the book for my personal habits as a developer is to write more code comments in the way the book proposes. Yes, you’ve read that right. The book urges you to write more comments – but good ones. It talks about why you should write more comments. It gives you extensive guidelines as to how good comments are written and some examples what these comments look like. After two decades of “write more (unit) tests!”, the message of “write more comments!” is unique and noteworthy. Perhaps we can improve our tools to better support comments in the same way they improved support for tests in the last years.

Perhaps we cannot solve our problems with the sourcecode by writing more sourcecode (unit tests). Perhaps we need to rely on something different. I will give it a try.

You might want to give the book “A Philosophy of Software Design” a try. It’s worth your time and thoughts.