Revisiting the bus factor concept

The concept of a “bus factor” is both grim and very useful to manage project risks. It originates from the area of project management and is sometimes called a “truck number” or (to give it a more positive spin) the “lottery factor”.

It tries to pinpoint the number of people in a project that can drop out suddenly and unplanned without the project success being jeopardized. The “bus” or “truck” is conceptually used as the tool to enforce the drop out. The big lottery win might produce the same outcome, but with less implacability.

The sole number of a bus factor is often helpful to make lurking project risks visible. Especially a bus factor of 1, the most nerve-wrecking number, should be avoided. It means that the project success is directly coupled to the health (or gambling luck) of one specific person.

But even a higher bus factor, lets say 3, is no complete relief. What if those three people hop into the same car to meet the customer in a project meeting and have an accident? The only way to mitigate those “cluster risks” is to plan separate routes and means of travel. Most people would regard those measures as “overly paranoid” and it robs the three people from communicating directly before and after the meeting.

You can explore the individual project risk with more sophisticated tools than just a number. Setting up and filling out a RACI matrix (or one of its many variants) is a good way to make things visible.

But in this blog post, I want to highlight another detail of the bus factor that I learned the hard way: The “bus factor risk” of different people can vary a lot. The “bus factor risk” is the individual probability that the bus factor occurs.

Let’s have an example with the lottery: Your project has two key players that keep the project afloat. One of them never fills out a lottery ticket, the other plays regularly. Their “lottery factor risk percentage” is not equal. Given the low probability to win the lottery, the percentage doesn’t change much, but it changes.

Now imagine one person that often pursues high risk spare time activities. I don’t want to single out one specific activity, but think about free-climbing maybe. The other person stays mostly at home and cooks delicious meals without using sharp knives or hot water. Ok, this comparison sounds a bit contrived, but you get the message:

Two projects with a bus factor of 2 each can vary a lot in the actual risk percentage, because all 4 people have their individual drop out percentage.

It doesn’t have to be spare time activities, by the way. Every person has an individual health risk that can only be improved to a certain degree. Every person simply has “luck” or “misfortune” and can’t do anything about it.

My message is simply that the bus factor number 2 might not be “half the risk” than 1. Or even that two bus factor numbers with the same value denote equal risk.

I don’t think that it is useful to try to quantify the individual “bus factor risk”of a person. Way too many factors come into play and most of them should not be the employer’s concern (like a medical history or spare time activities). What might be useful is to be aware that equal numbers don’t equate equal actual risk.

Useful background metrics: Distance to Disaster

This blog post would not have happened without my wife, who, upon learning that I use this metric in my everyday life, urged me to write about it.

I often categorize events that happen in my life. Due to my nature, I analyze detrimental events more thorough than things that “worked as intended”. One tool for my analysis is a measurement that I call “Distance to Disaster” (DtD). It indicates the “distance” or “bad faith work” or “bad decisions” that needs to be invested in order for disaster to happen. Let me explain:

If we wait on a train, we can stand in the middle of the platform and maximize the physical distance to the tracks before and behind us. Or we can stand right at the edge and minimize the physical distance to one track. If the track we chose for our position is the one where our train will arrive, we have a very low distance to distaster. We can lose our balance and fall onto the tracks. We can misjudge the physical dimensions of the train and get hit with something. In short: Nobody wants to wait on a train with a minimized (physical) distance to disaster.

Another measurement unit for the metric is “bad faith work”. Let’s assume you want to steal my most priced possession. That would be a disaster for me. You need to gain access to my home (step 1), then open the safe (step 2) and then find the key to the safe desposit box at my bank (no-brainer, not a step on its own). Afterwards, you need to gain access to the bank room before I recognize my loss (step 3) and open the box that has a two-lock system (step 4). It is probably easier to come up with a plan to circumvent some steps and attack the bank directly. If you just succeeded with step 1, my most priced possession is probably still very secure because a DtD of 3 is rather high.

And then, there are “bad decisions”. Let’s say you write code and accidentally hit “load” instead of “save”. If you are me in the early nineties, you just overwrote your code with an empty file. I still remember that day and it didn’t help that “save” was bound to F5 and “load” to F6. One bad decision lead to disaster.

Now imagine you still use the same shitty IDE (it was the GWBasic editor), but with modern version control. You commit early and often. You accidentally hit “load” instead of “save” and lose your last few minutes of work. Sad, but not a disaster. Even if you delete the whole file, you can restore your last commit as often as you want. Using version control adds +1 to your “bad decision distance” to disaster.

You probably understand the concept by now. You can specify what a “disaster” is and then measure your current distance to it by trying to come up with the least steps that lead to it.

In our normal everyday life, we are surprisingly often only one step away from disaster, but it never happens. That’s a reassuring reality, but shouldn’t keep us from thinking about how to increase the step count without much effort.

One typical implementation of this approach is a modest backup strategy for all data that you intend to keep. Another one is to have spare parts for crucial devices in stock (the “hardware backup”).

Don’t get me wrong: It’s not about maximizing the DtD. It’s about recognizing the cheap and easy opportunity to add one more step to the distance.

And it’s not about “disaster” in the meaning of life-altering, stop-the-world events. A “disaster” can be everything you don’t want to happen. Try to bring a reasonable distance between you and this thing if possible.

Now that you know about the concept, can you find examples of cheap and easy DtD improvements in software development? Let us know in the comments!

Addendum for my co-workers: Our ETOD metrics is the DtD metrics applied on financial resources.

And another addendum: I find a lot of similarities in the field and mindset of accident prevention. For example, airplane cockpits are designed in a way that dangerous actions require the actuation of two control elements like switches or buttons that are located on different sides of the room. Making it two buttons instead of one adds “bad decision” distance. Placing the buttons in different directions adds “intent distance”.

In software user interaction designs, we try to replicate the second button with a confirmation dialog (“Are you sure?”). It adds to the “bad decision” distance but often lacks in the “intent distance” dimension. I don’t want to be responsible for cumbersome “maximized mouse distance” dialogs, though.