Ignoring YAGNI – 12 years later

Fourteen years ago, we started to build a distributed system to gather environmental data in an automated 24/7 fashion. Our development process was agile and made heavy use of short iterations (at least that was what they were then, today they are normal-sized). So the system grew with many small new features and improvements, giving the customer immediate business value.

One part of the system was the task scheduler. Because the system had to run 24/7 and be mostly independent of human interaction, the task scheduler’s job was to launch different measurement processes at the right time. We had done extensive domain crunching and figured out that all tasks follow a rigid time regime like “start every 10 minutes” or “start every hour”, regardless of the processes’ runtime. This made the scheduler rather easy to develop. You should keep it simple, after all.

But another result of the domain crunching bothered us: The schedule of all tasks originated from the previous software system, built 30 years ago and definitely unfit for the modern software world. The schedules weren’t really rooted in the domain, they all had technical explanations like “the recording of the values is done sequentially and takes up to 8 minutes, we can’t record them more often than that”. For our project, the measurement hardware was changed, so our recording took a couple of milliseconds. We could store and display the values continuously, if the need arises.

So we discussed the required simpleness or complexity of the task scheduler with the customer and they seemed pleased with all the new possibilities. But they decided that the current schedules were sufficient and didn’t need to be changed. We could go ahead and build our simple task scheduler.

And this is when we decided to abandon KISS and make the task scheduler more powerful than needed. “But you ain’t going to need it!” was the enemy. Because we knew that the customer will inevitably come around and make use of their new possibilities. We knew that if we build the system with more complexity, we would be the heroes in a future time, wearing a smug smile and telling the customer: “We’ve already built this, you can use it right away”. Oh how glorious this prospect of the future shone! Just a few more thoughts going into the code and we’re set for a bright future.

Let me tell you a few details about the “few more thoughts” with the example of an “every hour” task schedule. Instead of hard-coding the schedule, we added a configuration file with a cron-like expression for the schedule. You could now leverage the power of cron expressions to design your schedule as you see fit. If you wanted to change the schedule from “every hour” to “every odd minute and when the pale moon rises”, you could do so. The task scheduler had to interpret the configuration file and make sure that tasks don’t pile up: If you schedule a task to run “every minute”, but it takes two minutes to process, you’ve essentially built a time-bomb for your system load. This must not be feasible.

But it doesn’t stop there. A lot of functionality, most of which wasn’t even present or outlined at the time of our decision, relies implicitly on that schedule. Two examples: There are manual operations that must not be performed during the execution of the task. The system goes into a “protected state” around the task execution. It disables these operations a few minutes before the scheduled execution and even some time afterwards. If you had a fixed schedule of “every hour”, you could even hard-code the protected timespan. With a possible dynamic schedule, you have to calculate your timespan based on the current schedule and warn your operator if it isn’t possible anymore to find a time slot to even perform the manual operation.
The second example is a functionality that supervises the completeness of the recorded data. The problem is: This functionality is on another computer (it’s a distributed system, remember?) that doesn’t know about the configuration files. To be able to scan the data archive and say “everything that should be there, is there”, the second computer needs to know about all the schedules of the first computers (there are many of them, recording their data on their own schedules and transferring it to the second computer). And if a schedule changes, the second computer needs to take the change into account and scan the data archive for two areas: one area with the old schedule and one area with the new schedule. Otherwise, there would be false alarms.

You can probably see that the one decision to make the task scheduler a little more complex and configurable as required had quite some impact on the complexity of other parts of the system. But this investment will be worth it as soon as the customer changes the schedule! The whole system is programmed, tested and documented to facilitate schedule changes. We are ready!

It’s been over twelve years since we wrote the first line of code for the more complex implementation (I’ve checked the source control logs). The customer hasn’t changed a single bit of the schedule yet. There are over twenty “first computers” and they all still run the same task schedule as initially planned. Our decision did nothing but to add accidental complexity to the system. It probably introduced some bugs along the way, too. It certainly increased our required level of awareness (“hurdle of understanding”) during the development of features that are somewhat coupled with the task schedule.

In short: It’s been a disaster. The smug smile we thought we’d wear has been replaced by a deep frown. Who wrote all that mess? And why? It wasn’t the customer, it was us. We will never be going to need it.

Keeping it simple is hard indeed

A true story I was reminded of when I read Chad LaVigne’s essay on simplicity from the book “97 Things Every Software Architect Should Know”.

When I read the great book (or collection of essays) “97 Things Every Software Architect Should Know”, there was a story I had experienced first-hand for nearly every chapter. But it was only after I read about it in the well-placed words of somebody with greater in-depth experience than me that it appeared clear to me. The essence of my own experience was written down so I could iterate over it. Here is a story about simplicity that I hope will help somebody out there iterate over the same thought sometimes.

The problem

In the early days, we were asked to provide custom software for a nearly robot-like machine that performs measurements. Our software had to control some engines that could move sensors and actuators around. The whole machine was built by another company that only had eyes for the mechanical and electrical aspects. As a result, the communication protocol between the microcontrollers of the machine and our software was horribly awkward and unpleasant in regard of the software engineering side of the project.

One engine was the so-called “vertical engine”, because it could move an array of sensors up and down. There were four actual positions where the engine would stop automatically upon contact. It was the job of our software to send the correct commands in the correct order at the correct time to reach the destination. Effectively, we would “hop” from position to position, either moving the sensors up or down. If only the communication protocol would’ve allowed that. What we got was two commands: moving one position down and moving up to the topmost position. The movement options are depicted here:

The first solution

But what would a developer’s life be without a few challenges? Our team quickly came up with several possible solutions, from finite state machines to object graphs of “position” instances that offered several “movement options” to an agent-like object travelling inside the graph. In the brainstorming session, even more sophisticated possibilities were discussed. The solution we finally implemented was complex in terms of clarity. We needed quite a few unit and integration tests to proof the whole thing correct. During all this process, nobody ever questioned the complexity of the problem.

The essay

The chapter I read in the book “97 Things Every Software Architect Should Know” when this story came back to me was “Make Sure the Simple Stuff Is Simple” written by Chad LaVigne. Let me cite some sentences to show the message of his essay:

“People who design software are smart – really smart. The simple problem-complex solution trap can be an easy one to fall into because we like to demonstrate our knowledge. If you find yourself designing a solution so clever that it may become self-aware, stop and think. Does the solution fit the problem?”

The essay hit the nail on the head for me. Our solutions for this “vertical engine” were all clever and solved the problem, but a much more complex problem than the one that was really needed.

The second solution

Our first version of the software went into production and worked a treat. Years later, the hardware (specifically the microcontrollers) fell apart and was replaced by more sophisticated electronics. We needed to rewrite parts of our software, especially the engine controls. But this time, we had real movement patterns of all engines, collected in the logfiles. Some regular expression magic later (remember, we are smart people and like to show that!), it was clear that the vertical engine had never used two of the possible movement options. The real usage pattern of the engine is depicted in the picture below:

The real problem we should have solved from the beginning is much simpler than the original one. Every engine position has only one successor, there is no need to choose from several options or to calculate the shortest path to the destination. It’s the most boring finite state machine you’ve ever seen and there is no need for recursive structures.

It was our own cleverness that made the problem appear more complex than it really was, just because we could handle it. Some investigation on the problem domain instead of the solution domain would have brought us to the same conclusion that the log file analysis finally revealed.

The lesson learnt

The whole story isn’t about failure. Both solutions worked for the customer. In fact, he never noticed a difference when the second solution was going live. This story is about accidental complexity and about being smarter than the requirement. It’s in our nature to accept a problem as “given” and come up with an elegant solution. It might be much more effective to question the problem instead.

Thank you, Chad, for your great essay!