Wear parts in software

I want to preface my thoughts with the story that originally sparked them (and yes, I oftentimes think about software development when unrelated things happen in the real world).

I don’t own a car myself, but I’m a non-hesistant user of rental cars and car sharing services. So when I have to drive long distances, I use many different models of cars. One model family is the Opel Corsa compact cars, where I’ve driven the models A to C and in the story, model D.

It was on the way back, on the highway, when darkness settled in. I switched on the headlamps and noticed that one of them was not working. In germany, this means that your car is unfit for travel and you should stop. You cannot stop on the highway, so I continued driving towards the next gas and service station.

Inside the station, I headed to the shelf with car spare parts and searched for a lightbulb for a Corsa model D. Finding the lightbulbs for A, B and C was easy, but the bulbs for D weren’t there. In fact, there wasn’t even a place for them on the shelf. I asked the clerk for help and he laughed. They didn’t sell lightbulbs for the Corsa model D because changing them wasn’t possible for the layman.

To change a lightbulb in my car, you have to remove the engine block, exchange the lightbulb and install the engine block again. You need to perform this process in a repair shop and be attentive to accidental leakage and connector damage.

Let me summarize the process: To replace an ordinary wear part, you have to perform delicate expert work.

This design paradigm seems to be on the rise with consumer products. If you know how to change the battery on your smartphone or laptop, you probably explicitly chose the device because of this feature.

Interestingly, the trend is reversed for software development. Our architectures and design efforts try to separate between primary code and wear part code. Development principles like SRP (Single Responsibility Principle) or OCP (Open/Closed Principle) have the “wear part code” metaphor in mind, even if it isn’t communicated in such clarity.

On the architecture field, a microservice paradigm maps a complex mechanism onto several small and isolated parts. The isolation aspect is crucial because it promotes replaceability – you don’t need to remove and reinstall a central microservice if you want to replace a more peripheral one. And even the notion of “central and peripheral” services indicates the existence and consideration of an abrasion effect.

For a single application, the clean, hexagonal or onion architecture layout makes the “wear part code” metaphor the central aspect of your code positioning. The goal is to prepare for the inevitable technology replacement and don’t act surprised if the thing you chose as your baseplate turns out to behave like rotting wood.

A good product design (at least for the customer/user) facilitates maintainability by making simple upkeep tasks easy.

We software developers weren’t expected to produce good products because the technological environment moved faster than the wear and nobody but ourselves could inspect the product anyway.

If a field moves faster than the abrasion can occur, longevity of a product is not a primary concern. Your smartphone will be outdated and replaced long before the battery is worn out. There is simply no need to choose wear parts that live longer than the main product. My postulation is that software development as a field has slowed down enough to make the major abrasive factors and areas discernable.

If nobody can inspect the software product and evaluate its sustainability, at least the original developer can, right? You can check for yourself with a simple experiment. Print the source code of your software (or parts of it), take two text markers (my favorite colors for this kind of approach are green and blue) and mark the code you deem primary with the first text marker. Any code you consider a wear part gets colored with the second marker. If you find it difficult to make the distinction or if the colors are mingled all over the place, this might be an indication that you could improve things.

What is a wear part in software? I would love to hear your thoughts and definitions in the comment section! My description, with no claim to be complete, would be any code that has a high probability to change because of one of the following reasons:

  • The customer/user is forced to make a change request by external forces like legal regulation
  • Another software/system/service changes, forcing your software to adjust its understanding of its surrounding
  • The technical field moved, changing your perception of the code

If you plan for maintainability in software development, you always plan for obsolescence and replacement. Our wear parts are different from mechanical ones in their uniqueness – we don’t replace a lightbulb with the same model, we replace unique code with different, but also unique code. But the concept of wear parts is the same:

Things that are likely to be replaced are designed for easy replacement.

Arbitrary limits in IT

When I was a young teenager, a game captured my attention for months: Sid Meyer’s Railroad Tycoon. The essence of the game was to build railways between cities and transport as much passengers and freight goods between them as possible.

I tell this story because it was the first time I discovered a software bug and exploited it. In a way, it was the start of my “hacking” career. Don’t worry, I’m a software developer now, not a hacker. I produce the bugs, I don’t search and exploit them.

The bug in Railroad Tycoon had to do with buying industry. You could not only build tracks and buy trains, you could also acquire industrial complexes like petroleum plants and factories. And while you couldn’t build more tracks once your money ran out, you could accrue debt by buying industry. If you did this extensively, your debt suddenly turned into a small fortune. I didn’t know about the exact mechanics back then, but it was a classic 16-bit signed integer overflow bug. If your debt exceeded 32.768 dollars, the sign turned positive. That was a lot of money in the game and you had a lot of industry, too. The english wikipedia article seems to be a bit inaccurate.

If you are accustomed with IT, there a some numbers that you immediately recognize as problematic, like 255, 32.767, 65.535 or 2.147.483.647. If anything unusual or bad happens around these numbers, you know what’s up. It’s usually related to an integer overflow or (in the case of Railroad Tycoon) underflow.

But then, there are problematic numbers that just seem random. If you want to name a table in an older Oracle database, you couldn’t name it longer than 30 characters. Not 32 or something that could somehow be related to a technical cause, but 30. Certain text values couldn’t be longer than 2000 characters. Not 2048 (or 2047 with a terminating zero character), but straight 2000. These numbers look more “usual” for a normal human, but they appear just as arbitrary to the IT professional’s eye as 2048 might seem to others.

Recently, I introduced a new developer to an internal project of ours. We set up the development environment and let the program run a few times. Then, we introduced some observable changes to the code to explain the different parts. But suddenly, a console output didn’t appear. All we did was to introduce a line of code in the form of:

System.out.println(output);

And the output just didn’t show up. We checked that the program executed the code beforehands and even fired up a debugger (something I’m not really fond of) to see that the output string is really filled.

We changed the line of code to:

System.out.println(output.length());

And got our result: 32.483 characters.

As you can see, the number is somewhat near the 32k danger zone. But in IT, these danger zones are really small. You can be directly besides it and don’t notice anything, but one step more and you’re in trouble. In a way, they act like minefields.

There should be nothing wrong with 32.483 characters printed on a console. Well, unless you use Eclipse and Windows. If you do, there is a new danger zone, starting with 32.000 characters. And this zone isn’t small. In fact, it affects any text with more than 32.000 characters that should be printed in an Eclipse console on Windows:

https://bugs.eclipse.org/bugs/show_bug.cgi?id=23406

‘ScriptShape’ WINAPI fails with E_FAIL for OpenType fonts when the length of the string is over 32000 (0x7d00) characters

Notes: 32000 is hardcoded in ‘gdi32full!GenericEngineGetGlyphs’ Windows function.

https://bugs.eclipse.org/bugs/show_bug.cgi?id=23406#c31

There is nothing special about the number 32.000. My guess is that some developer at Microsoft in the nineties had to impose a limit and just thought that “32.000 characters ought to be enough for anybody”. This is a common mistake made by Microsoft and the whole IT industry.

The problem is that now, 20 or even 30 years later, this limit is still in place. Our processing power grew by a factor of 1.000 (yes, one thousand times more power), the amount of available memory even by a factor of 16.000 and we are still limited to 32.000 characters for a line of text. If the limit would grow accordingly, you could now fit up to 32.000.000 characters in that string and it would just work.

So, what is the moral of this story? IT and software development are minefields where you can step on a mine that was hidden 20+ years ago at any turn. But even more important: If you write code, please be aware that every limit you introduce into your solution will cause trouble in the future. Some limits can be explained by other limits, but others are just arbitrary. Make the arbitrary limits visible and maybe even configurable!

The do-it-yourself rickroll

This is a funny story from a while ago when we were tasked to play audio content in a web application and we used the opportunity to rickroll our web frontend developer. Well, we didn’t exactly rickroll him, we made him rickroll himself.

Our application architecture consisted of a serverside API that could answer a broad range of requests and a client side web application that sends requests to this API. This architecture was sufficient for previous requirements that mostly consisted of data delivery and display on behalf of the user. But it didn’t cut it for the new requirement that needed audio messages that were played to alert the operators on site to be send through the web and played in the browser application, preferably without noticeable delay.

The audio messages were created by text-to-speech synthesis and contained various warnings and alerts that informed the operators of important incidents happening in their system. Because the existing system played the alerts “on site” and all operators suddenly had to work from home (you can probably guess the date range of this story now), the alerts had to follow them to their new main platform, the web application.

We introduced a web socket channel from the server to each connected client application and sent update “news” through the socket. One type of news should be the “audio alert” that contains a payload of a Base64-encoded wave file. We wanted the new functionality to be up and usable on short notice. So we developed the server side first, emitting faked audio alerts on a 30 seconds trigger.

The only problem was that we didn’t have a realistic payload at hand, so we created one. It was a lengthy Base64 string that could be transported to the client application without problem. The frontend developer printed it to the browser console and went on to transform it back to waveform and play it as sound.

Just some moments later, we got some irreproducible messages in the team chat. The transformation succeeded on the first try. Our developer heard the original audio content. This is what he heard, every 30 seconds, again and again:

Yes, you’ve probably recognized the URL right away. But there was no URL in our case. Even if you are paranoid enough to recognize the wave bytes, they were Base64 encoded. Nobody expects a rickroll in Base64!

Our frontend developer had developed the ingredients for his own rickroll and didn’t suspect a thing until it was too late.

This confirmed that our new feature worked. Everybody was happy, maybe a little bit too happy for the occasion. But the days back then lacked some funny moments, so we appreciated it even more.

There are two things that I want to explain in more detail:

The tradition of rickrolling is a strange internet culture thing. Typically, it consists of a published link and an irritated overhasty link clicker. There are some instances were the prank is more elaborate, but oftentimes, it relies on the reputation of the link publisher. To have the “victim” assemble the prank by himself is quite hilarious if you already find “normal” rickrolls funny.

Our first attempt to deliver the whole wave file in one big Base64 string got rejected really fast by the customer organization’s proxy server. We had to make the final implementation even more complex: The server sends an “audio alert” news with a unique token that the client can use to request the Base64 content from the classic API. The system works with this architecture, but nobody ever dared to try what the server returns for the token “dQw4w9WgXcQ” until this day…

Don’t shoot your messengers

Writing small, focused tests, often called unit tests, is one of the things that look easy at the outset but turn out to be more delicate than anticipated. Writing a three-lines-of-code unit test in the triple-A structure soon became second nature to me, but there were lots of cases that resisted easy testing.

Using mock objects is the typical next step to accommodate this resistance and make the test code more complex. This leads to 5 to 10 lines of test code for easy mock-based tests and up to thirty or even fifty lines of test code where a lot of moving parts are mocked and chained together to test one single method.

So, the first reaction for a more complicated testing scenario is to make the test more complicated.

But even with the powerful combination of mock objects and dependency injection, there are situations where writing suitable tests seems impossible. In the past, I regarded these code blocks as “untestable” and omitted the tests because their economic viability seemed debatable.

I wrote small tests for easy code, long tests for complicated code and no tests for defiant code. The problem always seemed to be the tests that just didn’t cut it.

Until I could recognize my approach in a new light: I was encumbering the messenger. If the message was too harsh, I would outright shoot him.

The tests tried to tell me something about my production code. But I always saw the problem with them, not the code.

Today, I can see that the tests I never wrote because the “test story” at hand was too complicated for my abilities were already telling me something important.

The test you decide not to write because it’s too much of a hassle tells you that your code structure needs improvement. They already deliver their message to you, even before they exist.

With this insight, I can oftentimes fix the problem where it is caused: In the production code. The test coverage increases and the tests become simpler.

Let’s look at a small example that tries to show the line of thinking without being too extensive:

We developed a class in Java that represents a counter that gets triggered and imposes a wait period on every tenth trigger impulse:

public class CountAndWait {
	private int triggered;
	
	public CountAndWait() {
		this.triggered = 0;
	}
	
	public void trigger() {
		this.triggered++;
		if (this.triggered == 10) {
			try {
				Thread.sleep(1000L);
			} catch (InterruptedException e) {
				Thread.currentThread().interrupt();
			}
			this.triggered = 0;
		}
	}
}

There is a lot going on in the code for such a simple functionality. Especially the try-catch block catches my eye and makes me worried when thinking about tests. Why is it even there? Well, here is a starter link for an explanation.

But even without advanced threading issues, the normal functionality of our code is worrisome enough. How many lines of code will a test contain that covers the sleep? Should I really use a loop in my test code? Will the test really have a runtime of one second? That’s the same amount of time several hundred other unit tests require for much more coverage. Is this an economically sound testing approach?

The test doesn’t even exist and already sends a message: Your production code should be structured differently. If you focus on the “test story”, perhaps a better structure emerges?

The “story of the test” is the description of the production code path that is covered and asserted by the test. In our example, I want the story to be:

“When a counter object is triggered for the tenth time, it should impose a wait. Afterwards, the cycle should repeat.”

Nothing in the story of this test talks about interruption or exceptions, so if this code gets in the way, I should restructure it to eliminate it from my story. The new production code might look like this:

public class CountAndWait {
	private final Runnable waiting;
	private int triggered;
	
	public static CountAndWait forOneSecond() {
		return new CountAndWait(() -> {
			try {
				Thread.sleep(1000L);
			} catch (InterruptedException e) {
				Thread.currentThread().interrupt();
			}			
		});
	}
	
	public CountAndWait(Runnable waiting) {
		this.waiting = waiting;
		this.triggered = 0;
	}
	
	public void trigger() {
		this.triggered++;
		if (this.triggered == 10) {
			this.waiting.run();
			this.triggered = 0;
		}
	}
}

That’s a lot more code than before, but we can concentrate on the latter half. We can now inject a mock object that attests to how often it was run. This mock object doesn’t need to sleep for any amount of time, so the unit test is fast again.

Instead of making the test more complex, we introduced additional structure (and complexity) into the production code. The resulting unit test is rather easy to write:

class CountAndWaitTest {
	@Test
	@DisplayName("Waits after 10 triggers and resets")
	void wait_after_10_triggers_and_reset() {
		Runnable simulatedWait = mock(Runnable.class);
		CountAndWait target = new CountAndWait(simulatedWait);
		
		// no wait for the first 9 triggers
		Repeat.times(9).call(target::trigger);
		verifyNoInteractions(simulatedWait);
		
		// wait at the 10th trigger
		target.trigger();
		verify(simulatedWait, times(1)).run();
		
		// reset happened, no wait for another 9 triggers
		Repeat.times(9).call(target::trigger);
		verify(simulatedWait, times(1)).run();
	}
}

It’s still different from a simple 3-liner test, but the “and” in the test story hints at a more complex story than “get y for x”, so that might be ok. We could probably simplify the test even more if we got access to the internal trigger count and verify the reset directly.

I hope the example was clear enough. For me, the revelation that test problems more often than not have their root cause in production code is a clear message to improve my ability on writing code that facilitates testing instead of obstructing it.

I don’t shoot/omit my messengers anymore even if their message means more work for me.

It’s not a bug, it’s a missing test

Recently, I changed some wording when I talk about code and code problems. In my opinion, the new words have more correlation with the root cause of the problem, while the old words used to be “nearer” to the symptoms.

Let me explain the changes with a small code example that serves no other purpose than to contain a few problems:

public class InMemoryItemRepository {
	private final Map<String, Item> items;
	
	public InMemoryItemRepository() {
		this.items = new HashMap<>();
	}
	
	public Optional<Item> itemFor(String itemNumber) {
		return Optional.of(this.items.get(itemNumber));
	}
}

This class is a concrete implementation of an “item repository” that stores items by their “item number”. You can retrieve items by calling the itemFor method and giving it a valid itemNumber. Otherwise, you’ll end up with an empty Optional.

The first problem

The first problem in this code is a code smell. The data structure used to store the items is an HashMap. In Java, the HashMap implementation is not thread-safe:

Note that this implementation is not synchronized. If multiple threads access a hash map concurrently, and at least one of the threads modifies the map structurally, it must be synchronized externally.

https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/HashMap.html

With our current implementation, we leak this limitation onto our clients, which will be totally unaware, because nothing in the class design hints towards an HashMap. In most production environments, our in-memory implementation might even be swapped out with a database-based one. And the database implementation hopefully is thread-safe.

A code smell is a bug

My change in wording calls this type of code problem a bug (instead of a code smell). It might be possible that right now, this class is only used in a single-threaded manner and the implementation flaw doesn’t matter. In that case, it’s a bug in hibernation. Right now, it sleeps, but the day will come when it wakes up. That doesn’t change its bug-like features, just its immediate damage potential.

If you label a code smell as a bug, you highlight the damage potential instead of the current “nice to have” prioritization.

The second problem

Let’s return to the code example. There is another problem with this implementation: If you ask for an item number that is not stored in the repository, you don’t get an empty Optional, you’ll get a nice and unexpected NullPointerException.

A small unit test that asks for any item number on an empty repository reveals the problem:

	@Test
	void returnsEmptyIfNotStored() {
		var target = new InMemoryItemRepository();
		assertThat(
			target.itemFor("non-existent")
		).isEmpty();
	}

This test will run red with a NullPointerException, pointing to this line in the implementation:

		return Optional.of(this.items.get(itemNumber));

Of course, this is “just a typo” in the way we construct the Optional. Because our HashMap returns null if a given key is not stored in it, we need to call Optional.ofNullable() and have it convert null to Optional.empty:

		return Optional.ofNullable(this.items.get(itemNumber));

A bug is a missing test

This “little fix” makes our unit test pass and the repository useful. My change in wording calls every bug a “missing test”. This points out that the test you write as you fix the bug could have prevented the bug’s damage altogether.

If you were surprised by my assumption that you write tests when you fix bugs, please consider adopting this as a habit. It is the last possible moment to improve your test coverage at a place that has proven to be important enough to have tests and undertested enough to exhibit bugs. If you want your project to be antifragile (and there are good reasons why it should be), start by strengthening it at every place it breaks.

Tests for every code smell?

The combination of both changes would make every code smell a missing test. Right now, I’m not sure if this needs to be an automatism. The existence of a code smell hints towards a more fragile part of the system, but I’m not sure if “potential fragility” is a valid indicator for additional tests.

Of course, if you program completely test driven, these kind of thoughts probably seem pointless to you.

What are your thoughts on this topic and specifically on compulsory testing for every code smell?

Characteristics of a Good Merge Request

If you happen to use Merge Requests as a basic mechanics of your software development workflow (and there are compelling arguments that you should), you’ve probably encountered some merge requests that could be handled in a straight-forward manner and some that are just painful to work with. Some merge requests “spark joy” and others elicit nothing but displeasure.

What differentiates the good, joyful merge requests from the dreadful? We’ve found six (or seven, based on your categorization) characteristics that good merge requests exhibit, while bad ones don’t.

Good merge requests are:

  • Atomic – Only one topic
  • Minimal – Only necessary changes
  • Curated – Double-checked by the developer
  • Finished – Free of deactivated code, debug code, etc.
  • Explained – Provided with commentary in the merge request (not the code)
  • Manageable – Small in changeset size
  • Separated – Either domain-based or technology-based, not intermingled

Bad merge requests lack in some or even all these categories. The first three aspects are ignored the most in my experience. This is why they are at the top of the list. A lot of trouble vanishes simply by staying on course, being succinct and caring about the next pair of eyes as a developer.

Let’s look at each characteristic in some depth:

Atomic

Let’s say you watch a romantic comedy movie and halfway through, all of a sudden, it changes into a war movie with gruesome scenes. You would feel betrayed. Ok, we don’t work in entertainment, our work is serious. Let’s say you read a newspaper article about the latest stock market movement and in the third paragraph, it plainly changes into a car advertisement. This is called “bait and switch” and is a diversionary tactics or a feint. If you do it unwittingly, you might mean no harm, but still cause irritation.

An atomic merge request tells only one story which means it contains changes for only one issue. If you happen to make “just a quick fix” at the wayside of your task, you break the atomicity property of your changeset and force your reviewer to follow your train of thought, but without the context.

There is nothing to say against a quick fix or a small refactoring, a little improvement here and there. But it’s not part of your story, so make it a story of its own. Bundling the tiny change with the overarching story makes it harder for the reviewer to differentiate topical changes from opportunitic ones and difficult to accept your main changes while rejecting your secondary ones. If you open up a second merge request for your peripheral changes, you retain the atomicity and the goodwill of your reviewer.

Minimal

A good merge request contains only changes that are essential to the story and consciously modified by the developer. Because we use a plethora of development tools that all want to store their metadata somewhere in our project, we often end up with involuntary modifications in files we don’t know, never looked at and don’t feel responsible for.

The minimalistic aspect of a good merge request puts you, the developer, into the position of responsibility of all content in your changeset. It doesn’t matter that “a tool made that change”. The tool acted on your command. It is not the responsibility of the reviewer to figure out what that change means. It is your duty. Explain the change to your reviewer and if you can’t, revert the change. If the rest of your changes work without it, it wasn’t necessary and shouldn’t have been included anyway. If your changes don’t work anymore, you are about to learn the inner workings of your development tools.

A minimal changeset also implies a reduced number of modified files. That’s true, but you shouldn’t design your system to have an artificial low number of files (or parts). If you don’t space out your code according to domain structures, you’ll end up with small changesets, but lots of merge conflicts and an ongoing dissonance between the domain model and your software model.

Curated

If you can follow the concept of a merge request telling a story, then you are the storyteller. You need to take care of your narration. In a good story, only necessary bits are told. If you muddy the water by introducing meaningless details and “red herrings”, you essentially irritate your listeners for no good reason.

At least have a second look at the changeset of your merge request. Is there any change present by mishap? That happens often and is not a sign of bad development. It is a sign of neglected care for your teammates if you let it show up in the changeset review, though.

Bring yourself in a position where you are fully aware of your story and the bits that tell it.

Finished

Your merge request should tell a complete story, not a fragment of it. It also shouldn’t show the auxiliary constructs you employ while developing your changes. These constructs might be debug output statements, commented out code, comments in the code that serve as a reminder for yourself (TODO comments play a big role in this regard) or extra code that ultimately proved to be unnecessary.

Your merge request should not contain any of that. After the merge, the resulting code base is regarded as “finished”. Any temporary content will stay forever.

Explained

Your story is probably very clear and recognizable. If not or if parts of it are more obscure than you hoped it would be, it is good practice to provide an explanation. It is your decision to explain yourself in the code (in form of an inline comment) or in the merge request itself (in form of a merge request comment). My preference tends to be the latter, because the comment will not be part of the “eternal” code base. The comment is a tool to help the reviewer grasp the details. Your approach may vary, but there needs to be some form of extra explanation from your side if your code isn’t crystal clear.

Manageable

The best stories are short. Humans aren’t good at keeping large numbers of details in their head and your reviewer is no exception. Keep your merge request small to make the review bearable. A good rule of thumb is a merge request containing no more than 10 files or 250 lines of changed code. While these numbers are arbitrary, they serve the purpose to give you a threshold you can check. You can’t check the real threshold of your reviewer, so try to stay below it.

If you find yourself in a position where a lot more files are touched and/or the changed lines of code are way above 250, you should think about what lead you there. These changes didn’t happen on their own. Your work produced them. Can you adjust your workflow so that smaller milestones are possible? What was the developer action that produced most of these changes? Are we looking at a shotgun surgery?

It is easier to review several small merge requests than one big one. Big, unmanageable merge requests may happen, but they should be the (painful) exception. Learn to work in episodes.

Separated

One aspect of software development is that we tend to mix two types of development into one stream of actions: There is development that produces new domain functionality and is inherently domain-based. And then, there is development that is required by technology but probably can’t be explained to the domain expert because it has nothing to do with the domain. Both types of code work together to provide the domain functionality.

But if you look at it from a story-telling aspect, then domain code is a novel while the technical code is more like a manual. Your domain code is probably unique while your technical code very much looks the same as in the next project. You should try to separate both types of code in your code base. And you should definitely separate them in your merge requests.

Keeping your domain related changes separate from your technical changes gives your reviewer a chance to ask the right questions. With domain code, some aspects are that way because the expert says so. There is no “universally correct way” to do these things. It might look strange or even wrong, but that’s the rule of the domain.

Technical changes, on the other hand, tend to look the same, no matter the domain. You can apply technical experience to these kind of changes regardless of context. Refactorings are the typical member of these changes. Renaming a variable or method will inevitably lead to a lot of changes in a lot of files. But it won’t affect the domain functionality (that’s the definition of a refactoring!). Keep your refactorings out of domain changesets to keep your reviewers happy.

What next?

There is a lot of adaption to be done from a “normal” development practice to one that tends to produce merge requests with these characteristics. These changes in behaviour don’t happen over night.

My proposal would be to start with one aspect and focus on it for one month. Starting with “Atomic” will have the most effect, but you can choose your starting point according to your preference. Just keep it for one month. After that, you can choose to focus on another aspect for a month or add another aspect to your focus group and now keep two aspects in mind. The main goal of this approach is to form a habit that enables you to produce better merge requests without really thinking about it. It might take some time, lets say a year or even two to incorporate all aspects in your day-to-day working style.

After that, you’ll probably look back on your earlier merge requests and smile because you see proof that you can do better now.

The IT architect, Part III: Improve your environment

If you happen to work on a system that scales to the size of an IT landscape, your worst bet is to let it evolve by circumstances. You want to have a plan and act upon that plan. The base for your plan could be a landscape map, which we talked about in the first part of this series. Upon drawing the map, you want to interpret it in order to find the strong points and weak spots. We’ve talked about assessing the map in the second part of this series.

In this blog article, we look at ways to improve our IT landscape towards the goal of overall stability.

Our mission statement

If we want to improve things, we need to know in what aspect the improvement should occur. At the scale of an IT landscape, overall stability is a commonly desired trait. This doesn’t mean rigidity, where you cannot change a thing in the landscape lest the whole thing breaks. It also doesn’t mean that ever part of our landscape needs to be stable itself. Overall stability means that even with the inevitable outage or replacement of a part, the whole system still works. The system is resilient to change and failure, at least resilient enough for the organization working with the system.

If our mission is to improve towards overall stability, we need to work on the relationships between our services (or assets, as we called them earlier, because “service” is a greatly overloaded term) more than we need to work on the service itself.

This doesn’t mean that individual stability of an asset isn’t important. It certainly is, but more often than not, you cannot improve this single value that much. What you can iterate on with recognizable effect is limiting the consequences of lacking individual stability.

Our mantra

The fundamental rule that brings overall stability is the “dependency rule” of the clean architecture that is meant for internal software application architecture. But if we see our IT landscape as one big application (software or not might make less of a difference than thought), we can apply the rule without modification:

All dependencies point towards the center (inside) and never in the opposite direction (outside).

That’s it. You define a center of your map and have all dependencies point towards it. This results in a structure of “rings” around the center that denote different levels of stability. The dependency rule can be rewritten as such:

All dependencies point from the less stable asset to the more stable asset and never in the opposite direction.

If you think of stability only in terms of “service availability” that tells you the percentage of time you can utilize the service without degradation, you’re thinking too short. Stability also means stable interface and stable implementation. You can have a really rock solid ISDN internet connection at the center of your IT landscape map, if your ISP discontinues the technology, the lack of implementation stability will force you to change the asset and hope that all dependent assets (basically your whole map) are not affected by the change.

Planning for obsolescence

Trying to bring the relationships between your assets in congruence with their significance for your IT landscape is the central work of an IT architect. The main question is always: What happens if this asset needs to be replaced?

In IT, there is no such thing as an “eternally working asset”. I’m not well-versed in more physical domains like mechanical engineering to say that this an univeral invariant, but in my field of speciality, everything changes eventually.

If you create an IT landscape where every asset can be replaced with manageable effort and predictable consequences, you’ve created an overall stable system. You can probably improve the availability of parts of it, but you won’t need to overhaul the whole thing over and over again. Your IT landscape is ready to grow, evolve and change, but it does so in a controlled manner and without compromising the mission.

Anti-obsolescence patterns

On your way from your current map to your anticipated one, you’ll recognize recurring patterns that you employ to solve dependency problems or improve the longevity of overall structures. Here are three patterns that have helped me in my endeavours:

Protected variation

If you have more than one implementation for basically the same service (like the example of an internet connection), you probably want the rest of the map to not know about the multiplicity. In this case, you introduce an additional asset that acts as a router between the implementations. Think of the router (or service interface) as a guarding wall for your service implementations. It acts as a “portal” to the real service and can be paper-thin (at least for now). If you want to improve the runtime availability of your service, the router can also act as a load balancer and a circuit breaker. The important rule is that all outside relationships only point to the router, not the actual implementations.

Opinionated interface

If you find an asset that has a lot of incoming dependencies, you’ve found a change risk. If you swap the service for a newer version with a similar, but not quite equivalent interface, you’ll find that you have to adjust lots of dependent assets in oftentimes surprising fashion. You can reduce your surprise by introducing a “portal” interface, just like you did in the protected variations, but without the variations. The portal or “opinionated service” interface offers everything your other assets require of the original service, but nothing more. It captures the “opinion” of your organization towards the service. When you introduce such a portal, it is nothing more than a forwarding service that maybe handles authentication itself. If you plan to swap the service, the portal becomes your requirement list and its new implementation will convert data back and forth.

If you find that your portal gets to big, you could think about multiple portals with their own separated “opinion” about the service, all forwarding to the same “source of truth”:

Now you need to maintain several interface services, but they separate different concerns or contexts into separate assets, which might help with future migrations. Chances are, if there are separate concerns, that they will be provided by separate assets in the future.

Circular portals

The most nasty thing to occur on your map is the ring (see part II of the series). In its smallest form, its just two assets requiring each other:

There are no easy ways out of this situation. But we can make some steps in the right direction and see where it takes us. The first step is introducing buffer assets that act as stand-ins for the real asset:

This doesn’t break the ring yet, but it gives us a chance to do so later. The service interfaces are opinionated and maybe even tailored specifically to the service using it. This reduces the “area of dependency” to its minimum. With a little luck, we find that Service A requires things of Service B that, if isolated, don’t require things of Service A themselves. If that’s the case, we can work on splitting Service B in two parts: One dependent on Service A, but not required by it and one indepedent of Service A, but needed by it. This would break the ring and give us a long chain that is much easier to work with. The problem is: None of this is guaranteed. In the worst case, you’ll still end up with your circular dependency with extra steps and nothing can be done about it.

Conclusion

When you begin to work with your IT landscape map, you begin to transform your assets from what they provide to what others actually require from them. Minimizing the relationships between assets, if not by number then at least by scope, is an appreciable improvement that gives you leeway to make changes in your actual IT setup without compromising the overall structure.

If you accompany your journey towards the best-fitting IT landscape with your map, you always have a plan at hands that you can show to people to form a shared understanding of the current state and the desired outcome. And if you keep old versions of your map in the archives, you can sometimes look back and see how far you’ve come yet.

The IT architect, Part II: Assess your situation

If you want to work on the scale of an IT landscape, you need to have a plan in the form of a map. In the first part of this series, we talked about creating such a map. This blog entry will give you the basic tools to make sense of all the things on it and how to convey meaning to other people while using the map.

The third part will talk about actionable steps that are a result of our interpretation of the map.

Making sense of the map

You’ve drawn the map of all your IT assets and given all the boxes names that you find useful. You’ve asked around to find relationships between your assets, represented by arrows between the boxes. You’ve moved the boxes around a bit to reduce arrow intersections. The map seems to be as “clean” as it can get at the moment.

Now is the time to apply meaning to the structures you see.

Interpreting loners

The first thing you want to look for are boxes without any relationships. These entities don’t interact with other things on your map and are not required by anything, too. Let’s think of them as independent value sources. If this asset brings your organization a describable and current advantage, you’ve found the ideal asset.

An example could be the blue box “L” in our example map. It isn’t coupled to any other asset. Let’s say it is a “customer relationship management” (CRM) system. Remember, boxes are not labeled by their actual implementation (in this case, maybe a vTiger or SugarCRM), but by the value they provide for the organization. If your organization needs a CRM (or benefits from its presence), then you have a “loner”, which is a good thing.

If the CRM stops working, the humans in the organization will be unhappy about it, but the outage itself will be limited to the CRM and not spread to other parts of your IT landscape (given that your map reflects the reality). If the outage lasts longer, your employees will adapt their work processes to circumvent the pothole in your IT. There will be a lot of post-it notes, at least for some time.

If the CRM is updated to a new version, you need to train your employees, but it won’t require other IT entities in your organization to match that update. The CRM can run on ancient hardware and software, as long as the human requirements are met. A loner on your map is a good thing.

Interpreting relicts

If you find a lonely box without a current use case, you’ve found a relict. Be glad that you’ve found it, because relicts tend to remain hidden and not show up on architect maps. If you can make sure that the relict serves no purpose for the organization anymore, you can eliminate it. Removing an asset from the map (and your real IT infrastructure) is a good thing, because you reduce complexity, costs and risks. There is no IT asset without associated costs and risks.

If, for example, the yellow box “P” represents a computer that provides a service that nobody uses anymore, the computer itself is still present in the network and can be used as a stepping stone for malicious itents. Let’s say the computer is a Raspberry Pi that isn’t included in the first tier of workhorse computers, its operating system might be outdated and susceptible to attacks. It doesn’t provide value for the organization anymore, but it increases the organization’s risk.

Revealing this kind of “dead weight” in your IT landscape is a real advantage, because you can cut it out rather easy.

Interpreting rings

A typical structure on your map could be a circular dependency. In its smallest form, it is just two boxes that both depend on each other. The more elaborate ring consists of several boxes that are connected without a clear start and end. This is the worst thing to find.

A ring in your entities means that you have to consider all elements in the ring as one big entity. You cannot modify them independently, neither on the technological level nor on in the temporal dimension. A ring is basically a mexican standoff situation for all included entities. You can also call it a deadlock. Whatever you call it, it is bad news. You probably want to break the ring as soon as possible.

Breaking a ring would warrant its own blog post altogether. A basic starting point might be the Acyclic dependencies principle of software design. You probably need to split at least one of your entities into smaller parts or introduce a new entity. The least favorable move would be to merge all entities into one bigger entity, creating a monolith. You will regret this move when the inevitable modernization pressure rises.

Interpreting chains

If your entities form “deep” dependency lines where A depends on B, B depends on C, C depends on D and so forth, you have discovered a chain. This structure is less troublesome compared to the ring, but worth a worry nonetheless. In terms of operational risk, the chain creates a meta-system with a failure rate that is the sum of the failure rates of the chain elements. To make a long story short, you’ll never get a reliable infrastructure with long chains.

The longer your chains are, the more ripple effects an outage will have on your IT landscape. Remember that a chain always breaks at its weakest link, but this link will bring down the whole line.

You can reduce the length of a chain of entities in your IT landscape by inserting buffer elements like read-only copies of central data sources. But more important is to think (and talk) about why the dependencies are there in the first place. Maybe your data storage strategy is too decentralized and you would gain some favorable dependency structures by pooling data together (essentially creating a data monolith if you overdo it).

Introducing zones

Recognizing the basic shapes on your map is important, but you also have to look at the forest and not only the trees. The basic layout of your boxes already tell you a lot about your IT landscape zones.

A zone on your map is a region of boxes that you can encircle and give it a superordinate name. The basic rule of a zone is that all entities in it should share a common property. The less technology-based this property is, the better is your zoning. A zone for “java web services” or “metal computers” is eventually useful, but won’t stand the test of time. Sooner or later, some java services are replaced by other programming languages and some real machines get virtualized. Do you move them to other zones on your map? What really changed for the users of your IT landscape?

If you concentrate on your users, you might be able to come up with properties that really affect them. Look at this example that takes our initial example and separates it into three zones:

And now, we find a user-oriented name for each zone. In our example, we’ve grouped the entities by user role and are now able to label our zones:

This grouping has the added advantage that the target audience for each modification to the map can be identified nearly immediately. It makes it easier to anticipate the effects of outages or problems and to identify non-cohesive usage of the same tool/entity.

In our example, each box in the “Both” zone is essential to the functioning of the organization. But just because a specific service is used by both other groups doesn’t mean they have overlapping requirements. Maybe it is better for everybody involved to actually divide an entity into two separate boxes in the respective zones, even if both boxes are implemented with the exact same tool/technology at the moment.

Identifying the zones takes your map to the next level. You end up with fewer, but bigger boxes and their dependencies. It’s the same IT landscape, but with less detail. Now you can start your discovery process again.

Conclusion

Your IT landscape map can be interpreted by looking for common structures (like loners, rings and chains) and by defining zones. This allows us to gather a list of problem points that we want to improve. It also allows to evaluate the expectable ramifications of changes to entities in our IT landscape. And there will be changes. The one (and probably only) constant in IT is that all things change.

In the next part of this series, we look at ways to transform the map from the current state towards a better one. Stay tuned!

Upgrade with a twist

A few weeks ago, I heard a nice story about the hidden cost of new features. Imagine a website, driven by a content management system, consisting of text, pictures and fancy styling. When the content management system gets an update, the website developer takes a look at the release notes and finds that a lot of new and cool features are included that you’ll get for free once you update.

So he updates the site, tries it out and publishes it onto the web. A few days later, the customer and owner of the website sends a bug report of some arbitrarily flipped images. There are just short of a hundred images on the website and a handful of them now show up upside down.

Who would update a website and randomly rotate some images?

Why would a content management system decide that excatly these images need a spin?

The answer is not as obvious as one might think.

The latest cause of the effect was a change of the imaging library the content management system uses to deliver the image content. It got upgraded to a new engine that essentially does the same thing as the old one: Take the image file content and put it on the web. But, it does it more thoroughly.

One feature of JPEG images are the EXIF metadata properties. Examples of useful properties are the photography time, the geolocation or the camera model. Some cameras add even more information into the metadata, like exposure time or the camera’s orientation (rotation) during the photographing process. There are cameras that notice if you hold them upside down and store this circumstance into the picture.

Then, there are imaging libraries that just take the pixels and put them on the screen. And there are libraries that know about their domain and read the EXIF metadata, interpret the rotation data and accomodate for that fact. Because, who would like to look at pictures that are displayed totally wrong?

The first version of the content management system’s imaging library didn’t care much about metadata. The new version takes rotation into account.

So, the cause of the suddenly rotated pictures originates with the photographer that happened to work during a workout session or in australia. This fact was registered and stored by the camera and promply ignored by the picture editing software and the earlier content management system. It was rediscovered only when the new version went live.

For the customer, this is a random regression. It worked just fine all those years! For the developer, this is a minefield. Every picture could contain an evil rotation information that gets applied someday.

For a security engineer, this is a harmless but perfect example of a persistence attack. You embed malicious payloads into data that do nothing for a long time, but are activated suddenly, without outside intervention, by an unrelated change of system parts towards a “lucky” constellation.

Guess what you can embed into EXIF metadata, too? Javascript or any other form of executable code. And then you wait.

To end this blog entry on a light note, sometimes the payload may just happen to be your last name – True!

The IT architect, Part I: Map your assets

When I’m tasked with commenting on a software architecture, my first step is to request or draw a map of all distinguishable elements of the software system and give them relationships to each other. This inevitably results in a boxes-and-arrows type of diagram that serves as a base for all future communication about the subject. Having a shared representation about a system is a great way to pinpoint discussions and focus on a particular area without forgetting the rest completely.

When I’m tasked with commenting on IT infrastructure, my first step is to request or draw a map of all distinguishable elements of the IT architecture (or “IT landscape”, a term that I actually prefer because it conveys better that a lot of things on this scale happen unplanned) and give them relationships to each other. Once again, we are drawing boxes and connecting them with arrows.

Being able to rely on this map is an essential base for all communication about IT architecture. And if you know how to read the map, it directs your efforts of consolidating your IT architecture nearly intuitively.

In this blog entry, we talk about drawing the map. The second part goes into interpretation of the map, the third part emphasizes actionable steps based on the map and our interpretation of it. Based on questions and discussion, there might even be a fourth part, but that’s not planned yet.

Your initial boxes

Beginning the IT architecture map is easy: Draw a box and give it a name. The name should correspond to an element of your work environment that is distinguishable from other elements. Note how I don’t say “system” or “service” or “server”. For an IT architecture, these words describe an implementation, a particular manifestation of the architecture. They don’t belong on the map (or this map). If you cannot see the difference yet, think about the floor plan of a house. It doesn’t tell you about the material the house is made of and you can use the same floor plan for a wooden cabin or a marble mansion (barring some pesky statics limitations that I don’t have a clue of). In our IT architecture map, each box represents a “thing” that will at the latest get a name the minute it stops working.

There will be boxes in your IT architecture map that don’t relate to anything else. That’s fine and not a problem, as long as the box relates to humans. If you cannot find a meaningful relationship between the box and humans or other boxes, you’ve found a relict. This is in fact one of the hardest tasks in IT architecture analysis, so congratulations!

Adding relationships

Every other box interacts with its environment in some manner. Again, the concrete implementation of that interaction is not important for our map. For our current view on the landscape, it makes no difference if a software system uses HTTP calls to a server or a computer tranfers bytes over RS232 wire to an appliance box. The fact that one box relies on the availability of another box is all that matters. That’s the essence of our arrows: Box 1 requires box 2 to be “online” in order to perform its duties. Without box 2, the functionality offered by box 1 will be limited, down to a point where it is no longer useful to others. Our arrows denote dependencies between boxes. If you happen to be a software developer: we don’t talk about code dependencies here. Also, even if closely related, we don’t mean format or protocol dependencies. We just state that if box 2 “goes down”, box 1 will follow closely.

This is the base for a rule of thumb about dependency arrows: Don’t draw them bidirectionally. Each arrow has one clear direction (like box 1 –> box 2). If you find that box 2 also depends on box 1, you should draw two arrows in opposite directions. As a preview for the interpretation step: This dependency cycle is a sore spot in your current architecture. It means that your two boxes appear as one to the outside. It means that you cannot replace one part without the other. The replaceability of single boxes is an important aspect of your landscape’s health.

Making it readable

When you’ve placed your boxes and drawn the arrows, it’s time to improve on the map’s layout. A guideline for the layout is that arrows shouldn’t intersect each other. Another guideline is that boxes that are semantically related should be near each other on the map. These two requirements alone often result in a lot of movement and experiments. You might want to use a software that allows for these experiments without much effort.

You’ll recognize a fitting layout when you see it. The map corresponds to your internal landscape representation enough to be useful in discussions. It might look like this real example:

First thing you’ll notice it that the names are replaced by denotations with zero meaning. In a real map, the box “C” might be named “time tracking” and box “D” could be labelled “issue tracking”. The name should indicate the responsibility of the element/box. You can also add the current implementation of that responsibility, if that makes things clearer. In our example, box “D”, indicating “issue tracking” might have “(JIRA)” added to the description. Just be aware that your organization probably needs another issue tracking system in that place even if JIRA falls out of favour. Following your arrows backwards, you’ll know which other elements of your landscape will be affected by this replacement. More on that in the next part about interpreting the map.

Evolving the map

Another thing you probably scoffed at is the intersecting arrows in the example. The map’s author came up with this layout as the best representation when the map had fewer boxes. With each subsequently added box, another arrow or two tried to reach the “center”. The intersections are a direct consequence of the emergence of a “center”. This is an important finding of your map: Being able to identify your map’s center and deduce meaning from it. To spoiler a bit: If your center is “time tracking” and “issue tracking”, you probably charge money per hour to solve other people’s problems.

Conclusion

You’ve probably seen how drawing an IT landscape map can benefit your organization and your discussions about its present and future. One thing you should keep in mind is that the map should reflect the current state and not your desired state of your organization’s IT architecture. That’s what will be addressed in part 3 of this series. Stay tuned!

Want to read more? Head over to part II of this series.