Always apply the Principle Of Least Astonishment to yourself, too

Great principles have the property that while they can be stated in a concise form, they have far-reaching consequences one can fully appreciate after many years of encountering them.

One of these things is what is known as the Principle of Least Astonishment / Principle of Least Surprise (see here or here). As stated there, in a context of user interface design, its upshot is “Never surprise the user!”. Within that context, it is easily understandable as straightforward for everyone that has ever used any piece of software and notices that never once was he glad that the piece didn’t work as suggested. Or did you ever feel that way?

Surprise is a tool for willful suspension, for entertainment, a tool of unnecessary complication; exact what you do not want in the things that are supposed to make your job easy.

Now we can all agree about that, and go home. Right? But of course, there’s a large difference between grasping a concept in its most superficial manifestation, and its evasive, underlying sense.

Consider any software project that cannot be simplified to a mere single-purpose-module with a clear progression, i.e. what would rather be a script. Consider any software that is not just a script. You might have a backend component with loads of requirements, you have some database, some caching functionality, then you want a new frontend in some fancy fresh web technology, and there’s going to be some conflict of interests in your developer team.

There will be some rather smart ways of accomplishing something and there will be rather nonsmart ways. How do you know which will be which? So there, follow your principle: Never surprise anyone. Not only your end user. Do not surprise any other team member with something “clever”. In most situations,

  1. it’s probably not clever at all
  2. the team member being fooled by you is yourself

Collaboration is a good tool to let that conflict naturally arise. I mean the good kind of conflict, not the mistrust, denial of competency, “Ctrl+A and Delete everything you ever wrote!”-kind of conflict. Just the one where someone would tell you “hm. that behaviour is… astonishing.”

But you don’t have a team member in every small project you do. So just remember to admit the factor of surprise in every thing you leave behind. Do not think “as of right now, I understand this thing, ergo this is not of any surprise to anyone, ever”. Think, “when I leave this code for two months and return, will there be anything… of surprise?”

This principle has many manifestations. As one of Jakob Nielsen’s usability heuristics, it’s called “Recognition rather than Recall”. In a more universal way of improving human performance and clarity, it’s called “Reduce Cognitive Load”. It has a wide range of applicability from user interfaces to state management, database structures, or general software architecture. I like the focus of “Surprise”, because it should be rather easy for you to admit feeling surprised, even by your own doing.

Wear parts in software

I want to preface my thoughts with the story that originally sparked them (and yes, I oftentimes think about software development when unrelated things happen in the real world).

I don’t own a car myself, but I’m a non-hesistant user of rental cars and car sharing services. So when I have to drive long distances, I use many different models of cars. One model family is the Opel Corsa compact cars, where I’ve driven the models A to C and in the story, model D.

It was on the way back, on the highway, when darkness settled in. I switched on the headlamps and noticed that one of them was not working. In germany, this means that your car is unfit for travel and you should stop. You cannot stop on the highway, so I continued driving towards the next gas and service station.

Inside the station, I headed to the shelf with car spare parts and searched for a lightbulb for a Corsa model D. Finding the lightbulbs for A, B and C was easy, but the bulbs for D weren’t there. In fact, there wasn’t even a place for them on the shelf. I asked the clerk for help and he laughed. They didn’t sell lightbulbs for the Corsa model D because changing them wasn’t possible for the layman.

To change a lightbulb in my car, you have to remove the engine block, exchange the lightbulb and install the engine block again. You need to perform this process in a repair shop and be attentive to accidental leakage and connector damage.

Let me summarize the process: To replace an ordinary wear part, you have to perform delicate expert work.

This design paradigm seems to be on the rise with consumer products. If you know how to change the battery on your smartphone or laptop, you probably explicitly chose the device because of this feature.

Interestingly, the trend is reversed for software development. Our architectures and design efforts try to separate between primary code and wear part code. Development principles like SRP (Single Responsibility Principle) or OCP (Open/Closed Principle) have the “wear part code” metaphor in mind, even if it isn’t communicated in such clarity.

On the architecture field, a microservice paradigm maps a complex mechanism onto several small and isolated parts. The isolation aspect is crucial because it promotes replaceability – you don’t need to remove and reinstall a central microservice if you want to replace a more peripheral one. And even the notion of “central and peripheral” services indicates the existence and consideration of an abrasion effect.

For a single application, the clean, hexagonal or onion architecture layout makes the “wear part code” metaphor the central aspect of your code positioning. The goal is to prepare for the inevitable technology replacement and don’t act surprised if the thing you chose as your baseplate turns out to behave like rotting wood.

A good product design (at least for the customer/user) facilitates maintainability by making simple upkeep tasks easy.

We software developers weren’t expected to produce good products because the technological environment moved faster than the wear and nobody but ourselves could inspect the product anyway.

If a field moves faster than the abrasion can occur, longevity of a product is not a primary concern. Your smartphone will be outdated and replaced long before the battery is worn out. There is simply no need to choose wear parts that live longer than the main product. My postulation is that software development as a field has slowed down enough to make the major abrasive factors and areas discernable.

If nobody can inspect the software product and evaluate its sustainability, at least the original developer can, right? You can check for yourself with a simple experiment. Print the source code of your software (or parts of it), take two text markers (my favorite colors for this kind of approach are green and blue) and mark the code you deem primary with the first text marker. Any code you consider a wear part gets colored with the second marker. If you find it difficult to make the distinction or if the colors are mingled all over the place, this might be an indication that you could improve things.

What is a wear part in software? I would love to hear your thoughts and definitions in the comment section! My description, with no claim to be complete, would be any code that has a high probability to change because of one of the following reasons:

  • The customer/user is forced to make a change request by external forces like legal regulation
  • Another software/system/service changes, forcing your software to adjust its understanding of its surrounding
  • The technical field moved, changing your perception of the code

If you plan for maintainability in software development, you always plan for obsolescence and replacement. Our wear parts are different from mechanical ones in their uniqueness – we don’t replace a lightbulb with the same model, we replace unique code with different, but also unique code. But the concept of wear parts is the same:

Things that are likely to be replaced are designed for easy replacement.

Redux-Toolkit & Solving “ReferenceError: Access lexical declaration … before initialization”

Last week, I had a really annoying error in one of our React-Redux applications. It started with a believed-to-be-minor cleanup in our code, culminated in four developers staring at our code in disbelief and quite some research, and resulted in some rather feasible solutions that, in hindsight, look quite obvious (as is usually the case).

The tech landscape we are talking about here is a React webapp that employs state management via Redux-Toolkit / RTK, the abstraction layer above Redux to simplify the majority of standard use cases one has to deal with in current-day applications. Personally, I happen to find that useful, because it means a perceptible reduction of boilerplate Redux code (and some dependencies that you would use all the time anyway, like redux-thunk) while maintaining compatibility with the really useful Redux DevTools, and not requiring many new concepts. As our application makes good use of URL routing in order to display very different subparts, we implemented our own middleware that does the data fetching upfront in a major step (sometimes called „hydration“).

One of the basic ideas in Redux-Toolkit is the management of your state in substates called slices that aim to unify the handling of actions, action creators and reducers, essentially what was previously described as Ducks pattern.

We provide unit tests with the jest framework, and generally speaking, it is more productive to test general logic instead of React components or Redux state updates (although we sometimes make use of that, too). Jest is very modular in the sense that you can add tests for any JavaScript function from whereever in your testing codebase, the only thing, of course, is that these functions need to be exported from their respective files. This means that a single jest test only needs to resolve the imports that it depends on, recursively (i.e. the depenency tree), not the full application.

Now my case was as follows: I wrote a test that essentially was just testing a small switch/case decision function. I noticed there was something fishy when this test resulted in errors of the kind

  • Target container is not a DOM element. (pointing to ReactDOM.render)
  • No reducer provided for key “user” (pointing to node_modules redux/lib/redux.js)
  • Store does not have a valid reducer. Make sure the argument passed to combineReducers is an object whose values are reducers. (also …/redux.js)

This meant there was too much going on. My unit test should neither know of React nor Redux, and as the culprit, I found that one of the imports in the test file used another import that marginally depended on a slice definition, i.e.

///////////////////////////////
// test.js
///////////////////////////////
import {helper} from "./Helpers.js"
...

///////////////////////////////
// Helpers.js
///////////////////////////////
import {SOME_CONSTANT} from "./state/generalSlice.js"
...

Now I only needed some constant located in generalSlice, so one could easily move this to some “./const.js”. Or so I thought.

When I removed the generalSlice.js depency from Helpers.js, the React application broke. That is, in a place totally unrelated:

ReferenceError: can't access lexical declaration 'loadPage' before initialization

./src/state/loadPage.js/</<
http:/.../static/js/main.chunk.js:11198:100
./src/state/topicSlice.js/<
C:/.../src/state/topicSlice.js:140
> [loadPage.pending]: (state, action) => {...}

From my past failures, I instantly recalled: This is a problem with circular dependencies.

Alas, topicSlice.js imports loadPage.js and loadPage.js imports topicSlice.js, and while some cases allow such a circle to be handled by webpack or similar bundlers, in general, such import loops can cause problems. And while I knew that before, this case was just difficult for me, because of the very nature of RTK.

So this is a thing with the RTK way of organizing files:

  • Every action that clearly belongs to one specific slice, can directly be defined in this state file as a property of the “reducers” in createSlice().
  • Every action that is shared across files or consumed in more than one reducer (in more than one slice), can be defined as one of the “extraReducers” in that call.
  • Async logic like our loadPage is defined in thunks via createAsyncThunk(), which gives you a place suited for data fetching etc. that always comes with three action creators like loadPage.pending, loadPage.fulfilled and loadPage.rejected
  • This looks like
///////////////////////////////
// topicSlice.js
///////////////////////////////
import {loadPage} from './loadPage.js';

const topicSlice = createSlice({
    name: 'topic',
    initialState,
    reducers: {
        setTopic: (state, action) => {
            state.topic= action.payload;
        },
        ...
    },
    extraReducers: {
        [loadPage.pending]: (state, action) => {
              state.topic = initialState.topic;
        },
        ...
    });

export const { setTopic, ... } = topicSlice.actions;

And loadPage itself was a rather complex action creator (thunk), as it could cause state dispatches as well, as it was built, in simplified form, as:

///////////////////////////////
// loadPage.js
///////////////////////////////
import {setTopic} from './topicSlice.js';

export const loadPage = createAsyncThunk('loadPage', async (args, thunkAPI) => {
    const response = await fetchAllOurData();

    if (someCondition(response)) {
        await thunkAPI.dispatch(setTopic(SOME_TOPIC));
    }

    return response;
};

You clearly see that import loop: loadPage needs setTopic from topicSlice.js, topicSlice needs loadPage from loadPage.js. This was rather old code that worked before, so it appeared to me that this is no problem per se – but solving that completely different dependency in Helpers.js (SOME_CONSTANT from generalSlice.js), made something break.

That was quite weird. It looked like this not-really-required import of SOME_CONSTANT made ./generalSlice.js load first, along with it a certain set of imports include some of the dependencies of either loadPage.js or topicSlice.js, so that when their dependencies would have been loaded, their was no import loop required anymore. However, it did not appear advisable to trace that fact to its core because the application has grown a bit already. We needed a solution.

As I told before, it required the brainstorming of multiple developers to find a way of dealing with this. After all, RTK appeared mature enough for me to dismiss “that thing just isn’t fully thought through yet”. Still, code-splitting is such a basic feature that one would expect some answer to that. What we did come up with was

  1. One could address the action creators like loadPage.pending (which is created as a result of RTK’s createAsyncThunk) by their string equivalent, i.e. ["loadPage/pending"] instead of [loadPage.pending] as key in the extraReducers of topicSlice. This will be a problem if one ever renames the action from “loadPage” to whatever (and your IDE and linter can’t help you as much with errors), which could be solved by writing one’s own action name factory that can be stashed away in a file with no own imports.
  2. One could re-think the idea that setTopic should be among the normal reducers in topicSlice, i.e. being created automatically. Instead, it can be created in its own file and then being referenced by loadPage.js and topicSlice.js in a non-circular manner as export const setTopic = createAction('setTopic'); and then you access it in extraReducers as [setTopic]: ... .
  3. One could think hard about the construction of loadPage. This whole thing is actually a hint that loadPage does too many things on too many different levels (i.e. it violates at least the principles of Single Responsibility and Single Level of Abstraction).
    1. One fix would be to at least do away with the automatically created loadPage.pending / loadPage.fulfilled / loadPage.rejected actions and instead define custom createAction("loadPage.whatever") with whatever describes your intention best, and put all these in your own file (as in idea 2).
    2. Another fix would be splitting the parts of loadPage to other thunks, and the being able to react on the automatically created pending / fulfilled / rejected actions each.

I chose idea 2 because it was the quickest, while allowing myself to let idea 3.1 rest a bit. I guess that next time, I should favor that because it makes the developer’s intention (as in… mine) more clear and the Redux DevTools output even more descriptive.

In case you’re still lost, my solution looks as

///////////////////////////////
// sharedTopicActions.js
///////////////////////////////
import {createAction} from "@reduxjs/toolkit";
export const setTopic = createAction('topic/set');
//...

///////////////////////////////
// topicSlice.js
///////////////////////////////
import {setTopic} from "./sharedTopicActions";
const topicSlice = createSlice({
    name: 'topic',
    initialState,
    reducers: {
        ...
    },
    extraReducers: {
        [setTopic]: (state, action) => {
            state.topic= action.payload;
        },

        [loadPage.pending]: (state, action) => {
              state.topic = initialState.topic;
        },
        ...
    });

///////////////////////////////
// loadPage.js, only change in this line:
///////////////////////////////
import {setTopic} from "./sharedTopicActions";
// ... Rest unchanged

So there’s a simple tool to break circular dependencies in more complex Redux-Toolkit slice structures. It was weird that it never occured to me before, i.e. up until to this day, I always was able to solve circular dependencies by shuffling other parts of the import.

My problem is fixed. The application works as expected and now all the tests work as they should, everything is modular enough and the required change was not of a major structural redesign. It required to think hard but had a rather simple solution. I have trust in RTK again, and one can be safe again in the assumption that JavaScript imports are at least deterministic. Although I will never do the work to analyse what it actually was with my SOME_CONSTANT import that unknowingly fixed the problem beforehand.

Is there any reason to disfavor idea 3.1, though? Feel free to comment your own thoughts on that issue 🙂

Pattern Matching and the SLAP

You probably know that effect: One starts writing a lot of code in a new language, after a while gains a decent appreciation of this or that goodness and a decent annoyance about these or that oddities… Then you do some other project in other languages (the Softwareschneiderei projects are quite diverse in this respect), and each time you switch languages there’s that small moment of pondering about certain design decisions.

Then after a while, sometimes there’s that feeling of a deep “ooooooh”, and you get a understanding of a fundamental mindset that probably must have been influential there. This is always interesting, because you can start to try using similar patterns in other languages, just to find out whether they are generally useful or not.

Now, as I’ve been writing quite some Rust code lately, I somehow started to like the way of pattern matching that it proposes. Suppose you have a structure that is some composition of multiple decisions, that somehow belong together to a sufficient degree that you don’t split it up into multiple pieces. That might be the state of some file handling that was passed to your program, as an example. Such a structure could look like (u32 is Rust’s unsigned 32-bit integer format):

enum InputState {
    Unitialized,
    PlainText(String),
    ImageData {
        width: u32,
        height: u32,
        base64Data: String,
    },
    Error {
        code: u32,
        message: String
    }
}

Now, in order to read such a thing in a context, there’s the “match” statement, a kind of “switch/select with superpowers”, in that it allows you to simultaneously destructure its content to reduce unnecessary typing. This might look like

fn process_input(state: InputState) {
    match state {
        InputState::Uninitialized => {
            println!("Uninitialized. Nothing to do!");
            std::process::exit();
        },
        InputState::PlainText(str) => {
            display_string(str);
        },
        InputState::ImageData {_, base64Data} => {
            println!("This seems to be a image and is now to be displayed");
            display_image(base64Data);
        }
        _ => (),
    }
}

(I probably forgot several &references in writing that example, but that’s Rust and not my point here :D). Anyway. At first, I was quite irritated with that – why does Rust want me to always include the placeholders (the _)? Why can’t it just let me take care of the stuff I want to take care right now? Why does the compiler complain instead of always assuming that _ => (), i.e. if nothing else matches, do nothing? But I eventually found out.

To make my point, a comparison with a more loosely typed environment like JavaScript, where (as a quick sketch) one could have written that as

// inputState might be null, or {message: "bla"},
// or {width: ..., height:..., base64Data}, or {code: ..., message: "bla"}...

if (!inputState) { return; }
if (inputState.code) { /* Error case */ }
else if (inputState.base64Data) { /* Image case */ }
else if (inputState.message) { /* probably the PlainText case */ }

/* but are you sure you forgot nothing? */

Now my point is, that these are not merely translations of the same idea between different languages. These are structurally different.

The latter example is a very microscopic view. I have a kind of squishy looking, alien thing called inputState that lies on the center of my operating table, I take the scalpel and dissect – what do we have here? what color is that? does this have bones? … Without further ado, you grab into the interior of whatever and you better hope that you’re not in a kind of Sci-Fi Horror Movie… :O

The former Pattern Matching, however, is an approach more true to its original question. It stops you with your scalpel right at the beginning.

We first want to know all eventualities. Then cut where it makes sense, and free your mind from the first decision.

We just wanted to distinguish our proceeding depending on the general nature of our object of interest. We do not want to look into details at that moment, we just want to know where we are.

In the language of Clean Code, this is the Single Level of Abstraction Principle (SLAP). It states that each method you write should explicitly concern itself with a constant degree of looking at a certain problem. E.g. Low-level mathematical calculations like milliseconds vs. system time conversions should not appear next to high-level server initialization; with the simple reason of understandability – switching levels of abstractions is quite irritating for the human brain (i.e. everyone who didn’t write that code). It breaks your line of thinking, especially when you are trying to find a bug or worse.

From my experience, I know that I myself would argue “well that’s not true; I can indeed hold multiple level of abstractions in my head simultaneously!” and this isn’t really a lie. But I still see embarrassing mistakes later, of the type “of course there’s also that case. I should have known.”

Another metaphor: Think you just ask your friend about the time. She then directly initiates a very detailed lecture about the movement of Sun and Earth, paired with some epistological considerations about the Heisenberg uncertainty principle, not to neglect the role of time in the concept of Entropy. Would you rather respond with “Thanks, that helped” or “… are you on drugs?”?

With Pattern Matching, this is the same. Look at a single decision, and then first lay all the options open: “What is this?” – then go on to another method. “What do do about it?” is another level. “How?” goes deeper, “And how, actually, if you consider these or that additional conditions” even deeper. And at the bottom are something like components, that just take one input and mangle these bytes without regard for higher morals.

It is a very helpful principle that still sometimes needs a little reminder. For me, I was just happy to see that kind-of-compulsive approach embodied by the Rust match operator.

So, how far do you think one should take it with SLAP? Do you manage to always follow it, or does it work differently for you?

PS: Of course, one could introduce the same caution by defining a similar control flow also in JavaScript. There’s no need to break the SLAP. But it makes you the one responsible of keeping track, while in my Rust example you have the annoying linting / compiler do that for you.

The IT architect, Part III: Improve your environment

If you happen to work on a system that scales to the size of an IT landscape, your worst bet is to let it evolve by circumstances. You want to have a plan and act upon that plan. The base for your plan could be a landscape map, which we talked about in the first part of this series. Upon drawing the map, you want to interpret it in order to find the strong points and weak spots. We’ve talked about assessing the map in the second part of this series.

In this blog article, we look at ways to improve our IT landscape towards the goal of overall stability.

Our mission statement

If we want to improve things, we need to know in what aspect the improvement should occur. At the scale of an IT landscape, overall stability is a commonly desired trait. This doesn’t mean rigidity, where you cannot change a thing in the landscape lest the whole thing breaks. It also doesn’t mean that ever part of our landscape needs to be stable itself. Overall stability means that even with the inevitable outage or replacement of a part, the whole system still works. The system is resilient to change and failure, at least resilient enough for the organization working with the system.

If our mission is to improve towards overall stability, we need to work on the relationships between our services (or assets, as we called them earlier, because “service” is a greatly overloaded term) more than we need to work on the service itself.

This doesn’t mean that individual stability of an asset isn’t important. It certainly is, but more often than not, you cannot improve this single value that much. What you can iterate on with recognizable effect is limiting the consequences of lacking individual stability.

Our mantra

The fundamental rule that brings overall stability is the “dependency rule” of the clean architecture that is meant for internal software application architecture. But if we see our IT landscape as one big application (software or not might make less of a difference than thought), we can apply the rule without modification:

All dependencies point towards the center (inside) and never in the opposite direction (outside).

That’s it. You define a center of your map and have all dependencies point towards it. This results in a structure of “rings” around the center that denote different levels of stability. The dependency rule can be rewritten as such:

All dependencies point from the less stable asset to the more stable asset and never in the opposite direction.

If you think of stability only in terms of “service availability” that tells you the percentage of time you can utilize the service without degradation, you’re thinking too short. Stability also means stable interface and stable implementation. You can have a really rock solid ISDN internet connection at the center of your IT landscape map, if your ISP discontinues the technology, the lack of implementation stability will force you to change the asset and hope that all dependent assets (basically your whole map) are not affected by the change.

Planning for obsolescence

Trying to bring the relationships between your assets in congruence with their significance for your IT landscape is the central work of an IT architect. The main question is always: What happens if this asset needs to be replaced?

In IT, there is no such thing as an “eternally working asset”. I’m not well-versed in more physical domains like mechanical engineering to say that this an univeral invariant, but in my field of speciality, everything changes eventually.

If you create an IT landscape where every asset can be replaced with manageable effort and predictable consequences, you’ve created an overall stable system. You can probably improve the availability of parts of it, but you won’t need to overhaul the whole thing over and over again. Your IT landscape is ready to grow, evolve and change, but it does so in a controlled manner and without compromising the mission.

Anti-obsolescence patterns

On your way from your current map to your anticipated one, you’ll recognize recurring patterns that you employ to solve dependency problems or improve the longevity of overall structures. Here are three patterns that have helped me in my endeavours:

Protected variation

If you have more than one implementation for basically the same service (like the example of an internet connection), you probably want the rest of the map to not know about the multiplicity. In this case, you introduce an additional asset that acts as a router between the implementations. Think of the router (or service interface) as a guarding wall for your service implementations. It acts as a “portal” to the real service and can be paper-thin (at least for now). If you want to improve the runtime availability of your service, the router can also act as a load balancer and a circuit breaker. The important rule is that all outside relationships only point to the router, not the actual implementations.

Opinionated interface

If you find an asset that has a lot of incoming dependencies, you’ve found a change risk. If you swap the service for a newer version with a similar, but not quite equivalent interface, you’ll find that you have to adjust lots of dependent assets in oftentimes surprising fashion. You can reduce your surprise by introducing a “portal” interface, just like you did in the protected variations, but without the variations. The portal or “opinionated service” interface offers everything your other assets require of the original service, but nothing more. It captures the “opinion” of your organization towards the service. When you introduce such a portal, it is nothing more than a forwarding service that maybe handles authentication itself. If you plan to swap the service, the portal becomes your requirement list and its new implementation will convert data back and forth.

If you find that your portal gets to big, you could think about multiple portals with their own separated “opinion” about the service, all forwarding to the same “source of truth”:

Now you need to maintain several interface services, but they separate different concerns or contexts into separate assets, which might help with future migrations. Chances are, if there are separate concerns, that they will be provided by separate assets in the future.

Circular portals

The most nasty thing to occur on your map is the ring (see part II of the series). In its smallest form, its just two assets requiring each other:

There are no easy ways out of this situation. But we can make some steps in the right direction and see where it takes us. The first step is introducing buffer assets that act as stand-ins for the real asset:

This doesn’t break the ring yet, but it gives us a chance to do so later. The service interfaces are opinionated and maybe even tailored specifically to the service using it. This reduces the “area of dependency” to its minimum. With a little luck, we find that Service A requires things of Service B that, if isolated, don’t require things of Service A themselves. If that’s the case, we can work on splitting Service B in two parts: One dependent on Service A, but not required by it and one indepedent of Service A, but needed by it. This would break the ring and give us a long chain that is much easier to work with. The problem is: None of this is guaranteed. In the worst case, you’ll still end up with your circular dependency with extra steps and nothing can be done about it.

Conclusion

When you begin to work with your IT landscape map, you begin to transform your assets from what they provide to what others actually require from them. Minimizing the relationships between assets, if not by number then at least by scope, is an appreciable improvement that gives you leeway to make changes in your actual IT setup without compromising the overall structure.

If you accompany your journey towards the best-fitting IT landscape with your map, you always have a plan at hands that you can show to people to form a shared understanding of the current state and the desired outcome. And if you keep old versions of your map in the archives, you can sometimes look back and see how far you’ve come yet.

The IT architect, Part II: Assess your situation

If you want to work on the scale of an IT landscape, you need to have a plan in the form of a map. In the first part of this series, we talked about creating such a map. This blog entry will give you the basic tools to make sense of all the things on it and how to convey meaning to other people while using the map.

The third part will talk about actionable steps that are a result of our interpretation of the map.

Making sense of the map

You’ve drawn the map of all your IT assets and given all the boxes names that you find useful. You’ve asked around to find relationships between your assets, represented by arrows between the boxes. You’ve moved the boxes around a bit to reduce arrow intersections. The map seems to be as “clean” as it can get at the moment.

Now is the time to apply meaning to the structures you see.

Interpreting loners

The first thing you want to look for are boxes without any relationships. These entities don’t interact with other things on your map and are not required by anything, too. Let’s think of them as independent value sources. If this asset brings your organization a describable and current advantage, you’ve found the ideal asset.

An example could be the blue box “L” in our example map. It isn’t coupled to any other asset. Let’s say it is a “customer relationship management” (CRM) system. Remember, boxes are not labeled by their actual implementation (in this case, maybe a vTiger or SugarCRM), but by the value they provide for the organization. If your organization needs a CRM (or benefits from its presence), then you have a “loner”, which is a good thing.

If the CRM stops working, the humans in the organization will be unhappy about it, but the outage itself will be limited to the CRM and not spread to other parts of your IT landscape (given that your map reflects the reality). If the outage lasts longer, your employees will adapt their work processes to circumvent the pothole in your IT. There will be a lot of post-it notes, at least for some time.

If the CRM is updated to a new version, you need to train your employees, but it won’t require other IT entities in your organization to match that update. The CRM can run on ancient hardware and software, as long as the human requirements are met. A loner on your map is a good thing.

Interpreting relicts

If you find a lonely box without a current use case, you’ve found a relict. Be glad that you’ve found it, because relicts tend to remain hidden and not show up on architect maps. If you can make sure that the relict serves no purpose for the organization anymore, you can eliminate it. Removing an asset from the map (and your real IT infrastructure) is a good thing, because you reduce complexity, costs and risks. There is no IT asset without associated costs and risks.

If, for example, the yellow box “P” represents a computer that provides a service that nobody uses anymore, the computer itself is still present in the network and can be used as a stepping stone for malicious itents. Let’s say the computer is a Raspberry Pi that isn’t included in the first tier of workhorse computers, its operating system might be outdated and susceptible to attacks. It doesn’t provide value for the organization anymore, but it increases the organization’s risk.

Revealing this kind of “dead weight” in your IT landscape is a real advantage, because you can cut it out rather easy.

Interpreting rings

A typical structure on your map could be a circular dependency. In its smallest form, it is just two boxes that both depend on each other. The more elaborate ring consists of several boxes that are connected without a clear start and end. This is the worst thing to find.

A ring in your entities means that you have to consider all elements in the ring as one big entity. You cannot modify them independently, neither on the technological level nor on in the temporal dimension. A ring is basically a mexican standoff situation for all included entities. You can also call it a deadlock. Whatever you call it, it is bad news. You probably want to break the ring as soon as possible.

Breaking a ring would warrant its own blog post altogether. A basic starting point might be the Acyclic dependencies principle of software design. You probably need to split at least one of your entities into smaller parts or introduce a new entity. The least favorable move would be to merge all entities into one bigger entity, creating a monolith. You will regret this move when the inevitable modernization pressure rises.

Interpreting chains

If your entities form “deep” dependency lines where A depends on B, B depends on C, C depends on D and so forth, you have discovered a chain. This structure is less troublesome compared to the ring, but worth a worry nonetheless. In terms of operational risk, the chain creates a meta-system with a failure rate that is the sum of the failure rates of the chain elements. To make a long story short, you’ll never get a reliable infrastructure with long chains.

The longer your chains are, the more ripple effects an outage will have on your IT landscape. Remember that a chain always breaks at its weakest link, but this link will bring down the whole line.

You can reduce the length of a chain of entities in your IT landscape by inserting buffer elements like read-only copies of central data sources. But more important is to think (and talk) about why the dependencies are there in the first place. Maybe your data storage strategy is too decentralized and you would gain some favorable dependency structures by pooling data together (essentially creating a data monolith if you overdo it).

Introducing zones

Recognizing the basic shapes on your map is important, but you also have to look at the forest and not only the trees. The basic layout of your boxes already tell you a lot about your IT landscape zones.

A zone on your map is a region of boxes that you can encircle and give it a superordinate name. The basic rule of a zone is that all entities in it should share a common property. The less technology-based this property is, the better is your zoning. A zone for “java web services” or “metal computers” is eventually useful, but won’t stand the test of time. Sooner or later, some java services are replaced by other programming languages and some real machines get virtualized. Do you move them to other zones on your map? What really changed for the users of your IT landscape?

If you concentrate on your users, you might be able to come up with properties that really affect them. Look at this example that takes our initial example and separates it into three zones:

And now, we find a user-oriented name for each zone. In our example, we’ve grouped the entities by user role and are now able to label our zones:

This grouping has the added advantage that the target audience for each modification to the map can be identified nearly immediately. It makes it easier to anticipate the effects of outages or problems and to identify non-cohesive usage of the same tool/entity.

In our example, each box in the “Both” zone is essential to the functioning of the organization. But just because a specific service is used by both other groups doesn’t mean they have overlapping requirements. Maybe it is better for everybody involved to actually divide an entity into two separate boxes in the respective zones, even if both boxes are implemented with the exact same tool/technology at the moment.

Identifying the zones takes your map to the next level. You end up with fewer, but bigger boxes and their dependencies. It’s the same IT landscape, but with less detail. Now you can start your discovery process again.

Conclusion

Your IT landscape map can be interpreted by looking for common structures (like loners, rings and chains) and by defining zones. This allows us to gather a list of problem points that we want to improve. It also allows to evaluate the expectable ramifications of changes to entities in our IT landscape. And there will be changes. The one (and probably only) constant in IT is that all things change.

In the next part of this series, we look at ways to transform the map from the current state towards a better one. Stay tuned!

The IT architect, Part I: Map your assets

When I’m tasked with commenting on a software architecture, my first step is to request or draw a map of all distinguishable elements of the software system and give them relationships to each other. This inevitably results in a boxes-and-arrows type of diagram that serves as a base for all future communication about the subject. Having a shared representation about a system is a great way to pinpoint discussions and focus on a particular area without forgetting the rest completely.

When I’m tasked with commenting on IT infrastructure, my first step is to request or draw a map of all distinguishable elements of the IT architecture (or “IT landscape”, a term that I actually prefer because it conveys better that a lot of things on this scale happen unplanned) and give them relationships to each other. Once again, we are drawing boxes and connecting them with arrows.

Being able to rely on this map is an essential base for all communication about IT architecture. And if you know how to read the map, it directs your efforts of consolidating your IT architecture nearly intuitively.

In this blog entry, we talk about drawing the map. The second part goes into interpretation of the map, the third part emphasizes actionable steps based on the map and our interpretation of it. Based on questions and discussion, there might even be a fourth part, but that’s not planned yet.

Your initial boxes

Beginning the IT architecture map is easy: Draw a box and give it a name. The name should correspond to an element of your work environment that is distinguishable from other elements. Note how I don’t say “system” or “service” or “server”. For an IT architecture, these words describe an implementation, a particular manifestation of the architecture. They don’t belong on the map (or this map). If you cannot see the difference yet, think about the floor plan of a house. It doesn’t tell you about the material the house is made of and you can use the same floor plan for a wooden cabin or a marble mansion (barring some pesky statics limitations that I don’t have a clue of). In our IT architecture map, each box represents a “thing” that will at the latest get a name the minute it stops working.

There will be boxes in your IT architecture map that don’t relate to anything else. That’s fine and not a problem, as long as the box relates to humans. If you cannot find a meaningful relationship between the box and humans or other boxes, you’ve found a relict. This is in fact one of the hardest tasks in IT architecture analysis, so congratulations!

Adding relationships

Every other box interacts with its environment in some manner. Again, the concrete implementation of that interaction is not important for our map. For our current view on the landscape, it makes no difference if a software system uses HTTP calls to a server or a computer tranfers bytes over RS232 wire to an appliance box. The fact that one box relies on the availability of another box is all that matters. That’s the essence of our arrows: Box 1 requires box 2 to be “online” in order to perform its duties. Without box 2, the functionality offered by box 1 will be limited, down to a point where it is no longer useful to others. Our arrows denote dependencies between boxes. If you happen to be a software developer: we don’t talk about code dependencies here. Also, even if closely related, we don’t mean format or protocol dependencies. We just state that if box 2 “goes down”, box 1 will follow closely.

This is the base for a rule of thumb about dependency arrows: Don’t draw them bidirectionally. Each arrow has one clear direction (like box 1 –> box 2). If you find that box 2 also depends on box 1, you should draw two arrows in opposite directions. As a preview for the interpretation step: This dependency cycle is a sore spot in your current architecture. It means that your two boxes appear as one to the outside. It means that you cannot replace one part without the other. The replaceability of single boxes is an important aspect of your landscape’s health.

Making it readable

When you’ve placed your boxes and drawn the arrows, it’s time to improve on the map’s layout. A guideline for the layout is that arrows shouldn’t intersect each other. Another guideline is that boxes that are semantically related should be near each other on the map. These two requirements alone often result in a lot of movement and experiments. You might want to use a software that allows for these experiments without much effort.

You’ll recognize a fitting layout when you see it. The map corresponds to your internal landscape representation enough to be useful in discussions. It might look like this real example:

First thing you’ll notice it that the names are replaced by denotations with zero meaning. In a real map, the box “C” might be named “time tracking” and box “D” could be labelled “issue tracking”. The name should indicate the responsibility of the element/box. You can also add the current implementation of that responsibility, if that makes things clearer. In our example, box “D”, indicating “issue tracking” might have “(JIRA)” added to the description. Just be aware that your organization probably needs another issue tracking system in that place even if JIRA falls out of favour. Following your arrows backwards, you’ll know which other elements of your landscape will be affected by this replacement. More on that in the next part about interpreting the map.

Evolving the map

Another thing you probably scoffed at is the intersecting arrows in the example. The map’s author came up with this layout as the best representation when the map had fewer boxes. With each subsequently added box, another arrow or two tried to reach the “center”. The intersections are a direct consequence of the emergence of a “center”. This is an important finding of your map: Being able to identify your map’s center and deduce meaning from it. To spoiler a bit: If your center is “time tracking” and “issue tracking”, you probably charge money per hour to solve other people’s problems.

Conclusion

You’ve probably seen how drawing an IT landscape map can benefit your organization and your discussions about its present and future. One thing you should keep in mind is that the map should reflect the current state and not your desired state of your organization’s IT architecture. That’s what will be addressed in part 3 of this series. Stay tuned!

Want to read more? Head over to part II of this series.