A procedure to deal with big amounts of email

How can you survive the daily email flood and still keep track of your work? Here is one personal ruleset that adapts real-life habits to virtual message management.

The problem

You probably know the problem already: A day with less than 500 emails feels like your internet connection might be lost. The amount of emails you receive can accurately predict the time of day. In my case, I’ll always receive my 300th email each day right before lunch. Imagine that I’ve spent one minute for each message, then I would have done nothing but reading emails yet. And by the time I return from lunch, more emails have found their way into my inbox. My job description is not “email reader”. It actually is one of the lesser prioritized activities of my job. But I keep most answering times low and always know the content of my inbox. You’ll seldom hear “sorry, I haven’t seen your email yet” from me. How I keep the email flood in check is the topic of this blog entry. It’s my personal procedure, so nothing fancy with a big name, but you might recognize some influence from well-known approaches like “Getting Things Done” by David Allen.

The disclaimer

Disclaimer: You might entirely disagree with my approach. That’s totally acceptable. But keep in mind that it works for at least one person for a long time now, even if it doesn’t fit your style. Email processing seems to be a delicate topic, please keep your comments constructive. By the way, I’d love to hear about your approach. I’m always eager to learn and improve.

The analogy

Let me start with a common analogy: Your email inbox is like your mailbox. All letters you receive go through your mailbox. All emails you receive go through your inbox. That’s where this analogy ends and it was never useful to begin with. Your postman won’t show up every five minutes and stuff more commercial mailings, letters, postcards and post-it notes into your mailbox (raising that little flag again that indicates the presence of mail). He also won’t announce himself by ringing your door bell (every five minutes, mind you) and proclaiming the first line of three arbirtrarily chosen letters. Also, I’ve rarely seen mailboxes that contain hundreds if not thousands of letters, some read, some years old, in different states of decay. It’s a common sight for inboxes whose owners gave up on keeping up. I’ve seen high stacks of unanswered correspondence, but never in the mailbox. And this brings me to my new analogy for your email inbox: Your email inbox is like your desk. The stacks of decaying letters and magazines? Always on desks (and around it in extreme cases). The letters you answer directly? You bring them to your desk first. Your desk is usually clear of pending work documents and this should be the case for your inbox, too.

(c) Fotolia Datei: #87397590 | Urheber: thodonal

The rules

My procedure to deal with the continuous flood of e-amils is based on three rules:

  • The inbox is the only queue of emails that needs attention. It is only filled with new emails (which require activity from my side) or emails that require my attention in the foreseeable future. The inbox is therefor only filled with pending work.
  • Email processing is done manually. I look at each email once and hopefully only once. There are no automatic filters that sort emails into different queues before I’ve seen them.
  • For every email, interaction results in an activity or decision on my side. No email gets “left there”.

Let me explain the context of the rules in a bit more detail:

My email account has lots and lots of folders to store all emails until eternity. The folders are organized in a hierarchy, but that doesn’t really matter, because every folder can hold emails. The hierarchy of folders isn’t pre-planned, it emerges from the urge to group emails together. It’s possible that I move specific emails from one folder to another because the hierarchy has changed. I will use automatic filters to process emails I’ve already read. But I will mark every email as read by myself and not move unread emails around automatically. This narrows the place to look for new emails to one place: the inbox. Every other folder is only for archivation, not for processing.

The sweeps

The amount of unread emails in my inbox is the amount of work I need to do to return to the “only pending work” state. Let’s say that I opened my email reader and it shows 50 new messages in the inbox. Now I’ll have to process and archivate these 50 emails to be in the same state as before I had opened it. I usually do three separate processing steps:

  • The first sweep is to filter out any spam messages by immediate tell-tale signs. This is the only automatic filter that I’ll allow: the junk filter. To train it, I mark any remaining junk mail as spam and let the filter deal with it. I’m still not sure if the junk filter really makes it easier for me, because I need to scan through the junk regularly to “rescue” false positives (legitimate mails that were wrongfully sorted out), but the junk filter in combination with my fast spam sweep will lower the message count significally. In our example, we now have 30 mails left.
  • The second sweep picks every email that is for information purpose only. Usually those mails are sent by software tools like issue trackers, wikis, code review tools or others. Machines don’t feel the effort of writing an email, so they’ll write a lot. Most of the time, the message content is only a few lines of text. I grab each of these mails and drag them into their corresponding folder. While I’m dragging, I read (and memorize) the content of the email. The problem with this kind of information is that it’s a lot of very small chunks of data for a lot of different contexts. In order to understand those messages, you have to switch your mental context in a matter of seconds. You can do it, but only if you aren’t interrupted by different mental states. So ignore any email that requires more than a few seconds of focused attention from you. Let them sit into your inbox along with the emails that require an answer. The only activity for mails included in your second sweep is “drag to folder & memorize”. Because machines write often, we now have 10 mails left in our example.
  • The third sweep now attends to each remaining mail independently. Here, the three-minutes rule applies: If it takes less than three minutes to reply to the email right now, then do it right now and archivate the mail in a suitable folder (you might even create a new folder for it). Remember: if you’ve processed an email, it leaves the inbox. If it takes more than three minutes, you need to schedule an attention slot for this particular message on your todo-list. This is the only time the email remains in the inbox, because it’s a signal of pending work. In our example, 7 emails could be answered with short replies, but 3 require deep concentration or some more text for the answer.

After the three sweeps, only emails that indicate pending work remain in the inbox. They only leave the inbox after I’ve dealt with them. I need to schedule a timebox to work on them, but after that, they’ll find themselves out of the inbox and in a folder. As soon as an email is in a folder, I forget about its existence. I need to remember the information that were in it, but not the message itself.

The effects

While dealing with each email manually sounds painfully slow at first, it becomes routine after only a short while. The three sweeps usually take less than five minutes for 50 emails, excluding the three major correspondence tasks that make their way as individual items on my work schedule. Depending on your ratio of spam to information messages to real correspondence, your results may vary.

The big advantage of dealing manually with each email is that I’ve seen each message with my own eyes. Every email that an automatic filter grabs and hides before you can see it should not have been sent in the first place – it’s simulating a pull notification scheme (you decide when to receive it by opening the folder) rather than leveraging the push notification scheme (you need to deal with it right now, not later) that emails are inherently. Things like timelines, activity streams or message boards are pull-oriented presentations of presumably the same information, perhaps that’s what you should replace your automatically hidden emails with.

You’ll have an ever-growing archive with lots of folders for different things (think of a shelf full of document files in real life), but you’ll never look into them as long as you don’t desperately search “that one mail from 3 months ago”. You’ll also have a clean inbox, preferably in “blank slate” condition or at least with only emails that require actions from you. So you’ll have a clear overview of your pending work (the things in your inbox) and the work already done (the things  in your folders).

The epilogue

That’s when you discover that part of your work can be described as “look at each email and move it to a folder” as if you were an official in charge for virtual paper. We virtualized our paperwork, letters, desks, shelves and document files. But the procedures to deal with them is still the same.

Your turn now

How do you process your emails? What are your rules and habits? What are your experiences with folders vs. tags? I would love to hear from you – in a comment, not an email.

Recap of the Schneide Dev Brunch 2016-08-14

If you couldn’t attend the Schneide Dev Brunch at 14th of August 2016, here is a summary of the main topics.

brunch64-borderedTwo weeks ago at sunday, we held another Schneide Dev Brunch, a regular brunch on the second sunday of every other (even) month, only that all attendees want to talk about software development and various other topics. This brunch had its first half on the sun roof of our company, but it got so sunny that we couldn’t view a presentation that one of our attendees had prepared and we went inside. As usual, the main theme was that if you bring a software-related topic along with your food, everyone has something to share. We were quite a lot of developers this time, so we had enough stuff to talk about. As usual, a lot of topics and chatter were exchanged. This recapitulation tries to highlight the main topics of the brunch, but cannot reiterate everything that was spoken. If you were there, you probably find this list inconclusive:

Open-Space offices

There are some new office buildings in town that feature the classic open-space office plan in combination with modern features like room-wide active noise cancellation. In theory, you still see your 40 to 50 collegues, but you don’t necessarily hear them. You don’t have walls and a door around you but are still separated by modern technology. In practice, that doesn’t work. The noise cancellation induces a faint cheeping in the background that causes headaches. The noise isn’t cancelled completely, especially those attention-grabbing one-sided telephone calls get through. Without noise cancellation, the room or hall is way too noisy and feels like working in a subway station.

We discussed how something like this can happen in 2016, with years and years of empirical experience with work settings. The simple truth: Everybody has individual preferences, there is no golden rule. The simple conclusion would be to provide everybody with their preferred work environment. Office plans like the combi office or the flexspace office try to provide exactly that.

Retrospective on the Git internal presentation

One of our attendees gave a conference talk about the internals of git, and sure enough, the first question of the audience was: If git relies exclusively on SHA-1 hashes and two hashes collide in the same repository, what happens? The first answer doesn’t impress any analytical mind based on logic: It’s so incredibly improbable for two SHA-1 hashes to collide that you might rather prepare yourself for the attack of wolves and lightning at the same time, because it’s more likely. But what if it happens regardless? Well, one man went out and explored the consequences. The sad result: It depends. It depends on which two git elements collide in which order. The consequences range from invisible warnings without action over silently progressing repository decay to immediate data self-destruction. The consequences are so bitter that we already researched about the savageness of the local wolve population and keep an eye on the thunderstorm app.

Helpful and funny tools

A part of our chatter contained information about new or noteworthy tools to make software development more fun. One tool is the elastic tabstop project by Nick Gravgaard. Another, maybe less helpful but more entertaining tool is the lolcommits app that takes a mugshot – oh sorry, we call that “aided selfie” now – everytime you commit code. That smug smile when you just wrote your most clever code ever? It will haunt you during a git blame session two years later while trying to find that nasty heisenbug.

Anonymous internet communication

We invested a lot of time on a topic that I will only decribe in broad terms. We discussed possibilities to communicate anonymously over a compromised network. It is possible to send hidden messages from A to B using encryption and steganography, but a compromised network will still be able to determine that a communication has occured between A and B. In order to communicate anonymously, the network must not be able to determine if a communication between A and B has happened or not, regardless of the content.

A promising approach was presented and discussed, with lots of references to existing projects like https://github.com/cjdelisle/cjdns and https://hyperboria.net/. The usual suspects like the TOR project were examined as well, but couldn’t hold up to our requirements. At last, we wanted to know how hard it is to found a new internet service provider (ISP). It’s surprisingly simple and well-documented.

Web technology to single you out

We ended our brunch with a rather grim inspection about the possibilities to identify and track every single user in the internet. To use completely exotic means of surfing is not helpful, as explained in this xkcd comic. When using a stock browser to surf, your best practice should be to not change the initial browser window size – but just see for yourself if you think it makes a difference. Here is everything What Web Can Do Today to identify and track you. It’s so extensive, it’s really scary, but on the other hand quite useful if you happen to develop a “good” app on the web.

Epilogue

As usual, the Dev Brunch contained a lot more chatter and talk than listed here. The number of attendees makes for an unique experience every time. We are looking forward to the next Dev Brunch at the Softwareschneiderei. And as always, we are open for guests and future regulars. Just drop us a notice and we’ll invite you over next time.

For the gamers: Schneide Game Nights

Another ongoing series of events that we established at Softwareschneiderei are the Schneide Game Nights that take place at an irregular schedule. Each Schneide Game Night is a saturday night dedicated to a new or unknown computer game that is presented by a volunteer moderator. The moderator introduces the guests to the game, walks them through the initial impressions and explains the game mechanics. If suitable, the moderator plays a certain amount of time to show more advanced game concepts and gives hints and tipps without spoiling too much suprises. Then it’s up to the audience to take turns while trying the single player game or to fire up the notebooks and join a multiplayer session.

We already had Game Nights for the following games:

  • Kerbal Space Program: A simulator for everyone who thinks that space travel surely isn’t rocket science.
  • Dwarf Fortress: A simulator for everyone who is in danger to grow attached to legendary ASCII socks (if that doesn’t make much sense now, lets try: A simulator for everyone who loves to dig his own grave).
  • Minecraft: A simulator for everyone who never grew out of the LEGO phase and is still scared in the dark. Also, the floor is lava.
  • TIS-100: A simulator (sort of) for everyone who thinks programming in Assembler is fun. Might soon be an olympic discipline.
  • Faster Than Light: A roguelike for everyone who wants more space combat action than Kerbal Space Program can provide and nearly as much text as in Dwarf Fortress.
  • Don’t Starve: A brutal survival game in a cute comic style for everyone who isn’t scared in the dark and likes to hunt Gobblers.
  • Papers, Please: A brutal survival game about a bureaucratic hero in his border guard booth. Avoid if you like to follow the rules.
  • This War of Mine: A brutal survival game about civilians in a warzone, trying not to simultaneously lose their lives and humanity.
  • Crypt of the Necrodancer: A roguelike for everyone who wants to literally play the vibes, trying to defeat hordes of monsters without skipping a beat.
  • Undertale: A 8-bit adventure for everyone who fancies silly jokes and weird storytelling. You’ll feel at home if you’ve played the NES.

The Schneide Game Nights are scheduled over the same mailing list as the Dev Brunches and feature the traditional pizza break with nearly as much chatter as the brunches. The next Game Night will be about:

  • Factorio: A simulator that puts automation first. Massive automation. Like, don’t even think about doing something yourself, let the robots do it a million times for you.

If you are interested in joining, let us know.

Three natural resources of information technology

There are quite a few commodities in IT that we use every day to gather and process the natural resources of IT. But what are the resources of IT? And are there more than the three I found?

Disclaimer: English is not my native language, so it is possible that my terminology is a little skewed in this blog entry. If you have a suggestion for better words, please let me know.

IT Currencies

Everybody working in the vast field of information technology knows about the three constraints in the project management triangle, namely

  • Cost
  • Scope
  • Schedule

We can translate these constraints in the three currencies of business:

  • Money
  • Effort
  • Time

You can virtually achieve anything in IT if you are willing to spend lots of these three currencies (except solving the Halting problem and similar decision problems).

IT Commodities

You surely also know about the three upscaling commodities of our profession:

  • processing power (think CPU)
  • memory (think RAM or HDD)
  • bandwidth (think network throughput)

If you are willing to invest more business currency, you’ll get more of these commodities. You’ll invest mostly money or time, albeit Moore’s Law seems to dwindle, so spending time, as in waiting for the next generation of computers, is not the superior deal it used to be.

There are three downscaling commodities, too:

  • latency (think caches or parallelism)
  • physical size (think USB sticks the size of a fingernail)
  • energy consumption (think Rasperry PCs that are powered over USB)

These commmodities are getting reduced with every new generation of computers. What once was a super-computer is now a 30$ Mini-PC. I vividly remember my university announcing their latest piece of technology during my first year of study: a computer with 1 GHz CPU, 1 GB RAM and 1 TB HDD. This machine was used by all students concurrently. Today, my phone provides more power and fits into my pocket.

IT natural resources

By Stepanovas (Stapanov Alexander). Timestamp at the bottom right was removed by Michiel Sikma in 2006. - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=350061Now, with currency and commodity defined, let me introduce you to the concept of natural resources of IT. Natural resources are things that have value to the business and need to be harvested instead of being just bought. While you shouldn’t envision material natural resources now, it helps to introduce the concept of a natural resource: You may buy a whole mountain, but the iron ore in it (a major natural resource for the industry) still needs to be mined. Raw iron ore is the starting point for many processing steps, each one refining the input material and producing output material of higher value. There are countless different materials in the world, but only a handful of major natural resources. Whoever was the first to drill a hole in the ground and get crude oil back was a rich man. Keep this imagery in mind when we talk about the natural resources of IT, but please forget about the aspect of mass. Our natural resources don’t have a mass. They do keep a location, though.

Data or information: You’ve already guessed this. The oil of IT is data. Data can be labeled as crude oil, information is refined data then. Just imagine you drill a hole in the ground and it spits out random facts. You could just record the facts and build a knowledge database out of it. If you want to give your hole in the ground a name, you could name it Facebook, Twitter or something alike. Data is processed and turned into information, information is combined and aggregated to give us more valueable information, just like the iron ore example beforehands. In the old days, data was provided by human effort (aka typing). In the era of the internet of things, most data is provided by sensors all over the world. And while data still maintains a location (but increasingly fuzzy in the era of cloud computing), it has no mass. This means it can be copied without cost, a feature no material natural resource can offer. Data as a natural resource of IT is so widely known, it even gave IT its first name: electronic data processing.

Source code: The fabric all software is made of is another natural resource of IT. You could argue that code is just data, but I think its whole processing pipeline is so remarkely different from data, it should be discussed seperately. Source code is still harvested from hand, by humans typing words into a text editor like it’s 1980. Source code is refined to running software programs, a process that got fully automated in the recent years. The software is then used to gather data, distill information out of data or, well, entertain us. Source code is a rare natural resource, because it needs to be harvested by highly skilled workers in a delicate process called programming. The number of programmers worldwide doubles every five years, but the demand for software rises even faster. All the while, we still haven’t figured out to maintain an acceptable quality level. If source code is the equivalent to gold (rare, valueable, sought-after), it most often comes mixed with all kinds of scrap metal.

Random numbers: The raw material of anything cryptographic are random numbers. They might be seen as data, too, but again, I think their unique properties require a separate examination. Random numbers need to be truly random. The higher the randomness, the higher the quality of this natural resource. A lot of random numbers we use (or consume) today are really just pseudorandom numbers, obtained from an ultimately deterministic generator. We rely on this second-grade material because the harvesting speed for real random numbers is pitiful slow and cannot satisfy our need of random numbers. Imagine again that you drill a hole in the ground and it spits out random numbers. You’re going to get rich, because random numbers are the crude oil of cryptography and therefore of every serious data transfer today. If you think about sources of randomness, radioactive decay or cosmic radiation are very high on the list. The RANDOM.ORG service uses atmospheric noise, as if the weather in Ireland would provide much noise – it will rain tomorrow, too. A speciality of random numbers is that they can only be used once to provide their full value. Nobody wants to use second-hand random numbers because they lose their randomness once they are known (much like you can’t bet on last week’s sport events). So while we can still say that random numbers have no mass, they are more similar to their material counterparts in that they can’t be copied and are affected by decay over time.

What now?

This blog post was meant to inspire and to share a question: are those all natural resources in the field of IT? I thought long about it but could only find derived products like blockchain blocks that ultimately rely on brute-forcing one-way functions like hashes in the wrong way. To mine a bitcoin, for example, the most prominent implementation of a blockchain, you only need commodities like processing power and some time. There is nothing inherently “unique” about a blockchain block. Nice choice of terminology with “mining bitcoins”, though.

So, the question goes to you: Can you think of another natural resource of IT? Please leave a comment if you do.

Recap of the Schneide Dev Brunch 2016-06-12

If you couldn’t attend the Schneide Dev Brunch at 12th of June 2016, here is a summary of the main topics.

brunch64-borderedLast sunday, we held another Schneide Dev Brunch, a regular brunch on the second sunday of every other (even) month, only that all attendees want to talk about software development and various other topics. This brunch was a little different because it had a schedule for the first half. That didn’t change much of the outcome, though. As usual, the main theme was that if you bring a software-related topic along with your food, everyone has something to share. We were quite a lot of developers this time, so we had enough stuff to talk about. As usual, a lot of topics and chatter were exchanged. This recapitulation tries to highlight the main topics of the brunch, but cannot reiterate everything that was spoken. If you were there, you probably find this list inconclusive:

The internals of git

Git is a version control system that has, in just a few years, taken over the places of nearly every previous tool. It’s the tool that every developer uses day in day out, but nobody can explain the internals, the “plumbing” of it. Well, some can and one of our attendees did. In preparation of a conference talk with live demonstration, he gave the talk to us and told us everything about the fundamental basics of git. We even created our own repository from scratch, using only a text editor and some arcane commands. If you visited the Karlsruhe Entwicklertag, you could hear the gold version of the talk, we got the release candidate.

The talk introduced us to the basic building blocks of a git repository. These elements and the associated commands are called the “plumbing” of git, just like the user-oriented commands are called the “porcelain”. The metaphor was clearly conceived while staring at the wall in a bathroom. Normal people only get to see the porcelain, while the plumber handles all the pipework and machinery.

Code reviews

After the talk about git and a constructive criticism phase, we moved on to the next topic about code reviews. We are all interested in or practicing with different tools, approaches and styles of code review, so we needed to get an overview. There is one company called SmartBear that has its public relationship moves done right by publishing an ebook about code reviews (Best Kept Secrets of Code Review). The one trick that really stands out is adding preliminary comments about the code from the original author to facilitate the reviewer’s experience. It’s like a pre-review of your own code.

We talked about different practices like the “30 minutes, no less” rule (I don’t seem to find the source, have to edit it in later, sorry!) and soon came to the most delicate point: the programmer’s ego. A review isn’t always as constructive as our criticism of the talk, so sometimes an ego will get bruised or just appear to be bruised. This is the moment emotions enter the room and make everything more complicated. The best thing to keep in mind and soul is the egoless programming manifesto and, while we are at it, the egoless code review. If everything fails, your process should put a website between the author and the reviewer.

That’s when tools make their appearance. You don’t need a specific tool for code reviews, but maybe they are helpful. Some tools dictate a certain workflow while others are more lenient. We concentrated on the non-opinionated tools out there. Of course, Review Ninja is the first tool that got mentioned. Several of our regular attendees worked on it already, some are working with it. There are some first generation tools like Barkeep or Review Board. Then, there’s the old gold league like Crucible. These tools feel a bit dated and expensive. A popular newcomer is Upsource, the code review tool from JetBrains. This is just a summary, but there are a lot of tools out there. Maybe one day, a third generation tool will take this market over like git did with version control.

Oh, and you can read all kind of aspects from reviewed code (but be sure to review the publishing date).

New university for IT professionals

In the german city of Köln (cologne), a new type of university is founded right now: https://code.university/ The concept includes a modern approach to teaching and learning. What’s really cool is that students work on their own projects from day one. That’s a lot like we started our company during our studies.

Various chatter

After that, we discussed a lot of topics that won’t make it into this summary. We drifted into ethics and social problems around IT. We explored some standards like the infamous ISO 26262 for functional safety. We laughed, chatted and generally had a good time.

Economics of software development

At last, we talked about statistical analysis and economic viewpoints of software development. That’s actually a very interesting topic if it were not largely about huge spreadsheets filled with numbers, printed on neverending pages referenced by endless lists of topics grouped by numerous chapters. Yes, you’ve already anticipated it, I’m talking about the books of Capers Jones. Don’t get me wrong, I really like them:

There a some others, but start with these two to get used to hard facts instead of easy tales. In the same light, you might enjoy the talk and work of Greg Wilson.

Epilogue

As usual, the Dev Brunch contained a lot more chatter and talk than listed here. The number of attendees makes for an unique experience every time. We are looking forward to the next Dev Brunch at the Softwareschneiderei. And as always, we are open for guests and future regulars. Just drop us a notice and we’ll invite you over next time.

Every time you write a getter, a function dies

This blog post explores the difference between classic and Tell, don’t Ask-style code without any sourcecode examples.

Don’t be too alarmed by the title. Functions are immortal concepts and there’s nothing wrong with a getter method. Except when you write code under the rules of the Object Calisthenics (rule number 9 directly forbids getter and setter methods). Or when you try to adhere to the ideal of encapsulation, a cornerstone of object-oriented programming. Or when your code would really benefit from some other design choices. So, most of the time, basically. Nobody dies if you write a getter method, but you should make a concious decision for it, not just write it out of old habit.

One thing the Object Calisthenics can teach you is the immediate effect of different design choices. The rules are strict enough to place a lot of burden on your programming, so you’ll feel the pain of every trade-off. In most of your day-to-day programming, you also make the decisions, but don’t feel the consequences right away, so you get used to certain patterns (habits) that work well for the moment and might or might not work in the long run. You should have an alternative right at hands for every pattern you use. Otherwise, it’s not a pattern, it’s a trap.

Some alternatives

Here is an incomplete list of common alternatives to common patterns or structures that you might already be aware of:

  • if-else statement (explicit conditional): You can replace most explicit conditionals with implicit ones. In object-oriented programming, calling polymorphic methods is a common alternative. Instead of writing if and else, you call a method that is overwritten in two different fashions. A polymorphic method call can be seen as an implicit switch-case over the object type.
  • else statement: In the Object Calisthenics, rule 2 directly forbids the usage of else. A common alternative is an early return in the then-block. This might require you to extract the if-statement to its own method, but that’s probably a good idea anyway.
  • for-loop: One of the basic building blocks of every higher-level programming language are loops. These explicit iterations are so common that most programmers forget their implicit counterpart. Yeah, I’m talking about recursion here. You can replace every explicit loop by an implicit loop using recursion and vice versa. Your only limit is the size of your stack – if you are bound to one. Recursion is an early brain-teaser in every computer science curriculum, but not part of the average programmer’s toolbox. I’m not sure if that’s a bad thing, but its an alternative nonetheless.
  • setter method: The first and foremost alternative to a state-altering operation are immutable objects. You can’t alter the state of an immutable, so you have to create a series of edited copies. Syntactic sugar like fluent interfaces fit perfectly in this scenario. You can probably imagine that you’ll need to change the whole code dealing with the immutables, but you’ll be surprised how simple things can be once you let go of mutable state, bad conscience about “wasteful” heap usage and any premature thought about “performance”.

Keep in mind that most alternatives aren’t really “better”, they are just different. There is no silver bullet, every approach has its own advantages and drawbacks, both shortterm and in the long run. Your job as a competent programmer is to choose the right approach for each situation. You should make a deliberate choice and probably document your rationale somewhere (a project-related blog, wiki or issue tracker comes to mind). To be able to make that choice, you need to know about the pros and cons of as much alternatives as you can handle. The two lamest rationales are “I’ve always done it this way” and “I don’t know any other way”.

An alternative for get

In this blog post, you’ll learn one possible alternative to getter methods. It might not be the best or even fitting for your specific task, but it’s worth evaluating. The underlying principle is called “Tell, don’t Ask”. You convert the getter (aka asking the object about some value) to a method that applies a function on the value (aka telling the object to work with the value). But what does “applying” mean and what’s a function?

191px-Function_machine2.svgA function is defined as a conversion of some input into some output, preferably without any side-effects. We might also call it a mapping, because we map every possible input to a certain output. In programming, every method that takes a parameter (or several of them) and returns something (isn’t void) can be viewed as a function as long as the internal state of the method’s object isn’t modified. So you’ve probably programmed a lot of functions already, most of the time without realizing it.

In Java 8 or other modern object-oriented programming languages, the notion of functions are important parts of the toolbox. But you can work with functions in Java since the earliest days, just not as convenient. Let’s talk about an example. I won’t use any code you can look at, so you’ll have to use your imagination for this. So you have a collection of student objects (imagine a group of students standing around). We want to print a list of all these students onto the console. Each student object can say its name and matriculation number if asked by plain old getters. Damn! Somebody already made the design choice for us that these are our duties:

  • Iterate over all student objects in our collection. (If you don’t want to use a loop for this you know an alternative!)
  • Ask each student object about its name and matriculation number.
  • Carry the data over to the console object and tell the console to print both informations.

But because this is only in our imagination, we can go back in imagined time and eliminate the imagined choice for getters. We want to write our student objects without getters, so let’s get rid of them! Instead, each student object knows about their name and matriculation number, but cannot be asked directly. But you can tell the student object to supply these informations to the only (or a specific) method of an object that you give to it. Read the previous sentence again (if you’ve not already done it). That’s the whole trick. Our “function” is an object with only one method that happens to have exactly the parameters that can be provided by the student object. This method might return a formatted string that we can take to the console object or it might use the console itself (this would result in no return value and a side effect, but why not?).  We create this function object and tell each student object to use it. We don’t ask the student object for data, we tell it to do work (Tell, don’t Ask).

In this example, the result is the same. But our first approach centers the action around our “main” algorithm by gathering all the data and then acting on it. We don’t feel pain using this approach, but we were forced to use it by the absence of a function-accepting method and the presence of getters on the student objects. Our second approach prepares the action by creating the function object and then delegates the work to the objects holding the data. We were able to use it because of the presence of a function-accepting method on the student objects. The absence of getters in the second approach is a by-product, they simply aren’t necessary anymore. Why write getters that nobody uses?

We can observe the following characteristics: In a “traditional”, imperative style with getters, the data flows (gets asked) and the functionality stays in place. In a Tell, don’t Ask style with functions, the data tends to stay in place while the functionality gets passed around (“flows”).

Weighing the options

This is just one other alternative to the common “imperative getter” style. As stated, it isn’t “better”, just maybe better suited for a particular situation. In my opinion, the “functional operation” style is not straight-forward and doesn’t pay off immediately, but can be very rewarding in the long run. It opens the door to a whole paradigm of writing sourcecode that can reveal inherent or underlying concepts in your solution domain a lot clearer than the imperative style. By eliminating the getter methods, you force this paradigm on your readers and fellow developers. But maybe you don’t really need to get rid of the getters, just reduce their usage to the hard cases.

So the title of this blog post is a bit misleading. Every time you write a getter, you’ve probably considered all alternatives and made the informed decision that a getter method is the best way forward. Every time you want to change that decision afterwards, you can add the function-accepting method right alongside the getters. No need to be pure or exclusive, just use the best of two worlds. Just don’t let the functions die (or never be born) because you “didn’t know about them” or found the style “unfamiliar”. Those are mere temporary problems. And one of them is solved right now. Happy coding!

Recap of the Schneide Dev Brunch 2016-04-10

If you couldn’t attend the Schneide Dev Brunch at 10th of April 2016, here is a summary of the main topics.

brunch64-borderedLast sunday, we held another Schneide Dev Brunch, a regular brunch on the second sunday of every other (even) month, only that all attendees want to talk about software development and various other topics. In case you miss the recap article about the february brunch: It didn’t happen. We all took a break, but are on track again. So if you bring a software-related topic along with your food, everyone has something to share. We were quite a lot of developers this time, so we had enough stuff to talk about. As usual, a lot of topics and chatter were exchanged. This recapitulation tries to highlight the main topics of the brunch, but cannot reiterate everything that was spoken. If you were there, you probably find this list inconclusive:

Why software development conferences?

We began with a curious question: Why are there even conferences about software development? You can read most of the content for free on the internet and even watch the talks afterwards. So why attend one for a lot of money? We discussed the topic a bit and came up with an analysis:
There are (at least) four different interested groups in a conference:

  • The organizer or commercial host is mostly interested in a positive revenue. As long as there’s a possibility for some net gain, somebody will host a conference. The actual topic is a secondary matter for them (this might explain some of the weirder conferences out there, like the boring conference).
  • The developers that really attend a conference are a small subset of all developers. They all have their own personal motives to pay money and invest time and inconviences to be there in person. Some might rely on the quality filter of a conference board, some are looking forward to meet their peers in an annual ritual. There might be those that can learn best if somebody talk-feeds them the topic. Whatever reason, a lot of developers enjoy participating at conferences. If it happens to be paid by the employer and booked as worktime, who would not?
  • Then there are the speakers. They have the additional burden to convince a committee of their topic, prepare a talk of high quality and be able to perform on stage (something that is harder than it looks). The speakers seek reputation and credible proof of expertise. His resume will probably profit, too.
  • And at last, the companies that sponsor the conference, maintain a booth with big roll-ups and smiling employees and give their developers a chance to attend are in the game to represent, to recruit and build their brand. A lot of traditional marketing effort goes into trade fairs, so why not treat the developer market like any other and be present in the developer fairs?

We can conclude that software development conferences can provide value for every associated stakeholder. As long as this sentence holds true, conferences will be held.
The question didn’t came out of the blue: one of our attendees got accepted as a speaker on the Karlsruher Entwicklertag 2016 and wanted to learn about the different expectations he needs to address. He will give his talk on the next Dev Brunch to practice the flow and to pass the hardest critics. The topic: git internals. We are thrilled!

Stratagems and strategies

The next topic contained another talk, not at a conference, but in the context of a “general topics” series at a local university (the Duale Hochschule in Karlsruhe). The talk introduces the concept of the 36 stratagems and of modern strategies to the audience. We talked a bit about the concept itself and found that the list of logical fallacies is somewhat similar. We even found an application of the stratagems in local history (sorry, only german source found): The Bretten’s Hundle
The talk itself is this monday, so you’ll need to hurry if you want to attend.

Psychology of deception

As often during the dev brunch, one topic led to the other, and we soon talked about morale and ethics. The concept of micro-expressions to reveal the hidden agenda of others came up, as well as the TV series “lie to me” that is inspired by the work of Paul Ekman, a professor of psychology. There even is a commercial training program to improve your skill of “spotting the liar”.

Games with morale aspects

Well, we are nerds. While crime investigation is thrilling, there is the even more enthralling topic of games with psychological and moralistic aspects. We soon exchanged our experiences with games like “Haze” or “Spec Ops: The Line”. But it doesn’t stop at shooter games, you can have similar insights by playing “Papers, Please” (a strong favorite for one of our next Schneide game nights) or “This War Of Mine”. You can even try some multiplayer games specifically designed for social insights, like “The Ship: Murder Party”.
And if you haven’t got much time but still want to learn something about yourself, little games like “60 Seconds!” are a great start.
This topic lead to some ideas for upcoming Schneide game nights in 2016.

Book review: A tour of C++

One attendee of the brunch provided a summary of the book “A Tour of C++” from Bjarne Stroustrup, that recently got updated to the language possibilities of C++ 11. In his words, the book is a rather incomplete introduction to the language, with way too many aspects described in a way too short manner. It’s more of a reading list to really grasp the concepts, so it may serve as a source of inspiration. For example, the notion of “move semantics” was explained, but to discover the consequences is up to the developer. The part about template programming was well done and every chapter has a suitable list of advices in the tradition of “Effective XYZ” at the end. So it’s not a bad book, but too short to be satisfying. It’s like a tourist’s tour around C++ 11, so the title holds its promise.

The left-pad incident

When we finished the “official” agenda, the topic of the recent left-pad incident came up and left us laughing. We really live in glorious times when the happiness of the (Javascript) world depends on a few lines of code. Not that this couldn’t happen in any other ecosystem.

Epilogue

As usual, the Dev Brunch contained a lot more chatter and talk than listed here. The number of attendees makes for an unique experience every time. We are looking forward to the next Dev Brunch at the Softwareschneiderei. And as always, we are open for guests and future regulars. Just drop us a notice and we’ll invite you over next time.

Timestamps make horrible identifiers

If you think about using a timestamp or date as an identifier for some kind of entity, object or data record – think again. They are horribly ill-equipped to be identifiers due to their dynamic resolution. Here’s the story how we got to this conclusion.

vetre / fotoliaNot long ago, I’ve struggled with a system that uses timestamps as entity identifiers. What can I say? Timestamps aren’t meant to identify anything else than a specific point in time. Don’t use them as entity identifiers, ever. If you want to know why, I invite you to read on. The blog post is written in Freytag’s dramatic structure for added effect.

Exposition

We’ve designed a system that runs on multiple instances that communicate in all sorts of way. A central archive instance stores all data related to measurements. The whole network revolves around the notion of measurement. Measurement data is the most precious data and all instances will either produce or consume data based on these measurements.

Most important for human operators is an instance that lets you view all existing measurement data. Let’s call it the viewer. The viewer displays an overview list of all measurements in a given context and lets the operator choose to view ever more details of any of them. To be able to provide the overview list as fast as possible, we added a cache that holds the information.

Rising action

This measurement list cache was the source of all kinds of peculiar behaviour of the system. Most, but not all measurement data was incomplete. The list cache entries were assembled from different sources that were available at different times, so it seemed that while one part of the data got written to the cache, another part couldn’t be written for whatever reasons. The operator could load detailed data of some few measurements, but the majority just produced an error message that the data couldn’t be found (despite it being present).
The most obviously broken functionality left the following trace in the log files (paraphrased):

- storing measurement at 2016-02-28T13:25:55.189+01:00 into the list cache
- measurement stored
[...]
- loading measurement at 2016-02-28T13:25:55.189+01:00 from the list cache
- error: measurement not found in list cache

So, the system is essentially telling me that it can’t load some data it just stored. As you can imagine, this may lead to some questions about the sanity of the database product underneath.

Climax

After some investigation and fruitless integration testing, it dawned me: The problem wasn’t timing or the database. All the bugs could be explained with only one circumstance: Measurements were ultimately identified by their timestamp, the moment the measurement was made. There’s also a location, type and some other information in the identifier for each measurement, but only the timestamp changes between two measurements in the same narrow context. And the timestamp was stored in different precisions, depending on the origin of the measurement identifier. Most identifiers were create at the measurement producing system instances (let’s call them measurers) and had millisecond precision. As soon as they got stored in the production database (but not our development database), they lost the milliseconds. And some of the most important measurement data got exported to third-party systems, using a minute-based precision. So we had one measurement identifier in the system, but with three different types, each mostly incompatible to each other.

Falling action

That’s why the log excerpt above never occurred in development, but in production: The measurement is stored in the database, the used identifier gets passed around in the software, but a query on the exact same identifier in the database yields no result because the timestamps now differ in the millisecond range. And the strange effects that sometimes, everything worked just fine? That’s when the milliseconds are zero by chance. Given that most actions in the system are scheduled and performed automatically exactly on the zero mark, the zero milliseconds case happened more often than it would in an even distribution.

Our system dealt with three types of measurement identifiers: Millisecond-precise identifiers produced by the measurers, second-precise identifiers used by the measurement list cache and minute-precise identifiers used (and sometimes fed back into the system) by the data export. These identifiers were incompatible even for the same measurement most of the time, but not always. In unit tests, the timestamps were made up and didn’t reveal the problem properly (who thinks about odd milliseconds when making up a timestamp?).

My solution was to pull this incompatibility up into the type system. Instead of one measurement identifier, there are now three measurement identifiers: MillisecondPreciseIdentifier, SecondPreciseIdentifier and MinutePreciseIdentifier. An identifier of higher precision can be converted to an identifier of lower precision, but not the other way around. Everytime a measurement identifier is created, it needs to explicitely state its precision of the timestamp. This made the compiler highlight the problematic usages clearly as type conflicts and therefore dealing with the problem much easier.

Revelation

Choosing a timestamp as a vital part of a (measurement) identifier was a mistake from the beginning. The greater problem was the omission of the timestamp’s precision. Timestamps perform more like floating-point numbers and less like integers, even if every timestamp can be represented by a long. As soon as I made the precision of each timestamp clear to the compiler, the bugs revealed themselves. The annoying difference between developer and production database would have been detected much sooner, because a millisecond-precise timestamp will now warn in the log files if its millisecond part is zero. As soon as this log entry is seen very often, its clear that something is wrong. The new datatypes not only serve as a clearer API contract definition tool, but also as a runtime sanity check.

If you don’t want to repeat this mistake, keep in mind that each timestamp, date or whatever time-related data type you use will inherently have a maximum precision. As soon as you mix different precisions into the same data type, you’re going to have a bad time. Explicitely state the required precision in your type system and your compiler will keep an eye on it, too.

Our five types of configuration

Over the years, we came up with a strategy to handle configurable aspects of our software applications. Using five different types of configuration, we are able to provide high customizability with modest effort.

settings © vege / fotoliaConfigurable aspects of software are the magical parts with which you can achieve higher customer satisfaction with relatively modest investment if done right. Your application would be perfect if only this particular factor were of the value three instead of two as it is now. No problem – a little tweak in the configuration files and everything is right. No additional development cost, no compile/build cycle. You can add or increase business value with a simple text editor when things are configurable.

The first problem of this approach is the developer’s decision what to make configurable. Every configurable and therefor variable value of a software system requires some sort of indirection and additional infrastructure. It suddenly counts as user input and needs to be validated and sanitized. If your application environment requires an identifier like a key, the developer needs to come up with a good one, consistent to the existing keys and meaningful enough to make sense to an unsuspecting user. In short, making something configurable is additional and hard work that every developer tries to avoid in the face of tight deadlines and long feature lists.

Over-configurability

Our first approach to configurable content of our applications lead to a situation where everything could be configured, even the name and location of the configuration files themselves. You had to jump through so many hoops to get from the code to the actual value that it was a nightmare to maintain. And it provided virtually no business value at all. No customer ever changed the location of only one configuration file. All they did was to change values inside the configuration files once in a while. Usually, the values were adjusted once at first installation and once some time later, when the improvement of the change could be anticipated. The possibility of the second adjustment usually brings the customer satisfaction.

Under-configurability

So we tried to narrow down the path of configurable aspects by asking our customers for constant values. We are fortunate to have direct customer access and to develop a lot of software based on physics and chemistry, science fields with a high rate of constants. But the attempt to embed natural constants directly in the code failed, too. Soon after we installed the first software of this kind, an important constant related to neutron backscattering was changed – just a bit, but enough to make a difference. Putting important domain values in the code just doesn’t cut it, even if they are labeled constants and haven’t changed for decades.

The five types

A good configurable software application finds the sweet spot between being completely configurable and totally rigid. To help you with this balancing, here are the five types of configuration we identified along our way:

Resources

The section containing the resources of the application isn’t meant to be introspected or edited by the user. It contains mostly binary data like images or media formats and static content like translations. Most resources are even bundled into archive files, so they don’t present themselves as files. All resources are overwritten with every new version of the application, so changing for example an icon is possible, but only has a short-term effect unless it is fed back into the code base. If the resources were deleted, the application would probably boot up, but lack all kind of icons, images and media. Most language content would be replaced by internationalization keys. In short, the application would be usable, but ugly.

(Manufacturer) Settings

We call every configurable option that is definitely predefined by the developer a setting. We group these options into a section called settings. Like resources, settings are overwritten with every new version of the application, so changes should be rare and need to be reported back into development. Settings are configurable if the urgent need arises, but are ultimately owned by the developers and not by the users. The most delicate decision for a developer is to distinguish between a setting and an option. Settings are owned by the developer, options are owned by the user. If the settings were deleted, the application would most likely not boot up or use hard-coded defaults that might not be suited for the given use case. We use settings mostly for feature toggles or dynamically loaded content like menu definitions or team credits.

Options

This is the most interesting type of configurable in terms of user centered business value. Every little bit of information in the option section is only deployed once. As soon as it can be edited by the user, it belongs to the user. We deliver nearly every property, config or ini file as an option. We fill them with nondestructive defaults and adjust the values during the initial deployment, but after that, the user is free to change the files as he likes. This has three important implications for the developer:

  • You can’t rely on the presence of any option entry. Each option entry needs to have a hardcoded fallback value that takes over if the entry is missing in the files.
  • Every new option entry needs to be optional (no pun intended). Since we can’t redeploy the option files, any new entry won’t exist in an existing installation and we can’t force the user to add it. If you can’t find a sensible way to make your option optional, you’re going to have a hard time.
  • If you need to make changes to existing option files, you need to automate it because the number of installations might be huge. We’ve developed our own small domain specific language for update scripts that perform these changes while maintaining readability. Update scripts are the most fragile part of an update deployment and should be avoided whenever possible.

The options are what makes each installation unique, so we take every measure to avoid data loss. All options are in one specific directory tree and can be backuped by a simple copy and paste. Our deliverables don’t contain option files, so they can’t be overwritten by manual copy or extract actions. If the options were deleted, the application would boot up and recreate the initial options with our default values, therefor losing its uniqueness.

(Mutable) Data

The data section is filled with mutable information that gets created by the application itself. It’s more of a database implemented in files than real configuration. The user isn’t encouraged to even look into this section, let alone required to edit anything by hand. If this section would be deleted, the application would lose parts of its current state like lists of pending tasks, but not the carefully adjusted configurables. The application would boot up into a pristine state, but with a suitable configuration.

Archive

The last type isn’t really a configuration, but a place for the application to store the documents it produces as part of its user-related functionality. Only the application writes to the archive, and only in a one-time fashion. Existing content is never altered and rarely deleted. The archive is the place to look for results like measurement data or analysis reports. It’s very important to keep the archive free of any kind of mutable data. If the archive would be deleted, all previously produced result documents would be lost, but the application would work just fine.

Summary

As you’ve seen, we differentiate between five types of configurables, but only two types are “real” configuration: The settings belong to the developer while the options belong to the user. We’ve built over a dozen successful applications using this strategy and are praised for their configurability while our required effort for maintainance is rather low.

Let us know if you have a similar or totally different concept for configurables by dropping a comment.

The whole company under version control

One of our secrets is that we’ve put the whole company under version control. You can see every change to our business data and undo every mistake.

by Sashkin / fotolia

A minor fact about the Softwareschneiderei that always evokes surprised reactions is that everything we do is under version control. This should be no surprise for our software development work, as version control is a best practice for about twenty years now. If you aren’t a software developer or unfamiliar with the concept of version control for whatever reasons, here’s a short explanation of its main features:

Summary of version control

Version control systems are used to track the change history of a file or a bunch of files in a way that makes it possible to restore previous versions if needed. Each noteworthy change of a file (or a bunch of files) is stored as a commit, a new savepoint that can be restored. Each commit can be provided with a change note, a short comment that describes the changes made. This results in a timeline of noteworthy changes for each file. All the committed changes are immutable, so you get revision safety of your data for almost no cost.

Usual work style for developers

In software development, each source code file has to be “in a repository”, the repository being the central database for the version control system. The repository is accessible over the network and holds the commits for the project. One of the first lessons a developer has to learn is that source code that isn’t committed to a version control system just doesn’t exist. You have to commit early and you have to commit often. In modern development, commit cycles of a few minutes are usual and necessary. Each development step results in a commit.

What we’ve done is to adopt this work style for our whole company. Every document that we process is stored under version control. If we write you a quote or an invoice, it is stored in our company data repository. If we send you a letter, it was first committed to the repository. Every business analysis spreadsheet, all lists and inventories, everything is stored in a repository.

Examples of usage scenarios

Let me show you two examples:

We have a digital list of all the invoices we sent. It’s nothing but a spreadsheet with the most important data for each invoice. Every time we write an invoice, it is another digital document with all the necessary text and an additional line in the list of invoices. Both changes, the new invoice document and the extended list are included in one commit with a comment that hints to the invoice number and the project number. These changes are now included in the ever-growing timeline of our company data.

We also have a liquidity analysis spreadsheet that needs to be updated often. Every time somebody makes a change to the spreadsheet, it’s a new commit with a comment what was updated. If the update was wrong for whatever reason, we can always backtrack to the spreadsheet content right before that faulty commit and try again. We don’t only have the spreadsheet, but the whole history of how it was filled out, by whom and when.

Advantages of version controlled files

Before we switched to a version controlled work style, we had network shares as the place to store all company data. This is probably the de-facto standard of how important files are handled in many organizations. Adding version control has some advantages:

  • While working with network shares, everybody works on the same file. Most programs show a warning that another user has write access on a file and only opens in read-only mode. But not every program does that and that’s where edit collisions occur without anybody noticing. With version control, you work on a local copy of the file. You can always change the file, but you will get a “merge conflict” when another user has altered the file in the repository after your last synchronization. These merge conflicts are usually minor inconveniences with source code, but a major pain with binary file formats like spreadsheets. So you’ll know about edit collisions and you’ll try to avoid them. How do you avoid them? By planning and communicating your work better. Version control emphasizes the collaborative work setting we all live in.
  • Version controlled data is always traceable. You can pinpoint exactly who did what at which time and why (as stated in the commit comment). There is no doubt about any number in a spreadsheet or any file in your repository. This might sound like a surveillance nightmare, but it’s more of a protection against mishaps and honest errors.
  • Version control lets you review your edits. Every time you commit your work, you’ll see a list of files that you’ve changed. If there is a file that you didn’t know you’ve changed, the version control just saved your ass. You can undo the erroneous change with a simple click. If you’d worked with network shares, this change would have gone unnoticed. With version control, you have to double-check your work.
  • There are no accidental deletions with version control. Because you have every file stored in the repository, you can always undo every delete operation. With network shares, every file lives in the constant fear of the delete key. With version control, you catch your mishap in the commit step and just restore the file.

Summary of the adoption

When we switched to version control for every company data, we just committed our network shares in the repository and started. The work style is a bit inconvenient at first, because it is additional work and needs frequent breaks for the commits, but everybody got used to it very fast. Soon, the advantages began to outweight the inconvenience and now working with our company data is free of fear because we have the safety net of version control.

You want to know more about version control? Feel free to ask!

The tables will eventually turn for every optimization

Nearly all performance optimizations turn sour sooner or later. Here’s a story that illustrates the concept and gives a simple ruleset to avoid the effect.

An inconvenient truth

One of the things every software developer has to learn the hard way is that performance optimizations are a bad thing most of the time. The lesson is counter-intuitive because it is in conflict with several fundamental motivations of our artist/engineer attitude:

  • We want our code to be as fast as possible or even faster. (We really like the prospect of making the impossible happen!)
  • We don’t want to waste resources if it can be avoided. (Digital resources have always been scarce. Development in abundance wasn’t taught!)
  • We want to be clever and one-up everybody with the latest trick/hack. (This ego boost driven development attitude is a major problem on its own!)

But we can’t deny that clever guys exist since forever and that their wisdom should be considered. Here’s one clever guy fourty years ago:

premature optimization is the root of all evil.

said Donald Knuth in 1974. A typical computer had RAM in the lower kilobyte range these days and clocked with around a megahertz. The Intel 8080 was introduced in this year. No sign of abundance all around.

Coming from this insight, i teach the “three rules of performance optimization” to my students:

  1. Don’t
  2. Not Yet
  3. Measure

A little disclaimer: Performance optimizations are measured in milliseconds. You are still responsible for complexity optimizations and should engage them. Complexity is measured in O(n).

I blogged about the rules some time ago, so I won’t repeat the whole meaning behind them. Today, I want to tell a story related to rule three (“Measure”) and why it’s generally a good idea to stay clear of optimizations if not absolutely necessary.

An accidental observation

by ortodoxfoto / fotolia

In one of our long-running projects, we store data in a 24/7 manner since more than ten years. The project started on Pentium IV boxes with 120 GB harddisks. Soon enough, the available disk space vanished rapidely. Our customer wanted to optimize the storage efficiency. We told him that we can always trade storage space versus computation time or the other way around (the infamous space/time tradeoff), but if we compress the data in the system’s archive to save space, it will result in slower archive access. Our customer understood the tradeoff and decided he wanted storage efficiency over access speed for the archive.

So we added a compression step when certain data types were stored in the archive. Because the data was text based (XML and other formats), the compression rate was at 90% and even higher, meaning that we could fit ten times more data on the disk than before. We certainly met the customer’s goal of store efficiency. But what about the access speed? We needed to add the decompression step for certain data types at the place where our system loads from the archive. After that, we hoped that the speed didn’t suffer too badly. We measured the access times before and after the change and couldn’t believe our eyes: The access time was nearly ten times faster than before. There was no tradeoff, we actually shrank memory consumption and computation time at once and in the same ballpark figure. We felt like heros.

Discovering false premises

Why did this happen? Our explanation is that at some specific point in time, the CPUs became too fast for a tradeoff. In the good old days, there really was a tradeoff: if you needed to compress parts of your harddisk to save space, loading these parts became slower. Because the CPU had to perform extra work on top of the loading, you had to wait longer. Loading more data directly from disk was faster than loading less data from disk and decompressing it. When the CPU became fast enough to actually decompress during the I/O delay of the harddisk, that’s when the raw amount of data that needed to be transferred from disk determined the loading speed. And since uncompressed data is larger, it now took longer to load than compressed data.

At some specific point in time, the old wisdom that compressed data is slower became a lie. Nobody told us. We weren’t heros, we just lived on false premises.

Our customer was pleased and continued to be pleased for years. Every few years, the CPUs of the target machines would become even faster, but the harddisk performance would hardly improve. The new wisdom was truer than ever: Less data is faster, regardless of what the CPU has to do with it. This was a performance optimization that could be used everywhere. You want to increase I/O speed? Compress the data.

The tables begin to turn

Fast forward to modern days. A new technology promised to improve harddisk performance by orders of magnitude: The solid state disk (SSD) delivers impressive amounts of data per second, but more important, gets rid of the initial seek time that magnetic spindle disks had (those few milliseconds to locate the data on the disk that felt like ages for modern CPUs). We started our migration to SSD in 2010 and were SSD-exclusive at the end of 2013. Our customer was a bit more hesistant, but the latest generation of target machines run on SSDs, too. So how did this affect our performance optimization?

As you can probably deduce by now, the decompression work of the CPU re-emerged in the archive access time. It’s by no means as bad as in ancient times, but the days of “no tradeoff” are gone, again. The performance optimization isn’t an optimization anymore. We still have it in the system, because it was never meant to improve performance, but storage efficiency, and it still does that perfectly. And needs to, because SSD are still expensive and don’t offer storage space in abundance. But the old wisdom is back (partially): compressed data is slower, if not by much.

What do we learn from this?

Over the course of a few years, a specific feature (transparent storage compression) in our system was a performance burden, then a performance booster and a burden again. We didn’t change the code, we just changed the hardware circumstances. The best lesson to be learnt is that no performance optimization lasts forever. Premises will change. Effects will be negated. Bottlenecks will be shifted. It’s best to know when that happens. And then it’s best to be able to revoke the optimization. Or, if all this sounds way too much trouble for such a tiny performance gain, remember the first rule of performance optimization (Don’t) and just leave it alone. You can always tell yourself that you’ve just optimized your code for the machine it will run on in ten years.