A procedure to deal with big amounts of email

How can you survive the daily email flood and still keep track of your work? Here is one personal ruleset that adapts real-life habits to virtual message management.

The problem

You probably know the problem already: A day with less than 500 emails feels like your internet connection might be lost. The amount of emails you receive can accurately predict the time of day. In my case, I’ll always receive my 300th email each day right before lunch. Imagine that I’ve spent one minute for each message, then I would have done nothing but reading emails yet. And by the time I return from lunch, more emails have found their way into my inbox. My job description is not “email reader”. It actually is one of the lesser prioritized activities of my job. But I keep most answering times low and always know the content of my inbox. You’ll seldom hear “sorry, I haven’t seen your email yet” from me. How I keep the email flood in check is the topic of this blog entry. It’s my personal procedure, so nothing fancy with a big name, but you might recognize some influence from well-known approaches like “Getting Things Done” by David Allen.

The disclaimer

Disclaimer: You might entirely disagree with my approach. That’s totally acceptable. But keep in mind that it works for at least one person for a long time now, even if it doesn’t fit your style. Email processing seems to be a delicate topic, please keep your comments constructive. By the way, I’d love to hear about your approach. I’m always eager to learn and improve.

The analogy

Let me start with a common analogy: Your email inbox is like your mailbox. All letters you receive go through your mailbox. All emails you receive go through your inbox. That’s where this analogy ends and it was never useful to begin with. Your postman won’t show up every five minutes and stuff more commercial mailings, letters, postcards and post-it notes into your mailbox (raising that little flag again that indicates the presence of mail). He also won’t announce himself by ringing your door bell (every five minutes, mind you) and proclaiming the first line of three arbirtrarily chosen letters. Also, I’ve rarely seen mailboxes that contain hundreds if not thousands of letters, some read, some years old, in different states of decay. It’s a common sight for inboxes whose owners gave up on keeping up. I’ve seen high stacks of unanswered correspondence, but never in the mailbox. And this brings me to my new analogy for your email inbox: Your email inbox is like your desk. The stacks of decaying letters and magazines? Always on desks (and around it in extreme cases). The letters you answer directly? You bring them to your desk first. Your desk is usually clear of pending work documents and this should be the case for your inbox, too.

The rules

My procedure to deal with the continuous flood of e-amils is based on three rules:

The inbox is the only queue of emails that needs attention. It is only filled with new emails (which require activity from my side) or emails that require my attention in the foreseeable future. The inbox is therefor only filled with pending work.
Email processing is done manually. I look at each email once and hopefully only once. There are no automatic filters that sort emails into different queues before I’ve seen them.
For every email, interaction results in an activity or decision on my side. No email gets “left there”.

Let me explain the context of the rules in a bit more detail:

My email account has lots and lots of folders to store all emails until eternity. The folders are organized in a hierarchy, but that doesn’t really matter, because every folder can hold emails. The hierarchy of folders isn’t pre-planned, it emerges from the urge to group emails together. It’s possible that I move specific emails from one folder to another because the hierarchy has changed. I will use automatic filters to process emails I’ve already read. But I will mark every email as read by myself and not move unread emails around automatically. This narrows the place to look for new emails to one place: the inbox. Every other folder is only for archivation, not for processing.

The sweeps

The amount of unread emails in my inbox is the amount of work I need to do to return to the “only pending work” state. Let’s say that I opened my email reader and it shows 50 new messages in the inbox. Now I’ll have to process and archivate these 50 emails to be in the same state as before I had opened it. I usually do three separate processing steps:

The first sweep is to filter out any spam messages by immediate tell-tale signs. This is the only automatic filter that I’ll allow: the junk filter. To train it, I mark any remaining junk mail as spam and let the filter deal with it. I’m still not sure if the junk filter really makes it easier for me, because I need to scan through the junk regularly to “rescue” false positives (legitimate mails that were wrongfully sorted out), but the junk filter in combination with my fast spam sweep will lower the message count significally. In our example, we now have 30 mails left.
The second sweep picks every email that is for information purpose only. Usually those mails are sent by software tools like issue trackers, wikis, code review tools or others. Machines don’t feel the effort of writing an email, so they’ll write a lot. Most of the time, the message content is only a few lines of text. I grab each of these mails and drag them into their corresponding folder. While I’m dragging, I read (and memorize) the content of the email. The problem with this kind of information is that it’s a lot of very small chunks of data for a lot of different contexts. In order to understand those messages, you have to switch your mental context in a matter of seconds. You can do it, but only if you aren’t interrupted by different mental states. So ignore any email that requires more than a few seconds of focused attention from you. Let them sit into your inbox along with the emails that require an answer. The only activity for mails included in your second sweep is “drag to folder & memorize”. Because machines write often, we now have 10 mails left in our example.
The third sweep now attends to each remaining mail independently. Here, the three-minutes rule applies: If it takes less than three minutes to reply to the email right now, then do it right now and archivate the mail in a suitable folder (you might even create a new folder for it). Remember: if you’ve processed an email, it leaves the inbox. If it takes more than three minutes, you need to schedule an attention slot for this particular message on your todo-list. This is the only time the email remains in the inbox, because it’s a signal of pending work. In our example, 7 emails could be answered with short replies, but 3 require deep concentration or some more text for the answer.

After the three sweeps, only emails that indicate pending work remain in the inbox. They only leave the inbox after I’ve dealt with them. I need to schedule a timebox to work on them, but after that, they’ll find themselves out of the inbox and in a folder. As soon as an email is in a folder, I forget about its existence. I need to remember the information that were in it, but not the message itself.

The effects

While dealing with each email manually sounds painfully slow at first, it becomes routine after only a short while. The three sweeps usually take less than five minutes for 50 emails, excluding the three major correspondence tasks that make their way as individual items on my work schedule. Depending on your ratio of spam to information messages to real correspondence, your results may vary.

The big advantage of dealing manually with each email is that I’ve seen each message with my own eyes. Every email that an automatic filter grabs and hides before you can see it should not have been sent in the first place – it’s simulating a pull notification scheme (you decide when to receive it by opening the folder) rather than leveraging the push notification scheme (you need to deal with it right now, not later) that emails are inherently. Things like timelines, activity streams or message boards are pull-oriented presentations of presumably the same information, perhaps that’s what you should replace your automatically hidden emails with.

You’ll have an ever-growing archive with lots of folders for different things (think of a shelf full of document files in real life), but you’ll never look into them as long as you don’t desperately search “that one mail from 3 months ago”. You’ll also have a clean inbox, preferably in “blank slate” condition or at least with only emails that require actions from you. So you’ll have a clear overview of your pending work (the things in your inbox) and the work already done (the things in your folders).

The epilogue

That’s when you discover that part of your work can be described as “look at each email and move it to a folder” as if you were an official in charge for virtual paper. We virtualized our paperwork, letters, desks, shelves and document files. But the procedures to deal with them is still the same.

Your turn now

How do you process your emails? What are your rules and habits? What are your experiences with folders vs. tags? I would love to hear from you – in a comment, not an email.

C++ Coroutines on Windows with the Fiber API

Last week, I had the chance to try out coroutines as a way to cooperatively interleave long tasks with event-processing. Unlike threads, where you can have interaction between between threads at any time, coroutines need to yield control explicitly, whic arguably makes synchronisation a little simpler. Especially in (legacy) systems that are not designed for concurrency. Of course, since coroutines do not run at the same time, you do not get the perks from concurrency either.

If you don’t know coroutines, think of them as functions that can be paused and resumed.

Unlike many other languages, C++ does not have built-in support for coroutines just yet. There are, however, several alternatives. On Windows, you can use the Fiber API to implement coroutines easily.

Here’s some example code of how that works:

auto coroutine=make_shared<FiberCoroutine>();
coroutine->setup([](Coroutine::Yield yield)
{
  for (int i=0; i<3; ++i)
  {
    cout << "Coroutine " 
         << i << std::endl;
    yield();
  }
});

int stepCount = 0;
while (coroutine->step())
{
  cout << "Main " 
       << stepCount++ << std::endl;
}

Somewhat surprisingly, at least if you have never seen coroutines, this will output the two outputs alternatingly:

Coroutine 0
Main 0
Coroutine 1
Main 1
Coroutine 2
Main 2

Interface

Since fibers are not the only way to implement coroutines and since we want to keep our client code nicely insulated from the windows API, there’s a pure-virtual base class as an interface:

class Coroutine
{
public:
  using Yield = std::function<void()>;
  using Run = std::function<void(Yield)>;

  virtual ~Coroutine() = default;
  virtual void setup(Run f) = 0;
  virtual bool step() = 0;
};

Typically, creation of a Coroutine type allocates all the resources it needs, while setup “primes” it with an inner “Run” function that can use an implementation-specific “Yield” function to pass control back to the caller, which is whoever calls step.

Implementation

The implementation using fibers is fairly straight-forward:

class FiberCoroutine
  : public Coroutine
{
public:
  FiberCoroutine()
  : mCurrent(nullptr), mRunning(false)
  {
  }

  ~FiberCoroutine()
  {
    if (mCurrent)
      DeleteFiber(mCurrent);
  }

  void setup(Run f) override
  {
    if (!mMain)
    {
      mMain = ConvertThreadToFiber(NULL);
    }
    mRunning = true;
    mFunction = std::move(f);

    if (!mCurrent)
    {
      mCurrent = CreateFiber(0,
        &FiberCoroutine::proc, this);
    }
  }

  bool step() override
  {
    SwitchToFiber(mCurrent);
    return mRunning;
  }

  void yield()
  {
    SwitchToFiber(mMain);
  }

private:
  void run()
  {
    while (true)
    {
      mFunction([this]
                { yield(); });
      mRunning = false;
      yield();
    }
  }

  static VOID WINAPI proc(LPVOID data)
  {
    reinterpret_cast<FiberCoroutine*>(data)->run();
  }

  static LPVOID mMain;
  LPVOID mCurrent;
  bool mRunning;
  Run mFunction;
};

LPVOID FiberCoroutine::mMain = nullptr;

The idea here is that the caller and the callee are both fibers: lightweight threads without concurrency. Running the coroutine switches to the callee’s, the run function’s, fiber. Yielding switches back to the caller. Note that it is currently assumed that all callers are from the same thread, since each thread that participates in the switching needs to be converted to a fiber initially, even the caller. The current version only keeps the fiber for the initial thread in a single static variable. However, it should be possible to support this by replacing the single static fiber pointer with a map that maps each thread to its associated fiber.

Note that you cannot return from the fiberproc – that will just terminate the whole thread! Instead, just yield back to the caller and either re-use or destroy the fiber.

Assessment

Fiber-based coroutines are a nice and efficient way to model non-linear control-flow explicitly, but they do not come without downsides. For example, while this example worked flawlessly when compiled with visual studio, Cygwin just terminates without even an error. If you’re used to working with the visual studio debugger, it may surprise you that the caller gets hidden completely while you’re in the run function. The run functions stack completely replaces the callers stack until you call yield(). This means that you cannot find out who called step(). On the other hand, if you’re actually doing a lot of processing in the run function, this is quite nice for profiling, as the “processing” call tree seemingly has its own root in the call-tree.

I just wish the visual studio debugger had a way to view the states of the different fibers like it has for threads.

Alternatives

On Linux, you can use the ucontext.

Visual Studio 2015 also has another, newer, implementation.

Coroutines can be implemented using threads and condition-variables.

There’s also Boost.Coroutine, if you need an independent implementation of the concept. From what I gather, they only use Fibers optionally, and otherwise do the required “trickery” themselves. Maybe this even keeps the caller-stack visible – it is certainly worth exploring.

Using passwords with Jenkins CI server

For many of our projects the Jenkins continuous integration (CI) server is one important cornerstone. The well known “works on my machine” means nothing in our company. Only code in repositories and built, tested and packaged by our CI servers counts. In addition to building, testing, analyzing and packaging our projects we use CI jobs for deployment and supervision, too. In such jobs you often need some sort of credentials like username/password or public/private keys.

If you are using username/password they do not only appear in the job configuration but also in the console build logs. In most cases this is undesirable but luckily there is an easy way around it: using the Environment Injector Plugin.

In the plugin you can “inject passwords to the build as environment variables” for use in your commands and scripts.

The nice thing about this is that the passwords are not only masked in the job configuration (like above) but also in the console logs of the builds!

Another alternative doing mostly the same is the Credentials Binding Plugin.

There is a lot more to explore when it comes to authentication and credential management in Jenkins as you can define credentials at the global level, use public/private key pairs and ssh agents, connect to a LDAP database and much more. Just do not sit back and provide security related stuff plaintext in job configurations or your deployments scripts!

A zine for the modern developer

A new zine for the modern developer, this time about CQRS

Martin Fowler once said:

Test for value of an article isn’t novelty of its ideas, but how many who could benefit who haven’t considered them, or are using them badly

So I thought that I should share some things I am learning and have learned as a software developer. Maybe one time I will write a book but this requires a huge effort. Luckily Julia Evans (thank you!) just wrote a post having the same dilemma. She decided to create and publish a lean version of a book: a zine.
This pushed me to do my own zine. I believe that a modern software developer is confronted with many topics besides programming and so the topics will range from software architecture over user experience (UX) to marketing.
My first zine is about CQRS, an architecture I am excited about. Enjoy (PDF download).

Recap of the Schneide Dev Brunch 2016-08-14

If you couldn’t attend the Schneide Dev Brunch at 14th of August 2016, here is a summary of the main topics.

Two weeks ago at sunday, we held another Schneide Dev Brunch, a regular brunch on the second sunday of every other (even) month, only that all attendees want to talk about software development and various other topics. This brunch had its first half on the sun roof of our company, but it got so sunny that we couldn’t view a presentation that one of our attendees had prepared and we went inside. As usual, the main theme was that if you bring a software-related topic along with your food, everyone has something to share. We were quite a lot of developers this time, so we had enough stuff to talk about. As usual, a lot of topics and chatter were exchanged. This recapitulation tries to highlight the main topics of the brunch, but cannot reiterate everything that was spoken. If you were there, you probably find this list inconclusive:

Open-Space offices

There are some new office buildings in town that feature the classic open-space office plan in combination with modern features like room-wide active noise cancellation. In theory, you still see your 40 to 50 collegues, but you don’t necessarily hear them. You don’t have walls and a door around you but are still separated by modern technology. In practice, that doesn’t work. The noise cancellation induces a faint cheeping in the background that causes headaches. The noise isn’t cancelled completely, especially those attention-grabbing one-sided telephone calls get through. Without noise cancellation, the room or hall is way too noisy and feels like working in a subway station.

We discussed how something like this can happen in 2016, with years and years of empirical experience with work settings. The simple truth: Everybody has individual preferences, there is no golden rule. The simple conclusion would be to provide everybody with their preferred work environment. Office plans like the combi office or the flexspace office try to provide exactly that.

Retrospective on the Git internal presentation

One of our attendees gave a conference talk about the internals of git, and sure enough, the first question of the audience was: If git relies exclusively on SHA-1 hashes and two hashes collide in the same repository, what happens? The first answer doesn’t impress any analytical mind based on logic: It’s so incredibly improbable for two SHA-1 hashes to collide that you might rather prepare yourself for the attack of wolves and lightning at the same time, because it’s more likely. But what if it happens regardless? Well, one man went out and explored the consequences. The sad result: It depends. It depends on which two git elements collide in which order. The consequences range from invisible warnings without action over silently progressing repository decay to immediate data self-destruction. The consequences are so bitter that we already researched about the savageness of the local wolve population and keep an eye on the thunderstorm app.

Helpful and funny tools

A part of our chatter contained information about new or noteworthy tools to make software development more fun. One tool is the elastic tabstop project by Nick Gravgaard. Another, maybe less helpful but more entertaining tool is the lolcommits app that takes a mugshot – oh sorry, we call that “aided selfie” now – everytime you commit code. That smug smile when you just wrote your most clever code ever? It will haunt you during a git blame session two years later while trying to find that nasty heisenbug.

Anonymous internet communication

We invested a lot of time on a topic that I will only decribe in broad terms. We discussed possibilities to communicate anonymously over a compromised network. It is possible to send hidden messages from A to B using encryption and steganography, but a compromised network will still be able to determine that a communication has occured between A and B. In order to communicate anonymously, the network must not be able to determine if a communication between A and B has happened or not, regardless of the content.

A promising approach was presented and discussed, with lots of references to existing projects like https://github.com/cjdelisle/cjdns and https://hyperboria.net/. The usual suspects like the TOR project were examined as well, but couldn’t hold up to our requirements. At last, we wanted to know how hard it is to found a new internet service provider (ISP). It’s surprisingly simple and well-documented.

Web technology to single you out

We ended our brunch with a rather grim inspection about the possibilities to identify and track every single user in the internet. To use completely exotic means of surfing is not helpful, as explained in this xkcd comic. When using a stock browser to surf, your best practice should be to not change the initial browser window size – but just see for yourself if you think it makes a difference. Here is everything What Web Can Do Today to identify and track you. It’s so extensive, it’s really scary, but on the other hand quite useful if you happen to develop a “good” app on the web.

Epilogue

As usual, the Dev Brunch contained a lot more chatter and talk than listed here. The number of attendees makes for an unique experience every time. We are looking forward to the next Dev Brunch at the Softwareschneiderei. And as always, we are open for guests and future regulars. Just drop us a notice and we’ll invite you over next time.

For the gamers: Schneide Game Nights

Another ongoing series of events that we established at Softwareschneiderei are the Schneide Game Nights that take place at an irregular schedule. Each Schneide Game Night is a saturday night dedicated to a new or unknown computer game that is presented by a volunteer moderator. The moderator introduces the guests to the game, walks them through the initial impressions and explains the game mechanics. If suitable, the moderator plays a certain amount of time to show more advanced game concepts and gives hints and tipps without spoiling too much suprises. Then it’s up to the audience to take turns while trying the single player game or to fire up the notebooks and join a multiplayer session.

We already had Game Nights for the following games:

Kerbal Space Program: A simulator for everyone who thinks that space travel surely isn’t rocket science.
Dwarf Fortress: A simulator for everyone who is in danger to grow attached to legendary ASCII socks (if that doesn’t make much sense now, lets try: A simulator for everyone who loves to dig his own grave).
Minecraft: A simulator for everyone who never grew out of the LEGO phase and is still scared in the dark. Also, the floor is lava.
TIS-100: A simulator (sort of) for everyone who thinks programming in Assembler is fun. Might soon be an olympic discipline.
Faster Than Light: A roguelike for everyone who wants more space combat action than Kerbal Space Program can provide and nearly as much text as in Dwarf Fortress.
Don’t Starve: A brutal survival game in a cute comic style for everyone who isn’t scared in the dark and likes to hunt Gobblers.
Papers, Please: A brutal survival game about a bureaucratic hero in his border guard booth. Avoid if you like to follow the rules.
This War of Mine: A brutal survival game about civilians in a warzone, trying not to simultaneously lose their lives and humanity.
Crypt of the Necrodancer: A roguelike for everyone who wants to literally play the vibes, trying to defeat hordes of monsters without skipping a beat.
Undertale: A 8-bit adventure for everyone who fancies silly jokes and weird storytelling. You’ll feel at home if you’ve played the NES.

The Schneide Game Nights are scheduled over the same mailing list as the Dev Brunches and feature the traditional pizza break with nearly as much chatter as the brunches. The next Game Night will be about:

Factorio: A simulator that puts automation first. Massive automation. Like, don’t even think about doing something yourself, let the robots do it a million times for you.

If you are interested in joining, let us know.

Generating a spherified cube in C++

In my last post, I showed how to generate an icosphere, a subdivided icosahedron, without any fancy data-structures like the half-edge data-structure. Someone in the reddit discussion on my post mentioned that a spherified cube is also nice, especially since it naturally lends itself to a relatively nice UV-map.

The old algorithm

The exact same algorithm from my last post can easily be adapted to generate a spherified cube, just by starting on different data.

After 3 steps of subdivision with the old algorithm, that cube will be transformed into this:

Slightly adapted

If you look closely, you will see that the triangles in this mesh are a bit uneven. The vertical lines in the yellow-side seem to curve around a bit. This is because unlike in the icosahedron, the triangles in the initial box mesh are far from equilateral. The four-way split does not work very well with this.

One way to improve the situation is to use an adaptive two-way split instead:

Instead of splitting all three edges, we’ll only split one. The adaptive part here is that the edge we’ll split is always the longest that appears in the triangle, therefore avoiding very long edges.

Here’s the code for that. The only tricky part is the modulo-counting to get the indices right. The vertex_for_edge function does the same thing as last time: providing a vertex for subdivision while keeping the mesh connected in its index structure.

TriangleList
subdivide_2(ColorVertexList& vertices,
  TriangleList triangles)
{
  Lookup lookup;
  TriangleList result;

  for (auto&& each:triangles)
  {
    auto edge=longest_edge(vertices, each);
    Index mid=vertex_for_edge(lookup, vertices,
      each.vertex[edge], each.vertex[(edge+1)%3]);

    result.push_back({each.vertex[edge],
      mid, each.vertex[(edge+2)%3]});

    result.push_back({each.vertex[(edge+2)%3],
      mid, each.vertex[(edge+1)%3]});
  }

  return result;
}

Now the result looks a lot more even:

Note that this algorithm only doubles the triangle count per iteration, so you might want to execute it twice as often as the four-way split.

Alternatives

Instead of using this generic of triangle-based subdivision, it is also possible to generate the six sides as subdivided patches, as suggested in this article. This approach works naturally if you want to have seams between your six sides. However, that approach is more specialized towards this special geometry and will require extra “stitching” if you don’t want seams.

Code

The code for both the icosphere and the spherified cube is now on github: github.com/softwareschneiderei/meshing-samples.

My favorite Unix tool

Awk is a little language designed for the processing of lines of text. It is available on every Unix (since V3) or Linux system. The name is an acronym of the names of its creators: Aho, Weinberger and Kernighan.

Since I spent a couple of minutes to learn awk I have found it quite useful during my daily work. It is my favorite tool in the base set of Unix tools due to its simplicity and versatility.

Typical use cases for awk scripts are log file analysis and the processing of character separated value (CSV) formats. Awk allows you to easily filter, transform and aggregate lines of text.

The idea of awk is very simple. An awk script consists of a number of patterns, each associated with a block of code that gets executed for an input line if the pattern matches:

pattern_1 {
    # code to execute if pattern matches line
}

pattern_2 {
    # code to execute if pattern matches line
}

# ...

pattern_n {
    # code to execute if pattern matches line
}

Patterns and blocks

The patterns are usually regular expressions:

/error|warning/ {
    # executed for each line, which contains
    # the word "error" or "warning"
}

/^Exception/ {
    # executed for each line starting
    # with "Exception"
}

There are some special patterns, namely the empty pattern, which matches every line …

{
    # executed for every line
}

… and the BEGIN and END patterns. Their blocks are executed before and after the processing of the input, respectively:

BEGIN {
    # executed before any input is processed,
    # often used to initialize variables
}

END {
    # executed after all input has been processed,
    # often used to output an aggregation of
    # collected values or a summary
}

Output and variables

The most common operation within a block is the print statement. The following awk script outputs each line containing the string “error”:

/error/ { print }

This is basically the functionality of the Unix grep command, which is filtering. It gets more interesting with variables. Awk provides a couple of useful built-in variables. Here are some of them:

$0 represents the entire current line
$1 … $n represent the 1…n-th field of the current line
NF holds the number of fields in the current line
NR holds the number of the current line (“record”)

By default awk interprets whitespace sequences (spaces and tabs) as field separators. However, this can be changed by setting the FS variable (“field separator”).

The following script outputs the second field for each line:

{ print $2 }

Input:

John 32 male
Jane 45 female
Richard 73 male

Output:

32
45
73

And this script calculates the sum and the average of the second fields:

{
    sum += $2
}

END {
    print "sum: " sum ", average: " sum/NR
}

Output:

sum: 150, average: 50

The language

The language that can be used within a block of code is based on C syntax without types and is very similar to JavaScript. All the familiar control structures like if/else, for, while, do and operators like =, ==, >, &&, ||, ++, +=, … are there.

Semicolons at the end of statements are optional, like in JavaScript. Comments start with a #, not with //.

Variables do not have to be declared before usage (no ‘var’ or type). You can simply assign a value to a variable and it comes into existence.

String concatenation does not have an explicit operator like “+”. Strings and variables are concatenated by placing them next to each other:

"Hello " name ", how are you?"
# This is wrong: "Hello" + name + ", how are you?"

print is a statement, not a function. Parentheses around its parameter list are optional.

Functions

Awk provides a small set of built-in functions. Some of them are:

length(string), substr(string, index, count), index(string, substring), tolower(string), toupper(string), match(string, regexp).

User-defined functions look like JavaScript functions:

function min(number1, number2) {
    if (number1 < number2) {
        return number1
    }
    return number2
}

In fact, JavaScript adopted the function keyword from awk. User-defined functions can be placed outside of pattern blocks.

Command-line invocation

An awk script can be either read from a script file with the -f option:

$ awk -f myscript.awk data.txt

… or it can be supplied in-line within single quotes:

$ awk '{sum+=$2} END {print "sum: " sum " avg: " sum/NR}' data.txt

Conclusion

I hope this short introduction helped you add awk to your toolbox if you weren’t familiar with awk yet. Awk is a neat alternative to full-blown scripting languages like Python and Perl for simple text processing tasks.

Introduction to Debian packaging

In former posts I wrote about packaging your software as RPM packages for a variety of use cases. The other big binary packaging system on Linux systems is DEB for Debian, Ubuntu and friends. Both serve the purpose of convenient distribution, installation and update of binary software artifacts. They define their dependencies and describe what the package provides.

How do you provide your software as DEB packages?

The master guide to debian packaging can be found at https://www.debian.org/doc/manuals/maint-guide/. It is a very comprehensive guide spanning many pages and providing loads of information. Building a usable package of your software for your clients can be a matter of minutes if you know what to do. So I want to show you the basic steps, for refinement you will need to consult the guide or other resources specific to the software you want to package. For example there are several guides specific to packaging of software written in Python. I find it hard to determine what the current and recommended way to debian packages for python is because there are differing guides over the last 10 years or so. You may of course say “Just use pip for python” 🙂

Basic tools

The basic tools for building debian packages are debhelper and dh-make. To check the resulting package you will need lintian in addition. You can install them and basic software build tools using:

  sudo apt-get install build-essential dh-make debhelper lintian

The python packaging system uses make and several scripts under the hood to help building the binary packages out of source tarballs.

Your first package

First you need a tar-archive of the software you want to package. With python setuptools you could use python setup.py sdist to generate the tarball. Then you run dh_make on it to generate metadata and the package build environment for your package. Now you have to edit the metadata files, namely control, copyright, changelog. Finally you run dpkg-buildpackage to generate the package itself. Here is an example of the necessary commands:

mkdir hello-deb-1.0
cd hello-deb-1.0
dh_make -f ../hello-deb-1.0.tar.gz
# edit deb metadata files
vi debian/control
vi debian/copyright
vi debian/changelog
dpkg-buildpackage -us -uc
lintian -i -I --show-overrides hello-deb_1.0-1_amd64.changes

The control file roughly resembles RPMs SPEC file. Package name, description, version and dependency information belong there. Note that debian is very strict when it comes to naming of packages, so make sure you use the pattern ${name}-${version}.tar.gz for the archive and that it extracts into a corresponding directory without the extension, e.g. ${name}-${version}.

If everything went ok several files were generated in your base directory:

The package itself as .deb file
A file containing the changelog and checksums between package versions and revisions ending with .changes
A file with the package description ending with .dsc
A tarball with the original sources renamed according to debian convention hello-deb_1.0.orig.tar.gz (note the underscore!)

Going from here

Of course there is a lot more to the tooling and workflow when maintaining debian packages. In future posts I will explore additional means for improving and updating your packages like the quilt patch management tool, signing the package, symlinking, scripts for pre- and post-installation and so forth.

Getting better at programming without coding

Almost two decades ago one of the programming books was published that had a big impact on my thinking as a software engineer: the pragmatic programmer. Most of the tips and practices are still fundamental to my work. If you haven’t read it, give it a try.
Over the years I refined some practices and began to get a renewed focus on additional topics. One of the most important topics of the original tips and of my profession is to care and think about my craft.
In this post I collected a list of tips and practices which helped and still help me in my daily work.

Think about production

Since I develop software to be used, thinking early about the production environment is key.

Deploy as early as possible
Deployment should be a non event. Create an automatic deployment process to keep it that way and deploy as early as possible to remove the risk from unpleasant surprises.

Master should always be deployable
Whether you use master or another branch, you need a branch which could always be deployed without risk.

Self containment
Package (as many as possible of) your dependencies into your deployment. Keep the surprises of missing repositories or dependencies to a minimum of none.

Use real data in development
Real data has characteristics, gaps and inconsistencies you cannot imagine. During development use real data to experience problems before they get into production.

No data loss
Deploying should not result in a loss of data. Your application should shutdown gracefully. Often deployment deletes the directory or uses a fresh place. No files or state in memory should be used as persistence by the application. Applications should be stateless processes.

Rollback
If anything goes wrong or the new deployed application has a serious bug you need to revert it to the last version. Make rollback a requirement.

No user interruption
Users work with your application. Even if they do not lose data or their current work when you deploy, they do not like surprises.

Separate one off tasks
Software should be running and available to the user. Do not delay startup with one off admin tasks like migration, cache warm-up or search index creation. Make your application start in seconds.

Manage your runs
Problems, performance degradation and bugs should be visible. Monitor your key metrics, log important things and detect problems in the application’s data. Make it easy to combine, search and graph your recordings.

Make it easy to reproduce
When a bug occurs or your user has a problem, you need to follow the steps how the system arrived at its current state. Store their actions so that they can be easily replayed.

Think about users

Software is used by people. In order to craft successful applications I have to consider what these people need.

No requirements, just jobs
Users use the software to get stuff done. Features and requirements confuse solutions with problems. Understand in what situation the user is and what he needs to get to his goal. This is the job you need to support.

Work with the user
In order to help the user with your software I need to relate to his situation. Observing, listening, talking to and working along him helps you see his struggles and where software can help.

Speak their language
Users think and speak in their domain. Not in the domain of software. When you want to help them, you and the user interface of your software needs to speak like a user, not like a software.

Value does not come from effort
The most important things your software does are not the ones which need the most effort. Your users value things which help them the most. Find these.

Think about modeling

A model is at the core of your software. Every system has a state. How you divide and manage this state is crucial to evolving and understanding your creation.

Use the language of the domain
In your core you model concepts from the user’s domain. Name them accordingly and reasoning about them and with the users is easier.

Everything has one purpose
Divide your model by the purpose of its parts.

Separate read from write
You won’t get the model right from the start. It is easier to evolve the model if read and write operations have their own model. You can even have different read models for different use cases. (see also CQRS and Turning the database inside out)

Different parts evolve at different speeds
Not all parts of a model are equal. Some stand still, some change frequently. Some are specified, about some others you learn step by step. Some need to be constant, some need to be experimented with. Separating parts by its changing speed will help you deal with change.

Favor immutability
State is hard. State is needed. Isolating state helps you understand a running system. Isolating state helps you remove coupling.

Keep it small
Reasoning about a large system is complicated. Keep effects at bay and models small. Separating and isolating things gives you a chance to overview the whole system.

Think about approaches

Getting to all this is a journey.

When thinking use all three dimensions
Constraining yourself to a computer screen for thinking deprives you of one of your best thinking tools: spatial reasoning. Use whiteboards, walls, paper and more to remove the boundaries from your thoughts.

Crazy 8
Usually you think in your old ways. Getting out of your (mental) box is not easy. Crazy 8 is a method to create 8 solutions (sketches for UI) in a very short time frame.

Suspend judgement
As a programmer you are fast to assess proposals and solutions. Don’t do that. Learn to suspend your judgement. Some good ideas are not so obvious, you will kill them with your judgement.

Get out
Thinking long and hard about a problem can put you into blindfold mode. After stating the problem, get out. Take a walk. Do not actively think or talk about the problem. This simulates the “shower effect”: getting the best ideas when you do not actively think about the problem.

Assume nothing
Assumptions bear risks. They can make your project fail. Approach your project with what is certain. Choose your direction to explore and find your assumptions. Each assumption is an obstacle, an question that needs an answer. Ask your users. Design hypotheses and experiments to proof them. (see From agile to UX for a detailed approach)

Pre-mortem
Another way to find blind spots in your thinking is to frame for failure. Construct a scenario in which your project is failed. Then reason about what made it fail. Where are your biggest risks? (see How to map your fears for details)

MVA – Minimum, valuable action
Every step, every experiment should be as lightweight as possible. Do not craft a beautiful prototype if a sketch would suffice. Choose the most efficient method to get further to your goal.

Put it into a time box
When you need to experiment, constrain it. Define a time in which you want to have an answer. You do not need to go the whole way to get an impression.

Three natural resources of information technology

There are quite a few commodities in IT that we use every day to gather and process the natural resources of IT. But what are the resources of IT? And are there more than the three I found?

Disclaimer: English is not my native language, so it is possible that my terminology is a little skewed in this blog entry. If you have a suggestion for better words, please let me know.

IT Currencies

Everybody working in the vast field of information technology knows about the three constraints in the project management triangle, namely

Cost
Scope
Schedule

We can translate these constraints in the three currencies of business:

Money
Effort
Time

You can virtually achieve anything in IT if you are willing to spend lots of these three currencies (except solving the Halting problem and similar decision problems).

IT Commodities

You surely also know about the three upscaling commodities of our profession:

processing power (think CPU)
memory (think RAM or HDD)
bandwidth (think network throughput)

If you are willing to invest more business currency, you’ll get more of these commodities. You’ll invest mostly money or time, albeit Moore’s Law seems to dwindle, so spending time, as in waiting for the next generation of computers, is not the superior deal it used to be.

There are three downscaling commodities, too:

latency (think caches or parallelism)
physical size (think USB sticks the size of a fingernail)
energy consumption (think Rasperry PCs that are powered over USB)

These commmodities are getting reduced with every new generation of computers. What once was a super-computer is now a 30$ Mini-PC. I vividly remember my university announcing their latest piece of technology during my first year of study: a computer with 1 GHz CPU, 1 GB RAM and 1 TB HDD. This machine was used by all students concurrently. Today, my phone provides more power and fits into my pocket.

IT natural resources

Now, with currency and commodity defined, let me introduce you to the concept of natural resources of IT. Natural resources are things that have value to the business and need to be harvested instead of being just bought. While you shouldn’t envision material natural resources now, it helps to introduce the concept of a natural resource: You may buy a whole mountain, but the iron ore in it (a major natural resource for the industry) still needs to be mined. Raw iron ore is the starting point for many processing steps, each one refining the input material and producing output material of higher value. There are countless different materials in the world, but only a handful of major natural resources. Whoever was the first to drill a hole in the ground and get crude oil back was a rich man. Keep this imagery in mind when we talk about the natural resources of IT, but please forget about the aspect of mass. Our natural resources don’t have a mass. They do keep a location, though.

Data or information: You’ve already guessed this. The oil of IT is data. Data can be labeled as crude oil, information is refined data then. Just imagine you drill a hole in the ground and it spits out random facts. You could just record the facts and build a knowledge database out of it. If you want to give your hole in the ground a name, you could name it Facebook, Twitter or something alike. Data is processed and turned into information, information is combined and aggregated to give us more valueable information, just like the iron ore example beforehands. In the old days, data was provided by human effort (aka typing). In the era of the internet of things, most data is provided by sensors all over the world. And while data still maintains a location (but increasingly fuzzy in the era of cloud computing), it has no mass. This means it can be copied without cost, a feature no material natural resource can offer. Data as a natural resource of IT is so widely known, it even gave IT its first name: electronic data processing.

Source code: The fabric all software is made of is another natural resource of IT. You could argue that code is just data, but I think its whole processing pipeline is so remarkely different from data, it should be discussed seperately. Source code is still harvested from hand, by humans typing words into a text editor like it’s 1980. Source code is refined to running software programs, a process that got fully automated in the recent years. The software is then used to gather data, distill information out of data or, well, entertain us. Source code is a rare natural resource, because it needs to be harvested by highly skilled workers in a delicate process called programming. The number of programmers worldwide doubles every five years, but the demand for software rises even faster. All the while, we still haven’t figured out to maintain an acceptable quality level. If source code is the equivalent to gold (rare, valueable, sought-after), it most often comes mixed with all kinds of scrap metal.

Random numbers: The raw material of anything cryptographic are random numbers. They might be seen as data, too, but again, I think their unique properties require a separate examination. Random numbers need to be truly random. The higher the randomness, the higher the quality of this natural resource. A lot of random numbers we use (or consume) today are really just pseudorandom numbers, obtained from an ultimately deterministic generator. We rely on this second-grade material because the harvesting speed for real random numbers is pitiful slow and cannot satisfy our need of random numbers. Imagine again that you drill a hole in the ground and it spits out random numbers. You’re going to get rich, because random numbers are the crude oil of cryptography and therefore of every serious data transfer today. If you think about sources of randomness, radioactive decay or cosmic radiation are very high on the list. The RANDOM.ORG service uses atmospheric noise, as if the weather in Ireland would provide much noise – it will rain tomorrow, too. A speciality of random numbers is that they can only be used once to provide their full value. Nobody wants to use second-hand random numbers because they lose their randomness once they are known (much like you can’t bet on last week’s sport events). So while we can still say that random numbers have no mass, they are more similar to their material counterparts in that they can’t be copied and are affected by decay over time.

What now?

This blog post was meant to inspire and to share a question: are those all natural resources in the field of IT? I thought long about it but could only find derived products like blockchain blocks that ultimately rely on brute-forcing one-way functions like hashes in the wrong way. To mine a bitcoin, for example, the most prominent implementation of a blockchain, you only need commodities like processing power and some time. There is nothing inherently “unique” about a blockchain block. Nice choice of terminology with “mining bitcoins”, though.

So, the question goes to you: Can you think of another natural resource of IT? Please leave a comment if you do.

	Writing Integration… on Every Unit Test Is a Stage Pla…
	mariuselvert on C# is very strict about modify…
	Anonymous on C# is very strict about modify…
	Anonymous on Cache configuration with WildF…
	Miq on Nested queries like N+1 in pra…