My favorite Unix tool

Awk is a little language designed for the processing of lines of text. It is available on every Unix (since V3) or Linux system. The name is an acronym of the names of its creators: Aho, Weinberger and Kernighan.

Since I spent a couple of minutes to learn awk I have found it quite useful during my daily work. It is my favorite tool in the base set of Unix tools due to its simplicity and versatility.

Typical use cases for awk scripts are log file analysis and the processing of character separated value (CSV) formats. Awk allows you to easily filter, transform and aggregate lines of text.

The idea of awk is very simple. An awk script consists of a number of patterns, each associated with a block of code that gets executed for an input line if the pattern matches:

pattern_1 {
    # code to execute if pattern matches line
}

pattern_2 {
    # code to execute if pattern matches line
}

# ...

pattern_n {
    # code to execute if pattern matches line
}

Patterns and blocks

The patterns are usually regular expressions:

/error|warning/ {
    # executed for each line, which contains
    # the word "error" or "warning"
}

/^Exception/ {
    # executed for each line starting
    # with "Exception"
}

There are some special patterns, namely the empty pattern, which matches every line …

{
    # executed for every line
}

… and the BEGIN and END patterns. Their blocks are executed before and after the processing of the input, respectively:

BEGIN {
    # executed before any input is processed,
    # often used to initialize variables
}

END {
    # executed after all input has been processed,
    # often used to output an aggregation of
    # collected values or a summary
}

Output and variables

The most common operation within a block is the print statement. The following awk script outputs each line containing the string “error”:

/error/ { print }

This is basically the functionality of the Unix grep command, which is filtering. It gets more interesting with variables. Awk provides a couple of useful built-in variables. Here are some of them:

  • $0 represents the entire current line
  • $1$n represent the 1…n-th field of the current line
  • NF holds the number of fields in the current line
  • NR holds the number of the current line (“record”)

By default awk interprets whitespace sequences (spaces and tabs) as field separators. However, this can be changed by setting the FS variable (“field separator”).

The following script outputs the second field for each line:

{ print $2 }

Input:

John 32 male
Jane 45 female
Richard 73 male

Output:

32
45
73

And this script calculates the sum and the average of the second fields:

{
    sum += $2
}

END {
    print "sum: " sum ", average: " sum/NR
}

Output:

sum: 150, average: 50

The language

The language that can be used within a block of code is based on C syntax without types and is very similar to JavaScript. All the familiar control structures like if/else, for, while, do and operators like =, ==, >, &&, ||, ++, +=, … are there.

Semicolons at the end of statements are optional, like in JavaScript. Comments start with a #, not with //.

Variables do not have to be declared before usage (no ‘var’ or type). You can simply assign a value to a variable and it comes into existence.

String concatenation does not have an explicit operator like “+”. Strings and variables are concatenated by placing them next to each other:

"Hello " name ", how are you?"
# This is wrong: "Hello" + name + ", how are you?"

print is a statement, not a function. Parentheses around its parameter list are optional.

Functions

Awk provides a small set of built-in functions. Some of them are:

length(string), substr(string, index, count), index(string, substring), tolower(string), toupper(string), match(string, regexp).

User-defined functions look like JavaScript functions:

function min(number1, number2) {
    if (number1 < number2) {
        return number1
    }
    return number2
}

In fact, JavaScript adopted the function keyword from awk. User-defined functions can be placed outside of pattern blocks.

Command-line invocation

An awk script can be either read from a script file with the -f option:

$ awk -f myscript.awk data.txt

… or it can be supplied in-line within single quotes:

$ awk '{sum+=$2} END {print "sum: " sum " avg: " sum/NR}' data.txt

Conclusion

I hope this short introduction helped you add awk to your toolbox if you weren’t familiar with awk yet. Awk is a neat alternative to full-blown scripting languages like Python and Perl for simple text processing tasks.

Introduction to Debian packaging

In former posts I wrote about packaging your software as RPM packages for a variety of use cases. The other big binary packaging system on Linux systems is DEB for Debian, Ubuntu and friends. Both serve the purpose of convenient distribution, installation and update of binary software artifacts. They define their dependencies and describe what the package provides.

How do you provide your software as DEB packages?

The master guide to debian packaging can be found at https://www.debian.org/doc/manuals/maint-guide/. It is a very comprehensive guide spanning many pages and providing loads of information. Building a usable package of your software for your clients can be a matter of minutes if you know what to do. So I want to show you the basic steps, for refinement you will need to consult the guide or other resources specific to the software you want to package. For example there are several guides specific to packaging of software written in Python. I find it hard to determine what the current and recommended way to debian packages for python is because there are differing guides over the last 10 years or so. You may of course say “Just use pip for python” 🙂

Basic tools

The basic tools for building debian packages are debhelper and dh-make. To check the resulting package you will need lintian in addition. You can install them and basic software build tools using:

  sudo apt-get install build-essential dh-make debhelper lintian

The python packaging system uses make and several scripts under the hood to help building the binary packages out of source tarballs.

Your first package

First you need a tar-archive of the software you want to package. With python setuptools you could use python setup.py sdist to generate the tarball. Then you run dh_make on it to generate metadata and the package build environment for your package. Now you have to edit the metadata files, namely control, copyright, changelog. Finally you run dpkg-buildpackage to generate the package itself.  Here is an example of the necessary commands:

mkdir hello-deb-1.0
cd hello-deb-1.0
dh_make -f ../hello-deb-1.0.tar.gz
# edit deb metadata files
vi debian/control
vi debian/copyright
vi debian/changelog
dpkg-buildpackage -us -uc
lintian -i -I --show-overrides hello-deb_1.0-1_amd64.changes

The control file roughly resembles RPMs SPEC file. Package name, description, version and dependency information belong there. Note that debian is very strict when it comes to naming of packages, so make sure you use the pattern ${name}-${version}.tar.gz for the archive and that it extracts into a corresponding directory without the extension, e.g. ${name}-${version}.

If everything went ok several files were generated in your base directory:

  • The package itself as .deb file
  • A file containing the changelog and checksums between package versions and revisions ending with .changes
  • A file with the package description ending with .dsc
  • A tarball with the original sources renamed according to debian convention hello-deb_1.0.orig.tar.gz (note the underscore!)

Going from here

Of course there is a lot more to the tooling and workflow when maintaining debian packages. In future posts I will explore additional means for improving and updating your packages like the quilt patch management tool, signing the package, symlinking, scripts for pre- and post-installation and so forth.

Getting better at programming without coding

Almost two decades ago one of the programming books was published that had a big impact on my thinking as a software engineer: the pragmatic programmer. Most of the tips and practices are still fundamental to my work. If you haven’t read it, give it a try.
Over the years I refined some practices and began to get a renewed focus on additional topics. One of the most important topics of the original tips and of my profession is to care and think about my craft.
In this post I collected a list of tips and practices which helped and still help me in my daily work.

Think about production

Since I develop software to be used, thinking early about the production environment is key.

Deploy as early as possible
Deployment should be a non event. Create an automatic deployment process to keep it that way and deploy as early as possible to remove the risk from unpleasant surprises.

Master should always be deployable
Whether you use master or another branch, you need a branch which could always be deployed without risk.

Self containment
Package (as many as possible of) your dependencies into your deployment. Keep the surprises of missing repositories or dependencies to a minimum of none.

Use real data in development
Real data has characteristics, gaps and inconsistencies you cannot imagine. During development use real data to experience problems before they get into production.

No data loss
Deploying should not result in a loss of data. Your application should shutdown gracefully. Often deployment deletes the directory or uses a fresh place. No files or state in memory should be used as persistence by the application. Applications should be stateless processes.

Rollback
If anything goes wrong or the new deployed application has a serious bug you need to revert it to the last version. Make rollback a requirement.

No user interruption
Users work with your application. Even if they do not lose data or their current work when you deploy, they do not like surprises.

Separate one off tasks
Software should be running and available to the user. Do not delay startup with one off admin tasks like migration, cache warm-up or search index creation. Make your application start in seconds.

Manage your runs
Problems, performance degradation and bugs should be visible. Monitor your key metrics, log important things and detect problems in the application’s data. Make it easy to combine, search and graph your recordings.

Make it easy to reproduce
When a bug occurs or your user has a problem, you need to follow the steps how the system arrived at its current state. Store their actions so that they can be easily replayed.

Think about users

Software is used by people. In order to craft successful applications I have to consider what these people need.

No requirements, just jobs
Users use the software to get stuff done. Features and requirements confuse solutions with problems. Understand in what situation the user is and what he needs to get to his goal. This is the job you need to support.

Work with the user
In order to help the user with your software I need to relate to his situation. Observing, listening, talking to and working along him helps you see his struggles and where software can help.

Speak their language
Users think and speak in their domain. Not in the domain of software. When you want to help them, you and the user interface of your software needs to speak like a user, not like a software.

Value does not come from effort
The most important things your software does are not the ones which need the most effort. Your users value things which help them the most. Find these.

Think about modeling

A model is at the core of your software. Every system has a state. How you divide and manage this state is crucial to evolving and understanding your creation.

Use the language of the domain
In your core you model concepts from the user’s domain. Name them accordingly and reasoning about them and with the users is easier.

Everything has one purpose
Divide your model by the purpose of its parts.

Separate read from write
You won’t get the model right from the start. It is easier to evolve the model if read and write operations have their own model. You can even have different read models for different use cases. (see also CQRS and Turning the database inside out)

Different parts evolve at different speeds
Not all parts of a model are equal. Some stand still, some change frequently. Some are specified, about some others you learn step by step. Some need to be constant, some need to be experimented with. Separating parts by its changing speed will help you deal with change.

Favor immutability
State is hard. State is needed. Isolating state helps you understand a running system. Isolating state helps you remove coupling.

Keep it small
Reasoning about a large system is complicated. Keep effects at bay and models small. Separating and isolating things gives you a chance to overview the whole system.

Think about approaches

Getting to all this is a journey.

When thinking use all three dimensions
Constraining yourself to a computer screen for thinking deprives you of one of your best thinking tools: spatial reasoning. Use whiteboards, walls, paper and more to remove the boundaries from your thoughts.

Crazy 8
Usually you think in your old ways. Getting out of your (mental) box is not easy. Crazy 8 is a method to create 8 solutions (sketches for UI) in a very short time frame.

Suspend judgement
As a programmer you are fast to assess proposals and solutions. Don’t do that. Learn to suspend your judgement. Some good ideas are not so obvious, you will kill them with your judgement.

Get out
Thinking long and hard about a problem can put you into blindfold mode. After stating the problem, get out. Take a walk. Do not actively think or talk about the problem. This simulates the “shower effect”: getting the best ideas when you do not actively think about the problem.

Assume nothing
Assumptions bear risks. They can make your project fail. Approach your project with what is certain. Choose your direction to explore and find your assumptions. Each assumption is an obstacle, an question that needs an answer. Ask your users. Design hypotheses and experiments to proof them. (see From agile to UX for a detailed approach)

Pre-mortem
Another way to find blind spots in your thinking is to frame for failure. Construct a scenario in which your project is failed. Then reason about what made it fail. Where are your biggest risks? (see How to map your fears for details)

MVA – Minimum, valuable action
Every step, every experiment should be as lightweight as possible. Do not craft a beautiful prototype if a sketch would suffice. Choose the most efficient method to get further to your goal.

Put it into a time box
When you need to experiment, constrain it. Define a time in which you want to have an answer. You do not need to go the whole way to get an impression.

Generating an Icosphere in C++

If you want to render a sphere in 3D, for example in OpenGL or DirectX, it is often a good idea to use a subdivided icosahedron. That often works better than the “UVSphere”, which means simply tesselating a sphere by longitude and latitude. The triangles in an icosphere are a lot more evenly distributed over the final sphere. Unfortunately, the easiest way, it seems, is to generate such a sphere is to do that in a 3D editing program. But to load that into your application requires a 3D file format parser. That’s a lot of overhead if you really need just the sphere, so doing it programmatically is preferable.

At this point, many people will just settle for the UVSphere since it is easy to generate programmatically. Especially since generating the sphere as an indexed mesh without vertex-duplicates further complicates the problem. But it is actually not much harder to generate the icosphere!
Here I’ll show some C++ code that does just that.

C++ Implementation

We start with a hard-coded indexed-mesh representation of the icosahedron:

struct Triangle
{
  Index vertex[3];
};

using TriangleList=std::vector<Triangle>;
using VertexList=std::vector<v3>;

namespace icosahedron
{
const float X=.525731112119133606f;
const float Z=.850650808352039932f;
const float N=0.f;

static const VertexList vertices=
{
  {-X,N,Z}, {X,N,Z}, {-X,N,-Z}, {X,N,-Z},
  {N,Z,X}, {N,Z,-X}, {N,-Z,X}, {N,-Z,-X},
  {Z,X,N}, {-Z,X, N}, {Z,-X,N}, {-Z,-X, N}
};

static const TriangleList triangles=
{
  {0,4,1},{0,9,4},{9,5,4},{4,5,8},{4,8,1},
  {8,10,1},{8,3,10},{5,3,8},{5,2,3},{2,7,3},
  {7,10,3},{7,6,10},{7,11,6},{11,0,6},{0,1,6},
  {6,1,10},{9,0,11},{9,11,2},{9,2,5},{7,2,11}
};
}

icosahedron
Now we iteratively replace each triangle in this icosahedron by four new triangles:

subdivision

Each edge in the old model is subdivided and the resulting vertex is moved on to the unit sphere by normalization. The key here is to not duplicate the newly created vertices. This is done by keeping a lookup of the edge to the new vertex it generates. Note that the orientation of the edge does not matter here, so we need to normalize the edge direction for the lookup. We do this by forcing the lower index first. Here’s the code that either creates or reused the vertex for a single edge:

using Lookup=std::map<std::pair<Index, Index>, Index>;

Index vertex_for_edge(Lookup& lookup,
  VertexList& vertices, Index first, Index second)
{
  Lookup::key_type key(first, second);
  if (key.first>key.second)
    std::swap(key.first, key.second);

  auto inserted=lookup.insert({key, vertices.size()});
  if (inserted.second)
  {
    auto& edge0=vertices[first];
    auto& edge1=vertices[second];
    auto point=normalize(edge0+edge1);
    vertices.push_back(point);
  }

  return inserted.first->second;
}

Now you just need to do this for all the edges of all the triangles in the model from the previous interation:

TriangleList subdivide(VertexList& vertices,
  TriangleList triangles)
{
  Lookup lookup;
  TriangleList result;

  for (auto&& each:triangles)
  {
    std::array<Index, 3> mid;
    for (int edge=0; edge<3; ++edge)
    {
      mid[edge]=vertex_for_edge(lookup, vertices,
        each.vertex[edge], each.vertex[(edge+1)%3]);
    }

    result.push_back({each.vertex[0], mid[0], mid[2]});
    result.push_back({each.vertex[1], mid[1], mid[0]});
    result.push_back({each.vertex[2], mid[2], mid[1]});
    result.push_back({mid[0], mid[1], mid[2]});
  }

  return result;
}

using IndexedMesh=std::pair<VertexList, TriangleList>;

IndexedMesh make_icosphere(int subdivisions)
{
  VertexList vertices=icosahedron::vertices;
  TriangleList triangles=icosahedron::triangles;

  for (int i=0; i<subdivisions; ++i)
  {
    triangles=subdivide(vertices, triangles);
  }

  return{vertices, triangles};
}

There you go, a customly subdivided icosphere!
icosphere

Performance

Of course, this implementation is not the most runtime-efficient way to get the icosphere. But it is decent and very simple. Its performance depends mainly on the type of lookup used. I used a map instead of an unordered_map here for brevity, only because there’s no premade hash function for a std::pair of indices. In pratice, you would almost always use a hash-map or some kind of spatial structure, such as a grid, which makes this method a lot tougher to compete with. And certainly feasible for most applications!

The general pattern

The lookup-or-create pattern used in this code is very useful when creating indexed-meshes programmatically. I’m certainly not the only one who discovered it, but I think it needs to be more widely known. For example, I’ve used it when extracting voxel-membranes and isosurfaces from volumes. It works very well whenever you are creating your vertices from some well-defined parameters. Usually, it’s some tuple that describes the edge you are creating the vertex on. This is the case with marching cubes or marching tetrahedrons. It can, however, also be grid coordinates if you sparsely generate vertices on a grid, for example when meshing heightmaps.

ECMAScript 6 is coming

ECMAScript is the standardized specification of the scripting language that is the basis for JavaScript implementations in web browsers.

ECMAScript 5, which was released in 2009, is in widespread use today. Six years later in 2015 ECMAScript 6 was released, which introduced a lot of new features, especially additions to the syntax.

However, browser support for ES 6 needed some time to catch up.
If you look at the compatibility table for ES 6 support in modern browsers you can see that the current versions of the main browsers now support ES 6. This means that soon you will be able to write ES 6 code without the necessity of a ES6-to-ES5 cross-compiler like Babel.

The two features of ES 6 that will change the way JavaScript code will be written are in my opinion: the new class syntax and the new lambda syntax with lexical scope for ‘this’.

Class syntax

Up to now, if you wanted to define a new type with a constructor and methods on it, you would write something like this:

var MyClass = function(a, b) {
	this.a = a;
	this.b = b;
};
MyClass.prototype.methodA = function() {
	// ...
};
MyClass.prototype.methodB = function() {
	// ...
};
MyClass.prototype.methodC = function() {
	// ...
};

ES 6 introduces syntax support for this pattern with the class keyword, which makes class definitions shorter and more similar to class definitions in other object-oriented programming languages:

class MyClass {
	constructor(a, b) {
		this.a = a;
		this.b = b;
	}
	methodA() {
		// ...
	}
	methodB() {
		// ...
	}
	methodC() {
		// ...
	}
}

Arrow functions and lexical this

The biggest annoyance in JavaScript were the scoping rules of this. You always had to remember to rebind this if you wanted to use it inside an anonymous function:

var self = this;
this.numbers.forEach(function(n) {
    self.doSomething(n);
});

An alternative was to use the bind method on an anonymous function:

this.numbers.forEach(function(n) {
    this.doSomething(n);
}.bind(this));

ES 6 introduces a new syntax for lambdas via the “fat arrow” syntax:

this.numbers.forEach((n) => this.doSomething(n));

These arrow functions do not only preserve the binding of this but are also shorter. Especially if the function body consists of only a single expression. In this case the curly braces and and the return keyword can be omitted:

// ES 5
numbers.filter(function(n) { return isPrime(n); });
// ES 6
numbers.filter((n) => isPrime(n));

Wide-spread support for ECMAScript 6 is just around the corner, it’s time to familiarize yourself with it if you haven’t done so yet. A nice overview of the new features can be found at es6-features.org.

4 questions you need to ask yourself constantly while programming

Most of today’s general purpose progamming languagues come with plethora of features. Often there are different levels of abstractions and intended use cases. Some features are primarily for library designers, others ease implementation of domain specific languages and application developers use mostly another feature set.

Some language communities are discussing “language profiles / levels” to ban certain potentionally harmful constructs. The typical audience like application programmers does not need them but removing them from the language would limit its usefulness in other cases. Examples are Scala levels (a bit dated), the Google C++ Style Guide or Profiles in the C++ Core Guidelines.

In the wild

When reading other peoples code I often see novice code dealing with low-level threading. Or they go over board with templates, reflection or meta programming.

I have even seen custom ClassLoaders in Java written by normal application programmers. People are using threads when workers, tasks, actors or other more high-level abstractions would fit much better.

Especially novices seem to be unable to recognize their limits and to stay off of inappropriate and potentially dangerous features.

How do you decide what is appropriate in your situation?

Well, that is a difficult question. If you find the task at hand seems hard you should probably take a step back because:

There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors.

-Jeff Atwood

Then ask yourself some simple questions:

  1. Someone must have done it before. Have I searched thoroughly for hints or solutions?
  2. Is there a (better) library, data structure or abstraction?
  3. Do I really have to do this? There must be a better/easier way!
  4. What do I gain using feature/library/tool X and what are its costs? What about the alternatives?

Conclusion

You need some experience to recognize that you are on the wrong path, solving problems you would not even have if doing the right thing in the first place.

Experience is what you got by not having it when you needed it.

-Author Unknown

Try to know and admit your limits – there is nothing wrong with struggling to get things working but it helps to frequently check your direction by taking a step back and reflecting.

From Agile to UX: a change in perspective

Usually a project starts with a client sending us a list of requirements in varying levels of detail. In my early days I started with finding the most efficient way to fulfill these requirements with written software.

Over time and with increased experience I broke down the large requirements into smaller ones. With every step I tried to get feedback from the client if my solution matched his imagination.

Step by step I refined this iterative process by developing more efficiently, getting earlier feedback, testing and asking questions for more detail about the constraints of the solution. Sometimes I identified some parts that weren’t needed in the solution.

In the journey to getting more agile I even re-framed the initial question from ‘how can I get more efficiently to a satisfying solution’ to ‘which minimal solution brings the most value to the customer’.

This was the first change of perspective from the process of solving to a process of value. But a problem still persisted: the solution was based on my assumptions of what I believe that brings value to the customer.
The only feedback I would get was that the customer accepted the solution or stated some improvements. But for this feedback I had to implement the whole solution in software.

The clash of worlds: agile and UX

Diving into the UX and product management world my view of software development is questioned at its foundation. Agile software development and development projects in general start with a fatal assumption: the goal is to bring value through software that fulfills the requirements. But value is not created by software or by satisfying any requirements. For the user value is created by helping him getting his jobs done, helping him solving his problems in his context.

This might sound harsh but software is not an end in itself but rather one way to help users achieve their goals.
On top of that requirements lack the reasons why the user needs them (which jobs they help the user do) and in which situation the user might need them (the context).
In order to account this I need to change my focus away from defining the solution (the requirements) to exploring the users, their problems and their context.

The problem with finding problems

As a software developer I am skilled in finding solutions. Because of this I have some difficulties in finding the problems without proposing (even subconsciously) a solution right away. If you are like me while talking with a client or user I tend to imagine the possible solutions in my head. On top of that missing details that are filled by my experience or my assumptions. The problem is that assumptions are difficult to spot. One way is to look at the language. A repeatable way is to use a process for finding them.

The product kata

Recently I stumbled upon an excellent blog post by Melissa Perri, a product manager and UX designer. In this post she describes a way named ‘the product kata’.

The product kata starts with defining the direction: the problem, job or task I want to address.
After a listening tour with the different stakeholders (including clients and users), the requirements list and a contextual observation of the current solution I can at least give a rough description of the problem.

These steps help me to get a basic understanding of the domain and the current situation. In my old way of doing things I would now rush towards the solution. I would identify the next step(s) bearing the most value for the client and along the way remove the shortcomings of the current solution. But wait. The product kata proposes a different way.

A different way

Let’s use an example from a real project. Say the client needs a way to check incoming values measured by some sensors. The old solution plots these values over time in a chart. It lacks important contextual information, has no notion of what ‘checking the values’ mean and therefore cannot document the check result. Since the process of checking the values is central to the business we need to put it first and foremost. Following the product kata I define the direction as ‘check the sensor values’.

Direction: check the sensor values

To see if I reached the goal the kata needs a target condition which I define as ‘the user should be able to check the sensor values and record the check result’.

Target condition: the user should be able to check the sensor values and record the check result

Currently the user isn’t able to check anything. So the next step of the kata is to look at the current condition. If the current condition matches the target condition I am done. The current condition in my example is that the user cannot check the sensor values in the right way.

Current condition: the user cannot check the values in the right way

The first obstacle to achieving the target condition is that I don’t know what the right way is. Since the old solution lacks some important information to fulfill the check my first obstacle I want to address is to find out what information does the user need.

Obstacle: what additional information (besides the values themselves) does the user need

Since the product kata originates from lean product management I need to find an efficient step which addresses this obstacle. In my case I choose to make a simple paper sketch of a chart and interview the user about which data they needed in the past.

Step: paper sketch of chart (to frame the discussion) and interview about information needed in the past

I expect to collect a list with all the information needed.

Expected: a list of past data sources which helped the user in his check process

After doing this I learned what information should be displayed and which information (from the old solution) was not needed in the past.

Learned: two list of things: what was needed, what not

Now I repeat the kata from the start. The current condition still not matches the target condition. My next obstacle is that we do not know from the vast resources of information that is needed and the possible actions during the check which are related, form a group or are the most important ones. So my next step is to do a card sort with the users to take a peek into their mental model. I expect to find out about the priorities and grouping of the possible information and actions.

After I gathered and condensed the information from the card sorts, my next obstacle is to find out more about the journey of the user during the check and the struggles he has. From my earlier contextual observation I have the current user journey and the struggles along the way. Armed with the insights from the earlier steps I can now create a design which maps the user journey and addresses the struggles with the data and the actions according to the mental model of the user.
For this I develop a (prototypical) implementation in software and test them with a group of users. I expect to verify or find problems regarding the match of the mental model of the user and my solution.
This process of the product kata is repeated until the current condition meets the target condition.

Why this is needed

What changed for me is that I do not rush towards solving the problem but first build a solid understanding by learning more about the users, their jobs and contexts in a directed way. The product kata helps me to frame my thoughts towards the users and away from the solution and my assumptions about it. It helps me structure my discovery process and the progress of the project.
My old way started from requirements and old solutions improving what was done before. The problem with this approach is that assumptions and decisions that were made in the past are not questioned. The old way of doing things may be an inspiration but it is not a good starting point.
Requirements by the client are predefined solutions: they frame the solution space. If the client is a very good project manager this might work but if I am responsible for leading the project this can lead to a disaster.
The agile way of developing software is great at executing. But without guidance and a way of learning about the users and their problems, agile software development is lost.

The rule of additive changes

Change is in the nature of software development. Most difficult aspects of the craft revolve around dealing with change. How does one keep software extensible? How do you adapt to new business requirements?

With experience comes the intuition that some kind of changes are more volatile than other changes. For example, it is often safer to add a new function or type to an application than change an existing one.

This is because adding something new means that it is not already strongly connected to the rest of the application. Or at least that’s the assumption. You have yet to decide how the new component interacts with the rest of the application. Usually this is done by a, preferably small, incision in the innards of your software. The first change, the adding, should not break anything. If anything, the small incision should be the only dangerous aspect of the change.

This is as very important concept: adding should not break things! This is so important, I want to give it a name:

The Rule of Additive Changes

Adding something to a well-designed software system should not break existing functionality. Exceptions should be thoroughly documented and communicated.

Systems should always be designed and tought so that the rule of additive changes holds. Failure to do so will lead to confusing surprises in the best cases, and well hidden bugs in worse cases.

The rule is nothing new, however: it’s a foundation, an axiom, to many other rules, such as the Liskov Substitution Principle:

Inheritance

Quoting from Wikipedia:

“If S is a subtype of T, then objects of type T in a program may be replaced with objects of type S without altering any of the desirable properties of that program”

This relies on subtyping as an additive change: S works at least as good as any T, so it is an extension, an addition. You should therefore design your systems in a way that the Liskov Substition Principle, and therefore the rule of additive changes, both hold: An addition of a new type in a hierarchy cannot break anything.

Whitelists vs. Blacklists

Blacklists will often violate the rule of additive changes. Once you add a new element to the domain, the domain behind the blacklist will change as well, while the domain behind a whitelist will be unaffected. Ultimately, both can be what you want, but usually, the more contained change will break less – and you can still change the whitelist explicitly later!

Note that systems that filter classes from a hierarchy via RTTI or, even more subtle, via ask-interfaces, are blacklists. Those systems can break easily when new types are introduces to a hierarchy. Extra care needs to be taken to make sure the rule of addition holds for these systems.

Introspection and Reflection

Without introspection and reflection, programs cannot know when you are adding a new type or a new function. However, with introspection, they can. Any additive change can also be an incision point. Therefore, you need to be extra careful when designing systems that use introspection: They should not break existing functionality for adding something.

For example, adding a function to enable a specific new functionality is okay. A common case of this would be to adding a function to a controller in a web-framework to add a new action. This will not inferfere with existing functionality, so it is fine.

On the other hand, adding a member to a controller should not disable or change functionality. Adding a special member for “filtering” or some kind of security setting falls into this category. You think you’re merely adding something, but in fact you are modifying. A system that relies on such behavior therefore violates the rule of additive changes. Decorating the member is a much better alternative, as that makes it clear that you are indeed modifying something, which might break existing functionality.

Not all languages or frameworks provide this possibility though. In that case, the only alternative is good communication and documentation!

Refactoring

Many engineers implicitly assume that the rule of additive changes holds. In his book “Working Effectively With Legacy Code”, Micheal Feathers proposes the sprout and wrap techniques to change legacy software. The underlying technique is the same for both: formulating a potentially breaking change as mostly additive, with only a small incision point. In the presence of systems that do not follow the rule of additive changes, such risk minimization does not work at all. For example, adding additional function can break a system that relies heavily on introspection – which goes against all intuition.

Conclusion

This rule is not a new concept. It is something that many programmers have in their head already, but possibly fractured into lots of smaller guidelines. But it is one overarching concept and it needs a name to be accessible as such. For me, that makes things a lot clearer when reasoning about systems at large.

Monitoring data integrity with health checks

An important aspect for systems, which are backed by a database storage, is to maintain data integrity. Most relational databases offer the possibility to define constraints in order to maintain data integrity, usually referential integrity and entity integrity. Typical constraints are foreign key constraints, not-null constraints, unique constraints and primary key constraints.

SQL also provides the CHECK constraint, which allows you to specify a condition on each row in a table:

ALTER TABLE table_name ADD CONSTRAINT
   constraint_name CHECK ( predicate )

For example:

CHECK (AGE >= 18)

However, these check constraints are limited. They can’t be defined on views, they can’t refer to columns in other tables and they can’t include subqueries.

Health checks

In order to monitor data integrity on a higher level that is closer to the business rules of the domain, we have deployed a technique that we call health checks in some of our applications.

These health checks are database queries, which check that certain constraints are met in accordance with the business rules. The queries are usually designed to return an empty result set on success and to return the faulty data records otherwise.

The health checks are run periodically. For example, we use a Jenkins job to trigger the health checks of one of our web applications every couple of hours. In this case we don’t directly query the database, but the application does and returns the success or failure states of the health checks in the response of a HTTP GET request.

This way we can detect problems in the stored data in a timely manner and take countermeasures. Of course, if the application is bug free these health checks should never fail, and in fact they rarely do. We mostly use the health checks as an addition to regression tests after a bug fix, to ensure and monitor that the unwanted state in the data will never happen again in the future.

Getting Shibboleth SSO attributes securely to your application

Accounts and user data are a matter of trust. Single sign-on (SSO) can improve the user experience (UX), convenience and security especially if you are offering several web applications often used by the same user. If you do not want to force your users to big vendors offering SSO like google or facebook or do not trust them you can implement SSO for your offerings with open-source software (OSS) like shibboleth. With shibboleth it may be even feasible to join an existing federation like SWITCH, DFN or InCommon thus enabling logins for thousands of users without creating new accounts and login data.

If you are implementing you SSO with shibboleth you usually have to enable your web applications to deal with shibboleth attributes. Shibboleth attributes are information about the authenticated user provided by the SSO infrastructure, e.g. the apache web server and mod_shib in conjunction with associated identity providers (IDP). In general there are two options for access of these attributes:

  1. HTTP request headers
  2. Request environment variables (not to confuse with system environment variables!)

Using request headers should be avoided as it is less secure and prone to spoofing. Access to the request environment depends on the framework your web application is using.

Shibboleth attributes in Java Servlet-based apps

In Java Servlet-based applications like Grails or Java EE access to the shibboleth attributes is really easy as they are provided as request attributes. So simply calling request.getAttribute("AJP_eppn") will provide you the value of the eppn (“EduPrincipalPersonName”) attribute set by shibboleth if a user is authenticated and the attribute is made available. There are 2 caveats though:

  1. Request attributes are prefixed by default with AJP_ if you are using mod_proxy_ajp to connect apache with your servlet container.
  2. Shibboleth attributes are not contained in request.getAttributeNames()! You have to directly access them knowing their name.

Shibboleth attributes in WSGI-based apps

If you are using a WSGI-compatible python web framework for your application you can get the shibboleth attributes from the wsgi.environ dictionary that is part of the request. In CherryPy for example you can use the following code to obtain the eppn:

eppn = cherrypy.request.wsgi_environ['eppn']

I did not find the name of the WSGI environment dictionary clearly documented in my efforts to make shibboleth work with my CherryPy application but after that everything was a bliss.

Conclusion

Accessing shibboleth attributes in a safe manner is straightforward in web environments like Java servlets and Python WSGI applications. Nevertheless, you have to know the above aspects regarding naming and visibility or you will be puzzled by the behaviour of the shibboleth service provider.