Our voyage to service separation – Part II

Recap of the situation

In the first part of this blog series, we introduced you to our evolutionary grown IT landscape. We had a room full of snow- flaked servers and no overall concept how to use them. We wanted our services to be self-contained and separated. So we chose the approach of virtualization to host one VM per service on a uniform platform. We chose VirtualBox, Vagrant and Ansible to help us along the way.
This blog entry tells you about the way and our experiences and insights.

The migration

In order to migrate every service you use to its own virtual machine (VM), you’ll need a list or map of your services first. We gathered our list, compared it to reality, adjusted it, reiterated everything, added the forgotten services, drew the map, compared again, drew again and even then missed some services that are painfully obvious in hindsight, like DNS or SMTP. We identified more than 15 distinct services and estimated their resource profile. Then we planned the VM layouts and estimated the required computation power to host all of them. Then we bought the servers.

We started with three powerful hosting servers but soon saw that there is a group of “alpha VMs” with elevated requirements on availability and bought a fourth hosting server with emphasis on redundancy. If some seldom used backoffice service goes down, that’s one thing. The most important services of our company should not go down because of a harddisk failure or such.

Four nearly identical hosting servers to run 15+ VMs on required a repeatable process to set things up. This is where the first tip comes into play:

  • Document everything. Document all the details. Have your Wiki ready and write a step-by-step tutorial for every task you perform. It’s really tedious and probably a bit cumbersome at first, but it will pay of sooner and better than you’d imagine.

We started the migration process with the least important services to get a feeling for the required steps. It turned out later that these services were also the most time-consuming ones. The most essential and seemingly complex services took the least time. We essentially experienced the pareto effect but in reverse: We started with the lowest benefit for the highest cost. But we can give two tips from this experience:

  • Go the extra mile. Just forget about the pareto effect and migrate all services. It’s so much more fun to have a clean IT landscape map than one where most things are tidy but there’s an area marked “here be dragons”.
  • Migration effort and service importance aren’t linked. Our most important service was migrated in about half an hour. Our least important service needed nearly three days. It’s all about the system architecture of the service and if it values self-containment.

The migration took place over the course of a few months with frequent address changes of our tools and an awful lot of communication for cutoff dates. If you need to migrate a service, be very open about the process and make sure that the old service address won’t work after the switch. I cannot count the amount of e-mails I wrote with the subject prefix “IMPORTANT!”. But the transition went smooth and without problems, so we probably added some extra caution that might not have been necessary.

After the migration

When we had migrated our last service in its own VM, there were a lot of old servers without any purpose anymore. We switched them off and got rid of them. Now we had nearly two dozen new servers to care for. One insight we had right after the start of our journey is that virtualized servers require the same amount of administration as physical ones. Just using our old approaches for the new IT landscape wouldn’t cut it. So we invested heavily in automation and scripted everything. Want to set up a new CI build slave? Just add its address into Ansible’s inventory and run the script (“playbook”). All servers need security updates? Just one command and a little wait.

Gears by Pete BirkinshawLearning to automate the administrative tasks in the right way had a steep curve, but it’s the only feasible way. We benefit heavily from the simple fact that we forced ourselves to do it by making it impossible to handle the tasks manually. It’s a “burned bridges” approach, but upon reaching the goal, it really pays off. So another tip:

  • Automate everything. Even if you think you’ll perform this task just a few times – that’s exactly the scenario to automate it to never have to bother with the details again. Automation is key if you want to scale your IT landscape to reasonable sizes.

Reaping the profit

We’ve done the migration and have a fully virtualized setup now. This would not be very beneficial in itself, but opens the door for another level of capabilities we simply couldn’t leverage before. Let me just describe two of them:

  • Rethink your backup strategy. With virtual machines, you can now backup your services on an appliance level. If you wanted to perform this with a real server, you would need to buy the exact same hardware, make exact copies of the harddisks and store this “clone machine” somewhere safe. Creating an appliance level backup means to stop the VM, export it and restart it. You’ll have some downtime, but everything else is just a (big) file.
  • Rethink your service maintainance strategy. We often performed test upgrades to newer versions of our important services on test machines. If the upgrade went well, we would perform it again on the live server and hope for the best. With virtual machines and appliance backups, you can try the upgrade on an exact copy of the live server over and over again. And if you are happy with the result, you just swap your copy with the live server and everything’s fine. No need for duplicated procedures, you always work with the real deal – well, an indistinguishable copy of it.


We’ve migrated our IT landscape from evoluationary to a planned virtualized state in just about a year. We’ve invested weeks of work in it, just to have the same services available as before. From a naive viewpoint, nothing much has changed. So – was it worth it?

The answer is short and clear: Absolutely yes. Even in the short time after the migration, the whole setup performs smoother and more in a planned way than just by chance. The layout can be communicated clearer and on different levels. And every virtual machine has its own use case, to the point and dedicated. We now have an IT landscape that obeys our rules and responds to our needs, whereas before we often needed to make hard compromises.

The positive effects of documentation and automation alone are worth the journey, even if they are mere side effects of the main goal. +1, would migrate again.

Our voyage to service separation – Part I

What you need to know

We are a small software development company with a home-grown IT infrastructure. The euphemism for such a state is “evolutionary grown”, denoting a process that was shaped by the most elementary forces, often implicitely. One such implicit force is laziness: If there is a quick way to do things, it will be done this way. Why invest effort if everything works just fine?
During an internal safety review, we identified our IT landscape as a risk factor. It was designed to meet yesterday’s and perhaps today’s demands, but in no way aligned to our strategic vision. We decided to invest in our IT to bring it to a planned state that we are confident will sustain our demands of tomorrow – or be easily adaptable.

Where we started

Our starting point was a room full of servers that were bought at some point to serve a specific need like “be the build box”. Every server started with a good reason to exist and evolved from there. Some gathered more and more services, some were repurposed and some idled along. We identified only two servers that were essential for the company: one was the continuous integration server master and one hosted nearly all mission-critical services at once. The latter server was also our oldest machine in production usage. It was secured against data loss, but not against outages. So everytime this server went down, our company essentially came to a stop because all services were offline. Luckily, it went down very infrequently, but it still identified as a clear single point of failure.

Where we wanted to go

containersIn April 2014, the heartbleed vulnerability was published. We luckily weren’t affected on a large scale, but took it as a wake-up call to review our IT setup and to develop a strategy to mitigate the effects of disasters similar in scope to heartbleed while we still have time. We wanted to have our IT in a condition where we actually choose which risks we take instead of just hoping for the best. So we sat before a whiteboard and outlined the goals: We wanted to separate and self-contain every essential service, so that the compromising or outage of one service doesn’t affect the others. That means one machine (or container) per service. To gain flexibility, we also planned to separate our IT landscape into two layers: The “metal layer” provides the computation power, while the “appliance layer” realizes the services. We wanted to be able to implement the appliance layer nearly independtly to the metal layer, which means to use some sort of virtualization. In modern words, we wanted to have a “cloud platform” to deploy our service applications on. We just don’t wanted it out on the internet but in our computer center. To sum up, we wanted to separate hardware and software and move every service in its own compartment.

What technologies we chose

We thought about fitting technology for a long time but settled for a small-scale, bottom-up approach: Start with just a few metal machines (hosts) and use a familiar virtualization product. In our case, this meant two standard servers, Linux and Oracle’s VirtualBox to run the virtual computers. There sure are more professional and powerful virtualization products out there, but we had years of experience (and sometimes frustration) with VirtualBox and didn’t want to rely on an unknown technology. It’s not exciting, but works well enough for our use case – and we knew that beforehands.
We decided against any fancy cloud or grid software to combine the hosts to a pool and just planned the hosting of the virtual machines (VMs) statically by hand. This might mean that one host gets bored while another host cannot handle the pressure anymore. It will be our responsibility to take that problem into account. This approach primarily achieves one thing: it keeps everything rather simple. Each host has a list of VMs and that’s it. If we want to migrate a VM to another host, we have to do it manually.
To create the VMs, we used Vagrant, which turned out fine for three-quarters of our machines, but proved toxic for the remaining ones. Vagrant is a very handy tool for developers to quickly launch a VM, but it makes a lot of assumptions that might not match your specific requirements. We essentially abandoned Vagrant after the initial phase.
During the migration phase of our services, we adopted another tool to solve the problem of scaling effects in maintainance. It’s another story to maintain 20+ servers instead of the handful we had beforehands. Luckily, Ansible proved useful to automate most of our normal administration tasks. This transition from manual to automated administration wasn’t part of the original plan, but is one of the biggest payoffs. But that’s stuff for the next blog entry.

What’s next?

In this first part of our story to regain control of our IT landscape, we described the starting point, the plan and the tools. In the next part, you’ll hear about the migration and where we ended up. We will also point out our experiences along the way and hopefully give some useful tips if you think of reshaping your services, too: Click here to read part two of the series.

Creating a GPS network service using a Raspberry Pi – Part 2

In the last article we learnt how to install and access a GPS module in a Raspberry Pi. Next, we want to write a network service that extracts the current location data – latitude, longitude and altitude – from the serial port.


We use Perl to write a CGI-script running within an Apache 2; both should be installed on the Raspberry Pi. To access the serial port from Perl, we need to include the module Device::SerialPort. Besides, we use the module JSON to generate the HTTP response.

use strict;
use warnings;
use Device::SerialPort;
use JSON;
use CGI::Carp qw(fatalsToBrowser);

Interacting with the serial port

To interact with the serial port in Perl, we instantiate Device::SerialPort and configure it according to our hardware. Then, we can read the data sent by our hardware via device->read(…), for example as follows:

my $device = Device::SerialPort->new('...') or die "Can't open serial port!";
($count, $result) = $device->read(255);

For the Sparqee GPSv1.0 module, the device can be configured as shown below:

our $device = '/dev/ttyAMA0';
our $baudrate = 9600;

sub GetGPSDevice {
 my $gps = Device::SerialPort->new($device) or return (1, "Can't open serial port '$device'!");
    $gps->write_settings or return (1, 'Could not write settings for serial port device!');
    return (0, $gps);

Finding the location line

As described in the previous blog post, the GPS module sends a continuous stream of GPS data; here is an explanation for the single components.


We are only interested in the information about latitude, longitude and altitude, which is part of the line starting with $GPGGA. Assuming that the first parameter contains a correctly configured device, the following subroutine reads the data stream sent by the GPS module, extracts the relevant line and returns it. In detail, it searches for the string $GPGGA in the data stream, buffers all data sent afterwards until the next line starts, and returns the buffer content.

# timeout in seconds
our $timeout = 10;

sub ExtractLocationLine {
    my $gps = $_[0];
    my $count;
    my $result;
    my $buffering = 0;
    my $buffer = '';
    my $limit = time + $timeout;
    while (1) {
        if (time >= $limit) {
           return '';
        ($count, $result) = $gps->read(255);
        if ($count <= 0) {
        if ($result =~ /^\$GPGGA/) {
            $buffering = 1;
        if ($buffering) {
            my $part = (split /\n/, $result)[0];
            $buffer .= $part;
        if ($buffering and ($result =~ m/\n/g)) {
            return $buffer;

Parsing the location line

The $GPGGA-line contains more information than we need. With regular expressions, we can extract the relevant data: $1 is the latitude, $2 is the longitude and $3 is the altitude.

sub ExtractGPSData {
    $_[0] =~ m/\$GPGGA,\d+\.\d+,(\d+\.\d+,[NS]),(\d+\.\d+,[WE]),\d,\d+,\d+\.\d+,(\d+\.\d+,M),.*/;
    return ($1, $2, $3);

Putting everything together

Finally, we convert the found data to JSON and print it to the standard output stream in order to write the HTTP response of the CGI script.

sub GetGPSData {
    my ($error, $gps) = GetGPSDevice;
    if ($error) {
        return ToError($gps);
    my $location = ExtractLocationLine($gps);
    if (not $location) {
        return ToError("Timeout: Could not obtain GPS data within $timeout seconds.");
    my ($latitude, $longitude, $altitude) = ExtractGPSData($location);
    if (not ($latitude and $longitude and $altitude)) {
        return ToError("Error extracting GPS data, maybe no lock attained?\n$location");
    return to_json({
        'latitude' => $latitude,
        'longitude' => $longitude,
        'altitude' => $altitude

sub ToError {
    return to_json({'error' => $_[0]});

binmode(STDOUT, ":utf8");
print "Content-type: application/json; charset=utf-8\n\n".GetGPSData."\n";


To execute the Perl script with a HTTP request, we have to place it in the cgi-bin directory; in our case we saved the file at /usr/lib/cgi-bin/gps.pl. Before accessing it, you can ensure that the Apache is configured correctly by checking the file /etc/apache2/sites-available/default; it should contain the following section:

ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/
<Directory "/usr/lib/cgi-bin">
    AllowOverride None
    Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch
    Order allow,deny
    Allow from all

Furthermore, the permissions of the script file have to be adjusted, otherwise the Apache user will not be able to execute it:

sudo chown www-data:www-data /usr/lib/cgi-bin/gps.pl
sudo chmod 0755 /usr/lib/cgi-bin/gps.pl

We also have to add the Apache user to the user group dialout, otherwise it cannot read from the serial port. For this change to come into effect the Raspberry Pi has to be rebooted.

sudo adduser www-data dialout
sudo reboot

Finally, we can check if the script is working by accessing the page <IP address>/cgi-bin/gps.pl. If the Raspberry Pi has no GPS reception, you should see the following output:

{"error":"Error extracting GPS data, maybe no lock attained?\n$GPGGA,121330.326,,,,,0,00,,,M,0.0,M,,0000*53\r"}

When the Raspberry Pi receives GPS data, they should be given in the browser:


Last, if you see the following message, you should check whether the Apache user was correctly added to the group dialout.

{"error":"Can't open serial port '/dev/ttyAMA0'!"}


In the last article, we focused on the hardware and its installation. In this part, we learnt how to access the serial port via Perl, wrote a CGI script that extracts and delivers the location information and used the Apache web server to make the data available via network.

Creating a GPS network service using a Raspberry Pi – Part 1

Using sensors is a task we often face in our company. This article series consisting of two parts will show how to install a GPS module in a Raspberry Pi and to provide access to the GPS data over ethernet. This guide is based on a Raspberry Pi Model B Revision 2 and the GPS shield “Sparqee GPSv1.0”. In the first part, we will demonstrate the setup of the hardware and the retrieval of GPS data within the Raspberry Pi.

Hardware configuration

The GPS shield can be connected to the Raspberry Pi by using the pins in the top left corner of the board.

Raspberry Pi B Rev. 2 (Source: Wikipedia)

Raspberry Pi B Rev. 2 (Source: Wikipedia)

The Sparqee GPS shield possesses five pins whose purpose can be found on the product page:

Pin Function Voltage I/O
GND Ground connection 0 I
RX Receive 2.5-6V I
TX Transmit 2.5-6V O
2.5-6V Power input 2.5-6V I
EN Enable power module 2.5-6V I
Sparqee GPSv1.0

Sparqee GPSv1.0

We used the following pin configuration for connecting the GPS shield:

GPS Shield Raspberry Pi Pin-Nummer
2.5-6V +3V3 OUT 1
EN +3V3 OUT 17

You can see the corresponding pin numbers on the Raspberry board in the graphic below. A more detailed guide for the functionality of the different pins can be found here.

Relevant pins of the Raspberry Pi

Relevant pins of the Raspberry Pi

After attaching the GPS module, our Raspberry Pi looks like this:

Attaching the GPS shield to the Raspberry

Attaching the GPS shield to the Raspberry


GPS data retrieval

The Raspberry GPS communicates with the Sparqee GPS shield over the serial port UART0. However, in Raspbian this port is usually used as serial console, which is why we cannot directly access the GPS shield. To turn this feature off and activate the module, you have to follow these steps:

  1. Edit the file /boot/cmdline.txt and delete all parameters containing the key ttyAMA0:
    console=ttyAMA0,115200 kgdboc=ttyAMA0,115200

    Afterwards, our file contains this text:

    dwc_otg.lpm_enable=0 console=tty1 root=/dev/mmcblk0p2 rootfstype=ext4 elevator=deadline rootwait
  2. Edit the file /etc/inittab and comment the following line out:
    T0:23:respawn:/sbin/getty -L ttyAMA0 115200 vt100

    Comments are identified by the hash sign; the result should look as follows:

    #T0:23:respawn:/sbin/getty -L ttyAMA0 115200 vt100
  3. Next, we have to reboot the Raspberry Pi:
    sudo reboot
  4. Finally, we can test the GPS module with Minicom. The baud rate is 9600 and the device name is /dev/ttyAMA0:
    sudo minicom -b 9600 -D /dev/ttyAMA0 -o

    If necessary, you can install Minicom using APT:

    sudo apt-get install minicom

    You can quit Minicom with the key combination strg+a followed by z.

If you succeed, Minicom will continually output a stream of GPS data. Depending on wether the GPS module attains a lock, that is, wether it receives GPS data by a satellite, the output changes. While no data is received, the output remains mostly empty.


Once the GPS module starts receiving a signal, Minicom will display more data as in the example below. If you encounter problems in attaining a GPS lock, it might help to place the Raspberry Pi outside.


A detailed description of the GPS format emitted by the Sparqee GPSv1.0 can be found here. Probably the most important information, the GPS coordinates, is contained by the line starting with $GPGGA: In this case, the module was located at 33° 55.3471′ Latitude North and 117° 41.7128′ Longitude West at an altitude of 112.2 meters above mean sea level.


We demonstrated how to connect a Sparqee GPS shield to a Raspberry Pi and how to display the GPS data via Minicom. In the next part, we will write a network service that extracts and delivers the GPS data from the serial port.

How we distribute our backups geographically

We are a software development company, so all of our most valueable assets are constantly endangered by hardware failure. We regularly do risk assessments in regard to data security and over the years created a fine-tuned system of duplication and doubled duplication to prevent data loss. Those assessments aren’t really complicated, you basically sit down, relax and think about your deepest fears on a certain topic. Then you write them down and act on their avoidance or circumvention. Here’s an example of some results:

  • No data transfer over unsecured internet connections
  • No single point of failure
  • No single area of failure

The last result is of particular interest today: We want to prevent data loss in case of “area-based desaster”, like a whole-building fire or meteorite impact. Well, to be clear on the meteorite scenario, it is both highly improbable and dangerous. If the meteorite happens to be just a bit bigger than average, we won’t worry about backups anymore because we all live in a perimeter around our company. Yes, worst-case scenarios are always morbid.

Stages of data-loss prevention

We have several measures in effect to prevent data-loss in place. Technologies like RAID drives and processes like daily backups and several copies of that backups make sure that we always have at least one copy of all important data even in the most drastic locally confined desaster. But to adhere to the first rule that no data transfer can happen over unsecured internet connections and to make sure that an internet connection isn’t a single point of failure that may compromise data security, we had to come up with a way to distribute our backups in a physical manner without much effort.

The backup export disks

Our system relies on three facts:

  • Small and resilient hard drives with high capacity are affordable
  • Every home of our employees can be an unique backup storage location
  • If we take turns, the effort is low for everybody, but high enough to be effective

So we bought an “backup export disk” for every employee. It’s an 2,5″ USB-powered hard drive with enough storage capacity to keep our most important data. All export disks are registered at the backup distribution system that can, upon connect, provide them with the most current backup. And a little “backup export token” that gets passed from employee to employee in a predetermined order. The token is just a piece of cardboard that says “tag, you are it!”.

Our backup export process

So what do you have to do when you find the “backup export token” on your desk? Just five easy steps:

  • Bring your backup export disk next day (this is the hardest part: remembering to bag the disk at home)
  • Plug it into the backup distribution system (a specific computer in off-state with an USB-cable) and switch it on
  • Wait for the system to do its job. This will take a while, but you’ll get an e-mail at completion, so just wait for the e-mail to arrive
  • Unplug the backup export disk and take it back home (store it in a dry and safe place)
  • Forward the backup export token to the next employee in line

That’s all there is to the obvious process. Some more things happen behind the scenes, but the process mostly relies on the effect of repetition by several operators.

Simple and effective

This process ensures that our backup gets “exported” at least thrice a week to different locations. All in all, we store our backup in at least five locations with a maximum age of two weeks. The system can scale up (or down) without limitation, so it won’t change even if we double or triple the location count or the export frequency. And any individual disk cannot be compromised as the data is secured by strong encryption, so there is no need to restrict physical access to it on the storage locations (like using a safe) or fret if a disk would get lost.

Decentralized, but supervised

Every time a backup export disk is connected to the backup distribution system, the disk’s health figures and remaining space is reported to the administrators. Using this information, we can also reconstruct the distribution history and fetch the most current disk in an emergency case. If a disk shows its age, it gets replaced by a new one without effort. We only need to tell the backup distribution system about it and associate it with an employee so that the e-mail is sent to the right person.


By assigning our employees with the core mechanics of keeping the backups distributed and automating the rest, we reached a level of data security that even protects against area effect scenarios.

TANGO – Making equipment remotely controllable

Usually hardwareTango_logo vendors ship some end user application for Microsoft Windows and drivers for their hardware. Sometimes there are generic application like coriander for firewire cameras. While this is often enough most of these solutions are not remotely controllable. Some of our clients use multiple devices and equipment to conduct their experiments which must be orchestrated to achieve the desired results. This is where TANGO – an open source software (OSS) control system framework – comes into play.

Most of the time hardware also can be controlled using a standardized or proprietary protocol and/or a vendor library. TANGO makes it easy to expose the desired functionality of the hardware through a well-defined and explorable interface consisting of attributes and commands. Such an interface to hardware –  or a logical piece of equipment completely realised in software – is called a device in TANGO terms.

Devices are available over the (intra)net and can be controlled manually or using various scripting systems. Integrating your hardware as TANGO devices into the control system opens up a lot of possibilites in using and monitoring your equipment efficiently and comfortably using TANGO clients. There are a lot of bindings for TANGO devices if you do not want to program your own TANGO client in C++, Java or Python, for example LabVIEW, Matlab, IGOR pro, Panorama and WinCC OA.

So if you have the need to control several pieces of hardware at once have a look at the TANGO framework. It features

  • network transparency
  • platform-indepence (Windows, Linux, Mac OS X etc.) and -interoperability
  • cross-language support(C++, Java and Python)
  • a rich set of tools and frameworks

There is a vivid community around TANGO and many drivers for different types of equipment already exist as open source projects for different types of cameras, a plethora of motion controllers and so on. I will provide a deeper look at the concepts with code examples and guidelines building for TANGO devices in future posts.

Snowflakes are a bad sign

snowflakeFirst, allow me a bad joke: If you enter your server room and find real snowflakes, it might be a sign that your air conditioning is over-ambitious. But even if you just enter your server room, you probably see some snowflakes, but in the metaphorical sense.

Snowflake servers

Snowflakes are servers with an unique layout. I cannot say it better than Martin Fowler two years ago in his Bliki posting SnowflakeServer, but I’m trying to add some insights and more current tools. The term probably originates in the motto that everybody is a “precious unique snowflake”. This holds true for humans and animals, but not for machines. Let’s examine how a snowflake is born. Imagine that in the beginning, all servers are the same: standard hardware, a default operating system and nothing more. You pick one server to host a special application and adjust the hardware accordingly. Now you already have an hardware snowflake – not the worst thing, but you better document your rationale behind the adjustment in an accessible way – a wiki page specifically for that server perhaps. Because sooner or later, that machine will fail (or become hopelessly obsolete) and needs to be replaced – with adequate hardware. Without your documentation, you’ll have to remember why the old machine had that specific layout – and if it was sufficient. I’ve seen the “ancient server” anti-pattern much too often: A dusted machine, buzzing like an asthmatic pensioner in the last corner of the server room, and nobody was allowed near. Because there are no spare parts (VESA local bus isn’t supported anymore), if one part fails, the whole system is doomed – operating system and software included. Entire organizations rely on the readiness for duty of one hardware assembly – and almost always a crude one.

Server as cattle

The ancient server happens more likely when you treat your servers like pets. This is the crucial mental switch you’ll have to make: servers are cattle, not pets. They have numbers, not names. They can be monitored, upgraded and fostered, but at the end of the day, they serve a clearly defined business case and deserve no emotional investment of the owner. If a pet gets hurt, you take it to the veterinary and cure it. If cattle gets sick, you call the veterinary to make sure it’s not contagious and then replace the affected individuals – to cure them would be more expensive. Pets live as long as they can, cattle has a dacattlete of expiry. And our cattle (servers) really isn’t sentient, so stop treating it like pets.

Strategies to run a ranch

Our current answer to make the transition from pet zoo to cattle ranch without significantly increasing the amount of metal in our server room can be boiled down to three strategies:

  • Virtualize the logical machines. Instead of working on “real metal machines”, more and more of our services run inside virtual machines. This allows for a clearer separation of concerns (one duty per machine) and keeps the emotional commitment towards the machine low. Currently, we use VirtualBox and Docker for this task. Both are easy to set up and fulfill their task well.
  • Remove the names from real metal machines. We really number our real machines now. Giving clever names to virtual machines is still possible, but not necessary: they are probably only accessed using DNS aliases that specify their use, like “projectX-database” or “projectY-webserver”. We even choose the computer cases for our machines accordingly to separate the pets (unique cases) from cattle (uniform cases).
  • Specify the machine. The virtualized hardware must be described and explained (e.g. why this particular machine needs twice the normal RAM ration). Currently, we use Vagrant to specify the hardware and operating system of our virtual machines. The specifications are stored in a version controlled repository, so there is a place where most of our server infrastructure is described in a deployable fashion. Even more, all necessary third-party software products are specified, too. Imagine a todo list of what to install and prepare, like the one you’ve handed over to your admin in the past, but automatically executable. We currently use Ansible for our configuration management because it has very low requirements for the target platform itself and has a low learning curve.

Applying these three strategies, every (logical) machine in our server room should be reproduceable. They are still individuals, specifically tailored for their jobs, but completely specified and virtualized. The real metal machines only run the bare minimum of software necessary to host the logical machines. None of the machines promote emotional attachment – they are tools for their job.

Data is snow

One important insight is that persistent data will turn your machine into a snowflake over time (we use the term as a verb: “data will snowflake your machine”). You will become emotionally and financially attached to this data – otherwise, there is no need to persist it in the first place. We don’t have a panacea here yet. You probably want to use a database and a sophisticated backup strategy here. Just make sure that the presence of precious data on it doesn’t obscure your stance towards the machine. You want to keep the data and still be able to throw the machine away.

Don’t stop at machines

We are software developers, so we cannot deny that the concept of snowflaking is very helpful for our own projects, too. Every dependency that we can bring with us during deployment (called “self-containment” or “batteries included” in our slang) is one less thing of “snowflaking” the target machine. Every piece of infrastructure (real, virtualized or purely conceptual) we implicitly rely on (like valid certificates, SSH keys or passwords and database locations) will snowflake the target machine and should be treated accordingly: documented, specified and automated. If you hot-fix a production server, it’s definitely a huge snowflaking action that needs to be at least carefully documented. You can’t avoid snowflaking completely, but strive to mimize the manual amount of it and then sanitize the automated part.

Snowflaking is a concept

We’ve found the term of “snowflaking” very useful to transport the necessity and value in documenting, specifying and automating everything that doesn’t happen on a developer machine (and even there, the build process is fully automated). Snowflaked enviroments tend to be expensive in maintainance and brittle in operations. The effort to mitigate the effects of snowflaking pays off very soon and is highly reuseable. But even more powerful is the change in the mindset as soon as the concept of “snowflaking” is understood. It’s a short term for a broad range of strategies and values/beliefs. It’s a powerful and scalable concept.

We’d love to hear your experiences

You’ve probably experimented with various tools and concepts to manage your servers, too. What were your experiences and insights? Add a comment below, we are looking forward to your input.

A small story about outsourcing

Let me tell you a story about human labor and automization, cost efficiency and the result of local optimization. The story itself is true, but nearly all details are changed to protect the innocent.

An opportunity

Once, there was a company that produced sensoric equipment with a large portion of electronic circuitry. The whole device was manufactured at the company’s main factory and admired for its outstanding rigidity. Then, one day, the opportunity offered itself to outsource the assembly of the electronics to a country in the asian region. The company boss immediately recognized the business value in this change: The same parts would be produced by the company, shipped to the asian contract manufacturerer, assembled and promptly returned. Then, the company’s engineers will provide the firmware and software for the final product. By outsourcing the most generic step in the production line, the production costs could be lowered significantly.

A detail

There is one little detail that needs to be told: The sensors relied on some very specific and fragile parts only the company itself could produce. These parts were especially sensible to the atmosphere they were assembled in. A very important aspect of the production process was a special purpose machine that could assemble the parts while sustaining the necessary gas mixture and pressure. Upon closer inspection, one could say that the essence of this product’s secret ingredience weren’t the parts itself, but the specifically tailored production process.

The special purpose machine had to be transferred to the contract manufacturer in asia, otherwise, the sensors could not be assembled. This was a minor inconvenience compared to the large profits that could be realized once the outsourcing was completed.

A success

The machine was transported to the contractor, installed and tested. A special crew of workers of the contractor’s staff was trained to operate the machine properly and within the necessary conditions. After a while, the production line began its work. The first sensors assembled offshore returned home. They all worked as intended. The local engineers couldn’t tell the difference but by looking at the serial number. The company management was pleased, the profitability was increased.

A failure

Everything went well for a while. Then, the local engineers noticed a slightly higher number of faulty sensors. Not long after, the quality assurance reported decreasing performance numbers of the devices. The rigidity of the device, the unique selling point, slowly deteriorated. The company management was worried and established a task force to indentify the root cause for this change to the worse.

A mystery

The task force inspected the reported problems and couldn’t make much sense of the numbers. It wasn’t a problem of whole faulty batches (indicating incidents like transport damage), but also not of individual faulty pieces. Instead, they found that if a piece was faulty, the next few pieces from that series were also faulty. Then, there were long intervals with perfectly good pieces until another group of clearly faulty pieces occurred. Something had to go wrong during the assembly process at the contractor.

A revelation

When the task force arrived in the contractor’s factory and inspected the special purpose machine, they found that the atmosphere regulator was damaged. This automatic part of the machine takes care of the mixture and pressure of the gas in the machine during operation and keeps it in the necessary range by applying or draining specific gas. The contractor didn’t bother to replace the rather expensive part when cheap human labor is readily available. They had hired a worker to perform the atmosphere regulation manually. Some lowly paid worker had to watch the pressure numbers and provide more or less gas, just as needed. This was nearly as good as the automatic regulation and still good enough to produce quality devices.

An explanation

But, the contractor only hired one worker per shift. This worker had to go to the toilet sometimes during the work day. When he was away from the machine, it went along unregulated, soon to be misadjusted to the point of only producing junk. Once the worker returned, he would balance the numbers and bring the machine in the OK state again. This situation occurred periodically, but not too often to taint whole batches. Only during his absence, the series of faulty devices would be produced.

A conclusion

I don’t want to add much moral to this story. Perhaps one thing should be considered when recapitulating: Both the company and the contractor “optimized” their costs locally by making cost efficient decisions that turned out to be expensive in the long run. The company chose between expensive, but controlled local production and cheap outsourced assembly, arguably the most delicate step in the whole production process. The contractor chose between a high one-time investment in an automatism and the low ongoing cost of cheap human labor. Both decisions are comprehensible on their own, but lead to a situation that would never have occurred in the original setting.

SSD? Don’t think! Just Buy!

My personal experience with SSDs began with an Intel X25M that I built into a Lenovo Thinkpad R61. It replaced a Seagate 160 GB 5400rpm which in combination with Windows Vista … well, let’s just say, it wasn’t that fast.

The SSD changed everything. It was not just faster, it was downright awesome! As if I had a completely new computer.

With that in mind I thought about my desktop PC. It’s a little more than 2 year old Windows XP box, Intel Core2Duo 2.7 GHz, 4GB RAM, with a not so slow Samsung HDD. I use it mainly for programming, which is most of the time Grails programming under IntelliJ IDEA.

And let me tell you, the Grails + IDEA combination can get dog slow at times. The start-up time of IDEA alone gives you time to skim over the first three pages of Hacker News and read the latest XKCD.

So the plan was to put an extra SSD into the Windows box and put only programming related stuff on it. This would save me the potential hassle of moving my whole system but would still give me development speed-up.

I had to be a little careful because the standard settings for IDEA’s so-called “system path” and “config path” is in the user’s home directory. (Btw, this settings can be changed in file “idea.properties” which resides in “IDEA_INSTALLATION_DIR\bin”, e.g.: c:\Progam Files\JetBrains\IntelliJ IDEA 9.0.4\bin)

I think you already guessed the result. Three words: fast, faster, SSD. It’s just amazing! IDEA start-up is so fast now, I barely have time for a quick look at the newest headlines on InfoQ.

The next step is of course to put the whole system on SSD but that will probably have to wait until we upgrade the whole company to Win7. Can’t wait… 🙂

About PrintStream and Exceptions

Several of our projects deal with sensor hardware of different types often connected via the good old™ serial port. That is fine most of the time because most protocols are simple and RXTX provides a nice cross-platform library for most of your serial port needs. But many new computers do not feature the old RS232 serial ports anymore or other contraints prevent the use of a plain RS232 serial port. Here come serial converters like the Advantech ADAM 4570 (serial-to ethernet) or usb-to-serial converters into play. Usually this works fine.

Now one of our customers had a test system using an unreliable converter with sensor hardware. The hardware problems uncovered a robustness issue in our software which crashed the JVM when the virtual serial port of the converter disappeared and our app tried to write to it. Despite the faulty hardware our software had to be robust because it manages many more devices other than just that one sensor over serial. Looking at the problem we discovered that the crash occurred somewhere in the native part of RXTX. So we decided to scratch our own itch (and the one of the customer) and set out to fix the issue in RXTX at a Open Source Love Day (OSLD) . So we fixed the problem and submitted the patch to the bugtracker of the RXTX project. Our sample program now worked flawlessly and threw an IOException when the serial port failed in some way.

Happy to have fixed the problem we incorporated the patch RXTX in our production software but it still crashed and no IOException appeared anywhere in the logs. After another bughunting session we spotted the subtle difference of sample and production program: the use of OutputStream insted of PrintStream. PrintStream silently swallows all exceptions which proved fatal in our use case with the unreliable stream carrier. So the final fix was essentially replacing our PrintStream code

RXTXPort port = new RXTXPort("COM6");
PrintStream p = new PrintStream(port.getOutputStream(), true, "iso8859");

with using OutputStream directly:

RXTXPort port = new RXTXPort("COM6");
OutputStream o = port.getOutputStream();


Be careful when using PrintStream with unreliable stream carriers it swallows exceptions! That may shadow problems which you may want to know of. Often PrintStreams behaviour will not be a problem but in certain cases like the one depicted above it causes a lot of headaches.