What you need to know
We are a small software development company with a home-grown IT infrastructure. The euphemism for such a state is “evolutionary grown”, denoting a process that was shaped by the most elementary forces, often implicitely. One such implicit force is laziness: If there is a quick way to do things, it will be done this way. Why invest effort if everything works just fine?
During an internal safety review, we identified our IT landscape as a risk factor. It was designed to meet yesterday’s and perhaps today’s demands, but in no way aligned to our strategic vision. We decided to invest in our IT to bring it to a planned state that we are confident will sustain our demands of tomorrow – or be easily adaptable.
Where we started
Our starting point was a room full of servers that were bought at some point to serve a specific need like “be the build box”. Every server started with a good reason to exist and evolved from there. Some gathered more and more services, some were repurposed and some idled along. We identified only two servers that were essential for the company: one was the continuous integration server master and one hosted nearly all mission-critical services at once. The latter server was also our oldest machine in production usage. It was secured against data loss, but not against outages. So everytime this server went down, our company essentially came to a stop because all services were offline. Luckily, it went down very infrequently, but it still identified as a clear single point of failure.
Where we wanted to go
In April 2014, the heartbleed vulnerability was published. We luckily weren’t affected on a large scale, but took it as a wake-up call to review our IT setup and to develop a strategy to mitigate the effects of disasters similar in scope to heartbleed while we still have time. We wanted to have our IT in a condition where we actually choose which risks we take instead of just hoping for the best. So we sat before a whiteboard and outlined the goals: We wanted to separate and self-contain every essential service, so that the compromising or outage of one service doesn’t affect the others. That means one machine (or container) per service. To gain flexibility, we also planned to separate our IT landscape into two layers: The “metal layer” provides the computation power, while the “appliance layer” realizes the services. We wanted to be able to implement the appliance layer nearly independtly to the metal layer, which means to use some sort of virtualization. In modern words, we wanted to have a “cloud platform” to deploy our service applications on. We just don’t wanted it out on the internet but in our computer center. To sum up, we wanted to separate hardware and software and move every service in its own compartment.
What technologies we chose
We thought about fitting technology for a long time but settled for a small-scale, bottom-up approach: Start with just a few metal machines (hosts) and use a familiar virtualization product. In our case, this meant two standard servers, Linux and Oracle’s VirtualBox to run the virtual computers. There sure are more professional and powerful virtualization products out there, but we had years of experience (and sometimes frustration) with VirtualBox and didn’t want to rely on an unknown technology. It’s not exciting, but works well enough for our use case – and we knew that beforehands.
We decided against any fancy cloud or grid software to combine the hosts to a pool and just planned the hosting of the virtual machines (VMs) statically by hand. This might mean that one host gets bored while another host cannot handle the pressure anymore. It will be our responsibility to take that problem into account. This approach primarily achieves one thing: it keeps everything rather simple. Each host has a list of VMs and that’s it. If we want to migrate a VM to another host, we have to do it manually.
To create the VMs, we used Vagrant, which turned out fine for three-quarters of our machines, but proved toxic for the remaining ones. Vagrant is a very handy tool for developers to quickly launch a VM, but it makes a lot of assumptions that might not match your specific requirements. We essentially abandoned Vagrant after the initial phase.
During the migration phase of our services, we adopted another tool to solve the problem of scaling effects in maintainance. It’s another story to maintain 20+ servers instead of the handful we had beforehands. Luckily, Ansible proved useful to automate most of our normal administration tasks. This transition from manual to automated administration wasn’t part of the original plan, but is one of the biggest payoffs. But that’s stuff for the next blog entry.
What’s next?
In this first part of our story to regain control of our IT landscape, we described the starting point, the plan and the tools. In the next part, you’ll hear about the migration and where we ended up. We will also point out our experiences along the way and hopefully give some useful tips if you think of reshaping your services, too: Click here to read part two of the series.