Basic business service: Sunzu, the list generator

This might be the start of a new blog post series about building blocks for an effective business IT landscape.

We are a small company that strives for a high level of automation and traceability, the latter often implemented in the form of documentation. This has the amusing effect that we often automate the creation of documentation or at least the creation of reports. For a company of less than ten people working mostly in software development, we have lots of little services and software tools that perform tasks for us. In fact, we work with 53 different internal projects (this is what the blog post series could cover).

Helpful spirits

Some of them are rather voluminous or at least too big to replace easily. Others are just a few lines of script code that perform one particular task and could be completely rewritten in less than an hour.

They all share one goal: To make common or tedious tasks that we have to do regularly easier, faster, less error-prone or just more enjoyable. And we discover new possibilities for additional services everywhere, once we’ve learnt how to reflect on our work in this regard.

Let me take you through the motions of discovering and developing such a “basic business service” with a recent example.

A fateful friday

The work that led to the discovery started abrupt on Friday, 10th December 2021, when a zero-day vulnerability with the number CVE-2021-44228 was publicly disclosed. It had a severity rating of 10 (on a scale from 0 to, well, 10) and was promptly nicknamed “Log4Shell”. From one minute to the next, we had to scan all of our customer projects, our internal projects and products that we use, evaluate the risk and decide on actions that could mean disabling a system in live usage until the problem is properly understood and fixed.

Because we don’t only perform work but also document it (remember the traceability!), we created a spreadsheet with all of our projects and a criteria matrix to decide which projects needed our attention the most and what actions to take. An example of this process would look like this:

  • Project A: Is the project at least in parts programmed in java? No -> No attention required
  • Project B: Is the project at least in parts programmed in java? Yes -> Is log4j used in this project? Yes -> Is the log4j version affected by the vulnerability? No -> No immediate attention required

Our information situation changed from hour to hour as the whole world did two things in parallel: The white hats gathered information about possible breaches and not affected versions while the black hats tried to find and exploit vulnerable systems. This process happened so fast that we found ourselves lagging behind because we couldn’t effectively triage all of our projects.

One bottleneck was the creation of the spreadsheet. Even just the process of compiling a list of all projects and ruling out the ones that are obviously not affected by the problem was time-consuming and not easily distributable.

Post mortem

After the dust settled, we had switched off one project (which turned out to be not vulnerable on closer inspection) and confirmed that all other projects (and products) weren’t affected. We fended off one of the scariest vulnerabilities in recent times with barely a scratch. We could celebrate our success!

But as happy as we were, the post mortem of our approach revealed a weak point in our ability to quickly create spreadsheets about typical business/domain entities for our company, like project repositories. If we could automate this job, we would have had a complete list of all projects in a few seconds and could have worked from there.

This was the birth hour of our list generator tool (we called it “sunzu” because – well, that would require the explanation of a german word play). It is a simple tool: You press a button, the tool generates a new page with a giant table in the wiki and forwards you to it. Now you can work with that table, remove columns you don’t need, add additional ones that are helpful for your mission and fill out the cells that are empty. But the first step, a complete list of all entities with hyperlinks to their details, is a no-effort task from now on.

No-effort chores

If Log4Shell would happen today, we would still have to scan all projects and decide for each. We would still have to document our evaluation results and our decisions. But we would start with a list of all projects, a column that lists their programming languages and other data. We would be certain that the list is complete. We would be certain that the information is up-to-date and accurate. We would start with the actual work and not with the preparation for it. The precious minutes at the beginning of a time-critical task would be available and not bound to infrastructure setup.

Since the list generator tool can generate a spreadsheet of all projects, it has accumulated additional entities that can be listed in our company. For some, it was easy to collect the data. Others require more effort. There are some that don’t justify the investment (yet). But it had another effect: It is a central place for “list desires”. Any time we create a list manually now, we pose the important question: Can this list be generated automatically?

Basic business building blocks

In conclusion, our “sunzu” list generator is a basic business service that might be valueable for every organization. Its only purpose is to create elaborate spreadsheets about the most important business entities and present them in an editable manner. If the spreadsheet is created as an Excel file, as an editable website like tabble or a wiki page like in our case is secondary.

The crucial effect is that you can think “hmm, I need a list of these things that are important to me right now” and just press a button to get it.

Sunzu is a web service written in Python, with a total of less than 400 lines of code. It could probably be rewritten from scratch on one focussed workday. If you work in an organization that relies on lists or spreadsheets (and which organization doesn’t?), think about which data sources you tap into to collect the lists. If a human can do it, you can probably teach it to a computer.

What are entities/things in your domain or organization that you would like to have a complete list/spreadsheet generated generated automatically about? Tell us in the comments!

The four stages of automation – Part II

One of the core concepts of software development and IT in general is “automation”. By delegating work to machines, we hope to reduce costs and save time while maintaining the quality of results. But automation is not an all-or-nothing endeavor, there are at least four different stages of automation that can be distinguished.

In the first part of this blog series, we looked at the first two stages, namely “documentation” and “recurring reminders”. Both approaches are low tech, but high effect. Machines only played a minor role – this will change with this blog series part. Let’s look at the remaining two stages of automation:

Stage 3: Semi-automatic

If you have a process that is properly documented and you are reminded in a regular fashion, like once a month, you’ll soon find that some steps of the process could be done by a machine, while you as the “human in duty” still pull all the strings that orchestrate the whole thing.

If you know the term “semi-automatic” from firearms, a semi-automatic firearm doesn’t aim or shoot itself, it just reloads automatically after each shot. The shooter still has to pull (and release) the trigger for each single shot. The shooter is in full control of the weapon, it just automates the mundane and repetitive task of chambering the next round.

This is the kind of automation we are taking about for stage 3. It is the most common type of automation. We know it from our cars, our coffee machines and other consumer electronics. The car manages a lot of different tasks under the hood while we are still in control of the overall task of driving from A to B.

How does it look like for business processes? One class of stage 3 utilities are reporting tools that gather and aggregate data from different sources and present the result in a suitable manner. In our company, these tools make up the majority of stage 3 services. There are reporting tools for the most important numbers (the key performance indices – KPI) and even some for less important, but cumbersome to acquire data. Most tools just present a nice website with the latest results while others send e-mails or create pages in our wiki. If you need a report, just press a button or visit an URL and the machine comes up with the answer. I tend to call this class of tools “sensors”, because they acquire data and process it, but don’t decide on the results.

The other class of stage 3 utilities that are common are “actuators” in the sense that they perform tasks on command. We have scripts in place to shut down whole clusters of computers, clear wiki spaces or reset custom fields on important data objects, but those scripts are only triggered by humans.

A stage 3 actuator could even be something small as a mailto link. Let’s say you have to send a standardized e-mail to a known recipient as part of a monthly process. Sure, you can save a draft in your e-mail application, but you can also prepare the whole mail in an URL directly in the documentation of the process:

mailto:nobody@softwareschneiderei.de?subject=The%20schneide%20blog%20rocks!&body=I%20read%20your%20blog%20post%20about%20automation%20and%20tried%20the%20mailto%20link.%20This%20thing%20is%20awesome%2C%20thank%20you!

If you click the link above, your e-mail application will prompt you to send an e-mail to us. You don’t need to follow through – we won’t read it on that address.

You can read about the format of mailto links here, but you probably want to create working mailto links right away, which is possible with this nifty stage 3 service utility written by Michael McKeever (buy him a beer!).

Be aware that this is a classic example of chaining stage 3 tools together: You use a tool to create the mailto link that you use subsequently to write, but not send, e-mails. You, as the human coordinator, decide when to write the e-mail, if you want to adapt it to current circumstances and when to send it. The tools only speed you up, but don’t act or decide on their own.

An important aspect of this type of automation is the human duty of orchestration (which service does its thing when) and the possibility of inspection and adaption. The mailto link doesn’t send an e-mail, it just prepares it for you to send. You have the final word on the things that happen.

If you require this level of control, stage 3 automation is where your automation journey ends. It still needs the competent human operator (what, when, why) – but given a decent documentation (as outlined in stage 1), this competence can be delegated quickly. It is also the first automation stage that enables higher effectiveness through speedup and error reduction. The speedup is capped by the maximum speed of the human operator, though.

Stage 4: Full automation

The last stage of automation is “full automation”, which means that a machine gathers the data on its own, comes up with a decision based on the data and acts on its own. This is a powerful tool, but a dangerous one, too.

It is powerful because you just employed an additional worker. Not a human worker, but a machine. It doesn’t go on holiday, it doesn’t lose interest and won’t ask for a bonus.

It is dangerous, because your additional worker does exactly what it is told (programmed) to do, even if it doesn’t make sense or needs just the slightest adaption to circumstances.

Another peril lies in the fact that the investment to reach the fully automated stage is maximized. As with nearly everything related to IT, there is a relevant xkcd comic for this:

https://xkcd.com/1319/

The problem is that machines are not aware of their context. They don’t deal well with slight deviations (like “1,02” instead of “1.02”) and cannot weigh the consequences of task failure. All these things are done by a competent human operator, even without specific training. You need to train a machine for every eventuality, down to the dots.

This means that you can’t just program the happy path, as you do in stage 3, when a human operator will notice the error and act accordingly. You have to implement special case behaviour, failure detection, failure handling and problem reporting. You have to adapt the program to changes in the environment in a timely manner (this work is also present in stage 3, but can be delayed more often).

If the process contains mostly routine and is recurring often enough to warrant full automation, it is a rewarding investment that pays off quickly. It will take your human-based work on a new level: designing and maintaining an automation platform that is cost-efficient, scalable and adjustable. The main problem will be time-critical adjustments and their overall effects on the whole system. You don’t need routine workers anymore, but you’ll need competent technicians on stand-by.

Examples of fully automated processes in our company are data backups, operating system upgrades, server monitoring and the recurring reminder system that creates the issues for our stage 2 automation. All of these processes have increased reporting capabilities that highlight problems or just anomalies in a direct manner. They all have one thing in common: They are small, work on only one thing and try to do so with minimal dependencies and interaction.

Conclusion

There are four distinguishable stages of automation:

  1. documentation
  2. recurring reminders
  3. semi-automatic
  4. full automation

The amount of human work for the actual process decreases with each stage, while the amount of human work for the automation increases. For most processes in an organization, there will be a sweet spot between process cost and automation cost somewhere on that spectrum. Our job as automaters is to find the sweet spot and don’t apply too much automation.

If you have a good story about not enough automation or too much automation or even about automation being just right – tell us in the comments!

The four stages of automation – Part I

One of the core concepts of software development and IT in general is “automation”, the “creation and application of technologies to produce and deliver goods and services with minimal human intervention” (definition from techopedia).

The problem is that “minimal human intervention” is often misunderstood as “no human intervention”, which is the most laborious and expensive stage of automation that might not have the most economic return on investment. It might be more efficient to have some degree of intervention left while investing only a fraction of the automation work and duration.

In order to decide “how much” automation is the most profitable for the foreseeable future, I’ve established a model with four stages of automation that I can quickly check against the circumstances. In this blog post, I describe the first two stages and give some ideas how to implement them.

Stage 1: Documentation

The first step to automation is to just describe the process in a manner that can be repeated. The documentation itself does nothing, but it enables repetition and scalability, two fundamental aspects of automation.

Think about baking a pie. If you just mix some ingredients and put it in the oven for an arbitrary amount of time, you might produce the most delicious pie ever, but you cannot do it again if you don’t remember all details and, even more tragic, nobody else can bake your pie. In order to give others the secret to your special pie, you have to give them the recipe – the documentation of its production process. Once the recipe is written down (and published), it can be read by many bakers in parallel and enables all of them to recreate your invention (to some degree at least, there are probably still some tricks and secrets left out of the recipe).

While the pie baking process still needs human intervention (the bakers that read the recipe and transform it into a series of actions), it is automated in the sense that it can be repeated with roughly the same result and these repetitions, given enough bakers and ovens, can be performed in parallel.

The economic evaluation of documentation shows that it is really easy to create, fast to change and, given some quality of content, nearly universally understood. If you don’t want to invest a lot of time and money, documenting your processes is the first and most important step towards automation. For a lot of your processes, it will also be the last possible stage of automation, at least until artificial intelligence learns your tricks and interpretations.

Documenting your processes is (no surprises here) the foundation of most quality assurance standards. But it is surprisingly hard to start with. This is not a matter of tools – pen and paper will do in the beginning. It is a shift in your mindset. The goal is no longer to bake a pie. It is to write a recipe while you bake the pie as a reference piece for it. If you want to start documenting your processes, here are three tips that might help you:

  • Choose a digital tool that doesn’t obstruct you. It should be digital because this facilitates distribution and collaboration. It should not hinder you because every time you need to think about the tool, you lose the focus on your process. I’m using a Wiki that lets me type the things I want to say without interference. In my case, that’s Confluence, but Obsidian or other tools are just as good.
  • Try to adopt a narrative structure to describe your processes. Think about the established structure of a baking recipe. For example, there is an ingredients list separate from the preparation instructions. If you find a structure that works for you, repeat and evolve it. It helps you and your readers to stay on track and don’t scatter the information all over the place. In my case, the structure consists of four paragraphs:
    1. Event/Trigger – The circumstance(s) that should be present at the beginning of the process
    2. Actions/Steps – The things you have to do, described in the necessary details for the target audience. This is often the paragraph with the most content.
    3. Result – Description of the circumstance(s) that should be present once you’ve done all steps. In recipes, this is often a photo of the meal/pastry. For first-time performers, this description is important to be able to declare success.
    4. Report – Who needs to be informed? This paragraph is often missing in descriptions, but crucial for collaboration. If nobody knows there is a fresh and delicious pie in the kitchen, it will not be eaten. Ok, that’s a bad example: Pies in the oven announce themselves with their smell. Digital products often have no smell – inform your peers!
  • Iterate over your documentation any chance you get. It is easy to bake your signature pie from memory. But is the recipe still accurate? Are there details that are important, but missing from the description? Your digital tool probably allows immediate modification of your documentation and maybe even informs interested readers about your update. Unchanged documentation is dead documentation. In my case, I always open my process description on a secondary monitor whenever I perform them. Sometimes, I invite others to perform the process for me to review the accuracy and fidelity of the documentation.

If you can open the process description of many of your routine tasks, you have reached the first stage of automation for your work. Of course, there will be lots of things you do that are not “routine” – yet. With good documentation, you can even think about delegation – the art of maximizing the amount of work done by others – without sacrificing essential quality.

In later stages, the delegation target (the “others”) will be machines.

Stage 2: Recurring reminders

If you’ve documented a process with a structure similar to mine, you specified a trigger or event that requires the process to be performed. Perhaps its the first day of the month and you need to update your timesheet or send out the appointment overview for the next weeks. Maybe your office plants silently thirst for some water. Whatever it is, if your process is recurrent, you might think about recurring reminders.

This will not automate the performance of the process, but unburden you of thinking about the triggering event. The machines will now remind you about certain tasks. This can be a simple series of reminders in your schedular app or, like in my case, the automated creation of issues (or todo items, tickets) in your work planning application.

For example, once every few weeks, a friendly machine creates an issue for me to write a blog entry on this blog. It does the same for my colleagues and even sets a “due date” (The due date for this post is today). With this simple construct, some discipline and coordination, we’ve managed to write one blog post every week for more than ten years now.

The machine that creates the issues doesn’t check them. It doesn’t supervise their progress and isn’t offended if we “won’t fix” issues because we are on holiday or the plants are still wet. It will just create the next issue according to the rhythm. It is our duty as humans to check if that rhythm fits or if it should be sped up or slowed down.

If you want to employ really elaborate triggers for your reminders, a platform like “If this then that (IFTTT)” might be the right choice. Just keep in mind that with complexity, there often comes rigidity, which isn’t always desired.

By automating the aspect of reminding us about the routine tasks, we can concentrate on doing them. We don’t forget to write blog posts or to water the plants because the machine doesn’t forget. Another improvement is that this clearly distinguishes between routine (has a recurring reminder) and anomaly. If the special one-time task occurs again, we give it a recurring reminder and adopt it as a new routine task. If a reminder about a routine task is “won’t fixed” often enough without any inclination that it will be required again, we delete the reminder.

Conclusion for part I

If you combine automated recurring reminders with structured documentation, you already gain a lot of advantages and can free your mind from the mundane details and intervals of your routine tasks. You haven’t automated any aspect of your real work yet, which means that these two stages can be applied to most if not all workplaces.

In the next part of this series, we will look at the two stages that become integrated with your actual work. Stay tuned!

Your most precious resource

When I was a young student, living on my own for the first time, my most scarce resource seemed to be money. Money’s too tight to mention was (and probably still is) a motto that every student could understand. So we traded our time for money and participated in experiments and underpaid student assistant jobs.

Soon after I graduated, money began to accumulate. I have a rather frugal lifestyle, so my expenses didn’t suddenly surge. Instead, my perception of time and money shifted: Money isn’t the bottleneck (anymore), time is. Suddenly, time was much more valueable than money and I would gladly pay money if that meant some hours of additional leisure time or one less problem to tend to. It seems that time is the most precious thing there is.

The traditional economic wisdom supports this idea: “Time is money” is true, but the reverse is not: “money is time” doesn’t cut it. The richest man on earth still only has as much time available to him as anybody else.

If time is the most precious resource, the drive to automation as a time-saving effort can be understood directly. Automation also reduces learning costs if you scale horizontally by parallelizing production.

But soon after I had enough money to optimize my time, I hit another resource bottleneck. Suddenly, I had more time on hand than attention to spare. It turns out that attention is the most valueable resource you can spend. It is just entangled enough with time that its hard to distinguish which runs out first. If you reflect a bit, it becomes obvious. The term “to pay attention” is pretty spot on.

Let me take up the thought of automation again, this time in the domain of software development, in the form of automated tests. Here, automation is not in its most profitable shape. You don’t gain much from scaling your tests horizontally. If you don’t change the code, it doesn’t matter if your tests run once or a thousand times in parallel, the result will be the same (except if you run hardware-dependent tests, but even then you probably don’t gain much after covering all hardware variations).

You also don’t gain much from scaling your tests vertically by making them run faster and faster. It sure helps to have them run continuously in the background (think of a user-modded compiler – look into Continuous Test Runners if you are interested), but after one test run per meaningful change, the profit hits a limit.

So why else is automated testing an economically sound practice? My take on it is delegated attention. You write a small software (your test) that augments your attention area onto code that probably fades from your own attention pretty soon. Automated tests provide automated attention in a sustainable manner (except for those tests that cry wolf for no good reason, those are attention sinks and should be removed from your portfolio). Because of the automation, this delegated attention never fades – even after many years, the test has a close eye on “its” code.

If you are a developer, you have automation and zero-cost copying (aka parallelization without upfront costs) intrinsically in your solution portfolio. Look for ways how to make money with those super-powers. Or even better, look for ways to save time. But if you want the best return on investment for your efforts, you should look for ways to expand your attention area.

Do you agree that attention is our most precious resource? What do you do to lower your attention expenses? Perhaps you have experience in the Ops/DevOps area that resonates with this thoughts? Share your opinion by commenting below or writing your own blog entry!

We will pay attention to you.

Using Ansible vault for sensitive data

We like using ansible for our automation because it has minimum requirements for the target machines and all around infrastructure. You need nothing more than ssh and python with some libraries. In contrast to alternatives like puppet and chef you do not need special server and client programs running all the time and communicating with each other.

The problem

When setting up remote machines and deploying software systems for your customers you will often have to use sensitive data like private keys, passwords and maybe machine or account names. On the one hand you want to put your automation scripts and their data under version control and use them from your continuous integration infrastructure. On the other hand you do not want to spread the secrets of your customers all around your infrastructure and definately never ever in your source code repository.

The solution

Ansible supports encrypting sensitive data and using them in playbooks with the concept of vaults and the accompanying commands. Setting it up requires some work but then usage is straight forward and works seamlessly.

The high-level conversion process is the following:

  1. create a directory for the data to substitute on a host or group basis
  2. extract all sensitive variables into vars.yml
  3. copy vars.yml to vault.yml
  4. prefix variables in vault.yml with vault_
  5. use vault variables in vars.yml

Then you can encrypt vault.yml using the ansible-vault command providing a password.

All you have to do subsequently is to provide the vault password along with your usual playbook commands. Decryption for playbook execution is done transparently on-the-fly for you, so you do not need to care about decryption and encryption of your vault unless you need to update the data in there.

The step-by-step guide

Suppose we want work on a target machine run by your customer but providing you access via ssh. You do not want to store your ssh user name and password in your repository but want to be able to run the automation scripts unattended, e.g. from a jenkins job. Let us call the target machine ceres.

So first you setup the directory structure by creating a directory for the target machine called $ansible_script_root$/host_vars/ceres.

To log into the machine we need two sensitive variables: ansible_user and ansible_ssh_pass. We put them into a file called $ansible_script_root$/host_vars/ceres/vars.yml:

ansible_user: our_customer_ssh_account
ansible_ssh_pass: our_target_machine_pwd

Then we copy vars.yml to vault.yml and prefix the variables with vault_ resulting in $ansible_script_root$/host_vars/ceres/vault.yml with content of:

vault_ansible_user: our_customer_ssh_account
vault_ansible_ssh_pass: our_target_machine_pwd

Now we use these new variables in our vars.xml like this:

ansible_user: "{{ vault_ansible_user }}"
ansible_ssh_pass: "{{ vault_ansible_ssh_pass }}"

Now it is time to encrypt the vault using the command

ANSIBLE_VAULT_PASS="ourpwd" ansible-vault encrypt host_vars/ceres/vault.yml

resulting a encrypted vault that can be put in source control. It looks something like

$ANSIBLE_VAULT;1.1;AES256
35323233613539343135363737353931636263653063666535643766326566623461636166343963
3834323363633837373437626532366166366338653963320a663732633361323264316339356435
33633861316565653461666230386663323536616535363639383666613431663765643639383666
3739356261353566650a383035656266303135656233343437373835313639613865636436343865
63353631313766633535646263613564333965343163343434343530626361663430613264336130
63383862316361363237373039663131363231616338646365316236336362376566376236323339
30376166623739643261306363643962353534376232663631663033323163386135326463656530
33316561376363303339383365333235353931623837356362393961356433313739653232326638
3036

Using your playbook looks similar to before, you just need to provide the vault password using one of several options like specifying a password file, environment variable or interactive input. In our example we just use the environment variable inline:

ANSIBLE_VAULT_PASS="ourpwd" ansible-playbook -i inventory work-on-customer-machines.yml

After setting up your environment appropriately with a password file and the ANSIBLE_VAULT_PASSWORD_FILE environment variable your playbook commands are exactly the same like without using a vault.

Conclusion

The ansible vault feature allows you to safely store and use sensitive data in your infrastructure without changing too much using your automation scripts.

Our voyage to service separation – Part II

We transformed our evolutionary grown IT landscape to a planned setup. Here is what we learned on the way (part 2/2).

Recap of the situation

In the first part of this blog series, we introduced you to our evolutionary grown IT landscape. We had a room full of snow- flaked servers and no overall concept how to use them. We wanted our services to be self-contained and separated. So we chose the approach of virtualization to host one VM per service on a uniform platform. We chose VirtualBox, Vagrant and Ansible to help us along the way.
This blog entry tells you about the way and our experiences and insights.

The migration

In order to migrate every service you use to its own virtual machine (VM), you’ll need a list or map of your services first. We gathered our list, compared it to reality, adjusted it, reiterated everything, added the forgotten services, drew the map, compared again, drew again and even then missed some services that are painfully obvious in hindsight, like DNS or SMTP. We identified more than 15 distinct services and estimated their resource profile. Then we planned the VM layouts and estimated the required computation power to host all of them. Then we bought the servers.

We started with three powerful hosting servers but soon saw that there is a group of “alpha VMs” with elevated requirements on availability and bought a fourth hosting server with emphasis on redundancy. If some seldom used backoffice service goes down, that’s one thing. The most important services of our company should not go down because of a harddisk failure or such.

Four nearly identical hosting servers to run 15+ VMs on required a repeatable process to set things up. This is where the first tip comes into play:

  • Document everything. Document all the details. Have your Wiki ready and write a step-by-step tutorial for every task you perform. It’s really tedious and probably a bit cumbersome at first, but it will pay of sooner and better than you’d imagine.

We started the migration process with the least important services to get a feeling for the required steps. It turned out later that these services were also the most time-consuming ones. The most essential and seemingly complex services took the least time. We essentially experienced the pareto effect but in reverse: We started with the lowest benefit for the highest cost. But we can give two tips from this experience:

  • Go the extra mile. Just forget about the pareto effect and migrate all services. It’s so much more fun to have a clean IT landscape map than one where most things are tidy but there’s an area marked “here be dragons”.
  • Migration effort and service importance aren’t linked. Our most important service was migrated in about half an hour. Our least important service needed nearly three days. It’s all about the system architecture of the service and if it values self-containment.

The migration took place over the course of a few months with frequent address changes of our tools and an awful lot of communication for cutoff dates. If you need to migrate a service, be very open about the process and make sure that the old service address won’t work after the switch. I cannot count the amount of e-mails I wrote with the subject prefix “IMPORTANT!”. But the transition went smooth and without problems, so we probably added some extra caution that might not have been necessary.

After the migration

When we had migrated our last service in its own VM, there were a lot of old servers without any purpose anymore. We switched them off and got rid of them. Now we had nearly two dozen new servers to care for. One insight we had right after the start of our journey is that virtualized servers require the same amount of administration as physical ones. Just using our old approaches for the new IT landscape wouldn’t cut it. So we invested heavily in automation and scripted everything. Want to set up a new CI build slave? Just add its address into Ansible’s inventory and run the script (“playbook”). All servers need security updates? Just one command and a little wait.

Gears by Pete BirkinshawLearning to automate the administrative tasks in the right way had a steep curve, but it’s the only feasible way. We benefit heavily from the simple fact that we forced ourselves to do it by making it impossible to handle the tasks manually. It’s a “burned bridges” approach, but upon reaching the goal, it really pays off. So another tip:

  • Automate everything. Even if you think you’ll perform this task just a few times – that’s exactly the scenario to automate it to never have to bother with the details again. Automation is key if you want to scale your IT landscape to reasonable sizes.

Reaping the profit

We’ve done the migration and have a fully virtualized setup now. This would not be very beneficial in itself, but opens the door for another level of capabilities we simply couldn’t leverage before. Let me just describe two of them:

  • Rethink your backup strategy. With virtual machines, you can now backup your services on an appliance level. If you wanted to perform this with a real server, you would need to buy the exact same hardware, make exact copies of the harddisks and store this “clone machine” somewhere safe. Creating an appliance level backup means to stop the VM, export it and restart it. You’ll have some downtime, but everything else is just a (big) file.
  • Rethink your service maintainance strategy. We often performed test upgrades to newer versions of our important services on test machines. If the upgrade went well, we would perform it again on the live server and hope for the best. With virtual machines and appliance backups, you can try the upgrade on an exact copy of the live server over and over again. And if you are happy with the result, you just swap your copy with the live server and everything’s fine. No need for duplicated procedures, you always work with the real deal – well, an indistinguishable copy of it.

Conclusion

We’ve migrated our IT landscape from evoluationary to a planned virtualized state in just about a year. We’ve invested weeks of work in it, just to have the same services available as before. From a naive viewpoint, nothing much has changed. So – was it worth it?

The answer is short and clear: Absolutely yes. Even in the short time after the migration, the whole setup performs smoother and more in a planned way than just by chance. The layout can be communicated clearer and on different levels. And every virtual machine has its own use case, to the point and dedicated. We now have an IT landscape that obeys our rules and responds to our needs, whereas before we often needed to make hard compromises.

The positive effects of documentation and automation alone are worth the journey, even if they are mere side effects of the main goal. +1, would migrate again.

Automatic deployment of (Grails) applications

What was your most embarrassing moment in your career as a software engineer? Mine was when I deployed an application to production and it didn’t even start. Stop using manual deployment and learn how to automate your (Grails) deployment

What was your most embarrassing moment in your career as a software engineer? Mine was when I deployed an application to production and it didn’t even start.

Early in my career deploying an application usually involved a fair bunch of manual steps. Logging in to a remote server via ssh and executing various commands. After a while repetitive steps were bundled in shell scripts. But mistakes happened. That’s normal. The solution is to automate as much as we can. So here are the steps to automatic deployment happiness.

Build

One of the oldest requirements for software development mentioned in The Joel Test is that you can build your app in one step. With Grails that’s easy just create a build file (we use Apache Ant here but others will do) in which you call grails clean, grails test and then grails war:

<project name="my_project" default="test" basedir=".">
  <property name="grails" value="${grails.home}/bin/grails"/>
  
  <target name="-call-grails">
    <chmod file="${grails}" perm="u+x"/>
    <exec dir="${basedir}" executable="${grails}" failonerror="true">
      <arg value="${grails.task}"/><arg value="${grails.file.path}"/>
      <env key="GRAILS_HOME" value="${grails.home}"/>
    </exec>
  </target>
  
  <target name="-call-grails-without-filepath">
    <chmod file="${grails}" perm="u+x"/>
    <exec dir="${basedir}" executable="${grails}" failonerror="true">
      <arg value="${grails.task}"/><env key="GRAILS_HOME" value="${grails.home}"/>
    </exec>
  </target>

  <target name="clean" description="--> Cleans a Grails application">
    <antcall target="-call-grails-without-filepath">
      <param name="grails.task" value="clean"/>
    </antcall>
  </target>
  
  <target name="test" description="--> Run a Grails applications tests">
    <chmod file="${grails}" perm="u+x"/>
    <exec dir="${basedir}" executable="${grails}" failonerror="true">
      <arg value="test-app"/>
      <arg value="-echoOut"/>
      <arg value="-echoErr"/>
      <arg value="unit:"/>
      <arg value="integration:"/>
      <env key="GRAILS_HOME" value="${grails.home}"/>
    </exec>
  </target>

  <target name="war" description="--> Creates a WAR of a Grails application">
    <property name="build.for" value="production"/>
    <property name="build.war" value="${artifact.name}"/>
    <chmod file="${grails}" perm="u+x"/>
    <exec dir="${basedir}" executable="${grails}" failonerror="true">
      <arg value="-Dgrails.env=${build.for}"/><arg value="war"/><arg value="${target.directory}/${build.war}"/>
      <env key="GRAILS_HOME" value="${grails.home}"/>
    </exec>
  </target>
  
</project>

Here we call Grails via the shell scripts but you can also use the Grails ant task and generate a starting build file with

grails integrate-with --ant

and modify it accordingly.

Note that we specify the environment for building the war because we want to build two wars: one for production and one for our functional tests. The environment for the functional tests mimic the deployment environment as close as possible but in practice you have little differences. This can be things like having no database cluster or no smtp.
Now we can put all this into our continuous integration tool Jenkins and every time a checkin is made out Grails application is built.

Test

Unit and integration tests are already run when building and packaging. But we also have functional tests which deploy to a local Tomcat and test against it. Here we fetch the test war of the last successful build from our CI:

<target name="functional-test" description="--> Run functional tests">
  <mkdir dir="${target.base.directory}"/>
  <antcall target="-fetch-file">
    <param name="fetch.from" value="${jenkins.base.url}/job/${jenkins.job.name}/lastSuccessfulBuild/artifact/_artifacts/${test.artifact.name}"/>
    <param name="fetch.to" value="${target.base.directory}/${test.artifact.name}"/>
  </antcall>
  <antcall target="-run-tomcat">
    <param name="tomcat.command.option" value="stop"/>
  </antcall>
  <copy file="${target.base.directory}/${test.artifact.name}" tofile="${tomcat.webapp.dir}/${artifact.name}"/>
  <antcall target="-run-tomcat">
    <param name="tomcat.command.option" value="start"/>
  </antcall>
  <chmod file="${grails}" perm="u+x"/>
  <exec dir="${basedir}" executable="${grails}" failonerror="true">
    <arg value="-Dselenium.url=http://localhost:8080/${product.name}/"/>
    <arg value="test-app"/>
    <arg value="-functional"/>
    <arg value="-baseUrl=http://localhost:8080/${product.name}/"/>
    <env key="GRAILS_HOME" value="${grails.home}"/>
  </exec>
</target>

Stopping and starting Tomcat and deploying our application war in between fixes the perm gen space errors which are thrown after a few hot deployments. The baseUrl and selenium.url parameters tell the functional plugin to look at an external running Tomcat. When you omit them they start the Tomcat and Grails application themselves in their process.

Release

Now all tests passed and you are ready to deploy. So you fetch the last build … but wait! What happens if you have to redeploy and in between new builds happened in the ci? To prevent this we introduce a step before deployment: a release. This step just copies the artifacts from the last build and gives them the correct version. It also fetches the lists of issues fixed from our issue tracker (Jira) for this version as a PDF. These lists can be sent to the customer after a successful deployment.

Deploy

After releasing we can now deploy. This means fetching the war from the release job in our ci server and copying it to the target server. Then the procedure is similar to the functional test one with some slight but important differences. First we make a backup of the old war in case anything goes wrong and we have to rollback. Second we also copy the context.xml file which Tomcat needs for the JNDI configuration. Note that we don’t need to copy over local data files like PDF reports or serach indexes which were produced by our application. These lie outside our web application root.

<target name="deploy">
  <antcall target="-fetch-artifacts"/>

  <scp file="${production.war}" todir="${target.server.username}@${target.server}:${target.server.dir}" trust="true"/>
  <scp file="${target.server}/context.xml" todir="${target.server.username}@${target.server}:${target.server.dir}/${production.config}" trust="true"/>

  <antcall target="-run-tomcat-remotely"><param name="tomcat.command.option" value="stop"/></antcall>

  <antcall target="-copy-file-remotely">
    <param name="remote.file" value="${tomcat.webapps.dir}/${production.war}"/>
    <param name="remote.tofile" value="${tomcat.webapps.dir}/${production.war}.bak"/>
  </antcall>
  <antcall target="-copy-file-remotely">
    <param name="remote.file" value="${target.server.dir}/${production.war}"/>
    <param name="remote.tofile" value="${tomcat.webapps.dir}/${production.war}"/>
  </antcall>
  <antcall target="-copy-file-remotely">
    <param name="remote.file" value="${target.server.dir}/${production.config}"/>
    <param name="remote.tofile" value="${tomcat.conf.dir}/Catalina/localhost/${production.config}"/>
  </antcall>

  <antcall target="-run-tomcat-remotely"><param name="tomcat.command.option" value="start"/></antcall>
</target>

Different Environments: Staging and Production

If you look closely at the deployment script you notice that uses the context.xml file from a directory named after the target server. In practice you have multiple deployment targets not just one. At the very least you have what we call a staging server. This server is used for testing the deployment and the deployed application before interrupting or corrupting the production system. It can even be used to publish a pre release version for the customer to try. We use a seperate job in our ci server for this. We separate the configurations needed for the different environments in directories named after the target server. What you shouldn’t do is to include all those configurations in your development configurations. You don’t want to corrupt a production application when using the staging one or when your tests run or even when you are developing. So keep configurations needed for the deployment environment separate and separate from each other.

Celebrate

Now you can deploy over and over again with just one click. This is something to celebrate. No more headaches, no more bitten finger nails. But nevertheless you should take care when you access a production system even it is automatically. Something you didn’t foresee in your process could go wrong or you could make a mistake when you try out the application via the browser. Since we need to be aware of this responsibility everybody who interacts with a production system has to wear our cowboy hats. This is a conscious step to remind oneself to take care and also it reminds everybody else that you shouldn’t disturb someone interacting with a production system. So don’t mess with the cowboy!