Setting Grails session timeout in production

Grails 3 was a great update to the framework and kept it up-to-date with modern requirement in web development. Modularization, profiles, revamped build system and configuration were all great changes that made working with grails more productive and fun again.
I quite like the choice of YAML for the configuration settings because you can easily describe sections and hierarchies without much syntactic noise.

Unfortunately, there are some caveats. One of them went live and caused a (minor) irritation for our customer:

The session timeout was back to the 30 minutes default and not prolongued to the one hour we all agreed upon some years (!) ago.

Investigating the cause

Our configuration in application.yml was correctly set to the desired one hour timeout and in development everything was working as expected. But the thing is that the setting server.session.timeout is only applied to the embedded tomcat. If your application is deployed to a standalone servlet container this setting is ignored. Unfortunately it is far from obvious which settings in application.yml are used in what situation.

In the case of a standalone servlet container you would just edit your applications web.xml and the container would use the setting there. While this would work, it is not very nice because you have two locations for one setting. In software development we call that duplication. What makes things worse is, that there is no web.xml in our case! So what now?

The solution

We have two problems here

  1. Providing the functionality our customer desires
  2. Removing the code duplication so that development and production work the same way

Our solution is to apply the setting from application.yml to the HTTP-Session of the request using an interceptor:

class SessionInterceptor {
    int order = -1000

    SessionInterceptor() {
        matchAll()
    }

    boolean before() {
        int sessionTimeout = grailsApplication.config.getProperty('server.session.timeout') as int
        log.info("Configured session timeout is: ${sessionTimeout}")
        request.session?.setMaxInactiveInterval(sessionTimeout)
        true
    }
}

That way we use a single source of truth, namely the configuration in application.yml, both in development and production.

 

Ignoring YAGNI – 12 years later

Fourteen years ago, we started to build a distributed system to gather environmental data in an automated 24/7 fashion. Our development process was agile and made heavy use of short iterations (at least that was what they were then, today they are normal-sized). So the system grew with many small new features and improvements, giving the customer immediate business value.

One part of the system was the task scheduler. Because the system had to run 24/7 and be mostly independent of human interaction, the task scheduler’s job was to launch different measurement processes at the right time. We had done extensive domain crunching and figured out that all tasks follow a rigid time regime like “start every 10 minutes” or “start every hour”, regardless of the processes’ runtime. This made the scheduler rather easy to develop. You should keep it simple, after all.

But another result of the domain crunching bothered us: The schedule of all tasks originated from the previous software system, built 30 years ago and definitely unfit for the modern software world. The schedules weren’t really rooted in the domain, they all had technical explanations like “the recording of the values is done sequentially and takes up to 8 minutes, we can’t record them more often than that”. For our project, the measurement hardware was changed, so our recording took a couple of milliseconds. We could store and display the values continuously, if the need arises.

So we discussed the required simpleness or complexity of the task scheduler with the customer and they seemed pleased with all the new possibilities. But they decided that the current schedules were sufficient and didn’t need to be changed. We could go ahead and build our simple task scheduler.

And this is when we decided to abandon KISS and make the task scheduler more powerful than needed. “But you ain’t going to need it!” was the enemy. Because we knew that the customer will inevitably come around and make use of their new possibilities. We knew that if we build the system with more complexity, we would be the heroes in a future time, wearing a smug smile and telling the customer: “We’ve already built this, you can use it right away”. Oh how glorious this prospect of the future shone! Just a few more thoughts going into the code and we’re set for a bright future.

Let me tell you a few details about the “few more thoughts” with the example of an “every hour” task schedule. Instead of hard-coding the schedule, we added a configuration file with a cron-like expression for the schedule. You could now leverage the power of cron expressions to design your schedule as you see fit. If you wanted to change the schedule from “every hour” to “every odd minute and when the pale moon rises”, you could do so. The task scheduler had to interpret the configuration file and make sure that tasks don’t pile up: If you schedule a task to run “every minute”, but it takes two minutes to process, you’ve essentially built a time-bomb for your system load. This must not be feasible.

But it doesn’t stop there. A lot of functionality, most of which wasn’t even present or outlined at the time of our decision, relies implicitly on that schedule. Two examples: There are manual operations that must not be performed during the execution of the task. The system goes into a “protected state” around the task execution. It disables these operations a few minutes before the scheduled execution and even some time afterwards. If you had a fixed schedule of “every hour”, you could even hard-code the protected timespan. With a possible dynamic schedule, you have to calculate your timespan based on the current schedule and warn your operator if it isn’t possible anymore to find a time slot to even perform the manual operation.
The second example is a functionality that supervises the completeness of the recorded data. The problem is: This functionality is on another computer (it’s a distributed system, remember?) that doesn’t know about the configuration files. To be able to scan the data archive and say “everything that should be there, is there”, the second computer needs to know about all the schedules of the first computers (there are many of them, recording their data on their own schedules and transferring it to the second computer). And if a schedule changes, the second computer needs to take the change into account and scan the data archive for two areas: one area with the old schedule and one area with the new schedule. Otherwise, there would be false alarms.

You can probably see that the one decision to make the task scheduler a little more complex and configurable as required had quite some impact on the complexity of other parts of the system. But this investment will be worth it as soon as the customer changes the schedule! The whole system is programmed, tested and documented to facilitate schedule changes. We are ready!

It’s been over twelve years since we wrote the first line of code for the more complex implementation (I’ve checked the source control logs). The customer hasn’t changed a single bit of the schedule yet. There are over twenty “first computers” and they all still run the same task schedule as initially planned. Our decision did nothing but to add accidental complexity to the system. It probably introduced some bugs along the way, too. It certainly increased our required level of awareness (“hurdle of understanding”) during the development of features that are somewhat coupled with the task schedule.

In short: It’s been a disaster. The smug smile we thought we’d wear has been replaced by a deep frown. Who wrote all that mess? And why? It wasn’t the customer, it was us. We will never be going to need it.

Oracle database date and time literals

For one of our projects I work with time series data stored in an Oracle database, so I write a lot of SQL queries involving dates and timestamps. Most Oracle SQL queries I came across online use the TO_DATE function to specify date and time literals within queries:

SELECT * FROM events WHERE created >= TO_DATE('2012-04-23 16:30:00', 'YYYY-MM-DD HH24:MI:SS')

So this is what I started using as well. Of course, this is very flexible, because you can specify exactly the format you want to use. On the other hand it is very verbose.

From other database systems like PostgreSQL databases I was used to specify dates in queries as simple string literals:

SELECT * FROM events WHERE created BETWEEN '2018-01-01' AND '2018-01-31'

This doesn’t work in Oracle, but I was happy find out that Oracle supports short date and timestamp literals in another form:

SELECT * FROM events WHERE created BETWEEN DATE'2018-01-01' AND DATE'2018-01-31'

SELECT * FROM events WHERE created > TIMESTAMP'2012-04-23 16:30:00'

These date/time literals where introduced in Oracle 9i, which isn’t extremely recent. However, since most online tutorials and examples seem to use the TO_DATE function, you may be happy to find out about this little convenience just like me.

Using OpenVPN in an automated deployment

In my previous post I showed how to use Ansible in a Docker container to deploy a software release to some remote server and how to trigger the deployment using a Jenkins job.

But what can we do if the target server is not directly reachable over SSH?

Many organizations use a virtual private network (VPN) infrastructure to provide external parties access to their internal services. If there is such an infrastructure in place we can extend our deployment process to use OpenVPN and still work in an unattended fashion like before.

Adding OpenVPN to our container

To be able to use OpenVPN non-interactively in our container we need to add several elements:

  1. Install OpenVPN:
    RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get -y install openvpn
  2. Create a file with the credentials using environment variables:
    CMD echo -e "${VPN_USER}\n${VPN_PASSWORD}" > /tmp/openvpn.access
  3. Connect with OpenVPN using an appropriate VPN configuration and wait for OpenVPN to establish the connection:
    openvpn --config /deployment/our-client-config.ovpn --auth-user-pass /tmp/openvpn.access --daemon && sleep 10

Putting it together our extended Dockerfile may look like this:

FROM ubuntu:18.04
RUN DEBIAN_FRONTEND=noninteractive apt-get update
RUN DEBIAN_FRONTEND=noninteractive apt-get -y dist-upgrade
RUN DEBIAN_FRONTEND=noninteractive apt-get -y install software-properties-common
RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get -y install ansible ssh
RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get -y install openvpn

SHELL ["/bin/bash", "-c"]
# A place for our vault password file
RUN mkdir /ansible/

# Setup work dir
WORKDIR /deployment

COPY deployment/${TARGET_HOST} .

# Deploy the proposal submission system using openvpn and ansible
CMD echo -e "${VAULT_PASSWORD}" > /ansible/vault.access && \
echo -e "${VPN_USER}\n${VPN_PASSWORD}" > /tmp/openvpn.access && \
ansible-vault decrypt --vault-password-file=/ansible/vault.access ${TARGET_HOST}/credentials/deployer && \
openvpn --config /deployment/our-client-config.ovpn --auth-user-pass /tmp/openvpn.access --daemon && \
sleep 10 && \
ansible-playbook -i ${TARGET_HOST}, -u root --key-file ${TARGET_HOST}/credentials/deployer ansible/deploy.yml && \
ansible-vault encrypt --vault-password-file=/ansible/vault.access ${TARGET_HOST}/credentials/deployer && \
rm -rf /ansible && \
rm /tmp/openvpn.access

As you can see the CMD got quite long and messy by now. In production we put the whole process of executing the Ansible and OpenVPN commands into a shell script as part of our deployment infrastructure. That way we separate the preparations of the environment from the deployment steps themselves. The CMD looks a bit more friendly then:

CMD echo -e "${VPN_USER}\n${VPN_PASSWORD}" > /vpn/openvpn.access && \
echo -e "${VAULT_PASSWORD}" > /ansible/vault.access && \
chmod +x deploy.sh && \
./deploy.sh

As another aside you may have to mess around with nameserver configuration depending on the OpenVPN configuration and other infrastructure details. I left that out because it seems specific to the setup with our customer.

Fortunately, there is nothing to do on the ansible side as the whole VPN stuff should be completely transparent for the tools using the network connection to do their job. However we need some additional settings in our Jenkins job.

Adjusting the Jenkins Job

The Jenkins job needs the VPN credentials added and some additional parameters to docker run for the network tunneling to work in the container. The credentials are simply another injected password we may call JOB_VPN_PASSWORD and the full script may now look like follows:

docker build -t app-deploy -f deployment/docker/Dockerfile .
docker run --rm \
--cap-add=NET_ADMIN \
--device=/dev/net/tun \
-e VPN_USER=our-client-vpn-user \
-e VPN_PASSWORD=${JOB_VPN_PASSWORD} \-e VAULT_PASSWORD=${JOB_VAULT_PASSWORD} \
-e TARGET_HOST=${JOB_TARGET_HOST} \
-e ANSIBLE_HOST_KEY_CHECKING=False \
-v `pwd`/artifact:/artifact \
app-deploy

DOCKER_RUN_RESULT=`echo $?`
exit $DOCKER_RUN_RESULT

Conclusion

Adding VPN-support to your automated deployment is not that hard but there are some details to be aware of:

  • OpenVPN needs the credentials in a file – similar to Ansible – to be able to run non-interactively
  • OpenVPN either stays in the foreground or daemonizes right away if you tell it to, not only after the connection was successful. So you have to wait a sufficiently long time before proceeding with your deployment process
  • For OpenVPN to work inside a container docker needs some additional flags to allow proper tunneling of network connections
  • DNS-resolution can be tricky depending on the actual infrastructure. If you experience problems either tune your setup by adjusting name resolution in the container or access the target machines using IPs…

After ironing out the gory details depicted above we have a secure and convenient deployment process that saves a lot of time and nerves in the long run because you can update the deployment at the press of a button.

Domain-aligned bugs

Frank C. Müller [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0) ]
Imagine that you are an user of a typical enterprise software that handles commercial products and their prices. There are different prices in the software that are somehow related to each other. There is the purchase price that indicates your cost if you buy the product. There is the retail price that gets listed in your price lists and is paid by your customers, should they buy the product. You probably already figured out that the retail price should never be lower than the purchase price, because that would mean you lose money with every successful sale.

Let’s say that the enterprise software not only handles products, but also parts. Several parts combined, with some manufacturing effort, result in a product. Each part has a purchase price, the resulting product has a retail price. The retail price of the product should be higher than the sum of purchase prices of the parts. If it isn’t, you lose the costs of the manufacturing effort and some extra money with every successful sale.

If for any reason you cannot clearly estimate your manufacturing effort, the enterprise software has another input field for an amount of money that you can add to the sum of the parts’ costs. We call this field the “sales bonus”. So, if you sell a product made up of parts, your customer has to pay a price that consists at least of the retail prices of the parts and the sales bonus. Of course, your customer has an individual discount percentage that needs to be subtracted from the total price. Are you still following?

You are now thinking in the domain of price determination and financial mathematics. If you were the user of said enterprise software, you’d probably expect some bugs like these:

  • It is possible to enter a retail price lower than the purchase price
  • The price of products manufactured from parts isn’t calculated correctly
  • It is possible to enter a negative sales bonus
  • The total price with discount could be lower than the sum of purchase prices of the parts without a warning

All of them are bugs in the domain. All of them can be explained to a domain expert or a user with terms and concepts from the domain.

But what about the bug when you sell a product that consists of three parts, each with a retail price of 10 €, and a sales bonus of 5 €. You want to create a quote for your customer and the price shows up as 34,99999999998 €. You are a bit bewildered and try to countervail the apparent rounding error by changing the sales bonus to 5,00000000002 €. After this change you get another crazy total price and your prices in the database are different from what you entered, too. Everything seems to destabilize and deviate further and further from clear cut prices.

As a programmer, you know what happened. You know what caused this effect of numerical instability. Somebody stored monetary values in a floating point number. You know that is a bad idea and you’d never do this. But this blog post isn’t about you or what you should do or not do. It is about the user, expert in his domain, that stumbles over the bug as described and has to make some decision on how to fix it. This user cannot use any knowledge from the domain to even understand the mechanics of the bug. You, as the programmer, cannot explain this bug in terms of things the user already knows. You need to be vague (“the software doesn’t store the exact values, just approximations”) or introduce additional complexity (“we store this value by splitting it into a significand and multiply it with a factor consisting of a fixed base and an exponent. We can omit the base and just store the significand and the exponent and express a very large numerical range in just a few bits. Think about how cool that is!”).

Read the last explanation again, from the viewpoint of a salesman. We want to add some prices in the range of a few €, slap a moderate discount on top and call it a day. We don’t care about bits or exponential formulas. That is not part of our domain and it shouldn’t affect our domain or software that works in our domain. Confronting us with technical details reflects negatively on your ability to solve our problems. You seem to burden us with your problems in exchange.

As domain experts, we want only domain-aligned bugs.

Using protobuf with conan and CMake

In my last post, I showed how I got my feet wet while migrating the dependencies of my existing code-base to conan. The first major hurdle I saw coming when I started was adding something with a “special” build step, e.g. something like source-preprocessing. In my case, this was protobuf, where a special build-step converts .proto files to sources and headers.

In my previous solution, my devenv build scripts would install the protobuf converter binary to my devenv’s bin/ folder, which I then used to run my preprocessing. At first, it was not obvious how to do this with conan. It turns out that the lovely people and bincrafters made this pretty comfortable. conan_basic_setup() will add all required package paths to your CMAKE_MODULE_PATH, which you can use to include() some bundled CMake scripts that will either let you execute the protobuf-compiler via a target or run protobuf_generate to automagically handle the preprocessing. It’s probably worth noting, that this really depends on how the package is made. Conan does not really have an official way on how to handle this.

Let’s start with some sample code – Person.proto, like the sample from the protobuf website:

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;
}

And some sample code that uses it:

#include "Person.pb.h"

int main(int argn, char** argv)
{
  Person message;
  message.set_name("Hello Protobuf");
  std::cout << message.name() << std::endl;
}

Again, we’re using the bincrafters repository for our dependencies in a conanfile.txt:

[requires]
protobuf/3.6.1@bincrafters/stable
protoc_installer/3.6.1@bincrafters/stable

[options]

[generators]
cmake

Now we just need to wire it all up in the CMakeLists.txt

cmake_minimum_required(VERSION 3.0)
project(ProtobufTest)

include(${CMAKE_BINARY_DIR}/conanbuildinfo.cmake)
conan_basic_setup(TARGETS KEEP_RPATHS)

# This loads the cmake/protoc-config.cmake file
# from the protoc_installer dependency
include(cmake/protoc-config)

set(TARGET_NAME ProtobufSample)

# Just add the .proto files to the target
add_executable(${TARGET_NAME}
  Person.proto
  ProtobufTest.cpp
)

# Let this function to the magic
protobuf_generate(TARGET ${TARGET_NAME})

# Need to use protobuf, of course
target_link_libraries(${TARGET_NAME}
  PUBLIC CONAN_PKG::protobuf
)

# Make sure we can find the generated headers
target_include_directories(${TARGET_NAME}
  PUBLIC ${CMAKE_CURRENT_BINARY_DIR}
)

There you have it! Pretty neat, and all without a brittle find_package call.

Database table naming conventions

Naming things well is an important part of writing maintainable software, and renaming things once their names have become established in a code base can be tedious work. This is true as well for the names of an application database schema, where a schema change usually requires a database migration script. That’s why you should take some time beforehand to set up a naming convention.

Many applications use object-relational mappers (ORM), which have a default naming convention to map class and property names to table and column names. But if you’re not using an ORM, you should set up conventions as well. Here are some tips:

  • Be consistent. For example, choose either only plural or singular for table names, e.g. “books” or “book”, and stick to it. Many sources recommend singular for table names.
  • On abbreviations: Some database systems like Oracle have a character limit for names. The limit for Oracle database table names is 30 characters, which means abbreviations are almost inevitable. If you introduce abbreviations be consistent and document them in a glossary, for example in the project Wiki.
  • Separate word boundaries with underscores and form a hierarchy like “namespace_entity_subentity”, e.g. “blog_post_author”. This way you can sort the tables by name and have them grouped by topic.
  • Avoid unnecessary type markers. A table is still a table if you don’t prefix it with “tbl_”, and adding a “_s” postfix to a column of type string doesn’t really add useful information that couldn’t be seen in the schema browser of any database tool. This is similar to Hungarian notation, which has fallen out of use in today’s software development. If you still want to mark special database objects, for example materialised views, then you should prefer a postfix, e.g. “_mv”, over a prefix, because a prefix would mess up the lexicographic hierarchy established by the previous tip.

And the final advice: Document your conventions so that other team members are aware of them and make them mandatory.

Ansible for deployment with docker

We are using Jenkins not only for our continuous integration needs but also for running deployments at the push of a button. Back in the dark times™ this often meant using Apache Ant in combination with JSch to copy the projects artifacts to some target machine and execute some remote commands over ssh.
After gathering some experience with Ansible and docker we took a new shot at the task and ended up with one-shot docker containers executing some ansible scripts. The general process stayed the same but the new solution is more robust in general and easier to maintain.
In addition to the project- and deployment-process-specific simplifications we can now use any jenkins slave with a docker installation and network access. No need for an ant installation and a java runtime anymore.
However there are some quirks you have to overcome to get the whole thing running. Let us see how we can accomplish non-interactive (read: fully automated) deployments using the tools mentioned above.

The general process

The deployment job in Jenkins has to perform the following steps:

  1. Fetch the artifacts to deploy from somewhere, e.g. a release job
  2. Build the docker image provisioned with ansible
  3. Run a container of the image providing credentials and artifacts through the environment and/or volumes

The first step can easily be implemented using the Copy Artifact Plugin.

The Dockerfile

You can use mostly any linux base image for your deployment needs as long as the distribution comes with ansible. I chose Ubunt 18.04 LTS because it is sufficiently up-to-date and we are quite familiar with it. In addition to ssh and ansible we only need the command for running the playbook. In our case the the playbook and ssh-credentials reside in the deployment folder of our projects, of course encrypted in an ansible vault. To be able to connect to the target machine we need the vault password to decrypt the ssh password or private key. Therefore we inject the vault password as an environment variable when running the container.

A simple Dockerfile may look as follows:

FROM ubuntu:18.04
RUN DEBIAN_FRONTEND=noninteractive apt-get update
RUN DEBIAN_FRONTEND=noninteractive apt-get -y dist-upgrade
RUN DEBIAN_FRONTEND=noninteractive apt-get -y install software-properties-common
RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get -y install ansible ssh

SHELL ["/bin/bash", "-c"]
# A place for our vault password file
RUN mkdir /ansible/

# Setup work dir
WORKDIR /deployment

COPY deployment/${TARGET_HOST} .

# Deploy the proposal submission system using openvpn and ansible
CMD echo -e "${VAULT_PASSWORD}" > /ansible/vault.access && \
ansible-vault decrypt --vault-password-file=/ansible/vault.access ${TARGET_HOST}/credentials/deployer && \
ansible-playbook -i ${TARGET_HOST}, -u root --key-file ${TARGET_HOST}/credentials/deployer ansible/deploy.yml && \
ansible-vault encrypt --vault-password-file=/ansible/vault.access ${TARGET_HOST}/credentials/deployer && \
rm -rf /ansible

Note that we create a temporary vault password file using the ${VAULT_PASSWORD} environment variable for ansible vault decryption. A second noteworthy detail is the ad-hoc inventory using the ${TARGET_HOST} environment variable terminated with comma. Using this variable we also reference the correct credential files and other host-specific data files. That way the image will not contain any plain text credentials.

The Ansible playbook

The playbook can be quite simple depending on you needs. In our case we deploy a WAR-archive to a tomcat container and make a backup copy of the old application first:

---
- hosts: all
  vars:
    artifact_name: app.war
    webapp_directory: /var/lib/tomcat8/webapps
  become: true

  tasks:
  - name: copy files to target machine
    copy: src={{ item }} dest=/tmp
    with_fileglob:
    - "/artifact/{{ artifact_name }}"
    - "../{{ inventory_hostname }}/context.xml"

  - name: stop our servlet container
    systemd:
      name: tomcat8
      state: stopped

  - name: make backup of old artifact
    command: mv "{{ webapp_directory }}/{{ artifact_name }}" "{{ webapp_directory }}/{{ artifact_name }}.bak"

  - name: delete exploded webapp
    file:
      path: "{{ webapp_directory }}/app"
      state: absent

  - name: mv artifact to webapp directory
    copy: src="/tmp/{{ artifact_name }}" dest="{{ webapp_directory }}" remote_src=yes
    notify: restart app-server

  - name: mv context file to container context
    copy: src=/tmp/context.xml dest=/etc/tomcat8/Catalina/localhost/app.xml remote_src=yes
    notify: restart app-server

  handlers:
  - name: restart app-server
    systemd:
      name: tomcat8
      state: started

The declarative syntax and the use of handlers make the process quite straight-forward. Of note here is the - hosts: all line that is used with our command line ad-hoc inventory. That way we do not need an inventory file in our project repository or somewhere else. The host can be provided by the deployment job. That brings us to the Jenkins-Job that puts everything together.

The Jenkins job

As you probably already noticed in the parts above there are several inputs required by the container and the ansible script:

  1. A volume with the artifacts to deploy
  2. A volume with host specific credentials and data files
  3. Vault credentials
  4. The target host(s)

We inject passwords using the EnvInject Plugin which stored them encrypted and masks them in all the logs. I like to prefix such environmental settings in Jenkins with JOB_ do denote their origin. After the copying build step which puts the artifacts into the artifact directory we can use a small shell script to build the image and run the container with all the needed stuff wired up. The script may look like the following:

docker build -t app-deploy -f deployment/docker/Dockerfile .
docker run --rm \
-e VAULT_PASSWORD=${JOB_VAULT_PASSWORD} \
-e TARGET_HOST=${JOB_TARGET_HOST} \
-e ANSIBLE_HOST_KEY_CHECKING=False \
-v `pwd`/artifact:/artifact \
app-deploy

DOCKER_RUN_RESULT=`echo $?`
exit $DOCKER_RUN_RESULT

We run the container with the --rm flag to delete it right after execution leaving nothing behind. We also disable host key checking for ansible to keep the build non-interactive. The last two lines let the jenkins job report the result to the container execution.

Wrapping it up

Essentially we have a deployment consisting of three parts. A docker image definition of the required environment, an ansible playbook performing the real work and a jenkins job with a small shell script wiring it all together so that it can be executed at the push of a button. We do not store any clear text credentials persistently at neither location, so that we are reasonably secure for running the deployment.

Book review: A Philosophy of Software Design

This blog entry is structured in two main parts: The prologue sets the tone, but may be irritating because it doesn’t talk about the book itself. If you get irritated or know the topic well enough to skip it, you can jump to the second part when I talk about the book. It is indicated by a TL;DR summary of the prologue.

Prologue

Imagine a world where the last 25 years of computer game development didn’t happen. A world where we get the power of 5 GHz octacore computers and 128 GB of RAM, but nobody thought about 3D graphics or interaction design. The graphics of computer games is so rudimentary, it consists of ASCII art and color. In this world, two brothers develop a game that simulates a whole fantasy world with all details, in three dimensions. The game is an instant blockbuster hit and spawns multiple cinematic adaptions.
This world never happened. The only thing that seems to be from this world is the game itself: Dwarf Fortress. An ASCII art sandbox simulation of a bunch of dwarves that dig into the (three-dimensional) mountains and inevitably discover the fun in magma. Dwarf Fortress is a game told by stories, not graphics. It burdens the player to micro-manage a whole settlement down to the individual sock – Yes, no plural. There are left socks and right socks and they are different entities with a different story. Dwarves can literally go mad because they miss their favorite left sock and you didn’t notice in time. And you have to control all aspects of the settlement not by direct order, but by giving hints and suggestions through an user interface that is a game of riddles on its own.
Dwarf Fortress is an impossible game. It seems so out of time and touch with current gaming reality that you can only shake your head on first contact. But, it is incredibly deep and well-designed and, most surprising, provides the kind player with endless fun. This game actually works!

TL;DR: Just because something seems odd at first contact doesn’t mean it cannot work. Go and play Dwarf Fortress!

The book

John Ousterhout is a professor teaching software design at the Stanford university and writes software for decades now. In 1988, he invented the Tcl programming language. He got a lot of awards, including the Grace Murray Hopper Award. You can say that he knows what he’s doing and what he talks about. In 2018, he wrote a book with the title “A Philosophy of Software Design”. This book is a peculiar gem besides titles with a similar topic.

Imagine a world where the last 20 years of software development books didn’t happen. One man creates software for his whole life and writes down his thoughts and insights, structured in tactical advices, strategic approaches and an overarching philosophy. He has to invent some new vocabulary to express his ideas. He talks about how he performs programming – and it is nothing like today’s mainstream. In fact, it is sometimes the exact opposite of today’s best practices. But, it is incredibly insightful and well-structured and, most surprising, provides the kind developer with endless fun. Okay, I admit, the latter part of the previous sentence was speculative.

This is a book that seems a bit out of touch with today’s mainstream doctrine – and that’s a good thing. The book begins by defining some vocabulary, like the notion of complexity or the concept of deepness. That is rare by itself, most books just use established words to deliver a message. If you think about the definitions, they will probably enrich your perception of software design. They enriched mine, and I talk about software design to students for nearly twenty years now.

The most obvious thing that is different from other books with similar content: Most other books talk about behaviours, best practices and advices. Then they throw a buch of prohibitions in the mix. This isn’t wrong, but it’s “just” anecdotal knowledge. It is your job as the reader to discern between things that may have worked in the past, but are outdated and things that will continue to work in the future. The real question is left unanswered: Why is it so?

“A Philosophy of Software Design” begins by answering the “why” question. If you want to build an hierarchy of book wisdom depth, this might serve as one:

  • Tactical wisdom: What should be done? Most beginner’s books work on this level. They show exactly what goes on, but go easy on the bigger questions.
  • Strategic wisdom: How should it be done? This is the level that the majority of good software design books work on. They give insights about your work ethics and principles you should abide by.
  • Philosphical wisdom: Why should it be done? The reviewed book begins on this level. It explains the aspects of software and sourcecode that work against human perception and understanding and shows ways to avoid or at least diminish those aspects.

The book doesn’t stay on the philosophical level for long and dives deep into the “how” and “what” areas later on. But it does so with the background of an established “why”. And that’s a great reminder that even if you disagree with a specific “what” (or “how”), you should think about the root cause of your disagreement, not just anecdotes.

The author and the book aren’t as out-of-touch with current software development reality as you might think. There is a whole chapter addressed to current “software trends” like agile development and unit tests. It has a total page count of six pages and doesn’t go into details. But it at least mentions the things it doesn’t talk about.

Conclusion

My biggest learning point from the book for my personal habits as a developer is to write more code comments in the way the book proposes. Yes, you’ve read that right. The book urges you to write more comments – but good ones. It talks about why you should write more comments. It gives you extensive guidelines as to how good comments are written and some examples what these comments look like. After two decades of “write more (unit) tests!”, the message of “write more comments!” is unique and noteworthy. Perhaps we can improve our tools to better support comments in the same way they improved support for tests in the last years.

Perhaps we cannot solve our problems with the sourcecode by writing more sourcecode (unit tests). Perhaps we need to rely on something different. I will give it a try.

You might want to give the book “A Philosophy of Software Design” a try. It’s worth your time and thoughts.

Migrating an existing C++ codebase to conan

This is a bit of a battle report of migrating the dependencies in my C++ projects to use the conan package manager.
In the past weeks I have started to use conan in half a dozen both work and personal projects. Here’s my experiences so far.

Before

The first real project I started with was my personal game project. The “before” setup used a mixture if techniques to handle dependencies and uses CMake to do most of the heavy lifting.
Most dependencies reside in the “devenv”, which is a separate CMake project that I use to build and bundle the dependencies in a specific installation folder. It uses ExternalProject_Add for most parts (e.g. Boost, SDL, Lua, curl and OpenSSL), add_subdirectory for a few others (pugixml and lz4) and just install(FILES...) for a few header only libs like JSON for Modern C++, Catch2 and spdlog. It should be noted that there are relatively few interdependencies between the projects in there.
Because it is more convenient to update, I keep a few dependencies that I control myself directly in the source tree, either as git externals or just copies of the source files.
I try to keep usage of system dependencies to a minimum so that the resulting binary is more portable to the average gamer who does not want to know about libraries and dependencies and such nonsense. This setup has been has been mostly painless and working for my three platforms Windows, Linux and Mac – at least as long as I did not try to change it significantly.

Baby steps

Since not all my dependencies are available on conan and small iterations are usually more successful, I decided to proceed by changing only a single dependency to conan. For this dependency, it’s a good idea to pick something that does not have many compile-time options and is more or less platform agnostic. So I opted for boost over, e.g. SDL or wxWidgets. Boost was also one of the most painful dependencies to build, if only for the insane amount of files it produces and the time it takes to copy those ten-thousands of files to the install location.

Getting started..

There are currently two popular variants of boost available through conan. The “normal” variant on conan’s main repository/remote “conan-center” and a modular version that splits boost into its component libraries on the bincrafters remote, e.g. Boost.Filesystem. The modular version is more appealing conceptually, and I also had a better time getting it to work in my first tests, so I picked that. I did a quick grep for #include <boost/ through my code for an initial guess which boost libraries I needed to get and created a corresponding conanfile.txt in my project root.

[requires]
boost_filesystem/1.69.0@bincrafters/stable
boost_math/1.69.0@bincrafters/stable
boost_random/1.69.0@bincrafters/stable
boost_property_tree/1.69.0@bincrafters/stable
boost_assign/1.69.0@bincrafters/stable
boost_heap/1.69.0@bincrafters/stable
boost_optional/1.69.0@bincrafters/stable
boost_program_options/1.69.0@bincrafters/stable
boost_iostreams/1.69.0@bincrafters/stable
boost_system/1.69.0@bincrafters/stable

[options]
boost:shared=False

[generators]
cmake

Now conan plays really nice with “single configuration generators” like the new CMake/Ninja support in VS2017 and onward. Basically, just cd into your build dir and call something like conan install -s build_type=Debug -s build_type=x86 whenever you want to update dependencies. More info can be found in the official documentation. The workflow for CLion is essentially the same.

Using it in your build

After the last command, conan will download (or build) the dependencies and generate a file with all the corresponding paths.
To use it, include it from cmake like this:

include(${CMAKE_BINARY_DIR}/conanbuildinfo.cmake)
conan_basic_setup(TARGETS KEEP_RPATHS)

It will then provide targets for all the requested boost libraries that you can link to like this:

target_link_libraries(myTarget
  PUBLIC CONAN_PKG::boost_filesystem
)

I wanted to make sure that the compiler build using the new boost files and not the old ones. Because I have a generic include into my devenv that was still going to be in my compilers include-paths for all the other dependencies, so I just renamed boost’s header include folder on disc. After my first successful compile I felt confident enough to delete them.

First problems

There was one major problem: some of my in-source dependencies had their own claim on using boost via passed CMake variables, Boost_LIBRARY_DIRS and Boost_INCLUDE_DIR. I adapted their CMakeLists.txt to allow for injecting appropriate targets instead. Not the cleanest solution, but it got my builds green again fast.

There’s a still a lot to cover on this: The other platforms had their own quirks and I migrated way more than just this first project. Also, there is still ways to go for a full migration with my game project. But more on that in my next blog post…