When everything’s an issue

Years ago, I read the novel “Manna” by Marshall Brain. It’s a science fiction story about the robotic takeover and it features “Manna”, an (artificially) intelligent work management software that replaces human managers and runs the shop. The story starts with “Manna” and goes on to explore the implications of such a system on mankind. It’s a good read and contains a lot of thoughts about what kind of labor we want to do.

The idea that really captivated me was the company that runs itself. Don’t get me wrong: Most organizations are so big that the individual employee cannot see the big picture anymore. Those organizations seem to “run itself” to the untrained eye, but it is still humans that manage the workload. And like all humans, they make mistakes and, perhaps very subtle, infuse their own selfish goals into the process. But an organization that has its goals and instructs its workers (humans and machines alike) directly is an interesting thought for me.

It also is totally unrealistic with today’s technology and probably contains some risks that should be explored carefully before implementing such a system in the wild.

But what about a more down-to-earth approach that achieves the core advancement of “Manna” without many or all of the risks? What if the organization doesn’t instruct, but makes its needs visible and relies on humans to interpret and schedule those needs and fulfill them? In essence, a “Manna” system without the sensors and decision-making and certainly without the creepy snooping tendencies. Built with today’s technology, that’s called an automated work scheduler.

And that is what we’ve built at our company. We use an issue tracking system to manage and schedule our project work already. We extended its usage to manage and schedule our administrative work, too. Now, every work unit in the company is (or could be) accompanied by an issue in the issue tracker. And just like software developers don’t change code without an issue, we don’t change our company’s data or decisions without an issue that also provides a place for documentation related to the process. We’ve come to the conclusion that most of those administrative issues are recurring. So we automated their creation.

Our very early stage “Manna” system is called “issue scheduler”, a highly creative name on its own. It is a system that basically contains a lot of glorified cron expressions and just enough data to create a meaningful issue in the issue tracker, should a cron expression fire. So basically, our company creates issues for us on a fixed schedule. Let’s look at some examples:

  • We add a new article to our developer blog (you’re reading it right now) every week. This means that every week, our “issue scheduler” creates a blog issue and assigns it to the next author in line. This is done some time in advance to give the author enough time to prepare and possibly trade with other authors. Our developer blog has the “need” for one article each week, but it doesn’t require a particular topic or author. This need is made visible by the automatic blog issues and it is our duty to fulfill this need. On a side note: Maybe you’ve noticed that I wrote two blog articles in direct succession. There is definitely some issue trading going on behind the scenes right now!
  • We tend to have many plants in our office. To look at something green and living adds to our comfort. But those plants have needs, too. They probably make their needs pretty visible, but we aren’t expert plant caregivers. So we gave the “issue scheduler” some entries to inform us about the regular watering and fertilization duties for our office plants. A detailed description of the actual work exists in our company wiki and a link to it gives the caregiver of the week all the information that’s needed.
  • Every month, we are required to file a sales tax summary report. This is a need of the german government agencies that we incorporate into our company’s needs. To work on this issue, you need to have more information and security clearances than fits on a wiki page, but to process is documented nonetheless. So once a month, our company automatically creates an issue that says “do your taxes now!” and assigns it to our administrative employees.

These are three examples of recurring tasks that are covered by our poor man’s “Manna” system. To give you a perspective on the scale of this system for a small company like ours, we currently have about 140 distinct rules for recurring issues. Some of them fire almost every day, some of them sleep for years and wake up just in time to express a certain need of the company that otherwise would surerly be forgotten or rediscovered after the fact.

This approach relieves us from the burden to remember all the tasks and their schedules and lets us concentrate on completing them. And our system, in contrast to “Manna” in the story, isn’t judging or controlling. If you don’t think the plants need any more water, just resolve the issue with “won’t fix”. Perhaps you can explain your decision in a short comment for other humans, but our “issue scheduler” won’t notice.

This isn’t the robotic takeover, after all. It’s just automated scheduling of recurring work. And it works great.

Twelve years a bug

Recently, a friend recommended the talk “Love your bugs” by Allison Kaptur to me. It’s a good talk with a powerful message: Every bug is a chance to learn. The only prerequisite for this chance: You need to be aware of the bug. You need to see it and then understand and fix it.

What if you don’t see a bug for over a decade?

You can still learn from it – a lot. Here is a story about a bug that lingered in my code for twelve years and had impact on the software results without anybody, including me, noticing.

Let’s say the software is a measurement system that records physical measurement values that cannot be reenacted easily. The samples are measured in rapid succession and then stored away. The measurement results act as a quality quantifier as in “the sample was this good at that time”. The sample’s quality diminishes over time, even with perfect storage conditions. And just to be sure that the measurement is accurate, it isn’t performed once, but twice in quick succession: The measurement device traverses the sample in one direction, recording the values and uses the return path for a second measurement. This results in two measurement streaks that should, in theory, be very similar.

The raw measurement values are aggregated in different ways. In the final report, the maximum value over both measurement streaks is the most prominent and most important aspect. There is nothing exciting or error-prone about finding the maximum value in a value series, even twelve years ago. But I tested the code nonetheless. The whole value aggregation code was under test and had a good test coverage. All tests showed their thumbs up.

But twelve years ago, at the end of a workweek, on a Friday at 17 o’clock, I made a small and easy refactoring to the code. I know this in such detail because of the wonders of version control. Without version control, I probably would have learnt a lot less from this bug.

The code before the refactoring contained two nearly identical sections for the measurement streaks, making it duplicated code. There were only two differences: The first section of code counted from 0 to 99 and stored the values in the first streak’s array. The second section counted from 99 to 0, because the measurement device travels backwards over the sample, and stored the values into the second array.

My refactoring brought both pieces of code together: A new method with two parameters was introduced and called at both places. The first parameter specified a series of positions (0 to 99 or backwards), the second parameter specified the array to store into. All automated tests approved the changes. My manual testing showed no differences. The refactoring was going live.

Twelve years later, while working on a new engine for the measurement device, the customer took the raw values from the journal and performed some manual calculations. The maximum value calculation was wrong. It wasn’t wrong all of the time, but also not correct for all measurements. When it went wrong, it selected the second greatest value or, seldom, the third greatest value as the maximum value.

All the tests still insisted that everything is correct – as it was for twelve long years. There was no other change to the code that could affect the aggregation in any way.

There are two different ways that lead me to find the bug’s origin. The first way was to inspect the trail of commits in the area of code that performs the aggregation. My findings were that the abovementioned refactoring was the most likely culprit. The second way was to write more tests to ensure that yes, the maximum value calculation was indeed correct if given the correct values. Using the examples my customer had examined, I could prove with enough certainty that, given the input of all 200 values, the correct maximum value was returned in all cases. The aggregation therefore wasn’t given all values!

And that lead directly to the bug: The first measurement streak used the same array to store the values as the second streak. During the refactoring, I must have copied and pasted the call to the new method and forgotten to change the second parameter. Of 200 measured values, only 100 got stored permanently. The first 100 values were stored and promptly overwritten as the measurement device returned to its park position. Change the calls to store in both arrays and everything works like intended and like before the refactoring.

How did no test, automated or manual, catch this bug? It turns out that most automated tests were too focussed to indicate a problem. All unit tests that secured the maximum value calculation used given sets of values, they didn’t care about the origin of these value sets. The integration tests that covered the whole measurement process should have raised objections to the refactoring. But they used given sets of measurement values, too. And by chance, the given maximum value was in the second measurement streak. In fact, the given set of measurement values was produced by a loop that just increased a value. The greatest value was always the last one.

The manual tests had the same problem: The simulated measurement device produced measurement values that were either fixed or random. If your whole measurement uses the same fixed values, you don’t see it if half of them went missing. And if your measurement uses random values, you’ll have to pay close attention to a detail that isn’t in focus because it is unchanged. Except that one time when it was changed recently. Remember the change date? Friday afternoon, only minutes before the weekend? Not the best time to manually test a change that is a simple standard refactoring, after all.

So, what have I learnt? First: automated tests, even with great test coverage, aren’t enough. There is so much leeway in the setup of these tests that they will have blind spots without your knowledge. Second: A code review by another human (or even by the same human, some days later) might have caught the bug. It was painfully obvious in hindsight. The problem? “Might have” is an heuristics, just like your automated tests are.

My guess is that mutation testing would have shown the blind spot of the existing tests – among several hundred others. The heuristics is now your trained eye that sifts through the results and separates false positives from true positives.

Right now, I’ve fixed the bug, kept all additional tests and added one more: An integration test that performs a measurement with fixed values and checks nothing else but if all 200 values are stored at the end of the measurement. It’s oddly specific, but conveys this story in an automated fashion.

Oh, and I made another refactoring: I’ve replaced the arrays with collections. Hopefully, I won’t regret this one twelve years in the future. I’ve made it on a Monday.

Non-determinism in C++

A deterministic program, when given the same input, will always result in the same output. This intuitive, albeit quite fuzzily defined, property is often times pretty important for correct program. Sources of non-determinism can be quite subtle – and once they creep into your program, they can propagate and amplify and have enormous consequences. It is pretty much the well-known butterfly effect.

When discussing this problem, it is important to know what exactly makes up the input and the output of the program. For example, when logging times to a logfile, and considering this an actual output, no two runs will ever be the same – so this is usually not considered an output relevant for determinism. Which brings us to the first common source of non-determinism:

Time

If any part of your program depends on the time it is run at, it will be easily be non-deterministic. Common cases are using the time to initializing some variable depending on the time, or using the time for some kind of numerical integration, like computing a value over time. Also, using execution time as an output respective for determinism is hopeless on a normal desktop computer – but can be crucial for a real-time system.

Random number generation

Random number generation seems like an obvious candidate, yet most random number generators are not really random, but only pseudo-random. For example, std::mersenne_twister_engine will generate the same sequence of values every time, when initialized with the same seed. So do not initialize this with a non-deterministic input like the time, and it will be predictable. However, std::random_device might not share this property and give you fresh non-deterministic input. As a weird middle ground, std::default_random_engine will probably give you the same results when compiled with the same compiler/standard-lib, but on another compiler version or OS, it will not. Subtle.

The allocator

Another source of non-determinism that is pretty tricky is the allocator. For example, consider the following piece of code:

template <class T>
T sum(std::set<Thingy*> const& set)
{
  T result{};
  for (auto const& each : set)
    result += each->value();
  return result;
}

Is this deterministic or not? It depends. Now let’s assume that all the Thingys were allocated using standard new. In that case, the actual pointers, Thingy* are non-deterministic, and hence the order of the Thingy*s in the set is random. But does this matter? Well if T is std::uint32_t, it does not. Order in addition does not matter for unsigned integers, even with overflows. However, if T is float, then it does matter and the whole result becomes unpredictable, at least in the general case (it will even be predictable, if e.g. all the numbers in the computation are integers that are exactly representable as floats). Other languages have “insertion-ordered” containers to get around this problem. A sensible approximation in C++ is to use the (unordered_)set and (unordered_)map containers together with another list to iterate on.

The thread scheduler

When you cannot really control the order of instructions, which is really the whole point of threading, you will have a harder time making things deterministic. Like the allocator problem, this is usually also paired with floating-point arithmetic. The workaround here is to make sure that the order of computation does not influence the final result. One common way around this is to sort the output by a unique criteria. For example, if you use multiple threads to report the intersections of a bunch of line segments, you can later sort them by their position in space.

There’s of course the honorable mention for uninitialized variables, but I’m sure your static analyzer will complain about it. Any interaction with the “outside” of your program, any side-effect, be it filesystems, user input, output or cosmic radiation can lead to non-determinism, so be sure to know the context well enough and plan accordingly to your determinism requirements.

Debugging Web Pages for iOS

Web developers use browser tools like the Web Inspector in Chrome and Safari or the Developer Tools in Firefox to develop, debug and test web pages. In Safari you have to enable the developer menu first: Safari -> Preferences… -> Advanced -> Show develop menu in menu bar

All these tools offer modes where you can display the page layout at various screen sizes. In Safari this is called the Responsive Design Mode and can be found in the Develop menu. This is essential for checking the page layout for mobile devices. There are however some differences in behaviour, which can only be tested on the real devices or in a simulator. For example, dropdown menus can trigger a wheel selector on mobile devices, while the desktop browser renders them as regular dropdown menus, even in responsive design mode.

Here are some tips for debugging web pages for iOS devices in the simulator:

Using Web Inspector with the iOS Simulator

Within the mobile Safari browser you can’t simply open the Web Inspector console as you would do when developing a web page using a desktop browser. But you can connect the Web Inspector of your desktop Safari to the mobile Safari browser instance running in the iOS simulator:

  • Start the iOS simulator from Xcode: Xcode -> Open Developer Tool -> Simulator
  • Select the desired device: Hardware -> Device -> e.g. iOS 12.1 -> iPhone SE
  • Open the web page in Safari within the simulator
  • Open the desktop version of Safari

In Safari’s Develop menu the simulator now shows up as a device, e.g. “Simulator – iPhone SE – iOS 12.1 (16B91)”. The web page you opened in the simulator should be listed as submenu item. If you click this menu item the Web Inspector opens. It’s now connected to the simulated Safari instance and you can debug the mobile variant of your web page.

Workaround for Clearing the Cache

When using a desktop web browser one can easily bypass the local browser cache when reloading a web page by holding the shift key while pressing the reload button. Sometimes this is necessary to see changes in effect while developing a web application. However, this doesn’t work in Safari running within the iOS emulator. There’s a little workaround: You can open the web page in an incognito tab, which means the cache is cleared each time you close the tab and re-open it again in a new incognito tab.

Setting Grails session timeout in production

Grails 3 was a great update to the framework and kept it up-to-date with modern requirement in web development. Modularization, profiles, revamped build system and configuration were all great changes that made working with grails more productive and fun again.
I quite like the choice of YAML for the configuration settings because you can easily describe sections and hierarchies without much syntactic noise.

Unfortunately, there are some caveats. One of them went live and caused a (minor) irritation for our customer:

The session timeout was back to the 30 minutes default and not prolongued to the one hour we all agreed upon some years (!) ago.

Investigating the cause

Our configuration in application.yml was correctly set to the desired one hour timeout and in development everything was working as expected. But the thing is that the setting server.session.timeout is only applied to the embedded tomcat. If your application is deployed to a standalone servlet container this setting is ignored. Unfortunately it is far from obvious which settings in application.yml are used in what situation.

In the case of a standalone servlet container you would just edit your applications web.xml and the container would use the setting there. While this would work, it is not very nice because you have two locations for one setting. In software development we call that duplication. What makes things worse is, that there is no web.xml in our case! So what now?

The solution

We have two problems here

  1. Providing the functionality our customer desires
  2. Removing the code duplication so that development and production work the same way

Our solution is to apply the setting from application.yml to the HTTP-Session of the request using an interceptor:

class SessionInterceptor {
    int order = -1000

    SessionInterceptor() {
        matchAll()
    }

    boolean before() {
        int sessionTimeout = grailsApplication.config.getProperty('server.session.timeout') as int
        log.info("Configured session timeout is: ${sessionTimeout}")
        request.session?.setMaxInactiveInterval(sessionTimeout)
        true
    }
}

That way we use a single source of truth, namely the configuration in application.yml, both in development and production.

 

Ignoring YAGNI – 12 years later

Fourteen years ago, we started to build a distributed system to gather environmental data in an automated 24/7 fashion. Our development process was agile and made heavy use of short iterations (at least that was what they were then, today they are normal-sized). So the system grew with many small new features and improvements, giving the customer immediate business value.

One part of the system was the task scheduler. Because the system had to run 24/7 and be mostly independent of human interaction, the task scheduler’s job was to launch different measurement processes at the right time. We had done extensive domain crunching and figured out that all tasks follow a rigid time regime like “start every 10 minutes” or “start every hour”, regardless of the processes’ runtime. This made the scheduler rather easy to develop. You should keep it simple, after all.

But another result of the domain crunching bothered us: The schedule of all tasks originated from the previous software system, built 30 years ago and definitely unfit for the modern software world. The schedules weren’t really rooted in the domain, they all had technical explanations like “the recording of the values is done sequentially and takes up to 8 minutes, we can’t record them more often than that”. For our project, the measurement hardware was changed, so our recording took a couple of milliseconds. We could store and display the values continuously, if the need arises.

So we discussed the required simpleness or complexity of the task scheduler with the customer and they seemed pleased with all the new possibilities. But they decided that the current schedules were sufficient and didn’t need to be changed. We could go ahead and build our simple task scheduler.

And this is when we decided to abandon KISS and make the task scheduler more powerful than needed. “But you ain’t going to need it!” was the enemy. Because we knew that the customer will inevitably come around and make use of their new possibilities. We knew that if we build the system with more complexity, we would be the heroes in a future time, wearing a smug smile and telling the customer: “We’ve already built this, you can use it right away”. Oh how glorious this prospect of the future shone! Just a few more thoughts going into the code and we’re set for a bright future.

Let me tell you a few details about the “few more thoughts” with the example of an “every hour” task schedule. Instead of hard-coding the schedule, we added a configuration file with a cron-like expression for the schedule. You could now leverage the power of cron expressions to design your schedule as you see fit. If you wanted to change the schedule from “every hour” to “every odd minute and when the pale moon rises”, you could do so. The task scheduler had to interpret the configuration file and make sure that tasks don’t pile up: If you schedule a task to run “every minute”, but it takes two minutes to process, you’ve essentially built a time-bomb for your system load. This must not be feasible.

But it doesn’t stop there. A lot of functionality, most of which wasn’t even present or outlined at the time of our decision, relies implicitly on that schedule. Two examples: There are manual operations that must not be performed during the execution of the task. The system goes into a “protected state” around the task execution. It disables these operations a few minutes before the scheduled execution and even some time afterwards. If you had a fixed schedule of “every hour”, you could even hard-code the protected timespan. With a possible dynamic schedule, you have to calculate your timespan based on the current schedule and warn your operator if it isn’t possible anymore to find a time slot to even perform the manual operation.
The second example is a functionality that supervises the completeness of the recorded data. The problem is: This functionality is on another computer (it’s a distributed system, remember?) that doesn’t know about the configuration files. To be able to scan the data archive and say “everything that should be there, is there”, the second computer needs to know about all the schedules of the first computers (there are many of them, recording their data on their own schedules and transferring it to the second computer). And if a schedule changes, the second computer needs to take the change into account and scan the data archive for two areas: one area with the old schedule and one area with the new schedule. Otherwise, there would be false alarms.

You can probably see that the one decision to make the task scheduler a little more complex and configurable as required had quite some impact on the complexity of other parts of the system. But this investment will be worth it as soon as the customer changes the schedule! The whole system is programmed, tested and documented to facilitate schedule changes. We are ready!

It’s been over twelve years since we wrote the first line of code for the more complex implementation (I’ve checked the source control logs). The customer hasn’t changed a single bit of the schedule yet. There are over twenty “first computers” and they all still run the same task schedule as initially planned. Our decision did nothing but to add accidental complexity to the system. It probably introduced some bugs along the way, too. It certainly increased our required level of awareness (“hurdle of understanding”) during the development of features that are somewhat coupled with the task schedule.

In short: It’s been a disaster. The smug smile we thought we’d wear has been replaced by a deep frown. Who wrote all that mess? And why? It wasn’t the customer, it was us. We will never be going to need it.

Oracle database date and time literals

For one of our projects I work with time series data stored in an Oracle database, so I write a lot of SQL queries involving dates and timestamps. Most Oracle SQL queries I came across online use the TO_DATE function to specify date and time literals within queries:

SELECT * FROM events WHERE created >= TO_DATE('2012-04-23 16:30:00', 'YYYY-MM-DD HH24:MI:SS')

So this is what I started using as well. Of course, this is very flexible, because you can specify exactly the format you want to use. On the other hand it is very verbose.

From other database systems like PostgreSQL databases I was used to specify dates in queries as simple string literals:

SELECT * FROM events WHERE created BETWEEN '2018-01-01' AND '2018-01-31'

This doesn’t work in Oracle, but I was happy find out that Oracle supports short date and timestamp literals in another form:

SELECT * FROM events WHERE created BETWEEN DATE'2018-01-01' AND DATE'2018-01-31'

SELECT * FROM events WHERE created > TIMESTAMP'2012-04-23 16:30:00'

These date/time literals where introduced in Oracle 9i, which isn’t extremely recent. However, since most online tutorials and examples seem to use the TO_DATE function, you may be happy to find out about this little convenience just like me.

Using OpenVPN in an automated deployment

In my previous post I showed how to use Ansible in a Docker container to deploy a software release to some remote server and how to trigger the deployment using a Jenkins job.

But what can we do if the target server is not directly reachable over SSH?

Many organizations use a virtual private network (VPN) infrastructure to provide external parties access to their internal services. If there is such an infrastructure in place we can extend our deployment process to use OpenVPN and still work in an unattended fashion like before.

Adding OpenVPN to our container

To be able to use OpenVPN non-interactively in our container we need to add several elements:

  1. Install OpenVPN:
    RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get -y install openvpn
  2. Create a file with the credentials using environment variables:
    CMD echo -e "${VPN_USER}\n${VPN_PASSWORD}" > /tmp/openvpn.access
  3. Connect with OpenVPN using an appropriate VPN configuration and wait for OpenVPN to establish the connection:
    openvpn --config /deployment/our-client-config.ovpn --auth-user-pass /tmp/openvpn.access --daemon && sleep 10

Putting it together our extended Dockerfile may look like this:

FROM ubuntu:18.04
RUN DEBIAN_FRONTEND=noninteractive apt-get update
RUN DEBIAN_FRONTEND=noninteractive apt-get -y dist-upgrade
RUN DEBIAN_FRONTEND=noninteractive apt-get -y install software-properties-common
RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get -y install ansible ssh
RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get -y install openvpn

SHELL ["/bin/bash", "-c"]
# A place for our vault password file
RUN mkdir /ansible/

# Setup work dir
WORKDIR /deployment

COPY deployment/${TARGET_HOST} .

# Deploy the proposal submission system using openvpn and ansible
CMD echo -e "${VAULT_PASSWORD}" > /ansible/vault.access && \
echo -e "${VPN_USER}\n${VPN_PASSWORD}" > /tmp/openvpn.access && \
ansible-vault decrypt --vault-password-file=/ansible/vault.access ${TARGET_HOST}/credentials/deployer && \
openvpn --config /deployment/our-client-config.ovpn --auth-user-pass /tmp/openvpn.access --daemon && \
sleep 10 && \
ansible-playbook -i ${TARGET_HOST}, -u root --key-file ${TARGET_HOST}/credentials/deployer ansible/deploy.yml && \
ansible-vault encrypt --vault-password-file=/ansible/vault.access ${TARGET_HOST}/credentials/deployer && \
rm -rf /ansible && \
rm /tmp/openvpn.access

As you can see the CMD got quite long and messy by now. In production we put the whole process of executing the Ansible and OpenVPN commands into a shell script as part of our deployment infrastructure. That way we separate the preparations of the environment from the deployment steps themselves. The CMD looks a bit more friendly then:

CMD echo -e "${VPN_USER}\n${VPN_PASSWORD}" > /vpn/openvpn.access && \
echo -e "${VAULT_PASSWORD}" > /ansible/vault.access && \
chmod +x deploy.sh && \
./deploy.sh

As another aside you may have to mess around with nameserver configuration depending on the OpenVPN configuration and other infrastructure details. I left that out because it seems specific to the setup with our customer.

Fortunately, there is nothing to do on the ansible side as the whole VPN stuff should be completely transparent for the tools using the network connection to do their job. However we need some additional settings in our Jenkins job.

Adjusting the Jenkins Job

The Jenkins job needs the VPN credentials added and some additional parameters to docker run for the network tunneling to work in the container. The credentials are simply another injected password we may call JOB_VPN_PASSWORD and the full script may now look like follows:

docker build -t app-deploy -f deployment/docker/Dockerfile .
docker run --rm \
--cap-add=NET_ADMIN \
--device=/dev/net/tun \
-e VPN_USER=our-client-vpn-user \
-e VPN_PASSWORD=${JOB_VPN_PASSWORD} \-e VAULT_PASSWORD=${JOB_VAULT_PASSWORD} \
-e TARGET_HOST=${JOB_TARGET_HOST} \
-e ANSIBLE_HOST_KEY_CHECKING=False \
-v `pwd`/artifact:/artifact \
app-deploy

DOCKER_RUN_RESULT=`echo $?`
exit $DOCKER_RUN_RESULT

Conclusion

Adding VPN-support to your automated deployment is not that hard but there are some details to be aware of:

  • OpenVPN needs the credentials in a file – similar to Ansible – to be able to run non-interactively
  • OpenVPN either stays in the foreground or daemonizes right away if you tell it to, not only after the connection was successful. So you have to wait a sufficiently long time before proceeding with your deployment process
  • For OpenVPN to work inside a container docker needs some additional flags to allow proper tunneling of network connections
  • DNS-resolution can be tricky depending on the actual infrastructure. If you experience problems either tune your setup by adjusting name resolution in the container or access the target machines using IPs…

After ironing out the gory details depicted above we have a secure and convenient deployment process that saves a lot of time and nerves in the long run because you can update the deployment at the press of a button.

Domain-aligned bugs

Frank C. Müller [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0) ]
Imagine that you are an user of a typical enterprise software that handles commercial products and their prices. There are different prices in the software that are somehow related to each other. There is the purchase price that indicates your cost if you buy the product. There is the retail price that gets listed in your price lists and is paid by your customers, should they buy the product. You probably already figured out that the retail price should never be lower than the purchase price, because that would mean you lose money with every successful sale.

Let’s say that the enterprise software not only handles products, but also parts. Several parts combined, with some manufacturing effort, result in a product. Each part has a purchase price, the resulting product has a retail price. The retail price of the product should be higher than the sum of purchase prices of the parts. If it isn’t, you lose the costs of the manufacturing effort and some extra money with every successful sale.

If for any reason you cannot clearly estimate your manufacturing effort, the enterprise software has another input field for an amount of money that you can add to the sum of the parts’ costs. We call this field the “sales bonus”. So, if you sell a product made up of parts, your customer has to pay a price that consists at least of the retail prices of the parts and the sales bonus. Of course, your customer has an individual discount percentage that needs to be subtracted from the total price. Are you still following?

You are now thinking in the domain of price determination and financial mathematics. If you were the user of said enterprise software, you’d probably expect some bugs like these:

  • It is possible to enter a retail price lower than the purchase price
  • The price of products manufactured from parts isn’t calculated correctly
  • It is possible to enter a negative sales bonus
  • The total price with discount could be lower than the sum of purchase prices of the parts without a warning

All of them are bugs in the domain. All of them can be explained to a domain expert or a user with terms and concepts from the domain.

But what about the bug when you sell a product that consists of three parts, each with a retail price of 10 €, and a sales bonus of 5 €. You want to create a quote for your customer and the price shows up as 34,99999999998 €. You are a bit bewildered and try to countervail the apparent rounding error by changing the sales bonus to 5,00000000002 €. After this change you get another crazy total price and your prices in the database are different from what you entered, too. Everything seems to destabilize and deviate further and further from clear cut prices.

As a programmer, you know what happened. You know what caused this effect of numerical instability. Somebody stored monetary values in a floating point number. You know that is a bad idea and you’d never do this. But this blog post isn’t about you or what you should do or not do. It is about the user, expert in his domain, that stumbles over the bug as described and has to make some decision on how to fix it. This user cannot use any knowledge from the domain to even understand the mechanics of the bug. You, as the programmer, cannot explain this bug in terms of things the user already knows. You need to be vague (“the software doesn’t store the exact values, just approximations”) or introduce additional complexity (“we store this value by splitting it into a significand and multiply it with a factor consisting of a fixed base and an exponent. We can omit the base and just store the significand and the exponent and express a very large numerical range in just a few bits. Think about how cool that is!”).

Read the last explanation again, from the viewpoint of a salesman. We want to add some prices in the range of a few €, slap a moderate discount on top and call it a day. We don’t care about bits or exponential formulas. That is not part of our domain and it shouldn’t affect our domain or software that works in our domain. Confronting us with technical details reflects negatively on your ability to solve our problems. You seem to burden us with your problems in exchange.

As domain experts, we want only domain-aligned bugs.

Using protobuf with conan and CMake

In my last post, I showed how I got my feet wet while migrating the dependencies of my existing code-base to conan. The first major hurdle I saw coming when I started was adding something with a “special” build step, e.g. something like source-preprocessing. In my case, this was protobuf, where a special build-step converts .proto files to sources and headers.

In my previous solution, my devenv build scripts would install the protobuf converter binary to my devenv’s bin/ folder, which I then used to run my preprocessing. At first, it was not obvious how to do this with conan. It turns out that the lovely people and bincrafters made this pretty comfortable. conan_basic_setup() will add all required package paths to your CMAKE_MODULE_PATH, which you can use to include() some bundled CMake scripts that will either let you execute the protobuf-compiler via a target or run protobuf_generate to automagically handle the preprocessing. It’s probably worth noting, that this really depends on how the package is made. Conan does not really have an official way on how to handle this.

Let’s start with some sample code – Person.proto, like the sample from the protobuf website:

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;
}

And some sample code that uses it:

#include "Person.pb.h"

int main(int argn, char** argv)
{
  Person message;
  message.set_name("Hello Protobuf");
  std::cout << message.name() << std::endl;
}

Again, we’re using the bincrafters repository for our dependencies in a conanfile.txt:

[requires]
protobuf/3.6.1@bincrafters/stable
protoc_installer/3.6.1@bincrafters/stable

[options]

[generators]
cmake

Now we just need to wire it all up in the CMakeLists.txt

cmake_minimum_required(VERSION 3.0)
project(ProtobufTest)

include(${CMAKE_BINARY_DIR}/conanbuildinfo.cmake)
conan_basic_setup(TARGETS KEEP_RPATHS)

# This loads the cmake/protoc-config.cmake file
# from the protoc_installer dependency
include(cmake/protoc-config)

set(TARGET_NAME ProtobufSample)

# Just add the .proto files to the target
add_executable(${TARGET_NAME}
  Person.proto
  ProtobufTest.cpp
)

# Let this function to the magic
protobuf_generate(TARGET ${TARGET_NAME})

# Need to use protobuf, of course
target_link_libraries(${TARGET_NAME}
  PUBLIC CONAN_PKG::protobuf
)

# Make sure we can find the generated headers
target_include_directories(${TARGET_NAME}
  PUBLIC ${CMAKE_CURRENT_BINARY_DIR}
)

There you have it! Pretty neat, and all without a brittle find_package call.