Rounding numbers is not that easy

For many computer programs it is necessary to round numbers. For example an invoice amount should only have two decimal places and a tool for time management often does not have to be accurate to the millisecond. Fortunately you don‘t have to write a method for that yourself. In Java or JavaScript you can use Math.round, Python has a built-in function for rounding and the Kotlin Standard Library also contains a method for this purpose. Anyway some of these functions have a few surprises in store and violate the principle of least astonishment. The principle of least astonishment was first formulated by Geoffrey James in his book The Tao of Programming. It states that a program should always behave in the way the user expects it to, but it can also be applied to source code. Thus a method or a class should have a name that describes its behavior in a proper way.

So, what would you expect a method with the name round to do? The most common way to round numbers is the so called round half up method. It means that half-way values are always rounded up. For example 4.5 gets rounded to 5 and 3.5 gets rounded to 4. Negative numbers get rounded in the same way, for example -4.5 gets rounded to -4. In fact the Math.round functions in Java and JavaScript use this kind of rounding and thus behave in a way most people would expect.

But in other programming languages this can be different. Actually I used the Python built-in rounding function for some time without recognizing it does not always round half-way values up. For example round(3.5) results in 4 as you would expect, but round(4.5) also returns 4. That‘s because Python uses the so called round half to even method for rounding values. This means that half-way values are always rounded to the nearest even number. The advantage in this kind of rounding is that if you add mulitple rounded values the error gets minimized, so it can be beneficial for statistical calculations. If you still want to round half-way values up in Python, you can implement your own rounding function:

def round_half_up(number, decimals: int):
	rounded_value = int(number * (10**decimals) + 0.5) / (10**decimals)

	if rounded_value % 1 == 0:
		rounded_value = int(rounded_value)

	return rounded_value

round_half_up(4.5, decimals=0)    # results in 5

A different way in Python to round half-way values up is to use the decimal module, which contains different rounding modes:

from decimal import *

Decimal("4.5").quantize(Decimal("1"), rounding=ROUND_HALF_UP)    # results in 5

It should be noted that the ROUND_HALF_UP mode in this module does actually not use the round half up method as explained above, but the also very common round half away from zero method. So for positive numbers the results are the same, but -4.5 does not get rounded to -4, but -5.

Python is by the way not the only programming language that uses the round half to even method. For example Kotlin and R round half-way values to the nearest even number, too. However for Kotlin there are several easy ways to round half-way values up: you could use the methods roundToInt or roundToLong from the standard library or the Math.round method from Java instead of the method round.

It should also be noted that the explained methods for rounding are not the only ones. Instead of rounding half-way values up you could also use the round half down method, so rounding 3.5 would result in 3. And instead of rounding half to even you could use the round half to odd method and 4.5 would get rounded to 5, as would 5.5. There are some more methods and everyone of them has its use case, so you should always choose carefully.

To sum it up, rounding is not as easy as it seems. Although most programming languages have a method for rounding in their standard library you should always take a closer look and check if the rounding function you want to use behaves in the way you expect and want it to.

Sometimes you will be surprised.

The joy of being a student assistant

Lately I heard a lecture by Daniel Lindner about error codes and why you should avoid using them. I had to smile because it reminded me of my time as a student assistant, when I worked with some people that had a slightly different opinion on that point. Maybe they enjoyed torturing student assistants, but it seems the most likely to me that they just did not know any better. But let‘s start at the beginning.

One day the leader of my research group sent me an excel sheet with patient data and asked me to perform some statistical calculations with the programming language R, that is perfectly suitable for such a task. Therefore I did not expect it to take much time – but it soon turned out that I was terribly wrong. Transforming that excel sheet into something R could work with gave me a really hard time and so I decided to write down some basic rules you should consider when recording such data in the hope that at least some future student assistants won‘t have to deal with the problems I had again.

First I told my program to read in the excel sheet with the patient data, which worked as expected. But when I started to perform some simple operations like calculating the average age of the patients, my computer soon told me things like that:

Warning message: In mean.default(age): argument is not numeric or logical: returning NA

I was a little confused then, because I knew that the age of something or someone is a numeric value. But no matter how often I tried to explain that to my computer: He was absolutly sure that I was wrong. So I had no choice but to have a look at the excel sheet with about 2000 rows and 30 columns. After hours of searching (at least it felt like hours) I found a cell with the following content: Died last week. That is indeed no numeric value, it‘s a comment that was made for humans to read. So here‘s my rule number one:

1. Don‘t use comments in data files

There is one simple reason for that: The computer, who has to work with that data, does not understand it. And (as sad as it is) he does not care about the death of a patient. The only thing he wants to do is to calculate a mean value. And he needs numeric values for it. If you still want to have that comment, just save it somewhere else in another column for comments or in a separate file the computer does not have to deal with when performing statistical calculations.

So I removed that comment (and some others I found) and tried to calculate the mean value again. This time my computer did not complain, the warning message disappeard and for one moment I felt relieved. In the next moment I saw the result of the calculation. The average age of the patients was 459.76. And again I told my computer that this is not possible and again he was sure that I was wrong and again I had to take a look at the excel sheet with the data. Did I mention that the file contained about 2000 rows and 30 columns? However, after a little searching I found a cell with the value -999999. It was immediately clear to me that this was not the real age of the patient, but I wasn‘t able to find out by myself what that value meant. It could have been a typo, however the leader of my research group told me that some people use -999999 as an error code. It could mean something like: „I don‘t know the age of the patient.“ Or: „That patient also died.“ But that was only a guess. So here is my rule number 2:

2. Document your error codes

If there would have been some documentation I would maybe have known what to do with that value. Instead I secretly deleted it, hoping that it was not important to anyone, because unfortunately to my computer -999999 is just a numeric value, not better or worse than any other. So I had to tell him not to use it. But that was only the beginning.

I learned from my previous mistakes and before performing any other statistical calculations I had a look at the whole excel sheet. And it was even more horrible than I expected it to be. If every person who worked with that table would have used the same error code, I could just have written a script that eliminated all -999999 from the sheet and it would have been done. But it seemed that everyone had his own favorite (undocumented) error code. Or if at least there would have been some documentation about the value ranges of each column, I could have told my computer to ignore all values that are not in that range. For something like the age of a patient this is easy, but for other medical data a computer science student like me does not know that can be hard. Is 0 a valid value or does it mean that there is no value? What about 999? So in any case: I had to read the whole table (again: 2000×30 values!) and manually guess for each value if it really was a value or an error code and then tell my computer to ignore it, so he could calculate the right means. I don‘t know exactly how much time that cost, but I‘m sure in the same time I could have read all nine books of The Histories by Herodotus twice, watch every single episode of Gilmore Girls including the four episodes of A Year in the Life and learn Japanese. So finally here‘s my rule number 3 (and the good part about this one is that you can immediately forget about rule number 2):

3. Don‘t use error codes in data files

Really. Don‘t. The student assistants of the future will thank you.