Lately I heard a lecture by Daniel Lindner about error codes and why you should avoid using them. I had to smile because it reminded me of my time as a student assistant, when I worked with some people that had a slightly different opinion on that point. Maybe they enjoyed torturing student assistants, but it seems the most likely to me that they just did not know any better. But let‘s start at the beginning.
One day the leader of my research group sent me an excel sheet with patient data and asked me to perform some statistical calculations with the programming language R, that is perfectly suitable for such a task. Therefore I did not expect it to take much time – but it soon turned out that I was terribly wrong. Transforming that excel sheet into something R could work with gave me a really hard time and so I decided to write down some basic rules you should consider when recording such data in the hope that at least some future student assistants won‘t have to deal with the problems I had again.
First I told my program to read in the excel sheet with the patient data, which worked as expected. But when I started to perform some simple operations like calculating the average age of the patients, my computer soon told me things like that:
Warning message: In mean.default(age): argument is not numeric or logical: returning NA
I was a little confused then, because I knew that the age of something or someone is a numeric value. But no matter how often I tried to explain that to my computer: He was absolutly sure that I was wrong. So I had no choice but to have a look at the excel sheet with about 2000 rows and 30 columns. After hours of searching (at least it felt like hours) I found a cell with the following content: Died last week. That is indeed no numeric value, it‘s a comment that was made for humans to read. So here‘s my rule number one:
1. Don‘t use comments in data files
There is one simple reason for that: The computer, who has to work with that data, does not understand it. And (as sad as it is) he does not care about the death of a patient. The only thing he wants to do is to calculate a mean value. And he needs numeric values for it. If you still want to have that comment, just save it somewhere else in another column for comments or in a separate file the computer does not have to deal with when performing statistical calculations.
So I removed that comment (and some others I found) and tried to calculate the mean value again. This time my computer did not complain, the warning message disappeard and for one moment I felt relieved. In the next moment I saw the result of the calculation. The average age of the patients was –459.76. And again I told my computer that this is not possible and again he was sure that I was wrong and again I had to take a look at the excel sheet with the data. Did I mention that the file contained about 2000 rows and 30 columns? However, after a little searching I found a cell with the value -999999. It was immediately clear to me that this was not the real age of the patient, but I wasn‘t able to find out by myself what that value meant. It could have been a typo, however the leader of my research group told me that some people use -999999 as an error code. It could mean something like: „I don‘t know the age of the patient.“ Or: „That patient also died.“ But that was only a guess. So here is my rule number 2:
2. Document your error codes
If there would have been some documentation I would maybe have known what to do with that value. Instead I secretly deleted it, hoping that it was not important to anyone, because unfortunately to my computer -999999 is just a numeric value, not better or worse than any other. So I had to tell him not to use it. But that was only the beginning.
I learned from my previous mistakes and before performing any other statistical calculations I had a look at the whole excel sheet. And it was even more horrible than I expected it to be. If every person who worked with that table would have used the same error code, I could just have written a script that eliminated all -999999 from the sheet and it would have been done. But it seemed that everyone had his own favorite (undocumented) error code. Or if at least there would have been some documentation about the value ranges of each column, I could have told my computer to ignore all values that are not in that range. For something like the age of a patient this is easy, but for other medical data a computer science student like me does not know that can be hard. Is 0 a valid value or does it mean that there is no value? What about 999? So in any case: I had to read the whole table (again: 2000×30 values!) and manually guess for each value if it really was a value or an error code and then tell my computer to ignore it, so he could calculate the right means. I don‘t know exactly how much time that cost, but I‘m sure in the same time I could have read all nine books of The Histories by Herodotus twice, watch every single episode of Gilmore Girls including the four episodes of A Year in the Life and learn Japanese. So finally here‘s my rule number 3 (and the good part about this one is that you can immediately forget about rule number 2):
3. Don‘t use error codes in data files
Really. Don‘t. The student assistants of the future will thank you.