# The joy of being a student assistant

Lately I heard a lecture by Daniel Lindner about error codes and why you should avoid using them. I had to smile because it reminded me of my time as a student assistant, when I worked with some people that had a slightly different opinion on that point. Maybe they enjoyed torturing student assistants, but it seems the most likely to me that they just did not know any better. But let‘s start at the beginning.

One day the leader of my research group sent me an excel sheet with patient data and asked me to perform some statistical calculations with the programming language R, that is perfectly suitable for such a task. Therefore I did not expect it to take much time – but it soon turned out that I was terribly wrong. Transforming that excel sheet into something R could work with gave me a really hard time and so I decided to write down some basic rules you should consider when recording such data in the hope that at least some future student assistants won‘t have to deal with the problems I had again.

First I told my program to read in the excel sheet with the patient data, which worked as expected. But when I started to perform some simple operations like calculating the average age of the patients, my computer soon told me things like that:

`Warning message: In mean.default(age): argument is not numeric or logical: returning NA`

I was a little confused then, because I knew that the age of something or someone is a numeric value. But no matter how often I tried to explain that to my computer: He was absolutly sure that I was wrong. So I had no choice but to have a look at the excel sheet with about 2000 rows and 30 columns. After hours of searching (at least it felt like hours) I found a cell with the following content: Died last week. That is indeed no numeric value, it‘s a comment that was made for humans to read. So here‘s my rule number one:

1. Don‘t use comments in data files

There is one simple reason for that: The computer, who has to work with that data, does not understand it. And (as sad as it is) he does not care about the death of a patient. The only thing he wants to do is to calculate a mean value. And he needs numeric values for it. If you still want to have that comment, just save it somewhere else in another column for comments or in a separate file the computer does not have to deal with when performing statistical calculations.

So I removed that comment (and some others I found) and tried to calculate the mean value again. This time my computer did not complain, the warning message disappeard and for one moment I felt relieved. In the next moment I saw the result of the calculation. The average age of the patients was 459.76. And again I told my computer that this is not possible and again he was sure that I was wrong and again I had to take a look at the excel sheet with the data. Did I mention that the file contained about 2000 rows and 30 columns? However, after a little searching I found a cell with the value -999999. It was immediately clear to me that this was not the real age of the patient, but I wasn‘t able to find out by myself what that value meant. It could have been a typo, however the leader of my research group told me that some people use -999999 as an error code. It could mean something like: „I don‘t know the age of the patient.“ Or: „That patient also died.“ But that was only a guess. So here is my rule number 2: