In a recent project, my job was to build a scientific data processing pipeline for a new algorithm that wasn’t set in stone yet. Part of my work would be to explore different mathematical formulas interactively with the customer.
My usual approach to projects is a “risk first” strategy. I try to identify the riskiest or most demanding part of the project and deal with it first. This approach essentially resembles the “fail fast” mindset, just that we haven’t failed yet.
In the case of the calculation pipeline, the riskiest part and at the same time the functionality that matters to the customer most, was the pipeline itself. If we were able to implement a system that can transform the given entry data into the desired results, we had an end-to-end prototype and the means to explore different mathematical approaches.
The pipeline consists of different steps that can be described as a complex transformation each. The first step/transformation takes a proprietary data format file and converts it into a big JSON file. The main effort of this step is a deep physical analysis of the data contained in the proprietary format. This analysis requires a lot of thought, exploration and work, but can be seen as a black box that the data traverses on its way from proprietary format to JSON.
The next step takes the JSON input and extracts the necessary information required by the following step. It is essentially a data reduction operation.
The third step feeds the analyzed, reduced data into the formulas and stores the calculation result.
The fourth step aggregates the calculation results into a daily time series report in a format that can be read by a spreadsheet application. This report is the end product of the pipeline and will be used to make decisions and to rule out certain environmental hazards.
The main difference of this project to virtually every project before is that I didn’t write any user interface code. The application’s main window is still blank. The whole interaction of the system with other systems that provide the entry data, of the pipeline steps among each other and with the human user is based on files in the file system.
The system periodically checks for the existence of new entry data. If some is found, it is copied in the “inbox” directory of the first step. The first step periodically checks for the existence of files in its inbox and processes them into its “outbox” that conveniently serves as the inbox of the second step. You probably get the idea by now. All the steps in the system, including the upstream data fetching routine, are actors in an file-based actor model. The files serve as messages from one actor to another. The file system and its directory structure is the common communication channel that passes the messages around.
One advantage of this approach is that the file system viewer application of the operating system can be used as the (graphical) user interface. By opening the appropriate directories and viewing their content, the user can supervise the operating state of the system. The system can report problems by moving the incoming message not in the step’s “done” directory , but into its “failed” or “problem” directory. If several directories are on display at once, the user can follow a specific piece of data through the pipeline and view the intermediate results. For domain specific reasons, the actors in this project also have the result directory “omitted” for data that will not be processed any further because some domain rules have determined a cancellation.
An user can even manipulate the data’s flow by moving files away or into a specific directory. Let’s say that we want to calculate a certain amount of data again, we can just copy the files from the “done” directory of the first step into its “inbox” and the system will process it again.
Because the analysis step takes some time while the calculation step is surprisingly fast, we can perform just the calculation again by not moving the initial data files, but the analyzed and reduced entry files for the calculation step. Using this approach, we can try different mathematical formulas by stopping the system, swapping the calculation step with a new version, starting the system again and moving the desired entry files into its inbox.
Using the file system as an interaction device for the user and the system’s parts has many immediate advantages, but some drawbacks, too. One drawback is performance. Using the harddisk for data transfer is the slowest possible way to bring data from step X to step X+1. If your system is required to have high throughput or low latency, this approach isn’t suitable. My project has a low, forecastable throughput and a latency requirement that is measured in minutes or seconds, but not in milliseconds or even nanoseconds. It can spend some time in the filesystem, because the first step alone takes several seconds for each file.
Another drawback is a certain fragility of the communication medium, the file system. You have to account for concurrent reads, writes or even deletes. The target platform of my system (Microsoft Windows) exhibits signs of exhaustion if the amount of files in one directory grows too large. This means that your file selection, already a costly operation, becomes more costly if the systems is put under pressure. If your throughput is usually steady, which is the case in my project, this won’t be a problem. Until you manually copy 100k files in an inbox for swift recalculation and discover that the file copy process alone takes several minutes.
Of course, the system cannot operate without a graphical user interface forever. But some basic interactions with the system will probably just result in some files being copied from one directory to another one in the background.