Today, almost everything produces periodically data: A car, cloud-connected smart-home solutions, your “smart” refrigerator and yourself using a fitness tracker. This data has to be stored in a certain way to be usable. We want fast access, beautiful charts and different aggregations over time.
There a several options both free and commercial for storing such time-series data like InfluxDB, Prometheus or TimescaleDB. They are optimised for this kind of data taking advantage of different time-series properties:
- Storing is (mostly) append-only
- Data points have a (strict) ordering in time
- Data points often have fixed and variable meta data in addition to the data value itself
In one of our projects we have to store large quantities of data every 200ms. Knowing PostgreSQL as a powerful relational database management system we opted for TimescaleDB that promises time-series features for an SQL database.
Ingestion, probably the biggest problem of traditional (SQL) databases, of the data worked as expected from the beginning. We were able to insert new data at a constant rate without degrading performance.
The problem
However we got severe performance problems when querying data after some time…
So one of the key points of using a time-series database strangely did not work for us… Fortunately, since Timescale is only an extension to PostgreSQL we could use just use explain/explain analyse with our slow queries to find out what was going on.
It directly showed us that *all* chunks of our hypertable were queried instead of only the ones containing the data we were looking for. Checking our setup and hypertable definition showed that Timescale itself was working as expected and chunking the data correctly on the time axis.
After a bit more analysis and thought is was clear: The problem was our query! We used something like start_time <= to AND from <= end_time in our where clause to denote the time interval containing the requested data. Here start_time is the partition column of our hypertable.
This way we were forcing the database to look at all chunks from long ago until our end-time timestamp.
The solution
Ok, so we have to reformulate our where clause that only the relevant chunks are queried. Timescale can easily do this when we do something like start_time >= from AND start_time < to where from and to are the start and end of our desired interval. That way usually only one or only a few chunks of the hypertable are search and everything is lighning-fast even with billions of rows in our hypertable.
Of course the results of our two queries are not 100% the same sematically. But you can easily achieve the desired results by correctly calculation the start and end of the time-series data to fetch.
Conclusion
Time-series databases are powerful tools for todays data requirements. As with other tools you need some understanding of the underlying concepts and sometimes even implementation to not accidently defeat the purpose and features of the tools.
Used in the correct way they enable you to work with large amounts of data in a performant way todays users would expect. We had similar experiences with full text search engines like solr and ElasticSearch, too.