orm: verbosity rears its ugly head
preface
This post was originally named quick and dirty time-series with postgresql arrays
and aimed to show how leveraging PostgreSQL ARRAYs may improve the performance of your time-series queries. As i began writing the post, it soon became apparent the solution did not offer the anticipated gains. Rather, the benchmarking highlighted the overhead associated with ORM and the potential pitfalls of blindly building objects. As i believe the gist of the contents remain valid (and i’m lazy), i’ve decided to keep the post as is. Please see benchmarks if that’s all you care about.
start
As a contributor to Gnocchi, an open-source times-series database, one of it’s strengths i’ve come to appreciate is it’s ability to scale across different workloads through horizontal scaling or deployment strategies. With that said, if i were to provide criticism, like many open-source projects, it is not the easiest to pick up and install.
With Gnocchi, there is no simple dnf install gnocchi
and systemctl start gnocchi
. Instead, you need to be aware of the architectural design of Gnocchi, the role of APIs, metricd workers, and the underlying storage services. Granted, this is some pretty petty criticism where in reality, some blame is on the user for being too lazy to read and learn about the service they’re adopting but i needed an introduction and the preceeding is the result.
To help out those who: don’t need the scale Gnocchi provides; don’t want to learn new technology; or want to keep it real simple, another alternative to storing time-series can be done by leveraging PostgreSQL.
arrays
To begin, let’s first recognise that time-series are arrays; they are two contiguous arrays of timestamps and values. In Gnocchi, they are stored as a structured numpy array:
They can also be captured as a single array of timestamp-value tuples which numpy provides if you iterate through the series created above.
Time-series are not dictionary or hashmaps. While it is possible to store a series as {<time>: <value>, <time>: <value>, ...}
or [{'time': <time>, 'value': <value>}, {'time': <time>, 'value': <value>}, ...]
, doing so adds unnecessary overhead and voids the sequential characteristic of time-series by scattering the data all over and requiring lookups for the simplest of tasks.
postgresql arrays
Postgresql needs no introduction so to get right into it, one of the neat features it provides beyond the standard SQL standard is that it can store complex datastructures such as JSON or for this post, arrays
How PostgreSQL handles and stores array is beyond my expertise but information on the topic can be found in the documentation or in forums
time-series in postgresql
As this post aims to provide a quick and easy solution to time-series. The simplest way to store the data would be as you would most data in SQL: one datapoint per row.
At this point, we are not leveraging ARRAYs which is not ideal but this is the simplest approach and also has the benefit of using standard SQL meaning it can be ported to other SQL vendors. Additionally, it has the benefit of having independent datapoints which avoids having to lock, read, update when writing data.
Alternatively, data can be stored as ARRAYs of times and values:
or a 2D array of DOUBLE PRECISION values:
Doing so will optimise your queries and enable better storage efficiency. It will also add extra complexity as you need to consider your series length. In cases, where you expect to store data beyond the max column size of PostgreSQL, you’ll need to consider how to hash data appropriately whether by year or by a more novel hash key. There are interesting posts on this topic but in this case, it actually might be best to use a time-series database that handles this complexity for you.
Keeping it simple, we’ll use PostgreSQL to return ARRAYs rather than storing them as ARRAYs. This will most likely result in slower queries as it will build datastructures on demand. To do so, we leverage ARRAY_AGG which is an aggregate function in PostgreSQL which computes a single result from a set.
So that’s it. With a simple query, you can return data as a row per series, rather than many hundreds/thousands of rows of single values.
benchmarks
The question that remains is whether this is worth it as it’s not difficult to make python loop through the results and build the series itself. To verify this, i’ve timed three scenarios which involve a basic timeseries table that is joined to another table describing the series using Django:
- Using standard SQL to retrieve individual rows and build series with Python
- Using ARRAY_AGG to make PostgreSQL build series.
- Using ORM to retreive individual rows and build series with Python
and the results… when testing against ~200 datapoints/rows:
Using arrays in PostgreSQL was not faster than letting python generate series. ORM was slowest but only by tens of milliseconds so this is insignificant.
When testing against ~40K rows:
Again, using ARRAY_AGG or not, raw SQL performed relatively similar regardless. ORM was significantly slower than the other options.
Lastly, when testing against ~800K rows:
In a ‘bigger’ dataset, ARRAY_AGG performed slightly better but not significantly. ORM sucked.
conclusion
This wasn’t the conclusion i hoped for, but there is no quick and easy way to use ARRAYs in PostgreSQL. Using ARRAY_AGG does not really offer much unless your queries touch millions of row. In this case, it might just be better to use a real TSDB which does all the formating/computing pre-storage, or to store the data as an ARRAY in PostgreSQL.
What can be concluded is that my distain for ORM is somewhat justified and if the model you build is an intermeditary step to your end result, the overhead of building objects that will be thrown away can have significant impact on the performance of your system.
I’ll need to do some proper profiling of ORM and its effects but that’s for another post.