In addition to storing time-based data efficiently, one of the main functions a time-series database provides is the ability to query over temporal data and compute statistics. In both cases, this involves summarising vast amounts of data either for storage or to provide a new view.
Working on Gnocchi – an open-source, time-series database that leverages cloud-based storage – the ability to group data and provide new statistical views on the grouped data represents a critical path in addition to the ability to read and write data.
At its conception, Gnocchi original leveraged Pandas to supply the logic behind grouping and aggregating data. Pandas is a fantastic Python toolkit that provides various data structures and functions to manipulate datasets and is ubiquitous among data scientists. With that said, over time we realised that Pandas was overkill in the context of Gnocchi’s aggregation workflow.
Moving away from Pandas, two alternatives were identified: SciPy and NumPy.
At its core, NumPy provides the ability to construct multidimensional arrays. In addition to that, it supports fast, vectorised array operations which is something used heavily in machine learning tools such as TensorFlow but also in Gnocchi. (yes, i’m very good at adding in hypewords that are completely unrelated… cough blockchain, autonomous vehicles, cloud cough)
Alternatively, SciPy extends NumPy and provides a collection of helper functions and tools often used in signal and image processing but also provides the ability to compute statistics over an array.
The following will highlight the gains Gnocchi achieved by moving towards a more tailored solution rather than using an all-purpose toolkit.
performance
To test the performance difference for handling time-series statistics in NumPy, SciPy, and Pandas, the following code was used:
As a note, carbonara handles the time-series structure and logic in Gnocchi. I’m using pandas==0.22.0, scipy==1.0.0, numpy==1.13.3, with Gnocchi4.1 to test statistics with SciPy and master(2018.01.10) to test NumPy. The implementation code can be found in GitHub for exact details.
initialisation
First, we’ll start by timing how long it takes to initialise the required data structure:
Right off the bat, representing a basic time-series in NumPy is almost 5x more performant.
grouping
Next, to test grouping performance, we’ll group the 5760 time-value pairs three ways: by minute, which creates 1440 groups of 4 points each; by hour, which creates 24 groups of 240 points; and by day, which creates 2 groups of 2880 points.
Similarly, for the other granularities:
In this scenario, we see consistent performance in both Pandas and NumPy regardless of grouping granularity. Also, regardless of granularity, using NumPy yields 100x performance. It should be noted, that Pandas is probably doing a lot more than NumPy to provide additional functionality. Keeping that in mind, the above benchmark should not be considered as a 1:1 comparison.
aggregation
Now let’s compare how each solution performs when computing statistical values for the groups.
mean
In all three solutions, the performance is relatively stable with the pure NumPy solution performing ~2.4x faster.
last
NumPy shines when computing the last value of each group. possibly because of the indexing functionality of NumPy, it returns over 20x faster.
percentile
When computing the percentile of very few groups, SciPy performs the best. NumPy on the other hand performs consistently regardless of the number of groups and can perform over 400x and 70x better than Pandas and SciPy respectively.
max
Max aggregate computation results were surprising as Pandas appears to do some black magic. Similar for min aggregation, Pandas outperformed NumPy and SciPy by a clean margin. Whatever Pandas is doing we need to port it to Gnocchi.
count
I’ll start off by saying this comparison is misleading af. In our carbonara group structure, count is computed on initialisation, so the NumPy solution is doing zero computation. With that said, it’s up to 55x faster. \o/
multi-series aggregation
In this scenario, we’ll test how well each solution can aggregate across multiple independent time-series. We’ll use Gnocchi4.0 code to validate the Pandas solution (which may not be the best Pandas implementation) and master(2018.01.10) for NumPy (which also may not be the best NumPy implementation).
Without going too much into implementation details of Gnocchi, the Pandas solution takes multiple AggregatedTimeSerie (a Pandas series wrapped with supporting functions) and builds a Pandas DataFrame to do aggregation across the time-series collection. For the NumPy solution, we create similar AggregatedTimeSerie objects (a NumPy series wrapped with supporting functions) but rather than build a DataFrame, we build a AxB matrix where A is the number of series and B are the unique timestamps across all the series. the datasets are created as follows:
Using this series, we’ll aggregate across 3 identical series, filling in missing values with 0 (although there are no missing values in this case):
This is also done with the same 3 series except the holes are not filled. Note that the Pandas path is going to be much slower as significant parts of this logic are not in Pandas but in Python:
Reviewing the performance, by building the series in NumPy and maintaining the vast majority of logic and operations in NumPy, we are able to achieve up to 1000x performance gains in the above use case. This does not factor in the flexibility to perform vectorised mathematical operations across the series
There are more potential aggregates but they are consistent with above results. to be honest, i was actually expecting a greater performance gain compared to Pandas so i have to give props to the Pandas team for the continual improvements they’ve made! (or my benchmarks are wrong… or the code sucks)
memory
Personally, another reason for swapping out Pandas was to lower the memory requirements of each service so Gnocchi could run anywhere easily and reserve its memory for real work rather than to run Gnocchi. Pandas requires a non-trivial (for Python) amount of memory when loaded:
Depending on the packaged loaded, Pandas alone can be larger than 55MB. By swapping out Pandas for SciPy and then SciPy for NumPy, memory usage for Gnocchi’s processing workers drops more than 40%.
Gnocchi3 memory usage
(almost) Gnocchi4.1 memory usage
(potential) Gnocchi4.2 memory usage
end thought
By switching from Pandas to NumPy in Gnocchi, we were able to increase metric processing throughput and decrease memory usage. With all that said, Gnocchi does still require significant CPU but that’s maths for you.
revisions
2018-01-12: fixed some english; added more details to multi-series aggregate