Time series DB?

sle · Jan 15, 2018

Upon some analysis (with the help of a few former co-workers), we are moving towards the suggestion by temnik - kdb+ in the long run. In the ultra-short run, I have switched to storing everything in duplicate as a pickle to improve read spreads from python.

It truly would be nice if there was an open source product that was as good as kdb+ and it's possible there is, but I just don't have the time to experiment.

temnik · Jan 16, 2018

@sle - you're going to love speed and expressiveness of kdb. It feels like Lisp yet looks like Assembly... Does your fund have kdb consultants give periodic classes?

@T0pH4t - so far my biggest complaint about influxdb is RAM requirements and poor implementation of as-of joins. At what version did you give it up? Those other databases are not time oriented - so they have no concept of previous/next or even group-by-time. How do you deal with that?

T0pH4t · Jan 16, 2018

@T0pH4t - so far my biggest complaint about influxdb is RAM requirements and poor implementation of as-of joins. At what version did you give it up? Those other databases are not time oriented - so they have no concept of previous/next or even group-by-time. How do you deal with that?
More...

So kerf is timeseries database, so I will assume you are talking about the others. At the end of the day most if not all databases are implemented using core data structures. Meaning they are row based (Oracle, Microsoft, MySQL...) or column based (KDB+, Cassandra, MongoDB...). They also generally use a file structure that is based off a B tree variant or an LSM tree (BigTable, HBASE, levelDB, MongoDB, RocksDB...). Timeseries DBs will then use optimizations (like delta-delta compression) on top of these structures, taking advantage of the fact that time series data is a continous integer series with a known start and end. The query language can then be structured around the properties of time series data. The databases I suggested are just simple key value stores. Meaning they give you a base layer that you can then start building a timeseries database off of (which is what I did). They will not give you an out of box experience like InfluxDB. Most database could become time series oriented with certain techniques, and some will be better then others. Your access patterns for your data should drive your decision on which underlying structure to use (or at least they should).

Facebook put out an interesting white paper on their time series database used for metrics called gorilla . Beringei is their open source timeseries (in-memory) database based on the paper.

https://github.com/facebookincubator/beringei

T0pH4t · Jan 16, 2018

I should mention that kdb+ and kerf both are based off of APL to an extent which heavily leverages CPU vector instructures (and in some cases GPU). For timeseries/numerical data this can be a huge advantage. Its why other databases could have such a hard time beating them in the financial area.

Simples · Jan 16, 2018

T0pH4t said:
So kerf is timeseries database, so I will assume you are talking about the others. At the end of the day most if not all databases are implemented using core data structures. Meaning they are row based (Oracle, Microsoft, MySQL...) or column based (KDB+, Cassandra, MongoDB...). They also generally use a file structure that is based off a B tree variant or an LSM tree (BigTable, HBASE, levelDB, MongoDB, RocksDB...). Timeseries DBs will then use optimizations (like delta-delta compression) on top of these structures, taking advantage of the fact that time series data is a continous integer series with a known start and end. The query language can then be structured around the properties of time series data. The databases I suggested are just simple key value stores. Meaning they give you a base layer that you can then start building a timeseries database off of (which is what I did). They will not give you an out of box experience like InfluxDB. Most database could become time series oriented with certain techniques, and some will be better then others. Your access patterns for your data should drive your decision on which underlying structure to use (or at least they should).

Facebook put out an interesting white paper on their time series database used for metrics called gorilla . Beringei is their open source timeseries (in-memory) database based on the paper.

https://github.com/facebookincubator/beringei
More...

Going this route (though overkill for my personal project), it is good to consider both access patterns and possibly separate the API/technology for each: both for write and read operations as separate concerns

Experience and history points to that these access patterns are very distinct, so may be worth abstracting from eachother and optimize by themselves. Besides possible optimizations, there's also more flexibility and freedom to change the underlying platform.

Traditional solutions tend to tie both write and read access patterns together in the same technology/API, providing a worst common middle ground, but may be simpler to get initial development started with.

It may be easier to start with such a pattern if one see clear benefits from choosing such principles.

temnik · Jan 16, 2018

@T0pH4t - that's a bit too heavy for my back... I can barely keep up with the research and trading, without having to worry about being a database engine programmer.

T0pH4t · Jan 16, 2018

temnik said:
@T0pH4t - that's a bit too heavy for my back... I can barely keep up with the research and trading, without having to worry about being a database engine programmer.
More...

That is fair, its definitely not something most people should do. I just happen to do this type of work for my day job

sle · Jan 17, 2018

temnik said:
@sle - you're going to love speed and expressiveness of kdb. It feels like Lisp yet looks like Assembly...
More...

Yeah, I am not sure that's an endorsement in my book

djames · Jan 17, 2018

Too lazy to see is this has been mentioned already but
https://github.com/manahl/arctic is excellent

trendmomentum · Jan 18, 2018

@sle kdb+ is an excelent choice for a time-series db implementation.

However, the real advantage is in the q (and k) language framework itself. To truly get the best out of it you will need to master the language and design your system in such a way that most of the heavy (pre/post)processing of data is done within a set of dedicated q servers. Only in this way you should be able to fully utilise the memory and speed optimisation capabilities of the kdb+ framework.

Any other front-end clients should just use the data results for display only, for example.