Time series DB?

2rosy · Dec 27, 2017

sle said:
I store both T/Q ticks (since I don’t play UHF games, no book updates thank God) and resample to 1 second and 1 min. Why would Bcolz be better? Is it faster?
More...

issue with hdf5 was size, threading, corrupted files. the team moved from hdf5 to kdb. as someone mentioned columnar database vs row; find a free or cheap column db and just load into it. forget all these individual files

sle · Dec 27, 2017

2rosy said:
issue with hdf5 was size, threading, corrupted files. the team moved from hdf5 to kdb. as someone mentioned columnar database vs row; find a free or cheap column db and just load into it. forget all these individual files
More...

Kdb+ or a similar Sunguard product are relatively easy choices, since the firm uses them already. Cost-wise, I’d be piggybacking on firm-wide licenses and there are experts in house. Of course, the flip side is that I have to learn a brand new product and programming language (I have touched kdb but never used it seriously)

temnik · Dec 27, 2017

sle said:
Kdb+ or a similar Sunguard product are relatively easy choices, since the firm uses them already. Cost-wise, I’d be piggybacking on firm-wide licenses and there are experts in house. Of course, the flip side is that I have to learn a brand new product and programming language (I have touched kdb but never used it seriously)
More...

A firm uses kdb already? get onboard! Kdb's language Q hurts a lot at first, like you are getting a lobotomy, but there's denfinitely a state of bliss afterwards.

Is the "firm" a bank or a prop?

sle · Dec 27, 2017

temnik said:
Kdb's language Q hurts a lot at first, like you are getting a lobotomy, but there's denfinitely a state of bliss afterwards.
More...

That was my main concern, to be honest.

temnik said:
Is the "firm" a bank or a prop?
More...

The “firm” is a hedge fund.

CME Observer · Dec 28, 2017

CME Observer said:
Okay, cool. You're basically looking at an event sourcing problem, right? Then to me the question becomes not "what database goes really fast?" but more "how can I get really high throughput to transform my immutable log into a stream?"

So if you ask me (and don't because I'm a few months out of undergrad) I'd consider trying a little different approach. What if you:
1. went with a solution like Hadoop as persistence and query that by time window and contract. I imagine you'll want realtime read and perhaps write depending on your data source as well. Check out HBase.
2. once you've extracted the relevant range of ticks, feed it into a Kafka producer.
3. Implement whatever data processing you intend to do as a consumer and spin up a cluster as necessary as a means of achieving parallelism. Seems very applicable to backtesting and optimization.

Might be an interesting way to get really high throughout. Might also totally suck and get wrecked by MySQL InnoDB on a dual core laptop. Just how I'd try to do it first. The "check off as many buzzwords as you can" approach
More...

Would hate to thread jack @sle so let's not get off on a huge tangent, but since so many knowledgeable members have commented I wanted to ask if anyone would be willing to weigh in on the viability of the approach I suggested. Keep in mind this is only applicable to sequential replay of messages and especially so for horizontally scaling to many market data handlers (Kafka consumers) and feeding them the same data concurrently. Parameter sweep, par exempla.

I realize it's not going to be ideal for querying like "SELECT * FROM trades WHERE size < 5." I believe that would be referred to as "random seek" in this context (please feel free to correct me).

The idea also of course hinges on having a continuous delivery pipeline so that you can deploy Kafka consumers which implement your market data handler.

How do you think this would perform vs using one really beefy host and a local timeseries database? Am I just stringing fancy technologies together hoping they amount to some gestalt?

Simples · Dec 28, 2017

CME Observer said:
Would hate to thread jack @sle so let's not get off on a huge tangent, but since so many knowledgeable members have commented I wanted to ask if anyone would be willing to weigh in on the viability of the approach I suggested. Keep in mind this is only applicable to sequential replay of messages and especially so for horizontally scaling to many market data handlers (Kafka consumers) and feeding them the same data concurrently. Parameter sweep, par exempla.

I realize it's not going to be ideal for querying like "SELECT * FROM trades WHERE size < 5." I believe that would be referred to as "random seek" in this context (please feel free to correct me).

The idea also of course hinges on having a continuous delivery pipeline so that you can deploy Kafka consumers which implement your market data handler.

How do you think this would perform vs using one really beefy host and a local timeseries database? Am I just stringing fancy technologies together hoping they amount to some gestalt?
More...

It sounds pretty straightforward and doable, however:
1) I personally would like to avoid Java if I can, so would only consider Kafka for intended usage = configurable high throughput and/or high parallellism, maxing out network bandwidth. If it's all on one box, Kafka is clearly overkill. The network won't be a bottleneck on a single machine.
2) HBase, while neat and probably very fast for its intended use, could also be overkill for this job.
3) It's one thing to try it out, and you can, just to learn from it. Another thing to maintain it for years and try to adapt it to your own continued development of solution. Both can require alot of time, work and costs.

Some of it depend on who will do what and what qualities the solution should have.
Delaying choices keeping options open, while not sexy sounding, is often the better way, than trying to figure it all out when you know the least: at the start.
Though if you know you're going to need such scalability, it starts making sense to test its capabilities early.

T0pH4t · Jan 14, 2018

My 2cents:
I was using influx db for my market data a while back and I switched off of it due to performance reasons. It has an awesome query language for what I needed but it's ingest/read times were not meeting my requirements. I then switch over to rocksdb and its been solid (had an entire order of magnitude in performance improvments). Rocksdb is not the solution for everyone tho, since its just a key/value store. You have to write the query code for yourself.

I guess it really comes down to what your SLA is on read/write latency, as well as how much you don't mind building yourself.

T0pH4t · Jan 14, 2018

@sle Not sure if this thread is still relevant for you (that is still evaluating options). But there were two other databases on my radar (I haven't used either).

http://www.kerfsoftware.com/ (supposedly next gen timeseries db, proprietary).
https://upscaledb.com/index.html

i960 · Jan 14, 2018

Honestly time series DBs aren't really all that special. Either they're columar or row based and offer a time series specific query language or something generic. To a more specific level they may be oriented towards financial data. All may or may not support replication, transactions, clustering, etc.

We won't really get anything that truly covers everyone's bases in a robust fashion until we get a full opensource project with multiple contributors. Until then it's going to be commercial entities trying to lock things down to their specific products.

Also, most benchmarks are rigged or not general purpose as the problem domain really isn't anything new - yet developers keep acting as if they've rediscovered the wheel.

Simples · Jan 15, 2018

i960 said:
Honestly time series DBs aren't really all that special. Either they're columar or row based and offer a time series specific query language or something generic. To a more specific level they may be oriented towards financial data. All may or may not support replication, transactions, clustering, etc.

We won't really get anything that truly covers everyone's bases in a robust fashion until we get a full opensource project with multiple contributors. Until then it's going to be commercial entities trying to lock things down to their specific products.

Also, most benchmarks are rigged or not general purpose as the problem domain really isn't anything new - yet developers keep acting as if they've rediscovered the wheel.
More...

"The wheel" is uninteresting, really. Commercial / enterprise software stuff looks good, until you're outside its intended scope. You can program a "wheel" in a couple of days for simple usage, and often that can be much better for R&D efforts or efforts to tackle new scopes of problems rather than shoehorning fresh ideas into old garments. About the only times "the wheel" becomes truly valuable on its own, is when it's been a success for a long time and there's not much need for scoping in fresh starts and innovation, or it truly fullfills such a role already (rare). So you can make any software, commercial or open source, but they can't truly bend the law of gravity, being bound by scope and limited sets of requirements, features, qualities and complexity.

It's possible to give up, and just make what you need yourself, free yourself of other's ideas and implementations. It's a long road, but you get what you make yourself, and at least have a chance to test out your dreams.

I can understand building lots of DBs and scaling in any direction, just to pay for annoyances and limitations to go away, but they're never a necessary start of custom coding and introduce accidental complexities at too early stages. But also depending on use cases: who will build it, use it and maintain/evolve it further, so just to illustrate the natural way but not the only way.

People never started great works by building cathedrals right off the bat.
Someone's cathedral might be another's dinosaur.