Discussion in 'Programming' started by sle, Dec 24, 2017.
I'm out of my depth, excuse me if this is stupid...
This looks very good, actually.
PS. Bcolz looks very straight forward, this looks more "enterprize ready". Need to think about the various aspects, for a tiny team
Relational databases are not very good at handing large amounts of columnar data at high speeds (at least, most are not - apparently, there is a brand new TSDB that is optimized for it). They are very good for other things, though.
there are kinda two use cases here and I think it's useful to specify because I have what I'd consider a sensible idea about how to tackle one but for the other I won't do much better than posting GitHub links. Many of which are already here and worth looking at.
How much of your usage is:
1. Querying for ticks that meet criteria of interest.
2. Replaying a sequential log of ticks to rebuild the book or backtest.
Yeah, most RDBMS aren't so great when it comes to tick data because as you correctly point out you end up doing a lot of expensive joins. A column-oriented RDBMS will require fewer joins provided your queries are sensible. There seems to be lot more going on in the columnar open source space these days than when kdb+ was the end-all-be-all. MariaDB looks like it might be cool in this regard. Some very mature but not so cheap options:
- Sybase IQ
- SQL Server
Pretty much everything I do with this data is case 2 in one or another form.
Okay, cool. You're basically looking at an event sourcing problem, right? Then to me the question becomes not "what database goes really fast?" but more "how can I get really high throughput to transform my immutable log into a stream?"
So if you ask me (and don't because I'm a few months out of undergrad) I'd consider trying a little different approach. What if you:
1. went with a solution like Hadoop as persistence and query that by time window and contract. I imagine you'll want realtime read and perhaps write depending on your data source as well. Check out HBase.
2. once you've extracted the relevant range of ticks, feed it into a Kafka producer.
3. Implement whatever data processing you intend to do as a consumer and spin up a cluster as necessary as a means of achieving parallelism. Seems very applicable to backtesting and optimization.
Might be an interesting way to get really high throughout. Might also totally suck and get wrecked by MySQL InnoDB on a dual core laptop. Just how I'd try to do it first. The "check off as many buzzwords as you can" approach
I guess I have to do some real cost/benefit analysis before I pick a direction. The inputs are as follows:
- very small team, one techy young monkey and one useless old fart that can barely boot up his MacBook; so we can’t support heavy technology stack
- small number of assets that do require tick data work at this point in time, across a fairly small number of assets (probably talking a few hundred symbols, only 10-20 used concurrently); so organizational aspects are not crucial, I can probably use file-system based structure
- at some point, I might move in the direction of doing more of the latency sensitive stuff; so flexibility is important
- main requirements are rapid reading and writing of rather large blocks of data (intraday we dump ticks into a text file) for research and back testing
Is there a reason I don’t want to go with bcolz given the above?
I have experience with this given my own custom solutions using opensource software.
1. How much data are talking and at what resolution/update interval?
2. Do you need transactionality on updates?
3. How will you query the data? Adhoc random access for backtesting and the like or bulk reading/writing? Both?
4. Network access or host/local access only?
5. I presume your key is symbol+interval and your data is OHLC and maybe V and OI or other minor stuff. Is this accurate?
This is a very deep rabbit hole and a lot depends on number 1. The natural (naive) response is to reach for an SQL accessed database. These are generic solutions that are actually quite decent and will perform well if intelligently indexed. Where they break down is space efficiency due to their generic approach. On the other hand an embedded database like BerkeleyDB allows you to use a generic Btree database and whatever key/value structure you come up with - but you'll have to write the network side of things if you need network level access. Both are typically row structured and not column structured. This honestly might not make a massive difference depending on your access patterns and how much data you actually want to store.
If we're talking tick level resolution of 10,000+ instruments you're going to need something industrial grade, a shit ton of storage, and potentially multiple nodes. If we're talking magnitudes less then you can get away with much less footprint. Like I said: #1 is the big question.
Update: I just saw that you posted 10-20 tick level instruments. How much history are we talking? It might be better to just get some raw numbers out there because any solution you use is going to be massively dependent on storage and *how* you structure the data.
Are you the old fart here?
I'll say this: if you're still at the point that a filesystem based structure continues to scale then you're probably not at the point where you need an industrial grade solution. This isn't a bad thing, it's actually a good thing because it means your dataset (or hopefully you're projected one) is intending to stay reasonable.
Unless you've got something novel, model-wise, don't you think you're probably outgunned from the start here? My mind would be blown if commercial entities in the latency sensitive space are using any interpreted languages whatsoever at runtime (of which python is an interpreted language [and a slow one at that]). You'd have to be calculating something they're just not even remotely interested in in order to win that war - and who knows, that might just be the case. I'd say if you can avoid this entire space for as long as possible it's probably for the better.
What instruments are you guys able to dump ticks into text files every day and have those not be massive? On top of that how are you preserving floating point precision if you're emitting into ascii data at some point? Just straight dumping doubles via printf or something along those lines? Also, how many days of intraday data? The text file dump of ticks won't scale for anything highly liquid and encompassing months worth of data. If it's just 24h worth of data then it'll probably work but I don't know how you'd backtest anything with just that.
There's a lot of talk these days about "columnar" data stores and the like, typically with the vibe that it's some kind of bullet that will make all your numeric processing somehow better than a row oriented data store. In reality it is not a new idea whatsoever - just as "NoSQL" isn't new whatsoever. What matters is your data access patterns and how your models/backtests use the actual data you store. If you're not accessing the data from the context of a single column then it's not really going to do anything for you. For instance, let's say you use symbol as key, OHLC as columns. Your backtesting logic then consults close for every single tick. A column-oriented DBMS will perform better there because locality of reference is high for successive close values (hence they can be bulk retrieved faster than a row-oriented approach). However, let's say you want open+close, now that's going more in the direction of a row-oriented access pattern and a column-oriented structure, while it will still perform decently, is showing less of an advantage. BTW: The typical approach for something like this with a standard row-oriented approach is to index those columns to produce your own column-oriented table (the index). If you were constantly backtesting open+close but stored OHLC data per row then it behooves you to create an index on open+close for that table. More background: https://en.wikipedia.org/wiki/Column-oriented_DBMS
Also, if you're compressing data to save space you *will* take a hit in access time. It may not be much of a hit if the access patterns are streaming/bulk-reading but the second anything resembles random seek then compression will kill you (as each block will need to be uncompressed in order to read the individual columns or rows within a block - and some of that data may be unrelated).
bcolz may work fine for you, I'm just skeptical of the scalability of any database implemented on top of an interpretive language because they simply don't perform or scale. They eventually all hit bottlenecks requiring use of underlying components written in native languages in order to keep scaling. Generally they're good for scaffolding or prototyping things but they all eventually hit a wall (with unfortunately too many people throwing hardware at the problem rather than changing the algorithm [in this context, the implementation language]).
Separate names with a comma.