IMHO there isn't anything that comes close to the performance of kdb+ but as you also know it will cost an arm and a leg. Unless you're into sub ms market making (which makes expensive tech a requirement anyways) it is not necessarry. I quote across a variety of products and exchanges and I found that the bottleneck is most often not the database but the feed - handler. If you have to get in the data, do some complex calculus just to arrive at the price/size you want to quote and then send the orders out, it makes sense to separate execution algorithm and the algorithm that calcs parameters. Execution is very dumb but extremely fast, model is very smart but slow. Send changes in parameters from model to execution on update or only when a certain threshold is reached. This way your execution is still very fast since it's not slowed down by the permanent recalculation of your model parameters. I found that the model parameters change a lot but I don't care about changes of the 6th digit because they don't update price/size of my orders. Some guys go to the extreme and program the execution algorithm onto an ASIC chip and colocate that. The parameters are calculated on a different machine and sent to the chip on update. Not quite where I want/need to go but just FYI I'm working with MongoDB and it's more than sufficient for market making unless you're getting into queue priority calculus on tier1 products...but in this case your NDA should prohibit posting on ET anyways^^
Very good points and I agree with you on everything. Just to clarify, I am not using the database for any execution related work. My algorithms cache any historical data that the model might need, so no overhead there. The database is used primarily to feed a backtest engine and for me to perform some TCA in python, even to do parts of my tax accounting (in Canada, for any foreign denominated assets or fx base that differs from CAD, the then prevailing rate has to be pulled from the system and applied to the transaction, both open and close). I also use it to run some statistics on the fly that I might be interested in, such as the average bid/ask spread over the past month in certain symbols across specific time buckets during rth for equities and LDN/NY trading time zone for cash fx. I am not asking for the purposes of performance improvements. I am more than happy with the performance. I am more asking about the database structure for columnar databases per se. For what it's worth, I recently compared performance between my clickhouse server and the kdb instance of a buddy of mine who works for a hft house. A sample query ran multiple times faster on clickhouse than kdb. Though, to be fair, his kdb instance is 32bit based and clickhouse makes use of multiple threads when needed. I got 68ms on a query over 35 million rows and his query took over 400ms. We are still investigating this because kdb reports the entire turnaround not just when the query hits the db engine.
hm...honestly, I don't know about clickhouse nor am I a kdb expert. Your buddy should know a lot more, also he should be embarrased because a tool that is 5 digits/year minimum is slower than a retail program XD Database structure heavily depends on what you want to do with it since for trading purposes the queries are rather simple, so all you need is timestamp, bid/offer, last for an orderbook. Also for trading it makes sense to use separate tables per symbol as you often times store the entire orderbook over 10+ levels in the DB which results in a lot more than a couple of million entries per year. If it's all about statistics and taxes, do what's the easiest for you to implement since the DB doesn't feed any critical applications
The same could be said about Linux, developed and improved in an open source community ecosystem giving Microsoft a hard time ;-) Clickhouse is not a retail program. It is an open source project and is used by a number of large corporations as well. It has not yet caught on a larger scale because it's relatively new and immature. Again, to be clear, I don't find difficulties with the dB itself. It's performing well, the documentation is very expressive and clear, and it has api hooks targeting most popular languages. I am looking for advice on the schema and architectural side of things.
Memory mapped files are fine, and if you structure your files such that the important stuff doesn't require traversing the files you'll be good. But with every solution there is a "what are you doing with this" requirement. For me, I needed to use it for research (which meant slicing and dicing) as well as real time decisions. Note that I have since switched to manual trading so what I say should be taken with a grain of salt anyway, though I still use it for research and I can load ticks for thousands of symbols quickly enough for research.
so this is only used for research? Nothing to do with realtime trading? if so, then all that discussion about execution is not needed. Clickhouse is fine. Better would be to just buy tickdata with a good sdk. People always think they need to build their own stuff and neglect to have a profitable strategy.
Well, I would not want to wait for minutes or hours to download tick based data through an sdk for backtesting. I can specify symbols and a start and end time and compression rate and get my results from my dB within less than a second.
This is what I do. Over time, I find this phenomenally useful. If needed you can easily write a script to put it in whatever database and/or format you want. But I have removed all databases from my code.
Yes, that's best way when algo is profit, and when new algo come then re format again to fit new algo