Datastore design recommendations

MrMuppet · Jan 25, 2022

Clark Bruno said:
That idea does not work for me because I use multiple language bindings for different projects such as one in c#, and pandas dfs and c# don't efficiently parse. And we probably talk about vastly different performance requirements.

The following metric takes around 600ms across 370 million tick based bids and offers: "select avg(10000*(ask - bid)) from fx_table". This is without caching any data, without optimizing any queries, schema, or database engine. To my knowledge only kdb comes close, not even vertica. I sometimes need to run more complex queries across multiple symbols which takes a very long time with other tsdb. Going the pandas route is also too slow for my back test engine.

But back to my main question: How does adding a column as symbol identifier perform computationally as well as storage wise for any of the time series databases others use?
More...

IMHO there isn't anything that comes close to the performance of kdb+ but as you also know it will cost an arm and a leg.
Unless you're into sub ms market making (which makes expensive tech a requirement anyways) it is not necessarry.

I quote across a variety of products and exchanges and I found that the bottleneck is most often not the database but the feed - handler.

If you have to get in the data, do some complex calculus just to arrive at the price/size you want to quote and then send the orders out, it makes sense to separate execution algorithm and the algorithm that calcs parameters. Execution is very dumb but extremely fast, model is very smart but slow.
Send changes in parameters from model to execution on update or only when a certain threshold is reached. This way your execution is still very fast since it's not slowed down by the permanent recalculation of your model parameters.

I found that the model parameters change a lot but I don't care about changes of the 6th digit because they don't update price/size of my orders.

Some guys go to the extreme and program the execution algorithm onto an ASIC chip and colocate that. The parameters are calculated on a different machine and sent to the chip on update.

Not quite where I want/need to go but just FYI I'm working with MongoDB and it's more than sufficient for market making unless you're getting into queue priority calculus on tier1 products...but in this case your NDA should prohibit posting on ET anyways^^

Clark Bruno · Jan 25, 2022

Very good points and I agree with you on everything. Just to clarify, I am not using the database for any execution related work. My algorithms cache any historical data that the model might need, so no overhead there.

The database is used primarily to feed a backtest engine and for me to perform some TCA in python, even to do parts of my tax accounting (in Canada, for any foreign denominated assets or fx base that differs from CAD, the then prevailing rate has to be pulled from the system and applied to the transaction, both open and close). I also use it to run some statistics on the fly that I might be interested in, such as the average bid/ask spread over the past month in certain symbols across specific time buckets during rth for equities and LDN/NY trading time zone for cash fx. I am not asking for the purposes of performance improvements. I am more than happy with the performance. I am more asking about the database structure for columnar databases per se.

For what it's worth, I recently compared performance between my clickhouse server and the kdb instance of a buddy of mine who works for a hft house. A sample query ran multiple times faster on clickhouse than kdb. Though, to be fair, his kdb instance is 32bit based and clickhouse makes use of multiple threads when needed. I got 68ms on a query over 35 million rows and his query took over 400ms. We are still investigating this because kdb reports the entire turnaround not just when the query hits the db engine.

MrMuppet said:
IMHO there isn't anything that comes close to the performance of kdb+ but as you also know it will cost an arm and a leg.
Unless you're into sub ms market making (which makes expensive tech a requirement anyways) it is not necessarry.

I quote across a variety of products and exchanges and I found that the bottleneck is most often not the database but the feed - handler.

If you have to get in the data, do some complex calculus just to arrive at the price/size you want to quote and then send the orders out, it makes sense to separate execution algorithm and the algorithm that calcs parameters. Execution is very dumb but extremely fast, model is very smart but slow.
Send changes in parameters from model to execution on update or only when a certain threshold is reached. This way your execution is still very fast since it's not slowed down by the permanent recalculation of your model parameters.

I found that the model parameters change a lot but I don't care about changes of the 6th digit because they don't update price/size of my orders.

Some guys go to the extreme and program the execution algorithm onto an ASIC chip and colocate that. The parameters are calculated on a different machine and sent to the chip on update.

Not quite where I want/need to go but just FYI I'm working with MongoDB and it's more than sufficient for market making unless you're getting into queue priority calculus on tier1 products...but in this case your NDA should prohibit posting on ET anyways^^
More...

MrMuppet · Jan 25, 2022

Clark Bruno said:
Very good points and I agree with you on everything. Just to clarify, I am not using the database for any execution related work. My algorithms cache any historical data that the model might need, so no overhead there.

The database is used primarily to feed a backtest engine and for me to perform some TCA in python, even to do parts of my tax accounting (in Canada, for any foreign denominated assets or fx base that differs from CAD, the then prevailing rate has to be pulled from the system and applied to the transaction, both open and close). I also use it to run some statistics on the fly that I might be interested in, such as the average bid/ask spread over the past month in certain symbols across specific time buckets during rth for equities and LDN/NY trading time zone for cash fx. I am not asking for the purposes of performance improvements. I am more than happy with the performance. I am more asking about the database structure for columnar databases per se.

For what it's worth, I recently compared performance between my clickhouse server and the kdb instance of a buddy of mine who works for a hft house. A sample query ran multiple times faster on clickhouse than kdb. Though, to be fair, his kdb instance is 32bit based and clickhouse makes use of multiple threads when needed. I got 68ms on a query over 35 million rows and his query took over 400ms. We are still investigating this because kdb reports the entire turnaround not just when the query hits the db engine.
More...

hm...honestly, I don't know about clickhouse nor am I a kdb expert. Your buddy should know a lot more, also he should be embarrased because a tool that is 5 digits/year minimum is slower than a retail program XD

Database structure heavily depends on what you want to do with it since for trading purposes the queries are rather simple, so all you need is timestamp, bid/offer, last for an orderbook.
Also for trading it makes sense to use separate tables per symbol as you often times store the entire orderbook over 10+ levels in the DB which results in a lot more than a couple of million entries per year.

If it's all about statistics and taxes, do what's the easiest for you to implement since the DB doesn't feed any critical applications

Clark Bruno · Jan 25, 2022

The same could be said about Linux, developed and improved in an open source community ecosystem giving Microsoft a hard time ;-) Clickhouse is not a retail program. It is an open source project and is used by a number of large corporations as well. It has not yet caught on a larger scale because it's relatively new and immature.

Again, to be clear, I don't find difficulties with the dB itself. It's performing well, the documentation is very expressive and clear, and it has api hooks targeting most popular languages. I am looking for advice on the schema and architectural side of things.

MrMuppet said:
hm...honestly, I don't know about clickhouse nor am I a kdb expert. Your buddy should know a lot more, also he should be embarrased because a tool that is 5 digits/year minimum is slower than a retail program XD

Database structure heavily depends on what you want to do with it since for trading purposes the queries are rather simple, so all you need is timestamp, bid/offer, last for an orderbook.
Also for trading it makes sense to use separate tables per symbol as you often times store the entire orderbook over 10+ levels in the DB which results in a lot more than a couple of million entries per year.

If it's all about statistics and taxes, do what's the easiest for you to implement since the DB doesn't feed any critical applications
More...

nooby_mcnoob · Jan 25, 2022

jackfx said:
Only those who can't code will say SQL is faster.
More...

Memory mapped files are fine, and if you structure your files such that the important stuff doesn't require traversing the files you'll be good. But with every solution there is a "what are you doing with this" requirement. For me, I needed to use it for research (which meant slicing and dicing) as well as real time decisions.

Note that I have since switched to manual trading so what I say should be taken with a grain of salt anyway, though I still use it for research and I can load ticks for thousands of symbols quickly enough for research.

2rosy · Jan 25, 2022

so this is only used for research? Nothing to do with realtime trading? if so, then all that discussion about execution is not needed. Clickhouse is fine. Better would be to just buy tickdata with a good sdk. People always think they need to build their own stuff and neglect to have a profitable strategy.

nooby_mcnoob · Jan 25, 2022

2rosy said:
People always think they need to build their own stuff and neglect to have a profitable strategy.
More...

guilty!

Clark Bruno · Jan 25, 2022

Well, I would not want to wait for minutes or hours to download tick based data through an sdk for backtesting. I can specify symbols and a start and end time and compression rate and get my results from my dB within less than a second.

2rosy said:
so this is only used for research? Nothing to do with realtime trading? if so, then all that discussion about execution is not needed. Clickhouse is fine. Better would be to just buy tickdata with a good sdk. People always think they need to build their own stuff and neglect to have a profitable strategy.
More...

dholliday · Jan 26, 2022

jackfx said:
I would suggest u stick with text file (human readable), my systems handle hundreds of symbols, text files, billions of records. Of course it requires skill in C#. SQL was too slow, all depends on programming skills too. I use C# C++. Learn Memory Mapped File for speed. It's FASTER THAN all kinds of SQL.
One symbol, one text file, one day one file. Binary or Text doesn't matter. Only those who can't code will say SQL is faster. No license, no big memory, lightning speed.
Other issues are backup, redundancy, Shared Storage, change of algo in future etc, all are easy to solved by flat files. Good luck
More...

This is what I do. Over time, I find this phenomenally useful.
If needed you can easily write a script to put it in whatever database and/or format you want. But I have removed all databases from my code.

jackfx · Jan 26, 2022

dholliday said:
This is what I do. Over time, I find this phenomenally useful.
If needed you can easily write a script to put it in whatever database and/or format you want. But I have removed all databases from my code.
More...

Yes, that's best way when algo is profit, and when new algo come then re format again to fit new algo