Datastore design recommendations

Clark Bruno · Jan 24, 2022

Hello everyone, my first post.

I am an algorithmic trader and have so far stored my temporal sequence data in my own binary file data store. I am in the midst of changing parts of my architecture. I came across Clickhouse, an open source highly performant time series database and have one question, as I have never stored tick based data that I use for back testing in a conventional database:

I could import my currency tick based data for one symbol to experiment with performance optimizations. The schema is simply : timestamp (long), bid (float64) and ask (float 64), altogether three columns in the table. Do you now recommend to create separate tables for each symbol or would you store all symbols of an identical asset class in the same table? One symbol alone is around 370 million records. Adding 30 or more symbols for all my currencies is not a big deal, performance wise. The database can deal with many billions of records. But would this be advisable?

What do you guys do, particularly those who work with columnar databases?

nooby_mcnoob · Jan 24, 2022

What I did was store blocks of pandas dataframes in SQL something like:

block date start, block date end, instrument id, pandas dataframe

Clark Bruno · Jan 24, 2022

Got it, and that may make sense for your use case. What I worry about is that if I added an additional column that denotes the instrument Id then that would constitute a lot of additional space, required.

nooby_mcnoob said:
What I did was store blocks of pandas dataframes in SQL something like:

block date start, block date end, instrument id, pandas dataframe
More...

nooby_mcnoob · Jan 24, 2022

Clark Bruno said:
Got it, and that may make sense for your use case. What I worry about is that if I added an additional column that denotes the instrument Id then that would constitute a lot of additional space, required.
More...

Why not try it and find out? The above was what I found after much testing.

nooby_mcnoob · Jan 24, 2022

BTW, I tried with the popular file formats as well parquet and whatever else. The above was the best. Note that the pandas dataframe was compressed.

Clark Bruno · Jan 24, 2022

That idea does not work for me because I use multiple language bindings for different projects such as one in c#, and pandas dfs and c# don't efficiently parse. And we probably talk about vastly different performance requirements.

The following metric takes around 600ms across 370 million tick based bids and offers: "select avg(10000*(ask - bid)) from fx_table". This is without caching any data, without optimizing any queries, schema, or database engine. To my knowledge only kdb comes close, not even vertica. I sometimes need to run more complex queries across multiple symbols which takes a very long time with other tsdb. Going the pandas route is also too slow for my back test engine.

But back to my main question: How does adding a column as symbol identifier perform computationally as well as storage wise for any of the time series databases others use?

nooby_mcnoob said:
BTW, I tried with the popular file formats as well parquet and whatever else. The above was the best. Note that the pandas dataframe was compressed.
More...

nooby_mcnoob · Jan 24, 2022

Clark Bruno said:
That idea does not work for me because I use multiple language bindings for different projects such as one in c#, and pandas dfs and c# don't efficiently parse. And we probably talk about vastly different performance requirements.

The following metric takes around 600ms across 370 million tick based bids and offers: "select avg(10000*(ask - bid)) from fx_table". This is without caching any data, without optimizing any queries, schema, or database engine. To my knowledge only kdb comes close, not even vertica. I sometimes need to run more complex queries across multiple symbols which takes a very long time with other tsdb. Going the pandas route is also too slow for my back test engine.

But back to my main question: How does adding a column as symbol identifier perform computationally as well as storage wise for any of the time series databases others use?
More...

I'm not sure anyone will have that specific answer, I don't anyway.

ValeryN · Jan 24, 2022

You sounds like a smart dude. Don't overthink it.

PL is not in the DB structure.

jackfx · Jan 25, 2022

I would suggest u stick with text file (human readable), my systems handle hundreds of symbols, text files, billions of records. Of course it requires skill in C#. SQL was too slow, all depends on programming skills too. I use C# C++. Learn Memory Mapped File for speed. It's FASTER THAN all kinds of SQL.
One symbol, one text file, one day one file. Binary or Text doesn't matter. Only those who can't code will say SQL is faster. No license, no big memory, lightning speed.
Other issues are backup, redundancy, Shared Storage, change of algo in future etc, all are easy to solved by flat files. Good luck

Clark Bruno · Jan 25, 2022

SQL is not the underlying engine, it is just the query language. Clickhouse for example offers SQL-like query capabilities but has nothing whatsoever to do with relational databases. It is a columnar datastore. By the way, my existing file based datastore consists of binary files, they are not human readable. Memory mapped file based approaches certainly work but I did not consider them for my architecture for multiple reasons. And I am pretty sure that even simple queries such as the one I posted in an earlier message will not be as performant on a memory mapped system as on an optimized columnar db such as Clickhouse. I do not want to delve into a long discussion about different ways to store and query data, each one has its own merits and disadvantages. Thanks, though, for your suggestions.

I like to focus more on the single question whether users who use columnar databases store all their symbols of a specific structure in one table with an identifier column or whether they set up different tables. Also eager to hear of completely different approaches as long as it deals with columnar database storage.

jackfx said:
I would suggest u stick with text file (human readable), my systems handle hundreds of symbols, text files, billions of records. Of course it requires skill in C#. SQL was too slow, all depends on programming skills too. I use C# C++. Learn Memory Mapped File for speed. It's FASTER THAN all kinds of SQL.
One symbol, one text file, one day one file. Binary or Text doesn't matter. Only those who can't code will say SQL is faster. No license, no big memory, lightning speed.
Other issues are backup, redundancy, Shared Storage, change of algo in future etc, all are easy to solved by flat files. Good luck
More...