Log in or Sign up

ET News & Sponsor Info

General Topics

Markets

Technical Topics

Brokerage Firms

Company Specific

Interactive Brokers

Tools of the Trade

Trading for a Living

Community Lounge

Site Support

Feedback

Datastore design recommendations

Discussion in 'Automated Trading' started by Clark Bruno, Jan 24, 2022.

Databento Sponsor
- 161
  Posts
- 217
  Likes
@jackfx I'll answer once to close the loop, but I agree that further questions should be taken to another avenue to keep on topic.

We are open sourcing the code under a MIT license, and you can copy it, and all of it stays up for free on GitHub even if our company shuts down. The binary format is also extremely simple to maintain and extend even without our developers; at its heart it is just a C header file with 181 lines of code and 2 pages of documentation describing how it works. This is a deliberate design decision to make things fit into a single cache line. And of course we're also one of the most funded data startups right now.

Typical security practices: We use SSL for your API requests. Account IDs are all anonymized with encryption on the identifying information. We obscure implementation details of our internal infrastructure as part of our security policy, so I cannot disclose other elements. At the end of the day, there's a limited blast radius of proprietary strategy information you can glean from data queries.

Tessa Hollinger, Director of Product Strategy

Databento
Pay as you go for market data. Real-time and historical data direct from colocation facilities. Python, C++ and raw TCP.
$100 free usage credit for new accounts at databento.com.

#31 Jan 26, 2022

Share
jackfx
- 6
  Posts
- 9
  Likes
Databento said:
@jackfx I'll answer once to close the loop, but I agree that further questions should be taken to another avenue to keep on topic.

We are open sourcing the code under a MIT license, and you can copy it, and all of it stays up for free on GitHub even if our company shuts down. The binary format is also extremely simple to maintain and extend even without our developers; at its heart it is just a C header file with 181 lines of code and 2 pages of documentation describing how it works. This is a deliberate design decision to make things fit into a single cache line. And of course we're also one of the most funded data startups right now.

We use SSL for your API requests and all of your typical security practices. There's also a limited blast radius of proprietary strategy information you can glean from data queries. Account IDs are all anonymized with encryption on the identifying information. We obscure details of our other practices as part of our security policy, so I cannot disclose other details.
More...

Thanks for info Databento! I just erased my
questions (on Hackers), and bookmarked Databento on attractive prices plan. Appears like Databento heading to high end analytics (SIMD, Tensor, etc)

#32 Jan 26, 2022

Share
rb7
- 1,317
  Posts
- 1,270
  Likes
Clark Bruno said:
Hello everyone, my first post.

I am an algorithmic trader and have so far stored my temporal sequence data in my own binary file data store. I am in the midst of changing parts of my architecture. I came across Clickhouse, an open source highly performant time series database and have one question, as I have never stored tick based data that I use for back testing in a conventional database:

I could import my currency tick based data for one symbol to experiment with performance optimizations. The schema is simply : timestamp (long), bid (float64) and ask (float 64), altogether three columns in the table. Do you now recommend to create separate tables for each symbol or would you store all symbols of an identical asset class in the same table? One symbol alone is around 370 million records. Adding 30 or more symbols for all my currencies is not a big deal, performance wise. The database can deal with many billions of records. But would this be advisable?

What do you guys do, particularly those who work with columnar databases?
More...

You never mentioned what are the operations you'll do with your time series.
Would it be sequential reads on one symbol at a time?
No relation between symbols?
If this is the case, then switching to a database (RDBMS) might not be the best solution.

Also, never mentioned why you want to change your current way of doing it.

#33 Jan 26, 2022

Share
Clark Bruno
- 96
  Posts
- 71
  Likes
I mentioned some but worth it elaborating:

* streaming mixed symbols ordered by timestamp into a back test engine over a specified time span and potentially by adjusting the streamed bid/ask prices to manipulate spreads for different fill assumptions

* windowing of data for data-preprocessing for tensorflow/pytorch dnn training purposes

* TCA analysis, including top-of-book vs fill prices, average spreads within identical symbols over different time buckets throughout rth, or across symbols (normalized)

* quick ad-hoc queries that cannot be prepared because the questions that need answering are different each time

* quickly loading specific subsets of symbols and timespans into python pandas dataframes, C# array structures, R data frames for custom statistical analysis

Hope this helps to better understand my use case. I guess you get the idea, I need a data source that can be queried in a highly performant manner, from multiple platforms. If I dealt with binary flat files (which I have for years) or text files then I would either have to write a lot of code in multiple languages to get even a basic set of api functions or queries going. Text files are just way too slow to parse for some of my use cases. Clickhouse is a highly performant columnar database, specialized in time series, and the query from whatever platform is 100% identical. The APIs already exist and are tested by a large open-source community. What's not to like?

Again, my question was never whether CH is the right choice for my use case (it is) but how users that specifically work with financial trading and risk management algorithms and who specifically use time series database structure their data regarding symbols, exchanges, ...

Note: CH is NOT an RDBMS, it is a columnar database

rb7 said:
You never mentioned what are the operations you'll do with your time series.
Would it be sequential reads on one symbol at a time?
No relation between symbols?
If this is the case, then switching to a database (RDBMS) might not be the best solution.

Also, never mentioned why you want to change your current way of doing it.
More...

#34 Jan 26, 2022

Share

d08, Databento, shuraver and 2 others like this.
Databento Sponsor
- 161
  Posts
- 217
  Likes
Clark Bruno said:
I mentioned some but worth it elaborating...
More...

There's one more point worth mentioning.

Most naive binary flat file implementations are record-oriented, which makes it harder for a generic compression algorithm to squeeze out a high compression ratio. It take significantly more work to write a binary flat file design that employs column-oriented layout.

Clickhouse is fairly efficient at compression. In our own internal testing, a 7 GB CSV ends up being 6.7 GB in MySQL and 670 MB in Clickhouse.

The reduced storage requirements from the additional compression could be enough to pay for the marginal cost to switching to an all-NVMe setup on the same hardware, hosting and power budget.

On the flip side, we don't put all of our data on Clickhouse for various reasons that start to manifest when you run larger clusters like us. One is that we like to decouple compute from storage and scale them independently.

Tessa Hollinger, Director of Product Strategy

Databento
Pay as you go for market data. Real-time and historical data direct from colocation facilities. Python, C++ and raw TCP.
$100 free usage credit for new accounts at databento.com.

#35 Jan 31, 2022

Share

shuraver and swinging tick like this.
Clark Bruno
- 96
  Posts
- 71
  Likes
Very good point. I have ignored compression in the past because of large amounts of fast ssd storage clusters available but I am now much more looking into compression as data requirements increase.

Your latter point makes sense. I myself also separate concerns. My data storage is to store and retrieve partitions. All compute tasks are done in my data processing and query algorithm logic, not on the server.

Databento said:
There's one more point worth mentioning.

Most naive binary flat file implementations are record-oriented, which makes it harder for a generic compression algorithm to squeeze out a high compression ratio. It take significantly more work to write a binary flat file design that employs column-oriented layout.

Clickhouse is fairly efficient at compression. In our own internal testing, a 7 GB CSV ends up being 6.7 GB in MySQL and 670 MB in Clickhouse.

The reduced storage requirements from the additional compression could be enough to pay for the marginal cost to switching to an all-NVMe setup on the same hardware, hosting and power budget.

On the flip side, we don't put all of our data on Clickhouse for various reasons that start to manifest when you run larger clusters like us. One is that we like to decouple compute from storage and scale them independently.
More...

#36 Jan 31, 2022

Share
stargrazer
- 30
  Posts
- 4
  Likes
Clark Bruno said:
I am an algorithmic trader and have so far stored my temporal sequence data in my own binary file data store. I am in the midst of changing parts of my architecture.

What do you guys do, particularly those who work with columnar databases?
More...

Try something like https://www.hdfgroup.org/solutions/hdf5, various users use it to create the schema and store the data in there. There are api calls for compression on a columnar basis (where values change little from one record to another). I have created a C++ wrapper to save Quotes, Trade, Greeks, MarketDepth, .. by symbol, by day, by trading session. I can then run custom selectors across the data to pull values and timeseries for use in backtesting. There are python wrappers for it as well.

#37 Apr 18, 2022

Share
Databento Sponsor
- 161
  Posts
- 217
  Likes
By the way, we recently tested our Clickhouse cluster for a new use case and were able to run multiple queries with `sum` and `in` operations over 1.04B rows per second on a single client. This is faster than say, numbers advertised by Man AHL with Arctic, with much less hardware thrown at it.

Probably the biggest challenge is the complexity around expanding an existing cluster. It's easiest if you can build-and-forget or do a power law expansion, e.g. build the next cluster twice as big and have it subsume the old one.

Tessa Hollinger, Director of Product Strategy

Databento
Pay as you go for market data. Real-time and historical data direct from colocation facilities. Python, C++ and raw TCP.
$100 free usage credit for new accounts at databento.com.

#38 Jul 26, 2022

Share

shuraver and M.W. like this.
M.W.
- 4,047
  Posts
- 1,452
  Likes
I remember we had an exchange re Clickhouse in another thread. Thanks for your update. I migrated to Clickhouse a while ago and store over 140TB raw data that sits compressed at around 15TB in Clickhouse. Queries for my use case are very performant and I could not be happier. Only qualm is that my analytics engines and applications are all written in C# on a Windows box and I need to run a separate Linux VM to use Clickhouse. But not a big issue.

One downside of Clickhouse imo is the not yet so solid backup technology. I found the features regarding backups quite limited.

Databento said:
By the way, we recently tested our Clickhouse cluster for a new use case and were able to run multiple queries with `sum` and `in` operations over 1.04B rows per second on a single client. This is faster than say, numbers advertised by Man AHL with Arctic, with much less hardware thrown at it.

Probably the biggest challenge is the complexity around expanding an existing cluster. It's easiest if you can build-and-forget or do a power law expansion, e.g. build the next cluster twice as big and have it subsume the old one.
More...

#39 Jul 29, 2022

Share

shuraver, YuriWerewolf and Databento like this.
M.W.
- 4,047
  Posts
- 1,452
  Likes
Update to my previous post:

I now run all my ai related workflow on a wsl2 instance using Ubuntu on a Windows box. The clickhouse server runs on Ubuntu in wsl2 too. (neat thing is that I could configure the storage of the clickhouse server databases and tables on a separate nvme based raid array - - > extremely fast parallel data retrieval). Newer tensorflow versions don't support GPUs on a Windows machine anymore. Deployed deep learning based models all run on several proximity located Linux based dedicated servers.

M.W. said:
I remember we had an exchange re Clickhouse in another thread. Thanks for your update. I migrated to Clickhouse a while ago and store over 140TB raw data that sits compressed at around 15TB in Clickhouse. Queries for my use case are very performant and I could not be happier. Only qualm is that my analytics engines and applications are all written in C# on a Windows box and I need to run a separate Linux VM to use Clickhouse. But not a big issue.

One downside of Clickhouse imo is the not yet so solid backup technology. I found the features regarding backups quite limited.
More...

#40 Apr 18, 2023

Share

Databento likes this.

(You must log in or sign up to reply here.)

Search