Datastore design recommendations

Discussion in 'Automated Trading' started by Clark Bruno, Jan 24, 2022.

  1. Databento

    Databento Sponsor

    @jackfx I'll answer once to close the loop, but I agree that further questions should be taken to another avenue to keep on topic.

    We are open sourcing the code under a MIT license, and you can copy it, and all of it stays up for free on GitHub even if our company shuts down. The binary format is also extremely simple to maintain and extend even without our developers; at its heart it is just a C header file with 181 lines of code and 2 pages of documentation describing how it works. This is a deliberate design decision to make things fit into a single cache line. And of course we're also one of the most funded data startups right now.

    Typical security practices: We use SSL for your API requests. Account IDs are all anonymized with encryption on the identifying information. We obscure implementation details of our internal infrastructure as part of our security policy, so I cannot disclose other elements. At the end of the day, there's a limited blast radius of proprietary strategy information you can glean from data queries.
     
    #31     Jan 26, 2022
  2. jackfx

    jackfx

    Thanks for info Databento! I just erased my
    questions (on Hackers), and bookmarked Databento on attractive prices plan. Appears like Databento heading to high end analytics (SIMD, Tensor, etc)
     
    #32     Jan 26, 2022
  3. rb7

    rb7

    You never mentioned what are the operations you'll do with your time series.
    Would it be sequential reads on one symbol at a time?
    No relation between symbols?
    If this is the case, then switching to a database (RDBMS) might not be the best solution.

    Also, never mentioned why you want to change your current way of doing it.
     
    #33     Jan 26, 2022
  4. I mentioned some but worth it elaborating:

    * streaming mixed symbols ordered by timestamp into a back test engine over a specified time span and potentially by adjusting the streamed bid/ask prices to manipulate spreads for different fill assumptions

    * windowing of data for data-preprocessing for tensorflow/pytorch dnn training purposes

    * TCA analysis, including top-of-book vs fill prices, average spreads within identical symbols over different time buckets throughout rth, or across symbols (normalized)

    * quick ad-hoc queries that cannot be prepared because the questions that need answering are different each time

    * quickly loading specific subsets of symbols and timespans into python pandas dataframes, C# array structures, R data frames for custom statistical analysis

    Hope this helps to better understand my use case. I guess you get the idea, I need a data source that can be queried in a highly performant manner, from multiple platforms. If I dealt with binary flat files (which I have for years) or text files then I would either have to write a lot of code in multiple languages to get even a basic set of api functions or queries going. Text files are just way too slow to parse for some of my use cases. Clickhouse is a highly performant columnar database, specialized in time series, and the query from whatever platform is 100% identical. The APIs already exist and are tested by a large open-source community. What's not to like?

    Again, my question was never whether CH is the right choice for my use case (it is) but how users that specifically work with financial trading and risk management algorithms and who specifically use time series database structure their data regarding symbols, exchanges, ...

    Note: CH is NOT an RDBMS, it is a columnar database


     
    #34     Jan 26, 2022
    d08, Databento, shuraver and 2 others like this.
  5. Databento

    Databento Sponsor

    There's one more point worth mentioning.

    Most naive binary flat file implementations are record-oriented, which makes it harder for a generic compression algorithm to squeeze out a high compression ratio. It take significantly more work to write a binary flat file design that employs column-oriented layout.

    Clickhouse is fairly efficient at compression. In our own internal testing, a 7 GB CSV ends up being 6.7 GB in MySQL and 670 MB in Clickhouse.

    The reduced storage requirements from the additional compression could be enough to pay for the marginal cost to switching to an all-NVMe setup on the same hardware, hosting and power budget.

    On the flip side, we don't put all of our data on Clickhouse for various reasons that start to manifest when you run larger clusters like us. One is that we like to decouple compute from storage and scale them independently.
     
    #35     Jan 31, 2022
    shuraver and swinging tick like this.
  6. Very good point. I have ignored compression in the past because of large amounts of fast ssd storage clusters available but I am now much more looking into compression as data requirements increase.

    Your latter point makes sense. I myself also separate concerns. My data storage is to store and retrieve partitions. All compute tasks are done in my data processing and query algorithm logic, not on the server.

     
    #36     Jan 31, 2022
  7. Try something like https://www.hdfgroup.org/solutions/hdf5, various users use it to create the schema and store the data in there. There are api calls for compression on a columnar basis (where values change little from one record to another). I have created a C++ wrapper to save Quotes, Trade, Greeks, MarketDepth, .. by symbol, by day, by trading session. I can then run custom selectors across the data to pull values and timeseries for use in backtesting. There are python wrappers for it as well.
     
    #37     Apr 18, 2022
  8. Databento

    Databento Sponsor

    By the way, we recently tested our Clickhouse cluster for a new use case and were able to run multiple queries with `sum` and `in` operations over 1.04B rows per second on a single client. This is faster than say, numbers advertised by Man AHL with Arctic, with much less hardware thrown at it.

    Probably the biggest challenge is the complexity around expanding an existing cluster. It's easiest if you can build-and-forget or do a power law expansion, e.g. build the next cluster twice as big and have it subsume the old one.
     
    #38     Jul 26, 2022
    shuraver and M.W. like this.
  9. M.W.

    M.W.

    I remember we had an exchange re Clickhouse in another thread. Thanks for your update. I migrated to Clickhouse a while ago and store over 140TB raw data that sits compressed at around 15TB in Clickhouse. Queries for my use case are very performant and I could not be happier. Only qualm is that my analytics engines and applications are all written in C# on a Windows box and I need to run a separate Linux VM to use Clickhouse. But not a big issue.

    One downside of Clickhouse imo is the not yet so solid backup technology. I found the features regarding backups quite limited.


     
    #39     Jul 29, 2022
    shuraver, YuriWerewolf and Databento like this.
  10. M.W.

    M.W.

    Update to my previous post:

    I now run all my ai related workflow on a wsl2 instance using Ubuntu on a Windows box. The clickhouse server runs on Ubuntu in wsl2 too. (neat thing is that I could configure the storage of the clickhouse server databases and tables on a separate nvme based raid array - - > extremely fast parallel data retrieval). Newer tensorflow versions don't support GPUs on a Windows machine anymore. Deployed deep learning based models all run on several proximity located Linux based dedicated servers.

     
    #40     Apr 18, 2023
    Databento likes this.