Datastore design recommendations

Discussion in 'Automated Trading' started by Clark Bruno, Jan 24, 2022.

  1. M.W.

    M.W.

    @Databento, question for you guys: do you offer access to either the raw or curated twitter feeds via api? I am currently tinkering with the gpt-4 api and see a lot of potential to use gpt-4 to speed up interpretation and analysis of fed and corporate news releases which I might want to base several trading algorithms on. Twitter itself does not seem to offer full access to the entire raw feed to end users.
     
    #41     Apr 18, 2023
  2. Databento

    Databento Sponsor

    Nice, I'm glad it worked out for you, and thanks for the update.

    Unfortunately, we don't.

    Fed releases and other macro releases are difficult because there are premium feeds like AlphaFlash specifically optimized to monetize those events and deliver them from the embargo facility, e.g. K St in DC, to the trading PoPs in the least amount of time. This sort of long-haul wireless connectivity is not our expertise. And the companies who subscribe to these do have sophisticated NLP modeling capability and the ability to put on size.

    There's probably a lot you could try with EDGAR filings though. That's on our roadmap, but you can honestly scrape EDGAR with open source tools pretty easily. We know several customers using Ravenpack for something like this.
     
    #42     Apr 18, 2023
    YuriWerewolf likes this.
  3. M.W.

    M.W.

    Got it thanks for the prompt reply.

     
    #43     Apr 18, 2023
    Databento likes this.
  4. @Clark Bruno, thank you so much for starting up such an interesting thread.

    Would you please update on the Final Schema that you settled with and what other type of infrastructure are you currently using for querying your bid-ask tick data?

    And suppose you need to store not just the bid-ask, but also the order book market depth information for atleast 20 levels depth or even the full depth of the order book, then will you still use the same schema or if you will need to modify the schema in such a case?

    I was looking for the efficient solution for storing and analyzing many TB worth of order book data for many different symbols into a single table and then querying it to calculate a few additional columns of aggregated data to be used in further analysis and visualization work and I stumbled upon your thread, which has been very informative. So I am eager to know about your final approach in this regards.

    Thanks and regards
     
    #44     Jun 12, 2023
  5. M.W.

    M.W.

    Hello, I re-read your post and came across your point of wanting to segregate storage and compute. I am currently working on some design changes in my entire data acquisition, Analytics, research, and algo development pipeline.

    Have you ever come across or used timebase? It integrates quite neatly with clickhouse and takes care of the management of data (such as order book buildup) post data acquisition while it out sources storage to clickhouse (or other DBs). Its open source and I wonder what your take on timebase is.

    I am in particular looking for a solution that can perform both, the management of Realtime multi symbol data and the replay of stored timeseries (ideally, rebuilding the datastream in chronological order across different symbols)....

     
    #45     Jul 10, 2023
  6. Databento

    Databento Sponsor

    We've not used TimeBase but it looks like it's a proprietary software by Deltix. Deltix itself is a reasonably good company, but we don't have data points beyond that.

    We just use our own open-source file format (Databento Binary Encoding) for exactly the purpose you've described—it serves as an efficient multi-symbol, time-ordered file storage format while also acting as a message format for real-time data.
     
    #46     Jul 10, 2023
    M.W. and shuraver like this.
  7. d08

    d08

    Was recently looking into this and while ClickHouse seemed decent, ended up going with DuckDB for persistency and Polars for in-memory operations. In recent tests DuckDB has been showing better numbers than CH but that's not all that matters, I find CH harder to work with and I like that DDB is single-file instead of client/server.
     
    #47     Jul 11, 2023
    Databento and shuraver like this.
  8. Databento

    Databento Sponsor

    Interesting, thanks for sharing.

    It seems the single-file limitation is the primary trade-off here. In our case we need horizontal scale-out, but a single server/file is plenty for most users and use cases.

    Polars is nice. We're one of their official sponsors. ;)
     
    #48     Jul 11, 2023
    peechu and d08 like this.
  9. M.W.

    M.W.

    What tests did you run that made you feel like duckdb is faster than ch? I did not get the same results at all but it heavily depends on what you test. I used the NY taxi dB to run analytics and ch was always faster than duckdb sometimes orders of magnitude faster. Also, for data streams as well as reads and writes of large timeseries did ch perform better. But I did spin up multiple clients, multithreaded. It scales exceptionally well.

    Another advantage is that ch interfaces with many visualization tools, such as grafana. Not sure duckdb does so, too.

    I managed to use ch as backend storage for timebase. I now have one single solution across the board that handles live data streams, off which I can run trading algorithms and/or persist the incoming data to disk/db,and I can use the very same platform to request timeseries from ch that are streamed through timebase for backtests. I migrate to an integrated platform that allows me to use the exact same technology whether I profile new ideas, test ideas or implement and trade algorithms, all on the same architecture. Plus I can do so either via Python or C++ or C# without changing a thing in terms of data storage and event processing. Plus I can quickly load any filtered ticks/bars/whatever into Polars dfs. No more wasted time with static and pricing data. Plus timebase comes from a highly respected house and specializes in highly performant financial trading and risk management solutions. Their code generator saves a lot of time creating pocos and other boiler plate code.

    That makes me strongly favor timebase/ch over other solutions in open source space. Time will tell whether this particular solution makes it eventually into my production architecture but I am quite optimistic atm.

     
    Last edited: Jul 11, 2023
    #49     Jul 11, 2023
    shuraver and Databento like this.
  10. d08

    d08

    https://duckdb.org/2023/04/14/h2oai.html

    You can question the objectivity but it's a standardised test.
    DuckDB used to be much slower but it has changed in the last year or so. Comparing with the same test from 2021. TimeBase seems proprietary and for that reason I will avoid it like the plague. I might choose to go commercial with what I'm working on, so locking myself into a proprietary platform is a no-go.

    I will only need to interface using Python, backends of the packages I use are in C, C++, Rust so there's not much of a speed issue.

    Another tool to keep in mind is Ibis. They intend to make it so working with any backend db system is standardised. So whether it's DuckDB or ClickHouse, wouldn't mean changing any part of the code outside of spinning up the server / library imports.
     
    #50     Jul 11, 2023