Datastore design recommendations

Discussion in 'Automated Trading' started by Clark Bruno, Jan 24, 2022.

  1. jackfx

    jackfx

    When algo/strategies are changing then use whatever tools to profit. When use SQL then create index to speed up queries. Cloud storage provides SQL but not sure it will automatically create indexes based on queries patterns. Text, Excel, NoSQL, AI/ML, exotic, Cloud all are good when algos/strategies are changing. Any tools to show anomalies are welcomed
     
    Last edited by a moderator: Jan 26, 2022
    #21     Jan 26, 2022
  2. jackfx

    jackfx

    Short answer: one symbol one table, one algo/strategies/feature/exotic in one table. So basic tick in one table, and tick with secret sauce in other table, so it occupies [1 + (secret qty)] tables. Should Cloud is smart enough, actually it can see SQL query patterns and build indexes behind so user not aware but amazed by performance. For me personally, it's a just a selling point of Cloud company but behind is still Indexes and flat tables (with proprietary storage mgt). Let me know there's such engine, quite amazed it's free of charge and local storage.
    When in research stages, just use whatever fastest tools to produce view/graphs/anomalies. Only focus on Storage vs Query when profit. But unfortunately, algo may not last long or improving (for some) thus require Both Storage and Query as fast. So find Cloud and Smart engine to see your queries pattern BUT your SECRET SAUCE are bring STOLEN by hackers (if the engine in cloud)
     
    Last edited: Jan 26, 2022
    #22     Jan 26, 2022
    bitcuration likes this.
  3. I suspect if you tell the SQLite guys that their database isn't as fast as the cloud one because it build indexes in the background based on access patterns, they will implement that functionality. Someone figure out how to make Dr. Hipp jealous that cloud does querying better.
     
    #23     Jan 26, 2022
  4. Databento

    Databento Sponsor

    There's a fair bit of misinformation here. We maintain a small, 9 node Clickhouse cluster (40 TB) here at Databento for analytics. We also have experience with kdb. And we have a fair amount of experience storing order book and tick data (7+ PB raw and growing).

    Clickhouse will perform very well against other column-oriented databases such as Vertica and kdb when compared on the same single node, because Clickhouse exploits many of the same hardware codesign principles, e.g. SIMD, instruction pipelining, linear memory traversal patterns, inline compression, page-sized optimized index structures etc. It also has the distinct advantage of being free, so you can spend your budget towards clustering more hardware for more performance. For our target writes and queries, our Clickhouse cluster outperforms a kdb setup of same cost.

    Some selling points of kdb in this situation would be (a) more flexible architecture if you want to support multiple use cases besides backtesting and real-time capture and (b) more powerful query language. You can solve least squares on server side on a massive design matrix with q in a single line, but you can't do that with Clickhouse.

    Before asking if you should store the data in one large table, first ask if you have a good reason to store your data on a column-oriented database. Do you need:
    - Flexibility to write any arbitrary queries? e.g. Online queries, summary statistics etc.
    - To push compute to the data? As you've pointed out, your workflow could be slower on client side than server side.
    - Typically run queries that have smaller qualifying results than source data?

    The last criterion is probably most important here in answering your original question. A column-oriented database is a good idea when you have patterns where you filter and/or aggregate on columns, largely because it exploits linear scan on the filtered column and late materialization and deferred tuple construction on the (intermediate) results. This partially answers your question: if you already had a good reason to store the data in a column-oriented database, then you will probably have a good reason to store all of the symbols in 1 single large table to exploit things that it is good at - filtering on the symbol column, range selection on the time column and aggregating on other columns. Backtesting usually fails this criterion, because you're likely just seeking through contiguous data from start to end without any qualification.

    The other strong reason to store all of your symbols in one large table is an operational one: Adding 1 table each time you add an instrument is not very maintainable. And joining across them is expensive. And prematurely paying higher maintenance cost before you've found a repetitive use case is usually a bad idea.

    You may also find these popular posts by another one of our engineers useful:
    - What's the most efficient way to store options and time series data for backtesting?
    - Alternatives to RDBMS for options backtesting
     
    #24     Jan 26, 2022
    Sprout, shuraver, lariati and 6 others like this.
  5. Best post in this thread so far. And I agree, most posts spout misinformation or are perhaps just a reflection of the lack of knowledge how to deal with large datasets. I have outlined my reasons why I need a columnar data store and the type of queries I run.

    But in the end of the day the data has to be stored and queried somehow, so the most important question that needs to be answered first is what is the RELATIVELY best and most optimal way to store such data. Text files are completely out for logical and simple reasons. They are just not feasible for any queries. They don't even perform well for back testing over large amounts of data, streaming and parsing from strings is just way too slow to achieve an ingest rate of millions of data points per second that are required to run large backtests within a reasonable amount of time (CH reads entire partitions and does so multi threaded which still allows me to stream the data at over 5 million ticks per second)

    Memory mapping is from my experience just a misconceived concept. Either the data is on file or in memory. You can't have the benefits of either at the same time.

    Relational databases are not designed for the queries I intend to run and do not provide the performance necessary for temporal sequence data.

    This leaves me with binary flat file storage which I have used for years or with a columnar database.

    In the end I decided to switch over to a columnar database because the particular one (CH) already is widely adopted enough that it provides APIs for the languages I use most often. With binary flat files the serialization/deserialization mechanisms from one language to another are different, endianess is an issue, the way the data is parsed to arrays or pocos is different. I just found it way easier to go with a well developed open source tsdb than continuing to go down the road of binary flat files.

    Your sharing added some valuable insights. Thanks for that. I was doubtful that anyone would set up thousands of tables for the entire equity universe in the US alone. It makes no sense and cross join queries would kill all performance.

     
    Last edited: Jan 26, 2022
    #25     Jan 26, 2022
  6. MrMuppet

    MrMuppet

    A sponsor that actually provides valuable information instead of just shilling. Thumbs up
     
    #26     Jan 26, 2022
    d08, Sprout, M.W. and 2 others like this.
  7. Databento

    Databento Sponsor

    No problem.

    Well, these days we see most of the top trading firms using a binary flat file format on POSIX storage for their backtesting workflows, and scale their IO with a distributed file system like Weka, VAST or Lustre. But as you've pointed out, that takes quite a bit of work. You could consider parquet for the position that you're in.

    But otherwise - to shill a little - we are open-sourcing our binary flat file format along with our Python client library and documentation in about 1 month's time (and C++ library later) that will do a fair bit of that work for you. It is quite identical to the internal formats used at top prop firms. Tested briefly with 4,010,830 entries of MBO data on my 2020 MacBook Air:

    - Writes at 1.3 GB/s.
    - Reads at 3.5 GB/s (200+ million events per second).
    - Takes up 71 MB on disk. Compared to 81 MB (HDF5, gzip, complevel=7), 223 MB (HDF5, gzip, complevel=0), 97 MB (parquet, snappy).
    - Peak memory use at 192.7 MB. Compared to 321.5 MB for pandas/pyarrow.
    - We include much of the serialization/deserialization for you in the client libraries. In CPython:
    - Takes about 6.436 seconds to convert to pandas dataframe.
    - Also converts to list type about as fast.
    - Full order book replay at 154k events per second dispatching a pass-through callback. (And much faster in our internal C libraries and on our storage cluster.)​
    - Same data structure used in our real-time and historical data services, which also transcode from binary to CSV or JSON (and parquet later) on server side. And we do have coverage of FX order book and tick data.
     
    Last edited: Jan 26, 2022
    #27     Jan 26, 2022
  8. Those are quite impressive numbers. Though from my experience most back test engines become the bottleneck at around 10mil data points per second when the data feed could actually stream the data a lot faster. The more complex the risk checks and actually algorithm complexity the more this weighs down on performance beyond 10mil/sec. Though hft algorithms are fairly straight forward in nature so I would not be surprised if their testing harnesses performed better. I don't play in that space... so, hard for me to judge.

     
    #28     Jan 26, 2022
  9. Would you please stop to pollute my thread with your completely unrelated and random comments?

     
    #29     Jan 26, 2022
  10. That's pretty sick.
     
    #30     Jan 26, 2022