Tick data storage

Discussion in 'Data Sets and Feeds' started by 931, Jun 29, 2020 at 11:32 AM.

  1. 931

    931

    Lately i have been optimizing storage for data and it appears that time series data holds enormous potential for utilizing custom compression methods to deliver very small formats while still retaining decent decompression speed.
    Initialy was using lz4 but it appears not to be best choice for timeseries data, altough modified versions of lz4 named delta4c is claimed to achieve more.
    But those products or sources are not yet released and it is work in progress...https://blog.quasardb.net/introduci...d-adaptive-lossless-compressor-for-timeseries

    With more complex compression algos optimized specifically for time series it is possible to obtain at least ~94-98% compression on common market data aka 100tb of data vs 2-6tb.
    But if some disk failure corrupts data its not as restorable as with simpler methods so i would choose more easily restorable format.
    Lets say 1% of file gets corrupt and 2% lost after data restoration is acceptable.
    Also best (compression ratio/decompression speed) might be too hard to achieve with higher complexity algos offering high compression ratios.


    My goal is to find the best lossless compression algo in terms of (compression ratio/decompression speed) ratio to compare to my current implementations.
    Surely some HFT firm or data related firm has better formats but it would not be obtainable on web.

    The algo should be usable on both files and as in memory compression.

    What are the best compression algos that specifically target timeseries data and high decompression speed?
     
    Last edited: Jun 29, 2020 at 12:18 PM
  2. 931

    931

    To add more context the algo will be used in custom database that buffers blocks of data from disk to ram on some threads while other threads work on data.
    It utilizes both in memory compression and disk compression , atm both are same format.

    From my initial findings it appeared that with some relatively simple compression configurations it is possible to load less than half sized timeseries files in at more than double the disk speed compared to regular formats, while keeping compression format middle out and decompression operations as extremely cheap.

    Forgot to mention the buffered data is floating type and initial disk loaded data format is text based and compressed at <0.5 bytes per character.

    Current plan it to use high capacity m.2 drives rated at 5000MB/s sequential read and see if CPU bottlenecks or doubles the disk speed.
     
    Last edited: Jun 29, 2020 at 12:22 PM
  3. Not an answer to your question, but some independent confirmation of your results that compression increases speed as well as obviously disk usage.

    https://code.kx.com/q/wp/compress/

    GAT
     
  4. 2rosy

    2rosy

    what are you compressing? ascii?
    take a look at this it's free and probably better than what you're doing
    https://parquet.apache.org/
     
  5. 931

    931

    Ascii like human readable format for long term data storage, floating for buffering.
     
  6. I use https://github.com/man-group/arctic but it probably won't suit if you want to keep the ascii format.

    GAT
     
  7. 931

    931

    You probably ment reduces usage.
     
  8. Slaps head....

    GAT
     
  9. 931

    931

    Its not using ascii based character table, but i have custom text viewer with decoder to see in ascii utf8 unicode or whatever Qt apps shows text as when QString gets displayed.
     
  10. 931

    931


    After reading this article i started to see various compression algos and got interested.
    https://github.com/VictoriaMetrics/VictoriaMetrics
    This database software could have been written using lower level lang probably.

    More importantly im looking for floating compression. The text based format is just for long term storage and reliability.