How do you guys store tick data?

Discussion in 'Data Sets and Feeds' started by mizhael, Jun 10, 2010.

  1. SQLite vs. KDB?
     
    #41     Jun 14, 2010
  2. januson

    januson

    No they will not, binary is the lowest you can go. A DB has to read the whole file stream into memory/ disk and aggregate at that, it will be exactly the same as if you created your own aggregation at a stream. But the difference is that your own solution is specialized and the DB isn't which means overhead caused by features you eventually don't need.
    As a 2nd reason and maybe the best, when storing tick data in files (another poster suggested the same) you can create the files based by a time scale, hours, days etc., you have to decide that when you write your specification.
    Writing your own solution WILL always perform better/ faster, but maybe not as versatile and that is exactly what your decision should be based upon :p
     
    #42     Jun 14, 2010
  3. ^^^^ This. I'm building out three years of tik data for the Russell 3000 in a modified Backblaze (http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/).

    All Binary, all accessed via a custom application & backtest engine.
     
    #43     Jun 15, 2010
  4. I use single files to store tickdata. Form the name of the symbol I create a folder which contains all the files for that symbol. For each tradingday I create a new file in this folder. The date is the filename, so it is very easy and fast to access it. The data is z-lib compressed to reduce the time to read the quotes from the file since decompression in memory is faster that reading from a disk. I store trades/bid/ask, each with price, volume and timestamp with milisec resolution. From this base I can create all needed timeframes for charting and you can also do backtesting with ‘look inside bars’ based on bid/ask rather than trades….

    Daniel
     
    #44     Jun 15, 2010
    Axon likes this.
  5. januson

    januson

    #45     Jun 15, 2010
  6. #46     Jun 18, 2010
  7. Ummm, because it costs a lot of money?
     
    #47     Jun 18, 2010
  8. I store it as binary data. Here's how:

    one file per day

    HEADER:

    first 3 bytes are date (year+1900, month and day)

    then 3 bytes (unsigned) is the time of the first tick in the day (time is measured in seconds since 00:00)

    then 4 bytes (unsigned) for the price of the first tick
    and 2 bytes (unsigned) for volume of the tick

    THE REST:

    For every tick {

    3 bytes (unsigned) for a change of time since last tick in seconds. Usually 0 or +1

    2 bytes (signed) for a change of price since last tick in ticks, which is usually 0, +1 or -1 (that would represent a price change of 0, +0.25 and -0.25 for ES)

    2 bytes (unsigned) for volume of the tick

    }



    The byte array is then zipped (compressed). Compression is very effective since most of the bytes are either 0, 1 or -1.

    Storing data in non-binary format (like CSV) is very inefficient space-wise and performance-wise. All textual data have to be parsed. I experimented with binary vs. text-based data and found that parsing text data is about 8 times slower than reading in binary data. And the data size differences are very obvious. Binary+compressed data takes up about 20 times less hard disk space than CSV data.

    I haven't experimented with databases. But in my case using a DB would be a huge and unnecessary overhead.
     
    #48     Jun 19, 2010
  9. I don't store tickdata but if I would I'd store it in SQL.
    I store 800 million quotes in mysql (daily, weekly and 15 second quotes) and run various tests against it.

    A relational database can never be faster than a csv or binary format when it comes to raw data processing. However the way my backtesting algorithms work is they take advantage of some of the convenience of a relational database, I'd hate to duplicate that myself by processing a csv/binary file.

    Most of my actual backtesting is done from in memory, so there's not much difference once the structure is loaded. It will take longer to load the content from a mysql database, but then again, I can easily compare gaps with gaps on the S&P 500 without having to come up with that myself, I can just use standard (ansi) sql. I guess it depends on what your trading timeframe is, whether or not you want to correlate your strategy to index performance to optimize it ....

    For my type of trading, I'd gain very little processing the data binary/csv.
     
    #49     Jun 19, 2010

  10. PostgreSQL for persistent store but I frequently cache data in memory, just as arrays but you could use b-trees, tries, hash map, etc. depending on your needs.

    Note, even with PostgreSQL which I've had zero problems with, I make backups.
     
    #50     Jun 19, 2010