How do you guys store tick data?

Discussion in 'Data Sets and Feeds' started by mizhael, Jun 10, 2010.

  1. Yep, but you cannot query into binary files fast right?

    What if you need 5 milli-seconds' aggregated mean/max/min prices?
     
    #21     Jun 10, 2010
  2. Not to say alignment of multiple series at the milli-second level...
     
    #22     Jun 10, 2010
  3. I'm showing around 300 MB for a month of the tick data as you described (I specifically checked QQQQ). Let's say that six months, then, is around 2.5-3 gigs (I'm padding the estimate a bit). Then yes, R can store it quite easily, assuming you have the RAM/swap available. I've got 8 GB on this machine, so it isn't a problem.
     
    #23     Jun 10, 2010
  4. promagma

    promagma

    You won't be able to use SQL, but do you really want to rely on a database engine to aggregate data? Stick with reading in raw tick data and aggregate in your C++ code. Not hard and there is no faster way to do it.
     
    #24     Jun 10, 2010
  5. promagma

    promagma

    Your data should be pre-ordered, as in a binary file or HDF5. Then aligning the data is easy. Again you can probably do this faster with your own multi-threaded code rather than relying on a database engine.

    EDIT: I just saw that you can get your hands on KDB, which is pretty awesome. I'm guessing KDB is built to do exactly this kind of stuff, so don't want to imply that you can beat that.
     
    #25     Jun 10, 2010
  6. Okay, so you are saying R has no memory limitation (only bounded by hardware RAM) while Matlab has memory limitation?
     
    #26     Jun 10, 2010
  7. Yeah, I used to write low level C++ code to process tick time stamps, etc. Debugging is painful. Also not too fast - sorting timestamps on a 1GB data takes quite a while. [For sorting only, structured softwares such as SAS is much faster, which made me think raw C++ data processing is not the most cost-efficient route.]

    So I came up with the idea of KDB. Anybody has experience with KDB vs. C++ for tick data processing?
     
    #27     Jun 10, 2010
  8. I can't speak for all platforms, but R on 64-bit Linux will only hit address space limitations imposed by the system architecture, as far as I know.
     
    #28     Jun 10, 2010
  9. promagma

    promagma

    I'm trying to say, if you (or the database engine) is sorting data you are already in big trouble. I have 100GB of data so I've been down that road!

    The recommendation for binary file or HDF5 is because you can store it already sorted. Then reading in your stream of data is just a straight shot from the disk. The only bottleneck is I/O, and if your data is compressed that can be minimized.
     
    #29     Jun 10, 2010
  10. dinn13

    dinn13

    KDB+ is a decent way to go. Never liked using Q and when I've used it before the KDB database was shared and people would always get on my case when I would run heavy queries on it. So in the end just used it as a back up for tick data.

    Now a days when I want to access tick data directly then I use FastTick (now a Reuters product, http://quant.thomsonreuters.com/) which has its own proprietary database for accessing TAQ data. It's really fast.

    Otherwise I'll create csv files or binary files (stored as serialized java objects) that aggregates the data any number of different ways and then run my backtesting apps over that. Have found this to be the fastest way when optimized.
     
    #30     Jun 11, 2010