dataset for performance testing a time series implementation

Discussion in 'Data Sets and Feeds' started by tolland, May 22, 2009.

  1. tolland

    tolland

    Hi Elite traders,

    This is my first thread, and I am posting here as I found an epic thread on "Tick Database Implementations" and there were lots of intelligent replies here;
    http://www.elitetrader.com/vb/showthread.php?s=&threadid=81345&perpage=6&pagenumber=1

    So amongst others, monetdb, kdb+ and HDF5 were mentioned, along with streamSQL and CEP/ESPER style analysis tools.

    However I think the proof is in the testing, so I was looking for a financial price stream orientated dataset that I could download import and run my analysis against for a metric. (obviously I would only testing performance of running retrieval annd analysis rather than storing the ticks)

    I had a look at the TPC site, http://www.tpc.org/ but I couldn't find anything relevant. I also found some acedemic papers, with the sort of thing that would be useful, but no datasets supplied.

    So I would like to suggest maybe taking a dataset like OHLC data for the S&P for the last x years, to get it up to something sizable, and then a bunch of analyses that would test the implementation.

    Has anyone got any suggestions on such a beast, and indeed would anyone like to compare against my system with their own HDF5 or kdb implementation?

    Cheers,

    T











    And indeed there are a couple of threads on the RMetrics and ActiveQuant list;
    http://www.nabble.com/tick-data-database-td23340204.html
    http://www.nabble.com/Thoughts-about-handling-huge-amounts-of-ticks-in-db-backend-td22999165.html
     
  2. the most useful metric for this type of thing imo is just simple msgs per second. i'm not sure a common dataset is even necessary as most data products have similar msg lengths. also, i'm not sure an 'analysis' is necessary either as what's being compared is the speed at which the db is read not the speed of the analysis algo, unless i'm not understanding the scope of what you want to compare. i think it's also important to specify whether you're trying to simulate processing of a real-time series or processing of historic db. kx and hdf5 do both and performance can be markedly different.
     
  3. maxdama

    maxdama

    Tolland,

    A while ago I shared some high-frequency data here. It's 1 day of intraday QQQQ quotes, weighing in at 71.962 Mb (6Mb to download the compressed zip file), which should be a good test. It's over 1 million observations, i.e. intrasecond. Go for the quotes dataset, not the trades, which is much smaller. I'd like to hear your benchmarking results. Feel free to pm me if you'd like.

    Regards,
    Max