Data storage for backtesting

Discussion in 'Programming' started by calhawk01, Aug 24, 2012.

  1. Hi looking for some opinions on what the most efficient way to store 1 min data in a database and then fetching the data using a script. Script should also be able to recompile the 1 min data to 5 min etc. And the bactesting will be done on a website. Backtesting period = 10+ years

    What database?
    What language should the script be written in?
  2. 2rosy


    for 1 minute data...

    makes no difference
    makes no difference
  3. Don't you think it would make a difference depending on what language we use? I'm talking how much time it would take to recompile the 1 min data to 5 min etc, and then backtesting the variables etc.
  4. 2rosy


    you're a consultant's wet dream. language makes no difference; you're not recompiling anything. :eek:
  5. Any SQL database can aggregate 1 min data to 5 min data either on-the-fly using a view or stored using a stored procedure.
  6. We store 1 minute bars, 1 second bars and 25ms bars for many markets from 2000 to present.

    Our one minute bars for equities is only about 600gb stored and indexed in sql db's. Obviously the second and ms are substantially larger data sets (50+ TB's and growing)

    Your I/O bound when dealing with these types of data volumes and structures. Just copying 1TB of data is time consuming and taxes SATA3 limits. Traditional Fault tolerance and recovery are not realistic options. Traditional Big server / multi tb drive arrays do not service the load well nor scale.

    Our solution was building out a farm of sql appliances and feed handlers with infiniband and breaking up the historic data sets into 500GB containers. The data containers are replicated across a minimum of 3 appliances and the collective pool of appliances maintains a cache of 10% of the repository in memory. Kind of our own Hadoop / map reduce but for sql tick data.

    This redundancy not only protects the data but provides 3 to N x the I/O.
    Queries can be processed in parallel... Different indexes can be maintained based on purpose. Different views and schemas can be managed without impacting the repository. Our attempt at a self healing and updating data vault.

    Your 600GB or so of historic 1 minute bars will quickly occupy 10x the raw space based on replication and managing different views and schemas.

    For example suppose you want to maintain a portfolio view of the S&P 500 and its composites all adjusted for splits and dividends during RTH's.

    A subset of optimized tables are created from the repository master and maintained by triggers. The indexes are different, the views are custom and the I/O distribution is optimized for feeding MatLab.

    Matlab is optimized to use GPU's (400 + cores) accessing an inmemory sql db also optimized to use GPU's for virtualizing its opcodes and queries. As a result this specialized portfolio application can run in real time with 25ms precision to both real time market data and its historic data.

    This is a huge undertaking to do right both from an infrastructure expense plus all the coding and data management to get down to tick precision.

    One Minute Bars should be much lighter and easier but you'll inevitably want to query higher precision.
  7. I import 1 minute data into my own data files, converting it to all the timeframes that interest me. The 1 minute data goes back to 2001.

    I toyed with databases I finally realized it added more complexity than it was worth in benefit. My binary data file implementation is simple and very fast.
  8. My advice: Clear text files.

    Why not a database? That is a GREAT storage, but for backtests you want something FAST that is REPEATABLE. We test here always against a specific data set - then some others, then another one, then forward test. but reputability is critical, plus the ability to get the data fast and move it to some external systems (external to the storage system) in order to run tests on an external backtest machine ;)

    Which, btw., seems not to have Xeons - not price efficient, I go for high end high overclocked desktop machines ;)
  9. vicirek


    I do the same. Own dedicated binary file data structure is faster because of simplicity.
  10. Thanks everyone for replies.

    To add some color to this request.. i'm obviously new to this so I would really appreciate commends that include references so that I can do this all on my own. I don't have the resources to test out different methods with programmers. I am asking this question on this forum because I want to get the most optimized, cost effective structure down the first time around and then ask a programmers help (btw if anyone is interested in this project PM me).

    What methods do these backtesting softwares use during their backtesting? You say MYSQL will be able to recompile my 1 min data to different timeframes on the "fly." Really? So If I uploaded 10 years of 1 min data on MySQL; and then wanted to backtest 5 min time frame, and wanted to test a semi-complex formula.. MYSQL will be able to do this within a few minutes??

    I mean another thing I was thinking was.. maybe I should create a library of historical values for all of my indicators and store them on the MYSQL file as well. For example:

    date, time, price, indicator 1, indicator 2

    So if my buy was when indicator 1 > X and exit is when indicator 2 > X; the backtesting script will easily be able to ''fetch'' the data.

    I'm not sure if this would be the most effective/quickest way to do this.. because I have not tested this out for all the data that I have (this can be extremely time consuming because I would have to manually create all the indicators and values in excel for all the stocks data that I have).

    All thoughts are appreciated!
    #10     Aug 26, 2012