Software/database/method to work with 100-200 million row/3-5 column of data?

Discussion in 'Data Sets and Feeds' started by DSpline, Apr 28, 2010.

  1. DSpline

    DSpline

    Hi folks.

    Looking for a software/database/method to accurately & easily work with upwards of 200-million row x 3-5 parameter column data set.

    Looking to essentially build data set of *empty* tick bars (just open, close price). To build empty 1500 tick equates to pulling out and save every 1500th/1501th row. This results in a 200 million row x 3 column raw tick data set being reduced to ~266,666 row x 3 column.

    Example:

    Row, Date, Time, Price
    Row1 4/1/2010 08:30:05 1202.25 <- Save
    Row2 4/1/2010 08:30:05 1202.50 <- Remove
    Row3 4/1/2010 08:30:05 1202.25 <- Remove
    ..... <- Remove
    ..... <- Remove
    Row1500 4/1/2010 08:31:02 1201.50 <- Save
    Row1501 4/1/2010 08:31:02 1201.50 <- Save
    ......
    ......
    Row3000 4/1/2010 08:31:58 1201.25
    Row3001 4/1/2010 08:31:58 1201.00
    .....
    .....
    Row4500
    Row4501
    .....
    ..... etc
    Row150,000,000
    Row150,000,001


    Would like final manipulation as follows:

    Date, Time (Bar Close), Open, Close
    4/1/2010 08:31:58 1201.50 (Row1501 Price) 1201.25 (Row3000 Price)

    ^This format will eventually make its way to Matlab.

    Can Tickdata's Tickwrite script help here?

    Want to be able to easily switch to a removal criteria such that I can build empty 750,1000,2500 etc tick bars from raw tick.

    Data is either in ascii format or txt file from original source.

    Any input much appreciated.
     
  2. DSpline

    DSpline

    Should I be looking into MySQL or SQL Server 2008?
     
  3. Whatever you look at, make sure your hardware is up to the task - not memory wise (normal 8gb to 16gb should be plenty) but IO wise. Lots of FAST discs, or you will hae to wait.

    I personally use SQL Server for that stuff. But I am a SQL Server specialist in another life and use the .NET stack for my programming, so it is a naturaly extension of that ;)
     
  4. Just write a bit of code to do the job in the language of your choice. An RDB seems a bit of an overkill for this. You could have the job done in less time than it takes frigging around with a database and SQL.
     
  5. Syprik

    Syprik

    TickWrite 7 will do PRECISELY this for you, but it only imports their data format (.tdi). Specify any tick, trade, time, daily, weekly, monthly bar you wish. You then manually select your columns of choice (date, time, open, close, high, low, vol, ...) and it will spit it out to ASCII. Personally use it to frequently ferry vol, tick, & time bar OHLCV to R, Matlab, and PASW. It's rather quick interacting with their proprietary compressed format. Perhaps shoot them an email to see if they have a version of utility for sale that will work with any raw tick ascii, regardless of source?

    If you are still going down the RDBS route, I've found MySQL (free open-source) with the WorkBench quite straight-forward for storing and performing basic manipulation of my data. Consider O'Reillys "Learning SQL" 2nd edition if you want a quick and decent intro. There is also SQL 2008 R2 Express that is free, but only 10GB capacity ceiling. My trading partner has preference for BerkeleyDB.

    MS Access another option, as long as it's sub 2-GB per DB.

    Good luck.
     
  6. Syprik

    Syprik