Store OHLCV data in MySql database.

Discussion in 'Data Sets and Feeds' started by mengtrader, Apr 13, 2010.

  1. This is not a product pitch.. Just sharing some new technology that I have been beta testing that seems very promising for storing tick data, back testing etc.

    cloudcel.com is worth taking a look at and has opened up a public beta.

    Basically this is a cloud service tightly integrated with Excel that can store and process billions of records of market data... Run simulations across hundreds of processors and return just the results back into your spread sheet or database.

    I was pretty impressed... Testing processes that took months to setup and program and hours upon hours to test literally turned into a 15 minute job.

    Beyond having an Excel worksheet with literally unlimited rows and columns the main speed advantage is being able to parallel process data streams in the cloud. Effectively one huge in memory database of all of your high resolution market tick data with hundreds of cores available to process generating thousands of result streams..

    They have public data streams loaded with approx 3000 stock symbols with intrasecond tick data. It takes a little time to adjust to the concept of processing in-cloud data streams and feeding result streams to other in cloud parallel processors.. but once you get the hang of one... You'll instantly see the power you have available at your fingertips is nothing like you've ever had before.

    Basically in the context of this thread.. OHLCV data + Time Stamp and Symbol are streamed into the cloud as raw high resolution data.

    They have an API so you can have your price feed go directly into the cloud, you can upload csv, excel, mysql, sql server etc. and push the data up.. (One time load for historic and either a daily batch update or real time capture for new data)

    Once in the cloud you use their excel add-in to create a real simple worksheet that processes your "Precalculated historical results" ie. 1 minute, 5 minute, 15 minute, 1 hour, 4 hours, 1 day, 1 week, 1 month, 1 quarter, 1 year, 2 years, 3 years, 4 years, 5 years.

    For you DB guys... Group By Symbol, group by time, first, Max, Min, last... If you have 3000 symbols you set one field in excel to tell it to run on 3000 processors and all data is churned and calculated concurrently generating 30,000 result streams.

    The part that I was impressed with was all of these result streams persist and are available so you can use them for back testing... You can make them public, share them etc. You can stream the results back down into excel or to another database... (but why?) You can back test 10 years of tick data for thousands of symbols and have all of your results in 10 minutes...

    I know they use Amazon EC2 for the public beta which has 10 of thousands of processors. Their concept of data streams takes a little time to adjust to but its damn powerful. Think of it as persistent helper columns... All you really want to do is offload all of the heavy lifting and processing into the cloud.

    Half the battle CME datamine and other data providers have are delivering 20+ terabytes of historic data to subscribers and the subscribers have to be able to clean, load and process the data.

    If CME Group or any of these data providers loaded and maintained tick by tick historic Level II data into this type of a cloud system and provided templates to download, back test or query the data streams. It would change the way we process market data.

    Check them out... it seems like a particularly good tool to use for storing OHLCV type data and an even better tool for running simulations.
     
    #11     Apr 13, 2010
  2. For those complaining about this message, I don't feel it violates terms of service, so I will not be deleting it, as far as I know PocketChange does not work for this company and is not trying to spam.

     
    #12     Apr 13, 2010
  3. Pretty much. I also neither use MySQL (SQL Server here) nor have a crappy server (the virtual instance running SQL Server has 8gb RAM, 4 allocated processor cores, is prepared to go to 12 and has about 10 physical discs directly attached).

    Data generation is not stored procedures (they suck for complex things - group by time is great, but Renko charts, P&F etc. are hard to do with them).

    THe exact schema is not something I Can put up here at the moment (missing the tools), but if you ask again in about a week I may (upgrading my development environment at the moment now that Visual Studio 2010 is out).
     
    #13     Apr 14, 2010
  4. If you could, can you just share the relationship between the two tables you mentioned? No the E-R diagram, just the SQL create table script or something like that to show how you link those two tables.

    Thanks a lot!
    Andrew
     
    #14     Apr 14, 2010
  5. Actually it is three tables.

    * MarketAggregateCode
    - id
    - Code (unique)
    ...the code is the string i mentioned

    * MarketAggregateSeries
    - id
    - SymbolREF (to Symbol definition table- i store more than jsut the ticker there)
    - FeedREF (alloes me to handle a symbol in multiple data feeds)
    - CodeREF (to MarketAggregateCode)

    * MarketAggregate
    - SeriesREF (to MarketAggregateSeries)
    (then the payload fields - From, To timestamps, OHLC, volume, tickcount etc.)

    This allows me to keep the SeriesREF small (2-4 bytes) and still handle mulötiple series of the same symbol. I THINK of adding a SourceSeriesREF for the input (assuming a generator can work based on another generator).

    Hope that helps ;)

    Per SQL guidelines every table has a unique Id field that is as small as possible - makes foreign key relations a LOT faster. I also have a separate symbol store I currently work on ;)
     
    #15     Apr 14, 2010
  6. One more question, what is the PK in table MarketAggregate? Are From, To timestamp part of the PK?

    Thanks a lot for your share!
     
    #16     Apr 14, 2010
  7. ONLY the Id - I never have a PK that spans more than one field.

    Besides that - this is really hard.

    I would say Series, From - and a unique index on To.

    The problem with a PK there is that it does not solve the issue of uniqueness good enough anyway (as in: what if the end is slightly different - overlaps are not allowed at all, but that can not be shown in a PK.

    I have right now no PK there, assuming that simply to be streaming data. Unique indices on SeriesRef, From and SeriesRef, To allowing me fast orders etc.
     
    #17     Apr 14, 2010