PocketChange
Registered: Jul 2008
Posts: 2036 |
08-25-12 03:48 PM
We store 1 minute bars, 1 second bars and 25ms bars for many markets from 2000 to present.
Our one minute bars for equities is only about 600gb stored and indexed in sql db's. Obviously the second and ms are substantially larger data sets (50+ TB's and growing)
Your I/O bound when dealing with these types of data volumes and structures. Just copying 1TB of data is time consuming and taxes SATA3 limits. Traditional Fault tolerance and recovery are not realistic options. Traditional Big server / multi tb drive arrays do not service the load well nor scale.
Our solution was building out a farm of sql appliances and feed handlers with infiniband and breaking up the historic data sets into 500GB containers. The data containers are replicated across a minimum of 3 appliances and the collective pool of appliances maintains a cache of 10% of the repository in memory. Kind of our own Hadoop / map reduce but for sql tick data.
This redundancy not only protects the data but provides 3 to N x the I/O.
Queries can be processed in parallel... Different indexes can be maintained based on purpose. Different views and schemas can be managed without impacting the repository. Our attempt at a self healing and updating data vault.
Your 600GB or so of historic 1 minute bars will quickly occupy 10x the raw space based on replication and managing different views and schemas.
For example suppose you want to maintain a portfolio view of the S&P 500 and its composites all adjusted for splits and dividends during RTH's.
A subset of optimized tables are created from the repository master and maintained by triggers. The indexes are different, the views are custom and the I/O distribution is optimized for feeding MatLab.
Matlab is optimized to use GPU's (400 + cores) accessing an inmemory sql db also optimized to use GPU's for virtualizing its opcodes and queries. As a result this specialized portfolio application can run in real time with 25ms precision to both real time market data and its historic data.
This is a huge undertaking to do right both from an infrastructure expense plus all the coding and data management to get down to tick precision.
One Minute Bars should be much lighter and easier but you'll inevitably want to query higher precision.
|