Flat binary files with a fixed record length, and appropriate record length. I.e. Don't waste space and bandwidth with 8 byte doubles unless that precision is needed. Data representable in 1 or 2 byte records can really fly off the disk and into the CPU. Fixed length binary formats permit rapid seeking to blocks of data.
Bad advice! What you describe is very inefficient memory-wise and does not scale. It is better to load parts of the data set into memory, do as much processing as possible, with cache efficient code, accumulate results in memory or disk, repeat.... A basic understanding of disk<->memory<->L* cache<->CPU bandwidths and latencies is necessary, as is the desire to experiment with code and use a profiler.
There's no free lunch. You are trading in computaional and algorithmic challenges for less opportunities design-wise.
Wrong in general. The problem must be approached differently. Ticks are no less efficient than an equal number of bars or any data points for that matter. If you are refering to the increased number of ticks versus bars over time, then yes there may be more computations. However, not all processing must process every tick. The first layer of processing can operate on every tick, and cache the results to memory or disk. The next layer can process results every 10 or 100 ticks, or every minute, etc. You can optimize the second layer while keeping the results of the first layer in cache. You could actually end up with less computional load by virtue of greater design opportunities with the higher frequency data. So computational efficiency doesn't matter? That's wrong. They do calculate such matricies derrived from price changes and other technicals or statistics on many time frames. Not statisitically valid? That judgement depends purely on how the data is used. I have personally found success with 5000 stock * 1000 day covariance/correlation matrices of daily price changes and other daily statistics. Anyone who complains this isn't statistically significant enough simply hasn't found strong enough designs, signals or filters to achieve a desired profitability or stat sig. Keep the huge data set. Allow some automatic walk-forward adaptation. Problem solved. Attempting to fit a fixed-rule based system over too much data can be an overwhelming challenge and a waste of time.
Absolutly, 2000 iterations with 3 variables on 81000 records : 10 secondes.(On a 2.7 Gz) If one wants speed, one have to get a formula 1, not a truck. And speaking about real time financials apps, Sql stuffs on PCs are trucks, they are only suitable with mainframes and/or paralell computing.
2000 combinations * 81000 records / 10 seconds = 16.2M records*combinations per second. Not bad. In Matlab and C on a 2.5GHz Pentium 4 it is possible to do: 6.7M ES ticks * 17 system variations * 1014 per-tick stop loss formula combinations / 16 minutes = 120M ticks*combinations per second. I am currently doing operations like this on an dual Opteron 242 (1.6GHz): 10.2M NQ ticks * 750K system combinations / 10 hours, per processor = 208M ticks*combinations per second per CPU. Memory and VM usage is 500MB per process. Required disk bandwidth is very low over the 10 hour period. It reads a few MB of ticks every few minutes, then iterates through all the system combinations, then repeats. Most of the optimizations I use to achieve this are listed in this previous post: http://www.elitetrader.com/vb/showthread.php?s=&postid=594912#post594912
That's funny, I can get 850K records out of my database of 500M records in less then a second. Of course, I run Oracle.
I'm not sure how database query speed can be accurately compared here. It depends on physical record size and internal organization. Does the query require a transpose or many drive head seeks? Is the data stored in a contiguous block? One advantage of flat file formats is that you can choose the physical organization to optimize the most time consuming operation, be it reads or writes. Need to efficiently add to the data files as new market data comes in? Can design for that too.
Tell me that after calculating the covariance of two sets of tick data. The problem is not just the increasing amount of data, it is the lack of a consistant time scale. Algorithms are more complex and if you want to address the problem efficiently you need more complex data structures. The aggregation and caching techniques you describe just prove my point that tick data is harder to deal with than bar data. Obviously there are both benefits and costs to using tick data. Whether that matters, well, that depends on the application. In my application the only use I have for tick data is estimating trading costs. Certainly not as much as good design, simplicity, and maintainability. "Premature optimization is the root of all evil." Donald Knuth Of course it depends on how the data is used. But the poster asserted that institutionals don't do large covariance matricies for computational reasons. The real reason is significance, not computational complexity. With the right methods you can get significant correlation data across large numbers of stocks, J.P.Morgan's RiskMetrics work is a classic example. Martin