Tick delimited time scales are not usually appropriate for calculating covariances. It is better to index the ticks by some time interval, accumulating or averaging the tick delimited information into time-delimited bins. Then do the covariance. Youâre also not tied to fixed-time length bins. Use whatever temporal stepping you choose. Skip bars. Use volume weighted bar length. I like a tick-time hybrid stepping. The point is that ticks allow you to look at any stepping, and you can natively analyze per-tick fundamentals which would get lost with typical time delimited data. It's an easy problem to solve. Take your tick data, calculate the time indicies of your choice, index each tick into a second array, accumulating or averaging if you allow multiple ticks per index. Now process with the second array instead. This is a very fast and simple process. Aggregation and caching is completely independent of tick data. Use the techniques for bar data if you want. These are merely efficiency enhancements, not a crutch for processing tick data because tick data is somehow so difficult to analyze. Yes, it is very important to accurately simulate executions. Too many system traders constrain themselves to bar data out of fear of complexity, when they really could use inter-bar execution simulation, not to mention tick-based analysis which has unlimited possibilities.
marist89>...I can get 850K records out of my database of 500M records in less then a second. Of course, I run Oracle. That sounds rather good. May I ask: 1) Is your timing of this query fresh after the database is started? Or, is it after some of these 850K rows have had time to be accessed by other queries such that many still are still cached in the disk buffer? You're not running this query multiple times and then reporting the later/faster result? 2) You're not by chance pinning anything in memory in the KEEP POOL? 3) What's the datatype of the column(s) in the WHERE clause? 4) Are you using a hash or bitmap index for the column(s) referenced in the WHERE clause? 5) Do you have a composite index exactly matching the conditions referenced in your WHERE clause, or do you have individual indexes on each separate column referenced in the WHERE clause? 6) Are you using table partitioning? If so, do(es) the condition(s) in your WHERE clause relate to the column(s) used to specify partition boundaries? 7) Is it safe to assume that you are not using an ORDER BY clause? If so, how do you ensure rows come out in the same order on every query? Just because they are inserted in a particular order is no guarantee they will come out in the same order. 8) Do you avoid UPDATES, or else INSERTS after DELETES on this table, which can lead to automatic free space management in the form of coalescing or row migration? (which can change the order of rows)
Do you even realize what you are saying? Computational efficiency does matter when you are talking about 1 to 4 orders of magnitude improvements. Who wouldn't like to have the equivalent of 10 to 10,000 times as much computing power? In the end what matters most is how fast systems are designed and deployed to make money at acceptable risk, before the market evolves and makes the systems obsolete. We use computational tools to find profitable system designs because it is faster than doing the math by hand. Why limit yourself out of convenience? The faster you search, the faster you find. I would never rank simplicity higher than testing performance. Good design is very important, but needs to be defined. Maintainability is questionable. Often the most efficient designs are only good for one purpose, or have ugly, hard to maintain code. So there are always trade offs. All things being equal, if you rank efficiency too low you may severely reduce your chances of success in a reasonable period of time.
Sorry for the confusion. I realized my mistake but missed the 60 minute post-editing window. I was only criticizing the lack of main memory issue and the advice to stuff all data into memory. I agree with the part about using flat files. Sorry!
I preselect 20-30K ticks, that fits nicely into 256K L2 cache. All integer arithmetics (no floats), hand optimized.