Registered: Jul 2009
07-27-11 03:57 PM
Quote from Maverick1:
Question related to data mining and backtesting: what's the best container to use with the typical csv/comma delineated index futures?
Let's assume that the basic data components, i.e., open, high, low, close, price, volume are part of a structure or public class. Then should it be a vector of structures? Or a set, or a map?
I'm wondering especially about how to deal with date ranges and times when doing basic analysis. Typically date is of type string in the csv file. So does it need to be converted for analysis to be possible? Or is there a workaround. Say for example, I wanted to read in data from a csv file and then find the low price over a given range of dates. That sort of thing.
First off, you should convert the dates into timestamps when manipulating them in C, either seconds, nanoseconds, etc. This will simplify your life a great deal.
How are you accessing the data? Do you just go through the data sequentially? Do you look for particular timestamps? Or do you look for particular values?
Probably the most efficient way to store the data is to store all the basic data in a packed C-struct, in a massive array, and then create maps with pointers to individual piece of data. For example (pseudocode):
double open, high, low, close;
dataArray = new data;
map<int, data *> timestampMap = new pair<data.timestamp, &data>
My backtesting software that I wrote accesses hundreds of megabytes up to about 1-2 GB of data at a time. I use in-memory arrays, because they're the most efficient in terms of memory size, and I know that I just look at each piece of data at a time and then throw it away.
The other advantage of having it as a memory array is that I can directly persist this array as a file, and then just read it next time without having to parse it all over again.