Calling all C++ programmers

Maverick1 · Jul 26, 2011

Question related to data mining and backtesting: what's the best container to use with the typical csv/comma delineated index futures?

Let's assume that the basic data components, i.e., open, high, low, close, price, volume are part of a structure or public class. Then should it be a vector of structures? Or a set, or a map?

I'm wondering especially about how to deal with date ranges and times when doing basic analysis. Typically date is of type string in the csv file. So does it need to be converted for analysis to be possible? Or is there a workaround. Say for example, I wanted to read in data from a csv file and then find the low price over a given range of dates. That sort of thing.

rosy2 · Jul 26, 2011

list of bar objects

and its delimited not delineated

Craig66 · Jul 26, 2011

Depends what you want to do with the objects, if you want to index them by time then you're going to have to stick them in a map, however this will be more expensive than an unordered container.

A list is a good idea if your going to be doing a lot of adds and removes from the middle of the container, but I'm not sure why you would want to do that.

The best option IMO is either a vector or a deque, these containers will allow you to index the collection by numerical index, add objects at the end cheaply (in the vector case), or at both ends (in the deque case). Standard min/max functions will work over any container as long as you supply an appropriate functor.

keyser1 · Jul 26, 2011

have you thought of using a database? sql server express or ms access?

'find me the lowest price within this date range' is easy to do in sql.
Its also relatively easy to do in c# .net using linq -- a bunch of technologies to learn, but speed of development will be much faster in .net; execution speed will be slower than c++, but that can be solved by having a faster machine.

The answer to your question really depends on
1. How fast do queries have to be
2. How much data is there
3. How often are you adding data
4. How often are you querying data

gtor514 · Jul 26, 2011

Most likely you will use one of the sequence containers (list, deque, vector) to store your time,open,high,low,close,volume object. If you don't need to alot of inserting use the vector or deque. If you need to do a lot of accessing use the vector. Just set up a test application from which to do some "profiling" of each of the containers. You can change the container used in your test app in just a few lines of code. That's the beauty of c++.

I store my data objects in a vector because I do a lot of accessing. As for the time stamp from the .csv file. I read the time string into my own custom time class that stores each of the date component and time component into two integers. As a starting point, you could store the time into a single long variable, which could be the Unix times (seconds past the 1970 epoch).

1.) parse the string of the time stamp from the csv file into the date/time components (year, month, day, hour, min, sec)
2.) create a tm struct (see ctime time.h) from time components
3.) use mktime function to convert tm struct to time_t value which is just a long integer.
4.) use the time_t as the time in your object.

Maverick1 · Jul 27, 2011

Quote from Craig66:

Depends what you want to do with the objects, if you want to index them by time then you're going to have to stick them in a map, however this will be more expensive than an unordered container.

A list is a good idea if your going to be doing a lot of adds and removes from the middle of the container, but I'm not sure why you would want to do that.

The best option IMO is either a vector or a deque, these containers will allow you to index the collection by numerical index, add objects at the end cheaply (in the vector case), or at both ends (in the deque case). Standard min/max functions will work over any container as long as you supply an appropriate functor.
More...

Agree with your thought on the list. I use a vector of structure objects with my 1 min data file (spans a year). Call the object 'Bar' for example, containing: date time open high low close volume, where date and time are strings.

From there, for sorting, I've tried using a set and sorting it using a simple functor. I'm running into some trouble however when I try to select a date range because the dates are stored in the structure object as a string.

Is there a way to iterate over the vector using the date only? Do I have to build a separate index to do this like mentioned above?

Maverick1 · Jul 27, 2011

Quote from keyser1:

have you thought of using a database? sql server express or ms access?

'find me the lowest price within this date range' is easy to do in sql.
Its also relatively easy to do in c# .net using linq -- a bunch of technologies to learn, but speed of development will be much faster in .net; execution speed will be slower than c++, but that can be solved by having a faster machine.

The answer to your question really depends on
1. How fast do queries have to be
2. How much data is there
3. How often are you adding data
4. How often are you querying data
More...

I think you're probably right. I was surprised to find out that the basic containers (vectors, sets etc) don't offer a straightforward solution to the issue of dealing with price related data... If you know of any good SQL tutorials or books, please let me know. Thanks.

Hook N. Sinker · Jul 27, 2011

I prefer to use arrays.

Maverick1 · Jul 27, 2011

Quote from gtor514:

Most likely you will use one of the sequence containers (list, deque, vector) to store your time,open,high,low,close,volume object. If you don't need to alot of inserting use the vector or deque. If you need to do a lot of accessing use the vector. Just set up a test application from which to do some "profiling" of each of the containers. You can change the container used in your test app in just a few lines of code. That's the beauty of c++.

I store my data objects in a vector because I do a lot of accessing. As for the time stamp from the .csv file. I read the time string into my own custom time class that stores each of the date component and time component into two integers. As a starting point, you could store the time into a single long variable, which could be the Unix times (seconds past the 1970 epoch).

1.) parse the string of the time stamp from the csv file into the date/time components (year, month, day, hour, min, sec)
2.) create a tm struct (see ctime time.h) from time components
3.) use mktime function to convert tm struct to time_t value which is just a long integer.
4.) use the time_t as the time in your object.
More...

That's some nifty processing, Gtor, pretty cool stuff. Re step 3, is the mktime function something from boost? I'll send you a pm too.

jedwards · Jul 27, 2011

Quote from Maverick1:

Question related to data mining and backtesting: what's the best container to use with the typical csv/comma delineated index futures?

Let's assume that the basic data components, i.e., open, high, low, close, price, volume are part of a structure or public class. Then should it be a vector of structures? Or a set, or a map?

I'm wondering especially about how to deal with date ranges and times when doing basic analysis. Typically date is of type string in the csv file. So does it need to be converted for analysis to be possible? Or is there a workaround. Say for example, I wanted to read in data from a csv file and then find the low price over a given range of dates. That sort of thing.
More...

First off, you should convert the dates into timestamps when manipulating them in C, either seconds, nanoseconds, etc. This will simplify your life a great deal.

How are you accessing the data? Do you just go through the data sequentially? Do you look for particular timestamps? Or do you look for particular values?

Probably the most efficient way to store the data is to store all the basic data in a packed C-struct, in a massive array, and then create maps with pointers to individual piece of data. For example (pseudocode):

struct data
{
int timestamp;
double open, high, low, close;
int volume;
}

dataArray = new data[100000];

map<int, data *> timestampMap = new pair<data[0].timestamp, &data[0]>

My backtesting software that I wrote accesses hundreds of megabytes up to about 1-2 GB of data at a time. I use in-memory arrays, because they're the most efficient in terms of memory size, and I know that I just look at each piece of data at a time and then throw it away.

The other advantage of having it as a memory array is that I can directly persist this array as a file, and then just read it next time without having to parse it all over again.