Forums > Technically Speaking > Programming > Calling all C++ programmers

Thread Tools
Old Jul 27th, 2011, 10:24 AM   #7
Join Date: Jun 2002
Posts: 528
Quote from keyser1:

have you thought of using a database? sql server express or ms access?

'find me the lowest price within this date range' is easy to do in sql.
Its also relatively easy to do in c# .net using linq -- a bunch of technologies to learn, but speed of development will be much faster in .net; execution speed will be slower than c++, but that can be solved by having a faster machine.

The answer to your question really depends on
1. How fast do queries have to be
2. How much data is there
3. How often are you adding data
4. How often are you querying data
I think you're probably right. I was surprised to find out that the basic containers (vectors, sets etc) don't offer a straightforward solution to the issue of dealing with price related data... If you know of any good SQL tutorials or books, please let me know. Thanks.
Old Jul 27th, 2011, 10:33 AM   #8
Hook N. Sinker
Join Date: Apr 2005
Posts: 1,592
I prefer to use arrays.
Old Jul 27th, 2011, 10:39 AM   #9
Join Date: Jun 2002
Posts: 528
Quote from gtor514:

Most likely you will use one of the sequence containers (list, deque, vector) to store your time,open,high,low,close,volume object. If you don't need to alot of inserting use the vector or deque. If you need to do a lot of accessing use the vector. Just set up a test application from which to do some "profiling" of each of the containers. You can change the container used in your test app in just a few lines of code. That's the beauty of c++.

I store my data objects in a vector because I do a lot of accessing. As for the time stamp from the .csv file. I read the time string into my own custom time class that stores each of the date component and time component into two integers. As a starting point, you could store the time into a single long variable, which could be the Unix times (seconds past the 1970 epoch).

1.) parse the string of the time stamp from the csv file into the date/time components (year, month, day, hour, min, sec)
2.) create a tm struct (see ctime time.h) from time components
3.) use mktime function to convert tm struct to time_t value which is just a long integer.
4.) use the time_t as the time in your object.
That's some nifty processing, Gtor, pretty cool stuff. Re step 3, is the mktime function something from boost? I'll send you a pm too.
Old Jul 27th, 2011, 10:57 AM   #10
Join Date: Jul 2009
Posts: 313
Quote from Maverick1:

Question related to data mining and backtesting: what's the best container to use with the typical csv/comma delineated index futures?

Let's assume that the basic data components, i.e., open, high, low, close, price, volume are part of a structure or public class. Then should it be a vector of structures? Or a set, or a map?

I'm wondering especially about how to deal with date ranges and times when doing basic analysis. Typically date is of type string in the csv file. So does it need to be converted for analysis to be possible? Or is there a workaround. Say for example, I wanted to read in data from a csv file and then find the low price over a given range of dates. That sort of thing.
First off, you should convert the dates into timestamps when manipulating them in C, either seconds, nanoseconds, etc. This will simplify your life a great deal.

How are you accessing the data? Do you just go through the data sequentially? Do you look for particular timestamps? Or do you look for particular values?

Probably the most efficient way to store the data is to store all the basic data in a packed C-struct, in a massive array, and then create maps with pointers to individual piece of data. For example (pseudocode):

struct data
int timestamp;
double open, high, low, close;
int volume;

dataArray = new data[100000];

map<int, data *> timestampMap = new pair<data[0].timestamp, &data[0]>

My backtesting software that I wrote accesses hundreds of megabytes up to about 1-2 GB of data at a time. I use in-memory arrays, because they're the most efficient in terms of memory size, and I know that I just look at each piece of data at a time and then throw it away.

The other advantage of having it as a memory array is that I can directly persist this array as a file, and then just read it next time without having to parse it all over again.
Old Jul 27th, 2011, 11:08 AM   #11
Join Date: Mar 2005
Posts: 217
Quote from Maverick1:

That's some nifty processing, Gtor, pretty cool stuff. Re step 3, is the mktime function something from boost? I'll send you a pm too.
No boost needed. See...

As an alternate option, if your using windows, there is a COleDateTime class that stores the date time as a double. You would use the same steps to parse the datetime from string into a COleDateTime class, and from there write the double value into your "chart objects(time, op, hi, lo, cl, vol)"
Old Jul 27th, 2011, 11:16 AM   #12
Join Date: Feb 2008
Posts: 1,764
Quote from gtor514:

As a starting point, you could store the time into a single long variable, which could be the Unix times (seconds past the 1970 epoch).
That's the most efficient solution!

Alternatively, you can use OLE date-time format where the date-time is a floating point number and increase by 1 means 24 hours (or exactly 1 day) later. The best part with OLE format is the dates and time in a CSV file can be read, understood and saved by Excel.

Regarding the best format, as others suggested a vector of OHLC structures is better than 4 vectors.

I don't assume you will be adding new dates - instead you'll just read the input file from the start. So, I would avoid list containers like a plague.

A few ideas:
1) If you want to use strings for dates and times anyway, use the correct format: year, month, day, hour, minute with leading zeros. Something like yyyy-MM-dd hh:mm or yy/MM/dd hh:mm. Then strings will have the same order as corresponding dates do
2) Consider adding to OHLC structure a boolean value with the meaning "price exists". Then a bar will be present for every minute and there will be no need for searching the right time. Instead you will know that a bar 60 minutes later is exactly 60 positions later in the vector/llist/array.
Note that if you implement typical idicators like moving avearge in a naive way, this storage method comes with a performance penalty. Say, for a 30-period moving average you need 30 valid prices. So, you will have to iterate back through any gaps in price series till you find 3 bars where the price is not missing.
3) To handle bank holidays and week-ends better, consider storing a list of days when the security is traded with intraday data arranged as arrays. This way each day's data can start at market open and not midnight... and it still will be very easy to find the right time every day.
4) If you want to find the right date in an ordered array of days the binary search method is much faster than iterating through all days.

If you want speed at the cost of some flexibility, consider implementing backtesting as matrix operations. This way you cna use one of the popular linear algebra implementations such as ATLAS, MKL or GotoBLAS. For a case study of how linear algebra subroutines can be use dto massively speed up backtesting, see Amibroker and its Amibroker Formula Language (AFL).
Thread Tools

Forum Jump

   Conduct Rules   Privacy Policy   Sitemap Copyright © 2014, Elite Trader. All rights reserved.   

Advantage Futures
Futures Trading & Clearing
AMP Global Clearing
Futures and FX Trading
Automated Trading Services
Futures Trading Software
NinjaTrader Consulting
Trading Software Provider
Forex Trading Services
Global Futures
Futures, Options & FX Trading
Interactive Brokers
Pro Gateway to World Markets
JC Trading Group
Direct Access Trading
MB Trading
Direct Access Trading
Trading Software Provider
Option Trading & Education
Futures Trade Execution Platform
Direct Access Trading
Spread Trading Instruction
FX, Gold, & Stock Signals
System Building & Backtesting
Equity and Options Trading
Trading Technologies
Trading Software Provider