How do I handle historical data for my c# program?

SpaceCuddle · Jan 25, 2013

I want to handle historical data in an easier way than I do now for my automated trading and research system. I haven't begun live trading yet so I still have a very bad way of handling the data.

It's good to mention that the lowest data I am currently using is M1 data and I will not need tick data in the near future.

Currently I simply create arrays in c# of a size that is larger or the same size as the historical data that is stored in a .csv file. I then load it into the arrays and simply chose index start and end points.

This becomes harder when I want to use multiple time frames since now I need to sync them up and live-splice the current bar of each time frame. Also (really bad) I manually link them up LOL, that is to say I manually chose the index for all time frames so they match up. Then each time the iterator data changes (lowest time frame in use) I make it check if its on a new day, week, month, ect to then update the indexes of the larger bars of data. Messy!

What would be a better way to handle data? All advice is greatly appreciated! I have no perspective on how professional traders or programmers do this. I have only recently learned to program.

murrica · Jan 25, 2013

Quote from SpaceCuddle:

I want to handle historical data in an easier way than I do now for my automated trading and research system. I haven't begun live trading yet so I still have a very bad way of handling the data.

It's good to mention that the lowest data I am currently using is M1 data and I will not need tick data in the near future.

Currently I simply create arrays in c# of a size that is larger or the same size as the historical data that is stored in a .csv file. I then load it into the arrays and simply chose index start and end points.

This becomes harder when I want to use multiple time frames since now I need to sync them up and live-splice the current bar of each time frame. Also (really bad) I manually link them up LOL, that is to say I manually chose the index for all time frames so they match up. Then each time the iterator data changes (lowest time frame in use) I make it check if its on a new day, week, month, ect to then update the indexes of the larger bars of data. Messy!

What would be a better way to handle data? All advice is greatly appreciated! I have no perspective on how professional traders or programmers do this. I have only recently learned to program.
More...

This is a great question, something I will need to work through myself.

Speaking freely:

Taking overlap of time-based bars (second, minute, hourly, daily), I would think that instead of using simple integer based array indexing, one might somehow use time-based. What about unix time? This can be expressed as an integer, but now our index has some intelligence about the start time of a particular bar. For our purposes, assuming we use unix time, we can use 32 bit, as we do intend to retire this mechanism before 2038. This seems heavy, so maybe we would normalize 0 to the start time of our data sets, instead of 1970-01-01.

With this, we can maintain multiple data structures that share similar method of array indexing, and be able to synchronize them.

There's surely more to this, but just wanted to kick off an idea.

SpaceCuddle · Jan 25, 2013

Quote from murrica:

This is a great question, something I will need to work through myself.

Speaking freely:

Taking overlap of time-based bars (second, minute, hourly, daily), I would think that instead of using simple integer based array indexing, one might somehow use time-based. What about unix time? This can be expressed as an integer, but now our index has some intelligence about the start time of a particular bar. For our purposes, assuming we use unix time, we can use 32 bit, as we do intend to retire this mechanism before 2038. This seems heavy, so maybe we would normalize 0 to the start time of our data sets, instead of 1970-01-01.

With this, we can maintain multiple data structures that share similar method of array indexing, and be able to synchronize them.

There's surely more to this, but just wanted to kick off an idea.
More...

Good idea!

Accessing the data via time would be the best way most likely, whichever way it would be implemented.

But I would like to split our options into two categories: handling the data inside the program, or outside of it in some kind of database.

I bring up databases because I'm pretty sure people use them for this kind of thing but I don't know much about the positives or negatives of the approach, including how fast a database would be.

I would love to hear from high frequency traders. I have heard of some special kind of databases, but I can't find them right now. How are they used and why?

dom993 · Jan 25, 2013

How about using the free version of NinjaTrader ? It will do anything & everything for you, except placing actual trades.

When you are ready to go live, just buy a live-trading license.

Then, I would suggest you take at least a 1-month subscription to Kinetick Real-Time ... that will give you 4 years of historical minute data on any instrument you would like.

Of course, you can also import your minute-data, if you prefer.

Anyway, there is really no point in trying to re-invent the (trading) wheel.

murrica · Jan 25, 2013

Quote from SpaceCuddle:

Good idea!

Accessing the data via time would be the best way most likely, whichever way it would be implemented.

But I would like to split our options into two categories: handling the data inside the program, or outside of it in some kind of database.

I bring up databases because I'm pretty sure people use them for this kind of thing but I don't know much about the positives or negatives of the approach, including how fast a database would be.

I would love to hear from high frequency traders. I have heard of some special kind of databases, but I can't find them right now. How are they used and why?
More...

If you have lots of data then I would recommend flat serialized files, possibly look into some type of memory mapping.

Properly indexed DB would be fine if your dataset is not huge.

Yes, there are other options but if you have never used a DB (and if you do not have a substantial amount of data) I totally recommend learning a relational DB like Postgres first.

hft_boy · Jan 25, 2013

Quote from SpaceCuddle:

I want to handle historical data in an easier way than I do now for my automated trading and research system. I haven't begun live trading yet so I still have a very bad way of handling the data.

It's good to mention that the lowest data I am currently using is M1 data and I will not need tick data in the near future.

Currently I simply create arrays in c# of a size that is larger or the same size as the historical data that is stored in a .csv file. I then load it into the arrays and simply chose index start and end points.

This becomes harder when I want to use multiple time frames since now I need to sync them up and live-splice the current bar of each time frame. Also (really bad) I manually link them up LOL, that is to say I manually chose the index for all time frames so they match up. Then each time the iterator data changes (lowest time frame in use) I make it check if its on a new day, week, month, ect to then update the indexes of the larger bars of data. Messy!

What would be a better way to handle data? All advice is greatly appreciated! I have no perspective on how professional traders or programmers do this. I have only recently learned to program.
More...

First of all you should be using dynamically sized arrays and not allocating them yourself. Second of all databases suck and there is no point in using them for these kinds of data sets. Third of all C# is overkill for this, just use R and its xts package. It is designed for easy merging / manipulation of these bars; if you use it correctly it will just do everything for you.

SteveH · Jan 25, 2013

Since you're programming in C#, you might want to look into ways to use LINQ to query your in-memory data. If you have large datasets where the work can be split up and parallelized onto multiple cores, then look into using Parallel LINQ as well.

bluematrix · Jan 26, 2013

Hi,

Why don't you use lists? i.e. list<double> priceClose = new list<double>()

this way you add or remove on the go using priceClose.add()

lists are more appropriate for time series data analysis. Actually I don't know anyone who uses arrays anymore in the new world, unless for something very specific.

you can then store the datetime of the price with the price in a multi-dim list or make a class that has datetime, price etc and store that in a list. that way you can look up matching values by datetime across instruments.

I highly recommend storing your data in a database, sql-server is great. those who are against simply don't know how to use it. most firms use it and it works great. you can have 20mil records and retrive the data you need in less than a second if the table is indexed correctly.

when is comes to programming language there is nothing perfect, but C# is really the leading language given that it can do most things, in the nicest way (less time than most), has lot of concepts (from the best other languages) and is fast (fast enough for production). Having said that, the best programming languge is the one you know best, but the truth is if you really want to get set you must combine them as you can do certain things in other languages which goes a lot smoother. there is not a single firm that I know of that uses 1 single programmnig language.

R is fantastic. you can do nice analysis in 5 min if you have data. C++ takes an hour or days to give the same output. however, back-test millions of data and R will take at least 30min, while C++ will return result in few seconds. There is cost-reward with everything. making marginal adjustments to C++ again comes very costly. Python has lot of supporters because it is elegant in that sense, it's less verbose in achieving the same goal, but again very, very slow - although there are c libraries to speed it up.

So, in short, it takes more of a simple programmer to really get things right here. you need to be a good thinker from the outset in using the right tools. it's like being Ford on his production line. Researching the right tools and knowing what works best. you can combine most languages now days. C++ can be called from R, R can be used in C#, same with Python etc.

Quote from SpaceCuddle:

I want to handle historical data in an easier way than I do now for my automated trading and research system. I haven't begun live trading yet so I still have a very bad way of handling the data.

It's good to mention that the lowest data I am currently using is M1 data and I will not need tick data in the near future.

Currently I simply create arrays in c# of a size that is larger or the same size as the historical data that is stored in a .csv file. I then load it into the arrays and simply chose index start and end points.

This becomes harder when I want to use multiple time frames since now I need to sync them up and live-splice the current bar of each time frame. Also (really bad) I manually link them up LOL, that is to say I manually chose the index for all time frames so they match up. Then each time the iterator data changes (lowest time frame in use) I make it check if its on a new day, week, month, ect to then update the indexes of the larger bars of data. Messy!

What would be a better way to handle data? All advice is greatly appreciated! I have no perspective on how professional traders or programmers do this. I have only recently learned to program.
More...

NetTecture · Jan 26, 2013

Quote from bluematrix:

Hi,

Why don't you use lists? i.e. list<double> priceClose = new list<double>()

this way you add or remove on the go using priceClose.add()

lists are more appropriate for time series data analysis. Actually I don't know anyone who uses arrays anymore in the new world, unless for something very specific.

More...

Oh my. Whatever happened to programming and the stuff that was standard 20 years ago?

Lists are slow - DAMN SLOW. If you add / remove it is a lot of operations in the background.

Circular buffers, implementing IList. Programming for not total beginners, at least 20 years ago data structures were standard.

SpaceCuddle · Jan 26, 2013

Thanks for your reply, bluematrix, and to others as well.

The reason I don't use lists is because I have a lot of historical data and I need to be able to access any part of it in a reasonably fast manner. A list would be very slow I would think unless I need all the data starting from element one, because a list has to iterative over itself to get to any element.

Additionally I never have to add or remove anything so I guess I'm doing things differently from what you have in mind.

About storing the data in a database. Originally I had thought that one would repeatedly ask the database for more data once it would be needed, but I now realize that would be incredibly stupid and slow. Of course one should (as I do) have internal storage for the data like in an array or list. One just loads the data in first via a database, which would be nice. So I might do that.

I'm not a fan of Python in the slightest. But I'm looking at R and I'm taking 2 Coursera classes on R and data analysis. It's just to get a feel for different industries to learn from their preferred methods.

Anyway I'm not sure where I'm going with this and I don't think I need more help. There is no magic solution, I'm just going to remove all error prone manual stuff and error prone code and implement something better that matches time frames for me.

Thanks everyone!