Cleaning your data

runtrader · Aug 18, 2015

I'm a software developer and I've created automated trading tools (in Java) that download, store and analyse historical data from various data vendors. Before this data hits my systematic trading system I run it through a cleaning process which removes and fixes dubious data like outliers, missing days, etc. What amazes me is that the quality of some of this data is pretty good and some of it is amazingly bad. What amazes me even more is there are a lot of people out their that don't clean their data, they just assume it is correct!

I'm interested to hear how others clean data before using it in their trading systems?

runtrader · Aug 18, 2015

My own coded research and analysis tools (Java based) require access to historical data for 1000's of instruments for multiple asset types, I need to normalize this data to ensure I'm comparing apples-to-apples, any inconsistencies in this data can skew the results widely.

For example, consider comparing the performance of a US stock versus a UK stock. Since exchange holidays are different for the two there can be missing days for each. How does one compensate for this? I just duplicate the last good price and use it for the missing day. Just wondering if anyone has any better ideas?

What about data spikes? I take the ATR and if a price is beyond 3 stdev then I flag it and manually check it against multiple sources to see if it a real price. This requires a bit of manual intervention and is time consuming. Anyone have any better ideas to identify and clean spikes?

If you run an automated / systematic trading system and don't check and clean your data. Why?

2rosy · Aug 18, 2015

Python with blaze, pandas ,sklearn.preprocessing ,statsmodels

I dont really think about it; its 1 or 2 lines of code

runtrader · Aug 18, 2015

2rosy said:
I dont really think about it
More...

I assume that this means you leave it to your software to automatically 'munge' your data. What rules does it use to determine if a price is correct? What if the software drops a valid price? Do you know what it is doing under the hood?

I suppose clean data depends on the type of trading your system is doing and one shoe doesn't fit all. I for one need clean data to ensure my analysis is correct.

spacewiz · Aug 18, 2015

runtrader, what sources do you use for your data, i'm assuming they are available for retail traders? Could you provide more info which ones are good vs which ones are problematic? Thanks!

2rosy · Aug 18, 2015

runtrader said:
I assume that this means you leave it to your software to automatically 'munge' your data. What rules does it use to determine if a price is correct? What if the software drops a valid price? Do you know what it is doing under the hood?
More...

As long as the data is not N std away then its fine for historical backtests. If you are nitpicking every data point over the last 20 years or so you'll never do the analysis.
I wrote whats under the hood

aqtrader · Aug 19, 2015

runtrader said:
I'm a software developer and I've created automated trading tools (in Java) that download, store and analyse historical data from various data vendors. Before this data hits my systematic trading system I run it through a cleaning process which removes and fixes dubious data like outliers, missing days, etc. What amazes me is that the quality of some of this data is pretty good and some of it is amazingly bad. What amazes me even more is there are a lot of people out their that don't clean their data, they just assume it is correct!

I'm interested to hear how others clean data before using it in their trading systems?
More...

That is exactly I saw. I asked data cleaning questions but not actually got any good response. Simply talking about historic eod data from many sources, bad prints are everywhere.Many of these can be detected easily such as some out of band prints, but a lot still hard to pick out. Anyway, do some data check and try to fix is a good idea.

aqtrader · Aug 19, 2015

runtrader said:
My own coded research and analysis tools (Java based) require access to historical data for 1000's of instruments for multiple asset types, I need to normalize this data to ensure I'm comparing apples-to-apples, any inconsistencies in this data can skew the results widely.

For example, consider comparing the performance of a US stock versus a UK stock. Since exchange holidays are different for the two there can be missing days for each. How does one compensate for this? I just duplicate the last good price and use it for the missing day. Just wondering if anyone has any better ideas?

What about data spikes? I take the ATR and if a price is beyond 3 stdev then I flag it and manually check it against multiple sources to see if it a real price. This requires a bit of manual intervention and is time consuming. Anyone have any better ideas to identify and clean spikes?

If you run an automated / systematic trading system and don't check and clean your data. Why?
More...

Considering missing data on some days when comparing multiple historical time series, this is a common issue. What I do is simply to normalize the time series by forcing them use the same time array either adding missing daily data or by removing unwanted daily data from some of the time series.

runtrader · Aug 19, 2015

spacewiz said:
runtrader, what sources do you use for your data, i'm assuming they are available for retail traders? Could you provide more info which ones are good vs which ones are problematic? Thanks!
More...

I mainly utilize history data for US, European and UK stocks and ETFs. I've found that free Yahoo data is full of gaps and spikes. I now use historical data from Datalink (Metastock) which is sourced from Thompson Reuters and I've found this to be a much better source. There are still holes and spikes but much less.

runtrader · Aug 19, 2015

aqtrader said:
Anyway, do some data check and try to fix is a good idea.
More...

I detect data gaps and outliers and fix most of them automatically, some I flag for manual intervention. As 2rosy mentioned if you try and manually flag every single issue over 20 years of historical data it'll take forever, especially if you have many instruments.

The real question is when is the data 'good enough'? I suppose it depends on the use of the data - back-tests for short term mean reversion strategies are much more sensitive to data spikes than long trend following strategies.

I currently manually compare my paid data source against the free data source. The next stage would to be automate this and grab missing data from either source, rather than automatically interpolating data between gaps and spikes.

Perhaps I'm being too pedantic about cleaning my data!