Cleaning your data

Discussion in 'Automated Trading' started by runtrader, Aug 18, 2015.

  1. runtrader

    runtrader

    I'm a software developer and I've created automated trading tools (in Java) that download, store and analyse historical data from various data vendors. Before this data hits my systematic trading system I run it through a cleaning process which removes and fixes dubious data like outliers, missing days, etc. What amazes me is that the quality of some of this data is pretty good and some of it is amazingly bad. What amazes me even more is there are a lot of people out their that don't clean their data, they just assume it is correct!

    I'm interested to hear how others clean data before using it in their trading systems?
     
    Last edited: Aug 18, 2015
    shuraver likes this.
  2. runtrader

    runtrader

    My own coded research and analysis tools (Java based) require access to historical data for 1000's of instruments for multiple asset types, I need to normalize this data to ensure I'm comparing apples-to-apples, any inconsistencies in this data can skew the results widely.

    For example, consider comparing the performance of a US stock versus a UK stock. Since exchange holidays are different for the two there can be missing days for each. How does one compensate for this? I just duplicate the last good price and use it for the missing day. Just wondering if anyone has any better ideas?

    What about data spikes? I take the ATR and if a price is beyond 3 stdev then I flag it and manually check it against multiple sources to see if it a real price. This requires a bit of manual intervention and is time consuming. Anyone have any better ideas to identify and clean spikes?

    If you run an automated / systematic trading system and don't check and clean your data. Why?
     
    Last edited: Aug 18, 2015
    shuraver likes this.
  3. 2rosy

    2rosy

    Python with blaze, pandas ,sklearn.preprocessing ,statsmodels

    I dont really think about it; its 1 or 2 lines of code
     
  4. runtrader

    runtrader

    I assume that this means you leave it to your software to automatically 'munge' your data. What rules does it use to determine if a price is correct? What if the software drops a valid price? Do you know what it is doing under the hood?

    I suppose clean data depends on the type of trading your system is doing and one shoe doesn't fit all. I for one need clean data to ensure my analysis is correct.
     
    shuraver likes this.
  5. spacewiz

    spacewiz

    runtrader, what sources do you use for your data, i'm assuming they are available for retail traders? Could you provide more info which ones are good vs which ones are problematic? Thanks!
     
  6. 2rosy

    2rosy

    As long as the data is not N std away then its fine for historical backtests. If you are nitpicking every data point over the last 20 years or so you'll never do the analysis.
    I wrote whats under the hood
     
  7. aqtrader

    aqtrader

    That is exactly I saw. I asked data cleaning questions but not actually got any good response. Simply talking about historic eod data from many sources, bad prints are everywhere.Many of these can be detected easily such as some out of band prints, but a lot still hard to pick out. Anyway, do some data check and try to fix is a good idea.
     
  8. aqtrader

    aqtrader

    Considering missing data on some days when comparing multiple historical time series, this is a common issue. What I do is simply to normalize the time series by forcing them use the same time array either adding missing daily data or by removing unwanted daily data from some of the time series.
     
  9. runtrader

    runtrader

    I mainly utilize history data for US, European and UK stocks and ETFs. I've found that free Yahoo data is full of gaps and spikes. I now use historical data from Datalink (Metastock) which is sourced from Thompson Reuters and I've found this to be a much better source. There are still holes and spikes but much less.
     
  10. runtrader

    runtrader

    I detect data gaps and outliers and fix most of them automatically, some I flag for manual intervention. As 2rosy mentioned if you try and manually flag every single issue over 20 years of historical data it'll take forever, especially if you have many instruments.

    The real question is when is the data 'good enough'? I suppose it depends on the use of the data - back-tests for short term mean reversion strategies are much more sensitive to data spikes than long trend following strategies.

    I currently manually compare my paid data source against the free data source. The next stage would to be automate this and grab missing data from either source, rather than automatically interpolating data between gaps and spikes.

    Perhaps I'm being too pedantic about cleaning my data!
     
    #10     Aug 19, 2015