Cleaning your data

Discussion in 'Automated Trading' started by runtrader, Aug 18, 2015.

  1. Correctly cleaning historical data is about using the correct process. If you use the correct process, you will be successful and even determine what success is in the first place. Here is the correct process:
    • Clean in short, specific stages
    • Have easily accessible data characterization metrics reported for each stage
    • Be able to easily change the order of each stage
    • Be able to use the data from each stage in your application (in this case, a backtest), and additionally compare multiple cleaned products from the same initial raw
    Hopefully you can see how this is helpful in determining when you are done. This process will also help you discover subtleties in your data and optimize your cleaning procedure.

    Summary:

    Good situation-- a configuration input file for your cleaning software which specifies the cleaning steps, the raw data input file, and the output paths for the clean data file and characterization metrics

    Bad situation-- your backtesting software has some clever routines in it to filter data
     
    #11     Aug 19, 2015
  2. aqtrader

    aqtrader

    Here is more specific about checking and cleaning EOD data I did in my program, easy and hard bad prints, etc. I saw the following bad prints actually in yahoo and google historical data, and other sources. All the following can be automated:
    1. bad zero vol - zero vol is normal for many etf and other syms when they are not traded in some days but there are cases that do not follow this rule.
    2. bad zero price - there are obvious bad prints of zero prices (open, low, high, close), but when a penny stock is traded under the value that can be displayed, it shows zero which we do not want in our database;
    3. bad low and high - obviously wrong if the low price is higher than the daily high;
    4. out-of-range open and close price;
    5. missing daily EOD data for some days - need to add them back (even zero vol) to simplify data processing;
    6. out-of-band prints - a sudden huge jump/fall for one or two days;
    7. missing stock split - need to add them back in a smart approach that minimize false positive cases;
    8. missing dividend - in cases where a big dividend info missing will show a strange gap down in daily chart.
    The above steps help to handle most of bad prints. But when all the above fail, you need to manually review the data to fix and tell the program what to do next time ( the program keeps a historical false positive database), so the cleaning program becomes better.
     
    #12     Aug 19, 2015
    runtrader likes this.
  3. runtrader

    runtrader

    PettaDollar & aqtrader - Thank you for your detailed replies, they contain invaluable practical information. I totally agree with each of the automated steps that should be performed. I'm going to incorporate these into my error detection and cleaning algorithms.

    aqtrader - Can you please elaborate on the historical false positive database? Do you mean you keep a history of all bad prices that weren't actually bad and then use these to determine ranges in order to detect bad prices?
     
    #13     Aug 20, 2015
  4. aqtrader

    aqtrader

    The reason to have a false positive (good data but flagged as bad data) is due to it is almost impossible to 100% correctly detect all bad cases. so, the scanning need to be a little bit more aggressive for cases like suspicious out-of-band prints, suspicious spit or dividends. then you gets more false negative cases which we do not want. it helps a lot when you use a false positive database and have your program to check it before reporting a bad print for auto fix. it does not help false negative cases but you can adjust your scanning to be more aggressive so more work to maintain the false positive database (manually).
     
    #14     Aug 20, 2015
    runtrader likes this.
  5. runtrader

    runtrader

    aqtrader - yes this is a little more work to manually maintain the false positive database, but I can see it quickly becoming beneficial for detecting future data issues. Thanks for the explanation.
     
    #15     Aug 24, 2015
  6. #16     Aug 24, 2015
  7. aqtrader

    aqtrader

    Thanks for pointing the book by Olsen, etc. Cannot find any more information about how he did data cleaning. Only one slides by Olsen on data cleaning. It looks like he considered very basic data cleaning, not much useful to me, at least. " Olsen FX Data • Data cleaning prior to computation of RV measures: – 5-minute return data is restricted to eliminate nontrading periods, weekends, holidays, and lapses of the Reuters data feed. – The slow weekend period from Friday 21:05 GMT until Sunday 21:00 GMT is eliminated from the sample. – Holidays removed: Christmas (December 24-26), New Year's (December 31- January 2), July 4th, Good 5/24/2010 3 ( y) y Friday, Easter Monday, Memorial Day, Labor Day, and Thanksgiving and the day after. – Days that contain long strings of zero or constant returns (caused by data feed problems) are eliminated." --- from http://faculty.washington.edu/ezivot/econ589/econ512realizedVariancePowerpoint.pdf
     
    #17     Aug 24, 2015
  8. Did you read the book? I'm not following what you're saying. The book goes into data-cleaning in detail.
     
    #18     Aug 25, 2015