Cleaning your data

PetaDollar · Aug 19, 2015

Correctly cleaning historical data is about using the correct process. If you use the correct process, you will be successful and even determine what success is in the first place. Here is the correct process:

Clean in short, specific stages

Have easily accessible data characterization metrics reported for each stage

Be able to easily change the order of each stage

Be able to use the data from each stage in your application (in this case, a backtest), and additionally compare multiple cleaned products from the same initial raw

Hopefully you can see how this is helpful in determining when you are done. This process will also help you discover subtleties in your data and optimize your cleaning procedure.

Summary:

Good situation-- a configuration input file for your cleaning software which specifies the cleaning steps, the raw data input file, and the output paths for the clean data file and characterization metrics

Bad situation-- your backtesting software has some clever routines in it to filter data

aqtrader · Aug 19, 2015

Here is more specific about checking and cleaning EOD data I did in my program, easy and hard bad prints, etc. I saw the following bad prints actually in yahoo and google historical data, and other sources. All the following can be automated:
1. bad zero vol - zero vol is normal for many etf and other syms when they are not traded in some days but there are cases that do not follow this rule.
2. bad zero price - there are obvious bad prints of zero prices (open, low, high, close), but when a penny stock is traded under the value that can be displayed, it shows zero which we do not want in our database;
3. bad low and high - obviously wrong if the low price is higher than the daily high;
4. out-of-range open and close price;
5. missing daily EOD data for some days - need to add them back (even zero vol) to simplify data processing;
6. out-of-band prints - a sudden huge jump/fall for one or two days;
7. missing stock split - need to add them back in a smart approach that minimize false positive cases;
8. missing dividend - in cases where a big dividend info missing will show a strange gap down in daily chart.
The above steps help to handle most of bad prints. But when all the above fail, you need to manually review the data to fix and tell the program what to do next time ( the program keeps a historical false positive database), so the cleaning program becomes better.

runtrader · Aug 20, 2015

PettaDollar & aqtrader - Thank you for your detailed replies, they contain invaluable practical information. I totally agree with each of the automated steps that should be performed. I'm going to incorporate these into my error detection and cleaning algorithms.

aqtrader - Can you please elaborate on the historical false positive database? Do you mean you keep a history of all bad prices that weren't actually bad and then use these to determine ranges in order to detect bad prices?

aqtrader · Aug 20, 2015

runtrader said:
.
aqtrader - Can you please elaborate on the historical false positive database? Do you mean you keep a history of all bad prices that weren't actually bad and then use these to determine ranges in order to detect bad prices?
More...

The reason to have a false positive (good data but flagged as bad data) is due to it is almost impossible to 100% correctly detect all bad cases. so, the scanning need to be a little bit more aggressive for cases like suspicious out-of-band prints, suspicious spit or dividends. then you gets more false negative cases which we do not want. it helps a lot when you use a false positive database and have your program to check it before reporting a bad print for auto fix. it does not help false negative cases but you can adjust your scanning to be more aggressive so more work to maintain the false positive database (manually).

runtrader · Aug 24, 2015

aqtrader - yes this is a little more work to manually maintain the false positive database, but I can see it quickly becoming beneficial for detecting future data issues. Thanks for the explanation.

MrBrowser · Aug 24, 2015

Olsen et al, 2001 talks about this subject thoughtfully and in detail.

aqtrader · Aug 24, 2015

MrBrowser said:
Olsen et al, 2001 talks about this subject thoughtfully and in detail.
More...

Thanks for pointing the book by Olsen, etc. Cannot find any more information about how he did data cleaning. Only one slides by Olsen on data cleaning. It looks like he considered very basic data cleaning, not much useful to me, at least. " Olsen FX Data • Data cleaning prior to computation of RV measures: – 5-minute return data is restricted to eliminate nontrading periods, weekends, holidays, and lapses of the Reuters data feed. – The slow weekend period from Friday 21:05 GMT until Sunday 21:00 GMT is eliminated from the sample. – Holidays removed: Christmas (December 24-26), New Year's (December 31- January 2), July 4th, Good 5/24/2010 3 ( y) y Friday, Easter Monday, Memorial Day, Labor Day, and Thanksgiving and the day after. – Days that contain long strings of zero or constant returns (caused by data feed problems) are eliminated." --- from http://faculty.washington.edu/ezivot/econ589/econ512realizedVariancePowerpoint.pdf

MrBrowser · Aug 25, 2015

aqtrader said:
Thanks for pointing the book by Olsen, etc. Cannot find any more information about how he did data cleaning. Only one slides by Olsen on data cleaning.
More...

Did you read the book? I'm not following what you're saying. The book goes into data-cleaning in detail.