Evaluating Quality of test Data

Discussion in 'Data Sets and Feeds' started by fundjunkie, Oct 13, 2006.

  1. Hi All,
    I'm in the early stages of creating a trading system. What I'm looking into right now is how I can validate my data. I've heard about data checking utilities that can run logic checks on large quantities of data. Does anyone here know of one they can recommend to me?

    Also, is there an established methodology for data validation? If anyone can make suggestions I'll greatly appreciate it.

  2. You may be surprised to learn that it is a human process, not an automated computerised process. Computer software directs your attention to "anomalies" and then you, the human decision-maker, decide whether that anomaly is "real" or "erroneous". If erroneous, you decide how to fix it.

    For example: in some tradeable instruments, it really does happen in real life that the open or the close is below the Low of the day. (pit traded commodity futures). Your software may flag this as an "anomaly". What do you do? It's real, it actually happened, but it may foul up your subsequent analyses terribly. Do you
    • Leave it alone
    • Adjust the Low downwards, so that Low=Close
    • Adjust the Close upwards, so that Close=Low
    • Throw out that day entirely, pretend it was a holiday
    The whole area is a balancing act: you don't want to have ANY undetected real errors, but you also don't want to have "too many" false-positives, i.e., detected nonerrors, because you have to manually investigate each one, they slow you down.
  3. I compare data files with Microsoft Excel. I use data from the exact same time range, open the files with Microsoft Excel and check for the same number of rows. If the number of rows is different then some file is missing data.
  4. Quite right. However, there is more to it than that, so I have been led to believe. I am patiently gathering tick data right now and have realised that though the ohlc figures for each minute bar of the day may be correct, the tick volume at bid/ask/trade could be completely wrong.

    So, that got me thinking. With high frequency data (tick data) the task of validating data seems, to me, to be somewhat more complex. Now and again I've heard of people running some kind of statistical analysis of a sample of data again a set of control data, and other similar methods. And this is done with software and then analyzed.

    This leads me to wonder, from the trader's perspective, and in order to properly develop a meaningful trading system what is the correct methodology for validating your data...