Evaluating Quality of test Data

fundjunkie · Oct 13, 2006

Hi All,
I'm in the early stages of creating a trading system. What I'm looking into right now is how I can validate my data. I've heard about data checking utilities that can run logic checks on large quantities of data. Does anyone here know of one they can recommend to me?

Also, is there an established methodology for data validation? If anyone can make suggestions I'll greatly appreciate it.

Thx
D

inefficient · Oct 14, 2006

You may be surprised to learn that it is a human process, not an automated computerised process. Computer software directs your attention to "anomalies" and then you, the human decision-maker, decide whether that anomaly is "real" or "erroneous". If erroneous, you decide how to fix it.

For example: in some tradeable instruments, it really does happen in real life that the open or the close is below the Low of the day. (pit traded commodity futures). Your software may flag this as an "anomaly". What do you do? It's real, it actually happened, but it may foul up your subsequent analyses terribly. Do you

Leave it alone

Adjust the Low downwards, so that Low=Close

Adjust the Close upwards, so that Close=Low

Throw out that day entirely, pretend it was a holiday

The whole area is a balancing act: you don't want to have ANY undetected real errors, but you also don't want to have "too many" false-positives, i.e., detected nonerrors, because you have to manually investigate each one, they slow you down.

Hook N. Sinker · Oct 14, 2006

I compare data files with Microsoft Excel. I use data from the exact same time range, open the files with Microsoft Excel and check for the same number of rows. If the number of rows is different then some file is missing data.

fundjunkie · Oct 16, 2006

Quote from inefficient:

You may be surprised to learn that it is a human process, not an automated computerised process. Computer software directs your attention to "anomalies" and then you, the human decision-maker, decide whether that anomaly is "real" or "erroneous". If erroneous, you decide how to fix it.

More...

Quite right. However, there is more to it than that, so I have been led to believe. I am patiently gathering tick data right now and have realised that though the ohlc figures for each minute bar of the day may be correct, the tick volume at bid/ask/trade could be completely wrong.

So, that got me thinking. With high frequency data (tick data) the task of validating data seems, to me, to be somewhat more complex. Now and again I've heard of people running some kind of statistical analysis of a sample of data again a set of control data, and other similar methods. And this is done with software and then analyzed.

This leads me to wonder, from the trader's perspective, and in order to properly develop a meaningful trading system what is the correct methodology for validating your data...

Regards,
D

Log in or Sign up

Evaluating Quality of test Data

fundjunkie

inefficient

Hook N. Sinker

fundjunkie