Controlled data-mining?

Discussion in 'Data Sets and Feeds' started by mizhael, Mar 13, 2011.

  1. There was an academic paper -

    - Market does better on sunny days ...

    - Earn 16% per year buying when it's sunny in NY and selling when it's cloudy ...

    This is obviously data-mining; is there a way to put this sort of data-mining into a controlled framework so the downside is limited while we let the upside run?
  2. IMO, before further investigating any such claim (i.e. “market does better of sunny days”), as a first step try to establish a sound, rationale, interpretation consistent with your own beliefs of how markets work, that would allow the claim to be true.

    If you can’t, look no further. Why? If you can’t understand the underlying market structure mechanics that would make the system work, you are at a disadvantage;

    a) because you won’t be able to judge effectively whether the first factor actually affects the second (e.g. do “sunny days” cause “better markets”?), and

    b) even if you are lucky enough to find factors that are usefully related, you won’t be able to judge when market conditions change so that the prior relationship breaks down.

    Alternatively (now with a Devil’s Advocate hat on... and I expect to be flamed mercilessly for this), if the opportunity just seems far too juicy to miss (i.e. you have something that looks wonderful, but you can’t for the life of you figure out what makes it work), I would proceed as follows:

    a) Establish (through statistically significant back-testing), the performance characteristics of the strategy (these will be your “go/no go” metrics).

    b) Then proceed (first through forward testing, then live trading) to trade the system; at all stages compare actual performance with the performance from your statistically significant back-tests.

    c) You continue in this way through each stage until you get actual performance results worse than - and outside reasonable expectations compared to - those obtained in the back-tests. This is the difficult bit; to continue until the numbers tell you to stop, and then to have the discipline to stop.
  3. that's called spurious correlation :)

    or more accurately in this case, nonsense correlation :))


  4. What do you mean "this is obviously data-mining"?

    It can be astrology too. Data mining is a powerful method discovering hidden order in chaos. If someone is stupid enough to consider spurious correletions then "this is obviously data-mining" is not the right phrase. "This is obviously stupidity" is the right phrase.