Is Walk-Forward (out of sample) testing simply an illusion?

Discussion in 'Strategy Development' started by pursuit, Oct 17, 2017.

  1. tommcginnis


    Suppose there are three periods A,B,C
    SegA runs 1.0→2.0
    SegB runs 2.0→1.0, and
    SegC runs 1.0→1.0

    Models performing best in
    SegA would show positive (Cartesian) linear behavior,
    SegB would show negative linear behavior, while
    SegC would exhibit a graceful curvelinear arc and do best with a second-order parameter of negative sign, to degrade (and eventually overwhelm) a positive first-order parameter.

    1) The border matters.
    2) W.R.T. subset performance, SegA+SegB =/= SegC.
    #21     Oct 22, 2017
  2. pursuit


    The end result is the same though! By keeping only models that look pretty on a segment 2 test we have in fact manually optimized the system to segment 2. We might as well optimize on the whole segment 3. The end selection of systems will be the same.
    #22     Oct 22, 2017

  3. It's very easy to show that both classrooms will not necessarily end up with the same selection of models. All you need to do is imagine that model_n did very poorly on the first segment, and did fantastically well on the second segment. So well, in fact, that the model_n performance on the combined set did better than all of the other models. If the other classroom only looks at the entire concatenated set of data, they will choose model_n, however, class one already threw out that model in the first segment selection step by filtering the sets that pass to segment 2.
    Hopefully you can see the set of models at the final step are not equivalent.

    But that point is sort of trivial. What's more important is that both of the methods of filtering the best of the distributions suffer from the same selection bias. You can devise a different way to use the two segments to choose models, that can give you a better likelihood of performing well on a new unseen segment, that you cannot do with only one combined segment.

    If you are really interested to understand it from a more modern statistical perspective, you can look into topics on bias/variance tradeoff. But you will find very little literature applying it to your application, that's up to you to figure out.
    #23     Oct 22, 2017
  4. userque


    You guys are failing to consider the x-axis:

    You guys are trying to apply the model as though the domains are equal for all three segments.

    I agree with @pursuit

    Hopefully, I don't have to expound. :)
    #24     Oct 22, 2017
  5. pursuit


    No, model_n will not be selected by the class testing on segment 3. They will see that it did shitty on segment 1 part of segment 3 and will discard it. They're looking for a pretty graph throughout segment 3. That is impossible if segment 1 performed horribly.
    #25     Oct 22, 2017
  6. That really depends on how you define what 'shitty' or 'good' is. Notice you never did define it. 'Looking' at a graph characteristic in segments, is not the same as using a single quantitative metric to determine the outcome(s). In my case I simply used terminal wealth as a proxy (which isn't that uncommon). Step it up to a trillion models, and you really won't have the foresight to know if it is good or bad by your reasoning. Using terminal wealth for example, my scenario shows your hypothesis is not conclusive. If you are going to make a blanket statement, it has to cover all cases, any one case that disproves it, disproves your statement.

    Anyways, not here to argue. If you don't get anything from it, no need to post more.
    #26     Oct 22, 2017
  7. Yeah, you're probably right.

    The main question then would be how long of a segment to use for backtesting.
    #27     Oct 22, 2017
  8. pursuit


    We all know what shitty and pretty equity curves look like. There is no issue with quantifying it. We can use Sharpe ratio, Sortino, R-squared or any other accepted measure of smoothness. Does not change the validity of my point one bit.
    #28     Oct 23, 2017
  9. pursuit


    If using just segment 3 is as good as the combo of 1 and 2 then it would be reasonable to use segment 3 of maximum length so our model can experience a wide variety of market conditions in the test.
    #29     Oct 23, 2017
  10. userque


    Your original query is whether using segment 3 was the same as using segment 3 ... after splitting it into two segments (1 and 2), when comparing *pre-built* models. It obviously (to some) is.

    This is not the same as saying, in effect, if not in these exact words, "Hey! I should *build* my models on all possible data, and not leave some data out for testing/validating!"
    #30     Oct 23, 2017