Home > Technical Topics > Strategy Development > Is Walk-Forward (out of sample) testing simply an illusion?

Is Walk-Forward (out of sample) testing simply an illusion?

  1. If we test a bunch of strategies on data segment 1 and then data segment 2 and then keep the ones that do well on both..........

    Isn't that the same as testing them on segment 3 which is a combination of 1 and 2 and keeping "the good ones"?

    We'll arrive to the same choice of strats in both cases, no?
     
  2. No sure about 'illusion,' but yes, it's overrated.
     
  3. Nope.
    You have to remember the border:
    If you imagine the market (or whatever) as a perfect wave, and your border of segment1→segment2 was the very top of the wave, your segment1 strats would be biased "long". {etc etc etc.}

    I am totally forgetting the test condition for a fair conclusion of 'same data' comparisons, but it's readily obtainable, and should be a snap to input. {Sorry. Market's open....Gotta go.}

    But the idea is, that if it's possible for the border to play a role in success/failure, then *where* the border is {I'm sliding it back and forth in my mind} matters. And so, Seg1 + Seg2 =/= Seg 3.
     
  4. Doesn't make sense. Doesn't matter where the border is. We're looking for "nice equity curve" before and after the border regardless of the border's location.
     
  5. If you mean data 1 as in sample and data 2 as out of sample, you raise a good question. It goes to the heart of what knowledge is.
    Backtesting is rooted in the scientific method, which uses statistics to test a hypothesis.
    The investigation can only tell you that the hypothesis is true (in this case that the strategy works) with
    a certain degree of confidence so there is always the risk of a false positive.
    Formulating the hypothesis before looking at the (out of sample) data is but one of the many things to be aware of (this is strictly so the numbers make any kind of sense; it is very easy to not be aware that you looked at out of sample data if your analysis was inspired by third parties insights that were in turn aware of said data).
    The size of the out of sample data is also important. Say you get an out of sample p value of 0.1 that the information
    coefficient of your strategy is above 1 or 2 or whatever your psych profile is comfortable running, there is still
    a 1 in 10 chance you are being fooled. So if you try 10 different things even though you did not look at the out of sample data
    when cogitating it is very likely you are going to be fooled at least once.
    However, none of this deals with what I consider to be most important, namely the law of ever changing hypotheses.
    By the time you find the key, the lock has probably changed. But this dynamic is a discussion for another thread.
     
  6. As a pure math question, testing on 1 and then on 2 is more comprehensive than a test on 3.
     
  7. 3 is all in-sample.
    2 is out-of-sample only the first time around.
    Both are no guarantee your search won't yield false positives. The more you search, the more false positives.
     
  8. I think tom, the analyst is right that the best fits of 1 and 2 may not be a best fit of 1 + 2. It is not difficult to construct a case where both 1 and 2 are trending but if they are separated by a choppy period the curve fit formula may not be the best fit of 1 + 2. this is especially true if the fit is over-specified.

    I am not smart enough to give a mathematical proof. Maybe someone else can.
     
  9. this
     
  10. No way.
     
  11. Thank you for your deeply illuminating contribution to the discussion.
     
  12. Sorry i'm not here to educate. There are many resources online that explain why your initial premise is wrong and how testing on Segment 3 is more prone to over fitting.

    The only time you would test over segment 3 is if you were planning to then isolate each segment( 1 & 2 ) and run the likes of a linear regression through each to test for correlation. This could actually be a more effective technique then testing on segment 1 (in sample) and then segment 2( out of sample).
     
  13. Is Walk-Forward (out of sample) testing simply an illusion?

    If price movements evolve as a random walk, then yes, it is an illusion. It seems the best models of market price movement suggest random walk is the best fit. However all models are wrong to one level of precision or another. How wrong is always up for debate, therefore TA (and out of sample testing) may have some benefit after all.
     
  14. Just because the random walk model is a good fit is no proof price changes are random (whatever that means even).
    It just means price can vary alot and is not constrained too much!
     
  15. So if you're not here to discuss and explain the reasoning for your opinions, I'm just curious what are you here for?
     
  16. It's been proven that most financial time series are not a random walk. Anyway if we believed that I don't think any of us would be on this forum.
     
  17. Highly overrated.
     
  18. Walk forward and segementing folds of data is immensly useful if you try to understand the modelling process from a complexity/generalization perspective. Just imagine that you have a very finely optimized model on data segment 1, that performs spectacularly-- then (the same fitted model) on segment 2, performs terribly. That tells you something right there, that a (single) concatenated set will not.
     
  19. Suppose we have 1000 models. Two classrooms in two different rooms get the same models and the same data.

    The first classrooms is full of geniuses who know wassup and will test the "clever way", meaning they test all on seg1, then test pretty ones on seg2 and keep the ones that test pretty on seg2. That is their end selection.

    The second classroom is full of idiots and they test the "dumb way". They run all models on the combined seg3 which is a simple continuous combo of seg1 and seg2 and keep the pretty ones for their end selection.

    Question: don't you think the classrooms will end up with the same end selection of models? ;)

    Bonus question: don't you think at least some of the models in the end selection are there by random chance because we tested a lot of models and contain no alpha?
     
  20. Of course you're right that optimization can produce overfitted models with no alpha.

    However, WFA is much more than simply dividing your data into two segments.

    You optimize your strategy on segment 1 and then test on segment 2. You are not allowed to optimize using data from segment 2. That is the difference between the two situations that you are describing.

    Obviously, you will have a data mining bias after you perform WFA. Algo traders like Pardo and Eckhardt consider their WFA methods to be major trade secrets that they have refused to elaborate on in interviews.
     
  21. Suppose there are three periods A,B,C
    SegA runs 1.0→2.0
    SegB runs 2.0→1.0, and
    SegC runs 1.0→1.0

    Models performing best in
    SegA would show positive (Cartesian) linear behavior,
    SegB would show negative linear behavior, while
    SegC would exhibit a graceful curvelinear arc and do best with a second-order parameter of negative sign, to degrade (and eventually overwhelm) a positive first-order parameter.

    So....
    1) The border matters.
    2) W.R.T. subset performance, SegA+SegB =/= SegC.
     
  22. The end result is the same though! By keeping only models that look pretty on a segment 2 test we have in fact manually optimized the system to segment 2. We might as well optimize on the whole segment 3. The end selection of systems will be the same.
     

  23. It's very easy to show that both classrooms will not necessarily end up with the same selection of models. All you need to do is imagine that model_n did very poorly on the first segment, and did fantastically well on the second segment. So well, in fact, that the model_n performance on the combined set did better than all of the other models. If the other classroom only looks at the entire concatenated set of data, they will choose model_n, however, class one already threw out that model in the first segment selection step by filtering the sets that pass to segment 2.
    Hopefully you can see the set of models at the final step are not equivalent.

    But that point is sort of trivial. What's more important is that both of the methods of filtering the best of the distributions suffer from the same selection bias. You can devise a different way to use the two segments to choose models, that can give you a better likelihood of performing well on a new unseen segment, that you cannot do with only one combined segment.

    If you are really interested to understand it from a more modern statistical perspective, you can look into topics on bias/variance tradeoff. But you will find very little literature applying it to your application, that's up to you to figure out.
     
  24. You guys are failing to consider the x-axis:
    [​IMG]

    You guys are trying to apply the model as though the domains are equal for all three segments.

    I agree with @pursuit

    Hopefully, I don't have to expound. :)
     
  25. No, model_n will not be selected by the class testing on segment 3. They will see that it did shitty on segment 1 part of segment 3 and will discard it. They're looking for a pretty graph throughout segment 3. That is impossible if segment 1 performed horribly.
     
  26. That really depends on how you define what 'shitty' or 'good' is. Notice you never did define it. 'Looking' at a graph characteristic in segments, is not the same as using a single quantitative metric to determine the outcome(s). In my case I simply used terminal wealth as a proxy (which isn't that uncommon). Step it up to a trillion models, and you really won't have the foresight to know if it is good or bad by your reasoning. Using terminal wealth for example, my scenario shows your hypothesis is not conclusive. If you are going to make a blanket statement, it has to cover all cases, any one case that disproves it, disproves your statement.

    Anyways, not here to argue. If you don't get anything from it, no need to post more.
     
  27. Yeah, you're probably right.

    The main question then would be how long of a segment to use for backtesting.
     
  28. We all know what shitty and pretty equity curves look like. There is no issue with quantifying it. We can use Sharpe ratio, Sortino, R-squared or any other accepted measure of smoothness. Does not change the validity of my point one bit.
     
  29. If using just segment 3 is as good as the combo of 1 and 2 then it would be reasonable to use segment 3 of maximum length so our model can experience a wide variety of market conditions in the test.
     
  30. Your original query is whether using segment 3 was the same as using segment 3 ... after splitting it into two segments (1 and 2), when comparing *pre-built* models. It obviously (to some) is.

    This is not the same as saying, in effect, if not in these exact words, "Hey! I should *build* my models on all possible data, and not leave some data out for testing/validating!"
     
  31. Not only has no one written that the domain between sub-segments be the same, but I have repeated referenced it being variable as a relevant factor.:rolleyes:, :wtf:, :banghead:, :cool:

    That said, great exhibits (labeling aside).:thumbsup:
     
  32. Simple models and simple changes to such can yield vastly different results. A tool that may help is how many reasons do you have for your solutions not to be overfit? Doesn't matter what they are, but how you establish them matter greatly. These reasons may even be superior to out of sample and forward testing, because if they're right, they should work regardless of these tests, though they could still act as a tool for model validation.

    Complex models on the other hand, may be overfit already, simply because of how they became so complex in the first place (in order to fit the data perhaps?). They're often characterized by lack of robustness and fickle dependencies (ie. bad data quality).

    It's a mindbender and topic of exploration that may take lifetimes.
     
  33. Lol...yeah...coulda did a much better job with just a little bit more effort.
     
  34. I'm here for trading related entertainment.

    -Segment 1 (70% or data) turns out to be based on a strong bull market,
    -Segment 2( 30% of data) turns out to be based on a rapid decline.
    -Segment 3 (100% of data)

    *we have a long only strategy
    *we are blinded and have no idea what the data in segment 2 looks like

    A) If we tested strategies based only on segment 1, then the equity curves could significantly under-perform on segment 2, making the strategies no longer viable. If some still performed as expected ( even after a regime change), then we know what to investigate further.

    B) If we were unblinded and tested strategies across all data 1+2( Segment 3) our strategy design could have already compensated for the decline seen in segment 2( In fact we might have decided that a long only strategy was no longer a viable option). Either way, we have opened ourselves up to curve fitting, or at least increased the likelihood.

    When Segment 1 contains vastly different characteristics to Segment 2, then the strategies we arrived at in (B), are going to be different to the Strategies we arrived at in (A). Even though the strategies that performed well in (A) will still perform the same in (B), they could easily get overlooked for better performing strategies derived from only (B). Therefore we will not arrive with the same choice of strategies in both cases.
     
  35. You are missing that the x-axis values are different in segment 1 vs segment 2.
    Hypothetically, the best strategy for segment 1 can also be the best strategy for segment 2.

    [​IMG]
     
  36. As a matter of fact, if
    - a strategy is built on a good prior hypothesis
    - the effect has good statistical significance
    - and the number of free parameters is low (preferably none)
    it's a perfectly OK thing to do. In fact, you would be better served building a collection of simple strategies this way vs going in circles optimizing something complex.
     
  37. Oh, of course. I agree. I was merely pointing out that that's not the conclusion that can be drawn from this particular hypothetical.
     
  38. What are you talking about? Hypothetically sure, the best streagy for segment 1 can also be the best strategy for segment 2. However, it can also not be the best strategy aswell.
     
  39. I wanted to expound, but had to stop my analysis of your post. (See below).

    I know right.

    Ok.

    This is not what the OP says. The OP says that we pick one of the available strategies that also does well in segment 2 as well as segment 1. So I guess I must stop here since your hypothetical requires something different.

     
  40. Optimizing on seg1 and then picking only strats that look pretty on seg2 will result in a similar selection of strats as optimizing on the whole seg3. If we are testing a non-optimized strat - same thing. We end up with a similar selection regardless of whether we explicitly optimize some parameters or not. By selecting only pretty equity curves we are "optimizing".

    It's really not that hard to understand (or I guess it is for some people judging from some of the replies on the thread). The out of sample thing is a fallacy and great for marketing, especially to retail traders.

    It proves nothing and does nothing to increase the likelihood of success live. Other tests of robustness must be implemented.
     
  41. Have you actually tested this out for yourself? Or are just putting forward a hypothesis?
     
  42. Not an illusion at all if done properly with close to zero degrees of freedom.

    First, before doing any WFA, a trader needs to understand why, and how their model(s) gives them any kind of competitive advantage in the marketplace. Also make sure they're executable in real-time before moving into the analysis phase. Assuming doesn't cut it. Real money needs to be put on the line.

    Second, unless i'm marketing to investors, I don't give a shit about Sharpe, Sortino, Treynor and whatever metrics so called quants use, and I don't need R, MatLab, etc... to do effective WFA. A custom built Excel sheet works just fine. As a lone wolf, I can only keep track of a limited amount of variations/models.
     
  43. What you've described is not walk forward analysis/optimization.

    True WFA is robust. I know I'll have to explain/simplify...so I'll just get to it:

    1. Optimize over days 1-50.
    2. See how it worked on OOS (Out of Sample day 51. It must be out of sample because day 51 didn't exist during the analysis/optimization of days 1-50).
    3. Day 51 closes.
    4. Optimize over days 2-51.
    5. See how it worked on OOS day 52.
    6. Repeat.

    What you did say, and I agree with, is that optimizing by holding data out for *validation* is, many times, not much different that optimizing over *all* the data.

    Let me simplify further:

    WFA *tests* on *true* OOS data. In the above example, our hero optimizes *after* the close of the dow/nasdaq/etc. markets (4 pm ET)...but before the next market open.

    Then, the next day--after the next market close, our hero sees how well his forecast did.

    Repeat.

    It doesn't get more robust than proper WFA.
     
  44. There was a very nice study done by someone at DB (I think) that shows how your "percieved" Sharpe grows over a number of optimization passes. It looked pretty scary, IMHO.

    My approach is "hypothesis -> study -> first pass strategy -> live trading in small size -> improvement based on real results". In most cases, the causes for failure are aspects that can not be accounted for in paper trading (fills, borrow, information delays).

    Also, I am almost never ready to deploy a strategy unless I have a solid fundamental hypothesis regarding the source of alpha. There are no free lunches, only cheap lunches or stolen lunches. If it's the latter, I'd like to know what I am paying and if it's the former, I'd like to know who I am stealing it from.
     
  45. As you correctly my point is about traditional optimization/backtesting not the "WFA". As you describe WFA is basically "rolling window" optimization.
     
  46. Yes...also known as "sliding window" optimization.
     
  47. Well... then it's SOLVED. All we need is to use "sliding window optimization" and our system will always work live and will work forever. /s
     
  48. You're welcome /s
     
  49. How have your systems tested with WFA fared live? Was performance in line with testing? If so, for how long?
     
  50. Note: I can't be manipulated into revealing information I wouldn't normally reveal. I'm a pretty good Texas Hold'em player.

    This is your thread. I haven't started a thread revealing my data. If you want me to compile data for you, I can quote you a price.
     
  51. Haha. This is why this forum is shit. There is no incentive to share absolutely anything of value.
     
  52. As a side question, do you believe that your back-testing/optimization methods are a part of your alpha? I have definitely found it not to be the case for myself and most people I’ve spoken to.
     
  53. In my case, they are.
     
  54. is this forum about choosing the best broker and data feeds
     
  55. It depends what you're optimising.
     
  56. Interesting. The only professional traders/PMs I know that have been fairly secretive about that process (rightly so) are HFT guys - it's clearly a tricky topic for them and smart ways can contribute a fair bit. Definitely my strategy pipeline is not a part of my alpha, it's merely a tool.

    Well, let's not touch execution-type stuff (that's a separate question) but things like hysteresis bounds etc.
     
  57. You'll get out of this forum what you put into it.

    That is your first nickel-worth of free advice. You have two more in queue. Do not squander them.
     
  58. I'm not HFT...at most two trades per day. The algo will hold if it thinks the instrument is in a trend.

    Not as secretive as I probably should be. :) It's not that I believe I've found the holy grail, but I certainly won't entertain trolls/manipulators/asholes--even if it were concerning information I have no problem discussing.

    I actually wish I could talk about it completely...but of course, that wouldn't be smart.
     
  59. You have an algo that thinks? Fascinating.
     
  60. :)
     
  61. Oh, so you lied because you thought I'm "manipulating" you. LOL you need a psychiatrist, bud.
     
  62. I find that there are two separate degrees of discussion.

    There is theoretical “what if and how about” which most professionals engage in all the time without holding back much. A chat along the lines of "in my experience, cost of risk is a better way of smothing out your transaction frequency than hysteresis, especially in strategies with multiple legs".

    Then there are detailed "how do I make money" type of conversations that you could only have with the members of your team. Anything that involves your specific factors, instruments you trade or anything that can really dilute your alpha falls in that category.
     
  63. Yes, I understand what you are saying, but even with this example...generally speaking...it still just depends, imo. Just for hypothetical...

    You quote, "in my experience..." in your first example.

    If that experience was the result of lots of research, computer time, coding, testing, thinking, money, blood, sweat, tears, lost family time, lost social time, and led to some supposedly unique edge (or edges); then I doubt "most professionals" would willy-nilly reveal the results of 'their experience.'

    So, I see what you're saying, but 'it depends' upon the exact facts/circumstances, imo.
     
  64. Well, I have not and would never bleed or even cry for any of my employers :) However, I can speak from decades of professional experience in quant trading, both as a book runner on a dealer side and a PM at a couple funds. If there is one thing you learn is that there is nothing new under the sun. My portfolio has over 30 active strategies at the moment and I can’t name anything “unique”.

    By no means I would share the actual details of my alpha generation, but I have done in-depth conversations with coworkers or friends regarding more general ideas like risk metrics, measures of significance or various “how-tos” that are not specific to my own strategies. That my experience.
     
  65. Neither would I. I never referred to an employer.? :)

    There're always new things...or new ways to combine old things.
    https://en.wikipedia.org/wiki/Timeline_of_mathematics

    I agree. I never said I wouldn't discuss anything at all. I said "it depends"...same thing you seem to be saying. :)
     
  66. Your skepticism is understandable.
    However, if done correctly, out of sample testing is useful.
    Sometimes it is easier to get a feel for things if they are exagerated.
    Say two people come to you and claim they have a method to play the lottery.
    The first guy shows you his correct predictions for last week lottery.
    The second guy predicts the numbers for tomorrow and they turn out to be true.
    What model will you use to play the lottery?
     
  67. I think that out of sample/forward testing guarantees nothing. It can provide an added level of confidence to those who are about to deploy a new trading system but it cannot guarantee that a future sample will generate greater alpha than a historic sample. One subject that does not get talked about nearly enough here is how to define edge death/erosion. I know plenty of people here might just say "historic max draw down peak to valley will decide if my system is active or not".

    Step #1. Finding a profitable historic "edge" and hope that some sembalance of that persists into the future.
    Step #2. MAKE MONEY or at least break-even. Don't blow up!
    Step #3. Did you exit this system gracefully and with some money or did it blow up and take all of your profits with it? Yeah I know any quant reading this would just say "run a bunch of non correlated systems at the same time and who cares if any one individual system blows up? I writing this post with the retail trader in mind, someone who probably runs one or two trading systems max.
     
  68. Can someone kindly explain to me what is walk forward and out of sample testing? I don't understand the terminology.

    Back testing segment 1 and 2 vs back testing 1+2 combined segment, so what is out of sample, same data set if they are consecutive?

    Thanks.
     
  69. So, there is really no hope for us small retail traders? If we cannot find new strategies how can we beat the professionals with vast resources with the same strategy they use?

    For all the professionals, if all employs similar strategies, how can you make money? Unless you take them from us? With this logic, we retail traders should give up trading?
     
  70. I think this is something that deserves a separate thread (or even a whole book).
     
  71. Not really. If retail were to implement the same strategies in the same way, they'd perform at the same level. But they don't follow the same strategies. Nor do they implement them in the same way. If nothing else, fear prevents them from performing rationally.
     
  72. Just curious, are you retail or professional PM?

    I wonder whether it is greed and ignorant rather than fear? Most retails I know bet way too much for each trade instead of too little (me included as I had no idea of Kelly and risk of ruin until recently).
     
  73. You know that’s not true. In most cases, a retail trader should and can do better than an institutional market participant. There are multiple reasons for it - lower liquidity needs, broader product universes, lack of negative selection imposed by irrational risk parameters etc.

    Additionally, there are specific areas were the only people that could thrive are smart retail traders.

    PS. In a few cases, retail traders would indeed do worse or not be able to participate at all. Examples are strategies where they don’t have the same market access or are unable to afford technological investments.
     
  74. Can and should but doesn't. The typical retail trader is undercapitalized, undisciplined, ignorant of market structure and demand-supply dynamics, and generally lazy. As for smart, of course; but many retail traders just aren't smart enough. Theoretically the "smart retail trader" ought to be able to out-maneuver the "institutional market participant". This is the sort of thing that's put out there by the industry to keep retail in the game. And losing (not "loosing") money. But too many retail traders approach trading the way too many people approach time-shares. So the only people who are making consistent profits are those who are selling courses and software and dvds and newsletters and alert services and, with few exceptions, books. Then there are all the articles and blogs and so forth written by people who aren't any more qualified to write about the subject than the typical retail trader.

    Picks and shovels.
     
  75. Actually, from my interactions on this forum, there are plenty of diligent and intelligent people. What these people lack is knowledge and experience. Given the right ideas they would be able to make money (after being spanked by the market a few times).
     
  76. Without giving away your own strategies, do you have any recommendations for a retail investor on what knowledge or ideas to learn?
     
  77. Out of Sample:

    Optimize using segment 1. Test how it works using segment 2.

    Walk Forward:

    Walk Forward can refer to a particular type of optimization. Each day, new stock etc. data is generated. Re-optimizing on this newly generated data would be Walk Forward Optimization.

    https://en.wikipedia.org/wiki/Walk_forward_optimization

    [​IMG]
     
  78. The aim being to provide adaption to new data so as to track varying market behaviour.

    And if anyone manages to set up a model of that process the next problem is to determine the period of the in-sample data which optimises performance of the adaption process over past data.