There are plenty of conventional knowledges that are perceived to be true but in reality are very trivial (situationally correct). We had a lot of these with discretionary trading like.... 1. Trend related.: - Trend is your friend. - Cut your losses short, let your profits run. Reasons for it being trivial: Mean Reversion + Martingale, Stat. Arb. Scalping. Intraday Swing Trading. Historical conditions of the market those days, unlike the markets now. 2. Technical Analysis: - Market is always right. Reasons for it being trivial: The usage is completely dependent on the trader using it, ending up as a self-fulling prophecy. And there's plenty of statistical confirmations that general usage does not provide any significance (actually negative). 3. Psychology thing... it's trivial. etc. etc. There's plenty of them in Systematic Trading... 1. Testing with large data. 2. Out-sample/Forward Testing. 3. Robustness / Non curve-fit models. 4. Using stops. 5. Sharpe Ratio. 6. Pretty much most of the trading rules/lexicons brought over from discretionary trading. 7. Flipping coins analogy. I tend to see way too many people caught up with outdated and factless information.

What is wrong with in-sample / out-sample testing and the sharpe ratio? I use those. Why not "whatever works for you"? I have no quarrels with "the trend is your friend and cut your losses" - it does not work for me - my dominant strategy is mean reversal - but surely there are more than one way to skin a cat.

First... Testing with large sets of data. I took six models all profitable which I trade. I picked out the 6 models considering that I am about to run a parametric optimization and has a quantifiable sustainability after each test run. I have taken 40 years of US equities, commodity futures and cash index (trivial it's cash but I needed datas). I would have the computer randomly pick a start date and the symbol for each iteration and run the models for the set amount of time, which would be 1, 3, 5, 7, 10, 15 and 20 years of historical data. 1. The first set of results were to the average annualized returns of all the test iterations. This pretty much sets as a standard figure for the test. 2. The second set is the average performance of the models after the test done above. In another words, after a single set of optimization the models would take the best performing parameters. It would take those parameters, and test it for another year... So... 1 year test sample for Model 1 returns an average of 185%. The average % return 1 year after the test is on average 25% return. What can be concluded from the tests are that large set of data doesn't really help the models to sustain their performance for another year of trading. 3. So... I dug deeper a bit... saying... I take the best parameter in the set of 1. and run an out-sample like 2.. I would take the result and pick out the parameters which both Test 1 and 2 were positive (So Test 1 would be the main test, and Test 2 would act as an out-sample... forward testing). There's really not much of a pattern or any viable confirmation that helps me conclude that it works. Anyways, I don't see a viable confirmation that running a long time of data and outsampling has an advantage. 4. Finally, the 4 set of data is the % of the model/parameters that were actually positive after the set data size and 1 year out-sample. 1 year of data + 1 year out-sample (% of the models being profitable): 0.593210268 3 year of data + 1 year out-sample (% of the models being profitable): 0.51020825 5 year of data + 1 year out-sample (% of the models being profitable): 0.529947475 7 year of data + 1 year out-sample (% of the models being profitable): 0.529165128 10 year of data + 1 year out-sample (% of the models being profitable): 0.56812038 15 year of data + 1 year out-sample (% of the models being profitable): 0.565841906 20 year of data + 1 year out-sample (% of the models being profitable): 0.585818864 So the profitability is very close 50/50. I'm starting off with a profitable model to start off with so the % is all higher than 1/2. Even more there not a clean curve and relative to the sample size. So I conclude that: Testing on large size is trivial. Forward testing / out-sample is trivial. Adding: I ran a Gen. Op. like tests without carrying over information from the previous generation. The test was 100,000 generations over 1000 genes. The fitness function was Net Profit because it keeps the tests very simple and I don't have to worry about the viability of the function affecting the selection process. I TEST EVERYTHING.

Here's another one: I tested the well known fitness measures that you typically see in retail level testing platforms and ran whether or not they have any significance for evaluating a model. A bit more... 1. So I have 12 models total. 6 of the models are tested using EOD and the rest intraday (tick). The initial optimization for the EOD is done with 5 years of data. The tick data is ran with 1 year of data. In terms of the models, 3 of them are trend following in nature and the other 3 are the opposite which are counter-trend in nature. All 12 are different models exposing a "relatively different" tendency. 2. Similar to the previous test, I randomly pick a start date and the market. From each of the iteration sets, I would take the best performing value for each of the measurements and check whether the models are profitable or not (3 years for EOD... 6 months for intraday). Here's a breakdown of the results: % Drawdown (Minout of sample) 29.55% %Profitability 35.06% Avg Trade 30.99% Efficiency 39.41% MAR Ratio 44.62% Net Profit 29.83% Profit Factor 41.23% Risk/Reward Ratio 28.96% Sharpe Ratio 40.36% Sortino Ratio 40.31% Interestingly, the conventional measurement tools provide a less than 50% chance of the models to succeed in the future. The higher ones are MAR along with PF/Efficiency/Sharpe/Sortino around the same around with 40 somewhat %... Of course, I wouldn't want to end there, potentially starting a debate that backtesting doesn't work. So I added another row on top with the values for the optimal measure customized for each model for reference purposes. Hopefully, this is enough to get the conversation started. and... not to mention... I TEST EVERYTHING

It is hard for me (or any other outsider I assume) to explain what is going in your datasets and optimizations. But if you have something that trades 10 times a year then optimizing over one year is not going to do any good. And if you have something that only exposures itself very little (is mostly out rather having an actual position) then optimization is also likely to "train in noise". I also always use some kind of risk-adjusted performance measure (fitness function) usually sharpe ratio or return to maximum drawdown based on sampling the "net liquidation" of the position rather than waiting until the position is realized. It seems to be easier to give a reliable / valid estimate for expected variance (risk) than expected return.

Of course, I understand what you mean. And from your initial reply, you mentioned about Sharpe so I pulled out another set of test results. In terms of the viability of the tests... I understand, that it's trivial due to the a lot things like the models I used. But it's a test and I think it speaks more than the conventional beliefs people have. And I'll be more than willing to see tests done by others. Let's just say that it's one instance that I base my "opinion" on.

Maybe it would be easier to discuess if you took a trading model which wasn't secret and presented the full results? (Say: Buy/sell on two moving averages and short-term RSI.) Exactly what explains your results is anyones guess. What parameters are you optimizing? I like the in-sample/out-sample methodology because it makes you: 1. Test your models and look at actual historical data 2. Beware the effect of (over) optimization And I like the risk-adjusted performance measures because: 1. Risk is important. 2. Risk (variance) generalizes better / is easier to estimate than returns. This is not an easy game - you have got to go with the little tools you have.

I've had that in my mind already... 1. I would have to start off with a profitable model or the whole test is going to be perceived as useless. "A bad model = bad" no matter what, at least in this context of tests. Obviously, I am not going to be happy with providing the source with that. 2. I agree... Risk is important. Prolly, re-post that in the next few days... Anyways, I'm not attacking anyone's way of developing. People should do what keep them happy. It's just one incident to keep on their minds when they develop and assess models. Equally, I'd like hear how they deal with deciding on how they should pick the sample size and fitness measures when they assess... If you haven't noticed... the sample size IS static. Plus... I never mentioned about Walk-Forward.

imo, we can't be "Breaking the conventional knowledge..." based on our "conventional knowledge", can we? Think about that seriously!

Does walk-forward testing/optimization give better results with your models/data? (I would find that odd.) I have no strict rules for data sets and optimization methods / targets. But if I optimize I like to see my models generate hundreds of trades in my training data set. And my target for in production performance is a sharpe ratio of 2. I found that sharpe ratio, sortino ratio, return to max drawdown all gave approxiamately the same results. I like the sharpe ratio because it is easy to understand and corresponds to my trading target. As for a model to test - something like: 1. Buy if: Moving Average(10 weeks) > Moving Average(40 weeks) and RSI(2 weeks) < 30 2. Sell if: Moving Average(10 weeks) < Moving Average(40 weeks) and RSI(2 weeks) > 70 Is mentioned in a million places and gave better risk-adjusted return (i.e. sharpe ratio) than buy and hold last time I tested it (many years ago).