Modeling question - genetic programming

Discussion in 'Automated Trading' started by smili, Sep 22, 2007.

  1. smili


    HI All, I've lurked here for several weeks. This is my first post.

    I have a question regarding modeling time series data and the suggested steps in the manual (Discipulus software) mention that I for the training and validation data sets I should

    "Take the first two-thirds of your examples (in sequence) and
    randomize them. Then split these data into two equal size data
    sets--one for training and one for validation.?

    I don't understand the significance of randomizing the order of the training and validation data observations. Can anyone explain this? thanks in advance,

  2. What do you know about in sample and out of sample statistics?
  3. smili


    This is my first set of data I'm running through the program. Just seeing how it works and learning the options.

    It's daily SPY data back through mid 1993, about 3400 observations in total. Broken into thirds for the training, validation, and applied datasets. (applied is the last third, and training and validation are the first two-thirds of data randomized). The applied data is not used in any way in forming the model, but is used only to test the model on data it's never seen.

    The output (forecast) is either 1 day future SPY pct change, or 3 day future, 5 day future chg, or 10 day future chg, etc.

    For input variables I've calculated several moving averages, channels, gaps, prior day changes, daily range, some other indicators. Just inputs to let it chew on to help project the output.

    I'm thinking the randomizing of the training and validation data is to remove systematic biases in the data so that it's not training on just 93-98, validating on 99-03, but I'm not really sure why it needs the data broken up as above in the first place. In college I enjoyed my econometrics projects, but I didn't have to break the data down in 3 ways like this, so not sure.

    Part of me wonders if I'm using too long time period, and if I should include more recent data in the training and validation set, but the manual actually recommends breaking data into thirds as mentioned above.

    thanks for the reply.
  4. Ok,

    It sounds like you have a solid approach - what manual are you using? The only times I've used the "breaking up of the data" into pieces was to do a walk forward optimization, i.e. you keep rolling forward the optimization process:

    n=1: Optimize on data from 1/1/2004-1/1/2005,
    test on data for 1/1/2005 - 2/1/2005.

    n=2: Optimize on data from 2/1/2004-2/1/2005,
    test on data for 2/1/2005 - 3/1/2005.

    n = j: etc...

    This is called a rolling optimization or a walk forward optimization. The utility of this is often misunderstood - my conclusion is that this type of optimization adds another "fit" parameter to your model, hence providing yet another way to over-fit your data set. From my experiences, this type of optimization process isn't overwhelmingly sensitive to how you structure the segregation, i.e. it doesn't seem to matter if you use 1 year or 1 month lengths - but, the ratio's do matter however. I've used 8-1 or greater, usually in years with some success.

    In general, understand you are entering a dangerous area here - provided you designed your models correctly (with proper statistical research practices), the type of optimization your are doing can actually hurt your future results severely. Also, once you test the model on the future "unseen" data, you've introduced a serious bias into the model - that may or may not be a good thing.

  5. smili


  6. How many fields in your bit masks for testing?

    What size byte are you running (field population) as information set being tested?

    How frequently do you sweep the bytes with the bit masks in milliseconds?

    I'm not interested in forecasting; I'm just interested in how you do a process with both information and test criteria to produce an output at some frequency.
  7. smili


    I don't think I know enough about the software to answer your questions. Still trying to figure out what it's doing.
    - The few initial runs I've used included about 10-12 input fields
    - The program modifies some of the parameters as it sees fit as it's running, but the current run has a population of 543 with maximum size of 512K
  8. plodder


    Jack, I'd imagine that most models run at a higher level of abstraction than what you're mentioning here. Bit masks?
  9. During the 60 Minutes interview of Alan Greenspan on Sept. 16th, he said that economists are no better or worse at forecasting today than they were 50 years ago. I understand that predicting the future with better than 50/50 odds would be a very useful thing, but it will never, repeat never, happen no matter how much computational power we bring to bear.

    Of course predicting the future at better than 50/50 may not be required if your trading system has highly positive expectancy.
  10. Thanks for your response.

    I leafed through the manual and the binary part appealed to me. It will not take True but I could use 1's and 0's. Pages 169 and 170 looked a little redundant.
    #10     Sep 23, 2007