Modeling question - genetic programming

Discussion in 'Automated Trading' started by smili, Sep 22, 2007.

  1. You should break up the data into 2 sets, one for training and one for validation. The model is trained using the first set, and the model is validated (or tested) using the second set. The result of the validation should be compared to the result of the training to make sure that the model is still valid.

    Apply the model to the future data. You always compare the result with your training set to justify that the model is still applicable.
     
    #11     Sep 24, 2007
  2. Model with binary output might not be a profitable model. Assuming your model is correct (i.e. high accuracy in the output), you have a good accuracy (say 90%) of knowing if it is going up or down. But the profit in those profitable days might be small and the loss might be huge.

    When you use another training software, the first is to determine the right criterion. Even with the most popular mse is not good.
     
    #12     Sep 24, 2007
  3. Corey

    Corey

    The reason you randomize the 2/3rds of the input is to guarantee you are not training on a time series, but rather in individual points. It also helps ensure that the model is more robust. If you train in order, the model is often overtrained towards the present, and forgets the past (making it less robust, but often more accurate for short time periods).

    Well, for Neural Networks at least. Genetic programming is a generic buzzword that basically means they are using some sort of data-system that parallels biological evolution, often optimizing over a solution set based on fitness functions.

    So why would randomizing input be important for genetic programming? For pretty much the same reason as I said above -- to generate a more robust model. If you are constantly making comparisons between genes generated during the same time period, you won't for a good long-term comparison model. You are choosing long term robustness over short term (in)accuracy.
     
    #13     Sep 24, 2007
  4. Kohanz

    Kohanz

    Just a question out of curiosity on this topic, for the experts.

    My instinct tells me that if you have a time-series that you are doing testing on, if you randomize the samples in this series (I think the term "observations" was used earlier), assuming that the time-series is a single-realization of a random process (possibly my mistake?), aren't you irreparably altering the statistics of that time-series, therefore making it an invalid realization of the underlying random process?

    I'm thinking of this from an engineering background, but basically, if you take a signal, and randomize the samples, you will change the frequency content/characteristics of the data - and the new data set will not be anything like the old one. Therefore, although the initial data-set is a good example of a financial time-series, I don't see how a randomized version of that same series could be seen as a valid representative financial time series.

    Any input would be appreciated, I'm just curious to understand this.
     
    #14     Sep 24, 2007
  5. Corey

    Corey

    Your assumption would be correct ... if you WERE looking at it as a time series. Rather, most neural networks and genetic algorithms use single time snapshots whose datapoints take into account time series change. So instead of feeding in five separate data points in a row and getting a result, you feed in one data point that might comprise the rate of change over that period.

    By doing it in this way, you prevent from overoptimizing to a certain time period. You randomize over a large time period, but have each input use variables which describe the time period it is in.

    But take my knowledge as opinion -- I am certainly no 'expert.'
     
    #15     Sep 24, 2007