Forecasting, a case study.

Discussion in 'Strategy Building' started by TheBigShort, Nov 25, 2024.

  1. TheBigShort

    TheBigShort

    Hello all, Im looking for some help in forecasting. I've created a dataset that is a snapshot of historical data over the past 9 years. I will be keeping the features and target variable anonymous - labels: "feature1-10" and "target variable" as the final column.

    My goal is to learn statistical techniques from the group here and ideally build a decent model to predict the target. I can easily throw the data set into a multi linear regression and get a very high r^2 currently but I would like to improve my techniques and understand the data better.

    About the data set:
    The target variable seems to be a multiple of "feature2" so ive created a column which is just target/feature2. This multiple is labelled as "feature1". The multiple (feature1) stays in a range of 3.64 and 5.89. I have a strong belief that going forward, no matter what our forecast is, the target variable should be in between 3.5x feature2 and 6x feature2. So possibly some shrinkage needs to be done.

    Feature5,6 and 7 are moving averages of other variables in the data set so there is strong autocorrelation + NA values. As well, feature8 and feature9 are 2 different measurements of volatility so those will be highly correlated.

    All features are hand picked by me that i thought would have some correlation to our target variable so you will find that the correlation matrix for TV will have mostly high positive values.

    Lastly, this is data all for 1 single stock.

    An idea i had was to forecast the "multiple"(feature1) separately and having 2 models.

    Tomorrow i will post a more indepth EDA. Any inputs from you would be helpful as I work through it. Thank you.
    Attached is the data set.
     
    spy, Sekiyo and Baron like this.
  2. Code:
    > summary(lm(dv[1:35] ~ 0 + as.matrix(dfET2[1:35,-c(1,8,3,6,4,11,9,5)])))
    
    Call:
    lm(formula = dv[1:35] ~ 0 + as.matrix(dfET2[1:35, -c(1, 8, 3,
      6, 4, 11, 9, 5)]))
    
    Residuals:
      Min  1Q  Median  3Q  Max
    -25.415  -9.510  -1.980  9.224  37.178
    
    Coefficients:
      Estimate Std. Error t value Pr(>|t|)
    as.matrix(dfET2[1:35, -c(1, 8, 3, 6, 4, 11, 9, 5)])feature2  3.3575  0.1544  21.745  < 2e-16 ***
    as.matrix(dfET2[1:35, -c(1, 8, 3, 6, 4, 11, 9, 5)])feature7  602.3034  72.6602  8.289 6.74e-09 ***
    as.matrix(dfET2[1:35, -c(1, 8, 3, 6, 4, 11, 9, 5)])feature10  5.3506  2.9152  1.835  0.0775 .
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 15.68 on 27 degrees of freedom
      (5 observations deleted due to missingness)
    Multiple R-squared:  0.995,  Adjusted R-squared:  0.9945
    F-statistic:  1808 on 3 and 27 DF,  p-value: < 2.2e-16
    
    R^2 of .99+ and F-stat of 1808! Too good to be true. You've got some sort of dependent variable leakage into your design matrix.


    You've got 35 rows, that would be 8.75 years of quarterly data. Is that right?

    See above, you've already got a near perfect model using Features 2, 7, and 10, no intercept. If you use Little & Rubin (see attached png) for the missing rows in Feature7, it is literally perfect -- R^2 of 1. So there is probably some leakage from lhs to rhs. You need to see if you can find that leakage.

    You can't do that. You can't move your dv into the design matrix (rhs). Discard Feature1.

    Feature4 is just your DV lagged by one. Including lags of the dependent variable (dv) in with other regressors in your design matrix will bias the coefficients. Model the autocorrelation in the dv separately. Besides, in a model with the other regressors, it adds no predictivity. Discard Feature4.

    Feature5 is a 4 period ma of Feature4. Discard it for the same reasons as you discarded Feature4.

    Features 6 and 7 are moving averages of Feature2, I think. F6 is definitely a 6 period ma of F2. F7 is 95% correlated with F6 so is probably something similar. They are so correlated that you only need one of them in your model for this exploratory stage, as keeping both will make standardized betas harder to interpret. Keep F7 and discard F6.

    Reasonable to try these, as even quarterly data respond to vol regimes. But they don't contribute to the fit, so discard them.

    Not sure what Feature10 is, but it improves the fit, so keep it.

    Your biggest problem is that the data predict your dv too well. I suspect there is some contamination in Feature 2. A univariate model with no intercept has an R^2 of .98 -- unrealistically high.



    Statistical Analysis with Missing Data Little and Rubin
    1st Ed: pgs 112:119
    2nd Ed: pgs 148:152
    3rd Ed: pgs 166:169, 275:276

    Sweep_for_Monotone_Missing.png
     
    -Orion and poopy like this.
  3. TheBigShort

    TheBigShort

    feature1 x feature2 is equal to the target so that is why r^2 is practically 1. I put feature 1 in there as a shrinkage factor ie if our forecast is outside feature1 x feature2, shrink the prediction into the band.
    Event Vol is my target fwiw. quarterly data snapshots pre event.

    I lagged the DV by 1 as a predictor (feature 4), because of the auto correlation. I was getting 3-4 lagged periods as significant across multiple tickers which is why i also added the SMA4 of the lagged event vols. Im doing this across multiple stocks to create a scanner, so do you still think i need to model the auto correlation separately?

    Yes f7 is an EMA while f6 is an SMA.

    f2 is non event vol on the 30 day. I subtract the event vol from the total iv30d. Using this formula.

    Code:
    
    #IM is 1 day implied move, ivol is total vol on 30 day and dte is set to 30.
    #Ie. .10, .50, 30 could be inputs
    nonEVvol = function(IM, ivol, dte){
      constant <- sqrt(1/365) * sqrt(2/pi)
      v1 <- sqrt(ivol^2 - (IM/constant)^2 / (dte))
      return(v1)
    }
    
    
    
    
    So i think that is likely why we see a high correlation between the 2 variables. When someone buys an option they are buying both the event and non event vol and only a few participants are trying to separate the 2 by trading calendars. What do you think about that?

    I spent time analyzing the data with chatgpt yesterday and i also came to the conclusion of removing f6 (SMA) and keeping f7(EMA). I also dropped the rvol columns ( as they are not nearly as predictive and the non event ivol (f2). f1 was also dropped.

    f10 is the sign of the last move(+/-).
    I got up to 75% r^2 with a linear model still unrealistically high but i also dont think it should be too difficult to forecast what event vol should be on the day prior to the event given how the market historically has priced these.

    Chatgpt somehow got an R^2 of 90% with a random forest which i couldn't replicate in R even with the tuning. I also dont think random forest is ideal here with only 36 rows(is that true?). Although, i do want to capture some of the non linear relationships in the data set.

    Kevin, im going to post an up dated data set with the variables you have mentioned removed and Ill PM you the column headers as well. I just have to run out right now.
     
    spy, -Orion and poopy like this.
  4. TheBigShort

    TheBigShort

    @Kevin Schmit here is the new data frame. I've PM'd you the column headers. Note* the 2 dataframes are not for the same ticker. I couldnt recall which ticker i used for the one i sent
     
    Last edited: Nov 27, 2024
  5. TheBigShort

    TheBigShort

    Why is it when i set intercept to 0, my R^2 shoots up from .65 to .99? Looking at the residuals of the model lm(DV ~ 0 +., data = ETdata). There is still quite a bit of variance with the errors ranging from +30 to -30. Doesnt seem like that would be an R^2 of .99. Im certain there is no spill over from other columns into the DV. The features that are moving averages are from lagged variables so no contamination with corresponding DV.

    kev.png
     
    Last edited: Nov 27, 2024
  6. That is an excellent question! I wondered that myself but didn't have time to investigate. It does seem very strange. I will look into this and see if I can get to the bottom of it.

    Edit:

    This stack exchange post explains it:

    https://stats.stackexchange.com/que...pt-term-increases-r2-in-linear-mo/26205#26205

    See the first answer in that post.

    I think that I missed this all these years working with lm because I am almost always using centered or close to centered dv's and iv's -- where the intercept is very close to zero anyway.

    In light of the information in the SE post referenced above, I will re-do my original answer in this thread, but with your latest posted file.
     
    Last edited: Nov 27, 2024
    poopy likes this.
  7. ph1l

    ph1l

    One way to do this is through genetic programming. For example, I ran the second ETdata.csv you posted in a genetic programming rules generator to forecast TargetVariable. The mean squared error was 156.497, and the mean absolute error was 9.53354. The rules generated were:
    Code:
    R0  = 9.35595 * feature9
    R0  = feature9 + R0
    if  defined ( feature7 )    R0  = R0 + feature6
    R0  = R0 + feature2
    R1  = 7.24216 * feature5
    R0  = R0 ** 1.02118
    R0  = R0 / 0.957456
    R0  = R0 - feature1
    R1  = R1 / 0.137156
    R0  = R0 + R1
    R0  = R0 + R1
    R0  = R0 + feature4
    R0  = R0 + feature4
    R0  = R0 + feature2
    R0  = feature9 + R0
    R0  = R0 - feature1
    R0  = R0 + feature2
    if  ! defined ( feature8 )    R0  = 3.33795 * feature3
    R0  = R0 - feature1
    
    with the predicted value in R0 after the instructions run using a row with values feature1 through feature9 as input.
     
    TheBigShort likes this.
  8. TheBigShort

    TheBigShort

    Thanks for taking the time Kevin. I think it makes sense to set the intercept to 0 in this case as when the data points are 0 we would expect the DV to also be 0.
     
  9. TheBigShort

    TheBigShort

    Kevin, after thinking a bit more about the data i added a new feature which is nonEvol - spyATMiv to get sort of the idiosyncratic vol of the stock. Correlation between nonEvol and the new feature is 70% but they both are very correlated to the DV.
    When i take the average of these 2 variables, the r^2 is better than any of them individually.

    My question is, when you have 2 important features with high correlation, should i just combine them together to create a new variable and dump the 2 individually?
     
    spy likes this.
  10. spy

    spy

    I'm still working on these examples one by one. Can I get back to you when I'm done?
     
    #10     Nov 29, 2024