Forecasting, a case study.

TheBigShort · Nov 25, 2024

Hello all, Im looking for some help in forecasting. I've created a dataset that is a snapshot of historical data over the past 9 years. I will be keeping the features and target variable anonymous - labels: "feature1-10" and "target variable" as the final column.

My goal is to learn statistical techniques from the group here and ideally build a decent model to predict the target. I can easily throw the data set into a multi linear regression and get a very high r^2 currently but I would like to improve my techniques and understand the data better.

About the data set:
The target variable seems to be a multiple of "feature2" so ive created a column which is just target/feature2. This multiple is labelled as "feature1". The multiple (feature1) stays in a range of 3.64 and 5.89. I have a strong belief that going forward, no matter what our forecast is, the target variable should be in between 3.5x feature2 and 6x feature2. So possibly some shrinkage needs to be done.

Feature5,6 and 7 are moving averages of other variables in the data set so there is strong autocorrelation + NA values. As well, feature8 and feature9 are 2 different measurements of volatility so those will be highly correlated.

All features are hand picked by me that i thought would have some correlation to our target variable so you will find that the correlation matrix for TV will have mostly high positive values.

Lastly, this is data all for 1 single stock.

An idea i had was to forecast the "multiple"(feature1) separately and having 2 models.

Tomorrow i will post a more indepth EDA. Any inputs from you would be helpful as I work through it. Thank you.
Attached is the data set.

Kevin Schmit · Nov 27, 2024

Code:
> summary(lm(dv[1:35] ~ 0 + as.matrix(dfET2[1:35,-c(1,8,3,6,4,11,9,5)])))

Call:
lm(formula = dv[1:35] ~ 0 + as.matrix(dfET2[1:35, -c(1, 8, 3,
  6, 4, 11, 9, 5)]))

Residuals:
  Min  1Q  Median  3Q  Max
-25.415  -9.510  -1.980  9.224  37.178

Coefficients:
  Estimate Std. Error t value Pr(>|t|)
as.matrix(dfET2[1:35, -c(1, 8, 3, 6, 4, 11, 9, 5)])feature2  3.3575  0.1544  21.745  < 2e-16 ***
as.matrix(dfET2[1:35, -c(1, 8, 3, 6, 4, 11, 9, 5)])feature7  602.3034  72.6602  8.289 6.74e-09 ***
as.matrix(dfET2[1:35, -c(1, 8, 3, 6, 4, 11, 9, 5)])feature10  5.3506  2.9152  1.835  0.0775 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.68 on 27 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.995,  Adjusted R-squared:  0.9945
F-statistic:  1808 on 3 and 27 DF,  p-value: < 2.2e-16
R^2 of .99+ and F-stat of 1808! Too good to be true. You've got some sort of dependent variable leakage into your design matrix.

TheBigShort said:
Hello all, Im looking for some help in forecasting. I've created a dataset that is a snapshot of historical data over the past 9 years.
More...

You've got 35 rows, that would be 8.75 years of quarterly data. Is that right?

My goal is to learn statistical techniques from the group here and ideally build a decent model to predict the target.
More...

See above, you've already got a near perfect model using Features 2, 7, and 10, no intercept. If you use Little & Rubin (see attached png) for the missing rows in Feature7, it is literally perfect -- R^2 of 1. So there is probably some leakage from lhs to rhs. You need to see if you can find that leakage.

The target variable seems to be a multiple of "feature2" so ive created a column which is just target/feature2. This multiple is labelled as "feature1".
More...

You can't do that. You can't move your dv into the design matrix (rhs). Discard Feature1.

Feature4 is just your DV lagged by one. Including lags of the dependent variable (dv) in with other regressors in your design matrix will bias the coefficients. Model the autocorrelation in the dv separately. Besides, in a model with the other regressors, it adds no predictivity. Discard Feature4.

Feature5 is a 4 period ma of Feature4. Discard it for the same reasons as you discarded Feature4.

Features 6 and 7 are moving averages of Feature2, I think. F6 is definitely a 6 period ma of F2. F7 is 95% correlated with F6 so is probably something similar. They are so correlated that you only need one of them in your model for this exploratory stage, as keeping both will make standardized betas harder to interpret. Keep F7 and discard F6.

feature8 and feature9 are 2 different measurements of volatility.
More...

Reasonable to try these, as even quarterly data respond to vol regimes. But they don't contribute to the fit, so discard them.

Not sure what Feature10 is, but it improves the fit, so keep it.

Your biggest problem is that the data predict your dv too well. I suspect there is some contamination in Feature 2. A univariate model with no intercept has an R^2 of .98 -- unrealistically high.

Statistical Analysis with Missing Data Little and Rubin
1st Ed: pgs 112:119
2nd Ed: pgs 148:152
3rd Ed: pgs 166:169, 275:276

TheBigShort · Nov 27, 2024

Kevin Schmit said:
Code:
> summary(lm(dv[1:35] ~ 0 + as.matrix(dfET2[1:35,-c(1,8,3,6,4,11,9,5)])))

Call:
lm(formula = dv[1:35] ~ 0 + as.matrix(dfET2[1:35, -c(1, 8, 3,
  6, 4, 11, 9, 5)]))

Residuals:
  Min  1Q  Median  3Q  Max
-25.415  -9.510  -1.980  9.224  37.178

Coefficients:
  Estimate Std. Error t value Pr(>|t|)
as.matrix(dfET2[1:35, -c(1, 8, 3, 6, 4, 11, 9, 5)])feature2  3.3575  0.1544  21.745  < 2e-16 ***
as.matrix(dfET2[1:35, -c(1, 8, 3, 6, 4, 11, 9, 5)])feature7  602.3034  72.6602  8.289 6.74e-09 ***
as.matrix(dfET2[1:35, -c(1, 8, 3, 6, 4, 11, 9, 5)])feature10  5.3506  2.9152  1.835  0.0775 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.68 on 27 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.995,  Adjusted R-squared:  0.9945
F-statistic:  1808 on 3 and 27 DF,  p-value: < 2.2e-16
R^2 of .99+ and F-stat of 1808! Too good to be true. You've got some sort of dependent variable leakage into your design matrix.

You've got 35 rows, that would be 8.75 years of quarterly data. Is that right?

See above, you've already got a near perfect model using Features 2, 7, and 10, no intercept. If you use Little & Rubin (see attached png) for the missing rows in Feature7, it is literally perfect -- R^2 of 1. So there is probably some leakage from lhs to rhs. You need to see if you can find that leakage.

You can't do that. You can't move your dv into the design matrix (rhs). Discard Feature1.

Feature4 is just your DV lagged by one. Including lags of the dependent variable (dv) in with other regressors in your design matrix will bias the coefficients. Model the autocorrelation in the dv separately. Besides, in a model with the other regressors, it adds no predictivity. Discard Feature4.

Feature5 is a 4 period ma of Feature4. Discard it for the same reasons as you discarded Feature4.

Features 6 and 7 are moving averages of Feature2, I think. F6 is definitely a 6 period ma of F2. F7 is 95% correlated with F6 so is probably something similar. They are so correlated that you only need one of them in your model for this exploratory stage, as keeping both will make standardized betas harder to interpret. Keep F7 and discard F6.

Reasonable to try these, as even quarterly data respond to vol regimes. But they don't contribute to the fit, so discard them.

Not sure what Feature10 is, but it improves the fit, so keep it.

Your biggest problem is that the data predict your dv too well. I suspect there is some contamination in Feature 2. A univariate model with no intercept has an R^2 of .98 -- unrealistically high.

Statistical Analysis with Missing Data Little and Rubin
1st Ed: pgs 112:119
2nd Ed: pgs 148:152
3rd Ed: pgs 166:169, 275:276

View attachment 355246
More...
feature1 x feature2 is equal to the target so that is why r^2 is practically 1. I put feature 1 in there as a shrinkage factor ie if our forecast is outside feature1 x feature2, shrink the prediction into the band.
Event Vol is my target fwiw. quarterly data snapshots pre event.

I lagged the DV by 1 as a predictor (feature 4), because of the auto correlation. I was getting 3-4 lagged periods as significant across multiple tickers which is why i also added the SMA4 of the lagged event vols. Im doing this across multiple stocks to create a scanner, so do you still think i need to model the auto correlation separately?

Yes f7 is an EMA while f6 is an SMA.

f2 is non event vol on the 30 day. I subtract the event vol from the total iv30d. Using this formula.
Code:
#IM is 1 day implied move, ivol is total vol on 30 day and dte is set to 30.
#Ie. .10, .50, 30 could be inputs
nonEVvol = function(IM, ivol, dte){
  constant <- sqrt(1/365) * sqrt(2/pi)
  v1 <- sqrt(ivol^2 - (IM/constant)^2 / (dte))
  return(v1)
}
So i think that is likely why we see a high correlation between the 2 variables. When someone buys an option they are buying both the event and non event vol and only a few participants are trying to separate the 2 by trading calendars. What do you think about that?

I spent time analyzing the data with chatgpt yesterday and i also came to the conclusion of removing f6 (SMA) and keeping f7(EMA). I also dropped the rvol columns ( as they are not nearly as predictive and the non event ivol (f2). f1 was also dropped.

f10 is the sign of the last move(+/-).
I got up to 75% r^2 with a linear model still unrealistically high but i also dont think it should be too difficult to forecast what event vol should be on the day prior to the event given how the market historically has priced these.

Chatgpt somehow got an R^2 of 90% with a random forest which i couldn't replicate in R even with the tuning. I also dont think random forest is ideal here with only 36 rows(is that true?). Although, i do want to capture some of the non linear relationships in the data set.

Kevin, im going to post an up dated data set with the variables you have mentioned removed and Ill PM you the column headers as well. I just have to run out right now.

TheBigShort · Nov 27, 2024

@Kevin Schmit here is the new data frame. I've PM'd you the column headers. Note* the 2 dataframes are not for the same ticker. I couldnt recall which ticker i used for the one i sent

TheBigShort · Nov 27, 2024

Why is it when i set intercept to 0, my R^2 shoots up from .65 to .99? Looking at the residuals of the model lm(DV ~ 0 +., data = ETdata). There is still quite a bit of variance with the errors ranging from +30 to -30. Doesnt seem like that would be an R^2 of .99. Im certain there is no spill over from other columns into the DV. The features that are moving averages are from lagged variables so no contamination with corresponding DV.

Kevin Schmit · Nov 27, 2024

TheBigShort said:
Why is it when i set intercept to 0, my R^2 shoots up from .65 to .99? Looking at the residuals of the model lm(DV ~ 0 +., data = ETdata). There is still quite a bit of variance with the errors ranging from +30 to -30.
More...

That is an excellent question! I wondered that myself but didn't have time to investigate. It does seem very strange. I will look into this and see if I can get to the bottom of it.

Edit:

This stack exchange post explains it:

https://stats.stackexchange.com/que...pt-term-increases-r2-in-linear-mo/26205#26205

See the first answer in that post.

I think that I missed this all these years working with lm because I am almost always using centered or close to centered dv's and iv's -- where the intercept is very close to zero anyway.

In light of the information in the SE post referenced above, I will re-do my original answer in this thread, but with your latest posted file.

ph1l · Nov 27, 2024

TheBigShort said:
i do want to capture some of the non linear relationships in the data set.
More...

One way to do this is through genetic programming. For example, I ran the second ETdata.csv you posted in a genetic programming rules generator to forecast TargetVariable. The mean squared error was 156.497, and the mean absolute error was 9.53354. The rules generated were:
Code:
R0  = 9.35595 * feature9
R0  = feature9 + R0
if  defined ( feature7 )    R0  = R0 + feature6
R0  = R0 + feature2
R1  = 7.24216 * feature5
R0  = R0 ** 1.02118
R0  = R0 / 0.957456
R0  = R0 - feature1
R1  = R1 / 0.137156
R0  = R0 + R1
R0  = R0 + R1
R0  = R0 + feature4
R0  = R0 + feature4
R0  = R0 + feature2
R0  = feature9 + R0
R0  = R0 - feature1
R0  = R0 + feature2
if  ! defined ( feature8 )    R0  = 3.33795 * feature3
R0  = R0 - feature1
with the predicted value in R0 after the instructions run using a row with values feature1 through feature9 as input.

TheBigShort · Nov 27, 2024

Kevin Schmit said:
That is an excellent question! I wondered that myself but didn't have time to investigate. It does seem very strange. I will look into this and see if I can get to the bottom of it.

Edit:

This stack exchange post explains it:

https://stats.stackexchange.com/que...pt-term-increases-r2-in-linear-mo/26205#26205

See the first answer in that post.

I think that I missed this all these years working with lm because I am almost always using centered or close to centered dv's and iv's -- where the intercept is very close to zero anyway.

In light of the information in the SE post referenced above, I will re-do my original answer in this thread, but with your latest posted file.
More...

Thanks for taking the time Kevin. I think it makes sense to set the intercept to 0 in this case as when the data points are 0 we would expect the DV to also be 0.

TheBigShort · Nov 28, 2024

Kevin, after thinking a bit more about the data i added a new feature which is nonEvol - spyATMiv to get sort of the idiosyncratic vol of the stock. Correlation between nonEvol and the new feature is 70% but they both are very correlated to the DV.
When i take the average of these 2 variables, the r^2 is better than any of them individually.

My question is, when you have 2 important features with high correlation, should i just combine them together to create a new variable and dump the 2 individually?

spy · Nov 29, 2024

TheBigShort said:
goal is to learn statistical techniques
More...

I'm still working on these examples one by one. Can I get back to you when I'm done?