Forecasting, a case study.

2rosy · Nov 29, 2024

here's xgboost examples. looks like feature 9 is impactful. seems to be popular. next stop $town

Code:

import pandas as pd

from sklearn.metrics import mean_absolute_error, r2_score, root_mean_squared_error
from sklearn.model_selection import train_test_split

import xgboost as xgb

df = pd.read_csv('etdata.csv')
tv=df['TargetVariable']
dv=df[['feature1','feature2','feature3','feature4','feature5','feature6','feature7','feature8','feature9']]


#xgboost regression
X_train, X_test, y_train, y_test = train_test_split(dv,tv, test_size=.2)

my_model = xgb.XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(X_train, y_train,  eval_set=[(X_test,y_test)],verbose=False)

predictions = my_model.predict(X_test)
mean_absolute_error( y_test, predictions)
# 17.57801

my_model.feature_importances_
# array([0.0062262 , 0.00931156, 0.06587803, 0.11172969, 0.00268821,
#       0.11810362, 0.04209794, 0.00188226, 0.6420825 ], dtype=float32)


#xgboost random forest
rfmy_model = xgb.XGBRFRegressor(n_estimators=1000, learning_rate=0.05)
rfmy_model.fit(X_train,y_train)
rfpredictions = rfmy_model.predict(X_test)
mean_absolute_error( y_test, rfpredictions)
# 27.888

Kevin Schmit · Nov 30, 2024

TheBigShort said:
I think it makes sense to set the intercept to 0 in this case as when the data points are 0 we would expect the DV to also be 0.
More...

You'd expect the intercept to be zero when the both the IV's and the DV are centered -- that is when the means are all zero or near to zero. Intercept can, by chance, be zero when neither are centered, but it will be rare.

It would be a good idea to center all your variables, both left and right hand sides. Put them on the same scale also, as this will make interpreting the beta coefficients much easier. If you normalize all your variables (subtract expected mean, and divide by expected standard deviation), your covar matrix will be approximately the same as your cor matrix, which will make things much easier.

Your dependent variable shows a strong up trend. Any iv that also has a trend, either up or down, will predict it. Conversely, your iv's, as then are now, will predict any trend. You can verify this by substituting the vector 1:36 into your regression in place of your dv.

I recommend pseudo-differencing (Cochrane in "Asset Pricing") your dv and also Feature9. This will reduce your R^2 but most of the apparent R^2 is bogus -- the model appears to be predicting the DV but it is really just predicting the row number.

Here is your best model after the differencing, it includes Features 2, 4, 5, and 9:
Code:
lm(formula = dv2 ~ mET[, -c(1, 3, 6, 7, 8)])

              Estimate Std. Error t value Pr(>|t|)
(Intercept)                     -58.9421  17.4689  -3.374  0.00206 **
mET[, -c(1, 3, 6, 7, 8)]feature2  4.0045  2.2424  1.786  0.08425 .
mET[, -c(1, 3, 6, 7, 8)]feature4  1.5586  0.6099  2.555  0.01592 *
mET[, -c(1, 3, 6, 7, 8)]feature5 112.0528  44.4609  2.520  0.01728 *
mET[, -c(1, 3, 6, 7, 8)]feature9  -0.2499  0.1604  -1.558  0.12980

Multiple R-squared:  0.3719,  Adjusted R-squared:  0.2881
F-statistic:  4.44 on 4 and 30 DF,  p-value: 0.006148
The adjusted R^2 works out to a cor of .53675 and an estimated percent wins of cor2FSS(.53675) = .6834 (assuming elliptical)

Full cross validation is suspect with only 35 rows but 35 folds of leave-3-out cv yields the following:
Code:
frcstMtrcs(YHB(dv2[-1],cbind(1,mET[-1,c(2,4,5,9)]),wlen=3),dv2[-1])

  Shp        ShpD       ShpB     ShpT10      WinPC       WinPCT10   WinPCT5    WinPCT2     Cor      RSq
5.0959652  5.2744793  5.2191296 16.4759000  0.7143000  0.7500000  1.0000000  1.0000000  0.4312000  0.1859334
R^2 has dropped to .186, but that is excellent for "out of sample" results. Also note that the percent wins at .714 is higher than the predicted .637. Also note the Sharpe in the 10% tails.

Here is Cochrane's pseudoDiff function:
Code:
pseudoDiff <- function (x, lorder = 1, transform = "none")
{
  x <- as.matrix(x)
  n <- nrow(x)
  X <- matrix(NA, nrow = n - lorder)
  if (transform == "log") {
     x <- log(x)
  }
  else if (transform == "rlog") {
     x <- rlog(x)
  }
  else if (transform == "ihs") {
     x <- ihs(x)
  }
  if (lorder == 1) {
     xl <- x[1:(n - 1), ]
  }
  else xl <- getLags(x, lorder)
  for (i in 1:ncol(x)) {
     lmod <- lm(x[(lorder + 1):n] ~ xl)
     X[, i] <- lmod$resid
  }
  return(X)
}

Kevin Schmit · Nov 30, 2024

TheBigShort said:
Correlation between nonEvol and the new feature is 70% but they both are very correlated to the DV. When i take the average of these 2 variables, the r^2 is better than any of them individually. When you have 2 important features with high correlation, should i just combine them together to create a new variable and dump the 2 individually?
More...

Depends on what you are using the regression for: realtime forecasting or exploratory data analysis.

For realtime forecasting keep both so that if one is missing, the other can take up the slack -- sweep (Goodnight) out the missing col to adjust the coeff on the others. If the the two reliably appear with high offsetting loadings in an eigenvector with very low dv loadings, adjust your design matrix accordingly.

For exploratory data analysis:
Set the two variables on the same approximate scale and approximately equal means. Then replace the
two by (a + b) / 2 and (a - b). Rescale. If including (a - b) in your model increases adjusted R^2, keep it. Otherwise discard it.

spy · Nov 30, 2024

ph1l said:
ran the second ETdata.csv you posted in a genetic programming rules generator
More...

Can you point us to the one you used?

spy · Nov 30, 2024

If R/Rscript is not your native tongue and you want to play along, at least w.r.t. post 6061224, this setup code might help:
Code:
#!/usr/bin/Rscript

dfET2 <- read.csv("./ETdata.csv")

## lm(formula = dv[1:35] ~ 0 + as.matrix(dfET2[1:35, -c(1, 8, 3, 6, 4, 11, 9, 5)]))
lm(formula = dfET2$TargetVariable ~ 0 + as.matrix(dfET2[1:35, -c(1, 8, 3, 6, 4, 11, 9, 5)]))
As far as post 6062198 is concerned, unless I'm overlooking something, it seems there are dependencies completely missing (perhaps they're in a DM or .Rprofile somwhere) but, e.g., the function "frcstMtrcs" (presumably forecast matrices) is nowhere to be found.

In addition, it's obviously difficult to have this kind of conversation in public without accidentally leaking or surreptitiously omitting information that is bordering on proprietary. Therefore...

IDK if @Kevin Schmit cares to fill in some blanks or if @TheBigShort fears leaking more of his model (apparently features 6 and 7 where already made public), but... I'm willing to bet your readers wouldn't mind much

Anyway, it's been a very entertaining thread so far. GJ guys! And I'd like to remind you all that I, myself, have been benevolent enough to upload all my source code and trades for the last two years here. It would be really nice if everyone would show similar kindness

Edit: fixed reference to Kev's 2nd post.

ph1l · Nov 30, 2024

spy said:
Can you point us to the one you used?
More...

The input was the one in this post.

I wrote the rules generator in C++ and opencl. It is mostly based on "Linear Genetic Programming," by Markus F. Brameier and Wolfgang Banzhaf with a little from "Dynamics and Performance of a Linear Genetic Programming System" by Frank D Francone, the basis of the commercial product Trading System Lab http://www.tradingsystemlab.com (see attached linear_genetic_programming_system_francone.pdf).

Here is the result of another run on the same input but this time allowing addtition, subtraction, multiplication, and division to not have at least one constant (e.g., R1 = feature5 * R0 has an input multiplied by a register).
Code:
R0  = 0.817385 * feature4
if  defined ( feature7 )    R0  = feature9 * feature6
R0  = R0 - -5.4066
R0  = R0 ** 1.02235
R1  = feature5 * R0
R0  = feature4 + R0
R0  = R0 + R1
R0  = 9.66694 + R0
R0  = R0 + feature4
R0  = 9.99542 + R0
R0  = R0 ** 1.02235
R0  = R0 + R1
R0  = R0 - feature1
R0  = R0 + feature2
if  ! defined ( feature5 )    R0  = feature3 ** 1.34554
R0  = feature2 + R0
R0  = feature2 + R0
R0  = R0 + feature2
This allowed a closer fit with a mean squared error 139.034 and mean absolute error 8.43054. That doesn't necessarily mean these rules would be better at predicting, of course.

spy · Nov 30, 2024

ph1l said:
I wrote the rules generator in C++ and opencl
More...

Your own generator? I see. Comparable to something open-source in case some of us want to tinker?

Presumably something like ECJ will be a good enough*. Unless you can recommend something else.

Edit: added references to external resources.

ph1l · Nov 30, 2024

spy said:
Your own generator? I see. Comparable to something open-source in case some of us want to tinker?

Presumably something like ECJ will be a good enough*. Unless you can recommend something else.

Edit: added references to external resources.
More...

ECJ has tree-based genetic programming.

I used linear genetic programming which I chose because I thought it would be harder to implement tree-based genetic programming with a lot of the computation in a GPU.
https://www.perplexity.ai/search/what-are-the-main-differences-CYNJkiQJQlWlTwHmqCpfww has some differences between the two methods.

Tree-based genetic programming (TGP) and linear genetic programming (LGP) differ in several key aspects:

1. Representation: TGP represents programs as hierarchical tree structures, while LGP represents programs as linear sequences of instructions[1][3].

2. Execution: TGP programs are evaluated recursively from leaves to root, whereas LGP programs are executed sequentially[3][5].

3. Data flow: LGP allows for reuse of intermediate results stored in registers, while TGP has a more constrained data flow[1].

4. Efficiency: LGP is generally more efficient than TGP due to the reuse of intermediate results and the ability to remove non-effective code (introns) before execution[1].

5. Genetic operators: TGP typically uses subtree crossover and mutation, while LGP uses operators that work on linear sequences of instructions[3][5].

6. Expressiveness: TGP can naturally represent hierarchical and nested structures, while LGP is better suited for sequential operations[4].

7. Code bloat: Both suffer from code bloat, but LGP has simpler intron removal algorithms[1][5].

8. Flexibility: LGP can more easily incorporate control flow operations and multiple outputs[1].

These differences make TGP and LGP suitable for different types of problems, with LGP often producing more compact solutions and offering computational advantages in certain scenarios[1].

Citations:
[1] https://en.wikipedia.org/wiki/Linear_genetic_programming
[2] https://arxiv.org/abs/2405.14268
[3] https://martinpilat.com/en/nature-inspired-algorithms/evolutionary-algorithms-genetic-programming
[4] https://library.fiveable.me/evoluti...etic-programming/study-guide/QZmHukid6XprqHVi
[5] https://algorithmafternoon.com/programming/linear_genetic_programming/
[6] https://en.wikipedia.org/wiki/Genetic_programming
[7] https://link.springer.com/chapter/10.1007/978-0-387-31030-5_8
[8] https://www.researchgate.net/public...for_Solving_Geotechnical_Engineering_Problems
More...

https://www.quora.com/What-are-some-good-genetic-programming-libraries-in-Python lists some genetic programming libraries in python and C++.
https://geneticprogramming.com/software/ lists assorted genetic programming libraries.
https://www.perplexity.ai/search/what-are-some-open-source-line-mBQgf6SLRbaneNPQvHeDsA lists some linear genetic programming libraries. The first one sounds close to what I implemented.

Several open-source linear genetic programming libraries are available:

1. Linear Genetic Programming (Kotlin): This library implements Linear Genetic Programming (LGP) as outlined by M. F. Brameier and W. Banzhaf (2007). It provides a representation of linearly sequenced instructions in automatically generated programs[3].

2. PyshGP (Python): While not strictly linear, PyshGP is a library for Push Genetic Programming in Python. It aims to make PushGP accessible to a wider range of users and use cases[3].

3. Karoo GP (Python): This is a Python-based genetic programming application suite that supports symbolic regression and classification. It offers both CPU and GPU support via TensorFlow and is designed to be scalable[3].

4. DEAP (Python): Distributed Evolutionary Algorithms supports both Python 2 and 3 and can be used for implementing genetic programming, including linear variants[1].

5. EasyGA (Python): This is a Python package designed to provide an easy-to-use Genetic Algorithm, which can be adapted for linear genetic programming tasks[1].

While not all of these libraries are exclusively for linear genetic programming, they can be adapted or used as a foundation for implementing linear genetic programming algorithms.

Citations:
[1] https://stackoverflow.com/questions/16587145/any-genetic-algorithms-module-for-python-3-x
[2] https://pygad.readthedocs.io/en/latest/
[3] https://geneticprogramming.com/software/
[4]
[5] https://github.com/giacomelli/GeneticSharp
[6] https://jenetics.io
[7] https://en.wikipedia.org/wiki/Linear_genetic_programming
[8] https://github.com/rishavray/LinearGP
More...

spy · Dec 1, 2024

Kevin Schmit said:
cross validation is suspect with only 35 rows
More...

Have to say that Kev brings up a very good point here.

TheBigShort · Dec 4, 2024

Kevin Schmit said:
Depends on what you are using the regression for: realtime forecasting or exploratory data analysis.

For realtime forecasting keep both so that if one is missing, the other can take up the slack -- sweep (Goodnight) out the missing col to adjust the coeff on the others. If the the two reliably appear with high offsetting loadings in an eigenvector with very low dv loadings, adjust your design matrix accordingly.

For exploratory data analysis:
Set the two variables on the same approximate scale and approximately equal means. Then replace the
two by (a + b) / 2 and (a - b). Rescale. If including (a - b) in your model increases adjusted R^2, keep it. Otherwise discard it.
More...

Ill respond later this week, the model actually got me into long APP Dec/Feb calendar and i havent been able to focus on code haha.