here's xgboost examples. looks like feature 9 is impactful. seems to be popular. next stop $town Code: import pandas as pd from sklearn.metrics import mean_absolute_error, r2_score, root_mean_squared_error from sklearn.model_selection import train_test_split import xgboost as xgb df = pd.read_csv('etdata.csv') tv=df['TargetVariable'] dv=df[['feature1','feature2','feature3','feature4','feature5','feature6','feature7','feature8','feature9']] #xgboost regression X_train, X_test, y_train, y_test = train_test_split(dv,tv, test_size=.2) my_model = xgb.XGBRegressor(n_estimators=1000, learning_rate=0.05) my_model.fit(X_train, y_train, eval_set=[(X_test,y_test)],verbose=False) predictions = my_model.predict(X_test) mean_absolute_error( y_test, predictions) # 17.57801 my_model.feature_importances_ # array([0.0062262 , 0.00931156, 0.06587803, 0.11172969, 0.00268821, # 0.11810362, 0.04209794, 0.00188226, 0.6420825 ], dtype=float32) #xgboost random forest rfmy_model = xgb.XGBRFRegressor(n_estimators=1000, learning_rate=0.05) rfmy_model.fit(X_train,y_train) rfpredictions = rfmy_model.predict(X_test) mean_absolute_error( y_test, rfpredictions) # 27.888
You'd expect the intercept to be zero when the both the IV's and the DV are centered -- that is when the means are all zero or near to zero. Intercept can, by chance, be zero when neither are centered, but it will be rare. It would be a good idea to center all your variables, both left and right hand sides. Put them on the same scale also, as this will make interpreting the beta coefficients much easier. If you normalize all your variables (subtract expected mean, and divide by expected standard deviation), your covar matrix will be approximately the same as your cor matrix, which will make things much easier. Your dependent variable shows a strong up trend. Any iv that also has a trend, either up or down, will predict it. Conversely, your iv's, as then are now, will predict any trend. You can verify this by substituting the vector 1:36 into your regression in place of your dv. I recommend pseudo-differencing (Cochrane in "Asset Pricing") your dv and also Feature9. This will reduce your R^2 but most of the apparent R^2 is bogus -- the model appears to be predicting the DV but it is really just predicting the row number. Here is your best model after the differencing, it includes Features 2, 4, 5, and 9: Code: lm(formula = dv2 ~ mET[, -c(1, 3, 6, 7, 8)]) Estimate Std. Error t value Pr(>|t|) (Intercept) -58.9421 17.4689 -3.374 0.00206 ** mET[, -c(1, 3, 6, 7, 8)]feature2 4.0045 2.2424 1.786 0.08425 . mET[, -c(1, 3, 6, 7, 8)]feature4 1.5586 0.6099 2.555 0.01592 * mET[, -c(1, 3, 6, 7, 8)]feature5 112.0528 44.4609 2.520 0.01728 * mET[, -c(1, 3, 6, 7, 8)]feature9 -0.2499 0.1604 -1.558 0.12980 Multiple R-squared: 0.3719, Adjusted R-squared: 0.2881 F-statistic: 4.44 on 4 and 30 DF, p-value: 0.006148 The adjusted R^2 works out to a cor of .53675 and an estimated percent wins of cor2FSS(.53675) = .6834 (assuming elliptical) Full cross validation is suspect with only 35 rows but 35 folds of leave-3-out cv yields the following: Code: frcstMtrcs(YHB(dv2[-1],cbind(1,mET[-1,c(2,4,5,9)]),wlen=3),dv2[-1]) Shp ShpD ShpB ShpT10 WinPC WinPCT10 WinPCT5 WinPCT2 Cor RSq 5.0959652 5.2744793 5.2191296 16.4759000 0.7143000 0.7500000 1.0000000 1.0000000 0.4312000 0.1859334 R^2 has dropped to .186, but that is excellent for "out of sample" results. Also note that the percent wins at .714 is higher than the predicted .637. Also note the Sharpe in the 10% tails. Here is Cochrane's pseudoDiff function: Code: pseudoDiff <- function (x, lorder = 1, transform = "none") { x <- as.matrix(x) n <- nrow(x) X <- matrix(NA, nrow = n - lorder) if (transform == "log") { x <- log(x) } else if (transform == "rlog") { x <- rlog(x) } else if (transform == "ihs") { x <- ihs(x) } if (lorder == 1) { xl <- x[1:(n - 1), ] } else xl <- getLags(x, lorder) for (i in 1:ncol(x)) { lmod <- lm(x[(lorder + 1):n] ~ xl) X[, i] <- lmod$resid } return(X) }
Depends on what you are using the regression for: realtime forecasting or exploratory data analysis. For realtime forecasting keep both so that if one is missing, the other can take up the slack -- sweep (Goodnight) out the missing col to adjust the coeff on the others. If the the two reliably appear with high offsetting loadings in an eigenvector with very low dv loadings, adjust your design matrix accordingly. For exploratory data analysis: Set the two variables on the same approximate scale and approximately equal means. Then replace the two by (a + b) / 2 and (a - b). Rescale. If including (a - b) in your model increases adjusted R^2, keep it. Otherwise discard it.
If R/Rscript is not your native tongue and you want to play along, at least w.r.t. post 6061224, this setup code might help: Code: #!/usr/bin/Rscript dfET2 <- read.csv("./ETdata.csv") ## lm(formula = dv[1:35] ~ 0 + as.matrix(dfET2[1:35, -c(1, 8, 3, 6, 4, 11, 9, 5)])) lm(formula = dfET2$TargetVariable ~ 0 + as.matrix(dfET2[1:35, -c(1, 8, 3, 6, 4, 11, 9, 5)])) As far as post 6062198 is concerned, unless I'm overlooking something, it seems there are dependencies completely missing (perhaps they're in a DM or .Rprofile somwhere) but, e.g., the function "frcstMtrcs" (presumably forecast matrices) is nowhere to be found. In addition, it's obviously difficult to have this kind of conversation in public without accidentally leaking or surreptitiously omitting information that is bordering on proprietary. Therefore... IDK if @Kevin Schmit cares to fill in some blanks or if @TheBigShort fears leaking more of his model (apparently features 6 and 7 where already made public), but... I'm willing to bet your readers wouldn't mind much Anyway, it's been a very entertaining thread so far. GJ guys! And I'd like to remind you all that I, myself, have been benevolent enough to upload all my source code and trades for the last two years here. It would be really nice if everyone would show similar kindness Edit: fixed reference to Kev's 2nd post.
The input was the one in this post. I wrote the rules generator in C++ and opencl. It is mostly based on "Linear Genetic Programming," by Markus F. Brameier and Wolfgang Banzhaf with a little from "Dynamics and Performance of a Linear Genetic Programming System" by Frank D Francone, the basis of the commercial product Trading System Lab http://www.tradingsystemlab.com (see attached linear_genetic_programming_system_francone.pdf). Here is the result of another run on the same input but this time allowing addtition, subtraction, multiplication, and division to not have at least one constant (e.g., R1 = feature5 * R0 has an input multiplied by a register). Code: R0 = 0.817385 * feature4 if defined ( feature7 ) R0 = feature9 * feature6 R0 = R0 - -5.4066 R0 = R0 ** 1.02235 R1 = feature5 * R0 R0 = feature4 + R0 R0 = R0 + R1 R0 = 9.66694 + R0 R0 = R0 + feature4 R0 = 9.99542 + R0 R0 = R0 ** 1.02235 R0 = R0 + R1 R0 = R0 - feature1 R0 = R0 + feature2 if ! defined ( feature5 ) R0 = feature3 ** 1.34554 R0 = feature2 + R0 R0 = feature2 + R0 R0 = R0 + feature2 This allowed a closer fit with a mean squared error 139.034 and mean absolute error 8.43054. That doesn't necessarily mean these rules would be better at predicting, of course.
Your own generator? I see. Comparable to something open-source in case some of us want to tinker? Presumably something like ECJ will be a good enough*. Unless you can recommend something else. Edit: added references to external resources.
ECJ has tree-based genetic programming. I used linear genetic programming which I chose because I thought it would be harder to implement tree-based genetic programming with a lot of the computation in a GPU. https://www.perplexity.ai/search/what-are-the-main-differences-CYNJkiQJQlWlTwHmqCpfww has some differences between the two methods. https://www.quora.com/What-are-some-good-genetic-programming-libraries-in-Python lists some genetic programming libraries in python and C++. https://geneticprogramming.com/software/ lists assorted genetic programming libraries. https://www.perplexity.ai/search/what-are-some-open-source-line-mBQgf6SLRbaneNPQvHeDsA lists some linear genetic programming libraries. The first one sounds close to what I implemented.
Ill respond later this week, the model actually got me into long APP Dec/Feb calendar and i havent been able to focus on code haha.