This example shows how the k-nearest neighbor algorithm and corresponding histogram can be used to make trading decisions. A short summary of the method is: Find three value swings of past data for the input assets. Fit a curve with a mathematical function to the value swings. Standardize the function to help its outputs be easier to compare to outputs for other assets and/or other time frames. Use the k-nearest neighbor algorithm to find outputs for the standardized function for the current asset similar to outputs from standardized functions for historical assets. Build a histogram with returns for simulated trades corresponding to the similar past outputs. Trade the current asset based on the shape of the histogram. Here are the details: Defining Input Value Swings For each date in a symbol's data, the software finds data for three value swings with algorithm The swing direction starts as undefined. A lookback period is the past N bars or to the bar just past the previous value swing whichever is larger. If no close value in the lookback period is higher than the current close value, the swing direction is up. If no close value in the lookback period is lower than the current close value, the swing direction is down. If the swing direction is up and down on the same bar (e.g., value hasn't changed since last swing), the swing direction does not change. If the swing direction changes from not up to up, a low swing has been detected at the oldest bar with the lowest value in the lookback period. If the swing direction changes from not down to down, a high swing has been detected at the oldest bar with the highest value in the lookback period. The first swing found is ignored. Modeling Input Value Swings Software fits a curve for the data from the current bar to the bar after the value swing that came before the three value swings was detected. The fitted curve has a least squares regression line added to zero or more skewed cosine waves. A skewed cosine wave starts as an ordinary cosine wave with amplitude, period, and phase. The skew is the proportion of a period for the relative position of the trough of the wave between two peaks. For example here is a cosine wave with amplitude 3, period 20, phase 2.3333 (0 <= phase <= two*pi). and skew 0.5 (no skew). Here is a cosine wave the same amplitude, period, and phase but with skew 0.8. Here is a cosine wave the same amplitude, period, and phase but with skew 0.2. An example of the formula for a fitted function is: Code: DIA_20240306_37_opy = 391.993896484375 + 5.76630401611328 * -0.0800857560803043 * x + 2.83467149734497 * ( 0.290414988994598 * skewed_cos(twopi / 10.7259641207956, 2.21912574768066, 0.461418867111206, x) + 0.224823564291 * skewed_cos(twopi / 32.8402920009586, 2.60685586929321, 0.426971435546875, x) + 0.205139502882957 * skewed_cos(twopi / 4.69478300125808, 2.65043544769287, 0.473819971084595, x) + 0.167520850896835 * skewed_cos(twopi / 18.5536670608188, 4.66447830200195, 0.53011691570282, x) + 0.152913108468056 * skewed_cos(twopi / 5.76414163479297, 4.79805469512939, 0.562792778015137, x) ) ; The arguments to the skewed_cos functions are frequency, phase (when x == 0), skew, and x value. This fitted curve is for the open prices (adjusted for dividends) for the 37 trading days ending 20240306 for DIA (SPDR Dow Jones Industrial Average ETF Trust). x == 0 is for 20240306; x < 0 is for trading days before 20240306; x > 0 is for (predicted) trading days after 20240306. The Goertzel Algorithm discovers the amplitudes, frequencies, and phases. Other software finds the skews for a better fit. Standardizing Fitted Functions To help curve values be comparable to curve values from those for assets with different price ranges and/or time frames, the software standardizes them by removing: the y intercept from the least squares regression line; e.g., 391.993896484375 the estimated standard deviation of the price data from the least squares regression line; e.g. 5.76630401611328 the estimated standard deviation of the differences if price values to the least squares regression line; e.g., 2.83467149734497 An example of a standardized fitted curve formula is: Code: DIA_20240306_37_opy = -0.0800857560803043 * x + ( 0.290414988994598 * skewed_cos(twopi / 10.7259641207956, 2.21912574768066, 0.461418867111206, x) + 0.224823564291 * skewed_cos(twopi / 32.8402920009586, 2.60685586929321, 0.426971435546875, x) + 0.205139502882957 * skewed_cos(twopi / 4.69478300125808, 2.65043544769287, 0.473819971084595, x) + 0.167520850896835 * skewed_cos(twopi / 18.5536670608188, 4.66447830200195, 0.53011691570282, x) + 0.152913108468056 * skewed_cos(twopi / 5.76414163479297, 4.79805469512939, 0.562792778015137, x) ) ; The output of two fitted curves is more comparable when standardized. For example here are standardized curves for open prices of DIA (SPDR Dow Jones Industrial Average ETF Trust) at 20240305 for 37 trading days and EWW (iShares MSCI Mexico ETF) at 20100423 for 59 trading days where the x == 0 is the current trading day for the curve (20240305 and 20100423). Running K-Nearest Neighbor Algorithm In this example, each input has the standardized fitted curves for ETF daily open prices (adjusted for splits and dividends) starting 19930325 and ending 20240221 simulating long trades entering at the next trading day's open and exiting at the following trading day's close. The software calculates the standardized fitted curve values eight trading days before the current day through eight trading days after the current trading day (17 trading days total). The attached, tab-separated knn_etfs.csv lists the 68 ETFs used as inputs. The inputs for testing are split 70% for training (294,996 records) and 30% for evaluation (126,427 records). For each evaluation record, the software uses dynamic time warping to measure the distance to each training record and records a simulated trade result as log(next_next_trading_day_open / next_trading_day_open) * 100 for the closest 294 training records (a little less than 0.1% of the training records). Histogram Construction and Interpretation The software builds a histogram with the simulated results using the Freedman–Diaconis rule to calculate the bin width. For example, the histogram of DIA (SPDR Dow Jones Industrial Average ETF Trust) at 20240305 is: The middle marked bin is the mode, and the surrounding marked bins show the first bins surrounding the mode bin that cover at least 68.26% of the 294 values rounded to the nearest whole number (201 values). The 68.26% is the proportion of values within one standard deviation from the mean of a normal distribution. A rule to determine whether to enter a long trade at the next trading day's open (and exit at the following day's open) is Code: mode > 0 and (hi_mark - mode) >= (mode - lo_mark) and number_of_results_between_mode_and_hi_mark_inclusive <= number_of_results_between_lo_mark_and_mode_inclusive My theory for this rule is a new result between the mode and high mark (inclusive) is more likely to happen because it creates a more symmetrical histogram (closer to a normal distribution) or keeps the histogram symmetrical. Simulated Results For the evaluation data using the above rule, the mean trade result is a gain of 0.1679%, mean win 0.9408%, mean loss 0.8544% and win rate 57.17%. For the evaluation data using buy and hold, the mean one-day simulated trade result is a gain of 0.0396%, mean win 0.8824%, mean loss 0.9362%, and win rate 53.88%. On a per-trade basis, the method was better than buy and hold on a large evaluation input.
This example shows prices fitted with a least squares regression line and four asymmetric triangle waves with simulated long trades when the regression line and each asymmetric triangle wave point upward from the current bar through the next two bars. The software models curves for each sequence of daily open, high, low, and close prices similar to the above post except it uses asymmetric triangle waves instead of skewed cosine waves. For example, the fitted functions for FHLC (Fidelity MSCI Health Care Index ETF) for the most recent 101 trading days are: Code: FHLC_20240322_101_opy = 70.6176452636719 + 4.56533193588257 * -0.026622928546473 * x + 0.720556080341339 * ( atri(1.11319243907928, 93.4989013671875, 0.00286221504211426, 0.548264622688293, x) + atri(0.677243173122406, 35.7667388916016, 0.929093837738037, 0.488233029842377, x) + atri(0.291959315538406, 10.6060361862183, 0.689995527267456, 0.498603940010071, x) + atri(0.264432728290558, 8.69358348846436, 0.825272798538208, 0.532109916210175, x) ) ; FHLC_20240322_101_hiy = 70.8897933959961 + 4.77217102050781 * -0.0252707564487616 * x + 0.753896296024323 * ( atri(1.12747776508331, 93.9770126342773, 0.00061333179473877, 0.550458431243896, x) + atri(0.709456622600555, 35.8265037536621, 0.926209568977356, 0.4846431016922, x) + atri(0.283884853124619, 14.3711767196655, 0.951511383056641, 0.547267377376556, x) + atri(0.281728148460388, 46.9426307678223, 0.997644662857056, 0.446151077747345, x) ) ; FHLC_20240322_101_loy = 70.3250579833984 + 4.33053064346313 * -0.0279434899663403 * x + 0.762261092662811 * ( atri(1.0895619392395, 94.8137130737305, 0.999907374382019, 0.53909033536911, x) + atri(0.65017956495285, 34.8105125427246, 0.907422304153442, 0.453330934047699, x) + atri(0.334258705377579, 10.6658000946045, 0.675299227237701, 0.478061556816101, x) + atri(0.307076930999756, 44.4325370788574, 0.978124618530273, 0.495811760425568, x) ) ; FHLC_20240322_101_cly = 70.6441497802734 + 4.49145746231079 * -0.0268064757732261 * x + 0.887121498584747 * ( atri(0.990466058254242, 94.3953628540039, 0.997954547405243, 0.548464059829712, x) + atri(0.557023644447327, 35.4081535339355, 0.911576449871063, 0.504587173461914, x) + atri(0.24662446975708, 47.3609809875488, 0.00467079877853394, 0.447946041822433, x) + atri(0.245282217860222, 20.7061748504639, 0.991685390472412, 0.502991616725922, x) ) ; The curve points backwards in time, so x == 0 is for 20240322, x == -1 is for 20240325, and x == -2 is for 20240326. For a regression line pointing upward means the slope such as -0.0268064757732261 is less than zero. For an asymmetric triangla wave, pointing upward means the wave value at x == -1 is greater than at x == 0, and the wave value at x == -2 is greater than at x == -1. An example that has atri(0.677243173122406, 35.7667388916016, 0.929093837738037, 0.488233029842377, x) and atri(0.557023644447327, 35.4081535339355, 0.911576449871063, 0.504587173461914, x) from the functions for opy and cly above is: For this example, the software used ETF daily open, high, low, and close prices (adjusted for splits and dividends) starting 19930325 and ending 20240221 simulating long trades entering at the next trading day's open and exiting at the following trading day's close. The ETFs used were the ones from the knn_etfs.csv file in the above post. Unlike in that post, there was no data splitting into training and evaluation data because the trading rule was manually created (i.e., not trained with any type of machine learning). For the data using the fitted curve components pointing up rule, the mean trade result is a gain of 0.1542%, mean win 1.0317%, mean loss 0.9426% and win rate 55.80%. For the data using buy and hold, the mean one-day simulated trade result is a gain of 0.0274%, mean win 1.0038%, mean loss 1.0425%, and win rate 52.54%. On a per-trade basis, the method was better than buy and hold on a large input.
This example combines standardized linear regression plus standardized skewed cosine waves and standardized asymmetric triangle waves from the previous two posts in this thread. The software calculates the standardized fitted curve values three trading days before the current day through three trading days after the current trading day (7 trading days total). Then, genetic programming software creates rules relating the fitted curve values to each other or constants. The rules target 1-2% of the training data and simulate long trades entering at the next trading day's open and exiting at the following trading day's close. Pseudocode for 400 generated rules is in the attached scosatri_gprules.txt. An input with a name Aop_BeforeStdz3 means the value was from standardized linear regression plus standardized asymmetric triangle waves using a time series of open prices (adjusted for splits and dividends) for 3 trading days before the current trading day. Similarly, an input with a name Sop_AfterStdz1 means the value was from standardized linear regression plus standardized skewed cosine waves using a time series of open prices (also adjusted for splits and dividends) for one trading day after the current trading day. Other inputs follow the same conventions. Using 400 created rules trained and evaluated on the same data in the previous two posts in this thread, the simulated trade would be taken when at least 144 rules (36%) pass (return 1 in the above pseudocode). The results on the out-of-sample evaluation data have mean gain 0.3275%, mean win 1.2520%, mean loss 0.9696%, and win rate 58.65%. The results on the evaluation data for buy and hold have a mean gain of 0.0396%, mean win 0.8829%, mean loss 0.9366%, and win rate 53.88%. On a per-trade basis, the method was better than buy and hold on a large evaluation input (and better than the methods from the previous two posts in this thread).
Here is a method of extending spline-joined wave points for timing swing trades. Find the most recent eight price swings (aka wave points) from daily close prices as in "Defining Input Value Swings" here. Join the wave points with a linear spline. Model the linear spline with a function of least squares regression line added to skewed cosine waves as in "Modeling Input Value Swings" also here. If a wave point was just detected, project the function into the future from that wave point to find the next two wave points. If the function finds two predicted wave points with at least one after the detection point for the most recent wave, trade based on the positions of the predicted wave points relative to the detection point. In these examples, I refer to the wave points has hw7 (history wave 7) as the oldest wave point through hw0 as the most recent wave point. pw0 is the next predicted wave point, and pw1 is the wave point predicted after that. The charts show price bars and other data from hw7 through pw1. Each wave point is either a trough (swing bottom) or a peak (swing top). The black, zig-zagging curve is the linear spline from hw7 through hw0. It is horizontal after hw0 because splines cannot be extrapolated. The orange curve is the output of the function modeling the linear spline. The gray, straight, left-to-right line is the least squares linear regression of the linear spline from hw7 through hw0. The gray dotted lines represent the entry and exit bars for a simulated long trade. The square on a close price is the bar the most recent wave was detected from. Example: pw0 is a trough at or before the hw0 detection bar. Example: pw0 is a peak after the hw0 detection bar. Example: pw0 is a trough after the hw0 detection bar. Using the same parameters on 46 assorted ETFs with up to 24 years of input price data adjusted for dividends and splits, a backtest summary for simulated long trades on each ETF is in the attached, semicolon-separated extending_spline_joined_wave_points_perf.csv. For an individual symbol, the simulated trades can overlap with prior simulated trades. The simulated trades enter long at the open of the entry bar and exit at the open of the exit bar. Transaction costs were not included. The method was profitable on all 46 ETFs tested.
%% Astronomy based could help-seasonals; most calenders still use moon phases, i use the numbers[days, months] on a calnder more . Sunrise is a good start or presunrise
One of the issues with post #15 in this thread is overlap by trades lasting a long time (e.g., more than one year). To make this less of an issue, I tried something roughly similar with the main differences being detrending is always between wave points (not just a straight line) and using linear prediction of Maximum Entropy Spectral Analysis (MESA) coefficients instead of a sum of waves based on the Goertzel algorithm. Find the most recent eight price swings (aka wave points) from daily open prices as in "Defining Input Value Swings" here. This example uses five or more bars as the lookback period down from eight or more before. It also uses open prices instead of close prices. If a wave point was just detected, join the waves points with a spline that has one half period of a cosine wave (trough to peak or peak to trough) for each leg of the waves. Find points in the middle of each of those half-period cosine waves including a cosine wave one price swing older than the oldest price swing above. Join those points with a spline that has one half period of a cosine wave (trough to peak or peak to trough) for each point between the wave points. This is a trend curve for spline that goes through the eight wave points. Extend the trend curve into the future with the endpoint value. Detrend by subtracting the two splines at trading day points. Calculate Maximum Entropy Spectral Analysis (MESA) coefficients on the detrended data using the method in the attached linear_prediction_and_maximum_entropy_spectral_analysis_for_radar_applications_bowling.pdf (source code in the document: SUBROUTINE COEFF). Extrapolate from the most recent wave point to find the next two wave points (source code in the document: SUBROUTINE LNPRED). If the extrapolation finds two predicted wave points with at least one after the detection point for the most recent wave, trade based on the positions of the predicted wave points relative to the detection point. Ignore trades longer than ~one month (more than 22 trading days including entry and exit day). Example of the spline through the wave points plus the detrending curve: The trend curve is above the trough wave points and below the peak wave points. After detrending: In theory, this should make it easier to model with oscillations whose peaks and troughs are above and below zero respectively. Contrast that with a linear detrend line The trend line is below some wave troughs and above a wave peak. After linear detrending In these examples, I refer to the wave points has hw7 (history wave 7) as the oldest wave point through hw0 as the most recent wave point. pw0 is the next predicted wave point, and pw1 is the wave point predicted after that. The charts show price bars and other data from hw7 through pw1. Each wave point is either a trough (swing bottom) or a peak (swing top). The black curve is the wave points from hw7 through hw0 joined by a cosine for each leg. The orange curve is the extrapolated output of linear prediction of the MESA coefficients from hw0 through pw1. The grey curve is the midpoints hw0 to hw1, hw1 to hw2, ... hw7 to hw8 joined by a cosine for each leg. It's horizontal after the midpoint hw0 to hw1. The black square is at hw0. The brown square is at the close of the bar that detected hw0. The grey dotted lines represent the entry and exit bars for a simulated long trade. The label on bottom has the symbol, hw7 date, hw0 date, pw0 date, pw1 date, entry date, and exit date. Example: pw0 is a trough at or before the hw0 detection bar. Example: pw0 is a peak after the hw0 detection bar. Example: pw0 is a trough after the hw0 detection bar. Using the same parameters on 46 assorted ETFs with up to 24 years of input price data adjusted for dividends and splits, a backtest summary for simulated long trades on each ETF is in the attached, semicolon-separated extending_spline_joined_wave_points_perf.csv. For an individual symbol, the simulated trades can overlap with prior simulated trades. The simulated trades enter long at the open of the entry bar and exit at the open of the exit bar. Transaction costs were not included. The method was profitable on 44 of 46 ETFs tested.