Can linear regression analysis really predict the future?

cfd_trader · Nov 19, 2009

Maybe this is a noob-in-stats question, and sorry if it is, but:

I have written a random series generator and I'd like to be able to control *precisely* the relative percentage of occurences of each possible value over the total number of samples generated.

So far I'm _almost_ there, except that I almost always end up with the final 10-20 "random" values being the same (because of the way I've written the code for this RNG).

For instance, say I have a set of 3 possible values: A,B and C.

I would like to be able to generate large random series (eg B-C-C-A-B-A-A-B etc...) where I can control exactly how many A's, B's and C's will be hit (eg 30%,15%, and 55%).

While avoiding ending the series with say 25 A's: A-A-A-A-A-A-A-... (not so "random" ^^) like it does now.

Thanx in advance for any help, will save me some time to maybe work on some of the intersting stuff posted here and deliver some result...

cajw1 · Nov 19, 2009

One simple suggestion:

1. build an array of tuples containing the exact amount of As, Bs and Cs you want in your series:
{A,_}
{A,_}
.... required amount of As
{B,_}
... required amount of Bs
{C,_}
... required amount of Cs

2. go over the array and replace each tuple's _ with a random int (use a proper RNG, for example java's SecureRandom)

3. sort the array by _

et voila...

dtrader98 · Nov 19, 2009

Quote from cfd_trader:

Maybe this is a noob-in-stats question, and sorry if it is, but:

I have written a random series generator and I'd like to be able to control *precisely* the relative percentage of occurences of each possible value over the total number of samples generated.

So far I'm _almost_ there, except that I almost always end up with the final 10-20 "random" values being the same (because of the way I've written the code for this RNG).

For instance, say I have a set of 3 possible values: A,B and C.

I would like to be able to generate large random series (eg B-C-C-A-B-A-A-B etc...) where I can control exactly how many A's, B's and C's will be hit (eg 30%,15%, and 55%).

While avoiding ending the series with say 25 A's: A-A-A-A-A-A-A-... (not so "random" ^^) like it does now.

Thanx in advance for any help, will save me some time to maybe work on some of the intersting stuff posted here and deliver some result...
More...

Generally, you can alter the set number to start with by a set seed command each trial if you want.
This should randomize your sets of series better (If you are getting repeated series). To set the frequency of outcomes for each variable, you can simply use case statements with a uniform rng I.e. if rand num <=30% then A, etc...

cfd_trader · Nov 19, 2009

cajw1, not bad for a 1st post.

Thanxx pal, sounds like a good idea, will have to see how it pans out but should work.

dtrader98, the algo I'm using operates in a "self-seeding" type of mode, in fact the pbm I'm referring to doesn't always happen, but often enough that I need to fix it. With that said I might want to spend some time later on improving the seeding of the RNG and see how it goes...

Right now writing code to determine which value in a data set has the longest streak of all.

Way too much fun really...

dtrader98 · Nov 19, 2009

Regarding Avg Mean deviation, if you discretize each directional change to one of two binary values (much like a coin toss), the answer is that a stock sequence will very much resemble the average absolute deviation predicted by the sqrt(2n/pi) estimate-- further evidence of its random nature.

Although it predicts a central expectation of a large ensemble of cumsum rw trials pretty well, that is the central expectation of hundreds and thousands of trials. Each individual rw (of a long run) will vary wildly from that estimate.

If you use literal raw points from an untouched s&p500 series, however, the avg. distance will not match the mad estimate, since the value of each step is not binary to start with (this can be confirmed empirically).

Also, note that the equation is an absolute value predictor of expected cumsum n points out, since the rw of a time series can move both polarities relative to start point, you will have a bimodal expectation, as well as the common bounded rw envelope with a sqrt shape.

Maestro, was this your expected answer? If so, I'm interested to hear how you are using it to an advantage.
I'd like to experiment with the logic.
thks dt.

P.S. Hope you don't mind me chiming in, I'm hoping to learn something new here as well.

Tompson · Nov 19, 2009

Thanks Maestro, I'll verify the distributions over the weekend and then test some simple rules, out of sample on daily SPY data. Will post findings.

Tom

MAESTRO · Nov 19, 2009

Quote from dtrader98:

Regarding Avg Mean deviation, if you discretize each directional change to one of two binary values (much like a coin toss), the answer is that a stock sequence will very much resemble the average absolute deviation predicted by the sqrt(2n/pi) estimate-- further evidence of its random nature.

Although it predicts a central expectation of a large ensemble of trials pretty well, that is the central expectation of hundreds and thousands of trials. Each individual rw (of a long run) will vary wildly from that estimate.

If you use literal raw points from an untouched s&p500 series, however, the avg. distance will not match the mad estimate, since the value of each step is not binary to start with (this can be confirmed empirically).

Also, note that the equation is an absolute value predictor of expected cumsum n points out, since the rw of a time series can move both polarities relative to start point, you will have a bimodal expectation, as well as the common bounded rw envelope with a sqrt shape.

Maestro, was this your expected answer? If so, I'm interested to hear how you are using it to an advantage.
I'd like to experiment with the logic.
thks dt.

P.S. Hope you don't mind me chiming in, I'm hoping to learn something new here as well.
More...

I was expecting you to chime in! Your posts are always valuable! Please participate in this discussion as much as you can.

If you create a step function (let's say 5 point step) and convert ES data into this function you should see that the number of steps vs. absolute move on the long run ( I did it over 4 years) matches exactly the formula quoted above. To me it is a solid proof that the price data if plotted not on price/time scale but price/step scale is almost 100% Gaussian. It leads to many profound conclusions including the utilization of the interpolators (such as splines) on the price/step plane. The tools such options and other existing tools become very efficient in the harvesting of the uncovered price distribution.

dtrader98 · Nov 19, 2009

Not certain if I am interpreting your rules identically, BUT-- if I use a constant scaling factor of about 14 for raw S&P 500 daily data (SP/14), then yes, the expected abs dev vs. n-- vs. theoretical shape match almost perfectly and only slightly start to diverge after about 50 steps or so; but ignoring the scaling factor, yes.. the abs dev vs n shape maps in an identical fashion to the rw shape. Ran it from 98 to 03, ~15k data pts. (expect other samples to match).

Very similar to conclusions I mentioned using discretized binomial walk.

--------------------------------------------------
I prefer to use S&P daily, as all have equal access to the free data (to compare), and it should be a good approximate to any derived s&p500 data (barring scale factors).
--------------------------------------------------

I found similar conclusions to just about everything you've mentioned so far via my own personal research over the years.

Looking forward to learning more about new and different logical rules to profit off this knowledge.

tradrejoe · Nov 19, 2009

Quote from MAESTRO:

I was expecting you to chime in! Your posts are always valuable! Please participate in this discussion as much as you can.

If you create a step function (let's say 5 point step) and convert ES data into this function you should see that the number of steps vs. absolute move on the long run ( I did it over 4 years) matches exactly the formula quoted above. To me it is a solid proof that the price data if plotted not on price/time scale but price/step scale is almost 100% Gaussian. It leads to many profound conclusions including the utilization of the interpolators (such as splines) on the price/step plane. The tools such options and other existing tools become very efficient in the harvesting of the uncovered price distribution.
More...

Maestro, thanks for the insight. But I am still uncomfortable with the assumption of Spline analysis that *all* possible information influencing the short term future movements of the S&P 500 is embedded in its own historical prices. There has to be many social/economic factors in play even for short term.

I was able to build Bayesian networks with distribution tables to show me the best predictors, for use in Logit Regressions (no more Linear Regression and no more blind combination of predictors). Prediction accuracy went up 10% accross the board. Bayesian networks study predictor dependencies. Do you know of even better machine learning methods when it comes to analyzing financial time series? AI doesn't work very well, I heard. Thanks again.

Corey · Nov 20, 2009

Very interesting discussion here. Some food for thought:

If the market can be extremely accurately modeled using a random walk process that is auto-regressive with both heteroscedastic mu (trend) and sigma (volatility), with some sort of jump process (maybe QGARCH with a jump process) -- does this still preclude TA from working?

Let us assume that both the trend and the volatility are heteroscedastic -- i.e. non stable. But let us also assume that they are auto-regressive, implying that when a new regime occurs, it is stable for a small portion of time -- i.e. 'trends' in delta and volatility, though occurring randomly, exist.

When I draw 'support' and 'resistance,' I think of it as measuring the current supply and demand pressure. These pressures are random (and thus unpredictable) -- but I can recognize when a REGIME shift has occurred by when my line is broken. Perhaps the mu in my QGARCH process has changed ... which would indicate that the process will no longer behave in the same manner.

Are my trend-lines a perfect model (they stink of linear regression)? Absolutely not. But they do the trick as long as I am cognizant about what I am trying to achieve.

So does "TA" work? I think many parts of TA work ... but for the wrong reasons. The crazier patterns and theories (Fibonacci numbers and candle-stick patterns come to mind)? I don't remember who said it in this thread, but to me, they are just shapes in the clouds.