What is the correct way to analyze...

nitro · Oct 10, 2010

There is no general math section on ET, so I put it here since it seems this is where lots of math talk happens.

Say you have n finite number of events, x1, x2, ..., xn on any given day. Say that on any given day, any combination of these events can occur, say x1 & x3 & x7, or sometimes it could be x1 & x3 with no x7, or x2 & x3 & x6, or just x10. Each of these Xs by the way is a structure that contains statistical data, but each x contains identical fields.

Say you are pretty certain that the variable you are trying to predict, call it Y, is influenced by these events, but in what measure you don't know. What is the right way to analyse this situation statistically? Factor analysis, PCA, Bootstrap etc? The question is how to aggregate the events btw, imo.

If I just measure/analyse the Xs individually, I am ignoring the fact that other xs occurred on that day (conditional probability), and I may be assigning to high or too low a probability to this event. Also, it seems as if each X should be joined into one day overall X, where overall x, [OAX] ={x |: the sum of all x event occurred on that day}, and then take statistic (whatever hypotheses I am doing) on this aggregate. The problem with this is, who is to say that the means, or std, or any other statistical technique allows for this? I could just sum the fields I suppose, but what if the scales are different in each?

I guess I could take the moments of each, and then divide the key numbers by the second moment, STD, giving me a dimensionless number I can then add for each event within the day. I am just not sure if I am GIGO...

I hope this is a clear. I realize this may require more explanation...

jack hershey · Oct 10, 2010

A fun thing to do is construct a pinwheel.

You can use three lengths of MLR's. 40, 20 and 10 work.

Repeat this for each kind of "x" in your terms. One kind could be price another could be volume. Use separate display panes.

To make it more fun use the OHLC of price to make four separate pinwheels. Superimpose these four pinwheels.

Before charts were avaiable in real time, I used to plot 30 minute charts by hand. I did two of these by overlapping two 30 minute charts offset by 15 minutes. I used four colors for obvious reasons (bars have two ends).. I used velocity instead of just displacement values.

Having a plot of the market that was graphed in the future was very helpful.

The general idea of data processing is to create additional degrees of freedom from raw data.

The general drift is to take about 7 degrees of freedom and create 70 degrees and use the 70 degrees selectively as time passes. At any time a subset of 6 to 8 degrees of freedom is all that is need to "extract" the market's offer fully.

Do NOT use any statistics, nor any induction, ever. Only use deduction and the null hypothesis.

The reason ET does not have a maths forum is because probabilistic mathematics does not work to out perform the market's offer (this is a humorous comment). Dealing with supposed anomalies does not lead to profitability over time.

Mike805 · Oct 10, 2010

Have you looked into something like a Standard Additive Model (SAM)?

Fuzzy logic and the SAM are designed specifically for these types of situations. The utility, however, lies in the initial rule set. If you choose good initial rules you won't run into the curse of dimensionality, and when you finally do achieve a representative model for Y, you can run rule-trimming/rule-removal techniques to further refine the relationships.

intradaybill · Oct 11, 2010

Vague suggestions to vague problems.

The estimation method used depends on the nature of the problem. You are trying to generalize the problem because you do not want to reveal the details of what you are doing - I can guess it by the way - and you are left with an abstract mathematical structure for which there is no specific method to model.

nitro · Oct 11, 2010

Thanks. I am not sure I follow, but I am very close to understanding. If you could elaborate a little more...

Quote from jack hershey:

A fun thing to do is construct a pinwheel.

You can use three lengths of MLR's. 40, 20 and 10 work.

Repeat this for each kind of "x" in your terms. One kind could be price another could be volume. Use separate display panes.

To make it more fun use the OHLC of price to make four separate pinwheels. Superimpose these four pinwheels.

Before charts were avaiable in real time, I used to plot 30 minute charts by hand. I did two of these by overlapping two 30 minute charts offset by 15 minutes. I used four colors for obvious reasons (bars have two ends).. I used velocity instead of just displacement values.

Having a plot of the market that was graphed in the future was very helpful.

The general idea of data processing is to create additional degrees of freedom from raw data.

The general drift is to take about 7 degrees of freedom and create 70 degrees and use the 70 degrees selectively as time passes. At any time a subset of 6 to 8 degrees of freedom is all that is need to "extract" the market's offer fully.

Do NOT use any statistics, nor any induction, ever. Only use deduction and the null hypothesis.

The reason ET does not have a maths forum is because probabilistic mathematics does not work to out perform the market's offer (this is a humorous comment). Dealing with supposed anomalies does not lead to profitability over time.
More...

nitro · Oct 11, 2010

Thanks, no I had never heard of them.

But just to be clear, I can apply all sorts of statistics to the numbers to be sure. What I am worried about is how not to miss the conditional property of the data (which I am certain exists) and how best to deal with making inferences on data where one day x1 & x2 are present, some day only x1, and some days x1 & x2 & x3, and some days non of them.

Most statistical tools I have seen assume the matrix is not sparse. It wants values for x1 even if I have to manually put in zero. It is this problem that I am really asking about. How to "massage" the data correctly to make sure I am not just fooling myself with applying statistics incorrectly.

Quote from Mike805:

Have you looked into something like a Standard Additive Model (SAM)?

Fuzzy logic and the SAM are designed specifically for these types of situations. The utility, however, lies in the initial rule set. If you choose good initial rules you won't run into the curse of dimensionality, and when you finally do achieve a representative model for Y, you can run rule-trimming/rule-removal techniques to further refine the relationships.
More...

nitro · Oct 11, 2010

Thanks. I realize that not directly talking about the numbers would cause general responses that aren't helpful, but there is probably a theory out there with a specific example...

It is not the actual model I am after, but how to prepare the data for a model, since the matrix is sparse or jagged.

Quote from intradaybill:

Vague suggestions to vague problems.

The estimation method used depends on the nature of the problem. You are trying to generalize the problem because you do not want to reveal the details of what you are doing - I can guess it by the way - and you are left with an abstract mathematical structure for which there is no specific method to model.
More...

nitro · Oct 11, 2010

I could do a t-test for missing values when they are not there, and fill in that number. I don't know if this is better than entering a zero for the "missing" values....

nitro · Oct 11, 2010

Maybe something like this is best, but it assumes additivity...

http://www.ruf.rice.edu/~branton/interaction/faqintro.htm

kut2k2 · Oct 26, 2010

Quote from nitro:

There is no general math section on ET, so I put it here since it seems this is where lots of math talk happens.

Say you have n finite number of events, x1, x2, ..., xn on any given day. Say that on any given day, any combination of these events can occur, say x1 & x3 & x7, or sometimes it could be x1 & x3 with no x7, or x2 & x3 & x6, or just x10. Each of these Xs by the way is a structure that contains statistical data, but each x contains identical fields.

Say you are pretty certain that the variable you are trying to predict, call it Y, is influenced by these events, but in what measure you don't know. What is the right way to analyse this situation statistically? Factor analysis, PCA, Bootstrap etc? The question is how to aggregate the events btw, imo.

If I just measure/analyse the Xs individually, I am ignoring the fact that other xs occurred on that day (conditional probability), and I may be assigning to high or too low a probability to this event. Also, it seems as if each X should be joined into one day overall X, where overall x, [OAX] ={x |: the sum of all x event occurred on that day}, and then take statistic (whatever hypotheses I am doing) on this aggregate. The problem with this is, who is to say that the means, or std, or any other statistical technique allows for this? I could just sum the fields I suppose, but what if the scales are different in each?

I guess I could take the moments of each, and then divide the key numbers by the second moment, STD, giving me a dimensionless number I can then add for each event within the day. I am just not sure if I am GIGO...

I hope this is a clear. I realize this may require more explanation...
More...

What you want is called an ARIMAX model (AutoRegressive Integrated Moving Average with eXogenous inputs)

Y = ARIMA (p,d,q) + c1*Z1 + c2*Z2 + ... + cn*Zn ,

where Zi = 1 when Xi is present and Zi = 0 when Xi is absent.

I think you need a minimum of 50 data points to make an ARIMA model, so you'll probably want at least 50 + n for ARIMAX.