what test for 3-4 variables predicting an outcome of 0 or 1

zedDoubleNaught · Aug 7, 2011

I think the target math is some form of multiple regression, like a linear regression but fitting a line to n-dimensions instead of only 2 dimensions. Hoping there may be a name for it, or a name for a better process.

case:
- time series, in days
- an event occurs or not for a given day, so gets a value of 0 or 1
- several other variables (indicator values) that can have any value, which are measured in the early part of the day

goal:
- which of the other variables show some relationship to the event occurring or not?
- is there a formula to get a prediction value?

If the case were one variable predicting the event, I could do something simple, like the variable is higher or lower than a level. In my case, I've currently got 3 variables, and may add more. It seems too simple to say higher or lower for each individually, which is what I can do with backtesting software. Is there math, a stat test, or a process to figure out which variables are important, or if there is some effect from combining them?

I have Eureqa, but usually I don't have much luck when the outcome is 0 or 1. I also have Octave, which may have better functions for this type of task.

thanks in advance -- stat test names or wikipedia articles to get me started are very helpful, and sorry if my description does not make much sense, my statistics knowledge is not very deep, I'll try to study more

kickout · Aug 7, 2011

i think what your looking for is discriminant analysis..it's a multivariate statistical method, where the predictor variable may be quantitative or qualitative..

not quite sure on how to incorporate time series, but i assume it'd be straightforward

black diamond · Aug 8, 2011

You probably want to try a logit or probit model.

Joman · Aug 8, 2011

I would use dummy variables:

http://en.wikipedia.org/wiki/Dummy_variable_(statistics)

zedDoubleNaught · Aug 8, 2011

thanks for the leads -- these sound like models or tests close to my case.

zedDoubleNaught · Aug 8, 2011

If anyone's interested, here's what I've found so far:
In QTOctave, it's got "regress" function, which does a multiple regression. But, I'll have to study some more to make sense of the values it returns. Looks like it gives an F score, to help determine if the multiple regression is significant at a specified p level. I'm not quite sure yet where the correlation coefficients are.

Before that, it's also important to get a cross correlations -- the predicted value (also called the "criterion", according to my old stat book) could be predicted by just one of the variables. Or, some of the variables may correlate with each other, so no need to include one of them, it does not add any new info.

In Octave, load your data in a table. Then do "corrcoef(yourDataTable)" and it will return the cross correlations. If your criterion is its own vector variable, you can do "corrcoef(criterionTable, dataTable)", and looks like it gives the correlations of the criterion to each other variable individually.

I'm still not sure how to deal with the outcome being 0 or 1, I think it will affect the regression. Will look into dummy variables more. Or I may be able to characterize my event on a 10-point scale, so the outcome has more variance.

black diamond · Aug 9, 2011

I still think a logit regression is your best bet. The problem with a normal regression is you will probably get predicted values < 0 or > 1. A logit (or probit, they are similar) doesn't have this issue - predicted values are all between 0 and 1. And the output can be interpretes as the odds of observing a 1 as a function of the inputs.

I am not familiar with Octave so I don't know how hard it is, but it is pretty straightforward in SAS once you know how to to a linear regression.

zedDoubleNaught · Aug 9, 2011

Hi black_diamond, thanks for logit regression, that sounds like the best one to use, based on summary from:
http://en.wikipedia.org/wiki/Logit_regression
I didn't mean to disregard your post -- just that I spent yesterday starting from the beginning, studying multiple regression. The logit looks an extension to the case of predicting an outcome happening (1) or not(0), just like my case. It will take me a few days to study up on this to post back. I hope Octave has a logit function, if I find it, I'll post for everyone.

I'll take at a look at the suggestion for discriminant analysis too, but that looks like quite a few steps beyond my current level, unfortunately may take quite a bit of time for me to understand it well enough.

zedDoubleNaught · Aug 9, 2011

Wow check this out -- no need for octave, use this page instead:

http://statpages.org/logistic.html

just copy and paste your data into a box, and it generates a bunch of stats for you. Much easier than figuring out how to do it, but drawback is extra effort required in choosing what data and getting data into it. Now to see if I can find the variables that are significant predictors above chance ....

Note to self -- always google search for "(desired stat) test calculator" before trying to figure out how to code it or do it in Excel.

One of the descriptive stats it gave me was:

Overall Model Fit...
Chi Square= 1.9892; df=3; p= 0.5747

So, I think with the current 3 predictor variables I have, the Chi-Square distribution at 3 degrees of freedom says my data has a 57.47% chance of appearing randomly ... hmm, that sounds pretty high, I think I need to search for different variables.