Hi all, Seemingly simple question. For multivariate (3+) inputs, how many samples are required? According to one source, if 9 samples (observations) are required with one variable (parameter), 9^2 (81) are required with two variables and 9^3 (729) with three variables. So. How does one determine the original number (9). And is the exponential relationship (squared, cubed) always the case? Wouldn't this depend on the variance or covariance or some other factor(s)? Is there an equation for "minimum number of samples needed"? http://www.google.com/search?q=curse+dimensionality+number+of+samples+needed Thanks! Stox

Stox, The original #, ex 9, is determined by how flexible/curvy the object you're trying to identify is. For example you need fewer points to fit a linear regression model than a neural network model. The neural network is more flexible/curvy in general. You're right that it depends on the covariance of the factors but it's hard to say anything more specific than that since estimating the covariance itself is subject to the curse of dimensionality... That's why standard mean-variance/Markowitz portfolio optimization on a large universe of stocks performs so badly out of sample. So you have to impose some assumptions of smoothness on your space to reduce the effective dimensionality- ie. regularization/shrinkage. This 'curse' is an unsolved issue and much of modern machine learning research is focused on dealing with it. The rise of support vector machines vs neural nets is driven by it. Which is to say that there's not a great solution (so don't calculate too many indicators, etc). You need to develop a heuristic feel for the number of samples needed. I posted a better explanation of the problem from a great textbook, The Elements of Statistical Learning, on my blog, in case you're interesting in learning more: http://www.maxdama.com/2008/12/weirdness-is-curse-of-dimensionality.html Regards, Max

Nice to hear you still post around here, Max. I would argue that in most cases, the effort is not so much to increase the sample size to overcome this problem, but to find methods to minimize the input attributes (PCA, dimensionality reduction, AIC, etc). It also improves computational efficiency to deal with the problem by feature reduction. Also, a good rule of thumb from basic statistics is to use a bare minimum of 30 samples for each attribute (in trading data, you should have far more available).

The 9 samples in a single dimension is just used to illustrate the point. The minimum required samples for constructing a predictive model depends on the modeling algorithm being used, the nature of the data itself, the attribute selection process, what performance is the minimum accepted, etc. Note, too, that the exponential progression from 9 to 27 to 81 assumes that the input variables are completely uncorrelated with each other, which would be unusual in practice. More likely, the input variables are at least somewhat correlated, implying that less samples would be needed, for instance, 9 - 16 - 35, etc.

The only specific number I can think of comes from the Central Limit Theorem. As to the actual problem, as I understand it, there's a variety of ways to circumvent/work around the problem, but no actual solution. Personally, PCA/ICA is something I find useful.

Thanks for the responses! So far "dimensionality reduction" and "minimum statistical sample size" have yielded decent google results. Here are some online calculators similar to what I thought I was looking for: http://www.raosoft.com/samplesize.html http://www.wessa.net/rwasp_sample.wasp This thread has pointed me in the right direction of reducing dimensions rather than increasing sample size. Rather than leaving number of dimensions (input variables) the same and trying to increase number of samples, I'd be better off decreasing the number of dimensions until reasonable margin of error is achieved...

Have you looked at the variety of Standard Additive Models? http://sipi.usc.edu/~kosko/FuzzyUniversalApprox.pdf http://ieeexplore.ieee.org/Xplore/l...434342.pdf?arnumber=4434342&authDecision=-203 An interesting property of these models (SAMs for shorts) is they can weigh rules and remove "correlated" or redundant ones. Depending on your training algo, the curse of dimensionality can also be dealt with (up to a certain point of course). Mike

Anything that supports nonlinear model I'm considering. Also I will be using WEKA and/or R. So yeah nonlinear additive, nonlinear PCA, other suggestions welcome... Search for non-linear dimensionality reduction: http://www.google.com/search?hl=en&q=non-linear+dimensionality+reduction

If you are using Weka, there is a built in GA algorithm and a few other dim. reduction techniques built in. http://intelligenttradingtech.blogspot.com/ I have a lot of Weka examples posted here. There are also many feature reduction techniques in R, incl. PCA. GL