http://www.secondmoment.org/articles/ann.php ANNs: A Little Knowledge Can Be A Dangerous Thing Posted by Dr. Halbert White Introduction When it comes to analytics, neural networks have to be one of the hottest commodities going. As of this writing, they are being used by everyone from MasterCard and American Express to Wal-Mart and KayBee Toys for everything from fraud detection and medical diagnosis to product marketing and stock forecasting. Still, while neural networks are undoubtedly a powerful tool, their use is laden with pitfalls, and as Dr. Halbert White points out, like all statistical techniques, they must be applied with both knowledge and care. White, a professor of economics at University of California, San Diego and a Senior Partner of Bates, White, and Ballentine, a business consulting firm, is one of the worldâs foremost experts on neural networks. He is also a member of the advisory board of Stone Analytics, sponsor of Second Moment, which recently sat down with Dr. White to discuss the current state of the art. A Little Background Neural networks, or, more precisely, artificial neural networks (ANNs) are a type of parallel process computing system based on a mathematical abstraction of the manner in which biological nervous systems function. Human and other animal nervous systems are composed of neurons or nerve cells, each of which is connected by thousands of different synaptic inputs and outputs to other neurons. Likewise, ANNs, though several orders of magnitude less complex than their biological counterparts, are based on collections of individual processing units united in highly interconnected networksânetworks capable of learning, memorizing, and actually recognizing relationships among data. The beauty and significance of ANNs is clear, in that similar to actual biological nervous systems, they possess the potential to perform three very important tasks: they can learn by adapting their âsynapticâ weights to changes in the environment; they can manage imprecise, probabilistic information; and they can generalize from known tasks to unknown ones. And, indeed, as statistical techniques go, they can be uncommonly good at dealing with a variety of real world situations such as handling nonlinearities and noisy data. They are also excellent at working with large numbers of variables or factors, and perhaps most important, at ascertaining the interaction among variables. All that being said, however, they are no silver bullet, and misapplied can produce results of little or no meaning. âNeural networks have a wonderful property, which is their universal approximation capability. By that I mean, using a single layer feed forward network , you can approximate any function arbitrarily well,â explains White. âEven with ones that contain a fair amount of noise, given sufficient examples, you can achieve optimal approximation using standard statistical estimation techniques. These go by various names depending on what discipline youâre in. One name is back propagation, which is a standard algorithm for training the network. Statisticians often call it nonlinear squares. In either case, this ability to approximate just about any function you want means that the relationship between the dependent variableâthe targetâand the explanatory variablesâthe inputâcan be nonlinear. There can be interaction of an arbitrary degree, and the neural network is capable of extracting those relationships and approximating them.â The Bad News âThe real drawback to using artificial neural networks,â White continues, âis that to do the estimation, you must train the network, which requires optimizing a nonconvex function, something that people in the field know is not an easy problem. Typically what this means is that you can fairly easily arrive at a local optimumâoften a good local optimumâbut depending where you start, the results can differ. In other words, if you start on two different points in the network weight space and go through the training exercise or the optimization routine, you might end up with two different sets of estimated coefficients or trained network weights. âOne means for dealing with this is the so-called multistart method, which is a provably effective way to arrive at a global optimum. Basically what you do is begin with multiple starting points and then let the thing go and see which ones converge and which ones donât. Of the ones that do converge, you pick the best one, or as Leo Brieman (of Stanford University) has suggested, you can combine them. The idea is to take a whole bunch of neural networks that youâve trainedâmaybe 500 or 1000âand average the results so that in the end what you have is something much more reliable and robust than the result of training just one network. Still, while this is a workable solution, in my mind itâs not very appealing because what youâve done is taken a problem that may take two or three hours to solve and multiplied it by a thousand, and thatâs just computing time. Thereâs also the time to do the optimization itself, which typically involves a great deal of tweaking, not to mention false starts.â Over Fitting Another means for dealing with the problem of local versus global optima is to perturb the point in question and to see if the function continues to return to that same point. But here again, White sees that as extremely time intensive. Moreover, as he goes on to point out, âthe nature of the optima on a very fine scale can be quite irregular. You can find yourself hopping in and out of very small local optima, like a saw tooth. The finer and finer the resolution, it can still persist. So what does it mean in this case for it to be an optimum? The right answer is probably some smoothing of the objective function, which is perhaps what you should care about.â âThe thing about neural networks is that they are a great data mining tool if you have the time and patience. So you can turn out a whole bunch of different models. For instance, you might try different tweaks for the training parameters, or different preprocessing steps, or give the learning algorithm different sets of inputs to play with. As you go through the process you are generating models that work well or not so well. If they do not work so well you can keep going until you get something that does work well. The danger though, is that ultimately you will end up with a network that is basically fitting the noise. Even cross validation is not by itself a guarantee against this, because once you go back and revisit the cross validation set over and over again, youâre going to be fitting the noise that it contains. So there is this pitfall.â <font color=red> âLet me give you an illustration. Letâs suppose you want to trade the S&P 500 on a daily basis. Basically what you want to know is your average continuously compounded return from one day to the next, which is calculated as the logarithm of the net asset value on day t divided by the net asset value on day t minus 1. You want to maximize this average over a particular time frame, which tells you what your target is, and then to some degree you are going to be focusing on being able to predict when thatâs large or small. One simple way that people attempt to do this is with what is called a moving average indicator. A moving average is an average of prices over a period of n days, and by comparing the moving average over n days with a moving average for a smaller number of days you supposedly get an indication of whether the market is headed up or down. If the short moving average is above the long moving average, it means that market is moving up and you want to buy. If the short moving average is below the long moving average it means the market is moving down, and you want to sell. So, for example, you can pick two days for the short moving average and 10 days for the long moving average and buy or sell whenever the short moving average touches the long moving average. This will create a series of buy and sell strategies, which will determine your portfolio position in time, which in turn will determine what your net asset values are, which in turn will determine what your performance measure is. Next you can look over all the different combinations of numbers of days for the short and long moving averages to see which one gives you the best performance. The ultimate issue though, is whether the apparently good performance you ultimately find is real, or is it in line with the random variation you would expect to find, given all the scenarios youâve looked at. The sad answer is usually that itâs in line with the random variation.â </font> Gold or Foolâs Gold? âThere are entire literatures and industries built around so called technical trading indicators, and a lot of new versions of things that you can buy to put into your program that will crank out these indicators and that will calculate the profits you would have made if you only you had done that ahead of time. And itâs not simply stock market predictions. Whether itâs credit card fraud, mortgage fraud, CRM (Customer Relationship Management)âthe danger with neural networks is that if there is something there, you will find it, and if there is nothing there you will also find something. So while thereâs no question that neural networks are a powerful tool for discovering relationships within a collection of data, their very power makes them dangerous. Great care has to be taken to ensure that what one finds is truly gold and not simply foolâs gold. "