Also, depending on how many data streams you are looking at, you may want to consider coarse grained parallelism. For instance, if you are looking at multiple streams of data, you might be better off building a cluster of super-cheap AMD machines and calculating each data stream on its own machine. On the other hand, if you are more interested in speeding up FFT on one data stream, then it seems all the other suggestions are your best bet.