http://www.raritan.com/products/power-distribution/intelligent-rack-pdus?utm_campaign=70150000000iCnh&utm_medium=Blog&utm_source=Basic vs. Intelligent&utm_content=&utm_term= One problem I see is how to design a PDU to connect thousands of PiZeros to them. Having to deal with all these transformers is a huge pain in the ass. You don't wan to deal with thousands of these guys in the lower left hand corner. You just want the cable in the lower right hand corner connecting directly from the PiZero to a PDU/USB Hub. In essence, you want an intelligent high quality powered USB Hub in a rack PDU design that can handle large volumes of connections, but just for the power. If someone would design a PDU that took just the cable from the PiZero with no transformer, then given that these computers have such low power ratings it is possible to have say 128+ power receptacles on the PDU. The goal is to have a 1U computer with a bunch of PiZero in them that can be racked easily, cooled easily, maintained easily, and get something like 128 to 256 cores per 1U for a fraction of the price/power equivalent cluster in traditional computers solutions that do the same thing. Dreaming a little more, wireless power would RULE.But even those new things similar to what you do to power your phone at StarBucks by simply laying the phone on it would be even more ideal in this case.
Here's a review of the Zero: https://www.phoronix.com/scan.php?page=article&item=raspberry-pi-zero&num=1
So that people don't think this is just fun and games, consider one of my systems. I am able to analyse 3000 symbols (for now the number can easily double). Say I want to analyse a years worth of tick data. Say a day has an average of 10,000 Bid/Ask quotes. Say I am able to process through the entire back testing framework about 10 B/A quotes a second. Here is the problem: 3000 * 10000 = 30 million quotes a day 30,000,000 * 250 trading days a year = 7,500,000,000 Bid/Ask quotes a year I can do 10 a second so 7,500,000,000 / 10 = 750,000,000 seconds to finish 750,000,000 / 60 = 12,500,000 minutes to finish 12,500,000 / 60 = 208,333 hours to finish 208,333 / 24 = 8680 days to finish 8680 / 365 = about 23 years to finish Now imagine I had a one thousand node cluster. Since this is an embarrassingly parallel computation, I can divide 8680 / 1000 to get just about 9 days to do an entire years worth of analysis of tick data on 3000 symbols through a system end to end. That I can live with. The program is already cluster aware. I just need the nodes. AWS/EC2 is very expensive.
10 per second is very slow. Are you using asynchronous/parallel coding practices? You should review the code and see if there are any areas that need algorithm improvement and code optimization. As suggested earlier GPU computing is very often better choice than cluster computing. Before spending time and money on hardware use smaller sample that is manageable on standard equipment. If your method does not show promising results on small sample then running the same on massive data set is not going to make a difference. I would suggest to focus on algorithm and program optimization first rather than on hardware.
10 a second is slow, but trust me, it has to be this way. I might be able to get it to 30 a second, but what good would that do me? I need it to be 1000x faster, not 3x. As far as running on a GPU, it is possible, but so is going to Mars. The amount of work to port to a GPU is probably intense. The system is extremely promising. if it were just promising, I wouldn't waste the effort on trying to test it on the intensity described here. Take note, this is running the exact same algorithm on different data, so it is embarrassingly parallel.
Embarrassingly parallel sounds like good candidate for GPU because that what GPU is all about. The only issue is moving 15 - 30 GB back and forth between CPU and GPU fast. Programming GPU is much easier today with CUDA or C++ AMP for example since it is C/C++ code with few extensions but can be done in many other languages using different GPU ports. Most of the time algorithms use hybrid CPU/GPU computations since not all computations are best suited for GPU. I just added these remarks for completeness. Theoretical single GPU speed up is definitely not 1000x but rather 100x maybe more (theoretical! all depends on algorithm type) and single computer can have multiple GPU working in parallel. In your scenario cluster computing might be better suited to solve the problem. On technological note Intel produces Xeon Phi with around 50 cores (real cores as opposed to streamlined GPU cores) for parallel computing but the cost might be prohibitive. GPU is much cheaper supercomputer than anything else.
I don't know anything about the Phi, but I am guessing it can't be the same thing as developing for a multicore CPU. If it were, why not just put 50 cores into a Xeon? As far as cost, I estimate that a 1000 core Zero Pi cluster would cost around $15,000. I doubt a Phi costs that much. On the other hand, it is [distributed] 1000 cores. I have no idea where I would find 1000 outlets though. LOL. That is why the PDU has to be designed for this sort of thing first.