Thousand+ core cluster of Raspberry Pi

nitro · Dec 14, 2015

Jack_Larkin said:
Even with adding the USB to ethernet dongle, you aren't even taking into account the overhead a USB based ethernet solution would have on that under-powered board. The performance hit would be massive in most high-bandwidth cases... USB can be very dependent on the CPU and an under-powered device like the Pi, combined with latency sensitive and high throughput traffic ethernet usage, would make this very noticeable.

Also, you can get better performance out of a single entry-level desktop than you would out of multiple Pi Zeros. Take a moment to understand what workloads could benefit from multi-node architectures.. and even then, consider the performance limits of a Pi Zero. And NO, just adding more Zeros to the cluster won't make up for it.

As a lab or tool to learn how to build clusters, a bunch of Pi's might fit.. but outside of a learning tool there's nearly no practical reason to use Pi's for something trading related (if that's what you wanted to eventually use the cluster for,) over regular commodity hardware.
More...

I prefer to stick the O-ring in the ice water than philosophize about it.

nitro · Dec 14, 2015

http://www.raritan.com/products/power-distribution/intelligent-rack-pdus?utm_campaign=70150000000iCnh&utm_medium=Blog&utm_source=Basic vs. Intelligent&utm_content=&utm_term=

One problem I see is how to design a PDU to connect thousands of PiZeros to them. Having to deal with all these transformers is a huge pain in the ass.

You don't wan to deal with thousands of these guys in the lower left hand corner. You just want the cable in the lower right hand corner connecting directly from the PiZero to a PDU/USB Hub.

In essence, you want an intelligent high quality powered USB Hub in a rack PDU design that can handle large volumes of connections, but just for the power.

If someone would design a PDU that took just the cable from the PiZero with no transformer, then given that these computers have such low power ratings it is possible to have say 128+ power receptacles on the PDU.

The goal is to have a 1U computer with a bunch of PiZero in them that can be racked easily, cooled easily, maintained easily, and get something like 128 to 256 cores per 1U for a fraction of the price/power equivalent cluster in traditional computers solutions that do the same thing.

Dreaming a little more, wireless power would RULE.But even those new things similar to what you do to power your phone at StarBucks by simply laying the phone on it would be even more ideal in this case.

nitro · Dec 14, 2015

blah12345678 · Dec 18, 2015

Here's a review of the Zero:

https://www.phoronix.com/scan.php?page=article&item=raspberry-pi-zero&num=1

nitro · Dec 29, 2015

So that people don't think this is just fun and games, consider one of my systems. I am able to analyse 3000 symbols (for now the number can easily double).

Say I want to analyse a years worth of tick data. Say a day has an average of 10,000 Bid/Ask quotes. Say I am able to process through the entire back testing framework about 10 B/A quotes a second. Here is the problem:

3000 * 10000 = 30 million quotes a day
30,000,000 * 250 trading days a year = 7,500,000,000 Bid/Ask quotes a year

I can do 10 a second so

7,500,000,000 / 10 = 750,000,000 seconds to finish
750,000,000 / 60 = 12,500,000 minutes to finish
12,500,000 / 60 = 208,333 hours to finish
208,333 / 24 = 8680 days to finish
8680 / 365 = about 23 years to finish

Now imagine I had a one thousand node cluster. Since this is an embarrassingly parallel computation, I can divide 8680 / 1000 to get just about 9 days to do an entire years worth of analysis of tick data on 3000 symbols through a system end to end. That I can live with.

The program is already cluster aware. I just need the nodes. AWS/EC2 is very expensive.

Gambit · Dec 29, 2015

Could this be rigged up for cryptocoin (ex: litecoin) mining?

vicirek · Dec 29, 2015

nitro said:
So that people don't think this is just fun and games, consider one of my systems. I am able to analyse 3000 symbols (for now the number can easily double).

Say I want to analyse a years worth of tick data. Say a day has an average of 10,000 Bid/Ask quotes. Say I am able to process through the entire back testing framework about 10 B/A quotes a second. Here is the problem:

3000 * 10000 = 30 million quotes a day
30,000,000 * 250 trading days a year = 7,500,000,000 Bid/Ask quotes a year

I can do 10 a second so

7,500,000,000 / 10 = 750,000,000 seconds to finish
750,000,000 / 60 = 12,500,000 minutes to finish
12,500,000 / 60 = 208,333 hours to finish
208,333 / 24 = 8680 days to finish
8680 / 365 = about 23 years to finish

Now imagine I had a one thousand node cluster. Since this is an embarrassingly parallel computation, I can divide 8680 / 1000 to get just about 9 days to do an entire years worth of analysis of tick data on 3000 symbols through a system end to end. That I can live with.

The program is already cluster aware. I just need the nodes. AWS/EC2 is very expensive.
More...

10 per second is very slow. Are you using asynchronous/parallel coding practices? You should review the code and see if there are any areas that need algorithm improvement and code optimization. As suggested earlier GPU computing is very often better choice than cluster computing. Before spending time and money on hardware use smaller sample that is manageable on standard equipment. If your method does not show promising results on small sample then running the same on massive data set is not going to make a difference. I would suggest to focus on algorithm and program optimization first rather than on hardware.

nitro · Dec 29, 2015

vicirek said:
10 per second is very slow. Are you using asynchronous/parallel coding practices? You should review the code and see if there are any areas that need algorithm improvement and code optimization. As suggested earlier GPU computing is very often better choice than cluster computing. Before spending time and money on hardware use smaller sample that is manageable on standard equipment. If your method does not show promising results on small sample then running the same on massive data set is not going to make a difference. I would suggest to focus on algorithm and program optimization first rather than on hardware.
More...

10 a second is slow, but trust me, it has to be this way. I might be able to get it to 30 a second, but what good would that do me? I need it to be 1000x faster, not 3x.

As far as running on a GPU, it is possible, but so is going to Mars. The amount of work to port to a GPU is probably intense.

The system is extremely promising. if it were just promising, I wouldn't waste the effort on trying to test it on the intensity described here.

Take note, this is running the exact same algorithm on different data, so it is embarrassingly parallel.

vicirek · Dec 29, 2015

nitro said:
10 a second is slow, but trust me, it has to be this way. I might be able to get it to 30 a second, but what good would that do me? I need it to be 1000x faster, not 3x.

As far as running on a GPU, it is possible, but so is going to Mars. The amount of work to port to a GPU is probably intense.

The system is extremely promising. if it were just promising, I wouldn't waste the effort on trying to test it on the intensity described here.

Take note, this is running the exact same algorithm on different data, so it is embarrassingly parallel.
More...

Embarrassingly parallel sounds like good candidate for GPU because that what GPU is all about. The only issue is moving 15 - 30 GB back and forth between CPU and GPU fast. Programming GPU is much easier today with CUDA or C++ AMP for example since it is C/C++ code with few extensions but can be done in many other languages using different GPU ports. Most of the time algorithms use hybrid CPU/GPU computations since not all computations are best suited for GPU. I just added these remarks for completeness. Theoretical single GPU speed up is definitely not 1000x but rather 100x maybe more (theoretical! all depends on algorithm type) and single computer can have multiple GPU working in parallel. In your scenario cluster computing might be better suited to solve the problem. On technological note Intel produces Xeon Phi with around 50 cores (real cores as opposed to streamlined GPU cores) for parallel computing but the cost might be prohibitive. GPU is much cheaper supercomputer than anything else.

nitro · Dec 30, 2015

vicirek said:
...On technological note Intel produces Xeon Phi with around 50 cores (real cores as opposed to streamlined GPU cores) for parallel computing but the cost might be prohibitive...
More...

I don't know anything about the Phi, but I am guessing it can't be the same thing as developing for a multicore CPU. If it were, why not just put 50 cores into a Xeon?

As far as cost, I estimate that a 1000 core Zero Pi cluster would cost around $15,000. I doubt a Phi costs that much. On the other hand, it is [distributed] 1000 cores.

I have no idea where I would find 1000 outlets though. LOL. That is why the PDU has to be designed for this sort of thing first.