Cell processors

Discussion in 'Automated Trading' started by WayneWunder, Oct 15, 2009.

  1. Hi,
    Has anyone had experience developing ATS platforms for cell processors and is there alot of traders/financials using this technology.

    Also whats better in terms of processing power and multithreading capabilities between Cell CPUs vs GPU's?

  2. Corey


    No direct experience (no ps3 to play on), but from what I have read, most parallel algorithms drop down easily from a conceptual point of view, but not so from an implementation point of view. To get the true power from the CELL CPU, you need to get your hands dirty with issues like loop-unrolling, bad branch prediction, memory coalescing (a GPU issue as well). You really have to make sure your code is appropriately vectorized. You also need to watch out for the hard memory limit.

    Basically, you're going to need to redesign your algorithms to fit into the constraints of the processor, but also to make sure that they are taking full advantage of the processors capabilities. Job management and what-not.

    From a Stack-Overflow thread. Issues you need to deal with when handling a SPU:

    * Atomic operations (lock-free try-discard style).
    * Strong distinction between memory areas. You have to know which pointer is pointing to which memory area or you'll screw everything up.
    * No enforced hardware distinction between data and code. This is actually a fun thing, you can setup dynamic code loading and essentially stream subroutines in and out. Self-modifying code is possible but not necessarily practical on SPU.
    * Lack of hardware debugging aids.
    * Limited memory size.
    * Fast memory access.
    * Instruction set balanced toward SIMD operations.
    * Floating point "gotchas".

    My experience with GPUs is basically that getting a naturally parallel algorithm over is pretty simple, and as long as the memory transfer isn't your bottleneck (i.e. you are giving it enough to work with), you get a free speed-up. The basics are easy. However, really getting nitty gritty can be difficult, and getting the MOST out of your GPU is quite difficult. I've mainly used it for back-testing and that sort of stuff.
  3. nitro


  4. maxpi


    I wonder if any of these multi processor complex things can beat software that is compact, compiled to native Intel code and runs entirely in the processor cache on a single [multi-core] processor..
  5. nitro


    If the application is embarrisingly parallel, and the cost of marshalling the data and code to and from the GPU is negligable compared to the computation involved, the answer is it is not even close, and it can be 100x faster to run on a GPU.

    The problem with these things for realtime use is the system bus, since (PCI) bus latency/speed is terrible compared to the speed of once the data is in an L1 cache near the CPU. That's why I want a direct connect Ethernet port on nVidia cards.
  6. Thanks for replys.

    With regards to overcoming bottlenecks such as PCI bandwidth what mechanisms/technologys can be utilized to get around these issues?
  7. None you ahve access to, or do you design your own motherboards?

    In theory, Nvidia could put out special graphcis card that connect with infiniband ;) Problem solved.
  8. nitro


    I would recommend against Infiniband. First, you have to program to the Infiniband stack, and that is not trivial. AFAIK, you can't just take a program that works on Ethernet and it will automatically work on IB.

    Two 1 GB ports would be fine.