Good luck. I am completly disinterested in matrices/factorization etc and it is not area I am interested in other than knowing how I would approach it if needed. And for that I need technical detail of the hardware which directly translates into how you write your software. Do not be scared with technicalities. I was under the impression that it cannot be stated simpler on ET board than this and regardless what is the algorithm GPU programming is actually hardware you are programming to which is different than CPU jargon. I was just making the point that it can be done since you did not state size of your large matrix.
You understand it's not a question of technicality right? It's possible (turns out not to be the case, of course) that matrix factorization simply cannot be made parallel efficiently. If that's the case (again, turns out not to be the case), hardware doesn't change a single thing. To summarize: I asked a question about algorithm. You threw a condescending response back regarding hardware details.
This is ET general board by the way. I think that you know the answer and try to get answer to a question that is at the core of parallel programming: what is trivial to do in sequential programming becomes major issue if you try to implement highly parallel algolrithm to a problem that is not inherently parallel. You can still take advantage of parallelizm in those situations as well. So what you do? Exactly what programming and algorithms suppose to do: you split your problem into pieces that can be run parallel and those that can not. Specific implementation is dictated by your hardware/software and often computation is hybrid and asynchronous on both CPU/GPU. As I said I can not help you with this particular problem because I do not have any applications that would utilize this.
very true about the last statement in particular. I see myself having to acquaint myself with Node.JS, java script in general, html5 just to get couple nice looking chart scripts running that are based on D3 for implementation on my website (I make performance numbers accessible to registered clients and they can chart and calculate risk and return metrics any way they want over any chosen time frame. Such libraries are only available in JS/HTML5 and they are so easy to bind to that it would be a waste to spend hundreds of dollars to get a Silverlight library and having to delve into the intrinsics.
some implementations to program FPGAs use C variations but they all need to be compiled to HDL in the end. Still lots of guys feel more comfortable writing such libraries in C or C++ rather than Verilog or VHDL. Just pointing out one application where C is still used.
GPUs are heavily used on the buy side, Risk Management (valuation), derivatives pricing, to name just two. When I say heavily I really mean that, you won't find a single large fund that trades exotics that would not peruse GPUs. And of course can matrix manipulations be parallelized, no question. That is why matrix operations is one of the domains of GPU outsourcing.
Thanks. Yes, I'm aware of those cases, except I generally classify valuation and derivatives pricing as sell-side activity. I understand there are overlaps that might be firm-specific. I've tried eliminating buffer copy from CPU to GPU - but as far as my benchmarks go, it is still too slow except for a few use cases, e.g. heavy PDEs.
Well, because C is a fast and resource efficient language compared to interpreted languages like Ruby, Python, Java, C#, etc. Python and Ruby runs around 5-10 times slower than C, that do mean longer delays for the end result. If the main code is written in C then you may also want the support code around core it to be written in it C too. Modern C compilers is very good at optimization and can auto-vectorize your code to some degree, and automatically use all these nice instructions you have access to, SSE, AVX, AVX2, XOP, FMA4, etc.
I've seen open source OpenCL code for Black Scholes plus Quantlib seems to have OpenCL extenstions as well. There are a number of financial algorithms that lend themselves to GPU acceleration.
There are almost no real life algorithms that would be immediately 100% parallel ready. Most of the programming effort is made to convert such algorithm to take maximum possible advantage of parallelism either through hardware or software acceleration. Usually there are some chunks of code that cannot be processed in parallel but there are ways to synchronize those operations with some degree of performance penalty but the end result is significant speedup of entire application.