I agree. But it still does what I want even with the hack so I don't care how it's done. I don't know that much about parallel computing anyway. Maybe my knowledge is too limited on this subject but I don't see how the example I gave is embarassingly parallel when the evaluation of one variable depends on the value of another variable. Can you explain how to make something like the example I gave you run in multiple threads? AS far as APS I understand that searching for patterns in 2 data files can run in parallel but the question is whether they can do a single search to run in paraller processing mode. In fact, all I do is a signle search at a time.
That is just what Intel's Larrabee is about: http://www.ddj.com/architect/216402188 This could be very interesting.
No, your example is pretty straightforward, and it can be parallelized. But, it's easier to think of all this in terms of atomic functions. Let's start with: a = f(b) c = f(d) Since the dependent variables in each, 'a' and 'c' are independent of each other their functions can be parallelized. On the other hand, the sequence: a = f(b) c = f(a) cannot be parallelized since 'a' has be be computed first because 'c' is now dependent on 'a.' (and no, 'y' in your example isn't dependant on 'x' in the same was as here since the value of 'x' is constant during the entire iteration of the inner loop.) From a functional perspective why would the input to a function that searches for a cup w/handle pattern be dependent on the output from a function that searches for a head & shoulders pattern? The only possibility is poor program design with the use of global variables almost always at the top of that list.
WTF you bozo retard, wiki freak. Look at his example carefully: x = 0. y=0. for i = 0 to 100 x = x+i for j = 1 to 1000 y = x+2j end end This translates to: a = f(b) // b = i c = g(a,d) // d = j Calculation of c is dependent on a. This cannot be parallelized (easily). Bozo...
Haven't done any real work with CUDA yet but I have read the documentation as well as visited their forum. To gain a large performance boost you do need to parallelize your algorithm, but unfortunately that is not enough. Equally important are the memory access patterns of your algorithm. The GPU reads data in blocks and if your problem does not map to the access pattern it will need to synchronize which will slow things down a lot. For complex algorithms it seems this could become harder than making the algorithm parallel. /Hugin
Execute 101,000 threads in parallel with the following kernel: y = (i*(i+1)/2) + 2j Where: i is an input from 0 to 100 and j is an input from 1 to 1000 Analagous to pixel rendering for each x, y co-ordinate on the screen where the screen is 101 by 1000 pixels in dimensions i.e. perfectly suited to parallelism.
I did, but since you insist... Code: main() { x = 0 y = array[100] for i = 0 to 100 x = x+1 fork_thread(i, proc_x(y, i, x)) end wait_thread(100) print(x, y[100]) } proc_x(array y, int i, int x) { for j = 1 to 1000 y[i] = x+2j next return } Both fork_thread() and wait_thread() are OS dependant. A decent treatment can be found on wiki, but you don't seem open to that. So, here's one of the books that I've got, about 10 years old by now: http://www.amazon.com/Win32-Multith...=sr_1_8?ie=UTF8&s=books&qid=1242728857&sr=8-8 Dipstick...