Tesla Personal Supercomputer by Nividia

intradaybill · May 18, 2009

Quote from Elitist Trader:

When you have trade station running your testing, bring up the windows task manager, click the performance tab, and check out the CPU usage History. You should see two graphs (graphing the load on both the CPUs) and if only 1 of the processors are under full load, then you know that tradestation is only using one processor.

Another thing to note, is you need a better CPU, your 4gb ram wont be doing anything to help make things faster.
More...

Most programs use only 1 processor because the task of rewriting code for multiple processors is both difficult and expensive. Most sequential type of optimization and backtesting algorithms are not exactly suitable for multiple processing.

There is a neat trick you can do to circumvent that and Michael Harris has done it with his APS software. Basically, you run concurrently multiple instances of a program and each instance does a portion of the task. I was one of the first to request multiple CPU processing for APS and I got this solution in response. It works nicely. Maybe you should ask Tradestation people to do the same. Running the program for days to get results is not very productive.

jprad · May 18, 2009

Quote from intradaybill:

Most programs use only 1 processor because the task of rewriting code for multiple processors is both difficult and expensive. Most sequential type of optimization and backtesting algorithms are not exactly suitable for multiple processing.
More...

Writing a multi-threaded application is not that hard nor is it all that new. Windows NT, back in '95 had SMP:

http://msdn.microsoft.com/en-us/library/ms810434.aspx

There is a neat trick you can do to circumvent that and Michael Harris has done it with his APS software. Basically, you run concurrently multiple instances of a program and each instance does a portion of the task. I was one of the first to request multiple CPU processing for APS and I got this solution in response. It works nicely.
More...

He's making you buy multiple licenses? That's not a solution, that's a rip-off.

BTW, you might want to check into the scalability of SMP on Windows with the Intel CPU architecture. Additional CPUs do not give you a linear performance gain.

Since you have to buy multiple licenses you'd be better off running APS on separate machines

byteme · May 18, 2009

Quote from intradaybill:

Most programs use only 1 processor because the task of rewriting code for multiple processors is both difficult and expensive. Most sequential type of optimization and backtesting algorithms are not exactly suitable for multiple processing.

There is a neat trick you can do to circumvent that and Michael Harris has done it with his APS software. Basically, you run concurrently multiple instances of a program and each instance does a portion of the task. I was one of the first to request multiple CPU processing for APS and I got this solution in response. It works nicely. Maybe you should ask Tradestation people to do the same. Running the program for days to get results is not very productive.
More...

You have to buy mutiple licences to make use of mutiple CPUs??!!!

[edited: sorry, someone beat me to the punch above]

"Most sequential type of optimization and backtesting algorithms are not exactly suitable for multiple processing."

Could you elaborate on that for me? I'd like to hear of some use cases that wouldn't be suitable.

Masterchanger · May 18, 2009

Quote from Elitist Trader:

Another thing to note, is you need a better CPU, your 4gb ram wont be doing anything to help make things faster.
More...

I know a 2.66 Dual core pentium isn't a 3.0 or better or an i7 but cant my processor use up to 4GB of ram with Winxp and I'm avoiding Vista right now?

what exactly are you saying I need a quadcore or what? Given that Tradestation doesn't utilize multithreading how much of an advantage would there be with a quad or a 2.66-> 3.0 Dual core?

jprad · May 18, 2009

Quote from Masterchanger:

I know a 2.66 Dual core pentium isn't a 3.0 or better or an i7 but cant my processor use up to 4GB of ram with Winxp and I'm avoiding Vista right now?

More...

While the total memory space in a 32bit CPU is 4GB, the most you can use is 3.5GB due to I/O (video, disc, network, etc.) being mapped to the remaining 512MB.

You need to move over to a 64bit CPU and OS in order to use more than 3.5GB of ram.

buzzy2 · May 18, 2009

Actually some advanced techniques parallellize easily like MonteCarlo and the Bootstrap.

intradaybill · May 18, 2009

Quote from jprad:

Writing a multi-threaded application is not that hard nor is it all that new. Windows NT, back in '95 had SMP:

http://msdn.microsoft.com/en-us/library/ms810434.aspx

He's making you buy multiple licenses? That's not a solution, that's a rip-off.

BTW, you might want to check into the scalability of SMP on Windows with the Intel CPU architecture. Additional CPUs do not give you a linear performance gain.

Since you have to buy multiple licenses you'd be better off running APS on separate machines
More...

That won't work since the execution time will remain the same for a certain task for each of the machines.

Writing multi-threaded applications is not that hard but it is not enough for improving execution time.

You will either pay for the cost of the programming or for multiple instances of a program. Actually, I paid just an additional 20% to get the second license.

Maybe you would like to explain this to us how it is done. Consider the trivial nested loop example below:

x = 0.
y=0.
for i = 0 to 100
x = x+i
for j = 1 to 1000
y = x+2j
end
end
print(x,y)

How do you make use of multiple CPUs to get the result faster? Maybe there is a way, I don't know. I am not a good programmer myself.

On the other hand, for the following example it is much easier:

x = 0.
y=0.
for i = 0 to 100
x = x+i
end
for j = 1 to 100
y = y+2j
end
print(x,y)

You can do it as follows I suppose:

Thread 1:

x = 0.
for i = 0 to 100
x = x+i
end

Thread 2:

y=0
for j = 1 to 1000
y = y+2j
end
print(x,y)

Now you have gained the execution time for thread 1, assuming thread 2 takes longer.

jprad · May 18, 2009

Quote from intradaybill:

That won't work since the execution time will remain the same for a certain task for each of the machines.
More...

Implicit is the ability to split the data set or segment the problem space across machines. If so, there certainly will be an improvement in total clock time to complete the entire job.

You will either pay for the cost of the programming or for multiple instances of a program. Actually, I paid just an additional 20% to get the second license.
More...

Uh, that's my point. An application that iterates multiple pattern matches across multiple symbols and costs $1,500 should already be multi-threaded.

Charging you extra for what amounts to an amateurish hack shouldn't be tolerated in something that's advertised for trading professionals.

Maybe you would like to explain this to us how it is done. Consider the trivial nested loop example below:
More...

Actually, there's a special term for this sort of problem and it's treated fairly well here:

http://en.wikipedia.org/wiki/Embarrassingly_parallel

vikana · May 18, 2009

The biggest issue/problem with the Tesla architecture is that you have to re-design your software around their APIs. For some that's easy, but for many, it's probably not a good fit.

If your software already is highly distributed and parallel without lots of locking, cuda might fit. Otherwise, it's a bit project to support it.

Personally, it'd rather see a board with 100 386-instruction set CPUs where normal software would have an easier time of exploiting the parallelism.

Edit: actually, 386 is not that important. What's key imo is a general purpose CPU with good compilers, preferably an well-known architecture such as 386, mips, sparc or the like, and a solid kernel on it.

jprad · May 18, 2009

Quote from vikana:

The biggest issue/problem with the Tesla architecture is that you have to re-design your software around their APIs. For some that's easy, but for many, it's probably not a good fit.
More...

I agree. NVidia labeling the Telsa as "general purpose" is misleading at best.

Personally, it'd rather see a board with 100 386-instruction set CPUs where normal software would have an easier time of exploiting the parallelism.
More...

It's not really general purpose, but here's a 64 core MPP you can get on a single-board:

http://www.tilera.com/products/processors.php

But, something like this, if it sees the light of day, would be much more economical power-wise and give you a 12-way cluster:

http://www.theregister.co.uk/2009/05/15/dell_does_via_nano/