Automated Trading for a Living

garachen · May 3, 2013

Quote from ofthomas:

LMAO.... hmmm... please tell me how to store 70TB of data without raid... last I checked, the biggest drive on the market is 4TB... I can of course use a Violin with 20TB, but there is cost associated with that (unless you are an actual bank that can afford it, and we dont even deploy it everywhere)...

it amazes me the stuff people spew out at times... please understand the tech before saying non-sense... RAID is necessary, what most get hurt by is the implementation of said technology because they usually pick the wrong choice and assume they are protected when they might not be...
More...

Google doesn't use RAID

vicirek · May 3, 2013

Quote from ofthomas:

LMAO.... hmmm... please tell me how to store 70TB of data without raid... last I checked, the biggest drive on the market is 4TB...
More...

I am less dependent on hardware limitations because I design and program my own data storage and do not use commercial databases because they are to slow.

In the beginning I was also using RAID until I found out it does not offer me any real protection or better data storage.

I agree if you have no choice then go with RAID.

ofthomas · May 3, 2013

Quote from garachen:

Google doesn't use RAID
More...

they still do... they are just doing it with storage nodes... lastly, more and more file systems have the protection now built in... so it has moved more to the software and become object based than at the physical layer and dealing with blocks...

ofthomas · May 3, 2013

Quote from vicirek:

I am less dependent on hardware limitations because I design and program my own data storage and do not use commercial databases because they are to slow.

In the beginning I was also using RAID until I found out it does not offer me any real protection or better data storage.

I agree if you have no choice then go with RAID.
More...

exactly, you understood what your requirements were, what your app needs were and then accounted for the protection within your db design... some apps will actually suffer performance wise if they get any type of raid, they prefer raw disks to be given and the volume manager, at the OS or app layer, handles the "consolidation and presentation" of the lun group to the app... think ZFS, ASM(oracle), and the likes...

garachen · May 3, 2013

Quote from ofthomas:

they still do... they are just doing it with storage nodes... lastly, more and more file systems have the protection now built in... so it has moved more to the software and become object based than at the physical layer and dealing with blocks...
More...

Doesn't sound like they think they use RAID - since they contrast GFS with it. But sure, if you want to call that RAID then I'm using RAID too.

http://static.googleusercontent.com...rch.google.com/en/us/archive/gfs-sosp2003.pdf

ofthomas · May 3, 2013

Quote from garachen:

Doesn't sound like they think they use RAID - since they contrast GFS with it. But sure, if you want to call that RAID then I'm using RAID too.

http://static.googleusercontent.com...rch.google.com/en/us/archive/gfs-sosp2003.pdf
More...

you've actually deployed GFS? gutsy... ... and yes... still raid... as I said, what has transpired is that more and more vendors have moved to implementing it on the software side where the physical blocks have become objects, all other tech remains the same, but now you can basically grow a FS to PB's and are able to lose more than many disks at a time given the objects can be quickly rebuilt... as the data sizes grew, actual physical RAID became a pain in the rear... EMC/HDS/HP/IBM addressed it within their frames by virtualizing the luns, so you basically RAID whithin an array at times... (can be wasteful as the sizes grew) ... but it is more efficient to do it with objects 3PAR/ZFS/GFS/etc)...

RAID = Redundant Array of Inexpensive Disks...

and not to bore you... but this is basically what I had said, it is handled by storage nodes (they just call it chunklets) actually, as I re-read the specdoc, it reminds me of EMC Centera's back in the day... now those have advanced much further and are more reliable... we used them primarily for compliance when data has to be guaranteed to not have been altered and has to follow a given expiration schedule for compliance... or cant be deleted due to legal holds..

anyhow... see, RAID... just object based..

A GFS cluster consists of a single master and multiple
chunkservers and is accessed by multiple clients, as shown
in Figure 1. Each of these is typically a commodity Linux
machine running a user-level server process. It is easy to run
both a chunkserver and a client on the same machine, as
long as machine resources permit and the lower reliability
caused by running possibly flaky application code is acceptable.

Files are divided into fixed-size chunks. Each chunki s
identified by an immutable and globally unique 64 bit chunk
handle assigned by the master at the time of chunkcreat ion.
Chunkservers store chunks on local disks as Linux files and
read or write chunkda ta specified by a chunkha ndle and
byte range. For reliability, each chunkis replicated on multiple
chunkservers. By default, we store three replicas, though
users can designate different replication levels for different
regions of the file namespace.

The master maintains all file system metadata. This includes
the namespace, access control information, the mapping
from files to chunks, and the current locations of chunks.
It also controls system-wide activities such as chunklease
management, garbage collection of orphaned chunks, and
chunkmigration between chunkservers. The master
periodically communicates with each chunkserver in HeartBeat
messages to give it instructions and collect its state.

gmst · May 3, 2013

Quote from NetTecture:

I never said I did that. We use binary tick files for the backtest sources. As those are never searched, SQL Server would be idiotic as a choice. RIght now it is a file per symbol per day.

The SQL Server is for the results, which are analyzed.

End of next months we get another 35 months of data (3 years - 1 month - we right now start in a december).

Ok, this is quite showing off - but seriously, you will have to go in a similar direction. Retail software does not work, from our experience, especialyl Ninja with the stupid idiocies of giving random resultsÂ´, but nothing that does not allow you to distribute tests.

For example, you optimize a strat in code - in code, only.

More...

Many Thanks Net!!

This showing off from you is quite a learning experience for many of us. So, please keep doing great things and keep showing off Since I am in the process of rolling out my own infrastructure (excel/vba/vb.net) to augment what I do with MC, your experience and design is quite valuable to me, especially because you have similarly expanded from NT.

So, Few more questions:
1) You are doing currently 15 markets. I assume its all fx and futures. You mentioned your backtesting setup is quite scalable. If you move to stocks (which are thousands), will your infrastructure successfully scale as its designed currently? Or will you be forced to do major architectural modifications? If yes, what will they be like?
2) Not directly related to thread - but who is your data source and why are you getting data in chunks (like you have 29 months and you are going to get 35 more months)

Quote from NetTecture:

It is crucial for strategy quality - code wise - that you backtest on an event stream. Allows a lot more analysis and makes sure the code does not need changes to handling the real data. Big problem with Ninja.
More...

What do you mean by event stream? Do you mean streaming real-time data? If yes then how can you backtest anything on streaming data, since you are fetching the historical data from files on your computer?

NetTecture · May 4, 2013

Let me start from the end:

but who is your data source and why are you getting data in chunks (like you have 29 months and you are going to get 35 more months)
More...

I get the NANEX MF (CME GROUP) real time tape - and I started December 2010. When it became clear we go for our own infrastructure long term, I decided to get a good data provider that has no problems with downtime - Nanex delivers i n real time tape files you can archive that never have a gap. If you are down - once you reconnect you get the missing pieces. I started getting them December 2010 Let them gather a year.... Then started working - the one year was essentially lost (thanks NinjaTrader). So, 20 months is December 2010 to - April I collect going on, but we move to the backtest archive at the moment monthly, though that will change to daily exports.

You are doing currently 15 markets. I assume its all fx and futures. You mentioned your backtesting setup is quite scalable. If you move to stocks (which are thousands), will your infrastructure successfully scale as its designed currently? Or will you be forced to do major architectural modifications? If yes, what will they be like?
More...

We do Futures CME only at the moment. Stocks - you will need to seriously post the node count probably to a couple of hundred. I would tweak node responsiveness - right now they call back to the SQL Server every second looking for work - that can go to 6 seconds - 10 times per minute - as well as when they have no work (check internall every second). The load of all the nodes on the database server would be high. THat needs to be a decent higher range machine, likely with a LOT more discs.

Otherwise - not a lot we do not plan to do anyway. We are moving (this summer, parts are ordered) to a 10g based file server setup, with an ISCSI san to boot the nodes from a central three computer high available virtual SAN. We move the tape storage to a new Adaptec 7805Q controller with SSD as cache and we will move the SQL Server to a new hardware layout that has the capability to handle 72 discs. Doing spreads / correlations will require more work, but that is not grid related and planned anyway.

The whole concept though lives and falls around having cut off points where you know we are flat - in our case that are weekends. We do only intraday trading, but the 15 minute down of the CME means that we have no guaranteed off time during the week, especialyl as we plan to possibly integrate Fores at some point. Weekends are always flat - that is important because a job must be independant and not wonder whether it has a position from the last week or not.

But the main principle stands. Seriously. HPC queueing is well researched. I probably would invest more time into making the node system more failsafe - right now we reset working jobs in case of a reset of a node, but it requires the node to come back up and say "hey, I am here and I just started" so all work assigned to it is reset to scheduled. If the nod dies, we have to do that manually. May be worth doing a timeout there, so that work at nodes that do not call in 10x their call in duration is automatically reset. When you go to 100 nodes that is more likely. But anyway, that is minor - that is not a change of principle.

What do you mean by event stream? Do you mean streaming real-time data?
More...

We record all market events in proper order and play them back during backtesting. Ninja, for example, has no bid/ask during backtests. MarketUpdate callback is never called. That makes certain things complicated to test. We do not - our backtest is identical in code from real trading, just faster. But you get every market event you get during trading. Our simulator reconstructs the order book based on bid and ask updates. It then executes the fill based on this order book. We are adding a possibility now to execute the fill based on a larger size (i.e. for example you get the worst fills of 3x your order size - that is no slippage in a stable market, slippage if the bid/ask is "think"). Ninja is a serious pain in certain aspects because in backtests you get only bars, nothing else. That is multiple code paths partially. No event stream, sadly.

QuantWizard · May 4, 2013

Quote from dholliday:

With Intel's 50 and 60 core processors coming out this year, Tesla (I assume running on a GPU) will no longer even be a consideration for many.
-David
More...

Probably sufficient to most, but I still don't think it can match Tesla's 2048 cores which are tailored for floating-point arithmetics...

NetTecture · May 4, 2013

Quote from QuantWizard:

Probably sufficient to most, but I still don't think it can match Tesla's 2048 cores which are tailored for floating-point arithmetics...
More...

Most likely. THe main problem is still - the Tesla requies specific programming AND is bad in latency (AMD is fixing this now in their next generation processors). That means writing strategies 2 times, which means debugging them 2 times.

Oh, we are faster than I thought.... 747 million simulated trades now. Seems all the optimizations of the last 2 weeks that were done really starts paying off now