How to build an automated system

hft_boy · Feb 4, 2013

Hey,
I figure I'll spend half an hour to an hour a week writing about various aspects of building the infrastructure behind a trading system. Why? I guess I'm just kind of bored, and there seems to be precious little written on by people who know what they are doing. And I feel the need to thrust my views on the world .

I guess my first post will be about my general philosophy/approach towards programming.

First, don't optimize until it is needed. Focus on the logic of the code first, get it right, get it simple, make it easy to understand. Then, when optimizing, know your patterns of usage, and use the correct hardware/language/data structure for the job. It seems like every other day somebody recommends an SSD for accessing files faster. Well, for files which are read sequentially (e.g. data files), there is basically no difference between spinny drives and NAND drives because the bottleneck is in SATA transfer speeds, not seek time! Yeah there is like a millisecond difference in seeking to the start of the file but unless you are hammering the drive with a thousand requests a second, there is not going to be a noticeable difference.

Don't use too much abstraction, and don't use too little abstraction. Einstein is attributed with saying that "everything should be as simple as it can be, but not simpler." I tend to agree. Don't make too many classes. Don't make too few. (Don't use C++, herp derp). It comes down to patterns of usage and not prematurely optimizing. There is no point in abstracting away five lines of code. More to the point, it can actually be dangerous -- too much abstraction kills the ability to know what can be assumed about the code, which is a serious problem come testing time. This brings me to my next agenda.

Write code which is easy to test for correctness. It's really hard to get this right, and even experienced programmers mess up all the time. What I try to do is write code in such a way that there are 'logical bottlenecks', so that the number of assumptions that have to be made about each section is limited and you can assert the crap out of it, so that when it breaks it isn't subtle. Test pre-conditions, post-conditions and invariants. Put as many of these tests as possible into compile time (e.g. const-correctness).

This problem of actually knowing what various pieces of code are doing is one of the reasons I don't like using really high level abstractions/languages and try to keep external library usage to a minimum -- unless the documentation is very good, and you read it carefully, you won't know what assumptions you can make about the code (e.g. did you know that java.lang.Math.round is different from C's "math.h" round? #omgmymindwasblown). Even if you read the documentation, you should just go and read the source code anyways to double check that it actually does what it claims to do. IMHO not being aware of what assumptions can be made about code which executes behind the scenes especially as it pertains to the outcome of the code is seriously sloppy work and is not acceptable for production systems.

Well, anyways, that's about it for the week. Hope that this was entertaining or that maybe you even learned something from it!

cdcaveman · Feb 4, 2013

i've been reading alot of threads in the programming section lately...
i don't really understand how SSDs aren't any faster.... i mean my level of understanding isn't great.. but when i use a HDD it would take an idiot to not see the difference in speed.. when i go to install a program.. start a program etc..
i have this..
http://www.newegg.com/Product/Product.aspx?Item=N82E16822136296

and this

http://www.newegg.com/Product/Product.aspx?Item=N82E16820227715

so your saying the entire difference is literally between sata 2 and sata 3

seems like what your saying is common sense about programming.. all of which is seemingly counter intituitve as you do it.. haha.. really appreciate your imput.. i'm just starting out.. no formal education.. just some web development, and alot of reading about programming concepts...

hft_boy · Feb 4, 2013

Quote from cdcaveman:

i've been reading alot of threads in the programming section lately...
i don't really understand how SSDs aren't any faster.... i mean my level of understanding isn't great.. but when i use a HDD it would take an idiot to not see the difference in speed.. when i go to install a program.. start a program etc..

so your saying the entire difference is literally between sata 2 and sata 3
More...

Not really. For starting a program or computer, especially if the filesystem is fragmented, you need to make a bunch of different random reads from disk. So it makes a big difference, and in this case an SSD is much faster. But for something like tick data which can be a flat file on disk, you're not really going to see much of a difference because the whole access is sequential.

EDIT: BTW, you don't have to believe me. Write a test program which reads from a file on the two drives, see what the difference is, if any.

cdcaveman · Feb 4, 2013

Quote from hft_boy:

Not really. For starting a program or computer, especially if the filesystem is fragmented, you need to make a bunch of different random reads from disk. So it makes a big difference, and in this case an SSD is much faster. But for something like tick data which can be a flat file on disk, you're not really going to see much of a difference because the whole access is sequential.

EDIT: BTW, you don't have to believe me. Write a test program which reads from a file on the two drives, see what the difference is, if any.
More...

that explanation makes complete sense... but its the navigation in and out of programs that over time builds up frustration with me as well so.. it makes sense.. and now a days with the 1 dollar a gig thing going on .. its hard not to buy them.. there is no quantification of reliability though per dollar. i'm sure its not good.. at least in my experience.. the thing is. when an SSD goes down.. it goes down.. a platter can more easily be recovered at least realtively.. it is my understanding that the controller on the ssd goes bad and could technical be replaced and your data could be recovered.. but idk..

you went on about DB's respectively as well.. now as i remember you were talking about efficiency in terms of read/writes as well.. but aren't you missing out on the entire points of efficiencies related to a DB as opposed to a flat file ? unless you have an xml document.. how would you navigate via a script through the flat file as effectively as you navigation with SQL ... you can cache queries, build inner joins and say build virtual spread sheets of flat files that in essense would be much more complicated then grabbing a bunch of time series data off flat files and combining them no?

hft_boy · Feb 4, 2013

Quote from cdcaveman:

you went on about DB's respectively as well.. now as i remember you were talking about efficiency in terms of read/writes as well.. but aren't you missing out on the entire points of efficiencies related to a DB as opposed to a flat file ? unless you have an xml document.. how would you navigate via a script through the flat file as effectively as you navigation with SQL ... you can cache queries, build inner joins and say build virtual spread sheets of flat files that in essense would be much more complicated then grabbing a bunch of time series data off flat files and combining them no?
More...

Probably missing out on benefits of a DB . But really, what do you use a tick data file for? You don't need to cache queries, build inner joins and spreadsheets and whatnot, because in the end you are going to analyze it just the same way you analyze market data in real time -- sequentially, one tick at a time. I guess I just don't really understand what problem a database solves in terms of backtesting.

clearinghouse · Feb 5, 2013

Quote from hft_boy:

Probably missing out on benefits of a DB . But really, what do you use a tick data file for? You don't need to cache queries, build inner joins and spreadsheets and whatnot, because in the end you are going to analyze it just the same way you analyze market data in real time -- sequentially, one tick at a time. I guess I just don't really understand what problem a database solves in terms of backtesting.
More...

You do "join" slices of time series together to see what the impacts events on one product have on another, although this is more stat-arby.

Just curious: In your line of work/trading, are you mostly a taker or a maker? I'm assuming you have some real speed at your disposal, with your forum name and all.

kut2k2 · Feb 5, 2013

So given your name, is your focus on hft or will your advice apply to traders at a slower pace (e.g., one-minute timeframe) as well?

And what exactly is wrong with C++? Thanks.

hftvol · Feb 5, 2013

regarding your statement of SSD vs physical drives I cannot confirm that. Reading a binary data file (custom binary format representing fx tick data) from an SSD drive is about one order of magnitude faster than reading it from HDD. Also, you are making the simplifying assumption that you only read a single file at any time in a sequential fashion. The sequential assumption is fine but most of the time the engine needs to read several files concurrently (any strategy that peruses more than just one single time series). Having said that an SSD drive really kicks in when you read several files concurrently. I store each symbol in its own file, which is pretty much the recommended and standard way of doing it. I then read time synchronized portions of all symbols, requested, into memory (cannot read the whole data from requested start to end because it would even kill a 32 or 64gb memory machine on like 20 or more concurrent symbols), the data in memory is then merge-sorted and fed to the strategy engine. This is where SSD or memory really shines over a traditional HDD.

Quote from hft_boy:

Hey,
I figure I'll spend half an hour to an hour a week writing about various aspects of building the infrastructure behind a trading system. Why? I guess I'm just kind of bored, and there seems to be precious little written on by people who know what they are doing. And I feel the need to thrust my views on the world .

I guess my first post will be about my general philosophy/approach towards programming.

First, don't optimize until it is needed. Focus on the logic of the code first, get it right, get it simple, make it easy to understand. Then, when optimizing, know your patterns of usage, and use the correct hardware/language/data structure for the job. It seems like every other day somebody recommends an SSD for accessing files faster. Well, for files which are read sequentially (e.g. data files), there is basically no difference between spinny drives and NAND drives because the bottleneck is in SATA transfer speeds, not seek time! Yeah there is like a millisecond difference in seeking to the start of the file but unless you are hammering the drive with a thousand requests a second, there is not going to be a noticeable difference.

Don't use too much abstraction, and don't use too little abstraction. Einstein is attributed with saying that "everything should be as simple as it can be, but not simpler." I tend to agree. Don't make too many classes. Don't make too few. (Don't use C++, herp derp). It comes down to patterns of usage and not prematurely optimizing. There is no point in abstracting away five lines of code. More to the point, it can actually be dangerous -- too much abstraction kills the ability to know what can be assumed about the code, which is a serious problem come testing time. This brings me to my next agenda.

Write code which is easy to test for correctness. It's really hard to get this right, and even experienced programmers mess up all the time. What I try to do is write code in such a way that there are 'logical bottlenecks', so that the number of assumptions that have to be made about each section is limited and you can assert the crap out of it, so that when it breaks it isn't subtle. Test pre-conditions, post-conditions and invariants. Put as many of these tests as possible into compile time (e.g. const-correctness).

This problem of actually knowing what various pieces of code are doing is one of the reasons I don't like using really high level abstractions/languages and try to keep external library usage to a minimum -- unless the documentation is very good, and you read it carefully, you won't know what assumptions you can make about the code (e.g. did you know that java.lang.Math.round is different from C's "math.h" round? #omgmymindwasblown). Even if you read the documentation, you should just go and read the source code anyways to double check that it actually does what it claims to do. IMHO not being aware of what assumptions can be made about code which executes behind the scenes especially as it pertains to the outcome of the code is seriously sloppy work and is not acceptable for production systems.

Well, anyways, that's about it for the week. Hope that this was entertaining or that maybe you even learned something from it!
More...

cdcaveman · Feb 5, 2013

Quote from clearinghouse:

You do "join" slices of time series together to see what the impacts events on one product have on another, although this is more stat-arby.

Just curious: In your line of work/trading, are you mostly a taker or a maker? I'm assuming you have some real speed at your disposal, with your forum name and all.
More...

theoretically i think from my viewings alot of guys are using R to do statistical studies on sets.. but as i have used databases and studied them before.. i think of it as a typical intellectual hurdle one doesn't want to jump.. if your doing some market wide correlation stat arb search, having a searchable DB and building some logic into your queries would be exactly "alot" faster then asking flat files the same questions.. but i am sure there is at lower levels of use a DB is more 'shit" to keep up with for what its worth.. think about a query as a virtual spread shit constructed with the constraints and logic you require.. you can build alot of the logic into the sql.. or you can use your native language to loop through the query.. obviously optimizing commonly used queries by caching them makes sense.. but idk.. i'm so new its hard for me to feel to confident about my ideas.. but i have doubts about flat file scalabliity.. you always start out with a few symbols, and 10 years back of ticks.. but then you end up with a hugh library.. and i realize factoring is ok to do just to keep the project going.. but the barrier to entry with using DB's is super easy.. so i don't see a need.. stack overflow has helped me alot in past research deals.. but they pose a good argument here..

http://stackoverflow.com/questions/...e-some-technical-reasons-for-choosing-one-ove

abattia · Feb 5, 2013

Great thread! Thanks for starting ...

Quote from hft_boy:
... Don't make too many classes. Don't make too few...
More...

What rules of thumb are there? What's too many? What's too few?

Quote from hft_boy:
...Write code which is easy to test for correctness... What I try to do is write code in such a way that there are 'logical bottlenecks', so that the number of assumptions that have to be made about each section is limited and you can assert the crap out of it, so that when it breaks it isn't subtle... Test pre-conditions, post-conditions and invariants... Put as many of these tests as possible into compile time (e.g. const-correctness)...
More...

There's lots in what you've written in the quote above that I'd appreciate understanding better. Could you expand each of these points, or give examples?

Many thanks again ...