Why even use a database?
There's been some discussion on databases lately and I'd like to pose the question of why even use a database? From the research I've done it looks like a lot of traders seem to prefer binary data files for storing price/quote information. Apparently binary data is much faster to read and write to.
For instance, if my purpose was to have an efficient means of storing gigs of tickdata and this data was the backbone to my bactesting engine, whereby strategies were run off of different timeperiods etc., why would I need a database? Why not call and hold the pertinent data in memory? From my research, dbs excel at query based searches... unless your backtesting analysis needs this type of functionality, again why would use a database?
Background: I'm currently in development of a backtesting/trading engine and have been looking at the different ways large amounts of data can be stored, I have no experience with either method of storage. Here's a great discussion by experienced mechanical traders on the merits of binary data and the pitfalls of dbs. It's where I based most of my questions from and it's definitely worth a read:
Would love to hear your guys' thoughts on this. It would be great to hear examples... I've been reading a lot of the theory and it's a bit difficult to base decisions upon. Concrete examples of how one has used different methods and their comparisions, always seems to triumph over theory.
Depends somewhat on your specific application and needs, but in general I'd have to say that using a DB to store price data is not inherently performance detrimental.
One instance I can think of off the top of my head where stream binary files could be better is if your implementation requires routinely having to reload large amounts of data (i.e., tens of thousands of records) for a specific security at a time for some kind of processing, then move on and do the same thing for the next security, etc. - but there are so many analytic applications that don't involve that, that not using a DB is probably more the exception than the rule.
In my experience, the typical reason for performance problems when using a DB (and this extends to most applications, not just trading/analysis) is poor database design (in some cases the logical design sucks, in other times the physical design is poor) and/or poor design of the access/processing.
Considering that most datafeeds store their large datasets in some form of DB and that the realtime systems at the major exchanges all use DBs (I'm specifically familiar with NYSE, NYMEX, and CME - but would guess the other exchanges do similarly) and I'm also very familiar with numerous other realtime/high volume systems that all use integral DBs - it can be generalized that well designed and implemented systems can use DBs without performance impact.
Re: Why even use a database?
Also, modern database systems are desiged to allow for ensuring data integrity: that is they allow the application of sets of rules to help ensure that the data is not modified in any undesirable way.
Also, moden database systems have mechanisms to allow for backup and most importantly recovery to a point in time. Thatis you can roll back changes in a database to a particular point in time.
So, if you need many people (or even just a few people ) asking set based questions on larger amounts of data, and the integrity of the data is most important, then something like mySQL , SQl server, Oracle, Sybase, or DB2 would be desirable.
You really just need to compute the costs to decide if a flat file or a relational database system is of value to your project: my guess is that it is.
There is no one-size-fits-all.
I've run many gigs of ticker data within big multi-cpu Unix servers with many gigs of RAM and hundreds of gigs of fiber-channel disks organized as RAID arrays. I have also used awk & Perl on a laptop against gigs of tick data stored in simple text files.
Depending upon the question to be answered, either one could deliver a result faster. Faster sometimes means run-time performance of the h/w & s/w, and sometimes it means the amount of time it takes the user to code up a query.
When most people say "database s/w" they usually mean "relational database." Relational databases are optimized for OLTP (OnLine Transaction Processing), with a lot of small read-write activity. Only in more recent years has read-mostly data-mining, data-warehousing & OLAP functions been grafted onto a foundation which is still inherently biased towards OLTP. And, these issues are separate from which platform is easiest for expressing the question you want answered.
For working on individual series of tick data, plain files read by a script or custom program are probably fastest. For studying relations between a small number data series, something like SAS might be useful. For cross-tabulations between dozens or hundreds of data series, a relational database may be easier to write up a query.
And these generalizations may not apply to your specific circumstances.
Re: There is no one-size-fits-all.
I've got an eclectic background. Started programming in high school over 20 years ago writing futures trading systems. Had a bit of fame in my early twenties as a trader and then left for 15 years to start a few software companies.
One of them sold an embedded database which was the number one product on the Macintosh. I worked on the internals, disk access, etc. as well as the query optimization.
There is a huge difference in read time between a database and a binary file unless the database has been specifically optimized for large binary data storage (known as BLOBs in the business).
The reason is simple, even in a database with an efficient caching mechanism large data sets generally involve multiple reads from the disk because the data is split up into chuncks. Every separate read will take a while because on average it will require 1/2 of a rotation of the disk before the data comes under the read heads so the read can start.
Unlike almost every other aspect of computing, disk speeds have not followed Moore's Law. Disks are maybe 30 to 100 times faster than they were 20 years ago while computers are 10,000 times faster.
Even a 10,000 RPM disk takes 6 milliseconds to rotate. So you only get 167 rotations per second. That's a lot of time when computers are doing billions of instructions per second.
For tick data analysis the speed of reading the data is the determining factor for the speed of testing unless you have very inefficient code or are doing esoteric analysis.
So I suggest storing information about your data in a database but storing the physical data on the disk in raw binary files.
|All times are GMT -5. The time now is 03:51 PM.|