Database engine: any ideas?

segv · May 26, 2006

Quote from ids:

Using MySQL does not imply any special requirements. Time is hard to predict but it works much faster then all known me SQL servers.
More...

Just reaffirming - MySQL is a great database. We use it exclusively.

-segv

dcraig · May 27, 2006

Quote from NoWorries:

Would using MySQL imply storing the data for each stock in a separate table, and indexing them by date/time? Do you have any experience with the time it takes for a typical query in this case (e.g. 1000 stocks x 10,000 observations).
More...

MySQL, Linux 2.6.15 kernel, Athlon 2800+, ordinary IDE drive:

7.24 seconds to select * from a 700K row table of tick data.

Paccc · May 27, 2006

Quote from dcraig:

MySQL, Linux 2.6.15 kernel, Athlon 2800+, ordinary IDE drive:

7.24 seconds to select * from a 700K row table of tick data.
More...

Do you use MyISAM or InnoDB? Is there a performance benefit of one over the other? Thanks.

-- Paccc

inflector · May 28, 2006

I use statistical software (e.g. Stata, R) for modelling of returns and large-scale portfolio optimizations. I usually can hold all my (end-of-day) data in memory, but am thinking of using a database engine to facilitate data storage. I might want to start running models on intraday data and will likely run out of memory using my current approach.
More...

I've seen this question answered many times before in various forums and the experienced programmers always respond the same. Don't do it.

The ratio of speed of memory access to disk access has increased over the years. Disks have become relatively slower and slower as compared to RAM.

Disk access is still measured in 1/1,000ths of seconds as you need to wait for the platter to spin around so that the disk head can read it. Random reading access is barely hundreds of times faster than it was when I started doing futures research more than 20 years ago.

Disk size and linear disk access has increased over the years at the pace of Moore's law or even slightly higher. That's why we have 500 gig drives now and we had 5 meg drives in 1982.

You can read in a large chunk of data very fast if it is stored as one piece. Unfortunately, databases tend to store their data in relatively small pieces. So a typical read of a large chunk of test data using a database would require many separate reads of the disk; each of these, in turn, requiring many 1/1000ths of a second.

This can slow a large test down tremendously.

If you are testing using intraday data, your tests will take a long time due to the sheer number of bars as compared to tests with EOD data. The last thing you want is to slow things down even further.

The solution is fairly simple, just store your data in flat files with fixed size records. You can then easily determine where the data is for your next read and you can use simple data access routines to read the data.

For example, if your data is stored in 32 byte chunks, you can read in data in some reasonable multiple of 32. You might read in 32 * 1024 * 1024 records at a time. This will be much much faster than trying to do the same with a database, perhaps 10 to 100 times faster depending on the database and its specific settings.

If you want to get fancy, you can even do your reading asynchronously so the disk is reading in the next chunk of data while you are processing the current chunk of data. This will give you your disk reads for essentially no additional time.

In short, the problem is simple enough that a database is overkill and will end up slowing you down an order of magnitude in the typical case. Even a fast database will be much slower than simple linear disk access.

- Curtis

P.S. I should note that specialized databases designed and optimized for this particular case could theoretically achieve the same performance as the simple mechanism I note above. However, I have not found that any of the commonly available databases have performance that even comes close to approaching this theoretical ideal.

P.P.S. The memory mapped files suggestion by 21Centtrader above is another way to handle this problem. It is more dependent on the specific OS (i.e. Windows vs. Linux etc.) and gives you less control but may achieve similar performance to what I propose above.

dcraig · May 28, 2006

Quote from Paccc:

Do you use MyISAM or InnoDB? Is there a performance benefit of one over the other? Thanks.

-- Paccc
More...

MyISAM. My knowledge of RDBs is slight, so I'm not really the best person to be handing out recommendations.

dcraig · May 28, 2006

Quote from inflector:

P.P.S. The memory mapped files suggestion by 21Centtrader above is another way to handle this problem. It is more dependent on the specific OS (i.e. Windows vs. Linux etc.) and gives you less control but may achieve similar performance to what I propose above.
More...

mmap should outperform anything else (all things being equal) because the extra buffer copies from kernel to user space are avoided. As well as using C or C++, you should be able to use Java with the java.nio package for memory mapped I/O and it should be portable. I havn't tried it but it supposedly improves I/O performance.

One advantage of SQL is that various applications understand it and ready tools are available for maintaining your data. If you roll your own, you will need to write your own code for this. If performance is sufficent, an RDB might still be a good choice.

originalskunk · May 28, 2006

Time series data is generally not very well suited to being stored in a RDBMS or being accessed using SQL. In the vast majority of cases, the data is accessed sequentially ("give me every bar after yyyy-mm-dd") and the overhead imposed by the database adds no value. I am not aware of any commercial systems that store market data in a relational database. Most modern operating systems offer some flavour of memory mapped files and I believe you will get far better performance storing the data in flat files.

fsm · Jun 28, 2006

Recently came across FastDB / GigaBase - both are open source.

Haven't evaluated the DBs yet, but documentation indicates FastDB uses the main memory (RAM) to store information; while the GigaBase uses memory mapped files.