Tick Database Implementations

slacker · Dec 28, 2006

I am using Sqlite www.sqlite.org to queue data into and out of mysql. The result is better than I expected.

sqlite is an embedded database with a very simple sub-set of sql statements. An sqlite database can be memory resident or disk based. Compiled into a C++ program is only about 100kb. The sqlite benchmarks show it being at least 2 times faster than mysql or postgres.

The speed benefit is that I can put every event I am following from IB TWS into sqlite and then copy the tables quickly using BEGIN/COMMIT into mysql every few seconds. This way I am not hammering the drive as much as I was in the past where every IB event was a standalone INSERT.

I do the opposite for my charting and fill sqlite from mysql with only the table data that I need to refresh the current views. (Working on this now.) In both cases their is no 'sqlite database server', the db is embedded in my application and resident in RAM. It is temp storage only.

This approach is cheap, (free) and easy to adopt if you are already using a sql database. I was trying to keep a lot of data in RAM and using flat files but this approach is fine for what I am doing. Commercial packages or HDF5 seemed like overkill. You can get this up in running in a few hours and see if it is an alternative you can use.

Something to consider.

Great threads fatrat and others....

Happy New Year!

Bowgett · Dec 28, 2006

I would go with flat binary files. You can use memory mapped files so OS can take care of cashing parts of file in memory. Do not forget to specify right arguments when you open this file like sequential read for example. Make sure you run defrag on this drive regularly too.

Compress ticks as much as possible. If for example price of QQQQ fluctuates from let say $30 to $60 you only need 5 bits to store that. Store your base $30 and price above that in each tick.

esc_trader · Dec 30, 2006

Quote from ModulusFE:

The best thing is to NOT use an RMDBS when working with time series data. Instead, use a linear database with memory mapping. To locate a record you can triangulate based on the date/time.
More...

Modulus, can you elaborate on "triangulate", when working on what is essentially a large byte array.

dcraig · Dec 30, 2006

Quote from esc_trader:

Modulus, can you elaborate on "triangulate", when working on what is essentially a large byte array.
More...

Probably binary chop.

Batman28 · Dec 30, 2006

great thread.. just happens that I've been thinking/reading on this alot myself..

I think to begin with, we note that we're dealing with time-series data. so all those relational databases are out the window.

the main problem is this: effectively we need a very long array (vector) which is tedious - hence need for database. given the nature of time series requirments, typical DBs like mysql are useless. (of course it all depends on ur requirements), but most likely such databases have many limitations..

this is crucial: to build a meaningful time series database, you need interaction between both in-memory and stored (in db) records - and this must be done real time. another requirment is speed and ability to store unlimited amount of data (only limited by drive space you have available)..

taking these into consideration, it also really depends who you are and what you NEED. if you're interested in a few instruments only and not high volume then most db (e.g. ms sql) should meet ur need. but if you're serious about recording, carring out real time in-memory vs. recorded time series analysis etc. you need something more advance. I highly recommend Kdb+ for this area of work, based on k+.. very sexy.. kx.com

i was actually reading the following site before i visited the forum (looks very interesting)
http://www.xenomorph.com/products/systems/xdb/

some other related products you may like to look at:

http://www.sungard.com/products_and_services/sdms/fame_timeseries/

http://www.mightyware.com/commodity_server.htm

nitro · Dec 30, 2006

I have enjoyed reading this thread. Let me give my .02:

Imo the key to databasing isn't in the storing or retreiving of it. It is in the interfaces and implementation of those interfaces to the data for later analysis/data mining.

The key to me is to be able to operate on many timeseries in parallel. In other words, it is the data mining of TS that is hard and computationally expensive (not O(n) or some other linear measure). Storage/retrieval is [some] linear function of number of time series and is not where the hard problem for real-time time series computation/analysis lies.

For example (dumb but illustrates the problem), here is a query I want to perform in 2 seconds on 500 symbols: Using tick data, give me all the stocks that are x% above their 20 day moving average. You see, the retrival of the data from disk to memory probably takes well over two seconds. But is that what anyone really cares about? No, it is the computation that happens to the data once in memory. If you try to run a stored procedure to do this, it will take "forever". What about more complex queries, like correlation amongst 500 symbols to each other?

This is the real problem that we want to solve imo, not the storage and retreivel of data which is a function of bus speeds, disk speeds and cache sizes, etc. Going from one database strategy to the next discussed in this thread will probably give say 1x to 4x speedups. But once the data is in memory (outside the domain of the database storage and retrievel and in the domain of our analysis), that is when using the right strategy can make one or two orders of magnitude difference. Imo this is where the gold lies...

My advice? Choose a good database like HDF5 (and probably run it on Linux kernel >2.6), and worry about speeding up the data mining, which is 90% certain not to happen on the database for real-time queries over many many symbols...

I remember seeing an interview with DE Shaw on Bloomberg. They asked him if there were any edges left. He said that he looked for those edges where it is very "expensive" for others to find...

nitro

Batman28 · Dec 30, 2006

Quote from nitro:

I have enjoyed reading this thread. Let me give my .02:

Imo the key to databasing isn't in the storing or retreiving of it. It is in the interfaces that are inplemented for later analysis/data mining.

The key to me is to be able to operate on many timeseries in parallel. In other words, it is the data mining of TS that is hard and computationally expensive (not O(n) or some other linear measure). Storage/retrieval is [some] linear function of number of time series and is not where the hard problem for real-time time series computation lies.

For example (dumb but illustrates the problem), here is a query I want to perform in 2 seconds on 500 symbols: Using tick data, give me all the stocks that are x% above their 20 day moving average. You see, the retrival of the data from disk to memory probably takes well over two seconds. But is that what anyone really cares about? No, it is the computation that happens to the data once in memory. If you try to run a stored procedure to do this, it will take "forever". What about more complex queries, like correlation amongst 500 symbols to each other?

This is the real problem that we want to solve imo, not the storage and retreivel of data which is a function of bus speeds, disk speeds and cache sizes, etc. Going from one database strategy to the next discussed in this thread will probably give say 1x to 4x speedups. But once the data is in memory (outside the domain of the database storage and retrievel and in the domain of our analysis), that is when using the right strategy can make one or two orders of magnitude difference. Imo this is where the gold lies...

My advice? Choose a good database like HDF5 (and probably run it on Linux kernel >2.6), and worry about speeding up the data mining, which is 90% certain not to happen on the database...

nitro
More...

couple of points I like to make on this and some disagreements.

firstly, the example you gave was very very bad, I didn't get the point you were trying to make.. anyone interested in carrying out the given analysis does not demand or strictly wish for it to be performed under 2 seconds. and EVEN IF IT IS CRITICAL, (so critical that it is demand at less than 2 seconds), then the obvious answer is to actually add a coloumn to the TS table logging the 20day MA, in which case it can be accessed instantly..still a bad example..

correlation between 500 symbols is also a arbitrary example, but nothing extradinary.. and not impossible.. can very quickly using a good database, but more relevant for portfolio managers who certainly don't make decisions in less than 2 seconds..

I also totally disagree that the storage method etc. is not an issue.. the management of time series data or dealing with high frequency q can very complex.. e.g. how do we even measure it for start? per space time or per tick move? the bid or the offer or the mean? do we filter out break trades (off the map - a serious issue)? and someone already mentioned corporate actions! even the retrival of the data can be difficult.. when u deal with TS working with forecasting model for eg., then retriving the right t-1 record is critcal..

in short it really depends how you want to use the data.. if there are critical functions or measure, you may actually include them in the table as it is recorded - so u can access to it instantly.. the most important is definatly playing with the in-memory and in-database data at the same time.. the kdb+tick is a very good example of a time series analysts' dream
http://www.kx.com/products/kdb+tick.php

as far as plane storage goes, LMT also provide a good solution with useful filters
http://www.lmt-expo.com/lmt/products/TICKBASE.asp

Joel Reymont · Dec 30, 2006

I love the kx and xdb and similar suggestions. I would love them even more if these solutions were accessible to mere mortals.

For the rest of us there's MonetDB:

http://monetdb.cwi.nl/

Batman28 · Dec 30, 2006

Quote from Joel Reymont:

I would love them even more if these solutions were accessible to mere mortals.
More...

hahahaha...

JUST IMAGINE your software programmed in k+ linked to kdb+ with ksql enquries!!

p.s. I hear they give out free k+ interpreters and if u contact them with nice enquiries for personal use they distribute free packages.. but be aware, you require a super multi CPU system to even set up something like kdb+tick..

nitro · Jan 6, 2007

Quote from a5519:

Dont invent wheel - use Oracle In-memory data base

http://www.oracle.com/technology/products/timesten/index.html
More...

I think the new MSFT SQL Server does the same thing...

nitro