More stats: Splits 31996 Dividends 167156 Symbol Changes 102489 Trading Halts: 67408 Historic Symbols: 102489
With all due respect -- what is the point of having a database and being able to locate random ticks quickly? First of all, 10ms is not fast -- that's 100 ticks a second. Second of all, most backtesting is done over sequential ticks. That means that you don't really need fast random access, but fast sequential access. Using flat binary files is blazing fast on modern computers. I can 'parse' (i.e. not actually do any computations with the data) 7 GB in about 7 seconds on a 2 GHz CPU. That's about 30 ns a message. So why use databases? They're bulky, slow, and unless you can give me a good reason, really not the right type of tool for most trading type applications.
I would assume some trading systems can get a bit more complex. Then other factors than disk access speed are important. I use Oracle for several reasons. Data Cache: Once data is in memory, a database can be tuned to keep needed data in memory with out having to build custom binaries. Read consistency: Being able to have multiple threads or jobs accessing and changing model state data and not worry about deadlocks and missed updates. PL/SQL: To me a much easier higher level access language than any other. Itâs slower than some, but the functionality and capability of a full blown Oracle DB with its functions, schema, and memory structures is powerful. Client access: Sharing data across various applications / jobs with out worry of custom access code for every client. Advanced Queues: First in & out queues ideal for dumping in large amounts of inserts so the client process does not get stuck waiting for processing. AQs are also good for having multiple threads access and process the same data and replicating across environments. High Availability: Simply very few other solutions offer as many HA options. If youâre building something on your desktop and it has a system board error or disk failure in the middle of the trading day, now what? Go to last nightâs carbonite backup or usb drive and startover? What did you just miss? Can you at least recapture the data feed for historical reasons even if you canât trade cause your sweating bricks rebuilding? Iâm not talking about a large data center applications; this is here at my house. So no, a DB is not for a laptop based C++ file parser, but that doesnât mean a DB is not the right tool for trading applications.
Locate and load streams of consolidated records for any instrument starting at any point in time in 10ms. Our data structures reduce the record count to an average of 3% that of the fix messages. Queries are optimized to return result sets of just the actionable events your interested in aka bid/ask changes... correlation triggers etc. If you currently have a message handler capable of processing 5 million messages / second you can filter out the fluff, skip to any point in time and feed your handler a stream of actionable events. Using the same backtest code/platform your analysis should complete in 3% of the time just with the reduction of data. Obviously every ones use case is different but these structures were built out to serve our trading requirements: Mostly Pairs and Cross Exchange Arbs... audit executions.
25 ms? so, i'm guessing this was done using nanex as data source? they're the only ones i know who use that arbitrary timeslice. if you're not using them, i'm curious why you chose it. so, what do you record then on the trigger? midpoint? if so, most illiquid things quote garbage a lot of times (around the open especially, and all the time if it's really illiquid), so mids/quotes are near useless. how do you get around this? anyway, kudos. i'm sure this took a bit of work.
I see. Well I take it back. That is ****ing fantastic. I guess our 'use cases' are totally different .
Our feed handlers are synced using meinberg NTP servers which also sync with CME NTP servers. Our clocks are synced and adjusted every minute but we still experience ms' of clock drift. 25ms is the best interval we can reliably coordinate and sync time between machines at different colos. There are also a few ms of order matching lag at the exchange 3 - 8 ms on Globex. Our objective was an accurate consolidated representation of the market trading conditions... specifically ask / bid liquidity and when an instrument spreads out. We always have the option to rerun the message streams but its generally too slow. We database 3 streams of pricing data for each instrument at 25ms slices along with supplemental calculated info: Ask, Ask High, AskLow, AskSize Bid, BidHigh, BidLow,BidSize Trade, TradeHigh, Tradelow, Volume. This resolution of data makes little difference at minute intervals with maybe illiquid instruments reporting more bars reflecting market price quotes. The real power is at the second and subsecond level. ie. at 10:31:17.275 does Ask hit 127.32 before bid 126.92 or where is the price at 3:55pm We developed a data structure that contains the forward path and back path of price movement inside the 10:31:17.275 record. Consequently the answer can be derived instantaneously without any further queries or message processing.
Custom flat files are 10 to 100 times faster than databases... And can be serialized in RAM for even greater speed. You need an very high level of complexity in your data analysis... To justify using SQL databases.... Which probably means you are overfitting. Keep it simple... it's all about execution anyway.
Agreed, I have a good experience with binary files on SSD, I use them like arrays. seek(a*x + b*y + 0) b1=read.decimal seek(a*x + b*y + 1) b2 = read.decimal and hey, there is my bar