Why use a database?

Discussion in 'Data Sets and Feeds' started by onelot, Oct 9, 2004.

  1. More stats:

    Splits 31996
    Dividends 167156
    Symbol Changes 102489
    Trading Halts: 67408

    Historic Symbols: 102489
     
    #121     Dec 9, 2012
  2. hft_boy

    hft_boy

    With all due respect -- what is the point of having a database and being able to locate random ticks quickly? First of all, 10ms is not fast -- that's 100 ticks a second. Second of all, most backtesting is done over sequential ticks. That means that you don't really need fast random access, but fast sequential access.

    Using flat binary files is blazing fast on modern computers. I can 'parse' (i.e. not actually do any computations with the data) 7 GB in about 7 seconds on a 2 GHz CPU. That's about 30 ns a message. So why use databases? They're bulky, slow, and unless you can give me a good reason, really not the right type of tool for most trading type applications.
     
    #122     Dec 10, 2012
  3. dreturns

    dreturns


    I would assume some trading systems can get a bit more complex. Then other factors than disk access speed are important. I use Oracle for several reasons. Data Cache: Once data is in memory, a database can be tuned to keep needed data in memory with out having to build custom binaries. Read consistency: Being able to have multiple threads or jobs accessing and changing model state data and not worry about deadlocks and missed updates. PL/SQL: To me a much easier higher level access language than any other. It’s slower than some, but the functionality and capability of a full blown Oracle DB with its functions, schema, and memory structures is powerful. Client access: Sharing data across various applications / jobs with out worry of custom access code for every client. Advanced Queues: First in & out queues ideal for dumping in large amounts of inserts so the client process does not get stuck waiting for processing. AQs are also good for having multiple threads access and process the same data and replicating across environments. High Availability: Simply very few other solutions offer as many HA options. If you’re building something on your desktop and it has a system board error or disk failure in the middle of the trading day, now what? Go to last night’s carbonite backup or usb drive and startover? What did you just miss? Can you at least recapture the data feed for historical reasons even if you can’t trade cause your sweating bricks rebuilding? I’m not talking about a large data center applications; this is here at my house. So no, a DB is not for a laptop based C++ file parser, but that doesn’t mean a DB is not the right tool for trading applications.
     
    #123     Dec 10, 2012
  4. Locate and load streams of consolidated records for any instrument starting at any point in time in 10ms.

    Our data structures reduce the record count to an average of 3% that of the fix messages. Queries are optimized to return result sets of just the actionable events your interested in aka bid/ask changes... correlation triggers etc.

    If you currently have a message handler capable of processing 5 million messages / second you can filter out the fluff, skip to any point in time and feed your handler a stream of actionable events. Using the same backtest code/platform your analysis should complete in 3% of the time just with the reduction of data.

    Obviously every ones use case is different but these structures were built out to serve our trading requirements: Mostly Pairs and Cross Exchange Arbs... audit executions.

     
    #124     Dec 10, 2012
  5. 25 ms? so, i'm guessing this was done using nanex as data source? they're the only ones i know who use that arbitrary timeslice. if you're not using them, i'm curious why you chose it.

    so, what do you record then on the trigger? midpoint? if so, most illiquid things quote garbage a lot of times (around the open especially, and all the time if it's really illiquid), so mids/quotes are near useless. how do you get around this?

    anyway, kudos. i'm sure this took a bit of work.
     
    #125     Dec 10, 2012
  6. hft_boy

    hft_boy

    I see. Well I take it back. That is ****ing fantastic. I guess our 'use cases' are totally different :).
     
    #126     Dec 10, 2012
  7. Our feed handlers are synced using meinberg NTP servers which also sync with CME NTP servers. Our clocks are synced and adjusted every minute but we still experience ms' of clock drift.

    25ms is the best interval we can reliably coordinate and sync time between machines at different colos. There are also a few ms of order matching lag at the exchange 3 - 8 ms on Globex.

    Our objective was an accurate consolidated representation of the market trading conditions... specifically ask / bid liquidity and when an instrument spreads out. We always have the option to rerun the message streams but its generally too slow.

    We database 3 streams of pricing data for each instrument at 25ms slices along with supplemental calculated info:
    Ask, Ask High, AskLow, AskSize
    Bid, BidHigh, BidLow,BidSize
    Trade, TradeHigh, Tradelow, Volume.

    This resolution of data makes little difference at minute intervals with maybe illiquid instruments reporting more bars reflecting market price quotes.

    The real power is at the second and subsecond level. ie. at 10:31:17.275 does Ask hit 127.32 before bid 126.92 or where is the price at 3:55pm

    We developed a data structure that contains the forward path and back path of price movement inside the 10:31:17.275 record.

    Consequently the answer can be derived instantaneously without any further queries or message processing.




     
    #127     Dec 10, 2012
  8. Custom flat files are 10 to 100 times faster than databases...
    And can be serialized in RAM for even greater speed.

    You need an very high level of complexity in your data analysis...
    To justify using SQL databases....
    Which probably means you are overfitting.

    Keep it simple... it's all about execution anyway.
     
    #128     Dec 10, 2012
  9. Gizzz

    Gizzz

    Agreed, I have a good experience with binary files on SSD, I use them like arrays.


    seek(a*x + b*y + 0)
    b1=read.decimal
    seek(a*x + b*y + 1)
    b2 = read.decimal


    and hey, there is my bar
     
    #129     Dec 10, 2012
  10. +1 (still good after all these years!)
     
    #130     Dec 12, 2012