Adding a database...

Discussion in 'Data Sets and Feeds' started by TheGoonior, Nov 20, 2009.

  1. nitro

    nitro

    The problem I have with SQLServer is that AFAIK it does not allow for custom engines. I have no choice but to use MySQL for some of my work because it requires one of these custom engines.

    I wish MSFT would do two things with SQL Server:

    1) Allow third party engines
    2) Include a column oriented store.
     
    #11     Nov 21, 2009
  2. rosy2

    rosy2

    i use mysql for EOD data. for tick data i use hdf5 with pytables but am looking at kdb.
     
    #12     Nov 22, 2009
  3. ....I currently let the data be dumped into a text file per exchange per hour. Need to think about storage soonish.

    My idea is to go with Blobs. Storing market updates (ticks) in a table is wastefull like hell. Timestamp, order in Timestamp, all fields very long - ridicuous wastage.

    My idea is:
    * One table.
    * Ine entry PER INSTRUMENT PER TIMESPAN.
    The TImespan has to be determined. I would possibly not go further than an hour, may go as low as 15 minutes (which is good to keep the ES off times in one complete entry) - allows one to store hourly/ 15 minutes totals in the table and take longer durations from there.

    Tick data can be stored extremely efficient if one uses a simple "store as ticks, not floats" and "store delta only" approach. I can get a trade down to about two bytesif all is right (same timestamp, same volume and price like last - that is not THAT rare), and 3 bytes if the trade is "fast after the last". 4 bytes if the volume is al ittle different. Most bid/ask run into 4 bytes. Extraction is extremely fast also with this approach.... only problem is if one Asks for "the last 3 minutes" (needs to read blob to discard data), but seriously, how often does that happen? Mostly charts are created X hours / days back, and there the approach is good.

    With this approach (15 minutes) there is also a good amount of data in every blob. This reduces the reading overhead significantly - as well as disc storage (SQL Server stores blobs in 8kb pages - wastefull if the data is just some hundred bytes). During the nights etc. one may even safely go up to about an hour or more with all full data.

    Oh, note - I dont store ticks, I store all (bid, ask, best bid and ask). The text files I currently use are about 25mb/hour for just 4 instruments ;)

    I considered the appraoch to store ticks in a table, but it is wastefull. I get data in microseconds (Millionth of a second) from Zen-Fire. I get a lot of data per microsecond often, so I need another field for the order in the timestamp. Then SQL Server has no efficient way to store delta prices, which means a lot of wasted space, wasted network bandwidth when retrieving - all eating up time like mad.
     
    #13     Nov 22, 2009
  4. Why be frugal when storage is cheap?

    Updating and appending records to an in memory SQL DB takes microseconds. Archive the in memory db data to a server based sql db every minute. The performance hit is minimal and your session archive is complete.
     
    #14     Nov 22, 2009
  5. Storage is not exactly that cheap if you need a lot of it AND if you need high performance IO. THe problem I see and experienced is that it simply takes a lot of time to load some days of tick data with the dimensions that has. Ignore the fact that it is a load on the IO substructure, unless I one day get an Infiniband connection in place between SQL Server and analysis systems.... it simply takes a lot more time to TRANSFER the data via 1gbit link ;)

    On top, I plan on storing my data long. Storage space is not that cheap anymore when you talk SAN ;) Sure, if it when you go with 1gb+ discs, but then they are sort of slow ;)
     
    #15     Nov 22, 2009
  6. Good stuff, guys, Thanks.
    I don't think I need anything super fancy, but from what I've seen, the code required to implement the database is pretty straightforward and the data-binding looks like it'd take away a bunch of grunt work back-end stuff.
    Thanks again.
     
    #16     Nov 22, 2009
  7. Just be carefull with databinding. In pre-WPF times the mechanism sucks majorly ;) It is only with WPF that it gets really flexible enough to be of proper use.

    Even then, though, databinding may suck up your performance majorly. Especially bid/ask/trade info may come in a LOT... thousand and mroe updates per second. With databinding normally these are ALL painted. Which will eat up your CPU.

    You have to get "out of the box" here, with a proper business object / view model that stores the data being bound and databinding events only thrown x times per second maximum.

    Instead of:
    * throw event on value change
    you basically:
    * set boolean marker on value change
    * X times per second check marker, do binding event, reset marker.

    This way you scale up a lot better. And yes, it really is that bad. The amount of data moving through the system with a proper datafeed (Zen-Fire/Rithmic/NxCore etc., NOT IB which only snapshots x times per second) is terrific in fast markets.
     
    #17     Nov 23, 2009
  8. Currently I'm using the IB datafeed, but I would like to build in provision to switch at a later date, just to keep things flexible. For a "proper" data feed, is the amount of data for your standard futures contracts going to kill you, or is it just the active stocks (QQQQ, etc)? I'm not familiar with "real" data feeds, as I've always just used the standard IB feed. But I don't trade during the standard news events (NFP, FOMC, etc), although obviously I can't predict the unknown events that occur from time to time. Is the hardcore feed like you mention a necessity for everybody or just the micro-second scalpers?
    Perhaps a good compromise would be to databind the actual trades (i.e: last) only, since bids/asks/sizes jump all over (as you mentioned).
    Do you typically log all events (i.e: to allow you to replay the day on simulation) or just trades (for charting purposes)?
     
    #18     Nov 23, 2009
  9. CME Certified Systems are required to process 70 Transactions per Second.

    Sales and Tick Data is published every 14ms = Approx 70 TPS. NYSE upgraded their systems recently going from 25ms to 5ms or approx 200 TPS.

    IB publishes 6 updates per second.

    Even though the numbers above are per instrument the feeds are efficient. If you subscribe to 100 CME symbols you should only receive a maximum of 70 batch updates per second.


     
    #19     Nov 23, 2009
  10. Well, the problem is mostly that you get a lot of changes in bid and ask - a lot more than possibly with only the trades. IB works around that by only delivering snapshot.

    Whether or not that is necessary is not a matter of scallping ONLY. I have some strats here that generically build on tick charts... so I rather have ticks ;)

    What you generally have to do is reduce the feed to something similar to what IB delivers. At least once you hit the user interface, there is no need to show every update - the user can not see it anyway. Delivering 10-30 updates per second is absolutely enough. This is how IB started it.

    My approach to this is already explained - basically do not databind automatically, but use manual triggers. Using a proper MVVM pattern, the VM (ViewModel) can manipulate it's change events in such a way that they dont trigger on every update.... with the result that you dont loose too much performance.

    The problem is not one of a "normal" market. Even the ES does not play hell ALL of the time. It does so sometimes, though - fast markets. Then.... thigns happen. Ninja sarts lagging seriously and on some machines just freezes, my own system... just lags the hell out. Market opening is another such time - lots of trades execute in the first minutes. Tremendous changes to bid, ask, volume. I saw my own visualization jump from "ok" to "10 seconds behind" within basically 10 seconds, then recover over some seconds as the trades slowed down and the graphics updates. CPU Usage was horrendous, using up multiple cores (as my own UI uses one thread per window, multiple windows mean multiple cores can be used). Repeating this with a profiler showed the culprit indeed in the rendering routines - though, given WPF 4.0 and bitmap caching I did live up a LOT better than Ninja did on my TimeAndSales window. Playing with the output events as described totally eliminates the load without visual change - the human eye can not really follow a fast market anyway. I work next week on doing the same with the time and sales running numbers ;)
     
    #20     Nov 24, 2009