Tick Database, Now Want to Run SQL

Discussion in 'Data Sets and Feeds' started by bscully27, Jun 28, 2012.

  1. Thanks for all those valuable comments. I think I pretty much decided on the following:

    Strictly splitting business logic between a) back testing / trading strategy related data management needs and b) data supply to run independent analytics, feed other unrelated apps,...

    a) I decided to stick to my own custom binary time series data storage for backtesting purpose. All the back tests need is random access to a time series specified by start and end time stamp. I run each symbol of tick based data in a separate file and despite those being in the gigabyte region can access the time series at very fast speeds, faster than I have seen with any database I tried so far. Any analytical data that are output by the strategy (trades, TCA, risk, indicators, statistics) are stored in an in-memory database in order to query such data for research purposes post raw data iteration in the strategy modules. The query process is manual. In that there is a clear segregation between time series based data and more relational data on the other side.

    b) I will spend more time on this later as I have not fully decided but currently I err on the side of a relational database. I do not need low latency access, only load couple 100 bars or 1000 ticks per symbol into some other applications or excel. I like to store tick and bar data of about 600-700 symbols in such database. What database would you recommend for such purpose? I want to access it through Excel, and .Net applications. As I need bars in Excel would you recommend a database that I can query with SQL and aggregate bars from ticks through an SQL query on the fly or store ticks and 1-minute bars in the database and aggregate bars from 1-minute bars given I do not need finer granularity bars? Would really appreciate some specific database recommendation for the specify example:

    Data storage specs: 600-700 symbols, tick data, (fx, stocks, futures, indexes). Can store 1-minute bars as well. Storing about max 1 year history.

    Read specs: No low latency requirements, want to read tick data between start and end time stamp of a given symbol; want to retrieve bars of any requested compression of a given symbol between start and end time stamp (would it be fast enough to compress tick data to bars on the fly through a query or will that be very very slow? ); want to query the average price of a given set of symbols at a specified time stamp.

    Write specs: Need access to the previously closed 1-minute bar at the very least, so need to store either bars every minute (when the symbol in question is currently traded) or ticks in batches every minute.

    Which database can match or exceed those specs? It does not sound extremely demanding to me, writing tick based data in batches every minute over 600-700 symbols could possibly be an issue for an RMDBS? Any ideas or suggestions? I emphasize access through Excel (thus SQL queries I guess) because its the lowest common denominator.

    Thanks in advance for any recommendations.

     
    #41     Jul 12, 2012
  2. Makis

    Makis

    I do not have much experience with tick data on relational databases, but since the message rate is not that big and query latency is not critical i think any respectable RDBMS will do fine. You can look at mysql inmemory engines and timesten, which can work in hybrid memory/disk mode.
     
    #42     Jul 13, 2012
  3. thanks I am currently playing with TimesTen in my spare time and will report back once I got around to run some performance tests.

     
    #43     Jul 14, 2012
  4. DevBrian

    DevBrian

    I'm curious about your implementation. I too have written a "custom binary time series data storage for backtesting purpose". All in c#.

    I did a few optimization that reduced the storage size of the file, but it hurt the read speed. I'm content with this, because any of the strategies we run can only handle about 200-300 messages to second, just from invoking a few private methods, interfaces, updating collections, etc.

    So we don't receive any benefit if our data stream could stream any higher rates than that. Is this your experience as well?

    - Thanks
     
    #44     Jul 24, 2012
  5. That is too little. I suggest you get rid of most of the floating point arithmetic and replace it with integers. Then lean how to make fast divisions (hint: multiple - shift instead of division).

    You should be able to pull about 4-5 million ticks per second into your strategies, and I seriously fail to see how those go down to 200-300 messages per second outside of REALLY crappy programming. 200.000 to 300.000 with a HEAVY strategy - ok, but 200-300? That is FACTORS too low.

    Put a profiler on it and find where you waste like 90% of your time upward.
     
    #45     Jul 24, 2012
  6. DevBrian

    DevBrian

    I'm sorry. I did mean 200k-300k, not 200-300.

    I was able to strip things down, to where I could pump 5 million messages per second. Few things I did:

    - Minimized the number of private method calls.
    - Removed interface calls.
    - No locks/no multi-threading/no arithmetic
    - No maintaining of any "Last Quote" type collections

    With a strategy that does absolutely nothing, and without the above, it can go 5 million messages a second. I add the above, it goes back down to 200k-300k message per second.

    I can live without the above. But any strategy will likely reintroduce this level of overhead, and I'm back to 200k-300k messages per second.
     
    #46     Jul 24, 2012
  7. DevBrian

    DevBrian

    I'm sorry, I meant to say 200k-300k.
     
    #47     Jul 24, 2012
  8. libitum

    libitum

    I read the first post in this thread, and too quickly browsed the rest, then sorry if someone already answered this.

    Dont use SQL. It is a dead end. SQL (and alike) are meant for queries where speed is not an issue. You should use a memory array for storing all the data you want. Reading from RAM is the only method used by programmers to access data where speed is critical.
     
    #48     Jul 24, 2012
  9. DevBrian

    DevBrian

    Sorry, I meant to say 200k-300k messages per second.

    ps - if I'm repeating myself, ET isn't showing my posts, so I'm trying again.
     
    #49     Jul 24, 2012
  10. Yes obviously your weakest link will be your bottleneck. No matter how fast you can stream the data at rate y, if your strategies can only consume at rate x, with x< y then you are constrained by x.

     
    #50     Jul 25, 2012