Time series DB?

Discussion in 'Programming' started by sle, Dec 24, 2017.

  1. Okay maybe my solution is technically a bit ambitious given that you won't have an unlimited capacity/patience for IT on your team. I think the hosted pub/sub + cluster of subscribers approach is sexy but might be overkill given the above. Certainly untested.

    Personally, I can't comment on bcolz but does seem interesting. I'll also say that given your really have a log replay vs. an adhoc query requirement I think the proper abstractions above a file system consisting of binary files of tick data is a viable low headache approach. Obvious you'd want to write an API that abstracts all the headache of text files. I'd say test bcolz though. Worst case scenario is you've probably just made migration to whatever DB you try next easier than it would be with plain text files. If I was in your shoes I'd probably give mariadb some thought as well.
     
    Last edited: Dec 25, 2017
    #21     Dec 25, 2017
  2. 2rosy

    2rosy

    given your requirement bcolz works. I used it previously due to memory limits with pandas.
     
    #22     Dec 25, 2017
  3. Simples

    Simples

    In order to do the data querying, processing and output, how many machines will be needed, ie. will you now need a distributed event processing platform. With enough CPU and parallellism, your network becomes your bottleneck and something like Kafka will be useful for moving/splitting datastreams and even do realtime processing while handling most of the network side for you. Chosing that route will require some initial investment, but will be able to scale up to what your network can handle, and also provide some flexibility for querying and processing (unlike traditional queues).

    If 1-3 boxes for the obvious parts, you could get away with filesystem, especially if mainly just used for replay, not complex querying and reordering. Using Linux, you can allow lots of open files. Alas more concurrency will require more and more seeks in between, a simple approach like this can really scale if you manage the complexity yourself, keep it simple and it's not too much data. For just reading/streaming, in-memory is really overkill, as network is often the most limiting factor nowadays, not harddrives.

    Most start with a DB, especially an RDBMS if they can get hands on it, but it really limits your design from then on, so choose wisely and don't be in a rush to settle. It can always be added later, is often the best architectural decision you can do now, because you never know what you'll need in the future anyways. If what you need now screams at you, you don't really need to ask, do you?
     
    Last edited: Dec 25, 2017
    #23     Dec 25, 2017
    CME Observer likes this.
  4. sle

    sle

    Probably talking a 100GB of data at the very very most if I do add other strategies I am planning to add.


    Yup, unfortunately

    It is "novel" in a sense that it's high-effort from modeling perspective. In any case, when I say "latency-sensitive" I don't mean UHF, I simply mean strategies that hold positions intraday. The run time engine is using C++, but I do use python for testing and research (also, real time model inputs that are not latency sensitive are generated in python too).

    Well, I dump them into a typical date-organized directory tree, so the biggest files are 24 hours for a single symbol. Since I have such a small number of concurrent symbols, I have a separate process handling recording for each symbol. I don't think we ever had any problems with it being too heavy.

    Seems like (I have not read it in detail yet) bcolz has most of it's logic written in C/Cython, which should be pretty fast. We already are going to throw some hardware at it :), my minion is supposed to make a decision if we want buy our own hyper-box or find a virtual solution.
     
    Last edited: Dec 25, 2017
    #24     Dec 25, 2017
  5. sle

    sle

    Makes sense. The added advantage of figuring out an open source solution is that should I decide to move to another firm (or should it be be decided for me, for some reason :() I can quickly re-deploy the same technology stack.

    So far, given how small the team is and how little resources we are willing to spend on maintaining this bit, we were going with a single box. The hardware is so cheap these days that we can get a machine that fits our entire dataset into memory for under 10k (I think).

    Right. I wonder if that can be added later - at the moment, I just want a simple solution that would be easy to deploy and maintain.

    PS. Someone just pointed out that since all of my data sets fit into memory, there are standard formats that pandas supports that are blindingly fast.
    PPS. Apparently, either feather or hdf5 formats are my best choices if I want to go down a binary file route. In both cases, no extra support needed (as it's fully integrated into pandas), plus hdf5 has a very good C++ support.
     
    Last edited: Dec 25, 2017
    #25     Dec 25, 2017
  6. temnik

    temnik

    It's great to finally find a sane thread on this website...

    Personally, I'm using influxdb. It's not ideal - especially compared to something like kdb. But it's good enough for now. You can take a look at http://community.influxdata.com to see what kind of real problems real people have to deal with.

    What I don't like about blosc/bcolz off the bat is that it ties you into python ecosystem. Yes, python is #1 in machine-learning right now, but I like to keep my options open.
     
    Last edited: Dec 25, 2017
    #26     Dec 25, 2017
    VPhantom and i960 like this.
  7. sle

    sle

    It's on my radar, but I don't know if I can supporting yet another stand-alone product when I can get away with a simpler solution. What made you pick this vs say any other time series databases (arctic looks pretty impressive, I'd say)?

    Yeah, that's for sure - upon some thought, I am leaning towards HDF5 instead of bcolz. It's a common scientific format, supported across pretty much every language and has a pretty broad base in and outside of finance.
     
    #27     Dec 25, 2017
    VPhantom, Simples and CME Observer like this.
  8. 2rosy

    2rosy

    people tend to go from hdf5 to bcolz (if not kdb). is your data ticks/events or is it already normalized somehow? if its events, might want to look at queues(ie. kafka) for base storage then something else when analyzing.
     
    #28     Dec 26, 2017
  9. wmli

    wmli

    I have had great success with PyTables, a Python library for HDF5. I store options tick data and load it into pandas/numpy.
    http://www.pytables.org/usersguide/tutorials.html
     
    #29     Dec 26, 2017
  10. sle

    sle

    I store both T/Q ticks (since I don’t play UHF games, no book updates thank God) and resample to 1 second and 1 min. Why would Bcolz be better? Is it faster?
     
    #30     Dec 26, 2017