Time series DB?

Discussion in 'Programming' started by sle, Dec 24, 2017.

  1. i960

    i960

    HDF5 is basically a portable Btree format that's optimized for interchangeability. If you don't need to send the data to other people but want the benefits of a Btree you should use something like BerkeleyDB which will outperform hdf5 and also automatically maintain its Btree space.

    There's a lot of cargo cult behavior in the python space where devs will use a given library not because it's the best but because it's what everyone else is using and/or they already have an existing library for it. Heck you have people storing data in MySQL and then dumping to hdf5 purely to be able to interface with Python code that handles hdf5. That is dumb.

    Choose a solution that is not programming language specific if you want it to scale and/or last from one implementation to another.
     
    #31     Dec 26, 2017
  2. sle

    sle

    “Cargo cult” - have not heard this one since college :)

    Valid points.
    - Well, if HDF5 is a standard, why would I not want to sacrifice a little performance for flexibility? There is an active code base to use it from python, R and c++, and, if needed, I can find people (e.g. consultants) that have experience with it.
    - I am, obviously, trying to find a solution that would require the minimum amount of work. I even, for a moment, considered simply adding hardware instead of fixing the underlying problem.
     
    #32     Dec 26, 2017
  3. i960

    i960

    Also, consider overhead per row. You really need to take this into mind to keep your storage concerns in mind. I'm talking about things like this:

    https://dev.mysql.com/doc/refman/5.5/en/innodb-physical-record.html
    https://docs.oracle.com/cd/E17276_01/html/programmer_reference/am_misc_diskspace.html

    This also might be useful to you when it comes to HDF5:

    http://cyrille.rossant.net/moving-away-hdf5/

    Because it's just a file system structure within a flat (binary) file that happens to have APIs written for it. That doesn't necessarily mean it's the optimal solution or the end-all solution just because there's already existing code out there. Honestly, you could probably find more consultants who know how to work with an actual database rather than HDF5. It's just a container format with very coarse level locking and basically no optimizations whatsoever for concurrent access (atleast for writing) that tries to provide a file system without any of the actual optimizations modern file systems provide.

    Well you know that isn't going to scale because text files do not scale, do not offer random access (they have no structured index, unlike btrees [which basically all modern databases use]), are not storage efficient unless compressed.

    If you go with a generic SQL or NoSQL database you'll atleast have something that isn't hard tied to a specific language but will be more performance/concurrency oriented than something like HDF5.

     
    Last edited: Dec 26, 2017
    #33     Dec 26, 2017
  4. sle

    sle

    - If I go with a stand-alone software product, i might as well go with a proper time series database (kdb or maybe try that arctic thing)?
    - I already have an instance of Mongo storing things like security details, so Arctic makes sense from continuity of support perspective, no?
    - I am sure that I am not aware of all limitations of any product that I’ll pick, so there is no surprise there :(
    - while I understand that it a bit of a religious topic (shit, what isn’t these days?), I am simply trying to find the right gun to shoot myself in the foot
     
    #34     Dec 26, 2017
  5. i960

    i960

    I guess this is the point I was trying to make earlier. There is no such thing as a "proper time series database" only people's ideas of what "proper" is for their needs plus the typical marketing stuff that shows up as usual. I'm sure you know how this works. A lot of people believe that somehow column-store and time-series database go hand in hand but it all depends on how you access the data. For the amount of data you actually have right now I honestly think either would work but it wouldn't hurt to try out multiple options.

    I'd read this for starters:

    https://www.percona.com/blog/2016/12/14/row-store-and-column-store-databases/

    Be aware too that there are actually 2 different meanings for the term "column store" these days that basically have nothing to do with each other at all:

    http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html

    I'd say so, yep. Not sure how it actually performs but given that it's apparently been in use at Man AHL it's probably atleast decent. In reality it looks like it's mainly dependent on MongoDB anyway. KDB isn't free either. Sure, give it a try - just be aware that it's a python implementation on top of MongoDB so that's not really language independent but may work for you.

    Well, just go with something - you're probably going to end up throwing it out after you figure out what you really want anyway.
     
    #35     Dec 26, 2017
  6. All this sounds Chinese to me. Forgive my ignorance, but how does a time series DB differ from just stored historical tick data? I'm confused
     
    #36     Dec 26, 2017
  7. temnik

    temnik

    Arctic is a bunch of, again, python-specific scaffolding around MongoDb. I knew I was not getting anywhere what I was used to (kdb), so I decided to go with a more flexible solution.

    Influxdb is "getting there"... maybe by the time they get from 1.4 to 1.8... So far I'm most upset with how backwards their support for as-of joins is. On the other hand, I'm impressed how I can almost invisibly switch data providers on the backend, as long as I maintain the same symbology.

    Been there, done that. I do not ever want to maintain directory trees of HDF5 files, with pytables or without.
     
    #37     Dec 26, 2017
    i960 likes this.
  8. sle

    sle

    Out of curiosity, how does the cost compare to kdb+?

    I got a chance to think about it today and discuss it a little. At this stage of the game, we don't have enough dense data to worry about "a proper solution", it's more about slightly more optimized reads/writes for backtests and such. However, at some point the data (and the strategies) will grow (it always does) and I want to be prepared for that moment.

    I suspect your asset universe is much broader then - I am getting away with text files for now, with minor difficulty.
     
    #38     Dec 26, 2017
  9. sle

    sle

    #39     Dec 26, 2017
  10. sle

    sle

    Well, let me 'splan to you what's goin one here. I have gotten to the point when an old way of doing things (namely, storing data in text files) is getting a bit creaky. Storing data in a DB allows for more flexibility at the expense of added work for me and my junior. I was not sure which exact route to take and good people here are explaining to me the ways of the righteous.
     
    #40     Dec 26, 2017