Database organization

Discussion in 'App Development' started by cjbuckley4, Oct 18, 2014.

  1. cjbuckley4

    cjbuckley4

    I'm looking for some advice on what the best practices are for storing options and futures data. How do you folks store these instruments? I've been using data that I (quite literally with a USB flash drive) fetch from school, but that's not optimal, so I've decided to build a database. Here's my idea for the design:

    1. Use the IQFeed symbol lookup from their C++ API to get all options for a specific security/contracts on a specific future.
    ....e.g. search for "index/equity options" on "SPY"
    2. From the list that's returned, I check to see if each a table for each result on the list exists in my database. If not, create the table.
    3. Loop over the list and store the data in the right table.

    Concerns:
    1. Relying on IQFeed's symbol lookup to catch every option or future. There's obviously some algorithm that determines when new futures or options are created and what their strikes/expirations/deliveries are. Where can I find these algorithms so that I can code them up and know what is created when instead of relying on the IQFeed symbol lookup?
    2. Is it best to create a table for each option/future? Throw them all together in one table? Throw all of it in one big table? I imagine what I've designed will probably make the queries fastest of those three options, but I'm here to learn so tell me if I'm wrong.
    3. The size of my database. This thing is obviously going to become HUGE. I figure I can't really know how huge until I try or someone tells me. I've been reading about different compression options from my database of choice and believe I'm using the right tools here, but maybe I will still need to trim back the amount of data I'm looking at. To be clear, I'm looking to database TRADES (not quotes) on S&P 500 equities and options, a few futures, and a few ETF/P/Ns. I really have no idea how big this is gonna get, so I'm just gonna try it and see. I'm prepared to invest in 4 or 5 TB hard drives every once in a while, but I don't have the means or the expertise to run a legitimate data server (I don't think I do at least, don't even know what the cost would be...if anyone has any estimates of what it will cost to scale this approach I'm all ears).
    4. Backups. Simply put, I want to make sure all this data I'm working on capturing is safe. I'm prepared to pay a good amount for backup space, but again I worry it could become unmanageable as I scale.
    5. Choice of databases. Currently kdb+ 32bit edition, I love it. I've heard anecdotally that you can address more RAM if you use multiple kdb+ processes? If anyone knows anything about that I'd love to hear. I've used other (SQL and noSQL) databases in the past but this is the fastest I've played with. I know the general consensus on EliteTrader is to store binary files, but I don't really know how to go about querying a bunch of flat files in the most efficient way, so either I learn to do that or stick with kdb+ 32 bit. Again, I'm open to advice there.

    Thanks for your ideas and feedback.
     
  2. 1245

    1245

    I'm not a programmer. But if I were looking to build a data base of option contracts I would look to see what the OCC offers for free and what they offer for a fee.Then I would build it to their specs.

    http://theocc.com/market-data/
     
  3. Baron

    Baron ET Founder

    Right off the top, if your database is as big as you think it's going to get (and subsequent memory requirements), you do realize that the 32-bit free version isn't going to cut it, right? A basic 2-core setup of the 64-bit kdb+ runs $25k a year. That doesn't sound very doable for you since you are in school. And given that you're asking how to set up a table structure, it sounds like to me that you're putting the cart before the horse anyway because you need to learn more about database design before you start thinking about these big scale problems.
     
  4. cjbuckley4

    cjbuckley4

    @1245 I've benefited from a ton of your posts on here in my years of lurking, what's your profile picture all about? I love it. Thanks for the lead on a method of organizing this.

    @Baron Haha yes, the carriage is definitely coming before the horse here, I realize that. As you noted, I'm in school. This is completely an academic pursuit for me, I don't have the capital for leased/direct access brokers, colo, direct feeds, kdb+ 64, or any of the fancy stuff that actual HFT requires. My rationale behind using q/kdb+ is that it is in demand in the industry and that's where I hope to be soon. Also, I just find it cool.

    That said, I know the professional application of kdb+ is generally to load a day+ worth of tick data into memory at a time and use it for realtime analysis. My current backtesting application doesn't do that, so I was operating under the assumption that it would be possible to make it work with the 4GB of memory limitation. If that's wrong let me know now so I can change gears now haha. I didn't want to get too technical with my original post, but my actual plan was to further split my tables into whatever increments ended up making the average table size closest to 1GB (kdb+ documentation recommends have about 3x as much free memory as your dataset available to work with your data).

    Table structure might be a basic question, but how would I know the right way to do this otherwise? I mean one of the reasons I initially thought the smaller table size would be better is because of my memory limitations. If I have to completely forgo kdb+ and use binary files I can do that too, it's just going to take a little more engineering on my part. If I go down the binary file route then I'll probably end up storing one day worth of ticks per symbol per file, just because that seems like the most prudent and low overhead way to organize it. I guess I just create a directory structure for the binaries and write my own C++ code to navigate it based on queries? It seems simple enough to implement, but I just though I'd use kdb+ for the aforementioned reasons. As I said earlier, if kdb+ isn't gonna be scalable, I'd rather know now.

    While I have an experienced trader though, do you have any ballpark idea on how quickly this will grow? Because no matter if I stay with kdb+ or go to something else, the biggest limiting factor here is probably going to be whether or not I can even afford to store this much data.

    There are so many unknowns about this project that I figure there's no way I'll know how best to do it without doing it, so I'm kinda just jumping right in with the knowledge that this may not work and that I'll inevitably hit some snags along the way.
     
    Last edited: Oct 18, 2014
  5. cjbuckley4

    cjbuckley4

  6. 1245

    1245

    cjbuckley4:

    My profile picture is a painting from an artist. His name is Jason Oliva. I knew him when he was a member of the American Stock Exchange many years ago. I really like his art and this painting is one of my favorites. I don't own it, but maybe one day. Hard to convince my wife that I want a purple skull in my home.

    http://jasonoliva.com/
     
  7. cjbuckley4

    cjbuckley4

    Nice, I see his studio is in Tribeca. Maybe I'll swing through after school one day to try to impress my girlfriend with my vast knowledge of art. I'm not exactly on a fine art budget at the moment though haha, so more power to you if your wife is the limiting factor here.
     
  8. Baron

    Baron ET Founder

    You should probably just give them a call and describe the nature of your project and just ask them what limitations you might run into with their free version. You'll probably learn more in 5 minutes with them on the phone you would in 5 weeks of trial and error on your own.
     
  9. cjbuckley4

    cjbuckley4

    That's probably a prudent suggestion. Better to know now than later, and I would assume part of the point of having a free version is to encourage students to play with it.
     
    Baron likes this.
  10. vicirek

    vicirek

    You can also consider NoSql or document databases like MongoDB, Hadoop and others. Such type of more flexible store than a table may be better suited for options and futures data. Many of them are open source and can be accessed from multiple programming languages.
     
    #10     Oct 20, 2014
    eusdaiki, Ghost_of_Blotto and Baron like this.