Designing an efficient database

Discussion in 'App Development' started by TheBigShort, Feb 9, 2019.

  1. is pricing data require for option data?
     
    #11     Feb 10, 2019
  2. what is the essential question you are trying to capture? that's a lot of rows.. most of those names have sparse liquidity I would assume even if there is some edge in XYZ but the bid/offer is $1 wide.. there is nothing to do. Also, if you are trying to capture term structure, find some heuristic position that can alert you at a hi level that the term structure is 'off'- then you go granular on the few. In equities since there is a single price, maybe take 3 atm and calculate the fly value. If one of those months are off, you will see it in the fly. You should look into semantic form of storing info. Far more robust and flexible as your needs grow.just my 2 cents :)
     
    #12     Feb 10, 2019
  3. tiddlywinks

    tiddlywinks

    So your opening post... is incorrect.
    This is really an options and price data project!

    Take a look at this link, which is inside the link @nooby_mcnoob offered.... https://www.quantstart.com/articles/Securities-Master-Database-with-MySQL-and-Python

    It is similar in that it includes pricing data. In addition to giving ideas for table structure, importantly, it mentions (but glosses over) the problems associated with gathering price data in general.
     
    #13     Feb 10, 2019
  4. ensemble

    ensemble

    I would suggest considering a SQL-on-Hadoop MPP query engine like Cloudera Impala or Apache Drill. Both are open source and free to use commercially. RDBMS (especially MySQL) won't scale.

    I am running an Impala cluster in a NJ colo and persist about 14 million quotes a day (over 2 billion rows on 79k+ options contracts) stored as Parquet on HDFS.

    AWS Redshift has a similar architecture but the TCO is substantially higher than self-hosting.
     
    #14     Feb 10, 2019
    nooby_mcnoob and fan27 like this.
  5. TheBigShort

    TheBigShort

    Thanks for the comments everyone, I am pleasantly surprised with the support. I was a little busy over the weekend and was unable to write a thorough response. Currently I have created a spread sheet on bloomberg that does the basics of what I am looking to do. I will post it down below. The spread sheet is done in excel and access Bloombergs database and extracts the data needed. The problem with this solution is, it is not very customizable and some of the predetermined formulas calculated by bloomberg are in my opinion wrong (an example is bloombergs predetermined event vol calculation). So the objective is to create a similar scanner but with my personal calculations plus advanced features (garch, event vol incorporating spx term structure, skew, standard deviation of implied move before earnings etc..).

    I will update it every hour.

    I have read a few articles that relational databases are perfect if I am storing less than 500 million rows of data. The rate I am uploading data, it would take me many years to get to that number so I don't think I need to go down the route you went. Eventually when I get into asia/europe I will consider switching over to a NoSQL data base.

    Thanks for the advice kev, I will add the multiple columns for iv calculations. In regards to normalization, do you think I could just have it all in one table (csv stored on some platform and query it using R, you mentioned this to me before) yet still be able to have remote access to the table?

    Doing all my analysis in R and presentation in shiny. Here is a super quick and rough example of some of the graphs I would like to have on Shiny. I posted this before in the earnings journal but maybe you missed it. https://financialtrading.shinyapps.io/elitetrader/ this is static data pulled from a csv from my local hardrive on UAL(the code was done in 45 minutes so it's not great just a template to show you an example). It's very easy for me to do this from a csv, however making the data dynamic (updating every hour) is why I am going the database route. The "DT" package allows me to easily query/scan a csv and have it shown in a nice robust javascript table as shown at the bottom of the page (the table shown on the website does not have the scanning feature embedded).

    below is the bloomberg scanner I created that I would like to somewhat replicate (objective of the database).

    scanner.JPG
     
    #15     Feb 11, 2019
  6. I suggest you replicate it in a spreadsheet first, much easier to prototype. That's what I usually do first.
     
    #16     Feb 11, 2019
  7. Just a quick observation on your schema. All of the tables, except stock, have a date field so I guess you are saving these points as a time series. On the stock table you have market cap but no date. Market cap changes over time so don't you need to have a separate table for stock code, market cap and date? You can then keep your Stock table as just a reference for the name.
     
    #17     Feb 21, 2019
  8. Sig

    Sig

    Except that it can be far more efficient to do a lot of the analysis through the SQL call if it's related to finding related data. You know this I know, but just to keep him from going down to road of building a massive array in his code that he iterates through when he could have done it in a fraction of a second with a SQL call. I say this because his description of what he's doing from a csv leads me to believe that his code is designed that way now.
     
    #18     Feb 21, 2019
  9. I'm a little late to the party but instead of a stock table why don't you use an instrument table. It's more flexible. You can then have a type column with Options/Futures/Stock. Also I would have a parent instrument reference in that table in case of an underlying.
     
    #20     Sep 26, 2019