Designing an efficient database

Discussion in 'App Development' started by TheBigShort, Feb 9, 2019.

  1. TheBigShort

    TheBigShort

    I am trying to build a remote relational database that contains earnings information on roughly 1500 US equities and 100 CAD equities with about 8000 new rows added everyday. I am having a few problems when it comes to database design and implementation. I was wondering if there is anyone on this board who has experience working with a database and has some time to help answer a few questions or partner-up to build a production level database. I am currently learning SQL and playing around with LucidCharts to help design a 3rd normal form DB.
    My programming knowledge in R is efficient for my purposes but not at the level required for this task.

    The goal is to be able to scan the universe listed above for certain criteria given the information we have stored as well as plot some graphs (currently using shiny).

    If you know of any course/forums (besides stack overflow) that will help me along this journey that would be great! I am getting bits and pieces from different sources/courses but none of them are time series related (would be nice).

    I am hitting some road blocks and any advice would be greatly appreciated! If anyone is interested, I will post my current schematic design from lucidcharts.

    Ps. Currently using AWS to host my database and I am using MySQL as the RDMS.
    Data Sources: Interactie Brokers, Questrade, Bloomberg, Nasdaq.com(scraping), theocc.com(scraping), cboe.com(scraping).
     
    blueraincap likes this.
  2. fan27

    fan27

    I am not familiar with LucidCharts but have worked with MySql. Sure...post your schematic designs and describe the problems you are having.
     
  3. tiddlywinks

    tiddlywinks

    Ditto here.

    I'm guessing you are building your design as-you-go so maybe no schematic, nothing in stone, that's OK. But I'm questioning your description...
    A total of roughly 1600 equities. With 8000 rows added daily. huh? IMO, your (row) design should be keyed on 2 columns, Symbol (ABCD) and Quarter+Year (Q119). And a column for each associated earnings data you are collecting. With this type of design, your 8000 rows (added daily!?!) would be over one full year (4 quarters) of earnings information for all 1600 symbols. What are you collecting row by row?
     
  4. I go here to be inspired. It's a little excessive to use from scratch, but has some great ideas.

    By the way, I am also going to be building an earnings database. Twinsies!
     
    srinir and TheBigShort like this.
  5. TheBigShort

    TheBigShort

    The problem I am having is creating an efficient layout. This is what I have right now which is only in 1st level form. I would like to get it to 3rd normal form. Did you directly code in SQL to interact with your database? I am trying to create a script where R(or whatever language would be best for this) uploads the data straight into the database sorted to the proper tables. Currently I am saving data into a csv and then calling the LOAD DATA INFILE.

    Hi tiddlywinks, I will be uploading data every hour(or six times a day)for the listed data fields in the OPTION DATA and PRICE DATA tables.


    PK and FK are my primary and foreign keys
     
  6. TheBigShort

    TheBigShort

    Screen Shot 2019-02-09 at 11.41.20 PM.png
     
  7. fan27

    fan27

    I have directly interacted with MYSQL in both PHP and Javascript. You should be able to find some boiler plate code on the net for your language of choice.

    Regarding your diagram, for Earnings data, what data will Day and Earnings Announce Time hold?
     
  8. carrer

    carrer

    How frequent do you update the "Price Data" table?
    Is it on a daily basis?
     
  9. Your schema looks more or less fine as is. My only caveat is that you have far too few columns in the Option Data table. And don't keep IV1,Ex1 and IV2,Ex2 on the same row, make them separate rows with the same date,time,ticker... then if you insist on treating a table like a spreadsheet within the database, pivot them in a view or mview. If you're calculating things like IV1 and IV2 yourself, you'll need another table of the underlying information, as you may want to go back later and change your calculations.

    I wouldn't worry about normalization, in all my years in the industry, I've never seen a 3rd normal form database in actual everyday use.

    "Normalize until it hurts, denormalize until it works"

    In practice, you can skip the first part of that quote and just go direct to denormalized. Cobb and Date were right about flattening, wrong about normalization.

    Learning a little SQL is easy, becoming an expert is difficult. It is much harder to reach expert level in SQL than in R, Python, or even C++. Since you are a one man show, I suggest triage. Keep the SQL as simple as possible and leverage your experise in R. By that I mean use the database for storage and do essentially all your analysis and presentation in R.
     
    domesticllama and fan27 like this.
  10. ZBZB

    ZBZB

    #10     Feb 10, 2019