Designing an efficient database

viruscore1 · Feb 10, 2019

is pricing data require for option data?

mushinseeker · Feb 10, 2019

what is the essential question you are trying to capture? that's a lot of rows.. most of those names have sparse liquidity I would assume even if there is some edge in XYZ but the bid/offer is $1 wide.. there is nothing to do. Also, if you are trying to capture term structure, find some heuristic position that can alert you at a hi level that the term structure is 'off'- then you go granular on the few. In equities since there is a single price, maybe take 3 atm and calculate the fly value. If one of those months are off, you will see it in the fly. You should look into semantic form of storing info. Far more robust and flexible as your needs grow.just my 2 cents

tiddlywinks · Feb 10, 2019

So your opening post... is incorrect.

TheBigShort said:
I am trying to build a remote relational database that contains earnings information on roughly 1500 US equities and 100 CAD equities with about 8000 new rows added everyday
More...

This is really an options and price data project!

Take a look at this link, which is inside the link @nooby_mcnoob offered.... https://www.quantstart.com/articles/Securities-Master-Database-with-MySQL-and-Python

It is similar in that it includes pricing data. In addition to giving ideas for table structure, importantly, it mentions (but glosses over) the problems associated with gathering price data in general.

ensemble · Feb 10, 2019

I would suggest considering a SQL-on-Hadoop MPP query engine like Cloudera Impala or Apache Drill. Both are open source and free to use commercially. RDBMS (especially MySQL) won't scale.

I am running an Impala cluster in a NJ colo and persist about 14 million quotes a day (over 2 billion rows on 79k+ options contracts) stored as Parquet on HDFS.

AWS Redshift has a similar architecture but the TCO is substantially higher than self-hosting.

TheBigShort · Feb 11, 2019

Thanks for the comments everyone, I am pleasantly surprised with the support. I was a little busy over the weekend and was unable to write a thorough response. Currently I have created a spread sheet on bloomberg that does the basics of what I am looking to do. I will post it down below. The spread sheet is done in excel and access Bloombergs database and extracts the data needed. The problem with this solution is, it is not very customizable and some of the predetermined formulas calculated by bloomberg are in my opinion wrong (an example is bloombergs predetermined event vol calculation). So the objective is to create a similar scanner but with my personal calculations plus advanced features (garch, event vol incorporating spx term structure, skew, standard deviation of implied move before earnings etc..).

carrer said:
How frequent do you update the "Price Data" table?
Is it on a daily basis?
More...

I will update it every hour.

ensemble said:
I would suggest considering a SQL-on-Hadoop MPP query engine like Cloudera Impala or Apache Drill. Both are open source and free to use commercially. RDBMS (especially MySQL) won't scale.

I am running an Impala cluster in a NJ colo and persist about 14 million quotes a day (over 2 billion rows on 79k+ options contracts) stored as Parquet on HDFS.

AWS Redshift has a similar architecture but the TCO is substantially higher than self-hosting.
More...

I have read a few articles that relational databases are perfect if I am storing less than 500 million rows of data. The rate I am uploading data, it would take me many years to get to that number so I don't think I need to go down the route you went. Eventually when I get into asia/europe I will consider switching over to a NoSQL data base.

Kevin Schmit said:
Your schema looks more or less fine as is. My only caveat is that you have far too few columns in the Option Data table. And don't keep IV1,Ex1 and IV2,Ex2 on the same row, make them separate rows with the same date,time,ticker... then if you insist on treating a table like a spreadsheet within the database, pivot them in a view or mview. If you're calculating things like IV1 and IV2 yourself, you'll need another table of the underlying information, as you may want to go back later and change your calculations.

I wouldn't worry about normalization, in all my years in the industry, I've never seen a 3rd normal form database in actual everyday use.

"Normalize until it hurts, denormalize until it works"

In practice, you can skip the first part of that quote and just go direct to denormalized. Cobb and Date were right about flattening, wrong about normalization.

Learning a little SQL is easy, becoming an expert is difficult. It is much harder to reach expert level in SQL than in R, Python, or even C++. Since you are a one man show, I suggest triage. Keep the SQL as simple as possible and leverage your experise in R. By that I mean use the database for storage and do essentially all your analysis and presentation in R.
More...

Thanks for the advice kev, I will add the multiple columns for iv calculations. In regards to normalization, do you think I could just have it all in one table (csv stored on some platform and query it using R, you mentioned this to me before) yet still be able to have remote access to the table?

Doing all my analysis in R and presentation in shiny. Here is a super quick and rough example of some of the graphs I would like to have on Shiny. I posted this before in the earnings journal but maybe you missed it. https://financialtrading.shinyapps.io/elitetrader/ this is static data pulled from a csv from my local hardrive on UAL(the code was done in 45 minutes so it's not great just a template to show you an example). It's very easy for me to do this from a csv, however making the data dynamic (updating every hour) is why I am going the database route. The "DT" package allows me to easily query/scan a csv and have it shown in a nice robust javascript table as shown at the bottom of the page (the table shown on the website does not have the scanning feature embedded).

below is the bloomberg scanner I created that I would like to somewhat replicate (objective of the database).

nooby_mcnoob · Feb 11, 2019

I suggest you replicate it in a spreadsheet first, much easier to prototype. That's what I usually do first.

andysinclair · Feb 21, 2019

TheBigShort said:
View attachment 197722
More...

Just a quick observation on your schema. All of the tables, except stock, have a date field so I guess you are saving these points as a time series. On the stock table you have market cap but no date. Market cap changes over time so don't you need to have a separate table for stock code, market cap and date? You can then keep your Stock table as just a reference for the name.

Sig · Feb 21, 2019

Kevin Schmit said:
Keep the SQL as simple as possible and leverage your experise in R. By that I mean use the database for storage and do essentially all your analysis and presentation in R.
More...

Except that it can be far more efficient to do a lot of the analysis through the SQL call if it's related to finding related data. You know this I know, but just to keep him from going down to road of building a massive array in his code that he iterates through when he could have done it in a fraction of a second with a SQL call. I say this because his description of what he's doing from a csv leads me to believe that his code is designed that way now.

silveredge · Sep 8, 2019

https://www.quantstart.com/articles/Securities-Master-Databases-for-Algorithmic-Trading

TradingPilot · Sep 26, 2019

I'm a little late to the party but instead of a stock table why don't you use an instrument table. It's more flexible. You can then have a type column with Options/Futures/Stock. Also I would have a parent instrument reference in that table in case of an underlying.

Log in or Sign up

Designing an efficient database

viruscore1

mushinseeker

tiddlywinks

ensemble

Capture.PNG

TheBigShort

nooby_mcnoob

andysinclair

Sig

silveredge

TradingPilot