Hey everyone, I thought ET might do well with a stats/betting thread. Although Cross Validated is very helpful with answering my questions, sometimes non-financial data scientists have a hard time connecting the 2 fields. We are lucky enough to have some very smart people on this forum who I and many others would love to learn from. For obvious reasons, I will be hiding some of the variables used in my models going forward. A couple days ago I came across an interesting variable that did a decent job at predicting the 1 month implied vol vs what the market actually realized in the following 30 days (SPX). log(IV t0/RV t1). The variable is the interest rate swap vol (SRVIX). Here is the first model we have. The data is from 2012 - Yesterday. iv_rv = log(IV t0/RV t1) TenYearVol = SRVIX.Index iv_rv ~ TenYearVol Plot1 = regular graph Plot2 = residuals Plot3 = QQ plot Plot4 = summary From looking at the residuals, we can see lots of heteroskesdaticity, and the qqplot tells us that we have some heavy tails, so the distribution is not normal. So I did 2 different transformations, the first was a boxcox, the second was to use a general linear model with a gamma distribution. The gamma distribution was a better fit so I ended up going with that. Here are the stats. We also have a quasi R^2 of .20 (1 - residulas/Null). What do you guys think? Is this trade-able? Maybe not enough data? For the interested, I added a dummy variable, where 1 = SPX was above SMA50 and 0 = SPX was below SMA50, it only marginally increased the R^2.

gonna have to dust off my "Math for Data scientists" lecture at my post grad program or you an just wait for sle,dest, TomM and a couple more quants who are so capable in answering this... One thing I would suggest initially though is to try this on a less liquid, less efficient ticker that you think can also be influenced by your variable and see if you get somewhat similar results.. then it gets interesting.

While this is a bit beyond my pay-grade... I do like posts like this. (Thank you!) Curious if some additional clarity may be extracted if the data were segregated into periods where the IV clearly missed the event, and "more normal" periods where the log(IV/RV) was "relatively well behaved". -- Initially ignore the periods where IV underestimated (perhaps only consider periods of Contango as an approximation, or only times with positive value of log(IV/RV)). My assumption with this separation of the data, is that the IV cannot predict the unknown, so remove the large unknowns from the equation.

Almost didn't recognize you with the new display!! That's the end goal, is to find predictors of less liquid underlyings. However I thought I would get some statistical advise on a more liquid underlying where the data is much cleaner. If I remove the outliers the data looks much cleaner, but I am not to sure it's right to remove them as it significantly changes the slope of the line (expected value). I'll post a photo of it later this evening!!!

I am trying to de-jump earnings in implied vol for backtesting purposes. Any easy ways to do this? I have the rolling 30 day implied vol and the rolling day 60 implied vol + all the earnings dates