If you are serious about vol forecasting, I would encourage you to drop GARCH in favor of the HAR-RV model of Corsi. I'll post a link below, but there are many websites that describe the model, possibly in a more user friendly way. The problem with GARCH is that it typically relies on daily returns. They are easy to get, which is nice, but they are a very noisy proxy for true volatility. HAR-RV uses the volatility of intraday returns, typically 5 minute returns. These are much more informative about true volatility but slightly more challenging to locate, though it's not that hard. I can provide more detail if anyone is interested. http://statmath.wu.ac.at/~hauser/LV...s/Corsi2009JFinEtrics_LMmodelRealizedVola.pdf
The starting point is that daily returns are not informative enough about vol. If a stock price went way up and then came back to where it started, the daily return could be zero. But the price path suggests there was significant volatility. If you looked at the price path at higher frequencies, say every 5 minutes, you would see on that day there were a lot of positive returns followed by a lot of negative returns, i.e. plenty of volatility. So there's a benefit to looking at higher frequencies. 5 minutes is the common choice, but I'm not saying it's the best one in all cases. If you have a dataset containing the price of a stock every 5 minutes, you can compute the a volatility of 5-minute returns for each day in your dataset. This is what we'll use to compute the "X" variables in a regression. There are 3 problems (at least) with 5-minute volatilities. One is that you can only compute them while the markets are active. The second is that they are probably contaminated by bid-ask spreads, short-run liquidity effects, etc. The third is that volatility is mean reverting, so if yesterday's volatility is unusually high or low, tomorrow's will most likely be less so. So while the 5-minute volatilities will be informative, they will probably be biased. Thus, we would not want to assume, for instance, that the volatility of 5-minute returns from yesterday would tell what the volatility is of today's close-to-close return. What HAR-RV does is to remove these biases by running a regression. The simplest version would be to take the 5-minute volatility from yesterday and transform it in two ways. First, square it so that it becomes a variance instead of a volatility. Second, and this has no effect other than interpretability, multiply it by the number of 5-minute intervals in the day. This transformed variable is your "X". Your "Y" variable is the squared close-to-close return on the next day. So it's a predictive regression. A variance you see today is predicting a squared return you see tomorrow. At a 1-day horizon, expected returns tend to be small. So if the regression is telling you that E[Y] = a + b*X, then a + b*X is your variance forecast and sqrt(a + b*X) is your volatility forecast. This is simpler because there is no numerical optimization required, like GARCH, and more accurate because it uses 5-minute returns. You can add additional X variables to capture different forms of mean reversion. A single lagged variance is probably simpler than you would want.
Thank you for the detailed explanation! But what it would happen if the time variable is being removed? Here we suppose daily returns or 5 minute volatility and their derivatives. But let's say you have no OHLC of time-based charts but e.g. volume-based such as 5000 lots for each bar. Time dimension is removed but there is still volatility on those bars (they have OHLC non dependent on time but according to trading volume threshold of 5000 constant, i.e. the so-called constant volume bars). - How would you use the HAR-RV model, please?
You are in effect then just using an alternative measure of "time" which is calculated in units of volume rather than units of real time. Just like a day is 78ish 5-minute periods, you could define a "volume day" as a certain number of 5000-lot bars. I have no idea how well that would work. The main thing is that the result would not be a volatility forecast for a future return over a calendar time interval, but rather a volatility forecast over a future "volume day." Now, if you at least know the day of each volume bar, it may be possible to do the same calculation that I suggested originally but using the 5000-lot bar returns in place of 5-minute returns. The important thing is that you have a good number of these on each day. Otherwise your data are not really high frequency. Again, I don't know how that would work, but I don't think it is any less theoretically motivated that the original HAR-RV. If it works, it works.
Well, sequence of 78ish x 5-minute periods is a constant because there are well defined sessions and you can have high price volatility on low volume (e.g. during economic reports), too. At the same time, you actually do not measure a "volume day" as a constant but it changes everyday. Aside that, it is clear. Thank you.
I agree with you. It would be pretty abstract, but what I'm saying is that you could define a "volume day" as 78 (for example) consecutive 5000-lot bars regardless of how many actual calendar days those bars covered.