Hi everyone, I am trying to replicate an ETF using the least amount of components as possible. There are really 2 parts to my questions. 1) How to explain the most variance and 2) How to size the position* The idea is, if I am trying to replicate QQQ, buying the top 5 holdings may not be the best bet since AAPL is very similar to MSFT. Also there might be a stock way down the list with a vol of 100% that explains variance in QQQ that is not correlated to AAPL, MSFT, etc... I was thinking of doing a PCA regression - grab all components of QQQ and see which loadings I should use. The issue is, there are too many loadings! So the problem is still not solved. Does anyone have ideas or links for modern day dispersion trading? Note* I am looking at this through a vol lens not D1. For a case study, I have attached a data.frame for QQQ and the components over the last 2 years. I also naively zeroed out large outliers in the dataset. Thank you for your time Code: # A tibble: 6 x 101 QQQ AAPL MSFT AMZN TSLA GOOG FB GOOGL NVDA PYPL CMCSA INTC <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 -0.00434 -6.51e-3 -1.31e-2 -0.00560 0.0431 -0.00468 -0.00259 -0.00580 1.51e-2 -0.0112 -1.77e-2 -0.00414 2 0.0159 1.24e-2 2.13e-2 0.0324 0.0448 0.0196 0.0153 0.0198 -9.82e-4 0.0206 1.50e-2 0.0237 3 -0.00612 -1.54e-2 -5.82e-3 -0.00607 0.00122 0.00337 -0.00813 0.00329 -1.73e-2 -0.00991 2.31e-4 -0.00418 4 -0.0195 -2.70e-2 -2.05e-2 -0.0151 -0.0324 -0.0129 -0.0212 -0.0122 -3.75e-2 -0.0171 -1.25e-2 -0.0144 5 -0.00252 1.97e-4 -7.95e-5 -0.00168 -0.00899 -0.00667 -0.00121 -0.00685 4.68e-3 0.00110 -4.91e-3 -0.0246 6 -0.00538 -1.07e-2 -7.96e-5 -0.00933 -0.0117 -0.00334 -0.00470 -0.00240 -2.14e-2 0.00605 8.70e-3 -0.0532
The easiest way is: Group your 100 names into 5 non-overlapping Factors, Sectors, or clusters. Just use kmeans or densclust or really any clustering method. Discard all but a few in each cluster (for esample keep the top four in in each cluster by semi-partial corr with the index). Try OLS fitting all possibilities of one name from each cluster, chose the one with the lowest w'Aw where A is your 5x5 corr matrix and w is a vector of the five weights. This metric is a similarity adjusted Herfindahl stat. Strictly speaking A should be a similarity matrix with all entries between zero and one, but since negatve entries in your corr matrix are very few and small, you can just use the corr matrix. That is the simple way. The complex way is to spend a considerable amount of time and effort to estimate the forward or instant covar matrix, then run a lasso with an additional penalty term involving the adjusted Herfindahl stat. This is an interesting question. I am surprised no one has answered it yet. It is interesting not just for dispersion trading, but also because an index and its sparse replicating portfolio often make excellent pairs trades. And there is some evidence that sparse replicators with low pairwise corr (small Herfindahl stat) are more stably cointegrated with the index..
Correlation is about past performance and past performance is not an indication of future results. The simpler the better: - the five biggest companies (represent 41% of the index - replacing GOOGL by GOOG) - the ten biggest companies (represent 54% of the index) ...