How to explain the most variance in an ETF?

Discussion in 'Options' started by TheBigShort, May 10, 2021.

  1. TheBigShort

    TheBigShort

    Hi everyone,

    I am trying to replicate an ETF using the least amount of components as possible. There are really 2 parts to my questions. 1) How to explain the most variance and 2) How to size the position*

    The idea is, if I am trying to replicate QQQ, buying the top 5 holdings may not be the best bet since AAPL is very similar to MSFT. Also there might be a stock way down the list with a vol of 100% that explains variance in QQQ that is not correlated to AAPL, MSFT, etc...

    I was thinking of doing a PCA regression - grab all components of QQQ and see which loadings I should use. The issue is, there are too many loadings! So the problem is still not solved.

    Does anyone have ideas or links for modern day dispersion trading?

    Note* I am looking at this through a vol lens not D1.

    For a case study, I have attached a data.frame for QQQ and the components over the last 2 years. I also naively zeroed out large outliers in the dataset.

    Thank you for your time

    Code:
    # A tibble: 6 x 101
           QQQ     AAPL     MSFT     AMZN     TSLA     GOOG       FB    GOOGL     NVDA     PYPL    CMCSA     INTC
         <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
    1 -0.00434 -6.51e-3 -1.31e-2 -0.00560  0.0431  -0.00468 -0.00259 -0.00580  1.51e-2 -0.0112  -1.77e-2 -0.00414
    2  0.0159   1.24e-2  2.13e-2  0.0324   0.0448   0.0196   0.0153   0.0198  -9.82e-4  0.0206   1.50e-2  0.0237
    3 -0.00612 -1.54e-2 -5.82e-3 -0.00607  0.00122  0.00337 -0.00813  0.00329 -1.73e-2 -0.00991  2.31e-4 -0.00418
    4 -0.0195  -2.70e-2 -2.05e-2 -0.0151  -0.0324  -0.0129  -0.0212  -0.0122  -3.75e-2 -0.0171  -1.25e-2 -0.0144
    5 -0.00252  1.97e-4 -7.95e-5 -0.00168 -0.00899 -0.00667 -0.00121 -0.00685  4.68e-3  0.00110 -4.91e-3 -0.0246
    6 -0.00538 -1.07e-2 -7.96e-5 -0.00933 -0.0117  -0.00334 -0.00470 -0.00240 -2.14e-2  0.00605  8.70e-3 -0.0532 
     
  2. The easiest way is:

    Group your 100 names into 5 non-overlapping Factors, Sectors, or
    clusters. Just use kmeans or densclust or really any clustering
    method. Discard all but a few in each cluster (for esample keep
    the top four in in each cluster by semi-partial corr with the
    index). Try OLS fitting all possibilities of one name from each
    cluster, chose the one with the lowest w'Aw where A is your 5x5
    corr matrix and w is a vector of the five weights. This metric
    is a similarity adjusted Herfindahl stat. Strictly speaking A
    should be a similarity matrix with all entries between zero and
    one, but since negatve entries in your corr matrix are very few
    and small, you can just use the corr matrix.


    That is the simple way. The complex way is to spend a considerable
    amount of time and effort to estimate the forward or instant covar
    matrix, then run a lasso with an additional penalty term involving
    the adjusted Herfindahl stat.


    This is an interesting question. I am surprised no one has answered it yet. It is interesting not just for dispersion trading, but also because an index and its sparse replicating portfolio often make excellent pairs trades. And there is some evidence that sparse replicators with low pairwise corr (small Herfindahl stat) are more stably cointegrated with the index..
     
    vegamedic and jtrader33 like this.
  3. Correlation is about past performance and past performance is not an indication of future results.

    The simpler the better:
    - the five biggest companies (represent 41% of the index - replacing GOOGL by GOOG)
    - the ten biggest companies (represent 54% of the index)
    ...