Use of allele scores as instrumental variables for Mendelian randomization

Background An allele score is a single variable summarizing multiple genetic variants associated with a risk factor. It is calculated as the total number of risk factor-increasing alleles for an individual (unweighted score), or the sum of weights for each allele corresponding to estimated genetic effect sizes (weighted score). An allele score can be used in a Mendelian randomization analysis to estimate the causal effect of the risk factor on an outcome. Methods Data were simulated to investigate the use of allele scores in Mendelian randomization where conventional instrumental variable techniques using multiple genetic variants demonstrate ‘weak instrument’ bias. The robustness of estimates using the allele score to misspecification (for example non-linearity, effect modification) and to violations of the instrumental variable assumptions was assessed. Results Causal estimates using a correctly specified allele score were unbiased with appropriate coverage levels. The estimates were generally robust to misspecification of the allele score, but not to instrumental variable violations, even if the majority of variants in the allele score were valid instruments. Using a weighted rather than an unweighted allele score increased power, but the increase was small when genetic variants had similar effect sizes. Naive use of the data under analysis to choose which variants to include in an allele score, or for deriving weights, resulted in substantial biases. Conclusions Allele scores enable valid causal estimates with large numbers of genetic variants. The stringency of criteria for genetic variants in Mendelian randomization should be maintained for all variants in an allele score.

1. Unequal variants: Genetic effect sizes α Gj for each genetic variant j from a normal distribution with mean (0.1, 0.06, 0.03) and standard deviation (0.03, 0.018, 0.009). Independent weights are generated from a normal distribution with standard deviation of 0.04 and 0.01.
6. Interactions between a genetic variant and a covariate: Genetic effect sizes α G = (0.1, 0.06, 0.03), effects α Gj3 drawn from a mixture distribution taking the value zero with probability 0.5 and a random value from a normal distribution with mean 0 and standard deviation (0.06, 0.036, 0.018) with probability 0.5. With (9, 25, 100) genetic variants, in each simulated dataset there will be an average of (4.5, 12.5, 50) interactions between a genetic variant and the covariate.
7. Invalid variants: Parameters as in initial analysis.
The results with 9 and 100 genetic variants are given in Web Tables A1 and A2  Web Table A1: Instrumental variable estimates in a range of scenarios from allele score analysis and multivariable analyses using two-stage least squares (2SLS) and limited information maximum likelihood (LIML) methods in data-generating model with 9 genetic variants: mean F statistic from regression of risk factor on the instrument (F stat), median estimate across simulations, interquartile range (IQR) of estimates, coverage (Cov %) and power (%) Web Table A2: Instrumental variable estimates in a range of scenarios from allele score analysis and multivariable analyses using two-stage least squares (2SLS) and limited information maximum likelihood (LIML) methods in data-generating model with 100 genetic variants: mean F statistic from regression of risk factor on the instrument (F stat), median estimate across simulations, interquartile range (IQR) of estimates, coverage (Cov %) and power (%)

A.2 Changing the sample size
In response to concern from a reviewer that the findings of this paper may only apply in small sample settings, we repeated the simulations for 25 variants with a sample size of 30 000. Results are given in Table A3 and show no substantial differences from those previously presented with a sample of size 3000, with the exception of Scenario 3, where imposing a p-value threshold for variants no longer resulted in substantial bias. This is because, with the increased sample size, all 25 variants (p < 0.05) or at least 24 of the 25 variants (p < 0.01) were chosen in over 90% of simulated datasets. While the findings of this paper are limited by their reliance on the results of simulation analyses, we have no reason to suspect that the paper's recommendations are sensitive to the sample size.
Null effect (β X = 0) Small effect (β X = 0. Web Table A3: Instrumental variable estimates in a range of scenarios from allele score analysis and multivariable analyses using two-stage least squares (2SLS) and limited information maximum likelihood (LIML) methods in data-generating model with 25 genetic variants and large sample size (30 000 individuals): mean F statistic from regression of risk factor on the instrument (F stat), median estimate across simulations, interquartile range (IQR) of estimates, coverage (Cov %) and power (%)