An efficient and robust approach to Mendelian randomization with measured pleiotropic effects in a high-dimensional setting

Summary Valid estimation of a causal effect using instrumental variables requires that all of the instruments are independent of the outcome conditional on the risk factor of interest and any confounders. In Mendelian randomization studies with large numbers of genetic variants used as instruments, it is unlikely that this condition will be met. Any given genetic variant could be associated with a large number of traits, all of which represent potential pathways to the outcome which bypass the risk factor of interest. Such pleiotropy can be accounted for using standard multivariable Mendelian randomization with all possible pleiotropic traits included as covariates. However, the estimator obtained in this way will be inefficient if some of the covariates do not truly sit on pleiotropic pathways to the outcome. We present a method that uses regularization to identify which out of a set of potential covariates need to be accounted for in a Mendelian randomization analysis in order to produce an efficient and robust estimator of a causal effect. The method can be used in the case where individual-level data are not available and the analysis must rely on summary-level data only. It can be used where there are any number of potential pleiotropic covariates up to the number of genetic variants less one. We show the results of simulation studies that demonstrate the performance of the proposed regularization method in realistic settings. We also illustrate the method in an applied example which looks at the causal effect of urate plasma concentration on coronary heart disease.


Introduction
Instrumental variables can be used to estimate the causal effect of an exposure (also called a risk factor) on an outcome from observational data.A variable is a valid instrument if it is: associated Figure 1: Directed acyclic graph showing the associations between the genetic variants (G 1 , . . ., G p ), the risk factor (X), measured covariates which potentially give rise to pleiotropy (W 1 , . . ., W k ), potentially unknown and unmeasured confounders (U ) and the outcome (Y ). with the risk factor; independent of any confounders of the association between the risk factor and the outcome; and independent of the outcome conditional on the risk factor and confounders.These are the three instrumental variables assumptions (Greenland, 2000).
In Mendelian randomization studies, genetic variants are used as instrumental variables (Katan, 1986;Davey Smith and Ebrahim, 2003;Lawlor et al., 2008).Although genetic variants have many properties which make them attractive candidates for instruments, one disadvantage is that a single variant typically explains only a small amount of the variation in a risk factor.It is therefore advantageous to combine information from a number of genetic variants.Given the proliferation of genome-wide association studies (GWAS) in recent years, there is data available linking genetic variants across the entire human genome to an enormous number of traits.Standard instrumental variables and meta-analysis techniques allow us to combine the individual estimates given by each of these genetic variants (Thompson and Sharp, 1999;Palmer et al., 2012).However, the more genetic variants that are added to an analysis, the more likely that at least one of them will be an invalid instrument.In particular, any given genetic variant could associate with a number of traits other than the risk factor of interest.If any of these traits, which we refer to as covariates, associate with the outcome via pathways which bypass the risk factor, then the third instrumental variables assumption is violated and estimates of the causal effect will be biased.This is known as pleiotropy.This scenario is illustrated via the directed acyclic graph in Figure 1.
There are many methods for estimating a causal effect in the presence of pleiotropy.However, such methods typically require at least some of the genetic variants to be valid instruments.These include median-based estimators (Bowden et al., 2016), mode-based estimators (Hartwig et al., 2017a;Guo et al., 2018) and the contamination mixture method (Burgess et al., 2019).The MR-Egger method (Bowden et al., 2015) consistently estimates the causal effect without requiring the assumption of no pleiotropy.However, the method relies on a different assumption that pleiotropic effects are independent of the genetic variant-risk factor associations.This assumption is almost as strong as the one it replaces and hypothesis testing based on the MR-Egger method often has low power.Regularization methods proposed by Kang et al. (2016), Windmeijer et al. (2019) and Rees et al. (2019) use ℓ 1 penalization of the least squares equation to down-weight, and possibly remove, invalid instruments.These methods implicitly assume that at least some of the instruments are valid, and Kang et al. (2016) show that consistent estimation requires a majority of instruments to be valid.Jiang et al. (2019) proposed a constrained optimization approach to construct a weighting scheme for the genetic variants that balances the pleiotropic effects.The weighting scheme can be thought of as weights used to construct an allele score (Burgess and Thompson, 2013;Burgess et al., 2016), which can then be used as an instrument in place of the genetic variants themselves.In Section 3 we demonstrate that, in the case where the number of genetic variants is greater than the number of covariates, the estimator obtained in this way is equivalent to that obtained using standard multivariable Mendelian randomization (Burgess and Thompson, 2015).The interpretation of this is that we can account for covariates which give rise to pleiotropy by including them in a multivariable Mendelian randomization model.
In practice, it will very often be the case that only a relatively small number of potential covariates need to be included in a multivariable analysis in order to balance pleiotropy.That is, only some of the traits will actually sit on pathways to the outcome which bypass the risk factor.If this is the case, then the estimator of the causal effect obtained by a multivariable Mendelian randomization analysis with all potential covariates included will be inefficient.Furthermore, if there are more covariates than genetic variants, multivariable Mendelian randomization cannot be performed, since this is akin to fitting a regression model with more predictor variables than observations.
In this paper we propose a method for estimating a causal effect where any number of the instruments are invalid due to measured pleiotropy.The method identifies the covariates, among a set of potential covariates, which are on causal pathways from the genetic variants to the outcome, and which therefore should be accounted for in a multivariable Mendelian randomization analysis.Our approach is to fit a multivariable model which applies an ℓ 1 penalty on the coefficients of the genetic variant-covariate associations without applying penalization on the coefficient of the genetic variant-risk factor association.The coefficients of the genetic variant associations with the covariates which have little or no pleiotropic effects will be shrunk towards zero.We thus obtain a more efficient estimator of the causal effect than we would by controlling for all covariates, and a less biased estimate than we would by ignoring the covariates.There are existing methods which apply regularization to covariates (see, for example, Caner, 2009 andFan andLiao, 2014).The method of Lin et al. (2015) is in fact a two stage procedure which regularizes both the covariates and the instruments.Our situation differs from these, however, in that we do not wish to apply penalization to the coefficient of the risk factor, but only those of the covariates.That is, not all coefficients in the model are penalized, which is a non-standard scenario.The approach is developed for use with summarized data, that is, when the only data available are estimates of the genetic associations with the risk factor, covariates and outcome, and their standard errors.This is typically the way in which GWAS data are made available.It does not require any of the genetic variants to be free of pleiotropy and, furthermore, since it is based on regularization, the method can be used in the case where there are more covariates under consideration than instruments.
In Section 2 we formally define the model under consideration and demonstrate the inconsistency of standard instrumental variables estimators if pleiotropy is ignored.In Section 3 we show how to construct covariate balancing allele scores based on the constrained optimization approach using summarized data, and demonstrate the equivalence of this approach to standard multivariable Mendelian randomization.In Section 4 we describe our method for estimating the causal effect by applying regularization to the coefficients of the genetic variant-covariate associations.In Section 5 we present the results of simulation studies which examine the performance of the method.In Section 6 we demonstrate the approach with an applied example which looks at the effect of plasma urate concentration on coronary heart disease.

The model
For individual i, let Y i be the outcome, X i be the risk factor, be covariates potentially on the causal pathway between each of the genetic variants and the outcome.The model we consider is given by where β X is a p × 1 vector of regression coefficients representing the associations between the p genetic variants and the risk factor and β W j is a p × 1 matrix of regression coefficients representing the associations between the p genetic variants and the jth covariate.The variable U i represents confounders of the associations between the risk factor, covariates and outcome.The parameters θ, γ X , γ W 1 , . . ., γ W k and γ Y are scalars and It is assumed that the noise terms, ε Xi , ε W i1 , . . ., ε W ik , and ε Y i are independent of U i and the genetic variants.It is also assumed that the genetic variants are independent of each other (that is, no linkage disequilibrium) and independent of U i .By assuming that U i is independent of the G i , the second instrumental variables assumption is satisfied and so the only violation of the assumptions is that of pleiotropy.We let βXi , βW ij and βY i be the estimates of the associations between the ith genetic variant and the risk factor, the jth covariate and the outcome, respectively.We denote by βX and βY the p × 1 vectors with ith elements βXi and βY i , respectively, and βW the p × k matrix with (i, j)th element βW ij .While instrumental variable analyses can be performed using individual-level data, often in practice only summarized data in the form of these regression coefficients and their standard errors are available to investigators.To aid applicability of the method, our method is formulated using these summarized data only.
If the three instrumental variables assumptions are met, the causal effect parameter θ can be consistently estimated using the two stage least squares method.In the first stage, the risk factor is regressed on the genetic variants.In the second stage, the outcome is regressed on the fitted values from the first stage.The regression estimate from the second stage is the estimate of the causal effect.When only summarized data is available, the same estimator can be obtained by using the inverse-variance weighted method (Burgess et al., 2013), which fits the regression model where ε j is assumed to be normally distributed with mean zero and variance equal to the variance of βY j , denoted se 2 βY j .That is, the inverse-variance weighted estimator is , where S is the p × p diagonal matrix with (j, j)th element se −2 βY j .Under the model considered here, θIV where → p denotes convergence in probability, with (i, j)th element the covariance of the ith and jth genetic variant.Thus, θIV W is an inconsistent estimator of the causal effect if β W δ = 0, that is, if pleiotropy is present.
are balanced out.Such a weighting scheme, however, will tend to reduce the strength of the association between the risk factor and the weighted genetic variants.A constrained optimization approach was therefore proposed, which aims to maximise the covariance between the weighted genetic variants and the risk factor under the constraint that the covariances between the weighted genetic variants and each of the covariates are zero.
We can adapt the constrained optimization approach to the summarized data case as follows.Letting α be a p × 1 vector of weights, cov (Gα, X) = α ′ Σ G β X and cov (Gα, W ) = α ′ Σ G β W .We thus wish to maximise with respect to α the objective function α ′ Σ G βX , subject to α ′ Σ G βW = 0 and α ′ Σ G α = 1.The second constraint is a normalising condition so that a unique solution is possible.When p > k, this can be solved in closed form by where The causal effect is then estimated by α′ Σ G βY α′ Σ G βX .
In practice, Σ G is unknown.However, since S is approximately proportional to Σ G , we can replace Σ G by S in (4), ( 5) and ( 6).It is shown in the Appendix that this estimator is the same as that obtained by fitting the weighted linear regression model where ε j is normally distributed with mean zero and variance se 2 βY j .This is the multivariable inverse-variance weighted method (Burgess et al., 2015).Thus, we obtain an estimator of the causal effect which controls for measured pleiotropy by using a standard multivariable Mendelian randomization approach.However, as noted above, this estimator will be inefficient if any of the covariates do not sit on pathways between the genetic variants and the outcome which bypass the risk factor.
4 The regularization method 4.1 Estimating the causal effect Suppose we believe that not all k covariates have pleiotropic effects.That is, that some of the δ j 's are zero.We can induce sparsity in δ by including an ℓ 1 penalty term in the least squares equation used for estimating the parameters in (7).That is, the parameter estimators are given by arg min where λ > 0 is a tuning parameter.This is not a standard Lasso problem, since we are not penalizing all the parameters in the model.It is analogous to the some valid, some invalid IV estimator (sisVIVE) of Kang et al. (2016), which also minimizes a sum of squares function with all but one parameter subject to penalization.In the sisVIVE setup, invalid instruments are identified by applying penalization on direct effects between the instruments and the outcome, but the causal effect is not subject to penalization.Our case is different in that we do not seek to identify valid instruments, but rather to identify pleiotropic covariates using summarized data.
Following a similar procedure to that of the proof of Theorem 3 in Kang et al. (2016), it is shown in the Appendix that the estimator of θ obtained by ( 11), for a given value of λ, is equivalent to that given by the following two step procedure. where .
The first step is now a standard Lasso problem.It induces shrinkage on the elements of δ, but not on θ.Some of the elements of δ will be shrunk to zero, and the corresponding covariates are effectively removed from the analysis.The second step can be interpreted as estimating θ by a weighted regression of βY − βW δλ on βX .An alternative estimator is obtained by dropping the covariates that are assigned a zero coefficient by the above procedure and then performing a standard multivariable analysis including the remaining covariates.That is, the two step procedure is effectively used as a model selection technique.This is along the lines of, for example, the post-Lasso estimators of Belloni et al. (2012) and Windmeijer et al. (2019), and the LARS-OLS hybrid estimator of Efron et al. (2004).The main argument for using such post-regularization estimators is that they avoid potential bias that may arise from the shrinkage of some of the regression coefficients.The cost is some loss in efficiency.

The choice of tuning parameter
An important aspect of the method is the choice of tuning parameter, λ, which controls the level of sparsity.A common approach to choosing λ is K-fold cross-validation.The set of genetic variants is split into K folds, and the estimation procedure is performed, over a range of λ values, holding out each fold in turn.The λ chosen is that which minimizes the mean, across each fold, of a particular target function.A natural choice for the cross-validation target function is the mean squared error, that is An alternative is to make the choice of λ in Step 1, independent of βX .That is, the cross-validation target function is The use of ( 9) as target function will give the smallest test mean squared error and would be expected to give the more precise estimation.The use of (10) will tend to select more covariates, since any covariate-outcome effects which are mediated through the risk factor will not be discounted.It will tend to therefore be more conservative in the sense that the standard deviation of the estimates will be larger.

Two sample Mendelian randomization
An advantage of using summarized data is the possibility of using a two sample design for Mendelian randomization.Under this design the genetic variant-risk factor associations and genetic-variantoutcome associations are obtained from separate studies, assumed to be non-overlapping and with similar underlying populations (Hartwig et al., 2017b).This allows for a large range combinations of risk factors and outcomes to be considered, since we do not require each trait to have been included in the same study.It also helps to mitigate against the so-called "winner's curse" (Taylor et al., 2014), which causes effect estimates to tend to be overestimated in single sample designs.
In the multivariable setting, a two sample approach may in fact involve many samples, with up to one extra sample for each covariate.Again, this is a very flexible design in that it allows for any trait that has been included in published GWAS data to be considered as a potential pleiotropic covariate.It is a valid approach as long as each sample is non-overlapping with the genetic variantoutcome sample and is drawn from a similar underlying population.In practice, these conditions may be somewhat restrictive, particularly in a high-dimensional setting where there are many covariates chosen from a number of GWAS datasets.Some studies are included in the datasets of multiple GWAS consortia, and so there may be overlap with the genetic variant-outcome sample.
The extent to which any overlap exists should be checked to ensure it is not substantial.Note that these issues are potential limitations of multivariable Mendelian randomization generally.

Inference
Having estimated the causal effect, it is natural to wish to then perform inference, for example, via producing confidence intervals.The post-regularization method will produce a standard error for the causal effect, however the uncertainty is likely to be underestimated since it does not take in to account the model selection event.The fundamental problem is that the same data is being used to both select the covariates to be analysed and to do the analysis itself.A simple and pragmatic approach to get around the problem is to use data splitting, which is a practice that goes (at least as far) back as Cox (1975) (see also, for example, Fithian et al., 2014, for further discussion).The idea is to randomly split a dataset into two.One set is used for model selection, the other for inference.The obvious drawback is a loss of power, since the sample size is effectively halved.In our setting, since we are using summarized data, data splitting is not an option.However, using the same logic, we can propose a three sample study design.Here, an independent set of genetic associations is used to perform the regularization method to identify the covariates that should be accounted for.A standard two sample multivariable Mendelian randomization analysis is then performed using separate datasets which contain genetic variant associations with the identified covariates, risk factor and outcome.The independent dataset used for covariate selection should be from a sample which is non-overlapping with those in the analysis datasets and from a similar underlying population.
There is a growing literature on methods for performing inference post model selection without requiring independent samples.Berk et al. (2013) (see also Bachoc et al., 2019) propose controlling the family wise error rate across all possible models.In this way, correct coverage of confidence intervals is guaranteed.The same data can be used for both model selection and inference, and furthermore any selection technique can be used, even post-hoc, non-data driven ones.It is, however, very conservative.Furthermore, it is computationally intensive to compute the critical values (the authors note that it begins to be infeasible with m > 20).
Another strain of literature proposes the selective inference approach (Fithian et al., 2014;Lee et al., 2016;Tibshirani et al., 2016;Taylor and Tibshirani, 2018), where inference is performed conditional on a particular model being chosen.Lee et al. (2016) present a method for computing confidence intervals specifically for the case where the model has been chosen using Lasso.They show that the distribution of the parameter estimators conditional on the model selection event is a truncated normal.The confidence intervals can be very wide when the parameter estimate is close to the boundaries of the truncated normal, which will tend to occur when the signal is weak.It could be expected that this may the case in our Mendelian randomization setting when instruments are typically weak and the number of instruments is moderate.Furthermore, the method is derived for fixed λ, and so is not valid if the tuning parameter is computed using the data under analysis, for example using cross-validation.
Another approach is to use a double estimation procedure (Belloni et al., 2014).Under this approach, two model selections are performed using standard Lasso.The first selects covariates in the model that regresses βX on βW .The second selects covariates in the model that regresses βY on βW .The set of covariates used in the final model is the union of the two individual sets.
The procedure was developed for the scenario where the covariates are determinants of both the risk factor and the outcome.Although this is not the case in the model described in Section 2, in practice there may be associations between the covariates and the risk factor, in which case this method would account for those.In any case, it should provide more conservative confidence intervals than the two sample post-regularization approach.

Simulations
Data on 20,000 individuals were generated from the model given in ( 1), ( 2) and ( 3), with Four scenarios were considered, with different combinations of the number of genetic instruments (p) and the number of covariates (k): p = 10, with k either 8 (scenario 1) or 12 (scenario 2); and p = 80, with k either 70 (scenario 3) or 90 (scenario 4).We set The elements of β X were simulated uniformly on the interval (0.15, 0.3) (scenarios 1 and 2) or (0.05, 0.12) (scenarios 3 and 4).The elements of the β W j 's were simulated uniformly on the interval (−0.2, 0.4) (scenarios 1 and 2) or (0.05, 0.12) (scenarios 3 and 4).These values give average R 2 statistics (that is, the proportion of the variance in the risk factor explained by the genetic variants) of 10.0% (scenarios 1 and 2) and 11.7% (scenarios 3 and 4).The number of covariates representing pleiotropic pathways (that is, the number of δ j 's not equal to zero) was either 1, 2 or 4 in scenarios 1 and 2, and either 7, 21, or 35 in scenarios 3 and 4. The non-zero δ j 's were simulated uniformly on the interval (−0.2, 0.3).Note that all instruments in this setting are invalid and the pleiotropy is unbalanced.The causal effect was either θ = 0.2 or θ = 0.
For each scenario and combination of parameters, two independent datasets were generated.In order to produce the summarized data, the genetic variant-risk factor / outcome associations were estimated using simple linear regression on each genetic variant in turn using the first dataset.The estimates of the genetic variant-outcome associations, and their standard errors, were produced in the same way using the second dataset.For each of 1 000 replications the causal effect was estimated using the following methods.
3. The multivariable inverse-variance weighted method including only the covariates given a non-zero coefficient by the two step regularization procedure (Post-reg).
4. The multivariable inverse-variance weighted method with all covariates included (MV-All, scenarios 1 and 3 only).
5. The multivariable inverse-variance weighted method with only truly pleiotropic covariates included (Oracle).
When using the regularization procedure (that is, in methods b and c), the Lasso component of Step 1 was performed using the glmnet package in R (Friedman et al., 2010).The set of λ values used for cross-validation was the set generated by that package with the number of values set at 100.The target function for cross-validation was the mean squared error, given by ( 9), and number of folds was K = 10.In scenarios 2 and 4, when more than p − 2 covariates were identified, the tuning parameter was increased to the smallest value such that only p − 2 covariates were selected.
The inverse-variance weighted method, and the multivariable inverse-variance weighted methods using the relevant set of covariates (that is, as used in methods c, d, and e), were performed using the MendelianRandomization package in R (Yavorska and Burgess, 2017).The mean and standard deviations of the estimates are shown in Table 1. Figure 2 (a)-(b) plots the mean squared error for each scenario and method.
In each case, both the Reg and Post-reg estimators are less biased than IVW and have lower standard deviations.The regularized estimators also have lower standard deviations than the full multivariable estimator, and typically performed at least as well in terms of bias.The mean squared error plots show that the regularized estimators, across all scenarios, sit below the IVW and full multivariable estimators and above the oracle estimator.The simulations described above represent scenarios where each of the genetic variants are associated with each covariate, but where only some of the covariates have an association with the outcome (that is, sparsity in the covariate effects on the outcome).In practice, it will often be that all covariates under consideration are associated with the outcome, but only some of them are associated with the genetic instruments (that is, sparsity in the genetic variant effects on the covariates).In order to demonstrate that our method is an adequate proxy for this situation, the simulations were repeated where all elements of δ were non-zero and covariates were removed in the true model by setting columns of β W to zero.The mean and standard deviations of the estimates from these simulations are shown in Table 2, and the mean squared error for each scenario and method are shown in Figure 2 (c)-(d).The results are in line with the previous ones.
We next show the results of performing inference using methods discussed in Section 4.4.Using the same set of simulations as above (where sparsity is in the covariate effects on the outcome), confidence intervals were computed by performing the multivariable inverse-variance weighted method using sets of covariates which were chosen as follows.
3. The two step regularization procedure where cross-validation was performed independent of the genetic variant-risk factor associations, that is, using (10) as target function (2 sample(b)).
4. The two step regularization procedure using an independent sample and the mean squared error, given by ( 9), in cross-validation (3 sample(a)).
5. The two step regularization procedure using an independent sample and where cross-validation was performed independent of the genetic variant-risk factor associations, that is, using (10) as target function (3 sample(b)).

Only truly pleiotropic covariates included (Oracle).
In each case, the model was fitted using the MendelianRandomization package in R with random effects (that is, allowing over-dispersion, see Thompson andSharp, 1999 andBurgess andThompson, 2017) and 95% confidence intervals derived using the normal distribution.The means of the standard errors, coverage (that is, the proportion of confidence intervals containing the true causal effect) and power (that is, the proportion of confidence intervals not containing zero) are shown in Table 3 (for θ = 0.2) and Table 4 (for θ = 0).Note that in the θ = 0 case, power in fact refers to the Type I error rate.
The results show that using the same data to do both covariate selection and inference results in under-coverage, as expected.The three sample approach gives coverage close to the nominal level of 0.95, particularly in scenarios 3 and 4 with larger numbers of instruments.The use of (10) in cross-validation tends to give coverage closer to 0.95 than the use of (9), although not uniformly.It also gives wider confidence intervals, suggesting that it is more conservative in covariate selection (that is, tends to give lower levels of sparsity).The double estimation method, while producing better coverage than the two sample approach, in most cases did not reach the 0.95 level.As expected, the IVW method always had the lowest coverage and the full multivariable method (for scenarios 1 and 3) always had the lowest power.

Investigating the causal effect of urate concentration on coronary heart disease
We consider the study of White et al. (2016)    at a genome-wide significance level of 5 × 10 −8 .Summarized associations between these genetic variants and urate concentration were produced from a combination of published meta-analyses.
The associations between the genetic variants and coronary heart disease were obtained from the CARDIoGRAMplusC4D study.In addition, eight potential pleiotropic covariates were identified: Fasting glucose, BMI, type 2 diabetes, HDL cholesterol, LDL cholesterol, triglycerides, systolic blood pressure and diastolic blood pressure.These covariates were chosen as risk factors which have been shown observationally to be associated with increased urate concentration, and are also known risk factors for coronary heart disease.By examining the associations between the 31 genetic variants and the covariates, White et al. (2016) concluded that four of them were potential sources of pleiotropy: HDL cholesterol, triglycerides, systolic blood pressure and diastolic blood pressure.
A Mendelian randomization analysis ignoring covariates suggests that urate concentration has a causal effect on coronary heart disease.However, when including the covariates in the model, the results suggest that there is no causal effect.This is supported by Bowden et al. (2017), who analysed the same data using the MR-Egger method.
We re-analysed the causal effect of urate concentration on coronary heart disease using our regularization procedure.Details of each of the data sources for the genetic variant associations are given in the Appendix (noting there are some differences in the data sources used here to those used by White et al., 2016).Figure 3 shows the values of the coefficients for the genetic variantrisk factor and genetic-variant-covariate associations produced by the regularization procedure for increasing values of λ.The value of λ used in the final model was chosen by performing 10-fold cross-validation 100 times and taking the mean minimizer of the mean squared error.This value is indicated in Figure 3 by the vertical dashed line.The procedure identified two covariates that should be included in the analysis: diastolic blood pressure and BMI.This suggests that pleiotropy is being caused by these two covariates only.Interestingly, BMI was not identified by White et al. (2016) as a covariate to be included in the model.Furthermore, HDL cholesterol was the final covariate to be included using the regularization method, whereas it was one of the four chosen by White et al. (2016).We performed multivariable Mendelian randomization analyses using five sets of covariates: no covariates; all covariates; the four covariates identified by White et al. (2016); diastolic blood pressure only; and diastolic blood pressure and BMI.Table 5 shows the estimates of the log causal odds ratio for each model, as well as their standard errors and 95% confidence intervals (computed using a random effects model and the normal distribution).In agreement with the previous studies, the results suggest that urate concentration has a causal effect on coronary heart disease when ignoring covariates.When covariates are included, the results suggest that there is no causal effect.The causal effect estimate when only diastolic blood pressure (0.036) or diastolic blood pressure and BMI (0.034) were included are close to the estimates obtained by including all covariates (0.036) or the set of covariates chosen by White et al. (2016) (0.038).We use a covariate balancing plot which shows the correlation between the genetic variantrisk factor / covariate associations and the residuals obtained after regressing the genetic variantoutcome associations on the genetic variant-covariate associations for each set of covariates considered.If there is no pleiotropy exerted by the covariates, or there is pleiotropy but the model has accounted for it, the correlations with the genetic variant-covariate associations will be close to zero.The plot thus demonstrates two things: the strength of the association between the instruments and the risk factor when controlling for the different sets of covariates (shown by the size of the correlation with urate concentration), and how well each model has balanced the pleiotropic effects (shown by the size of the correlations with the covariates).Figure 4 shows that, when all covariates are ignored, the genetic variant-risk factor correlation is the strongest, but there are also strong correlations with all other covariates except for glucose fasting.When all covariates are included, the genetic variant-risk factor strength is somewhat weaker, but all covariates are close to uncorrelated with the residuals.When the four covariates of White et al. (2016) are included, the covariates are reasonably balanced except for BMI, and a similar pattern is seen when diastolic blood pressure only is included.When diastolic blood pressure and BMI are included, a similar pattern of instrument-exposure correlation and covariate balance is seen as in the case of the full multivariable model.
It should be noted that the analysis here has been performed in a two-sample framework, implying that the confidence interval from the model chosen by the regularization procedure could be too narrow.This would be the case if further covariates should be included to account for pleiotropy.However, the covariate balancing plot suggests that the inclusion of more covariates is not needed.Furthermore, since we have a finding of no causal effect, widening the confidence interval will only strengthen the evidence behind this finding.

Discussion
In this paper, we have presented a method for estimating a causal effect of a risk factor on an outcome in a Mendelian randomization setting in the face of pleiotropy.The method does not  require any of the genetic variants to be valid instruments and can be performed using summarized data only.By controlling for covariates that have pathways to the outcome which bypass the risk factor we remove the bias that arises due to these covariates.By not controlling for further covariates unnecessarily, we gain a more precise estimate than we would from a full multivariable model.Furthermore, we can consider cases where there are more covariates than instruments, which cannot be handled by standard multivariable Mendelian randomization techniques.We also discussed different ways of constructing confidence intervals for the causal effect.Simulations suggest that our proposed three sample approach produces valid confidence intervals and can be used to infer the presence of a causal effect.
The regularization method can be used as a model selection tool which identifies mechanisms by which sets of genetic variants influence an outcome.In the study looking at the effect of urate concentration on coronary heart disease, the method suggested that pleiotropy was being caused by diastolic blood pressure and BMI.This is in contrast to the previous finding suggesting that HDL cholesterol, triglycerides and systolic blood pressure, but not BMI, should be accounted for.
Although the method was derived for the case where there are no causal pathways between the risk factor and covariates, the method will still work when such pathways exist.In practice, there may be many pathways among the covariates if the set of potential covariates is chosen by taking all plausible traits from a GWAS database, many of which may be highly correlated.Two possible scenarios relating to causal pathways between the risk factor and the covariates are illustrated via the directed acyclic graphs in Figure 5.The first is where covariates are predictors of both the risk factor and the outcome.The second is where covariates are mediators of the causal effect of the risk factor on the outcome (also known as vertical pleiotropy).In the latter case, care needs to be taken over the interpretation of the causal effect estimate.The estimate in this case will be of the direct causal effect of the risk factor on the outcome, and will not include the indirect effect via the mediator(s) (Burgess et al., 2017;Sanderson et al., 2018).Hence an alternative interpretation of the results of the applied example is that the covariates (in particular, diastolic blood pressure) are mediators of the effect of urate concentration on coronary heart disease.
We note that we are assuming linearity in the relationships between the genetic variants and the risk factor, covariates and the outcome.Although we have derived the method as though we have a continuous outcome, an advantage of using summarized data is that it allows us to also consider binary outcomes.In this case, the βY j 's represent estimates of log odds ratios obtained by fitting logistic regression models.Note, however, that, due to the non-collapsibility of the odds ratio, causal effect estimates will tend toward the null (Vansteelandt et al., 2011;Bowden et al., 2017).In summary, the method we have developed provides a causal effect estimator in a Mendelian randomization setting where potentially all genetic instruments are invalid due to pleiotropy via a possibly high-dimensional set of covariates.The estimator is robust to measured pleiotropy and The second and third terms of (13) are independent of θ.The first term of ( 13) can be set to zero for any value of δ = δ * by putting Thus, (12) can be solved by minimising the second and third terms of (13) with respect to δ, then setting θ according to (14).This is the two step procedure.

A.3 Data sources for the applied analysis
The associations between the genetic variants and urate concentration were taken from White et al. (2016).Note that, although the singificance level for inclusion of a genetic variant was 5 × 10 −8 , one variant (rs164009) which had a p-value larger than 5 × 10 −8 , and less than 5 × 10 −7 , was also included on the basis of a known biological role in urate metabolism.The associations between the genetic variants and coronary heart disease as well as the covariates were taken from GWAS data as summarized in Table 6, and accessed using PhenoScanner (Staley et al., 2016;Kamat et al., 2019).
Note that the analysis by White et al. (2016) used the 2013 CARDIoGRAMplusC4D dataset, whereas here we use the 2015 dataset.Similarly, White et al. (2016) used the 2012 dataset from DIAGRAM and the 2010 dataset from GIANT, whereas here we use the 2017 and 2015 datasets, respectively.Finally, White et al. (2016) obtained genetic variant associations with the blood pressure traits from the ICBP consortium, whereas here we use the 2017 results from the analysis of UK

Figure 2 :
Figure2: Logarithm of the mean squared errors for each scenario (S1-S4) and number of truly pleiotropic covariates (01-35).Plots (a) and (b), where θ = 0.2 and θ = 0, respectively, show the results from simulations where there is sparsity in the covariate effects on the outcome.Plots (c) and (d), where θ = 0.2 and θ = 0, respectively, show the results from simulations where there is sparsity in the genetic variant effects on the covariates.

Figure 3 :
Figure 3: Estimates of regression coefficients for the genetic variant-risk factor association and the eight genetic variant-covariate associations for different values of λ.The dashed vertical line indicates the value of λ chosen by cross-validation.

Figure 4 :
Figure 4: Correlation between the genetic variant-risk factor association and genetic variantcovariate associations, and the residuals obtained after regressing the genetic variant-outcome association on each set of genetic variant-covariate associations.

Figure 5 :
Figure 5: Directed acyclic graphs showing the cases where W has a causal effect on both X and Y (left graph) and when W is a mediator of the causal effect of X on Y (right graph).Here, W represents one or more, possibly correlated, covariates.

Table 1 :
looking at the effect of plasma urate concentration on coronary heart disease.The study identified 31 genetic variants associated with urate concentration Mean and standard deviation (SD) of estimates from the various methods when there is sparsity in the covariate effects on the outcome.Scenarios 1 and 2 have either 1, 2 or 4 truly pleiotropic covariates and scenarios 3 and 4 have either 7, 21 or 35 truly pleiotropic covariates.

Table 2 :
Mean and standard deviation (SD) of estimates from the various methods when there is sparsity in the genetic variant effects on the covariates.Scenarios 1 and 2 have either 1, 2 or 4 truly pleiotropic covariates and scenarios 3 and 4 have either 7, 21 or 35 truly pleiotropic covariates.

Table 3 :
Mean, standard deviation (SD), mean standard errors (SE), coverage (Cov) and power (Pow) of estimates from the various methods (when there is sparsity in the covariate effects on the outcome) with θ = 0.2.

Table 4 :
Mean, standard deviation (SD), mean standard errors (SE), coverage (Cov) and power (Pow) of estimates from the various methods (when there is sparsity in the covariate effects on the outcome) with θ = 0.

Table 5 :
Estimates, standard errors (SE) and 95% confidence intervals of the log causal odds ratio for coronary heart disease per one standard deviation increase in plasma urate concentration levels.