A Bayesian approach to Mendelian randomization with multiple pleiotropic variants

Summary We propose a Bayesian approach to Mendelian randomization (MR), where instruments are allowed to exert pleiotropic (i.e. not mediated by the exposure) effects on the outcome. By having these effects represented in the model by unknown parameters, and by imposing a shrinkage prior distribution that assumes an unspecified subset of the effects to be zero, we obtain a proper posterior distribution for the causal effect of interest. This posterior can be sampled via Markov chain Monte Carlo methods of inference to obtain point and interval estimates. The model priors require a minimal input from the user. We explore the performance of our method by means of a simulation experiment. Our results show that the method is reasonably robust to the presence of directional pleiotropy and moderate correlation between the instruments. One section of the article elaborates the model to deal with two exposures, and illustrates the possibility of using MR to estimate direct and indirect effects in this situation. A main objective of the article is to create a basis for developments in MR that exploit the potential offered by a Bayesian approach to the problem, in relation with the possibility of incorporating external information in the prior, handling multiple sources of uncertainty, and flexibly elaborating the basic model.


INTRODUCTION
Many statistical studies aim to assess the causal effect of a phenotype or exposure (X ) on an outcome (Y ). In many such studies, an experimental design is unfeasible, and the only remaining option is to work on the basis of observational data. Unfortunately in this case, no matter how impeccable is the study design, how accurate are the observations and smart the inference algorithm, there is no guarantee that the result will not be biased due to unobserved confounding or reverse causation. A useful approach to this situation is Mendelian randomization (MR) (Katan, 1986;Davey Smith and Ebrahim, 2003;Lawlor and others, 2008). The bare bones of the idea are that, under certain assumptions, for a phenotype X to be a causal influence on an outcome Y , we expect a genetic variant Z that modulates X to likewise affect Y . Information about Z can then be used as an instrument to assess the causal effect of X on Y , despite confounding.
The potential impact of MR on science cannot be underestimated (Robinson and others, 2017). In various occasions, an MR study based on observational data has predicted the outcome of a clinical trial, thereby supporting or casting doubt on the motivating causal hypothesis (Voight and others, 2012). Furthermore, MR can help biologists reconstruct a disease process from its molecular causes to its phenotypic manifestations, and unravel causal relationships of pharmacological relevance through the analysis of biobank data.
Early implementations of MR used a single or a handful of instrumental variants, under the untestable assumption that these variants are not pleiotropic, i.e. that they affect the outcome only through the changes they induce in the exposure. Recent developments, mostly in the frequentist realm, have focused on methods that use multiple instruments, while allowing for "cryptic" pleiotropy, i.e. allowing an unspecified subset of the instruments to affect the outcome directly. Examples of multi-instrument Mendelian randomization (MIMR) methods that allow for cryptic pleiotropy are the Egger regression (ER) and the weighted median estimator (WME) method of Bowden and others (2016).
The existing frequentist approaches to MR do not coherently account for important sources of uncertainty, such as the uncertainty arising from the estimation of the instrument-exposure (i-e) associations. Hence our concern that these methods may yield over-optimistic results. We attempt to remedy this by proposing a Bayesian approach to MR (see Sheehan, 2007, 2008;Thompson, 2010, 2012;Jones and others, 2012), which deals with cryptic pleiotropy and can in principle acknowledge uncertainty at all levels of the model. Our method allows the user to shape the prior on the basis of external information, for stronger and more accurate inferences, although it will work with vague prior specifications, too. It combines the power of Bayesian analysis with that of Markov chain Monte Carlo (MCMC) inference, for an exceptional freedom in elaborating the basic model. Extensions of our method to deal with non-linear dependencies and model uncertainty are under investigation, but remain outside the scope of this article. We restrict our attention to continuous X and Y variables.
In Section 2, we review past related work, and place our method in that context. In Section 3, we introduce our MIMR framework in a simple setting. The idea here is that the pleiotropic effects are represented in the model by unknown parameters, with an independence sparsity prior that assumes an unspecified subset of these parameters to be zero. Incorporating this prior yields a proper posterior for the causal effect, which we MCMC-sample to obtain point and interval estimates.Also discussed in this section is the use of external information to shape an informative prior. In Section 4, we assess the performance of our method in relation to the number of instruments, the amount and direction of pleiotropy, and the degree of linkage disequilibrium (LD) between the instrumental variants, taking the performance of the WME method as a reference. Thanks to the explicit modeling of the direct instrument-outcome (i-o) effects, our approach bears relationships with mediation analysis. This connection is explored in depth in Section 5, where we consider a problem involving two exposures (instead of one), and use our method to estimate direct and indirect effects. This is further illustrated in Section 6 with the aid of a study in metabolomics. This article is based on the decision-theoretic causality framework proposed by Dawid (2000) and described in Chapter 4 of Berzuini and others (2012).
Conditional independence graph representations of a Mendelian randomization problem. In (a) the graph represents a set of conditional independence assumptions that do not violate Conditions 1-3 of Section 2. In (b), the arrow from Z to Y violates Condition 3. In (c) the graph represents a class of problems with multiple instruments, where Conditions 1 and 2 are not violated.

BACKGROUND
Let U denote a set of imperfectly observed exposure-outcome (e-o) confounders, responsible for the correlation between X and Y being not totally attributable to a causal relationship. In order for a scalar variable Z to qualify as an instrument for estimating the causal effect of X on Y , we generally require it to satisfy the following three conditions, where we use the notation A ⊥ ⊥ B | C for "A is independent of B given C" (Dawid, 1979), and A ⊥ ⊥ B | C for the negation of the same sentence: Condition 1 (marginal relevance) Z is associated with the exposure, formally Z ⊥ ⊥ X .
Condition 2 (confounder independence) Z is independent of the e-o confounders, formally Z ⊥ ⊥ U .
Condition 3 (exclusion restriction) Z is independent of Y , given U and X , formally Z ⊥ ⊥ Y | (U , X ).
The last two conditions are not testable on the basis of the usually available (X , Y , Z) data. Three examples of MR problem are graphically represented in Figure 1, where the X → Y arrow represents the causal effect of inferential interest, the Z → Y arrow a pleiotropic effect, and a Z → X arrow an i-e association, which none of the methods discussed assumes to be causal. We regard the graphs of Figure 1 as expressing sets of conditional independence relationships, which can be read off them with the aid of the d-separation criterion of Geiger and others (1990). Conditions 1-3 are satisfied in Figure 1a. Condition 3 is violated in Figure 1b by the presence of the Z → Y arrow. With reference to Figure 1a, if we assume linear additive dependencies between the variables in the graph, and letβ Y andβ X denote the estimated slopes in the regressions of Y on Z and X on Z, respectively, then the instrumental variable (IV) estimator of the causal effect of X on Y isβ Y /β X . A small sample size and/or weak i-e associations may cause the data to deviate from Condition 2 (Nelson and Startz, 1990), and consequently the IV estimate to be affected by the so-called weak instruments bias.
Existing frequentist methods admit a collection of independent instruments, Z 1 , . . . , Z J , and they require Conditions 1 and 2 to hold for all instruments, formally Z j ⊥ ⊥ X and Z j ⊥ ⊥ U , for j = 1, . . . , J , as in Figure  1c. In these methods, each jth instrument contributes a separate IV estimateβ Yj /β Xj of the causal effect of X on Y . When the IV estimates of several instruments show reasonable concordance, it would appear that a causal conclusion is defensible, pleiotropy notwithstanding. This idea is developed by Egger and others (1997), who suggest that concordance can be tested by regressingβ Yj onβ Xj . Under the assumption that the i-e associations (or instrument strengths) are independent of the direct effects (pleiotropic associations), usually referred to as the INSIDE assumption (Kolesar and others, 2014), evidence of a linear relationship betweenβ Yj andβ Xj will support (and provide an estimate of) the causal effect of interest, whether or not the instruments satisfy Condition 3. For finite numbers of instruments, the frequentist interpretation of the INSIDE is that the correlation between pleiotropic and i-e associations is zero. This is an untestable property, although some indirect empirical evidence (Pickrell and others, 2015) can be summoned in its support. The Egger method requires the instrumental SNPs to be recoded to ensure that the i-e associations have the same sign, although, unfortunately, INSIDE is sensitive to changes such. Moreover, by treating the {β Xj } as fixed quantities, the Egger method ignores the imprecision introduced by their estimation.
Another popular approach to MIMR is the median estimator. If Conditions 1 and 2 are valid, the instruments are independent and at least half of them satisfy Condition 3, then the median of their corresponding IV estimates will be a consistent estimate of the causal effect (Han, 2008). Bowden and others (2016) proposed a widely used weighted version of this estimator-the WME of the causal effect.
In this article, we propose a Bayesian approach to MR that allows an unspecified subset of the instruments to be pleiotropic, provided that Condition 2 and a Bayesian version of the INSIDE assumption (see Condition 4 in the next section) are satisfied. The proposed approach has the following distinguishing features. It allows for (moderate) instrument-instrument correlation, and does not require the signs of the instrument effects to be manipulated. It treats the i-e associations as random quantities, which we can learn about via prior-to-posterior updating. Once the posterior distributions (e.g. for the i-e associations and for the causal effect, etc.) have been calculated, they can be used as priors in future studies, in what can be regarded as a sequential learning process. Finally, while the aforementioned frequentist methods emphasize the construction of estimators for specific situations, our combined use of Bayesian inference and MCMC computation allows the researcher to focus on model choice, to better explore the possibility of tackling elaborated versions of the basic model.
We conclude this section with a note on the decision-theoretic formulation of causality proposed by Dawid (2000), and on the corresponding definition of causal effect, which we adopt in the present work. In accord with this formulation, we define the causal effect of X on Y as the difference between the expected values of Y under a (hypothetical) intervention that imposes on X a reference value x 0 and another intervention that imposes a generic value x. To express this, let the symbol F X label the regime under which the value of X is generated, with F X = a indicating that X is fixed to value a by an intervention of the relevant type, and F X = ∅ denoting the observational regime under which the data have actually been obtained. Then the average causal effect (ACE) of X on the continuous outcome Y is defined by , that the conditional distribution of Y given X = x in the generic individual characterized by a specific value of U , depends on x, but not further on whether the value x has arisen by passive observation or through the intervention of interest. The implications of this condition in a MR context, and, more in general, in the context of IV analysis, are examined in Chapter 4 of Berzuini and others (2012) 3. METHODS We shall now introduce our approach to MR with reference to a one-sample setting, where each individual is characterized by a complete set of observed values for X , Y and Z 1 , . . . , Z J . We assume linear additive dependencies and write . The model is not completely identifiable, in the sense that the information contained in the observed covariances does not lead to a unique solution for or any subset of containing the parameter of inferential interest, θ . In fact, parameters (ω X , α, τ X ) are identified by the J + 2 conditions provided by equalities E(X | Z 1 , . . . , Z J ) = ω X + J j=1 α j Z j and var(X | Z 1 , . . . , Z J ) = τ 2 X . Unfortunately, the remaining J + 4 parameters, including the causal effect of interest, θ , remain unidentified. This is because ) provides additional J + 2 conditions, and a further condition is obtained from the equation , for a total of additional J + 3 conditions, which are not sufficient to identify J + 4 parameters. From a Bayesian point of view, non-identifiability can be negotiated by using a scientifically plausible prior that induces a proper posterior on θ . Formally, if D denotes data, the posterior can always be written in the product form: Because the last term above is the conditional posterior of an unidentifiable parameter, it reduces to the conditional prior: from which it follows that we may make the full posterior distribution proper by allowing the last term of the above product to take the form of a proper distribution. To proceed, we introduce the following Bayesian interpretation and generalization of INSIDE: Condition 4 (instrument effects orthogonality (IEO)) Each component of β is a priori independent of the parameters of the exposure model, P(β | ω X , α, τ X ) = P(β), and we specify a proper and scientifically plausible prior P(β). One option is to impose β = 0, as in a standard IV analysis, which however will often be unrealistic. A second option is to impose that the effect exerted by each instrument on the outcome through the mediation of X is greater in magnitude than the corresponding pleiotropic (unmediated) effect. We use none of these. In the following section, we construct P(β) from our belief that some of the components of β are zero.

The prior
We shall now discuss the prior specifications for model (3.1-3.3). In many applications, it will be reasonable to assume that some components of vector β are zero, i.e. that an unspecified subset of the set of instruments have no pleiotropic effect. This justifies imposing on β a shrinkage prior, e.g. by taking each β j to be a priori independently drawn from a Laplace (double exponential) distribution with mean 0 and unknown variance 2τ 2 , with τ distributed a priori as Cauchy + (0, 1), where Cauchy + (0, a) denotes the half-Cauchy density on the positive reals, with scale parameter a. An alternative choice is to impose on each jth component of β the horseshoe shrinkage prior proposed by Carvalho and others (2010), which has the hierarchical structure where the degree of shrinking of each jth component of β is controlled by an unknown parameter φ j . A high value of φ j corresponds to a near-zero value of the shrinkage weight, κ j = 1/(1 + φ 2 j ), in which case this prior leaves the magnitude of β j almost unaffected. In contrast, a near-zero value of φ j corresponds to a nearunit shrinkage weight, which will result in the estimate of β j being heavily shrunk towards zero. Under the horseshoe prior, each β j is mixed over its own φ j , with φ j drawn from a Cauchy + (0, γ ) distribution governed by an unknown parameter γ . Both the φ j parameters, which are in charge of controlling the local degrees of shrinking, and parameter γ , which controls the global degree of shrinking, are inferred from the data, with minimal input from the user. With γ = 1, the horseshoe specifications induce on κ j a horseshoe-shaped Beta(.5, .5) distribution with one peak at κ j = 0 and another at κ j = 1. The two peaks may be interpreted in terms of the horseshoe prior inducing sparsity in a selective fashion. The lower peak of the distribution of κ j accounts for the small components of β, which our model recognizes as noise and heavily shrinks towards zero. The upper peak of the distribution accounts for the large components of β, which our model recognizes as pleiotropic, and leaves almost unaffected, thereby reducing the influence of the pleiotropic instruments on the estimate of θ.
In our experience, assigning the remaining parameters uniform priors does not cause numerical problems, thanks to the ability of the Stan toolbox (STAN Development Team, 2014) to determine a sensible bounding of the search space via variational algorithms. However, we shall often wish to make our priors informative, for stronger inferences. In future studies, we speculate that it will be possible to shape informative priors on the basis of data collected in previous studies (provided these satisfy the necessary conditions). For example, (X , Z) data from past studies can be used to construct a prior for (α, ω x , τ x ), in such a way to reduce the weak instruments bias.
Consider also that mathematical relationships between parameters may be used to derive sensible local priors. For example, parameters δ X and σ X are not identifiable, but the model links them to τ X through the identity τ 2 X = δ 2 X + σ 2 X , which justifies the inequalities δ X ≤ τ X and σ X ≤ τ X . Because we are able to learn about τ X from external data, we can use this information, in conjunction with the above inequalities, to derive joint prior bounds for δ X and σ X (not illustrated in this article). Alternatively, we may establish an upper bound for τ 2 X , denoted by u τ 2 X , and impose the prior bound δ 2 X + σ 2 X ≤ u τ 2 X . In some situations, a posterior distribution for the causal effect might become available from previous studies, and be used, under assumptions, as our prior for θ. Prior information about β might become available with the development of web repositories containing lists of instruments for specific exposures. Finally, in certain situations it might be reasonable to assume a priori that each direct effect β j is smaller in magnitude than the corresponding indirect effect, α j θ .

C. BERZUINI AND OTHERS
In our analyses of real and simulated data, we assigned σ X and σ Y uniform prior distributions with positive support. We assigned θ, δ X , δ Y , ω X and ω Y independent uniform priors, and we took each α j , for j = 1, . . . , J , to be independently drawn from a normal N (μ α , σ 2 α ) prior, with hyperparameters μ α and σ α subject to uniform priors.

SIMULATION EXPERIMENT
We performed a simulation experiment to evaluate our model's performance in relation to the number of instruments and individuals, the direction and amount of pleiotropy, and the degree of correlation between the instruments. Although performance comparisons are not a primary objective of this article, we shall compare our method's performance with the WME in terms of bias, coverage and power. Our simulations were based on sequences of SNPs of real individuals, with each SNP expressed on an interval scale as an allele dose (0,1,2). We considered the 21 simulation scenarios described in Table 1. In each of these, we simulated 800 datasets with the causal effect θ set to zero, and further 800 datasets with θ set to 0.35, which allowed us to assess each method's performance under the null and under the alternative hypothesis. The SNP sequences changed from one individual to the next, but they were kept fixed across scenarios and simulations, except for scenarios 14 to 21, where they changed from one scenario to the next to represent different degrees of LD between the SNPs.
Each of Scenarios 1 to 13 uses independent SNPs, and is characterized by (i) the sample size reported in column 2 of Table 1, (ii) the value of μ β , the mean pleiotropic effect, reported in column 3, and (iii) the value of σ α reported in column 4, which controls the variability of the strength, α, from one instrument to the next. Note that by varying μ β , we explore different types of pleiotropy: balanced (μ β = 0), negative (μ β < 0) and positive (μ β > 0). In particular, by allowing μ β to take values ±0.012, we have included situations where the pleiotropic component of the effect of the instrument on the outcome is on average stronger than the component mediated by the exposure (indirect component). At each new simulation, new values for the model parameters were generated. In particular, in Scenarios 1 to 13, each component of α was independently drawn from N (−0.07, σ 2 α ). A randomly selected subset (40%) of the components of β were independently drawn from N (μ β , 0.05 2 ), the remaining components being set to 0. The proportion of instruments with a significant (P < 0.05) marginal association with the exposure varied between 70% and 100% across the simulations. Also, parameters ω X and ω Y were drawn from N (3.3, 0.2 2 ) and N (0.9, 0.2 2 ), respectively, and δ X and δ Y were independently drawn from N (−0.1, 0.02 2 ), so as to have a positive average correlation between X -errors and Y -errors. Parameters σ X and σ Y were sampled from sharp inverse-gamma distributions with means 0.1 and 0.3, respectively. Conditional on the generated parameter values, at each new simulation we generated values for variables U , X , Y , for each individual, on the basis of Equations (3.1-3.3) and in conformity with the IEO condition.
Scenarios 14 to 21 involve instrumental SNPs with increasing degrees of mutual correlation (average R 2 reported in column 6). These scenarios were generated in the same way as the previous ones, except for vector α. After the elements of this vector were simulated, the majority of them were set to zero, so as to mimic the situation where only a small number of instruments have a non-null causal or conditional effect on the exposure. Also, the components of β were independently drawn from N (φ, 0.05 2 ), with φ uniformly distributed between ±0.012, so as to embrace situations where the pleiotropic effect is on average stronger than the i-o indirect effect.
Each simulated dataset was analyzed via WME to obtain a point and a bootstrapped 95% confidence interval for θ, and then via our model (with a horseshoe prior for β) to obtain a posterior mean and a 95% credible interval for θ. On the basis of these results, we assessed performance in terms of bias, coverage and power. The analysis with our model was performed by using the Hamiltonian MCMC methods (Metropolis and others, 1953;Neal, 2011) provided by the program Stan (STAN Development Team, 2014). Stan employs a combination of variational (Wainwright and Jordan, 2008) and MCMC Table 1. Comparative assessment of the proposed method (with a horseshoe prior for β) and of the WME, in relation to the mean pleiotropy, the number of instruments, the degree of linkage disequilibrium (R 2 ) between instruments and the dispersion of the α instrument-exposure associations (column 4)  We shall now briefly discuss the results of the simulations. Scenarios 1 to 8 were based on independent instruments. Table 1 tells us that, in these scenarios, (i) in both methods an increase in the number of instruments corresponds to an increase in power, (ii) in both methods an increase in the number of instruments corresponds to a drop in coverage under the null, the drop being modulated by the amount of directional pleiotropy, and (iii) in both methods, positive pleiotropy reduces power. In our case, a positive pleiotropy corresponds to the direct and indirect effects of the instruments' effects on the outcome having on average the opposite sign.
A comparison between the results of Scenarios 1-5 and Scenarios 9-13, all of which involve independent instruments, suggests that in both methods a higher value of σ α , which means a higher number of strong instruments, improves power and coverage under the null. In our method, this was sufficient to bring coverage under the null into the nominal range. This did not happen with WME, although in Scenarios 9-13 WME slightly outperforms our method in terms of coverage under the alternative.
In Scenarios 14 to 17, and in both methods, the progressively increasing degree of LD between SNPs causes a marked drop in coverage and a slight drop in power. In the presence of LD, the gap in performance between the two methods is dramatic. This is unsurprising, because WME was developed with independent instruments in mind. This pattern is confirmed in Scenarios 18 to 21, where, in addition, we observe the effect of reducing the number of individuals from 500 to 300. The reduction makes power more vulnerable to presence of LD between the instruments.
Our method appears to outperform WME in terms of coverage under the null (in all scenarios), and in terms of power (in all scenarios with independent instruments).

INCORPORATING MEDIATION
This section extends our approach to deal with two (instead of one) exposures or intermediate phenotypes, X 1 and X 2 . Within this more general setting, we shall use our method to estimate direct and indirect effects, and to combine, albeit under strong parametric assumptions, the capabilities of MR and mediation analysis. Figure 2a portrays a problem where X 1 is a putative cause of X 2 . Suppose, we accept the assumptions represented in the figure. Suppose, we are interested in the direct causal effect of X 1 on Y (controlling for X 2 ), and in the indirect effect of X 1 on Y (via X 2 ). Suppose, we are also interested in the causal effects of X 1 on X 2 and of X 2 on Y . Let the set of instruments, Z 1 , . . . , Z J , consist of two non-overlapping subsets, I 1 and I 2 , with I 1 ≡ (Z 1 , . . . , Z J 1 ) and I 2 ≡ (Z J 1 +1 , . . . , Z J 1 +J 2 ), with J = J 1 + J 2 . Assume for simplicity that I 1 ⊥ ⊥ I 2 . Let Z −j ≡ (Z 1 , . . . , Z j−1 , Z j+1 , . . . , Z J ). We elaborate (3.1-3.3) into: with The causal effect of X 1 on X 2 is represented by parameter θ 2 , whereas the direct causal effect of X 1 on Y (controlling for X 2 ) is represented by parameter θ 1 , and the indirect causal effect of X 1 on Y is represented by θ 2 θ 3 . The model equations satisfy Condition 2. When all components of α 1 and α 2 differ from zero, they satisfy also: Condition 5 (sequential relevance) Each component of I 1 is associated with X 1 , conditional on the remaining instruments, and each component of I 2 is associated with X 2 , conditional on X 1 and the remaining instruments. This condition is formally expressed by Z j ⊥ ⊥ X 1 | Z −j for j = 1, . . . , J 1 , and Z j ⊥ ⊥ X 2 | (Z −j , X 1 ), for j = J 1 + 1, . . . , J .
In the following, we show that, under the above conditions, and in the special case where β 1 = β 2 = β 3 = 0, all the parameters of model (5.1) and, in particular, the causal effects (θ 1 , θ 2 , θ 3 ), are identified. First, we need to introduce the concept of "unblocked" path of a causal diagram. A path (= sequence of adjacent edges) in a causal diagram is said to be unblocked if it involves one or more colliders (Geiger and others, 1990), i.e., if at least one pair of arrows point to a common node, → • ← (not the most general definition, but sufficient for our purposes).
We are now ready to show that (θ 1 , θ 2 , θ 3 ) are identifiable, provided (i) α 1 = 0, (ii) α 2 = 0 and (iii) β 1 = β 2 = β 3 = 0. To see this, consider Figure 2a in the simple case where I 1 and I 2 are scalar. Assume all variables represented in the graph have zero mean. Then the correlation ρ AB between two nodes of the graph, A and B say, is given by a sum of terms over all the unblocked paths that connect A and B, with each term of the sum consisting of the product of the effects along the path (Wright, 1934). By using Figure 2a and Condition (iii), we obtain ρ X 1 I 1 = α 1 and ρ X 1 I 2 = α 3 . The two equalities uniquely identify parameters α 1 and α 3 , conditional on which we may then consider the system formed by equations ρ I 1 X 2 = α 1 θ 2 and ρ I 2 X 2 = α 2 + α 3 θ 2 , which can be solved for (α 2 , θ 2 ) by virtue of Condition (i), as its determinant does not vanish. This means that causal parameter θ 2 is identified. Next, note that, under Condition (iii), nodes Y and I 1 are connected via two unblocked paths, and Y and I 2 via further three unblocked paths, which leads to the two equations ρ I 1 Y = α 1 θ 1 + α 1 θ 2 θ 3 and ρ I 2 Y = α 3 θ 1 + (α 3 θ 2 + α 2 ) θ 3 . These can be solved for θ 1 and θ 3 , conditional on the identifiable parameters (α 1 , α 2 , α 3 , θ 2 ), because Conditions (i) and (ii) prevent the determinant of the system, α 1 α 2 , from vanishing, which completes the proof.
We deal with the more general situation where the β-vectors depart from zero in the same way as in Section 3.1, i.e. by imposing on each of these vectors a sparsity prior that makes the posterior distribution of the causal effects proper. Simple averages of the MCMC samples generated from this posterior will give simulation-consistent point and interval estimates for any function of interest of parameters (θ 1 , θ 2 , θ 3 ), such as, for example, the indirect effect θ 2 θ 3 exerted by X 1 on Y . This is illustrated in the next section.

ILLUSTRATIVE APPLICATION
Past decades have witnessed an unprecedented worldwide rise in obesity. Excess body fat, as measured by body mass index (BMI = weight in kilograms divided by the square of the height in meters) is a major risk factor for cardiovascular disease (CVD), among other disorders. The increased incidence of CVD associated with adiposity is believed to be mediated both by abnormalities in carbohydrate metabolism and by an increase in blood pressure. As far as the latter is concerned, various authors have found evidence of BMI being a causal factor for hypertension, and in this section, we shall corroborate this hypothesis by applying our MR approach to data from the general population, by using a recently proposed measure of blood pressure burden defined as the sum of the diastolic and systolic arterial pressures, hereafter, denoted as PRES (Nair, 2016). Part of our analysis is motivated by recent metabolite profiling studies, that have highlighted deviations in molecular signatures of BMI. Many of these studies compared small groups of individuals with large differences in adiposity, and it remains unclear whether those deviations are also observed in the general population. One putative molecular signature of obesity is the α-aminoacid phenylalanine (PHE) (Jones, 1996;Droyvold and others, 2005;Shah and others, 2012;Moore and others, 2014;Wuertz and others, 2015;Hao and others, 2016). Recent research also highlights PHE as a putative mediator of the causal effect of body fat on blood pressure.
In the following analysis, we shall put these hypotheses under scrutiny by using our MR approach. We shall first use MR to assess the putative causal effect of BMI on PHE. In a subsequent stage of the analysis, we shall assess the causal effect of BMI on PRES, in terms of a direct causal effect, and of an indirect causal effect mediated by PHE.
We analyzed a dataset of 520 unrelated individuals (aged 25-74) from a population-based Finnish cohort-the DILGOM (Dietary, Lifestyle and Genetic determinants of Obesity and Metabolic Syndrome) study (Inouye and others, 2010). Each individual in this study had serum metabonome information, a genome-wide genetic profile and measures of BMI, blood pressure and sex. The eighty instruments used in the analysis, Z 1 , . . . , Z 80 , were SNPs with a significant (P ≤ 10 −5 ) BMI marginal association, and in negligible LD (R 2 < 0.05). These SNPs we treated as counts (0, 1, 2) of minor alleles at the corresponding locus. Let X (the exposure) represent the logarithm of BMI. Let Y represent the log-concentration of PHE, and W take value 0 for female and 1 for male. WME gave an estimated causal effect of log BMI on log PHE of 0.25, with a 95% confidence interval of 0.18-0.31. Our analysis based on model (3.1-3.3) gave a posterior mean of 0.3, with a 95% credible interval of 0.19-0.42, representing a higher degree of uncertainty about the causal effect with respect to the WME estimate.
A number of studies (Kaplan and others, 2014, see) stress the differential prognostic significance of BMI across genders. This motivated our interest in incorporating an interaction between the effects of sex and BMI on the outcome. Recall that sex is denoted as W , with W = 0 indicating female, and W = 1 male. For purposes of illustration, we made the following simplifying assumptions. First, we assumed that sex is independent of the confounders U . Second, we assumed the effect of sex on either BMI or PHE Fig. 3. With reference to our analysis of Section 6, each jth instrument is represented in each of these plots by a black dot with co-ordinates (β Xj ,β Yj ) (see Section 2 for a definition of these symbols), as obtained from an analysis of the male (left plot) and female (right plot) individuals in the sample. The slope of the regression line is the Egger regression estimate of the causal effect. not to interact with the effect of the instrumental variants on the same variable. The latter assumption is delicate, which invites caution in the interpretation of the results. To include the interaction, we extended model (3.1-3.3) as follows: with U ∼ N (0, 1). The causal effects of log BMI on log PHE are represented in the model equations by θ (in the females) and θ + ψ YXW (in the males), with ψ YXW representing the interaction between sex and BMI. We used a horseshoe prior for β, and uniform priors for the remaining parameters. We ran 10 000 iterations of a Markov chain, and used the values generated during the second half of the chain to compute the estimates. Parameter ψ YXW had a posterior mean of −0.14 and a 95% credible interval of −0.28 to −0.0062, representing fair evidence of an interaction between BMI and sex in their causal effects on PHE. The causal effect of log BMI on log PHE had a posterior mean of 0.34 with a 95% credible interval of 0.21-0.47 in the females, and a posterior mean of 0.2 with a 95% credible interval of 0.098-0.3 in the males.
In the scatter diagrams of Figure 3, each instrumental SNP is represented by a black dot with xcoordinate (respectively, y-coordinate) given by the coefficient of the least-squares regression of log BMI (respectively, log PHE) on that SNP, as obtained from an analysis of the male (left plot) and female (right plot) subsamples. The linearity of the relationship in both plots provides visual evidence of a causal effect of BMI on PHE, whereas the difference between the two slopes provides evidence of that causal effect interacting with sex.
The second stage of our analysis embraced variables BMI, PHE, and PRES. Our assumptions in this analysis are depicted in Figure 2b, where the effect of BMI on PRES has two putative components: a direct one and an indirect component mediated by PHE. We analyzed the data by using model (5.1), with X 1 , X 2 and Y representing log BMI, log PHE, and PRES, respectively. We used a set of 98 instruments, Z 1 , . . . , Z 98 , partitioned into a subset I 1 ≡ (Z 1 , . . . , Z 80 ) consisting of 80 BMI-associated instrumental SNPs (the same as in the preceding part of the analysis), and a subset I 2 ≡ (Z 81 , . . . , Z 98 ), consisting of 18 instruments associated with PHE but not BMI. We assumed almost all the parameters to be a priori  Figure 2b. Parameter θ 1 represents the controlled direct effect of BMI on PRES, controlling for PHE. Parameter θ 2 represents the direct effect of BMI on PHE, and θ 3 represents the causal effect of PHE on PRES. Also included are the posterior distributions for two nonlinear functions of the above parameters, namely θ 2 θ 3 , which represents the indirect component of the effect of BMI on PRES, mediated by PHE, and θ TOT ≡ θ 1 + θ 2 θ 3 , which represents the total effect of BMI on PRES. From a substantive point of view, these results can be interpreted to provide evidence of a causal effect of the body mass index on blood pressure and phenylalanine concentration, but no evidence that this latter influences blood pressure. uniformly distributed. We sampled the model posterior distribution by running six Markov chains, of 100 000 iterations each, with initial values spanning the approximate 95% confidence intervals for θ 2 and for the quantity θ 1 + θ 2 θ 3 , as obtained by a traditional MR analysis. We checked convergence of the six chains to the same posterior. The second half of each chain was used to approximate the posterior means and credible intervals for the parameters of interest. Figure 4 shows the marginal posterior distributions for the main quantities of interest. One of the plots shows the posterior distribution for the total causal effect of log BMI on PRES, θ TOT ≡ θ 1 + θ 2 θ 3 . Figure 4 suggests that BMI might exert a causal effect on both PHE and PRES, although there appears to be little evidence of an effect of PHE on PRES. These results discredit the hypothesis of PHE acting as a mediator of the deleterious effect of body mass on blood pressure. The total effect of log BMI on PRES, represented by parameter θ TOT , was re-estimated in the traditional way, by using the instruments contained in I 1 . This yielded an estimated total effect of 32.3, and a 95% confidence interval of 19.1-46.6, which corresponds to a lower uncertainty relative to the estimate obtained by our method.
In consideration of the relatively small size of the sample, and of the cross-sectional nature of the study, the results of our analysis deserve future independent validation.

DISCUSSION
Thanks to its holistic approach to uncertainty, a Bayesian approach to MR may represent a safeguard from over-optimistic conclusions. The results of our simulation study are consistent with this expectation, while also suggesting that our method behaves well in the presence of moderate LD between the variants-a welcome feature when the choice of the instruments is confined to a narrow region of DNA.
Much work remains to be done. It might be interesting to assess the extent to which our approach can mimic existing frequentist methods, such as the one proposed by Kang and others (2016) and further elaborated by Windmeijer and others (2016), where LASSO-type procedures are used to identify the valid instruments from within a set of candidate variables.
A variety of future developments of the approach are envisaged. One of these is to incorporate advances in Bayesian sparsity modeling, for example, in relation to the design of shrinkage priors that deal with high-dimensional vectors of possibly correlated instruments. Of equal importance is to extend the method to deal with nonlinearities and selection effects, and perhaps to incorporate principles of Bayesian model averaging. Such efforts will encounter theoretical difficulties, such as problems of collapsibility of the causal effect parameters. Finally, we may use our framework in a simulation mode, for generating extended datasets from limited data, for purposes of power calculation.

SOFTWARE
R software to implement analyses by means of the proposed method is available from Github (https://github.com/carloberzuini/BMR).