MR-Clust: clustering of genetic variants in Mendelian randomization with similar causal estimates

Abstract Motivation Mendelian randomization is an epidemiological technique that uses genetic variants as instrumental variables to estimate the causal effect of a risk factor on an outcome. We consider a scenario in which causal estimates based on each variant in turn differ more strongly than expected by chance alone, but the variants can be divided into distinct clusters, such that all variants in the cluster have similar causal estimates. This scenario is likely to occur when there are several distinct causal mechanisms by which a risk factor influences an outcome with different magnitudes of causal effect. We have developed an algorithm MR-Clust that finds such clusters of variants, and so can identify variants that reflect distinct causal mechanisms. Two features of our clustering algorithm are that it accounts for differential uncertainty in the causal estimates, and it includes ‘null’ and ‘junk’ clusters, to provide protection against the detection of spurious clusters. Results Our algorithm correctly detected the number of clusters in a simulation analysis, outperforming methods that either do not account for uncertainty or do not include null and junk clusters. In an applied example considering the effect of blood pressure on coronary artery disease risk, the method detected four clusters of genetic variants. A post hoc hypothesis-generating search suggested that variants in the cluster with a negative effect of blood pressure on coronary artery disease risk were more strongly related to trunk fat percentage and other adiposity measures than variants not in this cluster. Availability and implementation MR-Clust can be downloaded from https://github.com/cnfoley/mrclust. Supplementary information Supplementary data are available at Bioinformatics online.

A Clustered heterogeneity under linearity and homogeneity assumptions We consider the scenario in which there are linear and homogeneous relationships between genetic variants G 1 , . . . , G J , a risk factor X, an outcome Y , and (initially) a single risk factoroutcome confounder U . We assume the following linear structural models: where bold face represents a vector and the epsilon terms represent error in each variable. We assume the φ and ξ are non-zero, and consider ratio estimates for genetic variants with different values of β j , η j and δ j . The expected value of the ratio estimate (the ratio estimand) using the jth variant is: We consider different values of β j , η j and δ j , and the resulting ratio estimands. There are eight possible scenarios for these parameters equalling zero or differing from zero:  Table A1: Exhaustive representation of combinations of possible causal effects of genetic variant j on the risk factor (present if β j = 0), confounder (present if η j = 0), and outcome (present if δ j = 0). Only two possibilities lead to the ratio estimand being a constant function of β j , η j , and δ j .
While it is possible for ratio estimands to exactly coincide for other values of β j , η j and δ j due to chance, this is vanishingly unlikely.
If we generalize further to a scenario with multiple risk factor-outcome confounders, similar considerations show that ratio estimands will coincide exactly when the genetic variants influence the risk factor only, or one confounder only. This suggests that clustered heterogeneity in the linear and homogeneous scenario corresponds to the situation where genetic A1 variants influence the outcome via a single causal mediator: either the nominated risk factor, or a confounder of the risk factor and outcome. In both situations, there is a common causal pathway from variants in the cluster to the outcome. This corresponds to the diagram of Figure 2.

B Specification of the junk distribution
To ensure that the distribution of ratio estimates in the junk cluster is near constant across the range of observations from a given sample, whilst also accounting for uncertainty in the ratio estimatesθ j via their standard errorsσ j , we set the scale parameter ψ in the generalised t-distribution for the junk cluster tô in our applications, whereθ max is the maximum of the ratio estimates,θ min is the minimum of the ratio estimates, andσ max is the maximum of the standard errors of the ratio estimates. This means that the density of the junk cluster will vary between samples, automatically accounting for differences in the range and precision of the ratio estimates. The junk cluster density is a proper density that is approximately uniform across the range of plausible values of the ratio estimates.
C Testing for associations with variants in a given cluster As part of the exploratory post hoc investigation of whether genetic variants in a cluster associate preferentially with a given trait, we perform a statistical test. We compare the number of variants in the cluster that associate with the trait at a given p-value threshold (n 11 ), the number of variants in the cluster that do not associate with the trait at a given p-value threshold (n 12 ), the number of variants not in the cluster that associate with the trait at a given p-value threshold (n 21 ), and the number of variants not in the cluster that do not associate with the trait at a given p-value threshold (n 22 ). We consider the number of ways in which n 11 associations are observed for n 11 +n 12 variants as a proportion of the number of ways that n 11 + n 21 associations could arise for n total variants (where n = n 11 + n 12 + n 21 + n 22 ). This is given by the hypergeometric distribution as: n 11 +n 21 n 11 n 12 +n 22 n 12 n n 11 +n 12 . (A3) The relevant p-value is given by summing up these probabilities for n 11 traits associating with variants in the cluster and n 21 associating with variants not in the cluster, n 11 + 1 traits associating with variants in the cluster and n 21 −1 associating with variants not in the cluster, and so on up to n 11 + n 21 traits associating with variants in the cluster and zero associating with variants not in the cluster.       Iteration Log−likelihood Figure A1: Convergence to the maximum likelihood estimate across iterations for six initializations for a specifically-chosen dataset from scenario 4 that demonstrates failure to converge to the same estimates for all initializations. In our experience, this was not common, but convergence should be assessed carefully in practice.
Clustering results for applied example using MR-Clust, TAGM, and Mclust methods.   Figure A3: Log-likelihood from contamination mixture method as a function of causal estimate for applied example of systolic blood pressure and coronary artery disease risk.

A9
Clustering results for additional applied example of HDL-cholesterol and coronary artery disease risk.