Genetic fine-mapping from summary data using a nonlocal prior improves the detection of multiple causal variants

Abstract Motivation Genome-wide association studies (GWAS) have been successful in identifying genomic loci associated with complex traits. Genetic fine-mapping aims to detect independent causal variants from the GWAS-identified loci, adjusting for linkage disequilibrium patterns. Results We present “FiniMOM” (fine-mapping using a product inverse-moment prior), a novel Bayesian fine-mapping method for summarized genetic associations. For causal effects, the method uses a nonlocal inverse-moment prior, which is a natural prior distribution to model non-null effects in finite samples. A beta-binomial prior is set for the number of causal variants, with a parameterization that can be used to control for potential misspecifications in the linkage disequilibrium reference. The results of simulations studies aimed to mimic a typical GWAS on circulating protein levels show improved credible set coverage and power of the proposed method over current state-of-the-art fine-mapping method SuSiE, especially in the case of multiple causal variants within a locus. Availability and implementation https://vkarhune.github.io/finimom/.


Theoretical considerations of non-local priors
In Bayesian fine-mapping literature, the most common prior for the causal effects of individual variants is the Gaussian distribution (Hutchinson, Asimit, and Wallace 2020). Literature on Bayesian polygenic modelling (including prediction modelling) of quantitative traits have presented numerous other suggestions for the effect size distribution, such as t-distribution, double exponential, as well as their various mixtures (de los Campos et al. 2013;Zhou, Carbonetto, and Stephens 2013). Other propositions include the gamma distribution with different shape parameter values (Otto and Jones 2000;Xu 2003).
Our proposition is to use the product inverse-moment (piMOM) distribution as the prior for the causal effects.
The prior originates from the inverse-moment priors presented in Johnson and Rossell (2010), where the authors highlight benefits of the non-local formulation of the prior particularly for variable selection. The core idea is to distinguish the estimated effects that strongly favor the 'alternative hypothesis', or conversely, the effects that distinctly differ from the parameter null value. In the scenarios of multiple causal variants per locus, piMOM achieves strong model selection due to the product formulation of the prior (Johnson and Rossell 2012). This is because if any of the effects is at zero, this shrinks the full density to zero. In comparison with the earlier literature, the piMOM prior bears the most resemblances to the gamma prior, as the functional form of the iMOM prior is related to the inverse-gamma distribution (Johnson and Rossell 2010).
It is worth highlighting that we are mainly interested in variable selection to identify the correct causal configuration. Therefore, the effect sizes themselves are not of interest, as they are integrated out in calculating the marginal likelihood for the data, given a specific model (main text equation (5)).

Selecting a suitable value for τ based on sample size
Letẑ j =β j /SE(β j ), j = 1, . . . , P . Assuming large sample size N , then under the null, approximatelŷ z j ∼ N (0, 1), and if the genotype data have been standardized to mean zero and unit variance, then This directly shows that the marginal distribution of eachβ j depends on the sample size.
The inverse-moment (iMOM) prior for the marginal effect sizes β j is given by where r regulates the tail behavior and τ controls the spread of the prior. Throughout, we set r = 1. The value of τ can be set such that the absolute values of the marginal effect sizes β j are larger than a specific threshold value β q with a given probability 1 − q, that is, This can be solved using the known relationship between the iMOM prior and (inverse-)gamma distribution (Johnson and Rossell 2010). In the context of genetic analysis, Sanyal et al. (2019) suggest β q = 0.01 and q = 0.01, yielding τ = 0.0083. We propose to let τ be dependent on the sample size by setting β q = N −1/2 z q , leading to where z q is a specific quantile from a Gaussian distribution. This can be interpreted such that we assign a probability of 1 − q for a causal variant to have the observed |ẑ| larger than z q . Importantly, we have made explicit in equation (S1) that the detectable causal effect sizes β j depend on the sample size, such that larger sample size allows for smaller effect sizes to be detected. In fact, similar ideas were implicitly suggested in Otto and Jones (2000).
Equivalently, we can use equation (S1) to assign a probability of 1 − q for a causal variant to have its observed Of note, the prior distribution of the detectable causal effects depends on the sample size, which is not ideal for accurate effect estimation. However, as the aim of our method is model selection, there is a trade-off to be made between the detection (presence or absence) and the interpretability (magnitude) of the effects. The original argument for using non-local prior distributions is their ability to distinguish the effects that strongly favour the "alternative hypothesis" (Johnson and Rossell 2010). In this spirit, we believe the distribution for Supplementary  a Parameter in the beta-binomial distribution for the model dimension.
Set at a = 1 to provide a right-skewed prior for the model dimension with large variance, to prioritize smaller models. b Parameter in the beta-binomial distribution for the model dimension.
Set at P u , u > 1 (see next row).
u Parameter to control the model dimension.
Controls the beta-binomial prior, with larger values of u providing stronger priors towards smaller dimensions.
the detectable causal effects should also take into account the distribution under the null, that is, distribution for the non-causal effects, which is dependent on the sample size. This approach is more explicit in various empirical Bayes approaches for multiple testing correction (Efron 2008;Muralidharan 2010). Our method is somewhat related to the empirical Bayes methods, with the distinction that the null distribution is derived analytically, rather than calculated based on the data. An explicit empirical Bayes approach would indeed be a possible alternative for selecting τ in our model.

Approximate Laplace's method
Consider the integral Throughout, we condition on model m of dimension d, and therefore we drop the subscript m from β m for simplicity. Assuming f is twice-differentiable, it can be approximated by a second-order Taylor expansion for where g β and H β are the gradient and Hessian of f , respectively, evaluated at β.
Laplace's method to approximate π m (D) (Tierney and Kadane 1986) is based on equation (S2) atβ = argmin β f , where gβ = 0. Moreover, the value of the following Gaussian integral is known: This leads to the following conventional Laplace approximation (Tierney and Kadane 1986;Johnson and Rossell 2012): The value ofβ can be obtained using numerical optimization methods. While the Laplace's method switches the non-trivial integration problem to an optimization problem, the numerical optimization needs to be done at every iteration in our Markov Chain Monte Carlo algorithm, and therefore it is the main bottleneck in the computations.
As a rapid alternative, Rossell, Abril, and Bhattacharya (2021) proposed the approximate version of Laplace's method by evaluating the function f at a value where the gradient term does not vanish: Denotingβ =β − H −1 β gβ and completing the square in the exponential in the integrand, we obtain In (S4), the Gaussian integral is again known, and the exponential can be further simplified to obtain the approximate Laplace approximation: Notably, equation (S5)  It should be noted that the approximate Laplace's method requires both Hβ and R m to be invertible. We also observed that for a near-singular R m , the value ofβ becomes unstable. Therefore, in cases where either Hβ or R m are not invertible, or the maximum non-diagonal absolute value of R m > 0.99, the conventional Laplace's method in equation (S3) is applied. Importantly, equation (S3) does not involve calculating the inverse ofR, and therefore our proposed method is flexible even in the case of non-invertible LD matrices (see also Zou et al. (2022)). For further considerations of the accuracy of the approximations, see Tierney and Kadane (1986) and Rossell, Abril, and Bhattacharya (2021) for the conventional Laplace's method and the approximate Laplace's method, respectively.

Clumping extremely highly correlated variants
Extremely highly correlated variants, such as those with pairwise r 2 > 0.99, cannot be easily distinguished by any fine-mapping method. Therefore, we have added an option to clump such variants to be treated as one 'cluster' in the posterior sampling, with the lead variant (i.e. the one with the lowest marginal p-value) representing this group of variants in the sampling, yielding a speed improvement as a by-product.
Consequently, any credible set that would include the lead variant from a cluster of clumped variants would also include all other variants from the corresponding cluster.

Consistency check for out-of-sample LD reference
When using out-of-sample LD reference, there is a possibility of inconsistencies in that the observed test statistic does not correspond to the estimated LD between the variants; Zou et al. (2022) give a toy example of such scenario of two variants, withẑ = (6 7) T andR = 1 1 1 1 .
We present an option to distinguish such variants. Considering a cluster of variants in extremely high LD, we assume the variant C with the lowest p-value as the true causal variant. As per Hormozdiari et al. (2014), we assume that the observed test statistic for any non-causal variant N in this cluster depends solely on its correlation with C. The joint distribution of the marginal z scores for C and N is therefore a multivariate Gaussian distribution: For testing equality of the observed and expected z-scores for a non-causal variant, we obtain a test statistic In the denominator, we approximate r with r T , the threshold used to define extremely highly correlated variants. If the observed X 2 > q χ 2 1 , the LD structure is not considered to approximate the z scores, and the variants are assigned to separate clusters. By default, we use q χ 2 1 = E(X 2 ) ≈ 0.45. This provides additional robustness to the clumping and serves as a check for LD misspecification of tightly-linked variants. We highlight that none of the variants are excluded from fine-mapping, but simply treated as separate, nonclumped clusters in the sampling phase. In the toy example, our method would distinguish the misspecified LD structure and treat the variants separately. Similar to SuSiE (Zou et al. 2022), our method also strongly prefers the second variant to be the causal one.

Credible sets
We define a credible set at level α as a set of one or more variants that contain a true causal variant with probability larger than α. More than one credible set is allowed; the posterior distribution of model dimension d can be used as the posterior distribution for the number of credible sets.
Our credible set definition is similar to the one described in Wang et al. (2020), and we provide a similar toy example here. Consider 20 genetic variants G 1 , . . . , G 20 , of which two variants, G 1 and G 2 , are causal on a phenotype of interest. Additionally, cor(G 1 , G 11 ) = 1 and cor(G 2 , G 12 ) = 1. No statistical method can distinguish between G 1 and G 11 (or between G 2 and G 12 ) which one is the true causal variant.
The conventional way to construct the credible set would be to consider a single set of variants that contain all causal variants with a specified probability (The Wellcome Trust Case Control Consortium et al. 2012; Hormozdiari et al. 2014). In the toy example, this definition would provide G 1 , G 2 , G 11 , G 12 as the credible set. An alternative way applied in our method is to explicitly recognize the two different signals of two credible sets: one of G 1 and G 11 , and the other of G 2 and G 12 . These sets imply that the scenarios for causal variants would be (1) G 1 and G 2 , (2) G 1 and G 12 , (3) G 2 and G 11 , or (4) G 11 and G 12 . However, G 1 and G 11 or G 2 and G 12 cannot simultaneously be causal.
To illustrate the creation of the credible sets, consider the following posterior probabilities for the models (P P M ): P P M (G 1 , G 2 ) = 0.27, P P M (G 11 , G 2 ) = 0.26, P P M (G 1 , G 12 ) = 0.23, P P M (G 11 , G 12 ) = 0.23, P P M (G 1 , G 3 ) = 0.005 and P P M (G 1 , G 2 , G 3 ) = 0.005. The model with the highest P P M , denoted by m HP P M , is the one with G 1 and G 2 . As it consists of two variants, two credible sets are created. To create a 95% credible set, we consider models in which variants can replace the ones in m HP P M up to total posterior probability exceeding 0.95, conditioned on the number of variants in m HP P M . Now, G 1 can be replaced by G 11 (in the model G 11 , G 2 ), G 2 can be replaced by G 12 (in the model G 1 , G 12 ), and both can be replaced simultaneously (model G 11 , G 12 ). This has accumulated the posterior probability, conditioned on two variants, to (0.27 + 0.26 + 0.23 + 0.23)/(0.27 + 0.26 + 0.23 + 0.23 + 0.005) = 0.995. Therefore, the credible sets would be (1) G 1 and G 11 , and (2) G 2 and G 12 . G 3 is not considered for the credible set due to the accumulated posterior probability already exceeding the desired credible set level of 0.95.
In summary, our proposed way of constructing credible sets takes two distinct types of uncertainties simultaneously into account: the number of signals (i.e. the number of credible sets), and the uncertainty in the signals (i.e. the particular variants in the specific sets). Of note, even though credible sets need to be constructed at a selected level α, we do not necessarily expect the nominal coverage (say, 95%), as the fine-mapped loci are not a random sample of loci which harbor causal variants, but only of those with the largest effect sizes (Hutchinson, Watson, and Wallace 2020).
Genotyping and quality control for both datasets are detailed elsewhere (Kaakinen 2012;Karhunen et al. 2021 After sample quality control procedures and consent withdrawals, genomic data were available for 5400 and 3743 individuals in NFBC1966 and NFBC1986, respectively. Both datasets were Imputed to Haplotype Reference Consortium panel, and the autosomal, biallelic markers were filtered for minor allele frequency > 0.01, imputation quality R 2 > 0.5 and Hardy-Weinberg equilibrium p-value > 10 −12 . In the simulation studies, the genotype data from NFBC1966 were used for phenotype simulations to create the summary-level data, and genotype data from NFBC1986 for estimating the out-of-sample linkage disequilibrium (LD) matrix. These data come from the same geographical region with no overlapping individuals, therefore the different datasets provide ideal out-of-sample LD references for each other. Due to the larger sample size in NFBC1966, we used this study to simulate the phenotype and the summary data, and NFBC1986 for out-of-sample LD reference.
Using NFBC1966 genotype data, we simulated a quantitative trait under the model with C causal variants explaining a total variance of ϕ in the phenotype, with C ∈ {1, 2, 5} and ϕ ∈ {0.015, 0.03}, as follows: 1. Select indices c for the causal variants uniformly from {1, . . . , P }, with the constraint that the maximum correlation between the causal variants is 0.95.
2. Draw the effect sizes with the constraint that the power to detect all individual causal effect sizes was > 0.8 at significance level = 0.05: a. Draw C effect sizes b c from N (0, 1).
b. Scale the b c to have the desired total variance explained by simulating an unscaled phenotype Y ′ with and then rescaling c. Return to (a) if any of β c do not achieve the desired power.
The simulated phenotype was regressed separately on all variants (standardized to mean = 0 and variance = 1) within the locus assuming an additive model. The resulting summary statistics (i.e. association estimates and their standard errors) and an LD matrix were given as inputs to the considered fine-mapping methods.

Synthetic dataset
The main simulations using NFBC1966 genotype data are for parameters which are typical in GWASs on protein levels. However for complex traits, the heritabilities of individual loci can be considerably smaller, and therefore require larger sample sizes. Therefore, to examine the influence of lower heritability and larger sample size to the performance of our proposed method, we used synthetic HAPNEST genotype data created by Wharrie et al. (2022) in additional simulations. We also used the synthetic data to compare the speed of our method with that of SuSiE's as a function of the number of genetic variants considered in fine-mapping.
The synthetic data generation is based on a stochastic model that simulates genotypes from existing reference genomes while simultaneously taking genetic coalescence, recombination and mutation into account (Wharrie et al. 2022). The full synthetic dataset consists of over 6.8 million single-nucleotide polymorphisms of 1,008,000 individuals, categorised to six different ancestries, available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. We extracted 50,000 observations of simulated African ancestry, and extracted their genotype data from five separate loci in chromosome 22 of 500, 1000, 2000, 5000 and 10,000 variants. The different heritabilities considered were 0.001 and 0.015, with 1, 2 or 5 causal variants. The simulations for the lowest heritability of 0.001 were conducted on synthetic data only as the sample size of 5400 in our real genotype data is far too small compared to a typical sample size in GWAS summary statistics for complex traits.
The simulation strategy for the synthetic dataset was exactly the same as for the NFBC1966 dataset.
We calculated the reference LD for the synthetic data based on a separate sample of 10,000 HAPNEST observations of simulated African ancestry. To determine a suitable value for hyperparameter τ , we solved for P(|β j | > N −1/2 z q ) = 1 − q with z q = 3.29 and q = 0.05, based on the simulation results using NFBC1966 genotype data, leading to τ = 0.000416. Supplementary