PALM: a powerful and adaptive latent model for prioritizing risk variants with functional annotations

Abstract Motivation The findings from genome-wide association studies (GWASs) have greatly helped us to understand the genetic basis of human complex traits and diseases. Despite the tremendous progress, much effects are still needed to address several major challenges arising in GWAS. First, most GWAS hits are located in the non-coding region of human genome, and thus their biological functions largely remain unknown. Second, due to the polygenicity of human complex traits and diseases, many genetic risk variants with weak or moderate effects have not been identified yet. Results To address the above challenges, we propose a powerful and adaptive latent model (PALM) to integrate cell-type/tissue-specific functional annotations with GWAS summary statistics. Unlike existing methods, which are mainly based on linear models, PALM leverages a tree ensemble to adaptively characterize non-linear relationship between functional annotations and the association status of genetic variants. To make PALM scalable to millions of variants and hundreds of functional annotations, we develop a functional gradient-based expectation–maximization algorithm, to fit the tree-based non-linear model in a stable manner. Through comprehensive simulation studies, we show that PALM not only controls false discovery rate well, but also improves statistical power of identifying risk variants. We also apply PALM to integrate summary statistics of 30 GWASs with 127 cell type/tissue-specific functional annotations. The results indicate that PALM can identify more risk variants as well as rank the importance of functional annotations, yielding better interpretation of GWAS results. Availability and implementation The source code is available at https://github.com/YangLabHKUST/PALM. Supplementary information Supplementary data are available at Bioinformatics online.

1 More simulation results

Simulations when z-values are from other distributions
In the main simulation, we generate the z-score of each SNP by z j ∼ N (µ j , 1), j = 1, . . . , M , where µ j follows a bimodal distribution. We conduct additional simulations to compare the performances of PALM and other methods under alternative distributions of µ.
Scenario Distribution big-normal N (0, 4 2 ) near-normal   1.2 Simulations when z-values are generated from a linear model In GWAS, the z-scores of SNPs are typically calculated from a linear model with individual data. Specifically, with an assumed heritability h 2 and number of non-null SNPs N nonnull , the common linear model takes the form: where y is the phenotype vector, X is the standardised genotype matrix, β ∼ N (0, h 2 N nonnull ) and ϵ ∼ N (0, 1 − h 2 ). The z-score of each SNP can be calculated by performing marginal linear regression. Here we investigate the performance of PALM and compared methods with z-scores generated in this way rather than directly from a prespecified distribution. Since PALM assumes the independence of z-scores and mainly focus on evaluating its performance on the computed z-scores in this section, we will generate genotype matrix with uncorrelated SNPs. For the influence of LD, please refer to supplementary section 1.6.
We consider a realistic setting with heritability h 2 = 0.2, sample size n = 20, 000, total number of SNPs M = 10, 000 and number of annotation L = 50. First, for individual i, we randomly sample M minor allele frequencies (MAFs) from f j ∼ U [0.05, 0.5], j = 1, ..., M and generate genotype g ij of individual i from g i ∼ Binomial(2, f j ), j = 1, ..., M then standardize g j to have mean 0 and variance 1 (the standardised genotype is denoted as x ij ). Second, following the simulation in the paper, we generate annotation matrix A of shape M × L. Third, the prior probability of each SNP for being in non-null group π j1 is calculated by Eq. (9) in the paper and the association status is generated by Z j ∼ Bernoulli(π j1 ), j = 1, ..., M . With Z j , we know the true non-null SNP indices and thus the number of non-null SNPs N nonnull . For the null SNPs j ∈ {k|Z k = 0}, their effect sizes are set to be zero; for the non-null SNPs, their effect sizes are randomly sampled by β j ∼ N (0, h 2 N nonnull ), j ∈ {k|Z k = 1}. Then the phenotype of each individual i is generated by y i = X i β + ϵ i , i = 1, ..., n. Finally, the estimates of effect size of each SNPβ j and its standard error se(β j ) can be calculated by regressing y onto X j . Thus z-score z j can be obtained by z j =β j se(β j ) . After the summary statistics have been generated, we can perform and evaluate all the methods just as the simulation in the paper. Fig. S5 shows the FDR and power for all the methods. Except for GPA-Tree, all the other methods can provide satisfactory FDR control (the following discussion will exclude GPA-Tree). For case (A), the power of methods integrating annotations (LSMM, PALM-D1 and PALM-D2) is of the same level with methods only using summary statistics (BH, TGM-zval and TGM-Pval); for case (B), LSMM, PALM-D1 and PALM-D2 have similarly higher power than methods without using annotations; for case (C)(D)(E), PALM can model nonlinear relationship between annotation and association status thus having significant gain in power compared with other methods. In summary, the results validate the effectiveness of PALM with z-scores generated from a linear model.

Quantification of interaction effects between two annotations
In this section, we discuss how to quantify the interaction effects between two annotations using PALM. Friedman's H-statistic is introduced to quantify interaction effects [10]. Consider a function F of several features x l , l = 1, ..., L, if two features x j and x k don't interact, we can decompose the 2-way partial dependence function [9] in the following way: where P D jk (x j , x k ) is the 2-way partial dependence function of both features, and P D j (x j ) and P D k (x k ) are the partial dependence functions of the single features. If the two features interact, Eq.
(1) no longer holds and the difference between the observed 2-way partial dependence and the decomposed one reflects the interaction. Following this idea, the H-statistic measures the interaction between feature x j and x k as: where n is the sample size for fitting the model. Theoretically, with centered partial dependence functions, H-statistic is in [0, 1] and a larger H 2 jk means a stronger interaction between x j and x k . In the context of our paper, the 1-way and 2-way partial dependence functions can be calculated by: where M is the SNP number,F is the fitted boosted trees, \j and \jk represent features except j and j, k, respectively. We implemented the calculation of H-statistic with PALM-D2 (PALM-D1 cannot model interaction).

The influence of missing annotations on the performance of PALM on simulated data
As PALM is based on the gradient boosting decision trees, it is capable of handling missing values in functional annotations A. We conduct simulations to gauge the influence of missing value rate of annotation matrix A on the performance of PALM on simulated data. In details, we consider four values of missing rates, i.e., mrate ∈ {0.05, 0.1, 0.2, 0.4}, and use 2-fold cross-validation to select the optimal number of trees. The missing elements in the synthetic annotation matrix A are randomly chosed. 1.5 The influence of missing annotations on the performance of PALM on real data We also conduct simulations to gauge the influence of missing value rate of annotation matrix A on the performance of PALM on real data. In details, we consider four values of missing rates, i.e., mrate ∈ {0.05, 0.1, 0.2, 0.4}, and use 2-fold cross-validation to select the optimal number of trees. The missing elements in the real annotation matrix A are randomly chosen. In general, the number of prioritized SNPs will decrease as the missing rate of annotation matrix increases. The two exceptions are Alzheimer's disease and HIV, whose numbers of prioritized SNPs under mrate = 0.2, 0.4 are slightly greater than those of without missing values. The figure below therefore excludes these two GWASs for the sake of visualization.

The influence of cross-validation folds on the performance of PALM
In the main simulation, the optimal number of trees is selected by 5-fold cross validation. We conduct simulations to gauge the influence of cross-validation (CV) folds on the performance of PALM. Specifically, we apply PALM with 2-fold CV and 5-fold CV to the same simulated data then evaluate FDR and power.   1(a) with M = 20, 000 and L = 50. From Figure S15(a), we observe that the magnitude of shrinkage parameter has minor impact on FDR control and power: for PALM-D1, the estimated FDR and power with different ν under the five cases are almost the same. For PALM-D2, FDR with ν = 0.8 in case (C) is slightly higher than other ν's but still in the tolerable range; power with ν = 0.01 in case (D) is slightly lower than other ν's. Apart from the two exceptions, the differences on FDR and power between different shrinkage parameters are inapparent. Thus, the performance of PALM is insensitive to the choice of shrinkage parameters, as long as the number of trees is determined by cross-validation. Although shrinkage parameter shows little influence on risk SNP prioritization, it has some impact on the number of trees of the final model after cross validation. As displayed in Figure  S15, with a very small shrinkage parameter (e.g ν = 0.01, 0.05), the number of trees in the final model increases, i.e., the EM algorithm needs more iterations, thus more time-consuming.

The influence of tree depth on PALM
We have extensively explored PALM with tree depth 1 and 2. Here, we performance PALM with tree depth 3 and 4 (PALM-D3 and PALM-D4 for short) under the same simulation settings in paper with SNP number M = 20, 000 and annotation number L = 50. Figure  S16(a) implies that PALM-D3 can still control FDR while PALM-D4 shows a little inflation on FDR in case (B) and case (C). In terms of statistical power, PALM-D3 and PALM-D4 show no further improvement compared with PALM-D2 in case (C) and case (D). Although the optimal numbers of optimal trees of PALM-D3 and PALM-D4 are smaller than PALM-D1 and PALM-D2, the total computational times of PALM-D3 and PALM-D4 are notably higher since a larger individual tree costs more time to be fitted.
In summary, if we aim to identify more risk SNPs and the computation cost is not a concern, PALM-D2 is recommended; if we want to be more conservative and speed up the prioritization, then PALM-D1 is a great choice. We don't recommend to use trees with depth greater than 2 for risk SNP prioritization.

The influence of LD effects on PALM
Following LSMM, we study the influence of LD effects on PALM using real genotype data from The Wellcome Trust Case Control Consortium (WTCCC). We consider 23170 SNPs in chromosome 1 after quality control and randomly select 23 SNPs as true causal SNPs. We assume the 23 causal SNPs explain 5% phenotypic variance. We use GCTA to simulate phenotypes and PLINK to calculate p-values for all SNPs. We simulate 110 annotations: SNPs within 1Mb of causal SNPs are annotated by the first 10 annotations and by 20 of the last 100 annotations with a probability of 60%. All the other SNPs are annotated with a probability of 10%. We apply two-groups model of p-values (TGM-Pval), PALM-D1 and PALM-D2 to prioritize risk SNPs.
Because of the LD effects, it is unrealistic to detect the true causal SNPs. However, we can expect to identify the regions containing causal SNPs. We set different distance threshold to define the regions around causal SNPs. Any prioritized risk SNP within the regions will be counted as true positive. As shown in Figure S17, without integrating annotations, two-groups model can stably control FDR even under the smallest distance threshold (100kb). Since SNPs within 1Mb of causal SNPs are more likely to be annotated, resulting in the increasing probability of being prioritized, it is difficult for PALM to well control FDR under a small distance threshold. As the distance threshold increases, both PALM-D1 and PALM-D2 become more conservative. We observe that FDR can be controlled at the nominal level 0.1 with a distance threshold of 500kb and 800kb for PALM-D1 and PALM-D2, respectively. 2 More about real data analysis 2.1 Discussion on the performance of GPA-Tree GPA-Tree [12] extends the two-groups model (TGM) by modeling the prior of SNP association status with a single tree. For the algorithm of GPA-Tree, on the first stage, parameter α of the two-groups model is estimated by fitting a linear regression in the M step; on the second stage, with α fixed at its estimation, a full tree is fitted then pruned with a complexity parameter cp in each iteration until the incomplete likelihood no longer increases even with other cp choices. In the simulation study, GPA-Tree did not well control FDR under several cases. In the real data analysis, GPA-Tree did not provide a satisfactory prioritization of risk SNPs: for quite a few traits, it identified even fewer SNPs than TGM; but it identified much more SNPs than other related methods on several traits. It seems that GPA-tree suffers from a stability issue.
To figure out the problem with GPA-Tree, we investigate the model fitting process of GPA-Tree on real data. Figure S18 shows the incomplete log-likelihoods on the first and second stage obtained by GPA-Tree. We observe that for more than half of the GWASs, the log-likelihood at the end of second stage is even smaller than that at the end of first stage; while for several traits, namely HIV, lupus and rheumatoid arthritis, the log-likelihood has a jump on the second stage (these three traits are exactly the ones with much more SNPs prioritized by GPA-Tree). The cause of these phenomena can be attributed to the algorithm design and limitation of the model. In order to prevent overfitting, GPA-Tree pruned the fitted full tree with the complexity parameter cp. An aggressive cp might leads to uncontrolled FDR while a conservative cp can limit the power of risk SNP prioritization. With a common cp (e.g 0.005), it is hard to achieve both good power and FDR control for all GWASs. Hence, for some traits, the log-likelihood using the pruned tree increases compared with that using a linear regression model; but for many other traits, the log-likelihood degrades on the second stage, indicating that the use of pruned tree is not more effective than a linear model; for several other traits, because another smaller cp in the set of candidate cp's has been found, the newly pruned tree is dramatically enlarged, resulting in a sudden increase in log-likelihood. Since there is no stable regularization in GPA-Tree, it cannot credibly and consistently improve risk SNP prioritization.
Besides the instability, we also notice that, for almost all traits without a jump in loglikelihood, the tree structure does not change throughout the second stage of GPA-Tree. Figure  S19 shows the tree depths along the iterations on the second stage. Except for HIV, lupus and rheumatoid arthritis, the tree depths of all the other traits remain unchanged during EM iterations. Indeed, not only the tree depths keep the same but also the split nodes barely change throughout the second stage. For instance, the tree structure of BMI only changed once during EM iterations and the final tree only involves two annotations; the tree structure of type 1 diabetes keeps unchanged in all iterations ( Figure S20(a)(b)). In other words, only the leaf values are updated in most of the iterations but at the cost of fitting a huge full size tree and pruning, suggesting the low efficiency of GPA-Tree's model fitting process.
We also plot the trees fitted by GPA-Tree of traits with abnormally large number of prioritized SNPs ( Figure S20(c)(d)(e) for rheumatoid arthritis, HIV and lupus) and two traits with prioritized SNPs fewer than TGM ( Figure S20(f)(g) for multiple sclerosis and type 2 diabetes). Notice that the annotations used to split nodes in rheumatoid arthritis, HIV and lupus before the jump are in accordance with the important annotations ranked by PALM. Because of the unstable algorithm, however, GPA-Tree fits much larger trees in the following iterations and causes inflation. For multiple sclerosis and type 2 diabetes, although the complexity of the final trees is controlled, it does not fully make use of the annotations. Compared with PALM, the tree of multiple sclerosis does not use the annotation primary B cells from peripheral blood which is evaluated to have nonnegligible importance; the tree of type 2 diabetes does not include multiple useful annotations such as monocytes-CD14+ RO01746 primary cells, K562 leukemia cells and fetal hearts. On the contrary, PALM uses a shallow tree to fit the residual in each EM iteration and ensemble these trees. Therefore, each individual tree in PALM can contribute to fitting a complex function in different aspects by choosing several annotations, and the aggregation of trees leads to a strong and stable model.

Performance of PALM with tree depth greater than 2
Continuing the discussion on tree depth in section 1.8, we perform PALM-D3 and PALM-D4 on real data and add the results to Figure 2 as shwon in Figure S21. The performance of PALM-D3 and PALM-D4 on real data verify the simulation result in S16: with the increase of tree depth, more SNPs can often be identified but PALM-D3 and PALM-D4 may not provide satisfactory FDR control.

Genic category of prioritized SNPs by PALM
We use ANNOVAR [26] to obtain the genic categories of SNPs prioritized by PALM. We found that most prioritized SNPs are indeed in the non-coding regions ( Figure S22), consistent with the fact that most GWAS hits are in the non-coding regions.