- Split View
-
Views
-
Cite
Cite
John Ferguson, Joseph Chang, An empirical Bayesian ranking method, with applications to high throughput biology, Bioinformatics, Volume 36, Issue 1, January 2020, Pages 177–185, https://doi.org/10.1093/bioinformatics/btz471
- Share Icon Share
Abstract
In bioinformatics, genome-wide experiments look for important biological differences between two groups at a large number of locations in the genome. Often, the final analysis focuses on a P-value-based ranking of locations which might then be investigated further in follow-up experiments. However, this strategy may result in small effect sizes, with low P-values, being ranked more favorably than larger more scientifically important effects. Bayesian ranking techniques may offer a solution to this problem provided a good prior distribution for the collective distribution of effect sizes is available.
We develop an Empirical Bayes ranking algorithm, using the marginal distribution of the data over all locations to estimate an appropriate prior. In simulations and analysis using real datasets, we demonstrate favorable performance compared to ordering P-values and a number of other competing ranking methods. The algorithm is computationally efficient and can be used to rank the entirety of genomic locations or to rank a subset of locations, pre-selected via traditional FWER/FDR methods in a 2-stage analysis.
An R-package, EBrank, implementing the ranking algorithm is available on CRAN.
Supplementary data are available at Bioinformatics online.
1 Introduction
Traditional statistical hypothesis testing was developed based on the ideas of P-values and levels of significance and is still in wide-spread use today, despite well-known caveats regarding its usage (Amrhein et al., 2019; Nickerson, 2000). A recent example is the statistical analysis of large genomic datasets, where perhaps hundreds of thousands of genetic locations are simultaneously tested for systematic differences between two groups. Different tests need to be used depending on the experimental technology that is being used, and a surprising degree of statistical ingenuity has been spent developing appropriate test statistics and P-values. Whereas significance testing should do reasonably well at identifying locations where real (although potentially small) biological differences exist, there is no guarantee that small P-values correspond to scientifically interesting differences. Nevertheless, P-values are frequently used as a ranking criterion to produce lists of the most interesting tests/locations to follow up.
As an example, consider Genome-Wide Association Studies (GWAS). These are large scale observational studies where a set of single nucleotide polymorphisms (SNPs) are genotyped for a group of control individuals, and typically a group of individuals with some disease (although continuous traits such as height and body mass index have also been investigated). The genotype for a given individual at a given SNP can be coded as 0, 1 or 2 depending on the number of copies of the variant allele at that locus for the individual. The associative effect of each SNP on disease risk is measured using the population log-odds ratio (OR) between SNP genotype and disease status. While we would ideally want any list of ‘significant’ SNPs to concentrate on the largest population ORs, the variant allele for some SNPs will be rare, meaning that most individuals have genotype 0, and relatively few have genotypes 1 and 2. As a result, there is low power to detect a disease/control difference in the genotypes at these rare SNPs. For example, suppose that SNP A has a disease/control OR of 1.1 and a minor allele frequency (MAF) in controls of 0.4, whereas SNP B has an OR of 1.5 but the MAF (again in controls) is only 0.01. Assuming the disease is rare; this implies that having the variant allele for SNP A is associated with a 10% increase in the probability of disease, with a more important looking increase of 50% for SNP B. However, in practice the respective P-values corresponding to these two SNPs may not reflect this information. Indeed, calculations show that the rare SNP with the higher OR will have the lower P-value for a sample of 1000 cases and 1000 controls only 48% of the time, assuming independence of the genotypes at the two SNPs (this independence assumption is referred to as ‘linkage equilibrium’ in the genetics literature).
Bayesian ranking methods provide an alternative approach to constructing lists of interesting hypotheses, or in the cases considered here DNA locations, which de-emphasizes the influence of varying experimental standard errors in comparison with a P-value based ranking. Our work here is motivated by a number of previous authors as far back as 1989, where Laird and Louis (1989) used Empirical Bayes ranking techniques to assess relative educational effectiveness of a number of secondary schools. In this manuscript, we develop a general Bayesian ranking method that is applicable to a wide range of genomic datasets. Our method is related to the recent papers demonstrating the use of Bayesian ranking for differential expression in microarrays (Noma et al., 2010; Noma and Matsui, 2013); however, we use a differing Bayesian model, and demonstrate the use of Bayesian ranking in a wider set of scenarios including determining differential expression from RNA-Sequence data and identifying disease associations in GWAS. Simulations and real-data analysis are used to compare the properties of our method against multiple competing approaches. To facilitate the use of our method, we have developed an R-package, EBrank, implementing these methods which is downloadable from the CRAN repository.
The rest of the paper is organized as follows. In Section 2, we discuss ranking problems from a general point of view, including a discussion of appropriate loss functions and Bayesian techniques to produce loss-minimizing rankings. In Section 3, we describe a novel non-parametric Empirical Bayes ranking algorithm that is particularly suited to examples from bioinformatics. In Section 4, we examine the performance of the algorithm using a series of simulated and real-data examples, and compare to other approaches for ranking effect sizes. To conclude, we discuss potential limitations of our methodology, when we would expect it to work well, and suggest some potential extensions.
2 Background
As alluded to in the previous section, using P-values to rank experiments in massive parallel testing situations may give unsatisfactory results. P-values and Bayesian ranking techniques typically optimize different criteria, with the criterion corresponding to P-values not as suited for ranking. To make these criteria and distinctions between them more tangible, it is helpful to begin with a brief discussion of loss functions that can be used for ranking problems. We conclude the section with a description of how Bayesian ranking methods are applied in practice, given the choice of such a loss function.
2.1 Loss functions
2.2 Finding to minimize loss
Finding rank vectors that optimize loss criteria such as (3) within the Frequentist paradigm is usually difficult, although some progress regarding minimax-optimal solutions has been achieved for special loss functions (Bansal et al., 1997). Bayesian modeling offers a computationally viable and principled alternative, and is the approach that we consider in this article. First consider a prior distribution, for the parameter vector γ being a set of hyper-parameters specifying the exact prior chosen from a class of plausible priors. A fully hierarchical approach incorporates an additional hyper-prior for γ. Alternatively, Empirical Bayes first specifies a data-driven estimate for γ and subsequently uses the non-hierarchical prior distribution: . In either case, we can then derive the posterior distribution, for given the data X using simulation or analytic techniques using existing procedures. Sometimes the variance of the posterior distribution may be adjusted upwards to account for uncertainty due to error in estimating γ (Berger, 1985), although the Empirical Bayes algorithm described in Section 3 is ‘naive’ in the sense that this extra source of variation is not considered.
3 An empirical Bayes ranking algorithm
We next develop an Empirical Bayes implementation for the ranking techniques described in the previous section. The described algorithm might be considered an approximation to a fully Bayesian approach, with a clear advantage in computational feasibility. An R-package implementing the algorithm can be downloaded from the CRAN repository. The algorithm is designed to estimate the ranks for the absolute values of N unknown univariate parameters based on noisy estimates . This is a commonly observed set up in bioinformatics, and examples from GWAS and RNA-Sequencing will be described in the following section. While the algorithm can be used in an ad-hoc fashion given any set of estimates, a few relative weak assumptions are necessary to justify the likelihood model that is implicitly used in Empirical Bayes estimation. These assumptions are as follows:
The estimates (,…, are conditionally independent, given
is asymptotically normally distribution in that as sample size .
Estimated standard errors: are available and satisfy a kind of ‘log-consistency’, i.e.
and are asymptotically independent.
The weakness of these assumptions supports the use of the algorithm in a variety of settings. It should be pointed out that the algorithm learns an appropriate prior using Empirical Bayes technique, using aggregate information in the N random variables . An implicit assumption is that N is sufficiently large to estimate this prior reasonably accurately. When N is small, the learned prior may be a poor reflection of how are distributed in reality, and ranking performance may suffer.
3.1 Derivation of an approximate conditional likelihood
3.2 Prior and posterior for μ
3.3 Empirical Bayes estimation of prior for μ
We utilize a constrained EM algorithm adapted from Muralidharan et al. (2010) to estimate the parameters: and by maximizing the marginal likelihood . The default algorithm constrains the mean and standard deviation for component 0, i.e. m0 and s0, to equal 0 and 1, respectively, whereas the variances are constrained to be at least 1 for components 1,…, J. Finally, for the examples in this manuscript we have forced the restrictions and . Here, s1 is constrained so that the algorithm can detect a low-probability mixing component representing isolated extremely large parameters, as sometimes may be observed in genome-wide association applications. If such a component is not included, the estimated effect sizes for these extreme parameters might be aggressively and incorrectly shrunk to an overly-conservative prior. This restriction mostly impacts estimated posterior means rather than estimated ranks and can be removed in EBrank if so desired. See Supplementary Section 3 for further discussion. is selected by minimizing BIC over the integers . The estimated mixing probabilities, , and component means, , are found directly from the algorithm—while the estimated variances are found using where is the vector of standard deviations found via the algorithm.
3.4 Estimated posterior for θi
3.5 Estimating ranks
While it is possible to calculate mean posterior ranks directly from the posterior via the identity , it is easier to simulate from the posterior (9), and then transform the simulated values into ranks using (6). Our R-package EBrank uses a default of 10 000 draws from the posterior. Estimated posterior-loss-minimizing rank vectors are then found by replacing the population solutions by their obvious sample analogs. For instance, is estimated by
4 Examples
4.1 RNA sequence example simulations
RNA-sequencing has emerged as a competitor to microarrays for quantifying and comparing gene expression under different conditions, e.g. Mortazavi et al. (2008). The data are the number of short pieces (reads) of cDNA that have been sequenced, lying in each gene or transcript. Gene expression is indirectly measured from the number of reads that map, or align, to a particular gene, normalized by the gene’s length (in nucleotides) and the total count of reads over all genes (known as the library size). Genes are typically identified as being differentially expressed using RNA-seq data when their counts for one condition, normalized only by library size and not by gene lengths, are systematically different than the other. Several competing statistical approaches have been developed to identify genes displaying differentially expressed genes in RNA-Seq data [see for instance Anders and Huber (2010), Robinson et al. (2010) and Hardcastle (2015)]. While these methods assume that the counts over the different samples follow a negative-binomial distribution, the test statistics used to identify differentially expressed genes are generally asymptotically normal, at least in the case of large library sizes, supporting the use of our model. Here, we have used the package the R-package, v1.18 on default settings, to measure gene-wise fold-changes and standard errors from an RNA-Seq count matrix.
Data simulation
Each simulation scenario required fixing , and average effect size, σ . Eight independent repeats of each individual scenario were simulated. In Figure 1, the distribution of for Bayesian Ranking and P-values is displayed for and , setting PDE = 0.05 and averaging across the settings for n. In all but one scenario, estimated overlap is highest according to Bayesian Ranking in Section 3. Results from alternative parameter settings are presented in the Supplementary Material, and show similar conclusions. In the Supplementary Material, we also investigate the effect of a number of other factors that might influence the quality of ranking, including pre-selection of genome-wide significant experiments before the ranking step, loss-function choice (including the Jewett loss and the ‘r-values’ approach detailed in Section 2) and the effect of Monte Carlo error.
4.2 Genome-wide association data
To investigate the potential benefits of applying the ranking algorithm in GWAS settings, we downloaded some publicly available SNP data from the DbGAP website (Mailman et al., 2007) for four different diseases. The number of cases and controls and original number of genotyped SNPs for each disease are described in Table 1. For each SNP, disease/control ORs were estimated from the data and shrunk using an Empirical Bayes approach (Ferguson et al., 2013b). Supplementary Figure S2.3 shows both the before and after log-ORs. For each disease, MAF for ∼95% of the genotyped SNPs were found using the UCSC genome browser website by matching rsid. The shrunken log-ORs for the Nall SNPs where the MAF could be recovered represent true ORs behind each simulation.
Disease . | # SNPs . | # Cases . | # Controls . | dbGap study accession # . |
---|---|---|---|---|
Parkinson’s | 463 185 | 1713 | 3978 | phs000501.v1.p1 |
Crohn’s | 298 391 | 968 | 995 | phs000130.v1.p1 |
Schizophrenia | 700 490 | 1351 | 1378 | phs000021.v1.p1 |
MS | 551 642 | 978 | 883 | phs000171.v1.p1 |
Disease . | # SNPs . | # Cases . | # Controls . | dbGap study accession # . |
---|---|---|---|---|
Parkinson’s | 463 185 | 1713 | 3978 | phs000501.v1.p1 |
Crohn’s | 298 391 | 968 | 995 | phs000130.v1.p1 |
Schizophrenia | 700 490 | 1351 | 1378 | phs000021.v1.p1 |
MS | 551 642 | 978 | 883 | phs000171.v1.p1 |
Note: Results for Crohn’s disease are displayed in Figure 2. See the Supplementary Material for results pertaining to the other diseases.
Disease . | # SNPs . | # Cases . | # Controls . | dbGap study accession # . |
---|---|---|---|---|
Parkinson’s | 463 185 | 1713 | 3978 | phs000501.v1.p1 |
Crohn’s | 298 391 | 968 | 995 | phs000130.v1.p1 |
Schizophrenia | 700 490 | 1351 | 1378 | phs000021.v1.p1 |
MS | 551 642 | 978 | 883 | phs000171.v1.p1 |
Disease . | # SNPs . | # Cases . | # Controls . | dbGap study accession # . |
---|---|---|---|---|
Parkinson’s | 463 185 | 1713 | 3978 | phs000501.v1.p1 |
Crohn’s | 298 391 | 968 | 995 | phs000130.v1.p1 |
Schizophrenia | 700 490 | 1351 | 1378 | phs000021.v1.p1 |
MS | 551 642 | 978 | 883 | phs000171.v1.p1 |
Note: Results for Crohn’s disease are displayed in Figure 2. See the Supplementary Material for results pertaining to the other diseases.
Simulation of data
Two scenarios were investigated: ‘weak’ signals (n = 2000 cases, n = 2000 controls) and ‘strong’ signals (n = 10 000 cases, n = 10 000 controls) via applying the ranking algorithm to 24 independent simulated datasets based on the ‘true’ above. Each iteration involved simulating sample ORs and associated P-values for each of the Nall SNPs, using an asymptotic approximation for the distribution of the logistic-regression estimated log-OR, given the true MAFs in cases and controls, true μi for that SNP and value for n. As a pre-filtering step to ease computational difficulties, we then selected only the subset of SNPs that had significant P-values after adjusting for a 90% false discovery rate threshold (Benjamini and Hochberg, 1995) for subsequent ranking. This subset of SNPs was ordered based on their P-values and according the Bayesian ranking procedure described in Section 3. However, note that while only a subset of N SNPs was ranked, all Nall SNPs were used to calculate the Empirical Bayes informed prior. In the subsequent discussion, we relabel the subset of N parameters that are chosen to be ranked for each simulation as .
Results
Two criteria were used to evaluate the quality of the estimated ranks in recovering the ordering of :
Overlap of top K SNPs—i.e. the % of the top K ranked SNPs (according to the method either Bayes or P-value) which are in the top K true absolute-value log-ORs (according to the Empirical Bayes shrinkage), given by (12).
Percentile Rank of top K SNPs. This is the average percentile of the top K ranked SNPs within the total list of absolute-value logORs as given by: .
Values of K =10 and 100 were considered for both criteria. Boxplots, based on the 24 simulations for each scenario, are displayed in Figure 2 for Crohn’s disease. Each pane of the figure corresponds to a particular scenario (either weak or strong signals) and criterion function. The boxplot on the center and left correspond to Bayesian and P-value based rankings. The center boxplot: ‘Overall’ shows the distribution over these eight simulations, where Px is the value of the criterion function for a particular simulation and ranking method x. These simulations indicate that typically there is a benefit in producing ordered lists of P-values using Bayesian ranking rather than using P-values, sometimes substantially so. For instance, from the 10 SNPs with highest true OR, Bayesian ranking picks out either 9 or 10 of these in the top 10 estimated SNPs compared to only seven when P-values are used. Similar plots for the three other diseases are given in the Supplementary Material.
4.3 Real-data applications
4.3.1 GWAS
In GWAS, the usual practice was to focus mostly on SNPs that were genome-wide significant, i.e. SNPs that have P-values <0.05 after a Bonferoni adjustment, for follow-up experiments. The reason for this stringent selection was a concern that the selected SNPs might demonstrate significant associations in follow-up studies. As described in Section 4.2, Bayesian ranking can be applied just to a selected group of SNPs, although the Empirical Bayes estimate of the prior must be deduced from the entire set of SNPs. To illustrate this idea, we selected ‘genome-wide significant’ SNPs for the Crohn’s disease data reported in Table 1, which consisted of 968 cases and 995 controls. Our Empirical Bayes ranking procedure, as described in Section 3, was then applied to these 15 SNPs (all SNPs being used to estimate the Empirical Bayes prior, but only the top 15 used in the ranking procedure). Table 2 shows the genome-wide significant SNPs, their ORs, standard errors and ranking according to P-values and estimated Bayesian rankings from the procedure. Note that the SNP ranked number 1 according to Bayesian ranking had the largest absolute-value log-OR from the 15 SNPs; however, the standard error (0.166) is larger than most of the SNPs on the list and as a result, several SNPs have smaller P-values. Note that Bayesian ranking gives almost identical results to ranking by the absolute values of the estimated raw log-ORs in this example, which is an expected feature of the procedure for a group of experiments with low-standard errors. In this case, the Bayesian ranking procedure gives results that are verifiably superior to a P-value based ranking. To show this, we matched SNP ids with the most significant signals from a more recent and comprehensive GWAS of Crohn’s disease involving independent discovery and replication datasets of 6333 cases and 15 056 controls and 15 694 cases and 14 026 controls (Franke et al., 2010). While all 15 significant SNPs from the initial study, shown in Table 2, were involved in one of three significant regions in the follow-up study, only rs2076756 and rs11209026 were exact matches with the 71 ‘top’ SNPs reported in Franke et al. (2010), with rs11209026 (which was ranked 1 by Bayesian ranking, but only 4 using P-values) having the larger absolute-value log OR.
SNP . | . | SE . | P-value . | Rank . | Rank . | Match . |
---|---|---|---|---|---|---|
. | . | . | . | (p) . | (Bayes) . | . |
rs2076756 | 0.55 | 0.07 | 1.26 | 1 | 4 | ** |
rs7517847 | −0.50 | 0.07 | 2.99 | 2 | 6 | — |
rs2066843 | 0.51 | 0.07 | 7.87 | 3 | 5 | — |
rs1343151 | −0.48 | 0.07 | 1.63 | 4 | 7 | — |
rs11209026 | −1.09 | 0.17 | 4.59 | 5 | 1 | ** |
rs10489629 | −0.43 | 0.07 | 6.79 | 6 | 10 | — |
rs10889677 | 0.44 | 0.07 | 9.04 | 7 | 8 | — |
rs2201841 | 0.43 | 0.07 | 3.57 | 8 | 11 | — |
rs11465804 | −0.96 | 0.15 | 3.74 | 9 | 2 | — |
rs11209032 | 0.42 | 0.07 | 8.64 | 10 | 12 | — |
rs1004819 | 0.41 | 0.07 | 1.50 | 11 | 13 | — |
rs8054797 | −0.67 | 0.12 | 1.11 | 12 | 3 | — |
rs2241880 | −0.36 | 0.07 | 4.40 | 13 | 14 | — |
rs5743289 | 0.44 | 0.08 | 4.65 | 14 | 9 | — |
rs7194886 | −0.35 | 0.07 | 1.41 | 15 | 15 | — |
SNP . | . | SE . | P-value . | Rank . | Rank . | Match . |
---|---|---|---|---|---|---|
. | . | . | . | (p) . | (Bayes) . | . |
rs2076756 | 0.55 | 0.07 | 1.26 | 1 | 4 | ** |
rs7517847 | −0.50 | 0.07 | 2.99 | 2 | 6 | — |
rs2066843 | 0.51 | 0.07 | 7.87 | 3 | 5 | — |
rs1343151 | −0.48 | 0.07 | 1.63 | 4 | 7 | — |
rs11209026 | −1.09 | 0.17 | 4.59 | 5 | 1 | ** |
rs10489629 | −0.43 | 0.07 | 6.79 | 6 | 10 | — |
rs10889677 | 0.44 | 0.07 | 9.04 | 7 | 8 | — |
rs2201841 | 0.43 | 0.07 | 3.57 | 8 | 11 | — |
rs11465804 | −0.96 | 0.15 | 3.74 | 9 | 2 | — |
rs11209032 | 0.42 | 0.07 | 8.64 | 10 | 12 | — |
rs1004819 | 0.41 | 0.07 | 1.50 | 11 | 13 | — |
rs8054797 | −0.67 | 0.12 | 1.11 | 12 | 3 | — |
rs2241880 | −0.36 | 0.07 | 4.40 | 13 | 14 | — |
rs5743289 | 0.44 | 0.08 | 4.65 | 14 | 9 | — |
rs7194886 | −0.35 | 0.07 | 1.41 | 15 | 15 | — |
Note: The SNPs that matched rsids with genome-wide significant SNPs in a much larger follow-up study are marked with asterisks.
SNP . | . | SE . | P-value . | Rank . | Rank . | Match . |
---|---|---|---|---|---|---|
. | . | . | . | (p) . | (Bayes) . | . |
rs2076756 | 0.55 | 0.07 | 1.26 | 1 | 4 | ** |
rs7517847 | −0.50 | 0.07 | 2.99 | 2 | 6 | — |
rs2066843 | 0.51 | 0.07 | 7.87 | 3 | 5 | — |
rs1343151 | −0.48 | 0.07 | 1.63 | 4 | 7 | — |
rs11209026 | −1.09 | 0.17 | 4.59 | 5 | 1 | ** |
rs10489629 | −0.43 | 0.07 | 6.79 | 6 | 10 | — |
rs10889677 | 0.44 | 0.07 | 9.04 | 7 | 8 | — |
rs2201841 | 0.43 | 0.07 | 3.57 | 8 | 11 | — |
rs11465804 | −0.96 | 0.15 | 3.74 | 9 | 2 | — |
rs11209032 | 0.42 | 0.07 | 8.64 | 10 | 12 | — |
rs1004819 | 0.41 | 0.07 | 1.50 | 11 | 13 | — |
rs8054797 | −0.67 | 0.12 | 1.11 | 12 | 3 | — |
rs2241880 | −0.36 | 0.07 | 4.40 | 13 | 14 | — |
rs5743289 | 0.44 | 0.08 | 4.65 | 14 | 9 | — |
rs7194886 | −0.35 | 0.07 | 1.41 | 15 | 15 | — |
SNP . | . | SE . | P-value . | Rank . | Rank . | Match . |
---|---|---|---|---|---|---|
. | . | . | . | (p) . | (Bayes) . | . |
rs2076756 | 0.55 | 0.07 | 1.26 | 1 | 4 | ** |
rs7517847 | −0.50 | 0.07 | 2.99 | 2 | 6 | — |
rs2066843 | 0.51 | 0.07 | 7.87 | 3 | 5 | — |
rs1343151 | −0.48 | 0.07 | 1.63 | 4 | 7 | — |
rs11209026 | −1.09 | 0.17 | 4.59 | 5 | 1 | ** |
rs10489629 | −0.43 | 0.07 | 6.79 | 6 | 10 | — |
rs10889677 | 0.44 | 0.07 | 9.04 | 7 | 8 | — |
rs2201841 | 0.43 | 0.07 | 3.57 | 8 | 11 | — |
rs11465804 | −0.96 | 0.15 | 3.74 | 9 | 2 | — |
rs11209032 | 0.42 | 0.07 | 8.64 | 10 | 12 | — |
rs1004819 | 0.41 | 0.07 | 1.50 | 11 | 13 | — |
rs8054797 | −0.67 | 0.12 | 1.11 | 12 | 3 | — |
rs2241880 | −0.36 | 0.07 | 4.40 | 13 | 14 | — |
rs5743289 | 0.44 | 0.08 | 4.65 | 14 | 9 | — |
rs7194886 | −0.35 | 0.07 | 1.41 | 15 | 15 | — |
Note: The SNPs that matched rsids with genome-wide significant SNPs in a much larger follow-up study are marked with asterisks.
4.3.2 RNA-Seq
To investigate how well the described Empirical Bayes ranking method deals with ranking the most highly differently expressed genes in real data, we obtained RNA-Seq read-counts of lymphoblastoid cell lines for 89 Yoruban and 91 Central European individuals, sequenced as part of the 1000 Genomes project (1000 Genomes Project Consortium et al., 2012; Lappalainen et al., 2013). The counts were processed from the raw RNA-Seq files using the Recount2 pipeline, and downloaded from the Recount2 repository (Collado-Torres et al., 2017). We applied the DESeq2 method (without shrinkage of fold-changes) to estimate fold-changes and standard errors for 58 037 genes in the original sample. Subsequently, we removed 37 597 genes where fewer than 3 samples had read-counts per million sequenced reads exceeding 1, resulting in 20 440 genes remaining. These genes were ranked based on their estimated fold-changes and the resulting ranks were considered as a gold standard in subsequent analyses.
Next, we randomly sampled n (n = 5, 20 or 50) individuals from both the Yoruban and the Central European groups. For each random sample of individuals, we ran DESeq2 as described before to estimate fold-changes and standard errors for the 20 440 genes under investigation. Subsequently, we employed the same ranking procedures described in Section 4.1 using the associated subsampled RNA-sequence data, fold-changes and standard errors; namely our Empirical Bayes algorithm (from Section 3), P-values calculated via DE-Seq2, non-negative matrix factorization (Jia et al., 2015), FCROS (Dembélé and Kastner, 2014) and Empirical Bayes ranking where the prior was estimated using smoothing by roughening (Noma and Matsui, 2013). Both Empirical Bayes methods used 1000 samples from the estimated posterior to estimate mean posterior ranks. This procedure of randomly sampling n Yoruban and Central European individuals from the respective groups, using DESeq2 to calculate fold-changes and standard errors and ranking using the five separate approaches, was repeated independently 40 times for each sceanrio. For each of the 40 simulations, identical overlap statistics to those calculated in Section 4.1 were reported, i.e. the percentage match between the geneids corresponding to the top 10 and 100 estimated ranks and the true ranks that are described in the previous paragraph.
The results, displayed as boxplots in Figure 3, indicate an improvement in the performance of the both Empirical Bayes methods as the sample size (n) increases. This relative improvement with increasing n might be expected as these Empirical Bayes estimates assume the standard error is known, which is only approximately true when n gets large, and rely on accurate estimation of the Empirical Bayes prior, which again is easier for larger n. Overall, the Empirical Bayes methods seem to do best in this example when the sample size is large, but the results are here sensitive to the software that creates the gold standard ranking. In the Supplementary Material, we recreate this figure, but instead using Limma to create the gold standard ranking of fold-change from the original 89 versus 91 comparisons. In this case, non-negative matrix factorization has superior performance, particular at n = 5 and n = 20. However, we observe a similar pattern of improvement of the Bayesian methods with increasing sample size.
5 Discussion
Bayesian ranking often represents a more appropriate tradeoff between effect size and standard error than do traditional P-values when sorting the scientific-importance of differing results in massive multiple hypothesis testing problems. Genome-wide scans for genetic or genomic differences between diseased and healthy individuals (as seen in GWAS and RNA-Seq experiments) are typical examples of such problems. In general the method will work well and be superior to a P-value based ranking so long as the standard error of the test statistic varies significantly over differing tests and the proportion of true alternatives (for the algorithm described in this paper, these would be ) is large enough. This second condition is necessary so that the learned prior distribution is an effective estimator of the true distribution of parameters over the different tests.
Bayesian Ranking techniques are not a new idea in the statistics literature, an early reference being Laird and Louis (1989), but have only recently be considered as a technique in genomics (Noma et al., 2010). Our work is quite similar to Noma and Matsui (2013) who considered a non-parametric Empirical Bayes ranking method, that implements the smoothing by roughening approach first considered by Shen and Louis (1999), and applied it to microarray data. In this manuscript, we consider a new approach, and extend consideration to other problems in Genomics such as ranking SNPs for disease association and fold-change ranking in RNA-sequence data. In comparison with smoothing by roughening, our approach fared slightly better when ranking the extent of differential expression in RNA-Sequence data. A plausible explanation for this is requires estimating separate parameters for each point on a grid covering the parameter space. In contrast, our approach uses BIC to adaptively choose the number of mixing components necessary to represent the prior. While normal mixture modeling can suffer with identifiability issues, it requires estimating far fewer parameters than smoothing by roughening, and is less likely to give an overfitted estimate of the prior. We implemented smoothing by roughening using the function in the R-package , which uses 200 grid points for the approximation.
In general, we have demonstrated that Bayesian ranking can be helpful in identifying the most important differentially expressed genes (for particular diseases) and SNPs in GWAS to existing procedures. In the case of GWAS, it has been hypothesized that some of the genetic disease heritability, unaccounted for by the accumulation of disease associated SNPs identified by GWAS, is a result of rare variants (Manolio et al., 2009) which tend to go undiscovered due to lower power and the stringent multiple testing corrections. When these variants are included on GWAS chips, they are typically ranked too low to be discovered by conventional methods, although statistics that pool together several rare variants within the same gene to increase power can help with this problem (Ferguson et al., 2013a; Ionita-Laza et al., 2011). Running the algorithm suggested in this manuscript may also help in this regard as it will up-weight the ranking of rare but potentially important SNPs that have large ORs compared to a P-value-based ranking. More recently, exon-sequencing and whole genome sequencing studies survey perhaps tens of millions of variant positions in the genome, many of which will be uncommon, heightening the necessity to develop methods that are more flexible in identifying important rare variants. When employing the algorithm to rank SNP locations a couple of points need to be kept in mind. First, we assume that conditional on the true parameter vector , the estimated parameters are independent. Technically, SNPs need to be LD-pruned before running the algorithm to approximately satisfy this condition. The algorithm can in theory be modified to run with non-independent effects, provided covariance matrices are known. Second, rankings will likely be more biologically appealing if the input SNP/disease log-ORs are adjusted for other covariates such as age, gender or principle component loadings. Finally, memory and computational issues might prohibit simultaneous genome-wide ranking of all SNPs in GWAS studies. A better strategy is use all SNPs in the Empirical Bayes estimation of the prior, but to pre-select significant SNPs (either based on FDR or FWER rules) before simulation of ranks from the posterior. The actual run time of the algorithm will vary depending on the number of significant SNPs and the number of simulations from the posterior, but using an FDR threshold of 5% and a GWAS of 1 000 000 SNPs, 1000 of which are non-null, one would expect the total run time to be <10 min on a typical desktop PC.
An advantage of Bayesian approaches, in comparison with targeted methods for ranking fold-change parameters such as non-negative matrix factorization and FCROS, is that we can analyze any function of the parameter, θi, rather than the parameter itself if we so wish. For example, if we sample posterior iterates of θi first, we can then transform these iterates using f, and finally rank experiments, , according to their mean value of from the simulations. As a concrete example, in the case of GWAS, population attributable risk (PAR) is a method that can quantify disease impact of a genetic variant on a population level. More specifically, for a given SNP, PAR is the proportion of disease current disease cases that would hypothetically be healthy if everybody had two copies of the major allele at that locus. Provided the disease is rare, PAR is related to the OR, and MAF of the SNP (for cases), in a simple way as , assuming Hardy Weinberg equilibrium in cases (Claus et al., 1996). So, by transforming the simulated true ORs into values into PARs in this way, we can use the procedure to rank PARs.
An R-package, EBrank, to run the ranking procedures described in this manuscript can be downloaded from the CRAN repository.
Funding
Dr Ferguson is supported the grant: EIA-2017-017 awarded by the Health Research Board, Ireland.
Conflict of Interest: none declared.
References
1000 Genomes Project Consortium et al. (