-
PDF
- Split View
-
Views
-
Cite
Cite
Tristan Mary-Huard, Sarmistha Das, Indranil Mukhopadhyay, Stéphane Robin, Querying multiple sets of P-values through composed hypothesis testing, Bioinformatics, Volume 38, Issue 1, January 2022, Pages 141–148, https://doi.org/10.1093/bioinformatics/btab592
- Share Icon Share
Abstract
Combining the results of different experiments to exhibit complex patterns or to improve statistical power is a typical aim of data integration. The starting point of the statistical analysis often comes as a set of P-values resulting from previous analyses, that need to be combined flexibly to explore complex hypotheses, while guaranteeing a low proportion of false discoveries.
We introduce the generic concept of composed hypothesis, which corresponds to an arbitrary complex combination of simple hypotheses. We rephrase the problem of testing a composed hypothesis as a classification task and show that finding items for which the composed null hypothesis is rejected boils down to fitting a mixture model and classifying the items according to their posterior probabilities. We show that inference can be efficiently performed and provide a thorough classification rule to control for type I error. The performance and the usefulness of the approach are illustrated in simulations and on two different applications. The method is scalable, does not require any parameter tuning, and provided valuable biological insight on the considered application cases.
The QCH methodology is available in the qch package hosted on CRAN. Additionally, R codes to reproduce the Einkorn example are available on the personal webpage of the first author: https://www6.inrae.fr/mia-paris/Equipes/Membres/Tristan-Mary-Huard.
Supplementary data are available at Bioinformatics online.
1 Introduction
1.1 Combining analyses
Since the beginning of the omics era, it has been a common practice to compare and intersect lists of P-values, where all lists describe the same items (say genes) whose differential activity was tested and compared in different conditions or using different omics technologies (Irizarry et al., 2005; Natarajan et al., 2012). One may e.g. consider (i) genes whose expression has been measured in Q different types of tissues (and compared to a reference), (ii) genes whose expression has been measured in the same tissue at Q different timepoints (compared to baseline), or (iii) genes whose activity has been investigated through Q omics, such as expression, methylation, and copy number. The goal of the post-analysis is then to identify items that have been perturbed in either all or in a predefined subset of conditions. Finding such genes by integrating information from multiple data sources may be a first step towards understanding the underlying process at stake in the response to treatment, or the inference of the undergoing regulation network (Das et al., 2019; Xiong et al., 2012).
1.2 List crossing
One simple way to cross sets of P-values is to simply apply a multiple testing procedure separately to each list, identify the subsets of items for which the H0 hypothesis is rejected, then intersect these subsets. The graphical counterpart of this strategy is the popular Venn diagram that can be found in numerous articles (see e.g. Conway et al., 2017 and references inside). Alternatively, a lot of attention has been dedicated to the consensus ranking problem, where one aims at combining multiple ranked lists into a single one that is expected to be more reliable, see Li et al. (2019) for an extensive review. However, neither intersecting nor consensus ranking relies on an explicit biological hypothesis. Moreover, apart from rare exceptions, such as Natarajan et al. (2012), in most cases, the final set of identified items comes with no statistical guarantee regarding false positive control.
1.3 Composed hypotheses
The primary objective of the present article is to provide a statistical framework suitable to answer complex queries called hereafter composed hypotheses, such as ‘which genes are expressed in a (pre-defined) subset of conditions?’. This corresponds to a complex query because it cannot be answered through standard testing procedures. A classical example of a composed hypothesis is the so-called Intersection-Union Test (IUT) (Berger and Hsu, 1996) where one aims at finding the items for which all the Q-tested hypotheses should be rejected, e.g. genes declared differentially expressed for all treatment comparisons.
Testing composed hypotheses on a large set of items requires (i) the identification of a proper test statistic to rank the items based on their P-value profiles and (ii) a thresholding rule that provides guarantees about type I error rate control. Although different methods have been proposed to build the test statistic and to control the type I error rate in the specific context of the IUT problem (Deng et al., 2008; Tuke et al., 2008; Van Deun et al., 2009), no generic framework has emerged to easily handle any arbitrary composed hypothesis so far.
1.4 Contribution
We propose to formulate the composed hypothesis testing problem as an inference task in a mixture model, where the observations correspond to items and the components correspond to classes characterized by a combination of H0 and H1 hypotheses. The strategy we propose is efficient in many ways. First, it comes with a natural way to rank the items and control type I error rate through their posterior probabilities and their local False Discovery Rate (FDR) interpretation. Second, we show that, under mild conditions on the conditional distributions, inference can be efficiently performed on a large collection of items (up to several million in our simulations) within seconds, making the method amenable to applications, such as meta-analysis of genome-wide association studies. The method consists of three independent steps: first, fit a non-parametric mixture model on each marginal distribution, then estimate the proportion of the joint mixture model, and finally query the composed hypothesis. Importantly, the first two fitting steps are performed only once and can then be used to query any number of composed hypotheses without additional computational burden. Lastly, using both simulated and real genomic data, we illustrate the performance of the strategy (in terms of FDR control and detection power) and its richness in terms of application. In particular, for the classical IUT problem, it is shown that the detection power improves with respect to the number of P-value sets up to almost 100% in some cases where a classical list crossing would be almost inefficient.
2 Model
2.1 Composed hypothesis
Considering Q = 3 tests and the configuration states that holds, but neither nor .
As an example, in the case where Q = 3 the Intersection Union Test corresponds to the case where reduces to the configuration and to the union of all others: . Alternatively, if one aims at detecting items such that hypothesis is rejected, then one can define .
In the sequel, we consider an experiment involving n items (e.g. genes or markers), for which one wants to test versus . We will denote by the null composed hypothesis for item i () and similarly for and so on.
2.2 Joint mixture model
Assume that Q tests have been performed on each of the n items and denote by the P-value obtained for test q on item i. We note the P-value profile of item i. Let us further define the binary variable being 0 if holds and 1 if holds. A vector is hence associated with each item. Assuming that the items are independent, each P-value profile arises from a mixture model with components defined as follows:
the vectors are drawn independently within with probabilities ,
the P-value profiles are drawn independently conditionally on the Zi’s with distribution ψc:
2.3 A classification point of view
2.4 About intersection-union
In the particular case of the IUT, a major difference between the classical approach and the one presented here is that the natural criterion to select the items for which is likely to hold are the posterior probabilities and not the maximal P-value . This of course changes the ranking of the items in terms of significance (see Section 3.3). As will be shown in Sections 4 and 5, this modified ranking impacts the power of the overall procedure. As mentioned earlier the formalism we adopt enables one to consider more complex composed hypotheses than the IUT, and the same beneficial ranking strategy will be applied whatever the tested composed hypothesis.
2.5 Modeling the ψc
Still, the procedure presented here relies on the conditional independence assumption, and its application to cases where the assumption does not hold may lead to spurious results. In particular, the FDR may not be satisfactorily controlled when strong conditional correlations exist between sets of P-values (see Supplementary Table S6). As an example, combining very similar test statistics computed on the same data would result in such conditional correlations; applying the proposed procedure here would constitute an improper use of the methodology.
3 Inference
We named our approach the QCH (Query of Composed Hypotheses) procedure. It can be summarized into three steps:
Fit a mixture model on each set of (probit-transformed) P-values to get an estimate of each alternative distribution ;
Estimate the proportion wc of each configuration c using an EM algorithm and deduce the estimates of the conditional probabilities of interest τi;
Rank the items according to the and compute an estimate of the false discovery rate to control for multiple testing.
3.1 Marginal distributions
3.2 Fitting marginal mixture models
Robin et al. (2007) showed that there exists a unique set of satisfying both (5) and (6), which can be found using a fix-point algorithm.
In practice, one needs to choose both the kernel function and its bandwidth h. In this article, we used a Gaussian kernel function whose bandwidth can be tuned adaptively from the data using cross-validation (Chacón and Duong, 2018). Both the Gaussian kernel density estimation and the cross-validation tuning are implemented in the kde function of R package ks (Duong, 2007).
3.2 Configuration proportions
Observe that the log-likelihood of the corresponding mixture model is strictly convex with respect to the weights wc (as a direct consequence of Jensen’s inequality), therefore the EM algorithm will converge to the unique global maximum.
3.3 Ranking and error control
We emphasize that QCH already has two attractive features. First, because inference steps 1 and 2 do not require the knowledge of the composed hypothesis to be tested, once the model is fitted one can query any number of composed hypotheses without any additional computational burden. Second, the procedure does not require any model selection step for the tuning of the number of components in mixture model (1) as it corresponds to the number of configurations C, which is directly deduced from the number of test sets Q.
4 Simulations
In this section the performance of the QCH method is evaluated and compared to those of two procedures previously considered for the IUT:
The Pmax procedure consists in considering for each item i its associated maximum P-value as both the test statistic, for ranking, and the associated P-value, for false positive (FP) control. Once is computed a multiple testing procedure is applied. Here we applied the Benjamini–Hochberg (BH, Benjamini and Hochberg, 1995) procedure for FDR control. This procedure corresponds to the IUT procedure described in Zhong et al. (2019).
The FDR set intersection procedure (called hereafter IntersectFDR) consists in applying an FDR control procedure (namely, the BH procedure) to each of the Q P-value sets. The lists of items for which has been rejected are then intersected. This corresponds to the ‘list crossing strategy’ presented in the Introduction section (see e.g. Conway et al., 2017).
In the IUT setting the set of the corresponding QCH procedure reduces to the cmax configuration that consists only of 1’s. All the analyses presented in this section were performed with a type-I error rate nominal level fixed at α = 5%.
We first consider a scenario where Q sets of -values are generated as follows. First, the proportion of H0 hypotheses in each set is drawn according to a Dirichlet distribution . The proportion of each configuration c is then obtained by multiplying the proportions of the corresponding H0 and H1 hypotheses. Once the configuration proportions are obtained, the one corresponding to the full H1 (i.e. all hypotheses of the configuration are H1) is re-weighted upward to ensure a minimal representation of this configuration (see the simulation code provided as Supplementary Material). Note that imposing a minimal proportion for the full H1 configuration also yields non-independent P-values series, as the probability to be under any configuration is not equal to the product of the corresponding marginal probabilities anymore. The configuration memberships Zi are then independently drawn according to Model (1). The test statistics are drawn independently, where if H0 holds for item i in condition q, and otherwise. Lastly, these test statistics are translated into P-values. Several values of n (104, 105, 106) and Q (2, 4, 8) are considered, and for a given parameter setting (n, Q) 100 P-value matrices were generated. This scenario is called ‘Equal Power’ hereafter since the deviation μiq under H1 is the same for all P-value sets.
Figure 1 displays the ROC curves of the Pmax and QCH procedures, computed on a single simulation with and Q = 2 or 8. The x-axis (respectively the y-axis) corresponds to the FP rate (respectively TP = true positive rate) computed for all possible cutoffs on either or the posterior probabilities . When Q = 2 the two procedures are roughly equivalent in terms of ranking, with a slight advantage for QCH. When Q = 8 QCH significantly outperforms Pmax and reaches an AUC close to 1. This behavior is even more pronounced for larger values of n (not shown) and shows that using as a test statistic is only relevant for small values of Q (typically 2 or 3). The ranking based on the QCH posterior probabilities always performs better than the ranking based on Pmax, and the higher Q the higher the gap in terms of AUC. Also, note that one cannot easily derive a ROC curve for the IntersectFDR comparable to the ones of Pmax and QCH since IntersectFDR is based on list crossing rather than selecting items based on an explicit test statistic.

Left: ROC curves for the Pmax (blue) and QCH (red) procedures, when Q = 2. The x-axis displays the FP rate and the y-axis the TP rate. Right: same graph with Q = 8
Table 1 provides some additional insight in terms of comparison between methods. First, one can observe that all procedures control the type-I error rate, i.e. the actual error rate is lower than the nominal level whatever the simulation setting. Importantly, whatever Q the power of Pmax is always close to 0 whereas the power of QCH is not. Since there are little differences between Pmax and QCH in terms of ranking when Q = 2, it means that the multiple testing procedure for Pmax is conservative, i.e. the maximum P-value should not be used as the P-value of the test. One can also observe that the power of the two baseline methods decreases whenever Q increases whereas QCH displays an opposite behavior.
N . | Q . | Pmax . | IntersectFDR . | QCH . | |||
---|---|---|---|---|---|---|---|
FDR . | Power . | FDR . | Power . | FDR . | Power . | ||
104 | 2 | 0 (0) | 0 (0.001) | 0.059 (0.089) | 0.034 (0.025) | 0.054 (0.1) | 0.031 (0.022) |
104 | 4 | 0 (0) | 0 (0) | 0 (0) | 0.001 (0.002) | 0.024 (0.038) | 0.095 (0.086) |
104 | 8 | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0.004 (0.007) | 0.363 (0.134) |
105 | 2 | 0 (0) | 0 (0) | 0.059 (0.031) | 0.031 (0.023) | 0.044 (0.025) | 0.027 (0.018) |
105 | 4 | 0 (0) | 0 (0) | 0.012 (0.049) | 0.001 (0.001) | 0.032 (0.022) | 0.09 (0.082) |
105 | 8 | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0.004 (0.002) | 0.364 (0.112) |
106 | 2 | 0 (0) | 0 (0) | 0.058 (0.022) | 0.032 (0.023) | 0.047 (0.008) | 0.027 (0.02) |
106 | 4 | 0 (0) | 0 (0) | 0.009 (0.018) | 0.001 (0.001) | 0.03 (0.007) | 0.094 (0.08) |
106 | 8 | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0.003 (0.001) | 0.349 (0.105) |
N . | Q . | Pmax . | IntersectFDR . | QCH . | |||
---|---|---|---|---|---|---|---|
FDR . | Power . | FDR . | Power . | FDR . | Power . | ||
104 | 2 | 0 (0) | 0 (0.001) | 0.059 (0.089) | 0.034 (0.025) | 0.054 (0.1) | 0.031 (0.022) |
104 | 4 | 0 (0) | 0 (0) | 0 (0) | 0.001 (0.002) | 0.024 (0.038) | 0.095 (0.086) |
104 | 8 | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0.004 (0.007) | 0.363 (0.134) |
105 | 2 | 0 (0) | 0 (0) | 0.059 (0.031) | 0.031 (0.023) | 0.044 (0.025) | 0.027 (0.018) |
105 | 4 | 0 (0) | 0 (0) | 0.012 (0.049) | 0.001 (0.001) | 0.032 (0.022) | 0.09 (0.082) |
105 | 8 | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0.004 (0.002) | 0.364 (0.112) |
106 | 2 | 0 (0) | 0 (0) | 0.058 (0.022) | 0.032 (0.023) | 0.047 (0.008) | 0.027 (0.02) |
106 | 4 | 0 (0) | 0 (0) | 0.009 (0.018) | 0.001 (0.001) | 0.03 (0.007) | 0.094 (0.08) |
106 | 8 | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0.003 (0.001) | 0.349 (0.105) |
Note. For each procedure, the FDR and Power (averaged over 100 runs) are displayed for different settings. Numbers in brackets correspond to standard errors.
N . | Q . | Pmax . | IntersectFDR . | QCH . | |||
---|---|---|---|---|---|---|---|
FDR . | Power . | FDR . | Power . | FDR . | Power . | ||
104 | 2 | 0 (0) | 0 (0.001) | 0.059 (0.089) | 0.034 (0.025) | 0.054 (0.1) | 0.031 (0.022) |
104 | 4 | 0 (0) | 0 (0) | 0 (0) | 0.001 (0.002) | 0.024 (0.038) | 0.095 (0.086) |
104 | 8 | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0.004 (0.007) | 0.363 (0.134) |
105 | 2 | 0 (0) | 0 (0) | 0.059 (0.031) | 0.031 (0.023) | 0.044 (0.025) | 0.027 (0.018) |
105 | 4 | 0 (0) | 0 (0) | 0.012 (0.049) | 0.001 (0.001) | 0.032 (0.022) | 0.09 (0.082) |
105 | 8 | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0.004 (0.002) | 0.364 (0.112) |
106 | 2 | 0 (0) | 0 (0) | 0.058 (0.022) | 0.032 (0.023) | 0.047 (0.008) | 0.027 (0.02) |
106 | 4 | 0 (0) | 0 (0) | 0.009 (0.018) | 0.001 (0.001) | 0.03 (0.007) | 0.094 (0.08) |
106 | 8 | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0.003 (0.001) | 0.349 (0.105) |
N . | Q . | Pmax . | IntersectFDR . | QCH . | |||
---|---|---|---|---|---|---|---|
FDR . | Power . | FDR . | Power . | FDR . | Power . | ||
104 | 2 | 0 (0) | 0 (0.001) | 0.059 (0.089) | 0.034 (0.025) | 0.054 (0.1) | 0.031 (0.022) |
104 | 4 | 0 (0) | 0 (0) | 0 (0) | 0.001 (0.002) | 0.024 (0.038) | 0.095 (0.086) |
104 | 8 | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0.004 (0.007) | 0.363 (0.134) |
105 | 2 | 0 (0) | 0 (0) | 0.059 (0.031) | 0.031 (0.023) | 0.044 (0.025) | 0.027 (0.018) |
105 | 4 | 0 (0) | 0 (0) | 0.012 (0.049) | 0.001 (0.001) | 0.032 (0.022) | 0.09 (0.082) |
105 | 8 | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0.004 (0.002) | 0.364 (0.112) |
106 | 2 | 0 (0) | 0 (0) | 0.058 (0.022) | 0.032 (0.023) | 0.047 (0.008) | 0.027 (0.02) |
106 | 4 | 0 (0) | 0 (0) | 0.009 (0.018) | 0.001 (0.001) | 0.03 (0.007) | 0.094 (0.08) |
106 | 8 | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0.003 (0.001) | 0.349 (0.105) |
Note. For each procedure, the FDR and Power (averaged over 100 runs) are displayed for different settings. Numbers in brackets correspond to standard errors.
In a second scenario called ‘Linear power’ P-value sets were generated the same way as in the previous case, except for the fact that the deviation parameter μiq is 0 if H0 holds for item i in condition q, and otherwise, where for , i.e. the statistical power associated to set q increases with q. The performance of the three procedures is displayed in Table 2. One can observe that the Pmax and IntersectFDR procedures get little benefit from the fact that the distinction between H0 and H1 is easy for some P-value sets. On the contrary, QCH successfully exploits this information, consistently achieving a power higher than 60% for when Pmax is at 0 and IntersectFDR is lower than 15%.
N . | Q . | Pmax . | IntersectFDR . | QCH . | |||
---|---|---|---|---|---|---|---|
FDR . | Power . | FDR . | Power . | FDR . | Power . | ||
104 | 2 | 0 (0) | 0.001 (0.003) | 0.059 (0.039) | 0.134 (0.063) | 0.05 (0.044) | 0.138 (0.069) |
104 | 4 | 0 (0) | 0 (0) | 0.01 (0.02) | 0.127 (0.06) | 0.04 (0.016) | 0.713 (0.179) |
104 | 8 | 0 (0) | 0 (0.001) | 0 (0) | 0.133 (0.065) | 0.028 (0.015) | 0.995 (0.021) |
105 | 2 | 0 (0) | 0 (0) | 0.056 (0.025) | 0.127 (0.056) | 0.048 (0.009) | 0.135 (0.052) |
105 | 4 | 0 (0) | 0 (0) | 0.009 (0.01) | 0.124 (0.049) | 0.037 (0.008) | 0.696 (0.163) |
105 | 8 | 0 (0) | 0 (0) | 0 (0.001) | 0.136 (0.054) | 0.024 (0.007) | 1 (0.001) |
106 | 2 | 0.001 (0.013) | 0 (0.002) | 0.062 (0.019) | 0.137 (0.059) | 0.05 (0.003) | 0.137 (0.058) |
106 | 4 | 0 (0) | 0 (0) | 0.009 (0.011) | 0.126 (0.058) | 0.038 (0.007) | 0.705 (0.172) |
106 | 8 | 0 (0) | 0 (0) | 0 (0) | 0.129 (0.053) | 0.023 (0.003) | 1 (0) |
N . | Q . | Pmax . | IntersectFDR . | QCH . | |||
---|---|---|---|---|---|---|---|
FDR . | Power . | FDR . | Power . | FDR . | Power . | ||
104 | 2 | 0 (0) | 0.001 (0.003) | 0.059 (0.039) | 0.134 (0.063) | 0.05 (0.044) | 0.138 (0.069) |
104 | 4 | 0 (0) | 0 (0) | 0.01 (0.02) | 0.127 (0.06) | 0.04 (0.016) | 0.713 (0.179) |
104 | 8 | 0 (0) | 0 (0.001) | 0 (0) | 0.133 (0.065) | 0.028 (0.015) | 0.995 (0.021) |
105 | 2 | 0 (0) | 0 (0) | 0.056 (0.025) | 0.127 (0.056) | 0.048 (0.009) | 0.135 (0.052) |
105 | 4 | 0 (0) | 0 (0) | 0.009 (0.01) | 0.124 (0.049) | 0.037 (0.008) | 0.696 (0.163) |
105 | 8 | 0 (0) | 0 (0) | 0 (0.001) | 0.136 (0.054) | 0.024 (0.007) | 1 (0.001) |
106 | 2 | 0.001 (0.013) | 0 (0.002) | 0.062 (0.019) | 0.137 (0.059) | 0.05 (0.003) | 0.137 (0.058) |
106 | 4 | 0 (0) | 0 (0) | 0.009 (0.011) | 0.126 (0.058) | 0.038 (0.007) | 0.705 (0.172) |
106 | 8 | 0 (0) | 0 (0) | 0 (0) | 0.129 (0.053) | 0.023 (0.003) | 1 (0) |
Note. For each procedure, the FDR and Power (averaged over 100 runs) are displayed for different settings. Numbers in brackets correspond to standard errors.
N . | Q . | Pmax . | IntersectFDR . | QCH . | |||
---|---|---|---|---|---|---|---|
FDR . | Power . | FDR . | Power . | FDR . | Power . | ||
104 | 2 | 0 (0) | 0.001 (0.003) | 0.059 (0.039) | 0.134 (0.063) | 0.05 (0.044) | 0.138 (0.069) |
104 | 4 | 0 (0) | 0 (0) | 0.01 (0.02) | 0.127 (0.06) | 0.04 (0.016) | 0.713 (0.179) |
104 | 8 | 0 (0) | 0 (0.001) | 0 (0) | 0.133 (0.065) | 0.028 (0.015) | 0.995 (0.021) |
105 | 2 | 0 (0) | 0 (0) | 0.056 (0.025) | 0.127 (0.056) | 0.048 (0.009) | 0.135 (0.052) |
105 | 4 | 0 (0) | 0 (0) | 0.009 (0.01) | 0.124 (0.049) | 0.037 (0.008) | 0.696 (0.163) |
105 | 8 | 0 (0) | 0 (0) | 0 (0.001) | 0.136 (0.054) | 0.024 (0.007) | 1 (0.001) |
106 | 2 | 0.001 (0.013) | 0 (0.002) | 0.062 (0.019) | 0.137 (0.059) | 0.05 (0.003) | 0.137 (0.058) |
106 | 4 | 0 (0) | 0 (0) | 0.009 (0.011) | 0.126 (0.058) | 0.038 (0.007) | 0.705 (0.172) |
106 | 8 | 0 (0) | 0 (0) | 0 (0) | 0.129 (0.053) | 0.023 (0.003) | 1 (0) |
N . | Q . | Pmax . | IntersectFDR . | QCH . | |||
---|---|---|---|---|---|---|---|
FDR . | Power . | FDR . | Power . | FDR . | Power . | ||
104 | 2 | 0 (0) | 0.001 (0.003) | 0.059 (0.039) | 0.134 (0.063) | 0.05 (0.044) | 0.138 (0.069) |
104 | 4 | 0 (0) | 0 (0) | 0.01 (0.02) | 0.127 (0.06) | 0.04 (0.016) | 0.713 (0.179) |
104 | 8 | 0 (0) | 0 (0.001) | 0 (0) | 0.133 (0.065) | 0.028 (0.015) | 0.995 (0.021) |
105 | 2 | 0 (0) | 0 (0) | 0.056 (0.025) | 0.127 (0.056) | 0.048 (0.009) | 0.135 (0.052) |
105 | 4 | 0 (0) | 0 (0) | 0.009 (0.01) | 0.124 (0.049) | 0.037 (0.008) | 0.696 (0.163) |
105 | 8 | 0 (0) | 0 (0) | 0 (0.001) | 0.136 (0.054) | 0.024 (0.007) | 1 (0.001) |
106 | 2 | 0.001 (0.013) | 0 (0.002) | 0.062 (0.019) | 0.137 (0.059) | 0.05 (0.003) | 0.137 (0.058) |
106 | 4 | 0 (0) | 0 (0) | 0.009 (0.011) | 0.126 (0.058) | 0.038 (0.007) | 0.705 (0.172) |
106 | 8 | 0 (0) | 0 (0) | 0 (0) | 0.129 (0.053) | 0.023 (0.003) | 1 (0) |
Note. For each procedure, the FDR and Power (averaged over 100 runs) are displayed for different settings. Numbers in brackets correspond to standard errors.
In the case where Q = 2 the set of the corresponding QCH procedure is . The two procedures Pmax and FDR intersect can be empirically modified to handle such composed hypothesis as follows:
rather than computing Pmax as the maximum P-value, one can get the second-largest P-value,
rather than intersecting all Q, FDR lists one can consider all the combinations of Q—1 intersected lists among Q.
Note that these adaptations are feasible for some composed hypotheses but considering a general composed hypothesis usually leads to a complete redefinition of the testing procedure that can become quite cumbersome, whereas no additional fitting is required for QCH. Indeed, for each new composed hypothesis to be tested only the posterior probabilities corresponding to the new hypothesis have to be computed. This comes without any additional computational cost since one only needs to sum up configuration posterior probabilities corresponding to the new hypothesis, which does not require one to refit the model (1). The results are provided in Table 3 for the ‘Linear Power’ scenario. In terms of performance, the detection problem becomes easier for all procedures (since the proportion of items is now much bigger). Still, QCH consistently achieves the best performance and significantly outperforms its competitors as soon as , being close to a 100% detection rate when Q = 8, while being efficient at controlling the FDR rate at its nominal level. Additional results on alternative simulation settings can be found in Supplementary Material.
Performance of the three procedures for the ‘Linear Power’ scenario and the ‘at least composed hypothesis’
N . | Q . | Pmax . | IntersectFDR . | QCH . | |||
---|---|---|---|---|---|---|---|
FDR . | Power . | FDR . | Power . | FDR . | Power . | ||
104 | 2 | 0.06 (0.012) | 0.624 (0.111) | 0.031 (0.006) | 0.519 (0.123) | 0.05 (0.004) | 0.619 (0.126) |
104 | 4 | 0.005 (0.004) | 0.262 (0.029) | 0.022 (0.015) | 0.411 (0.048) | 0.049 (0.007) | 0.733 (0.093) |
104 | 8 | 0 (0) | 0.201 (0.018) | 0 (0.001) | 0.353 (0.037) | 0.036 (0.007) | 1 (0) |
105 | 2 | 0.063 (0.013) | 0.608 (0.116) | 0.032 (0.007) | 0.502 (0.127) | 0.05 (0.002) | 0.598 (0.133) |
105 | 4 | 0.005 (0.004) | 0.265 (0.011) | 0.021 (0.013) | 0.412 (0.04) | 0.047 (0.003) | 0.727 (0.092) |
105 | 8 | 0 (0) | 0.199 (0.005) | 0 (0) | 0.365 (0.038) | 0.029 (0.004) | 1 (0.001) |
106 | 2 | 0.062 (0.012) | 0.613 (0.11) | 0.032 (0.006) | 0.504 (0.122) | 0.05 (0.001) | 0.601 (0.128) |
106 | 4 | 0.005 (0.003) | 0.264 (0.009) | 0.022 (0.013) | 0.413 (0.035) | 0.047 (0.002) | 0.724 (0.088) |
106 | 8 | 0 (0) | 0.199 (0.002) | 0 (0) | 0.356 (0.032) | 0.029 (0.004) | 1 (0.001) |
N . | Q . | Pmax . | IntersectFDR . | QCH . | |||
---|---|---|---|---|---|---|---|
FDR . | Power . | FDR . | Power . | FDR . | Power . | ||
104 | 2 | 0.06 (0.012) | 0.624 (0.111) | 0.031 (0.006) | 0.519 (0.123) | 0.05 (0.004) | 0.619 (0.126) |
104 | 4 | 0.005 (0.004) | 0.262 (0.029) | 0.022 (0.015) | 0.411 (0.048) | 0.049 (0.007) | 0.733 (0.093) |
104 | 8 | 0 (0) | 0.201 (0.018) | 0 (0.001) | 0.353 (0.037) | 0.036 (0.007) | 1 (0) |
105 | 2 | 0.063 (0.013) | 0.608 (0.116) | 0.032 (0.007) | 0.502 (0.127) | 0.05 (0.002) | 0.598 (0.133) |
105 | 4 | 0.005 (0.004) | 0.265 (0.011) | 0.021 (0.013) | 0.412 (0.04) | 0.047 (0.003) | 0.727 (0.092) |
105 | 8 | 0 (0) | 0.199 (0.005) | 0 (0) | 0.365 (0.038) | 0.029 (0.004) | 1 (0.001) |
106 | 2 | 0.062 (0.012) | 0.613 (0.11) | 0.032 (0.006) | 0.504 (0.122) | 0.05 (0.001) | 0.601 (0.128) |
106 | 4 | 0.005 (0.003) | 0.264 (0.009) | 0.022 (0.013) | 0.413 (0.035) | 0.047 (0.002) | 0.724 (0.088) |
106 | 8 | 0 (0) | 0.199 (0.002) | 0 (0) | 0.356 (0.032) | 0.029 (0.004) | 1 (0.001) |
Note. For each procedure, the FDR and Power (averaged over 100 runs) are displayed for different settings. Numbers in brackets correspond to standard errors.
Performance of the three procedures for the ‘Linear Power’ scenario and the ‘at least composed hypothesis’
N . | Q . | Pmax . | IntersectFDR . | QCH . | |||
---|---|---|---|---|---|---|---|
FDR . | Power . | FDR . | Power . | FDR . | Power . | ||
104 | 2 | 0.06 (0.012) | 0.624 (0.111) | 0.031 (0.006) | 0.519 (0.123) | 0.05 (0.004) | 0.619 (0.126) |
104 | 4 | 0.005 (0.004) | 0.262 (0.029) | 0.022 (0.015) | 0.411 (0.048) | 0.049 (0.007) | 0.733 (0.093) |
104 | 8 | 0 (0) | 0.201 (0.018) | 0 (0.001) | 0.353 (0.037) | 0.036 (0.007) | 1 (0) |
105 | 2 | 0.063 (0.013) | 0.608 (0.116) | 0.032 (0.007) | 0.502 (0.127) | 0.05 (0.002) | 0.598 (0.133) |
105 | 4 | 0.005 (0.004) | 0.265 (0.011) | 0.021 (0.013) | 0.412 (0.04) | 0.047 (0.003) | 0.727 (0.092) |
105 | 8 | 0 (0) | 0.199 (0.005) | 0 (0) | 0.365 (0.038) | 0.029 (0.004) | 1 (0.001) |
106 | 2 | 0.062 (0.012) | 0.613 (0.11) | 0.032 (0.006) | 0.504 (0.122) | 0.05 (0.001) | 0.601 (0.128) |
106 | 4 | 0.005 (0.003) | 0.264 (0.009) | 0.022 (0.013) | 0.413 (0.035) | 0.047 (0.002) | 0.724 (0.088) |
106 | 8 | 0 (0) | 0.199 (0.002) | 0 (0) | 0.356 (0.032) | 0.029 (0.004) | 1 (0.001) |
N . | Q . | Pmax . | IntersectFDR . | QCH . | |||
---|---|---|---|---|---|---|---|
FDR . | Power . | FDR . | Power . | FDR . | Power . | ||
104 | 2 | 0.06 (0.012) | 0.624 (0.111) | 0.031 (0.006) | 0.519 (0.123) | 0.05 (0.004) | 0.619 (0.126) |
104 | 4 | 0.005 (0.004) | 0.262 (0.029) | 0.022 (0.015) | 0.411 (0.048) | 0.049 (0.007) | 0.733 (0.093) |
104 | 8 | 0 (0) | 0.201 (0.018) | 0 (0.001) | 0.353 (0.037) | 0.036 (0.007) | 1 (0) |
105 | 2 | 0.063 (0.013) | 0.608 (0.116) | 0.032 (0.007) | 0.502 (0.127) | 0.05 (0.002) | 0.598 (0.133) |
105 | 4 | 0.005 (0.004) | 0.265 (0.011) | 0.021 (0.013) | 0.412 (0.04) | 0.047 (0.003) | 0.727 (0.092) |
105 | 8 | 0 (0) | 0.199 (0.005) | 0 (0) | 0.365 (0.038) | 0.029 (0.004) | 1 (0.001) |
106 | 2 | 0.062 (0.012) | 0.613 (0.11) | 0.032 (0.006) | 0.504 (0.122) | 0.05 (0.001) | 0.601 (0.128) |
106 | 4 | 0.005 (0.003) | 0.264 (0.009) | 0.022 (0.013) | 0.413 (0.035) | 0.047 (0.002) | 0.724 (0.088) |
106 | 8 | 0 (0) | 0.199 (0.002) | 0 (0) | 0.356 (0.032) | 0.029 (0.004) | 1 (0.001) |
Note. For each procedure, the FDR and Power (averaged over 100 runs) are displayed for different settings. Numbers in brackets correspond to standard errors.
5 Illustrations
5.1 TSC dataset
In our first application, we consider the Tuberous Sclerosis Complex (TSC) dataset obtained from the Database of Genotypes and Phenotypes website (dbGap: accession code phs001357.v1.p1). The dataset consists of 34 TSC and seven control tissue samples, for which gene expression and/or methylation were quantified. Differential analysis was performed separately on the two types of omics data, leading to the characterization of 7222 genes for expression and 273 215 CpG sites for methylation. To combine the corresponding P-values sets, we considered pairs between a gene and a CpG site, the pairs being constituted according to one of the following two strategies:
pair a gene with the methylation sites directly located at the gene position; this strategy resulted in 128 879 pairs (some CpG sites being not directly linked to any gene) and is called ‘Direct’ (for Direct vicinity) hereafter,
pair a gene with any methylation site that is physically close to that gene due to chromatin interactions; this strategy, called ‘HiC’ hereafter as a reference to the chromosome conformation capture technique employed, resulted in 6 532 368 pairs.
Depending on the strategy the same gene could be paired with several methylation sites and vice versa. Strategy (b) requires additional information about chromatin information that was obtained from HiC data obtained from Gene Expression Omnibus (GEO accession code GSM455133).
The purpose of the data integration analysis is then to identify pairs that exhibit a significant differential effect on both expression and methylation. In terms of composed hypothesis, this boils down to perform IUT: the configurations corresponding to is and is . In addition to using QCH, we evaluated the Pmax procedure associated with a Benjamini–Hochberg correction.
The results are summarized in Table 4. Using the Direct strategy combined with the QCH procedure, 4030 genes were classified as . As for the simulations studies the QCH procedure detected an increased number of pairs and genes compared with Pmax. Applying the HiC strategy resulted in a significantly higher number of tested pairs but a lower number of identified genes. Interestingly, the list of genes detected with the HiC strategy contains many candidates that were not detected using the Direct strategy. Among these candidates found with the HiC strategy only are TSC1 and TSC2 for which mutations are well-known to be associated with the TSC disease. The QCH procedure also identified three additional genes (again with the HiC strategy only) that were not detected using Pmax, and whose combinations with methylation sites in the TSC1 gene were significant. These genes are LRP1B, PEX14, and CHD7. LRP1B is associated with renal angiomyolipoma (AML) and AML is observed in 75% of patients with TSC (Wang et al., 2020). PEX14 is known to interact with PEX5 for importing proteins to the peroxisomes (Neuhaus et al., 2014). Mutations in CHD7 have been found to increase the risk of idiopathic autism (O’Roak et al., 2012), which could suggest a link between monogenic disorders like TSC and idiopathic autism due to complex inheritance (Gamsiz et al., 2015). It has indeed been reported that TSC patients may develop abnormal neurons and cellular enlargement, making difficult the differentiation between TSC and autistic symptoms in the setting of severe intellectual disability (Takei and Nawa, 2014). In conclusion, the QCH procedure combined with the HiC information detected genes that have been previously reported to have a functional association with the disease in different studies. These genes may reveal more insight into the genetic architecture of TSC.
Number of genes identified through different pairing strategies and procedures (Pmax or QCH)
Strategy . | Total Pmax . | Total QCH . | Extra Pmax . | Extra QCH . |
---|---|---|---|---|
Direct | 3624 | 4030 | 4 | 410 |
HiC | 3501 | 3875 | 0 | 374 |
Strategy . | Total Pmax . | Total QCH . | Extra Pmax . | Extra QCH . |
---|---|---|---|---|
Direct | 3624 | 4030 | 4 | 410 |
HiC | 3501 | 3875 | 0 | 374 |
Number of genes identified through different pairing strategies and procedures (Pmax or QCH)
Strategy . | Total Pmax . | Total QCH . | Extra Pmax . | Extra QCH . |
---|---|---|---|---|
Direct | 3624 | 4030 | 4 | 410 |
HiC | 3501 | 3875 | 0 | 374 |
Strategy . | Total Pmax . | Total QCH . | Extra Pmax . | Extra QCH . |
---|---|---|---|---|
Direct | 3624 | 4030 | 4 | 410 |
HiC | 3501 | 3875 | 0 | 374 |
5.2 Einkorn dataset
We consider the Einkorn study of Bonnot et al. (2020), in which the grain transcriptome was investigated in four nutritional treatments at different time points during grain development (see the original reference for a complete description of the experimental design). For each combination of a treatment and a timepoint 3 einkorn grains were harvested and gene expression was quantified through RNAseq. The results of the differential analyses are publicly available at forgemia.inra.fr/GNet/einkorn_grain_transcriptome/-/tree/master/. Here we focus on the comparison between the NmSm (control) treatment and the NpSm (enriched in Nitrate) treatment, compared at Q = 4 different timepoints (T300, T400, T500, and T600). The differential analysis involved n = 12, 327 transcripts and was performed at each timepoint. We extracted the results that were summarized into a -value matrix.
Only the results of QCH are reported since there is no clear way to adapt the Pmax or IntersectFDR procedures to the present setting. The results are displayed in Figure 2 and Table 5. A total of 249 transcripts were declared significant for the composed alternative hypothesis at an FDR nominal level of 5%. These transcripts were further classified into a configuration according to their maximal posteriors, resulting in a four-class classification presented in Table 5. The table displays for each configuration the median value of the −log10(P-value) at the different timepoints over the transcripts belonging to the class. One can first observe that several configurations, such as 1100 or 1110 are missing, i.e. no transcripts are differentially expressed in early times but not later. Alternatively, configurations 0110 and 0011 are well-represented by 12 and 30 transcripts, respectively. These classes can be clearly observed in Figure 2 that displays the (−log10 transformed) P-value profiles of the 249 transcripts over time, classified according to their configuration (color level on the left side). One can observe the time-shifted effect of the transcripts belonging to these two classes. Identifying these transcripts provides valuable insight about the kinetics of the response to nitrate enrichment that could be incorporated in/confirmed by the network reconstruction to follow.

P-value profiles of the transcripts over time, after −log10 transformation. Colors on the left side correspond to the configurations (black = 0110, red = 0011, green = 0111, and blue = 1111)
Description of the 4 H-config classes identified from the NmSm-NpSm comparison
Config . | NbTrspt . | T300 . | T400 . | T500 . | T600 . |
---|---|---|---|---|---|
0110 | 12 | 0.2225031 | 3.8242946 | 3.296769 | 0.3776435 |
0011 | 30 | 0.2200245 | 0.1256215 | 4.479422 | 4.4642771 |
0111 | 204 | 0.2982042 | 2.4155116 | 2.851227 | 3.1877051 |
1111 | 3 | 11.9843371 | 14.9638330 | 15.251248 | 17.2187644 |
Config . | NbTrspt . | T300 . | T400 . | T500 . | T600 . |
---|---|---|---|---|---|
0110 | 12 | 0.2225031 | 3.8242946 | 3.296769 | 0.3776435 |
0011 | 30 | 0.2200245 | 0.1256215 | 4.479422 | 4.4642771 |
0111 | 204 | 0.2982042 | 2.4155116 | 2.851227 | 3.1877051 |
1111 | 3 | 11.9843371 | 14.9638330 | 15.251248 | 17.2187644 |
Note. For each class (in row) and each timepoint (in column) the median value of the −log10(P-value) is reported along with the number of detected transcripts (column NbTrspt).
Description of the 4 H-config classes identified from the NmSm-NpSm comparison
Config . | NbTrspt . | T300 . | T400 . | T500 . | T600 . |
---|---|---|---|---|---|
0110 | 12 | 0.2225031 | 3.8242946 | 3.296769 | 0.3776435 |
0011 | 30 | 0.2200245 | 0.1256215 | 4.479422 | 4.4642771 |
0111 | 204 | 0.2982042 | 2.4155116 | 2.851227 | 3.1877051 |
1111 | 3 | 11.9843371 | 14.9638330 | 15.251248 | 17.2187644 |
Config . | NbTrspt . | T300 . | T400 . | T500 . | T600 . |
---|---|---|---|---|---|
0110 | 12 | 0.2225031 | 3.8242946 | 3.296769 | 0.3776435 |
0011 | 30 | 0.2200245 | 0.1256215 | 4.479422 | 4.4642771 |
0111 | 204 | 0.2982042 | 2.4155116 | 2.851227 | 3.1877051 |
1111 | 3 | 11.9843371 | 14.9638330 | 15.251248 | 17.2187644 |
Note. For each class (in row) and each timepoint (in column) the median value of the −log10(P-value) is reported along with the number of detected transcripts (column NbTrspt).
6 Discussion
6.1 Summary
We introduced a versatile approach to assess composed hypotheses, that is any set of configurations of null and alternative hypotheses among Q-tested hypotheses. The approach relies on a multivariate mixture model that is fitted efficiently so that very large omic datasets can be handled. The classification point-of-view we adopted yields a natural ranking of the items based on their associated posterior probabilities. These probabilities can also be used as local FDR estimates to compute (and control) the FDR. Note that in the present work we considered FDR control, but one could consider directly controlling the local FDR (Efron, 2008), or alternatively controlling more refined error rates, such as the multiple FDR developed in the context of multi-class mixture models (Mary-Huard et al., 2013).
The Simulation section illustrated the gap between Pmax and QCH in terms of power. The poor performance of the Pmax procedure is due to the use of Pmax as both a test statistic (i.e. for ranking the items) and a P-value (i.e. the P-value associated with Pmax is assumed to be Pmax itself). Although this corresponds to what has been applied in practice (Zhong et al., 2019), one can observe that Pmax cannot be considered as a P-value since it is not uniformly distributed under the null (composed) hypothesis . Consequently, a direct application of multiple testing correction procedures to Pmax will automatically lead to a conservative testing procedure. Although it may be feasible to improve on the current practice, note that (i) finding the null distribution of Pmax and (ii) extending the Pmax procedure to the more general question of testing any composed hypothesis are two difficult tasks. The QCH methodology presented here solves these two problems efficiently, in terms of power, FDR control, and computational efficiency.
From a more general perspective, we emphasize that likewise the whole theory of statistical testing, the proposed methodology is valid (i.e. error rate is controlled) only when the composed hypotheses and are specified a priori. More precisely, valid conclusions can be drawn if and only if the set of null combinations (and, consequently, ) is fixed beforehand, that is, before having any look at the data. The flexibility of the QCH procedure should not be used as a tool to magnify statistical significance artificially.
6.2 Future works
The proposed methodology also provides information about the joint distributions of the latent variables , that is the probability that, say, both and hold, but not . This distribution encodes the dependency structure between the tests. This available information should be further carefully investigated as it provides insights into the way items (e.g. genes) respond to each combination of tested treatments.
Although the proposed model allows for dependency between the P-values through the distribution of the latent variables Zi, conditional independence (i.e. independence within each configuration) is assumed to alleviate the computational burden. This assumption could be removed in various ways. A parametric form, such as a (probit-transformed) multivariate Gaussian with constrained variance matrix, could be considered for each joint distribution ψc. Alternatively, a non-parametric form could be preserved, using copulas to encode the conditional dependency structure.
Acknowledgements
SD and IM acknowledge the submitters of the TSC data to the dbGaP repository (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001357.v1.p1).
Funding
This work has been supported by the Indo-French Center for Applied Mathematics (IFCAM), the ‘Investissement d’Avenir’ project (Amaizing, ANR-10-BTBR-0001), and the Department of Biotechnology, Govt. of India, for their partial support to this study through SyMeC.
Conflict of Interest: none declared.