Genome-wide multimediator analyses using the generalized Berk–Jones statistics with the composite test

Abstract Motivation Mediation analysis is performed to evaluate the effects of a hypothetical causal mechanism that marks the progression from an exposure, through mediators, to an outcome. In the age of high-throughput technologies, it has become routine to assess numerous potential mechanisms at the genome or proteome scales. Alongside this, the necessity to address issues related to multiple testing has also arisen. In a sparse scenario where only a few genes or proteins are causally involved, conventional methods for assessing mediation effects lose statistical power because the composite null distribution behind this experiment cannot be attained. The power loss hence decreases the true mechanisms identified after multiple testing corrections. To fairly delineate a uniform distribution under the composite null, Huang (Genome-wide analyses of sparse mediation effects under composite null hypotheses. Ann Appl Stat 2019a;13:60–84; AoAS) proposed the composite test to provide adjusted P-values for single-mediator analyses. Results Our contribution is to extend the method to multimediator analyses, which are commonly encountered in genomic studies and also flexible to various biological interests. Using the generalized Berk–Jones statistics with the composite test, we proposed a multivariate approach that favors dense and diverse mediation effects, a decorrelation approach that favors sparse and consistent effects, and a hybrid approach that captures the edges of both approaches. Our analysis suite has been implemented as an R package MACtest. The utility is demonstrated by analyzing the lung adenocarcinoma datasets from The Cancer Genome Atlas and Clinical Proteomic Tumor Analysis Consortium. We further investigate the genes and networks whose expression may be regulated by smoking-induced epigenetic aberrations. Availability and implementation An R package MACtest is available on https://github.com/roqe/MACtest.


Introduction
Composite null hypothesis.Studies [1,2] have described two important issues in this scenario.First, the null hypothesis for each mechanism, H 0 : α S = 0∪β M = 0, is a composite null of three simple nulls that are mutually exclusive, H 0 = H 0 ∪ H 0α ∪ H 0β : (1) If we can not determine the proportions of each simple null, then deriving the null distribution of any test statistics is challenging.Second, the re-expression of the null hypothesis, H 0 : α S β M = 0, reveals that the test statistics consist of the product of the coefficient estimators αS βM .If the sample size is sufficient, then the estimators of the regression coefficients, αS and βM , behave as two random variables following a normal distribution, and their product αS βM follows a normal product distribution.Because the composite null consists of three simple nulls, the underlying null distribution is also a mixture of three types of normal product distribution, as illustrated in Figure S1(a).Notably, that the density function under H 0 goes to infinity at zero.In other words, a large portion of the samples are centered at zero if H 0 is the majority of H 0 , and a conservative conclusion is reached when a normal approximation is used to obtain p-values.As shown in Figure S1(b), both normal approximation (Sobel) and joint significance tests are conservative, only the composite test provides uniform p-values for H 0 .
Causal assumptions.We define Y (s, m) as the counterfactual outcome of Y , and the random variable Y (s, m) fully describes the behavior of Y in a possible world in which the exposure S has been set to s and the mediators M have been set to m.Similarly, M(s) is defined as the counterfactual mediators in a possible world in which S has been set to s.We list the assumptions of identifiability for mediation effects as follows: (A1) Y (s, m) ⊥ S X: no unmeasured confounding between the exposure and the outcome.
(A2) Y (s, m) ⊥ M S, X: no unmeasured confounding between the mediators and the outcome.and (2).The assumption A4 suggests the absence of exposure-induced confounding.The confounding illustrated with solid edges can be adjusted in the model, but that illustrated with dashed edges should be avoided.
This null hypothesis can distinguish mediation effects in Figure S3(a) from the null in Figure S3(b).However, a special case exists in which the effect from the exposure to the mediator is not directly linked to the effect from the mediator to the outcome; this is known as the disjoint effect, as depicted in Figure S4.A case involving the disjoint effect is difficult to interpret because the causal relation between the mediators remains unverified, and the null hypothesis may be rejected in this scenario [3] because it is compatible with an authentic mediation effect, as indicated in Figure S4(b).Figure S4 illustrates the possible causal interpretation of correlated mediators.Notably, if the mediators are independent, then the disjoint effect case should be regarded as null (no effect).Denote the estimator of covariance, Ŝij = Cov(M i , M j ), is of zero mean and ξ variance.We assume the mediators within a predefined mechanism are all correlated in practice since the continuous random variable P( Ŝij = 0) = 0 for all ξ > 0. Therefore, no such mechanism with independent mediators and a disjoint effect is involved.This assumption is also biologically acceptable since the bio-molecules within a mechanism (e.g.co-regulation towards the same gene) usually work together.Another special case is the perfect cancellation scenario depicted in Figure S3(d), also known as the causal faithfulness assumption.Because the test statistic for G i takes into consideration the effects from all the mediators within the mechanism, a path for a positive effect may exist with an effect magnitude identical to that of a path for a negative effect, which results in no marginal mediation effect.
We assumed that no perfect cancellation exists because the focus of this study was the evaluation of any mediation effect at an element-wise level.Also, this assumption is acceptable in practice since the exact cancellation only happens with zero probability [4].We apply this assumption only to the decorrelation approach, not to the multivariate approach.
Figure S3: Two-mediator model.If no edge exists between two variables, then no association exists between them.A solid edge indicates an association and a dashed edge indicates a causal effect that may or may not exist.The effects of the purple path and the emerald path are of identical strength but are in different directions; thus, the total mediation effect is perfectly canceled.
(b) Experiment involving multiple-mediator models Figure S5: Independence at mechanism level.Under the null hypothesis, the composite test proposed by Huang provides p-values of a uniform distribution for each mechanism in (a).The aim of the present study was to generalize this idea to multiple-mediator models in (b) to enable the evaluation of complicated mechanisms in practice.Notably, the composite test proposed by Huang requires the independence at mechanism level, or to be more precise, independence between all coefficient estimators in (a).

Method
Decorrelation of mediators.We use P i = P ij to denote the vector of decorrelated mediators, P i = u i M i , here we use the eigen decomposition of ˆ m to obtain u.The tilde coefficients are the linear combination of the original coefficients.For example, a two-mediator model allows for the construction of a 2×2 orthonormal matrix, u = p q r s , that satisfies uΣu T = diag σ 2 1 , σ 2 2 , P ∼ N(0, uΣu T ), αS = uα S = S = 0, α 2 S = 0, β 1 M = 0, β 2 M = 0), then the effects of decorrelated models will become weighted averages of original coefficients (i.e.α1 S = pα 1 S , α2 ) and none of them are zero-valued coefficients as illustrated in Figure S3(c).The decorrelation procedure can preserve the truth value of the null hypothesis and identify disjoint effects as mediation effects.

Proof of a i d
− → N(0, 1) under the null in the multivariate approach.For the GBJ statistics (or any multivariate statistic) in the multivariate approach, says T i α , we prove that the score statistic a i = sign αi S Φ −1 1 − p i α 2 converges to the standard normal distribution, if the following three statements hold: 1. sign αi S takes values of −1 and 1 with equal probabilities.
2. Φ −1 1 − p i α 2 converges to the absolute value of a standard normal random variable.
Part 1.The function sign( αi S ) = 1 − 2I j αij S < 0 represents the collective direction of αij S .Under the null, the maximum likelihood estimator n αi S converges to a zero-mean multivariate normal distribution with a bounded covariance; consequently, n j αij S = n1 T αi S converges to a zero-mean normal distribution with a finite variance, where 1 is a vector of one's.Since n j αij S is centered at 0 and symmetric, it takes positive and negative values with equal probabilities.Therefore, 1 − 2I j αij S < 0 takes values of −1 and 1 with equal probabilities.

Part 2.
Since p i α is the p-value of T i α , under the null, it converges to a random variable following a uniform distribution ranging from 0 to 1.It follows that Φ −1 1 − p i α 2 converges to the absolute value of a standard normal random variable U with a probability density function f U (u) = 2(2π) −1/2 e −u 2 /2 and support u ∈ [0, ∞).We denote the test statistic a i as V = sign( αi S )U and show that f Huang [5] has demonstrated that the probability of the score of α i S converges to the probability of the estimator αi S , i.e.P n −1 Q α < q → P n|| αi S || 2 < q .Also, we know that consider the event of sign( αi S ) is equivalent to the event of n1 T αi S in Part 1, so next we show n1 T αi S < 0 and n|| αi S || 2 are asymptotically independent.Since n αi S converges to a random variable ζ α following a multivariate normal distribution, n|| αi S || 2 < q indicates the integral of ζ α 's pdf within the multivariate hemispherical shape n|| αi S || 2 = q.Also notice that ζ α 's pdf is a symmetric function centered at the origin, so n1 T αi S = 0 becomes the hyperplane that splits ζ α 's pdf into two symmetric spaces through the origin (a demonstrative figure can be found at [5], Figure S6).Therefore we know We can further show that Therefore, we establish the independence of n|| αi S || 2 and the event n1 T αi S < 0. Since sign αi S and Φ −1 1 − p i α 2 are functions of n1 T αi S < 0 and n|| αi S || 2 , respectively, they are asymptotically independent.
A similar result can be obtained for b

Simulation: Type I errors
We demonstrate that our p-value calculations are sufficiently accurate for controlling the type I error rate.Under the null scenario, the p-values should follow a uniform distribution.We simulated three the type I error by using a multivariate approach or decorrelation approach with a decorrelation process (i.e., approaches using P as mediators, as described in Equation ( 7)).The adoption of a decorrelation approach without a decorrelation process (i.e., approaches in which M is used as a mediator, as described in Equation ( 1)) may lead to an inflated type I error.Disjoint cases are recognized as alternative scenarios using multivariate approaches.We use the significance level α = 0.0001 since the analysis is supposed to be applied on gene level.
types of null scenarios, as described in Table S1(a).The first was complete null scenarios, which only consisted of H 0 , and resulted in the poor approximation of the normal product distribution.The second was one-sided null scenarios, in which some mechanisms (10% of 10000 mechanisms) have effects only in the direction from the exposure to the mediators or from the mediators to the outcome.The third was disjoint null scenarios, which refer to scenarios that were omitted from our model settings, namely those with disjoint effects without a correlation between mediators within one mechanism.As mentioned, we omitted these scenarios because they would be recognized as alternative cases under the application of a multivariate approach.The results using composite test were summarized in Table S2, and the scenario of nominal α = 0.0001 was illustrated in Figure S6 to compare the three tests for mediation assessment.

Simulation: Power
Different mediator dependency settings and signal patterns.Figure S7 indicates that skipping decorrelation outperformed the multivariate approaches for the alternative cases with independent mediators (first row), and they could identify the disjoint cases from the connected cases (the first row vs.  the fourth row).This result is attributable to the fact that the decorrelation approach can preserve the relationships between coefficients for each mediator, whereas the multivariate approaches regarded the mediators as a group.However, the decorrelation process based on sample covariance disturbs the relationships; consequently, identifying disjoint cases is equally as tricky with decorrelation approaches involving a decorrelation process as it is with multivariate approaches (the second row vs. the fifth row).
We also found that the scenario of random correlation coefficients has lower power than the fixed scenario; the result suggests that higher mediator dependency may decrease the power of the tests (the second row vs. the third row).
Different signal strengths and sparsity levels.Sun and Lin suggested that the best performance is achieved with minP and GHC in cases with extremely sparse data (h ≤ 3; i.e., in our settings of m = 100), GBJ in cases with moderately sparse data (4 ≤ h ≤ 10), and VCT in cases with dense signals (µ = 0.05, v = 0, 100% active mechanisms in Figure S8 is the most similar setting to that adopted by Sun and Lin).
Notably, Sun and Lin only provided statistics for evaluating one mechanism and did not consider the genome-wide background; thus, their simulation produced scenarios of 100% active mechanisms.As discussed, a simulation in which 100% of mechanisms are assumed to have signals is unrealistic, and the conclusions drawn from the results obtained with such a setting may be inapplicable to real data analysis.Besides, the scenario provided by Sun and Lin is not a mediation design, their boundaries of the dominant regimes also changed; therefore, in practice, the highest-performing statistic is difficult to conclusive determine.However, following Figure S7 and Figure S8, general conclusions can be drawn: 1.The decorrelation approach performs better when signals are consistent, similar for the multivariate approach when signals are diverse (VCT and TSQ do not have corresponding decorrelation approaches, so this conclusion is based on the other three tests).
2. GHC and the global minP test achieved the best performance in the extremely sparse region, TSQ is generally the most powerful in the moderately sparse region; VCT and decorrelation approaches outperformed GBJ, GHC, and global minP tests when the signals were consistent.
3. The hybrid approach is able to capture the better results among the decorrelation and multivariate approach without raising the more multiple comparison issue.
To summarize our simulations, preferences of the test statistics are presented in Figure S9.Table S2: Type I error using the composite test.In general we control the type I error to the significance level α = 10 −5 using the multivariate approaches and decorrelation approaches with decorrelation, with exception in the following case: (1) disjoint case will be recognized as alternative scenarios (red cells) for both global and decorrelation approaches with decorrelation; (2) insufficient sample size (blue cells) also induce inflated type I error, T 2 -statistic is the most vulnerable and VCT is the most robust; (3) to multivariate approaches, the one-sided null in our setting becomes a strong signal when α < 10 −4 , so the type I error converges to α in a slower pace than the complete null; (4) the type I error of GHC tends to inflate in the setting of correlated mediators since it presume a weaker correlation structure; (5) using the random correlation coefficient setting (green cells) generates a similar results to the fixed setting.We compared eight test statistics with three mediation assessment approaches; the signal mean was fixed at 0.05, and the variance was fixed at 0.1.The significance level was 0.001 and the percentage of active mechanism is 0.1.The composite test achieved higher power when the signals were sparse, the multivariate approach is generally more powerful than the decorrelation approach, and the hybrid approach achieves comparable power toward the multivariate approach.Empirically, the sparsity can be observed through the Q-Q plots (one can even use a KS test to tell the signal strength), and the signal pattern can be observed through the histogram of ab (compare the area ab > 0 and ab < 0).Using the composite test, the red horizontal line indicates the significance cutoff of FDR-adjusted p-value equals to 0.1; the common significant genes among the decorrelation and multivariate approaches are also highlighted.

Data application
TCGA-LUAD preprocessing The R package TCGAbiolinks [6] was used to retrieve the dataset from the Data Portal of Genomic Data Commons (GDC, v26.0) [7].We filtered out the CpG loci with missing values, genes with abundant missing information (≥20%), and genes annotated with a single methylation site (i.e., m i = 1) and only used the expression of the primary tumor.We obtained the logit and log transformation of the DNA methylation β values and gene expression values, respectively.Finally, we had 20670 genes in the methyl-gene dataset, and the largest mechanism consisted of 1015 potential mediators.

CPTAC-LUAD preprocessing
We downloaded the normalized dataset from the supplementary material of the original paper [8].We filtered out the miRNAs and proteins with abundant missing information (≥20%) and proteins annotated with a single miRNA, and weonly used the expression of the tumor.Finally, we had 6955 proteins in the miRNA-protein dataset and the largest mechanism consisted of 132 potential mediators.q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q ADAM10 ADCY4 ADCY9

Figure S4 :
Figure S4: Disjoint effect.Correlations between mediators may originate from (a) common causes, (b) causation from M 1 to M 2 , or (c) causation from M 2 to M 1 .However, only scenario (b) can be recognized as an authentic mediation effect.Correlated mediators M are transformed into uncorrelated mediators P when the test statistics require independence between coefficients.The mediators can be decorrelated as in (d) to prevent the causal interpretation of disjoint effects.The tilde coefficients then serve as the linear combination of the original coefficients.

Figure S6 :
Figure S6: Type I error using the composite test, joint significance test, and normality-based test.With a sufficient sample size (n = 1000) for estimating the coefficients, we could successfully control

(
e experiments performed) Alternative (m = 30, e = 30) Disjoint alternative (m = 30, e = 30)Alternative (m = 100, e = 10) Simulation settings.General settings for one experiment consist of 10000 mechanisms, with each containing 30 mediators.Three dependency settings were applied, independent, fixed (a block of 10 mediators with the correlation coefficients fixed to 0.3), and random (three blocks of mediators are correlated; the block sizes are two, three, five, with random correlation coefficients sampled from a uniform distribution of different maximum bound, 0.8, 0.5, 0.3, respectively; then the simulated correlation matrix will be adjusted to be positive definite).(a) We modified the general settings to assess the impact of the sample size, application of a decorrelation process, and null type.(b) We investigated the effects of sparsity level, signal strength, signal pattern, and the number of mechanisms on the power of the proposed statistics.

Figure S7 :
Figure S7: Comparison of three mediation assessment approaches.We compared eight test statis-

Figure S8 :Figure S9 :
Figure S8: Performance of the test statistics with increasing signal sparsity.We compared eight test statistics by using the composite test with different settings for the signal patterns.The significance level was set at 0.001.The figure displays the percentage of significant mechanisms under different mediator settings and signal strengths with increasing sparsity levels.

Figure S10 :
Figure S10: Manhattan plots highlighting the significant genes in the methyl-gene dataset.

Figure S11 :
Figure S11: Q-Q plots of all statistics.We found that the methyl-gene dataset presented strong signals at a genome-wide scale, but the miRNA-protein dataset only reported a few proteins This plot excluded a few extreme p-values for the clarity of visualization.

Figure S12 :
Figure S12: Clustering analysis of the methyl-gene dataset using the STRING database.Using the gene sets provided by the decorrelation and multivariate approaches, we performed a clustering analysis on the confidence scores provided by STRING.Only clusters with more than five proteins are illustrated, and the nine clusters have the same color notation as their corresponding pathways in Figure S13(a).

Figure S14 :
Figure S14: PD-L1 expression and PD-1 check point in cancer provided by KEGG.The pathway illustrates the struggle between the apoptosis mechanism and the tumorigenesis mechanism among the T cells and tumor cells.We can observe several key proteins also presented in Figure 6 in the manuscript.

Table S3 :
Cluster annotation table provided by the STRING.We listed significant KEGG pathway, Reactome Pathway, and the top 30 GO Biological Process provided by STRING for the annotation of the methyl-gene dataset.This annotation is based on the union gene set provided by the decorrelation and multivariate approaches.