- Split View
-
Views
-
Cite
Cite
Haixiang Zhang, Yinan Zheng, Lifang Hou, Cheng Zheng, Lei Liu, Mediation analysis for survival data with high-dimensional mediators, Bioinformatics, Volume 37, Issue 21, November 2021, Pages 3815–3821, https://doi.org/10.1093/bioinformatics/btab564
- Share Icon Share
Abstract
Mediation analysis has become a prevalent method to identify causal pathway(s) between an independent variable and a dependent variable through intermediate variable(s). However, little work has been done when the intermediate variables (mediators) are high-dimensional and the outcome is a survival endpoint. In this paper, we introduce a novel method to identify potential mediators in a causal framework of high-dimensional Cox regression.
We first reduce the data dimension through a mediation-based sure independence screening method. A de-biased Lasso inference procedure is used for Cox’s regression parameters. We adopt a multiple-testing procedure to accurately control the false discovery rate when testing high-dimensional mediation hypotheses. Simulation studies are conducted to demonstrate the performance of our method. We apply this approach to explore the mediation mechanisms of 379 330 DNA methylation markers between smoking and overall survival among lung cancer patients in The Cancer Genome Atlas lung cancer cohort. Two methylation sites (cg08108679 and cg26478297) are identified as potential mediating epigenetic markers.
Our proposed method is available with the R package HIMA at https://cran.r-project.org/web/packages/HIMA/.
Supplementary data are available at Bioinformatics online.
1 Introduction
Mediation analysis was first proposed in the field of social science (Baron and Kenny, 1986). It has been widely applied in different areas, including neuroscience (Chén et al., 2017; Zhao et al., 2020), genomics and epigenomics (Fang et al., 2021; Valeri et al., 2017), microbiome studies (Sohn and Li, 2019; Zhang et al., 2018), etc. With the advancement of data collection techniques, it is now interesting and desirable to make inference on high-dimensional mediators. In recent years, substantial research efforts have been devoted to developing methodology for high-dimensional mediation analysis. For example, Zhang et al. (2016) and Gao et al. (2019) proposed innovative methods on testing mediation effects in high-dimensional epigenetic studies. Derkach et al. (2019) considered a model for high-dimensional mediation analysis with latent variables. Zhang (2019) introduced two procedures for mediator selection with high-dimensional exposures and high-dimensional mediators. Djordjilović et al. (2019) considered the testing for groups of potential mediators in high-dimensional mediation models. Zhang et al. (2019, 2021a, b) and Wang et al. (2020) considered the statistical inference for mediation effects with high-dimensional and compositional microbiome data. Liu et al. (2020) developed a powerful Divide-Aggregate Composite-null Test for large-scale mediation hypotheses. Loh et al. (2020) proposed a nonlinear framework for mediation analysis with high-dimensional mediators. Zhou et al. (2020) presented new inference procedures for the indirect effect in high-dimensional linear mediation models. Shi and Li (2020) developed a hypothesis testing procedure for high-dimensional mediators using the logic of Boolean matrices. Dai et al. (2021) developed a multiple-testing procedure that accurately controls the false discovery rate (FDR) when testing high-dimensional mediation hypotheses.
The above-mentioned results are mainly focused on noncensored outcomes. In low-dimensional case, some authors have studied the mediation analysis with survival data. For example, Lange and Hansen (2011), VanderWeele (2011), Tchetgen (2011) and Fulcher et al. (2017) proposed several causal mediation analysis frameworks with a single mediator and a survival outcome. Gelfand et al. (2016) presented a comparison of semiparametric proportional hazards and fully parametric accelerated failure time approaches to causal mediation analysis. Wang and Albert (2016) considered causal mediation analysis for the Cox model with a smooth baseline hazard estimator. Liu et al. (2018), Didelez (2019) and Zheng and Liu (2021) considered the mediation analysis for longitudinal and survival data. Fasanelli et al. (2019) proposed a method to estimate the marginal time-dependent causal effects in mediation analysis with survival data. Huang and Yang (2017) and Yu et al. (2019) studied mediation analysis of survival outcomes with multiple mediators. Cho and Huang (2019) investigated mediation analysis with causally ordered mediators using the Cox model.
However, there is a dearth of suitable models for high-dimensional mediation analysis on the survival outcome. To the best of our knowledge, Luo et al. (2020) was the first work toward high-dimensional mediator selection for the survival endpoint. In this paper, we propose a novel mediator identification procedure for the high-dimensional Cox model. Compared with Luo et al. (2020), our method has the following advantages. First, we use a series of marginal mediation effect () pathways (exposure → mediator → outcome), which roughly describe the mediation effect of each individual mediator, to screen out potentially significant mediators. On the other hand, Luo et al. (2020) only considered the effect β (mediator → outcome) as the term of screening criterion. Therefore, our mediation-based screening could be more accurate than Luo et al. (2020)’s screening method. Second, we adopt the de-biased Lasso (Fang et al., 2016) to estimate the effect β (mediator → outcome), where the estimate and its standard error are available. Therefore, our method can give inference results for all the de-biased Lasso estimators. In comparison, Luo et al. (2020) used the minimax concave penalty (MCP; Zhang, 2010) technique to estimate the effect β, which only provides statistical inference on nonzero MCP-based estimators. Third, we employ Dai et al. (2021)’s joint significance test with an mixture null distribution, which can more accurately control the FDR for large-scale multiple testing. However, Luo et al. (2020) used a naive joint significance rule with a uniform null distribution for the maximum P-value. Their procedure results in a valid but overly conservative test with low power (Dai et al., 2021; Huang, 2018).
The remainder of this paper is organized as follows. In Section 2, we present the regression model for mediation analysis with a survival outcome. We propose a three-step testing procedure for mediation effects in the high-dimensional Cox model. In Section 3, we evaluate the performance of our method via numerical simulations. In Section 4, an application to The Cancer Genome Atlas (TCGA) lung cancer cohort is provided. Some concluding remarks are given in Section 5.
2 Statistical methods
We use the counterfactual framework as VanderWeele and Vansteelandt (2014) and Huang and Yang (2017) to formally define the mediation effects and list assumptions for the identification of such effects. We denote the exposure for the ith individual as Xi and the baseline adjusted covariates (e.g. age and gender) as . Under stable unit treatment value assumption (Imbens and Rubin, 2015), let , we use to denote the potential survival time, respectively, for individual i when the exposure is set to x, and the mediators are set to m. We use to denote the potential value of the kth mediator for individual i when the exposure is set to x. Here, we assume that the mediators are not causally related to each other. Formally, let and use to denote the potential value of the kth mediator for individual i when the exposure is set to x and all mediators except the kth mediator are set to , then we assume for all and all possible . We would like to point out that this assumption does not require all mediators to be independent given the exposure X and the baseline adjusted covariates Z, and it allows for potential unmeasured common causes (either induced by the exposure or not) between mediators. In our example, where all the mediators are measured at the same time and a direct causal relationship between them is less likely, this assumption is reasonable. Under the consistency assumption (Imbens and Rubin, 2015), we have the observed mediators and the survival time .
As discussed in VanderWeele and Vansteelandt (2014) and VanderWeele et al. (2014), the following assumptions regarding potential confoundings, in addition to the positivity assumption (Imbens and Rubin, 2015), will allow us to nonparametrically identify the joint causal mediation effect as well as path-specific causal effects in the framework above:
(C.1) , i.e. no unmeasured confounders between the exposure and the survival outcome;
(C.2) , i.e. no unmeasured confounders between the mediators and the survival outcome;
(C.3) , i.e. no unmeasured confounders between the exposure and the mediators;
(C.4) , i.e. no exposure-induced confounding between the mediators and the survival outcome.
Assuming the censoring time C is noninformative, we can identify the parameters in these models.
Here, we point out that Luo et al. (2020)’s method adapts Zhang et al. (2016)’s framework to the survival endpoint. In Figure 1, we illustrate a scenario of high-dimensional mediation-based Cox model with omitted confounding variables, where the p mediators could be correlated with each other. Of note, the situation with causally ordered mediators described in Cho and Huang (2019) will not be captured by our suggested procedure.
Similarly, the path-specific causal effect on the log-hazard difference scale for the mediator Mk () can be defined as a comparison of log hazard for and , which can be approximated as . We would like to point out that even when the rare event approximation does not hold, testing the null hypothesis is still valid for testing the existence of such path-specific causal effect through Mk.
Our aim is to estimate and test the path-specific mediation effects of the kth mediator Mk, for . Denote by the index set of those significant mediators. Assume that we have n i.i.d. samples . For practical analysis, we first conduct a standardization of the mediator variables with mean zero and variance one. The proposed approach is as follows:
Of note, we have made statistical inference in Steps 2 and 3 of our proposed method conditional on the selected set (Step 1). As mentioned before, as . Asymptotically, we assume that , where and . Basically, the randomness of is actually due to . Because the key idea of de-biased Lasso is to project the scores of interested parameters (e.g. ) onto the linear span of the score functions of nuisance parameters (e.g. ), the corresponding estimation function of is uncorrelated with the score function of (Fang et al., 2016). Hence, the randomness of has very limited impact on the de-biased Lasso estimator of asymptotically.
3 Simulation studies
In this section, we conduct two simulation studies to assess the performance of our proposed method. First, we generate failure times from Cox Model (3) with . The exposure Xi follows from N(0, 2), and ; the covariates , where and are independently generated from N(0, 2), ; , and otherwise; the mediators are generated from Model (2), where , and otherwise; i.e. the are active mediators. , and the error terms are generated from . To simulate the dependency structure of mediators close to the real data, we use the first step of our method (mediators screening) to pick up the top p DNA methylation markers from the real data in Section 4. Σe is set to the correlation matrix of those p DNA methylation markers. In Supplementary Figure S1, we present the histogram for the lower triangular of the correlation matrix Σe, which indicates that some of the mediators are highly correlated. For illustration, in Supplementary Figure S2, we show the 10 × 10 upper submatrix of Σe for the active mediators . Moreover, the censoring times Ci are generated from a uniform distribution over , where (censoring rate is about 20%) and 5 (censoring rate is about 40%), respectively. All the simulations are based on 200 replications, where p = 10 000, n = 300 and 500, respectively.
For fair comparison, we consider Luo et al. (2020)’s method (denoted as ‘Luo et al.’), where the number of survived variables in their first step is the same as our method with . In Table 1, we present the probabilities to be screened in for those active mediators (Step 1) over 200 replications. The results indicate that Luo et al. (2020)’s screening method has a poor performance, while our mediation-based screening has a higher probability to include those active mediators .
. | . | CR = 20% . | CR = 40% . | ||
---|---|---|---|---|---|
. | . | Proposed . | Luo et al. . | Proposed . | Luo et al. . |
n = 300 | M1 | 200 | 50 | 200 | 43 |
M2 | 158 | 0 | 152 | 0 | |
M3 | 200 | 131 | 200 | 115 | |
M4 | 98 | 0 | 116 | 0 | |
M5 | 196 | 8 | 190 | 1 | |
M6 | 90 | 0 | 81 | 0 | |
M7 | 200 | 141 | 200 | 125 | |
M8 | 200 | 24 | 199 | 18 | |
M9 | 200 | 11 | 199 | 18 | |
M10 | 177 | 1 | 179 | 0 | |
n = 500 | M1 | 200 | 60 | 200 | 63 |
M2 | 179 | 0 | 179 | 0 | |
M3 | 200 | 167 | 200 | 159 | |
M4 | 101 | 0 | 126 | 0 | |
M5 | 200 | 6 | 200 | 9 | |
M6 | 132 | 0 | 128 | 0 | |
M7 | 200 | 190 | 200 | 171 | |
M8 | 200 | 35 | 200 | 31 | |
M9 | 200 | 19 | 199 | 26 | |
M10 | 195 | 0 | 182 | 0 |
. | . | CR = 20% . | CR = 40% . | ||
---|---|---|---|---|---|
. | . | Proposed . | Luo et al. . | Proposed . | Luo et al. . |
n = 300 | M1 | 200 | 50 | 200 | 43 |
M2 | 158 | 0 | 152 | 0 | |
M3 | 200 | 131 | 200 | 115 | |
M4 | 98 | 0 | 116 | 0 | |
M5 | 196 | 8 | 190 | 1 | |
M6 | 90 | 0 | 81 | 0 | |
M7 | 200 | 141 | 200 | 125 | |
M8 | 200 | 24 | 199 | 18 | |
M9 | 200 | 11 | 199 | 18 | |
M10 | 177 | 1 | 179 | 0 | |
n = 500 | M1 | 200 | 60 | 200 | 63 |
M2 | 179 | 0 | 179 | 0 | |
M3 | 200 | 167 | 200 | 159 | |
M4 | 101 | 0 | 126 | 0 | |
M5 | 200 | 6 | 200 | 9 | |
M6 | 132 | 0 | 128 | 0 | |
M7 | 200 | 190 | 200 | 171 | |
M8 | 200 | 35 | 200 | 31 | |
M9 | 200 | 19 | 199 | 26 | |
M10 | 195 | 0 | 182 | 0 |
‘Proposed’ denotes our method; ‘Luo et al.’ denotes Luo et al. (2020)’s method; ‘CR’ denotes the censoring rate of failure times.
. | . | CR = 20% . | CR = 40% . | ||
---|---|---|---|---|---|
. | . | Proposed . | Luo et al. . | Proposed . | Luo et al. . |
n = 300 | M1 | 200 | 50 | 200 | 43 |
M2 | 158 | 0 | 152 | 0 | |
M3 | 200 | 131 | 200 | 115 | |
M4 | 98 | 0 | 116 | 0 | |
M5 | 196 | 8 | 190 | 1 | |
M6 | 90 | 0 | 81 | 0 | |
M7 | 200 | 141 | 200 | 125 | |
M8 | 200 | 24 | 199 | 18 | |
M9 | 200 | 11 | 199 | 18 | |
M10 | 177 | 1 | 179 | 0 | |
n = 500 | M1 | 200 | 60 | 200 | 63 |
M2 | 179 | 0 | 179 | 0 | |
M3 | 200 | 167 | 200 | 159 | |
M4 | 101 | 0 | 126 | 0 | |
M5 | 200 | 6 | 200 | 9 | |
M6 | 132 | 0 | 128 | 0 | |
M7 | 200 | 190 | 200 | 171 | |
M8 | 200 | 35 | 200 | 31 | |
M9 | 200 | 19 | 199 | 26 | |
M10 | 195 | 0 | 182 | 0 |
. | . | CR = 20% . | CR = 40% . | ||
---|---|---|---|---|---|
. | . | Proposed . | Luo et al. . | Proposed . | Luo et al. . |
n = 300 | M1 | 200 | 50 | 200 | 43 |
M2 | 158 | 0 | 152 | 0 | |
M3 | 200 | 131 | 200 | 115 | |
M4 | 98 | 0 | 116 | 0 | |
M5 | 196 | 8 | 190 | 1 | |
M6 | 90 | 0 | 81 | 0 | |
M7 | 200 | 141 | 200 | 125 | |
M8 | 200 | 24 | 199 | 18 | |
M9 | 200 | 11 | 199 | 18 | |
M10 | 177 | 1 | 179 | 0 | |
n = 500 | M1 | 200 | 60 | 200 | 63 |
M2 | 179 | 0 | 179 | 0 | |
M3 | 200 | 167 | 200 | 159 | |
M4 | 101 | 0 | 126 | 0 | |
M5 | 200 | 6 | 200 | 9 | |
M6 | 132 | 0 | 128 | 0 | |
M7 | 200 | 190 | 200 | 171 | |
M8 | 200 | 35 | 200 | 31 | |
M9 | 200 | 19 | 199 | 26 | |
M10 | 195 | 0 | 182 | 0 |
‘Proposed’ denotes our method; ‘Luo et al.’ denotes Luo et al. (2020)’s method; ‘CR’ denotes the censoring rate of failure times.
In Table 2 and Supplementary Table S1, we report the estimation results for mediation effects , which include the estimated biases (Bias) given by the sample mean of the estimates minus the true value, and the mean squared errors (MSE) of the estimates. Here, we omit the results for , because their performances are similar to that of . For significant mediators, the Bias and MSE of our method (denoted as ‘Proposed’) are much smaller than those of ‘Luo et al.’. Hence, the proposed approach is more efficient than Luo et al. (2020)’s method toward the estimation of active mediation effects. In Table 3, we present the estimated FDR of mediation effects, where the FDR threshold level is 0.05. The results indicate that both methods can control the FDR under the threshold level. In Figures 2–5, we illustrate the empirical power for each of the active mediators separately. The figures indicate that our procedure is much more powerful than Luo et al. (2020)’s method in selecting significant mediators. All the above reported results become much better when the sample size n is increasing. However, it seems that the increasing of censoring rate has a negative affect on both methods, which is common in survival analysis.
. | CR = 20% . | CR = 40% . | ||
---|---|---|---|---|
. | Proposed . | Luo et al. . | Proposed . | Luo et al. . |
–0.0197 | –0.2650 | –0.0271 | –0.2649 | |
(0.0029) | (0.0921) | (0.0037) | (0.0921) | |
–0.1036 | –0.2500 | –0.1157 | –0.2500 | |
(0.0162) | (0.0625) | (0.0190) | (0.0625) | |
–0.0135 | –0.0173 | |||
(0.0011) | (0.0059) | (0.0014) | (0.0080) | |
–0.0515 | –0.0900 | |||
(0.0044) | (0.0081) | (0.0037) | (0.0081) | |
–0.0231 | –0.0235 | –0.0619 | ||
(0.0009) | (0.0038) | (0.0011) | (0.0039) | |
–0.0107 | –0.0225 | –0.0118 | –0.0225 | |
(0.0003) | (0.0005) | (0.0003) | (0.0005) | |
–0.0395 | –0.0404 | |||
(0.0038) | (0.0420) | (0.0036) | (0.0462) | |
0.0218 | –0.1089 | 0.0222 | ||
(0.0015) | (0.0132) | (0.0022) | (0.0139) | |
–0.0035 | –0.0609 | –0.0007 | –0.0621 | |
(0.0009) | (0.0044) | (0.0010) | (0.0046) | |
–0.0168 | –0.0600 | –0.0151 | –0.0600 | |
(0.0016) | (0.0036) | (0.0016) | (0.0036) | |
0.0002 | 0.0001 | |||
() | () | () | () | |
0.0013 | 0 | 0.0082 | 0 | |
(0.0007) | (0) | (0.0007) | (0) | |
0 | 0 | |||
() | (0) | () | (0) |
. | CR = 20% . | CR = 40% . | ||
---|---|---|---|---|
. | Proposed . | Luo et al. . | Proposed . | Luo et al. . |
–0.0197 | –0.2650 | –0.0271 | –0.2649 | |
(0.0029) | (0.0921) | (0.0037) | (0.0921) | |
–0.1036 | –0.2500 | –0.1157 | –0.2500 | |
(0.0162) | (0.0625) | (0.0190) | (0.0625) | |
–0.0135 | –0.0173 | |||
(0.0011) | (0.0059) | (0.0014) | (0.0080) | |
–0.0515 | –0.0900 | |||
(0.0044) | (0.0081) | (0.0037) | (0.0081) | |
–0.0231 | –0.0235 | –0.0619 | ||
(0.0009) | (0.0038) | (0.0011) | (0.0039) | |
–0.0107 | –0.0225 | –0.0118 | –0.0225 | |
(0.0003) | (0.0005) | (0.0003) | (0.0005) | |
–0.0395 | –0.0404 | |||
(0.0038) | (0.0420) | (0.0036) | (0.0462) | |
0.0218 | –0.1089 | 0.0222 | ||
(0.0015) | (0.0132) | (0.0022) | (0.0139) | |
–0.0035 | –0.0609 | –0.0007 | –0.0621 | |
(0.0009) | (0.0044) | (0.0010) | (0.0046) | |
–0.0168 | –0.0600 | –0.0151 | –0.0600 | |
(0.0016) | (0.0036) | (0.0016) | (0.0036) | |
0.0002 | 0.0001 | |||
() | () | () | () | |
0.0013 | 0 | 0.0082 | 0 | |
(0.0007) | (0) | (0.0007) | (0) | |
0 | 0 | |||
() | (0) | () | (0) |
‘Proposed’ denotes our method; ‘Luo et al.’ denotes Luo et al. (2020)’s method; ‘CR’ denotes the censoring rate of failure times.
. | CR = 20% . | CR = 40% . | ||
---|---|---|---|---|
. | Proposed . | Luo et al. . | Proposed . | Luo et al. . |
–0.0197 | –0.2650 | –0.0271 | –0.2649 | |
(0.0029) | (0.0921) | (0.0037) | (0.0921) | |
–0.1036 | –0.2500 | –0.1157 | –0.2500 | |
(0.0162) | (0.0625) | (0.0190) | (0.0625) | |
–0.0135 | –0.0173 | |||
(0.0011) | (0.0059) | (0.0014) | (0.0080) | |
–0.0515 | –0.0900 | |||
(0.0044) | (0.0081) | (0.0037) | (0.0081) | |
–0.0231 | –0.0235 | –0.0619 | ||
(0.0009) | (0.0038) | (0.0011) | (0.0039) | |
–0.0107 | –0.0225 | –0.0118 | –0.0225 | |
(0.0003) | (0.0005) | (0.0003) | (0.0005) | |
–0.0395 | –0.0404 | |||
(0.0038) | (0.0420) | (0.0036) | (0.0462) | |
0.0218 | –0.1089 | 0.0222 | ||
(0.0015) | (0.0132) | (0.0022) | (0.0139) | |
–0.0035 | –0.0609 | –0.0007 | –0.0621 | |
(0.0009) | (0.0044) | (0.0010) | (0.0046) | |
–0.0168 | –0.0600 | –0.0151 | –0.0600 | |
(0.0016) | (0.0036) | (0.0016) | (0.0036) | |
0.0002 | 0.0001 | |||
() | () | () | () | |
0.0013 | 0 | 0.0082 | 0 | |
(0.0007) | (0) | (0.0007) | (0) | |
0 | 0 | |||
() | (0) | () | (0) |
. | CR = 20% . | CR = 40% . | ||
---|---|---|---|---|
. | Proposed . | Luo et al. . | Proposed . | Luo et al. . |
–0.0197 | –0.2650 | –0.0271 | –0.2649 | |
(0.0029) | (0.0921) | (0.0037) | (0.0921) | |
–0.1036 | –0.2500 | –0.1157 | –0.2500 | |
(0.0162) | (0.0625) | (0.0190) | (0.0625) | |
–0.0135 | –0.0173 | |||
(0.0011) | (0.0059) | (0.0014) | (0.0080) | |
–0.0515 | –0.0900 | |||
(0.0044) | (0.0081) | (0.0037) | (0.0081) | |
–0.0231 | –0.0235 | –0.0619 | ||
(0.0009) | (0.0038) | (0.0011) | (0.0039) | |
–0.0107 | –0.0225 | –0.0118 | –0.0225 | |
(0.0003) | (0.0005) | (0.0003) | (0.0005) | |
–0.0395 | –0.0404 | |||
(0.0038) | (0.0420) | (0.0036) | (0.0462) | |
0.0218 | –0.1089 | 0.0222 | ||
(0.0015) | (0.0132) | (0.0022) | (0.0139) | |
–0.0035 | –0.0609 | –0.0007 | –0.0621 | |
(0.0009) | (0.0044) | (0.0010) | (0.0046) | |
–0.0168 | –0.0600 | –0.0151 | –0.0600 | |
(0.0016) | (0.0036) | (0.0016) | (0.0036) | |
0.0002 | 0.0001 | |||
() | () | () | () | |
0.0013 | 0 | 0.0082 | 0 | |
(0.0007) | (0) | (0.0007) | (0) | |
0 | 0 | |||
() | (0) | () | (0) |
‘Proposed’ denotes our method; ‘Luo et al.’ denotes Luo et al. (2020)’s method; ‘CR’ denotes the censoring rate of failure times.
. | CR = 20% . | CR = 40% . | ||
---|---|---|---|---|
Methods . | n = 300 . | n = 500 . | n = 300 . | n = 500 . |
Proposed | 0.0097 | 0.0047 | 0.0092 | 0.0039 |
Luo et al. | 0.0125 | 0.0063 | 0.0325 | 0.0117 |
. | CR = 20% . | CR = 40% . | ||
---|---|---|---|---|
Methods . | n = 300 . | n = 500 . | n = 300 . | n = 500 . |
Proposed | 0.0097 | 0.0047 | 0.0092 | 0.0039 |
Luo et al. | 0.0125 | 0.0063 | 0.0325 | 0.0117 |
‘Proposed’ denotes our method; ‘Luo et al.’ denotes Luo et al. (2020)’s method; ‘CR’ denotes the censoring rate of failure times.
. | CR = 20% . | CR = 40% . | ||
---|---|---|---|---|
Methods . | n = 300 . | n = 500 . | n = 300 . | n = 500 . |
Proposed | 0.0097 | 0.0047 | 0.0092 | 0.0039 |
Luo et al. | 0.0125 | 0.0063 | 0.0325 | 0.0117 |
. | CR = 20% . | CR = 40% . | ||
---|---|---|---|---|
Methods . | n = 300 . | n = 500 . | n = 300 . | n = 500 . |
Proposed | 0.0097 | 0.0047 | 0.0092 | 0.0039 |
Luo et al. | 0.0125 | 0.0063 | 0.0325 | 0.0117 |
‘Proposed’ denotes our method; ‘Luo et al.’ denotes Luo et al. (2020)’s method; ‘CR’ denotes the censoring rate of failure times.
As suggested by one reviewer, we conduct the second simulation to study the performance of our method when there are no indirect effects for any mediators. The settings are identical with the first simulation, except that for other k; and for other k. The estimation results for are similar to those of . Hence, we only report the Bias and MSE for and in Table 4 (other cases are similar and omitted). From the results, we can see that both methods are unbiased on the estimation of mediation effects. In Table 5, we give the FDR of multiple-testing in the case when there are no mediation effects. It seems that both method can well control the FDR.
. | . | CR = 20% . | CR = 40% . | ||
---|---|---|---|---|---|
. | . | Proposed . | Luo et al. . | Proposed . | Luo et al. . |
n = 300 | –0.0003 | 0.0009 | 0.0008 | ||
(0.0001) | (0.0001) | (0.0001) | (0.0001) | ||
0.0572 | 0 | 0.0608 | 0 | ||
(0.0053) | (0) | (0.0059) | (0) | ||
0 | 0 | 0 | |||
(0) | () | (0) | (0) | ||
n = 500 | 0.0004 | 0.0006 | 0.0002 | ||
() | (0.0001) | () | () | ||
0.0446 | 0 | 0.0488 | 0 | ||
() | (0) | (0.0004) | (0) | ||
0 | 0 | 0 | 0 | ||
(0) | (0) | (0) | (0) |
. | . | CR = 20% . | CR = 40% . | ||
---|---|---|---|---|---|
. | . | Proposed . | Luo et al. . | Proposed . | Luo et al. . |
n = 300 | –0.0003 | 0.0009 | 0.0008 | ||
(0.0001) | (0.0001) | (0.0001) | (0.0001) | ||
0.0572 | 0 | 0.0608 | 0 | ||
(0.0053) | (0) | (0.0059) | (0) | ||
0 | 0 | 0 | |||
(0) | () | (0) | (0) | ||
n = 500 | 0.0004 | 0.0006 | 0.0002 | ||
() | (0.0001) | () | () | ||
0.0446 | 0 | 0.0488 | 0 | ||
() | (0) | (0.0004) | (0) | ||
0 | 0 | 0 | 0 | ||
(0) | (0) | (0) | (0) |
‘Proposed’ denotes our method; ‘Luo et al.’ denotes Luo et al. (2020)’s method; ‘CR’ denotes the censoring rate of failure times.
. | . | CR = 20% . | CR = 40% . | ||
---|---|---|---|---|---|
. | . | Proposed . | Luo et al. . | Proposed . | Luo et al. . |
n = 300 | –0.0003 | 0.0009 | 0.0008 | ||
(0.0001) | (0.0001) | (0.0001) | (0.0001) | ||
0.0572 | 0 | 0.0608 | 0 | ||
(0.0053) | (0) | (0.0059) | (0) | ||
0 | 0 | 0 | |||
(0) | () | (0) | (0) | ||
n = 500 | 0.0004 | 0.0006 | 0.0002 | ||
() | (0.0001) | () | () | ||
0.0446 | 0 | 0.0488 | 0 | ||
() | (0) | (0.0004) | (0) | ||
0 | 0 | 0 | 0 | ||
(0) | (0) | (0) | (0) |
. | . | CR = 20% . | CR = 40% . | ||
---|---|---|---|---|---|
. | . | Proposed . | Luo et al. . | Proposed . | Luo et al. . |
n = 300 | –0.0003 | 0.0009 | 0.0008 | ||
(0.0001) | (0.0001) | (0.0001) | (0.0001) | ||
0.0572 | 0 | 0.0608 | 0 | ||
(0.0053) | (0) | (0.0059) | (0) | ||
0 | 0 | 0 | |||
(0) | () | (0) | (0) | ||
n = 500 | 0.0004 | 0.0006 | 0.0002 | ||
() | (0.0001) | () | () | ||
0.0446 | 0 | 0.0488 | 0 | ||
() | (0) | (0.0004) | (0) | ||
0 | 0 | 0 | 0 | ||
(0) | (0) | (0) | (0) |
‘Proposed’ denotes our method; ‘Luo et al.’ denotes Luo et al. (2020)’s method; ‘CR’ denotes the censoring rate of failure times.
. | CR = 20% . | CR = 40% . | ||
---|---|---|---|---|
Methods . | n = 300 . | n = 500 . | n = 300 . | n = 500 . |
Proposed | 0.040 | 0.030 | 0.050 | 0.045 |
Luo et al. | 0.050 | 0.060 | 0.030 | 0.025 |
. | CR = 20% . | CR = 40% . | ||
---|---|---|---|---|
Methods . | n = 300 . | n = 500 . | n = 300 . | n = 500 . |
Proposed | 0.040 | 0.030 | 0.050 | 0.045 |
Luo et al. | 0.050 | 0.060 | 0.030 | 0.025 |
‘Proposed’ denotes our method; ‘Luo et al.’ denotes Luo et al. (2020)’s method; ‘CR’ denotes the censoring rate of failure times.
. | CR = 20% . | CR = 40% . | ||
---|---|---|---|---|
Methods . | n = 300 . | n = 500 . | n = 300 . | n = 500 . |
Proposed | 0.040 | 0.030 | 0.050 | 0.045 |
Luo et al. | 0.050 | 0.060 | 0.030 | 0.025 |
. | CR = 20% . | CR = 40% . | ||
---|---|---|---|---|
Methods . | n = 300 . | n = 500 . | n = 300 . | n = 500 . |
Proposed | 0.040 | 0.030 | 0.050 | 0.045 |
Luo et al. | 0.050 | 0.060 | 0.030 | 0.025 |
‘Proposed’ denotes our method; ‘Luo et al.’ denotes Luo et al. (2020)’s method; ‘CR’ denotes the censoring rate of failure times.
4 An application to lung cancer data
We apply our proposed method to the TCGA lung cancer cohort study including lung squamous cell carcinoma and lung adenocarcinoma, where the data are freely available at https://xenabrowser.net/datapages/. Our interest is to identify potential epigenetic markers linking smoking and survival of lung cancer patients. In the analysis, we focus on 593 patients with nonmissing clinical and epigenetic information, whose ages ranged from 33 to 90 years. The survival endpoint is defined as the number of days from initial diagnosis to death, which could be censored. The median survival time is 678 days. Two hundred and forty-three patients died during the follow-up, with a censoring rate of 59%. A total of 379 330 DNA methylation markers (M) profiled by Infinium HumanMethylation450 BeadChip array are potential mediators. The exposure X is the number of packs smoked per years, and the survival time is the outcome variable. Other covariates (Z) include age at initial diagnosis, gender (male = 1; female = 0), tumor stage (Stage I = 1; Stage II = 2; Stage III = 3; Stage IV = 4) and radiotherapy (yes = 1; no = 0).
In Table 6, we report the summary results on the two selected mediators by our method, where the FDR threshold level is 0.05. For cg08108679 (in gene PRKCZ, chromosome 1, position: 2 003 274), the estimated pathway effect (α) on is –0.0092 (0.0024), where the number in parenthesis is the corresponding standard error (SE); the estimated pathway effect on is –3.4997 (1.0248). For cg26478297 (in gene PRKCG, chromosome 19, position: 54 387 436), the estimated pathway effects on and are –0.0256 (0.0068) and –1.3362 (0.4011), respectively. Hence, the two selected CpGs have positive log-hazard indirect effect (smoking increases the mortality). The two CpGs are located in different chromosomes but belonging to the same gene family of protein kinase C (PKC). PKC family members are known to be involved in diverse cellular signaling pathways and have been studied extensively as a group of proteins that involve in cancer development (Dowling et al., 2017). Previous studies (Guo et al., 2008; Wyatt et al., 1999) have found that PKC is activated in human epithelial cells when exposed to cigarette smoke extract, which may in turn influence the invasion and metastasis of lung cancer (Gopalakrishna et al., 1994).
CpGs . | Chromosome . | Gene . | . | . | . |
---|---|---|---|---|---|
. | . | . | (SE) . | (SE) . | . |
cg08108679 | Chr1:2003274 | PRCKZ | –0.0092 | –3.4997 | 0.0006 |
(0.0024) | (1.0248) | ||||
cg26478297 | Chr19:54387436 | PRCKG | –0.0256 | –1.3362 | 0.0009 |
(0.0068) | (0.4011) |
CpGs . | Chromosome . | Gene . | . | . | . |
---|---|---|---|---|---|
. | . | . | (SE) . | (SE) . | . |
cg08108679 | Chr1:2003274 | PRCKZ | –0.0092 | –3.4997 | 0.0006 |
(0.0024) | (1.0248) | ||||
cg26478297 | Chr19:54387436 | PRCKG | –0.0256 | –1.3362 | 0.0009 |
(0.0068) | (0.4011) |
‘SE’ denotes standard error; ‘’ is given in (7).
CpGs . | Chromosome . | Gene . | . | . | . |
---|---|---|---|---|---|
. | . | . | (SE) . | (SE) . | . |
cg08108679 | Chr1:2003274 | PRCKZ | –0.0092 | –3.4997 | 0.0006 |
(0.0024) | (1.0248) | ||||
cg26478297 | Chr19:54387436 | PRCKG | –0.0256 | –1.3362 | 0.0009 |
(0.0068) | (0.4011) |
CpGs . | Chromosome . | Gene . | . | . | . |
---|---|---|---|---|---|
. | . | . | (SE) . | (SE) . | . |
cg08108679 | Chr1:2003274 | PRCKZ | –0.0092 | –3.4997 | 0.0006 |
(0.0024) | (1.0248) | ||||
cg26478297 | Chr19:54387436 | PRCKG | –0.0256 | –1.3362 | 0.0009 |
(0.0068) | (0.4011) |
‘SE’ denotes standard error; ‘’ is given in (7).
For comparison, we also use Luo et al. (2020)’s method to analyze this dataset. However, their approach fails to identify any significant mediators. In summary, our proposed method works well for mediation testing with survival outcomes in practical applications.
5 Concluding remarks
We have proposed a multiple-testing procedure for high-dimensional mediation effects with the survival outcome. To address the ultra high-dimensional DNA methylation markers, we used a screening technique to reduce the dimension of potential mediators. Moreover, we adopted the de-biased Lasso method and ‘JS-mixture’ procedure to identify significant mediators. Simulation results indicated that our method has a satisfactory performance. An application to TCGA lung cancer cohort was provided to illustrate the utility of our proposed approach.
There are several topics to be studied in the future. First, we have adopted marginal screening in Step 1 of our method. As pointed out already in the original SIS paper by Fan and Lv (2008), correlations among the mediators may cause problems. Fan and Lv (2008) alleviated this by introducing the iterative SIS. Although our approach works well in the simulated examples, it is interesting to further study the iterative SIS in our method from both the theory and application aspects. Second, group testing for mediation effects is an attractive direction (Derkach et al., 2020; Djordjilović et al., 2019; Krull and MacKinnon, 2001), it is interesting to consider the group mediators in high-dimensional survival data. Third, we have imposed some traditional assumptions related to no unmeasured confounding in our method. However, in the high-dimensional mediator situation, additional complications occur. Specifically, the interrelationship among the (potential) mediators plays a crucial role. As suggested by one reviewer, it is interesting to consider the situation with causally ordered mediators described in Cho and Huang (2019). Fourth, we have used in (5) as valid p-values conditional on the selected set in Step 1. As one reviewer suggested, it is desirable to consider the randomness of for our method in the nonasymptotic situation. There are two possible ways to guarantee valid p-values theoretically: (i) apply the proposed Steps 2 and 3 directly without using the mediator screening step. However, the computational burden for de-biased Lasso is extremely heavy for ultra high-dimensional mediators, e.g. there are a total of 379 330 DNA methylation markers in the real application; (ii) split the samples into two equal parts, one part for Step 1 and the other part for Steps 2 and 3. However, this sample-splitting technique suffers from loss of efficiency, because only half of the whole samples are used in the screening (Step 1) and inference (Steps 2 and 3), respectively.
Acknowledgements
The authors would like to thank the Editor, the Associate Editor and two reviewers for their constructive and insightful comments that greatly improved the manuscript. The content is solely the responsibility of the authors and does not necessarily represent the official view of the NIH.
Funding
The work of Liu was supported by NIH UL1 TR002345 and R21 AG063370.
Conflict of Interest
None declared.
Data availability
The data that support the findings of this study are publicly available at https://xenabrowser.net/datapages/.
Dataset id: TCGA.LUNG.sampleMap/HumanMethylation450.