- Split View
-
Views
-
Cite
Cite
Michael B Sohn, Jiarui Lu, Hongzhe Li, A compositional mediation model for a binary outcome: Application to microbiome studies, Bioinformatics, Volume 38, Issue 1, January 2022, Pages 16–21, https://doi.org/10.1093/bioinformatics/btab605
- Share Icon Share
Abstract
The delicate balance of the microbiome is implicated in our health and is shaped by external factors, such as diet and xenobiotics. Therefore, understanding the role of the microbiome in linking external factors and our health conditions is crucial to translate microbiome research into therapeutic and preventative applications.
We introduced a sparse compositional mediation model for binary outcomes to estimate and test the mediation effects of the microbiome utilizing the compositional algebra defined in the simplex space and a linear zero-sum constraint on probit regression coefficients. For this model with the standard causal assumptions, we showed that both the causal direct and indirect effects are identifiable. We further developed a method for sensitivity analysis for the assumption of the no unmeasured confounding effects between the mediator and the outcome. We conducted extensive simulation studies to assess the performance of the proposed method and applied it to real microbiome data to study mediation effects of the microbiome on linking fat intake to overweight/obesity.
An R package can be downloaded from https://github.com/mbsohn/cmmb.
Supplementary files are available at Bioinformatics online.
1 Introduction
The human microbiome is recognized as a key determinant of normal physiology and immune homeostasis (Li, 2015; Honda and Littman, 2016; Thaiss et al., 2016). Essential functions provided by the microbiome include the regulation of the immune system and metabolic function, the synthesis of essential vitamins, and the removal of toxic compounds (Heintz-Buschart and Wilmes, 2018). It has also been shown that the microbiome changes readily in response to extrinsic factors, such as diet and xenobiotics (Wu et al., 2011; Lewis et al., 2015; Kurilshikov et al., 2017). This dual role of the microbiome is very appealing in biomedical science, as it can be used as a non-invasive therapeutic application. Modulating targeted microbes using xenobiotics, for instance, would be more effective than imposing a complete dietary change for obesity treatment and could be as effective as bariatric surgery with no severe side effects. To translate the microbiome research into therapeutic and preventative applications, however, we need to understand mechanisms underlying the effect of external factors or interventions on the disease transmitted through the perturbation in the microbiome.
Mediation analysis, which studies the effect of treatment on outcome transmitted through a variable called a mediator, has been widely applied in numerous disciplines, such as sociology and epidemiology. It traditionally has been formulated and implemented under the structural equation modeling (SEM) framework (Baron and Kenny, 1986; MacKinnon et al., 2002); however, with recent advances in causal inference, which clarifies the assumptions needed for causal interpretation, mediation analysis under the potential outcomes (PO) framework has been gaining popularity (Pearl, 2001; Rubin, 2005; Imai et al., 2010; VanderWeele and Vansteelandt, 2010). Recent studies have extended the traditional single-mediator model to the multiple-mediators model (Imai and Yamamoto, 2013; VanderWeele and Vansteelandt, 2014), even in high-dimensional settings (Chén et al., 2015; Huang and Pan, 2016; Zhao and Luo, 2016). These mediation models, however, are not directly applicable for microbiome data due to the compositional nature of the microbiome data.
Compositional data comprise the proportions or percentages of a whole, imposing a unit-sum constraint, i.e. the sum of components is 1 or 100%. This unit-sum constraint makes a composition with k-components lie in the -dimensional simplex space and makes it impossible to alter one component without altering at least one of the other components. Neglecting this compositional structure thus can cause undesirable consequences. Sohn and Li (2019) proposed a sparse compositional mediation model (CMM) for continuous outcomes under the PO framework utilizing the algebra defined in the simplex space (Aitchison, 1986; Billheimer et al., 2001) and a linear constraint on regression coefficients, which is a necessary condition to satisfy the basic properties of compositional data, such as scale and permutation invariance (Aitchison and Bacon-Shone, 1984; Lin et al., 2014). Subsequently, a few compositional mediation methods for continuous outcomes have been proposed (Wang et al., 2020; Zhang et al., 2021). In many human microbiome studies, however, the outcome is binary, such as the presence or absence of disease.
In this article, we extend CMM to accommodate binary outcomes. The effect of a treatment on all the components of a compositional mediator is jointly estimated using the algebra in the simplex space. For the quantification of the effects of a treatment and a compositional mediator on binary outcomes, an L1-penalized probit model with a linear constraint is used. Its parameters are estimated by an algorithm that combines the iteratively reweighted least-squares (IRLS) (Green, 1984; Lee et al., 2006) and the coordinate descent method of multipliers (CDMM) (Lin et al., 2014). To obtain asymptotically unbiased estimates for the parameters of the L1-penalized probit model, we developed a debias procedure that extends the methods of Shi et al. (2016) and Lu et al. (2019). We defined an estimator for the mediation effect under the PO framework and evaluated its performance in extensive simulation settings. We also developed a method for sensitivity analysis for the assumption of the no unmeasured confounding effects between the mediator and the outcome. We applied CMM to a real dataset, COMBO (Wu et al., 2011), to link diet fat intake to overweight/obesity and found a significant effect of fat intake on overweight/obesity mediated through the gut microbiome.
2 Materials and methods
2.1 Algebraic operators in simplex space
2.2 Compositional mediation model for binary outcomes
Suppose that we have n random samples from a population, where we observe an outcome Yi, a compositional mediator , a treatment Ti, and covariates for , and that we consider an expected causal effect of Ti on Yi mediated through , depicted in Figure 1. Then, a model for this mediation effect should take the compositional nature of into an account, as . To develop such a model, we utilize algebraic operations defined in the simplex space and a zero-sum constraint on regression coefficients for the components of a composition.
2.3 Model assumptions and identification
2.4 Estimation of composition parameters
2.5 Estimation of regression parameters
The details of this algorithm are provided in Supplementary Material B.
2.6 Debiasing procedure and its asymptotic convergence
2.7 Hypothesis test of mediation effect
To construct a sampling distribution of , we repeat the following steps B times: (i) randomly select n samples from the original n samples with replacement, and (ii) estimate . We use the 95% percentile confidence interval to test the significance of in this study. Alternatively, we can estimate an approximate P-value for utilizing the fact that any bootstrap replicate should have a distribution close to that of when the null hypothesis is true, where denotes an estimated indirect effect derived from a bootstrap sample (Efron and Tibshirani, 1994).
2.8 Sensitivity analysis
3 Results
3.1 Simulation study I: synthetic data
Mediation analysis for multiple or high-dimensional mediators often assumes independence between mediators. One approach to satisfying this assumption is to use principal components (PCs) of mediators, i.e. PCs of as mediators. We use this approach under structural equation modeling (hereinafter referred to as PCS) and under the potential outcomes framework (PCP) to evaluate the performance of CMM. The main difference between these two approaches is how to estimate the direct and indirect effects: for PCS, the inner product of path coefficients (i.e. ) was used for the indirect effect; and for PCP, an expression derived from the mediation formula was used (Pearl, 2001; Imai et al., 2010), which is like the expression for .
In data generation, we randomly generated a treatment Ti from a Bernoulli distribution with success probability 0.5; a compositional disturbance from a multivariate logistic normal (LN) distribution (Aitchison, 1986) with mean and covariance ; a regression disturbance from a standard normal distribution, where . We fixed , and c = 1 for k = 5, 25, 50. For a baseline composition, was used. A composition and an outcome Yi were then generated according to Models (1) and (2), respectively. Throughout the simulation studies and a real data application, we tested the direct and indirect effects at the 95% confidence level.
We first compared the coverage rate for the indirect effect, which measures a proportion of the time that estimated intervals contain the true value of an indirect effect. To this end, we first generated in each repetition, where r is randomly generated from the standard uniform distribution. In this setting, the true or known value of the total indirect effect is between 0 and 0.14. We then constructed a bootstrap confidence interval (CI) with 2000 bootstrap samples and measured the coverage rate for each method with each k. Figure 2 shows the results of 100 repetitions for each k. CMM yields the coverage rate around the nominal coverage rate (i.e. 0.95) for all the values of k considered. PCS gives the coverage rate around 0.95 when k = 5 but has an upward trend along with increased k. The coverage rate of PCP is a little lower than the nominal coverage rate for all k considered.
The second measure we used in performance comparison is the true positive rate versus the relative effect size, . Instead of randomly generating r, we increased r from 0 to 1 by 0.01 and calculated the true positive rate at a given r, which reflects a relative effect size of . For each value of r, we used 100 repetitions. As shown in Figure 3, CMM outperforms PCP and PCS, even in a low dimensional setting (i.e. k = 5).
We also compared the power and the size of these methods with n = 100 and k = 200. In this setting, we fixed r = 1 and estimated the total mediation effects and their bootstrap CIs. Based on 1000 and 500 simulations for the size and the power, all the methods control type I errors (CMM = 0.00, PCP = 0.01, and PCS = 0.01), but similar to the results with smaller k, CMM had a much higher power compared to the other methods (CMM = 0.73, PCP = 0.03, and PCS = 0.03).
3.2 Simulation study II: real microbiome data
To make a simulation setting more realistic, we used the composition of taxa in a real dataset, referred to as the ‘COMBO’ data (Wu et al., 2011), which was analyzed in Section 5.1. We first randomly permuted Ti and Yi to measure the empirical size at . For the power, we randomly generated Ti from N(0, 1) and estimated a with the Dirichlet regression (Maier, 2014). We then located the two largest and two smallest values of a and set if if , and bj = 0 otherwise, where the subscript indicates the jth order. The direct effect c was set to 1, and Yi was generated by the probit regression model (2). The estimated in this setting was 0.29 ± 0.14. As PCP had a slightly better performance than PCS, we included only PCP in comparison.
As shown in Table 1, both PCP and CMM roughly control type I errors and have comparable powers for the direct effect. However, PCP has very low power to detect the total indirect effect, which is similar to the results in Section 3.1.
. | Power . | Size . | ||
---|---|---|---|---|
. | DE . | IDE . | DE . | IDE . |
CMM | 0.900 | 0.942 | 0.066 | 0.004 |
PCP | 0.810 | 0.222 | 0.051 | 0.001 |
. | Power . | Size . | ||
---|---|---|---|---|
. | DE . | IDE . | DE . | IDE . |
CMM | 0.900 | 0.942 | 0.066 | 0.004 |
PCP | 0.810 | 0.222 | 0.051 | 0.001 |
The 1000 and 500 simulations were used for size and power, respectively.
. | Power . | Size . | ||
---|---|---|---|---|
. | DE . | IDE . | DE . | IDE . |
CMM | 0.900 | 0.942 | 0.066 | 0.004 |
PCP | 0.810 | 0.222 | 0.051 | 0.001 |
. | Power . | Size . | ||
---|---|---|---|---|
. | DE . | IDE . | DE . | IDE . |
CMM | 0.900 | 0.942 | 0.066 | 0.004 |
PCP | 0.810 | 0.222 | 0.051 | 0.001 |
The 1000 and 500 simulations were used for size and power, respectively.
3.3 Real data analysis: COMBO data
We applied CMM to the COMBO data, which consists of 16S rRNA gene sequences from fecal samples of 96 healthy individuals. It also contains demographic and clinical information including fat intake and BMI. Operational taxonomic units (OTUs) were summarized at the genus level, and the genera that appear in smaller than 10% of the samples were excluded, leaving 45 genera in 96 samples for analysis. Because of the compositional nature, the OTU counts assigned to the genera were transformed into proportions after adding a small number (0.5) to avoid the log-transformation of zero proportions, which is a common practice in compositional data analysis (Aitchison, 1986).
We dichotomized BMI at 25, which is generally used to define being normal (BMI < 25) or overweight/obese (BMI 25), and tested if the total effect of fat intake on overweight/obesity was statistically significant. The total calorie intake was included in the model as a pretreatment covariate. The estimated total effect with a probit model (i.e. ) was 0.122 with a 95% bootstrap CI of (0.017, 0.247). In other words, fat intake has a positive effect on overweight/obesity. CMM was then applied to study a mechanism of the effect of fat intake on overweight/obesity, in which the 45 genera were included as the components of a compositional mediator. The estimated direct effect was 0.018 with a CI of (−0.003, 0.073) and the estimated indirect effect was 0.030 with a CI of (0.000, 0.113), indicating positive mediation effects of fat intake on overweight/obesity.
To estimate component-wise mediation effects, we need to know the distribution of for ; however, it is not attainable even though we know a distribution of for . Thus, we assessed the product of path coefficients instead to identify potential component-wise mediation effects, as it is directly related to component-wise mediation effect. The genus Oscillibacter was identified as a potential mediator: its estimated product of path coefficients was 0.062 with a 95% bootstrap CI of (0.002, 0.185). In previous studies, Oscillibacter-like organisms have been identified as a potentially important gut microbe that mediates high fat-induced gut dysfunction and permeability, and it has been shown that a decrease of Oscillibacter led to an increase in gut permeability, which was shown to be associated with obesity (Lam et al., 2012; Teixeira et al., 2012). The estimated products of path coefficients for other components and their 95% bootstrap CIs are shown in Figure 4.
Since only Oscillibacter was identified as a potentially significant mediator, we included another genus to quantify the sensitivity of the assumption of the no unmeasured confounding effects. Note that CMM takes a compositional mediator so the number of components (mediators) must be greater than one. Figure 5 presents the result of the sensitivity analysis. The estimated mediation effect through Oscillibacter and Allisonella at ρ = 0 was 0.026 with a 95% bootstrap CI of (0.006, 0.043). For , the sign and significance of the estimated mediation effect remained unchanged. The 95% bootstrap CI covered the value of zero only when .
4 Discussion
In this study, we propose a sparse compositional mediation model for binary outcomes. To account for the characteristics of compositional data, we adopt the staying-in-the-simplex approach to jointly estimate the effect of a treatment on all the components of a compositional mediator; and we use an L1-penalized log-contrast regression model to estimate the effects of treatment and the components of a compositional mediator on binary outcomes. We demonstrated that CMM performs better than the methods based on principal component approaches in simulation studies. CMM also provides which components (taxa) could be potential drivers of mediation effects, which cannot be obtained directly by the principal component-based approach. Applying CMM to the COMBO data, we found a significant positive mediation effect of the gut microbiome in linking fat intake and overweight/obesity.
CMM, like other causal mediation models, requires assumptions to identify the direct and indirect effects. These assumptions are generally not verifiable with observational data. However, the assumption that treatment assignment is ignorable given observed pretreatment covariates is usually attained in a subgroup having similar characteristics. The no-confounding effects assumption between mediators and an outcome is often taken for granted after the observed pretreatment covariates are adjusted, and its sensitivity to unmeasured confounding effects is often measured. We allow pretreatment covariates in modeling CMM and provide a method for sensitivity analysis.
For the rare outcome case, the natural direct and indirect effects can be defined in log odds ratios, assuming follows a logistic distribution in Model (2); and their estimates can be approximated by c and , respectively. However, the logit model is computationally more intensive than the probit model for general cases in estimating the mediation effect. CMM was developed mainly for the general outcome case, which is more common in microbiome studies. So, CMM may not be an optimal method for the rare outcome. CMM uses a non-parametric bootstrap approach to testing the direct and indirect effects that involve the debiasing procedure, so it requires substantial computation time. For instance, it took 9 h and 29 min to run CMM with 100 samples and 200 components on a MacBook Pro with 2.0 GHz quad-core Intel Core i5. It would take longer if sensitivity analysis were also performed. We recommend sensitivity analysis be performed with a subset, as we did in the analysis of the COMBO data in Section 3.3.
The proposed method can be extended to multi-categorical treatments by utilizing indicator coding. However, extending CMM to multi-categorical outcomes or count outcomes is not trivial. These extensions are interesting future research topics. Another interesting and urgent extension of CMM is for longitudinal data, which has become increasingly common in clinical microbiome studies.
Acknowledgements
The authors would like to thank the reviewers for reviewing and suggesting valuable improvements to this work.
Funding
This work has been partially supported by the startup fund from the University of Rochester Medical Center.
Conflict of Interest: none declared.