A compositional mediation model for a binary outcome: Application to microbiome studies

Sohn, Michael B; Lu, Jiarui; Li, Hongzhe

doi:10.1093/bioinformatics/btab605

Abstract

Motivation

The delicate balance of the microbiome is implicated in our health and is shaped by external factors, such as diet and xenobiotics. Therefore, understanding the role of the microbiome in linking external factors and our health conditions is crucial to translate microbiome research into therapeutic and preventative applications.

Results

We introduced a sparse compositional mediation model for binary outcomes to estimate and test the mediation effects of the microbiome utilizing the compositional algebra defined in the simplex space and a linear zero-sum constraint on probit regression coefficients. For this model with the standard causal assumptions, we showed that both the causal direct and indirect effects are identifiable. We further developed a method for sensitivity analysis for the assumption of the no unmeasured confounding effects between the mediator and the outcome. We conducted extensive simulation studies to assess the performance of the proposed method and applied it to real microbiome data to study mediation effects of the microbiome on linking fat intake to overweight/obesity.

Availability and implementation

An R package can be downloaded from https://github.com/mbsohn/cmmb.

Supplementary information

Supplementary files are available at Bioinformatics online.

1 Introduction

The human microbiome is recognized as a key determinant of normal physiology and immune homeostasis (Li, 2015; Honda and Littman, 2016; Thaiss et al., 2016). Essential functions provided by the microbiome include the regulation of the immune system and metabolic function, the synthesis of essential vitamins, and the removal of toxic compounds (Heintz-Buschart and Wilmes, 2018). It has also been shown that the microbiome changes readily in response to extrinsic factors, such as diet and xenobiotics (Wu et al., 2011; Lewis et al., 2015; Kurilshikov et al., 2017). This dual role of the microbiome is very appealing in biomedical science, as it can be used as a non-invasive therapeutic application. Modulating targeted microbes using xenobiotics, for instance, would be more effective than imposing a complete dietary change for obesity treatment and could be as effective as bariatric surgery with no severe side effects. To translate the microbiome research into therapeutic and preventative applications, however, we need to understand mechanisms underlying the effect of external factors or interventions on the disease transmitted through the perturbation in the microbiome.

Mediation analysis, which studies the effect of treatment on outcome transmitted through a variable called a mediator, has been widely applied in numerous disciplines, such as sociology and epidemiology. It traditionally has been formulated and implemented under the structural equation modeling (SEM) framework (Baron and Kenny, 1986; MacKinnon et al., 2002); however, with recent advances in causal inference, which clarifies the assumptions needed for causal interpretation, mediation analysis under the potential outcomes (PO) framework has been gaining popularity (Pearl, 2001; Rubin, 2005; Imai et al., 2010; VanderWeele and Vansteelandt, 2010). Recent studies have extended the traditional single-mediator model to the multiple-mediators model (Imai and Yamamoto, 2013; VanderWeele and Vansteelandt, 2014), even in high-dimensional settings (Chén et al., 2015; Huang and Pan, 2016; Zhao and Luo, 2016). These mediation models, however, are not directly applicable for microbiome data due to the compositional nature of the microbiome data.

Compositional data comprise the proportions or percentages of a whole, imposing a unit-sum constraint, i.e. the sum of components is 1 or 100%. This unit-sum constraint makes a composition with k-components lie in the $(k - 1)$ -dimensional simplex space $S^{k - 1}$ and makes it impossible to alter one component without altering at least one of the other components. Neglecting this compositional structure thus can cause undesirable consequences. Sohn and Li (2019) proposed a sparse compositional mediation model (CMM) for continuous outcomes under the PO framework utilizing the algebra defined in the simplex space (Aitchison, 1986; Billheimer et al., 2001) and a linear constraint on regression coefficients, which is a necessary condition to satisfy the basic properties of compositional data, such as scale and permutation invariance (Aitchison and Bacon-Shone, 1984; Lin et al., 2014). Subsequently, a few compositional mediation methods for continuous outcomes have been proposed (Wang et al., 2020; Zhang et al., 2021). In many human microbiome studies, however, the outcome is binary, such as the presence or absence of disease.

In this article, we extend CMM to accommodate binary outcomes. The effect of a treatment on all the components of a compositional mediator is jointly estimated using the algebra in the simplex space. For the quantification of the effects of a treatment and a compositional mediator on binary outcomes, an L1-penalized probit model with a linear constraint is used. Its parameters are estimated by an algorithm that combines the iteratively reweighted least-squares (IRLS) (Green, 1984; Lee et al., 2006) and the coordinate descent method of multipliers (CDMM) (Lin et al., 2014). To obtain asymptotically unbiased estimates for the parameters of the L1-penalized probit model, we developed a debias procedure that extends the methods of Shi et al. (2016) and Lu et al. (2019). We defined an estimator for the mediation effect under the PO framework and evaluated its performance in extensive simulation settings. We also developed a method for sensitivity analysis for the assumption of the no unmeasured confounding effects between the mediator and the outcome. We applied CMM to a real dataset, COMBO (Wu et al., 2011), to link diet fat intake to overweight/obesity and found a significant effect of fat intake on overweight/obesity mediated through the gut microbiome.

2 Materials and methods

2.1 Algebraic operators in simplex space

We first provide the definitions of the algebraic operators in the simplex space that appear in this article. For two compositions of k-components

η, ζ \in S^{k - 1}

⁠, the perturbation operator is defined as

η \oplus ζ = {(η_{1} ζ_{1} / \sum_{j = 1}^{k} η_{j} ζ_{j}, \dots, η_{k} ζ_{k} / \sum_{j = 1}^{k} η_{j} ζ_{j})}^{⊤};

the inverse of the perturbation operator as

η ⊖ ζ = {(η_{1} ζ_{1}^{- 1} / \sum_{j = 1}^{k} η_{j} ζ_{j}^{- 1}, \dots, η_{k} ζ_{k}^{- 1} / \sum_{j = 1}^{k} η_{j} ζ_{j}^{- 1})}^{⊤};

the power transformation for a composition

η

by a scalar

υ

as

η^{υ} = {(η_{1}^{υ} / \sum_{j = 1}^{k} η_{j}^{υ}, \dots, η_{k}^{υ} / \sum_{j = 1}^{k} η_{j}^{υ})}^{⊤};

and a norm for composition as

‖ η ‖ = {(η^{T} η)}^{1 / 2} = {(alr {(η)}^{T} N^{- 1} alr (η))}^{1 / 2},

where

alr (\cdot)

is the additive log-ratio transformation and

N^{- 1}

is the inverse matrix of a

(k - 1) \times (k - 1)

matrix

N = I_{k - 1} + 1_{k - 1} 1_{k - 1}^{T}

(Aitchison, 1986; Billheimer et al., 2001).

2.2 Compositional mediation model for binary outcomes

Suppose that we have n random samples from a population, where we observe an outcome Y_i, a compositional mediator $M_{i}$ ⁠, a treatment T_i, and covariates $X_{i}$ for $i = 1, \dots, n$ ⁠, and that we consider an expected causal effect of T_i on Y_i mediated through $M_{i}$ ⁠, depicted in Figure 1. Then, a model for this mediation effect should take the compositional nature of $M_{i}$ into an account, as $M_{i} \in S^{k - 1}$ ⁠. To develop such a model, we utilize algebraic operations defined in the simplex space and a zero-sum constraint on regression coefficients for the components of a composition.

Fig. 1.

A compositional mediation model: aj, bj, and c are path coefficients, j=1,…,k; U1i and U2i are disturbance terms for k compositional mediators Mi and an outcome Yi, respectively; Ti is a treatment variable; Xi is a set of pretreatment covariates

Open in new tab Download slide

A compositional mediation model: a_j, b_j, and c are path coefficients, $j = 1, \dots, k$ ⁠; $U_{1 i}$ and $U_{2 i}$ are disturbance terms for k compositional mediators $M_{i}$ and an outcome Y_i, respectively; T_i is a treatment variable; $X_{i}$ is a set of pretreatment covariates

With the perturbation and power transformation operators, the proposed compositional mediation model for a binary outcome (CMM) is given by

M_{i} = (m_{0} \oplus a^{T_{i}} \oplus_{r = 1}^{n_{x}} h_{r}^{X_{r i}}) \oplus U_{1 i}

(1)

\begin{matrix} Y_{i} = 1 {c_{0} + c T_{i} + b^{⊤} (log M_{i}) + g^{⊤} X_{i} + U_{2 i} > 0}, \\ subject to 1_{k}^{⊤} b = 0, \end{matrix}

(2)

where

m_{0}

is a baseline composition (i.e. when

T_{i} = E (T_{i})

⁠); c₀ a baseline measure for Y_i; a a vector of composition parameters for a treatment; c regression coefficients for the treatment; b regression coefficients for the composition;

h_{1}, \dots, h_{n_{x}}

and g nuisance parameters corresponding to

X_{i}

⁠;

U_{1 i}

and

U_{2 i}

disturbance terms for

M_{i}

and Y_i, respectively; and

\oplus_{r = 1}^{n_{x}} η_{r} = η_{1} \oplus \dots \oplus η_{n_{x}}

⁠. We assume

U_{1 i}

follows a logistic normal distribution with mean

0

and covariance Σ and

U_{2 i}

follows a standard normal distribution. Model (1) formulates the effect of a treatment on a compositional mediator perturbed from the baseline composition, which is measured by the parameter a, after adjusting for pretreatment covariates, and Model (2) links treatment and a compositional mediator to a binary outcome after adjusting for pretreatment covariates while it accounts for the compositional nature of

M_{i}

by imposing a zero-sum constraint,

b^{⊤} 1_{k} = 0

⁠. Note that the vector of regression coefficients b is scale-invariant with respective to

M_{i}

because of the zero-sum constraint, i.e.

{(log C M_{i})}^{⊤} b = {(log M_{i})}^{⊤} b

for any constant C.

2.3 Model assumptions and identification

As in most of the work on causal mediation analysis, estimators of the natural direct and indirect (or mediation) effects for the proposed method are defined under the causal assumptions: the stable unit treatment value assumption (SUTVA) (Imbens and Rubin, 2015), the positivity assumption, and the no-unmeasured confounding assumption, i.e. a set of pretreatment covariates is sufficient to control for confounding effects. See Supplementary Material D.3 for details of these assumptions. Suppose that Models (1) and (2) are correctly specified. Then, under these assumptions, the direct effect

ζ (t)

and the total indirect effect

δ (t)

are identifiable and given by

\begin{matrix} ζ (τ) \equiv E [Y_{i} (t, log M_{i} (τ)) - Y_{i} (t', log M_{i} (τ)) | X_{i} = x] \\ = E {Φ (\frac{c t + f_{ζ} (τ, X_{i})}{\sqrt{b_{- k}^{⊤} Σ b_{- k} + 1}}) - Φ (\frac{c t' + f_{ζ} (τ, X_{i})}{\sqrt{b_{- k}^{⊤} Σ b_{- k} + 1}})}, \end{matrix}

\begin{matrix} δ (τ) \equiv E [Y_{i} (τ, log M_{i} (t)) - Y_{i} (τ, log M_{i} (t')) | X_{i} = x] \\ = E {Φ (\frac{{(log a)}^{⊤} b t + f_{δ} (τ, X_{i})}{\sqrt{b_{- k}^{⊤} Σ b_{- k} + 1}}) \\ - Φ (\frac{{(log a)}^{⊤} b t' + f_{δ} (τ, X_{i})}{\sqrt{b_{- k}^{⊤} Σ b_{- k} + 1}})}, \end{matrix}

where t is an observed treatment,

t'

a reference value for the treatment,

f_{ζ} (τ, x) = c_{0} + b^{⊤} (log m_{0} + τ log a + \sum_{r = 1}^{n_{x}} x_{r} log h_{r}) + g^{⊤} x

⁠, and

f_{δ} (τ, x) = c_{0} + c τ + b^{⊤} (log m_{0} + \sum_{r = 1}^{n_{x}} x_{r} log h_{r}) + g^{⊤} x

⁠. Note that these estimators,

ζ (τ)

and

δ (τ)

⁠, are invariant to the order of components (taxa) because of the constraint on a (i.e. lies in the simplex space) and the zero-sum constraint of b (i.e.

1_{k}^{⊤} b = 0

⁠).

2.4 Estimation of composition parameters

To estimate the parameters in Model (1), we minimize the difference between observed and estimated compositions in

S^{k - 1}

⁠. With the operators in

S^{k - 1}

⁠, we solve the following optimization problem:

\hat{θ} = \underset{m_{0}, a, h_{r} \in S^{k - 1}}{argmin} \sum_{i = 1}^{n} {‖ M_{i} ⊖ (m_{0} \oplus a^{T_{i}} \oplus_{r = 1}^{n_{x}} h_{r}^{X_{r i}}) ‖}^{2},

(3)

where

{\hat{θ}}^{⊤} = ({\hat{m}}_{0}, \hat{a}, {\hat{h}}_{1}, \dots, {\hat{h}}_{n_{x}})

⁠. The objective function in (3) is convex in terms of

alr (m_{0}), alr (a)

⁠, and

alr (h_{r})

for

r = 1, \dots, n_{x}

⁠, and the estimators are consistent and unbiased estimators. See Supplementary Material A for details.

2.5 Estimation of regression parameters

To estimate the parameters of the composition in Model (2), we will use a log-contrast model. Let

η_{i} = 2 y_{i} - 1, z_{i} = {(1, t_{i}, log {(m_{i})}^{⊤}, x_{i}^{⊤})}^{⊤}, β = {(c_{0}, c, b^{⊤}, g^{⊤})}^{⊤}

⁠, and

q (η_{i} z_{i}^{⊤} β) = - log Φ (η_{i} z_{i}^{⊤} β)

⁠. Then, the L₁ penalized log-likelihood function for Model (2) is given by

\hat{β} = \underset{β}{argmin} {\frac{1}{n} \sum_{i = 1}^{n} q (η_{i} z_{i}^{⊤} β)}, subject to | | β | |_{1} \leq t; 1_{k}^{⊤} b = 0,

where

t \geq 0

is some constant. The solution of this optimization problem is equivalent to the solution of the following optimization problem:

\hat{β} = \underset{β}{argmin} {\frac{1}{2 n} | | Ξ^{1 / 2} (u - Z β) | |_{2}^{2} + λ {‖ β ‖}_{1}}; 1_{k}^{⊤} b = 0,

(4)

where Ξ is an n × n diagonal matrix with its ith diagonal term

Ξ_{i i} = ξ_{i} (η_{i} z_{i}^{⊤} β^{*}) [z_{i}^{⊤} β^{*} + ξ_{i} (η_{i} z_{i}^{⊤} β^{*})]

⁠;

β^{*}

a vector lying between

β_{0}

and

β

⁠;

ξ_{i} (β) = η_{i} ϕ (η_{i} z_{i}^{⊤} β) / Φ (η_{i} z_{i}^{⊤} β)

⁠;

ξ (β_{0}) = {(ξ_{1} (β_{0}), \dots, ξ_{1} (β_{0}))}^{⊤}

⁠;

Z = {(z_{1}, \dots, z_{n})}^{⊤}

⁠;

u = Z β_{0} + Ξ^{-} ξ (β_{0})

⁠; and

λ \geq 0

is a penalty term. Letting

\tilde{Z} = Z (I_{p} - ι ι^{⊤} / k)

⁠, where

ι^{⊤} = (0, 0, 1_{k}^{⊤}, 0, \dots, 0)

⁠, Equation (4) can be written as

\hat{β} = \underset{β}{argmin} {\frac{1}{2 n} | | Ξ^{1 / 2} (u - \tilde{Z} β) | |_{2}^{2} + λ {‖ β ‖}_{1}}, ι^{⊤} β = 0.

(5)

We need this transformation of Z for the debiasing procedure in the following section. Note that the solutions of Equations (4) and (5) are the same since

ι^{⊤} β = 0

⁠. The objective function in Equation (5) has the form of weighted least squares; however, Ξ and u depend on unknown quantities,

β^{*}

and

β_{0}

⁠, respectively. Therefore, we propose a method that combines iteratively reweighted least squares and coordinate descent method of multipliers (IRLS-CDMM), which iteratively updates

Ξ^{(ℓ)}

and

u^{(ℓ)}

⁠. The algorithm for IRLS-CDMM consists of constructing the augmented Lagrangian,

L_{μ} = \frac{1}{2 n} | | Ξ^{1 / 2} (u - \tilde{Z} β) | |_{2}^{2} + λ {‖ β ‖}_{1} + ς ι^{⊤} β + \frac{μ}{2} {(ι^{⊤} β)}^{2},

where ς is the Lagrange multiplier and

μ > 0

a penalty parameter; and iterative updates of

β^{(ℓ + 1)} \leftarrow \underset{β}{argmin} L_{μ} (β, Ξ^{(ℓ)}, u^{(ℓ)}, γ^{(ℓ)}), Ξ^{(ℓ + 1)} \leftarrow \underset{β^{*}}{argmin} \sum_{i = 1}^{n} q (η_{i} {\tilde{z}}_{i}^{⊤} β^{*}), u^{(ℓ + 1)} \leftarrow \tilde{Z} β^{(ℓ + 1)} + Ξ^{- (ℓ + 1)} ξ (β^{(ℓ + 1)}), {(ς / μ)}^{(ℓ + 1)} \leftarrow {(ς / μ)}^{(ℓ)} + ι^{⊤} β .

The details of this algorithm are provided in Supplementary Material B.

2.6 Debiasing procedure and its asymptotic convergence

The solution

\hat{β}

of Equation (5) is biased because of L₁ penalization. To correct this bias, we adapt the debiased procedure of Shi et al. (2016) and Lu et al. (2019). The proposed de-biased estimator of

\hat{β}

given

\hat{Ξ}

and

\hat{u}

is given by

{\hat{β}}_{d b} = \hat{β} + \frac{1}{n} \tilde{Θ} {\tilde{Z}}^{⊤} \hat{Ξ} (\hat{u} - \tilde{Z} \hat{β}),

(6)

where

\tilde{Θ} = (I_{p} - ι ι^{⊤} / k) \hat{Θ}

and

\hat{Θ} = {({\hat{θ}}_{1}, \dots, {\hat{θ}}_{p})}^{⊤}

is a solution of the following convex problem,

{\hat{θ}}_{j} = \min θ_{j}^{⊤} \hat{Σ} θ_{j} s . t . | | \hat{Σ} θ_{j} - (I_{p} - ι ι^{⊤} / k) e_{j} | |_{\infty} \leq γ,

where

j = 1, \dots, p, \hat{Σ} = {\tilde{Z}}^{⊤} \hat{Ξ} \tilde{Z} / n, e_{j} \in R^{p}

the vector with one at the jth position and zero everywhere else, and γ some constant. Under some regularity conditions, the debiased estimator

{\hat{β}}_{d b}

converges to

β

as

n, p \to \infty

⁠. The algorithm for the debiasing procedure and the asymptotic properties of the debiased estimators are given in Supplementary Materials C and D.

2.7 Hypothesis test of mediation effect

The distribution of the total mediation effect

δ (τ)

is unknown, so we propose a bootstrap approach to test the significance of an expected causal mediation effect,

H_{0} : δ (τ) = 0 vs . H_{1} : δ (τ) \neq 0.

(7)

To construct a sampling distribution of $δ (τ)$ ⁠, we repeat the following steps B times: (i) randomly select n samples from the original n samples with replacement, and (ii) estimate $δ_{b} (τ)$ ⁠. We use the 95% percentile confidence interval to test the significance of $δ (τ)$ in this study. Alternatively, we can estimate an approximate P-value for $δ (τ)$ utilizing the fact that any bootstrap replicate $δ {(τ)}_{b} - δ (τ)$ should have a distribution close to that of $δ (τ)$ when the null hypothesis is true, where $δ {(τ)}_{b}$ denotes an estimated indirect effect derived from a bootstrap sample (Efron and Tibshirani, 1994).

2.8 Sensitivity analysis

In mediation analysis, the assumption of no unmeasured confounding effects is not verifiable. Particularly, no unmeasured confounding effects between a mediator and an outcome cannot be assured even in a randomized experiment, thus rendering estimated mediation effects prone to be biased. To address this potential problem, we propose a method for sensitivity analysis that extends the method proposed by Imai et al. (2010). Let

ρ \equiv Corr (alr {(U_{1 i})}_{j}, U_{2 i})

be a correlation between the disturbance terms for the compositional mediator and the outcome, respectively for all

j = 1, \dots k - 1

and

Y_{i} = 1 {{\tilde{c}}_{0} + \tilde{c} T_{i} + {\tilde{g}}^{⊤} X_{i} + U_{0 i} > 0}

be a probit regression model for the total effect of T on Y. Suppose that the model assumptions are satisfied and the models are correctly specified. Then, for a given correlation ρ, we can identify the expected total mediation effect using

δ_{ρ} (τ) = E {Φ ({\tilde{f}}_{δ} (τ) + \frac{{(log a)}^{⊤} b_{ρ} (t - τ)}{Ψ (ρ, b_{ρ}, Σ)}) - Φ ({\tilde{f}}_{δ} (τ) + \frac{{(log a)}^{⊤} b_{ρ} (t' - τ)}{Ψ (ρ, b_{ρ}, Σ)})},

(8)

where

{\tilde{f}}_{δ} (τ) = {\tilde{c}}_{0} + \tilde{c} τ + {\tilde{g}}^{⊤} x_{i}, b_{ρ}

is an adjusted estimate given ρ, and

Ψ (ρ, b_{ρ}, Σ) = {[{(b_{ρ})}_{- k}^{⊤} Σ {(b_{ρ})}_{- k} + 2 ρ {(b_{ρ})}_{- k}^{⊤} diag {(Σ)}^{1 / 2} + 1]}^{1 / 2}

⁠. We can estimate

b_{ρ}

utilizing the correlation between

U_{0 i}

and

alr {(U_{1 i})}_{j}

⁠. Note the range of ρ is not from −1 to 1 when the number of components is more than one since the components are not independent of each other, and its range becomes narrower as the number of components increases. See Supplementary Material E for details.

3 Results

3.1 Simulation study I: synthetic data

Mediation analysis for multiple or high-dimensional mediators often assumes independence between mediators. One approach to satisfying this assumption is to use principal components (PCs) of mediators, i.e. PCs of $log (M)$ as mediators. We use this approach under structural equation modeling (hereinafter referred to as PCS) and under the potential outcomes framework (PCP) to evaluate the performance of CMM. The main difference between these two approaches is how to estimate the direct and indirect effects: for PCS, the inner product of path coefficients (i.e. $log {(a)}^{⊤} b$ ⁠) was used for the indirect effect; and for PCP, an expression derived from the mediation formula was used (Pearl, 2001; Imai et al., 2010), which is like the expression for $δ (τ)$ ⁠.

In data generation, we randomly generated a treatment T_i from a Bernoulli distribution with success probability 0.5; a compositional disturbance $U_{1 i}$ from a multivariate logistic normal (LN) distribution (Aitchison, 1986) with mean $0$ and covariance $N$ ⁠; a regression disturbance $U_{2 i}$ from a standard normal distribution, where $i = 1, \dots, 50$ ⁠. We fixed $a = {(20, 10, 5, 2, 1_{k - 4}^{⊤})}^{⊤} / (20, 10, 5, 2, 1_{k - 4}^{⊤}) 1_{k}, b = {(0.5, - 0.5, 0.5, - 0.5, 0_{k - 4}^{⊤})}^{⊤}$ ⁠, and c = 1 for k = 5, 25, 50. For a baseline composition, $m_{0} = 1_{k} / k$ was used. A composition $M_{i}$ and an outcome Y_i were then generated according to Models (1) and (2), respectively. Throughout the simulation studies and a real data application, we tested the direct and indirect effects at the 95% confidence level.

We first compared the coverage rate for the indirect effect, which measures a proportion of the time that estimated intervals contain the true value of an indirect effect. To this end, we first generated ${(log a)}^{⊤} b \times r$ in each repetition, where r is randomly generated from the standard uniform distribution. In this setting, the true or known value of the total indirect effect is between 0 and 0.14. We then constructed a bootstrap confidence interval (CI) with 2000 bootstrap samples and measured the coverage rate for each method with each k. Figure 2 shows the results of 100 repetitions for each k. CMM yields the coverage rate around the nominal coverage rate (i.e. 0.95) for all the values of k considered. PCS gives the coverage rate around 0.95 when k = 5 but has an upward trend along with increased k. The coverage rate of PCP is a little lower than the nominal coverage rate for all k considered.

Fig. 2.

Open in new tab Download slide

Coverage probabilities for the indirect effect estimated by CMM, PCP, and PCS for different numbers of taxa k

The second measure we used in performance comparison is the true positive rate versus the relative effect size, ${(log a)}^{⊤} b \times r$ ⁠. Instead of randomly generating r, we increased r from 0 to 1 by 0.01 and calculated the true positive rate at a given r, which reflects a relative effect size of ${(log a)}^{⊤} b$ ⁠. For each value of r, we used 100 repetitions. As shown in Figure 3, CMM outperforms PCP and PCS, even in a low dimensional setting (i.e. k = 5).

Fig. 3.

Open in new tab Download slide

True positive rate versus relative effect size for CMM, PCP, and PCS and for different numbers of taxa k

We also compared the power and the size of these methods with n = 100 and k = 200. In this setting, we fixed r = 1 and estimated the total mediation effects and their bootstrap CIs. Based on 1000 and 500 simulations for the size and the power, all the methods control type I errors (CMM = 0.00, PCP = 0.01, and PCS = 0.01), but similar to the results with smaller k, CMM had a much higher power compared to the other methods (CMM = 0.73, PCP = 0.03, and PCS = 0.03).

3.2 Simulation study II: real microbiome data

To make a simulation setting more realistic, we used the composition of taxa in a real dataset, referred to as the ‘COMBO’ data (Wu et al., 2011), which was analyzed in Section 5.1. We first randomly permuted T_i and Y_i to measure the empirical size at $α = 0.05$ ⁠. For the power, we randomly generated T_i from N(0, 1) and estimated a with the Dirichlet regression (Maier, 2014). We then located the two largest and two smallest values of a and set $b_{j} = 0.5$ if $j \in {a_{(k)}, a_{(k - 1)}}, b_{j} = - 0.5$ if $j \in {a_{(1)}, a_{(2)}}$ ⁠, and b_j = 0 otherwise, where the subscript $(j)$ indicates the jth order. The direct effect c was set to 1, and Y_i was generated by the probit regression model (2). The estimated ${(log a)}^{⊤} b$ in this setting was 0.29 ± 0.14. As PCP had a slightly better performance than PCS, we included only PCP in comparison.

As shown in Table 1, both PCP and CMM roughly control type I errors and have comparable powers for the direct effect. However, PCP has very low power to detect the total indirect effect, which is similar to the results in Section 3.1.

Table 1.

Power and size in testing DE and IDE: n = 96 and k = 45

	Power		Size
$α = 0.05$	DE	IDE	DE	IDE
CMM	0.900	0.942	0.066	0.004
PCP	0.810	0.222	0.051	0.001

	Power		Size
$α = 0.05$	DE	IDE	DE	IDE
CMM	0.900	0.942	0.066	0.004
PCP	0.810	0.222	0.051	0.001

The 1000 and 500 simulations were used for size and power, respectively.

Open in new tab

Table 1.

Power and size in testing DE and IDE: n = 96 and k = 45

	Power		Size
$α = 0.05$	DE	IDE	DE	IDE
CMM	0.900	0.942	0.066	0.004
PCP	0.810	0.222	0.051	0.001

	Power		Size
$α = 0.05$	DE	IDE	DE	IDE
CMM	0.900	0.942	0.066	0.004
PCP	0.810	0.222	0.051	0.001

The 1000 and 500 simulations were used for size and power, respectively.

Open in new tab

3.3 Real data analysis: COMBO data

We applied CMM to the COMBO data, which consists of 16S rRNA gene sequences from fecal samples of 96 healthy individuals. It also contains demographic and clinical information including fat intake and BMI. Operational taxonomic units (OTUs) were summarized at the genus level, and the genera that appear in smaller than 10% of the samples were excluded, leaving 45 genera in 96 samples for analysis. Because of the compositional nature, the OTU counts assigned to the genera were transformed into proportions after adding a small number (0.5) to avoid the log-transformation of zero proportions, which is a common practice in compositional data analysis (Aitchison, 1986).

We dichotomized BMI at 25, which is generally used to define being normal (BMI < 25) or overweight/obese (BMI $\geq$ 25), and tested if the total effect of fat intake on overweight/obesity was statistically significant. The total calorie intake was included in the model as a pretreatment covariate. The estimated total effect with a probit model (i.e. $Y_{i} = 1 {{\tilde{c}}_{0} + \tilde{c} T_{i} + \tilde{g} X_{i} + U_{0 i} > 0}$ ⁠) was 0.122 with a 95% bootstrap CI of (0.017, 0.247). In other words, fat intake has a positive effect on overweight/obesity. CMM was then applied to study a mechanism of the effect of fat intake on overweight/obesity, in which the 45 genera were included as the components of a compositional mediator. The estimated direct effect was 0.018 with a CI of (−0.003, 0.073) and the estimated indirect effect was 0.030 with a CI of (0.000, 0.113), indicating positive mediation effects of fat intake on overweight/obesity.

To estimate component-wise mediation effects, we need to know the distribution of $log {(U_{1 i})}_{j}$ for $j = 1, \dots, k$ ⁠; however, it is not attainable even though we know a distribution of $alr {(U_{1 i})}_{j}$ for $j = 1, \dots, k - 1$ ⁠. Thus, we assessed the product of path coefficients instead to identify potential component-wise mediation effects, as it is directly related to component-wise mediation effect. The genus Oscillibacter was identified as a potential mediator: its estimated product of path coefficients was 0.062 with a 95% bootstrap CI of (0.002, 0.185). In previous studies, Oscillibacter-like organisms have been identified as a potentially important gut microbe that mediates high fat-induced gut dysfunction and permeability, and it has been shown that a decrease of Oscillibacter led to an increase in gut permeability, which was shown to be associated with obesity (Lam et al., 2012; Teixeira et al., 2012). The estimated products of path coefficients for other components and their 95% bootstrap CIs are shown in Figure 4.

Fig. 4.

Open in new tab Download slide

Estimated products of path coefficients closely related to the component-wise mediation effects of fat intake on obesity. The bootstrap CIs for Eggerthella, Butyricimonas, Coprococcus, Acidaminococcus, Allisonella, Dialister, and Veillonella are $(- 0.069, 0.035), (- 0.059, 0.061), (- 0.066, 0.046), (- 0.023, 0.157), (- 0.018, 0.216), (- 0.068, 0.022)$ ⁠, and $(- 0.079, 0.072)$ ⁠, respectively

Since only Oscillibacter was identified as a potentially significant mediator, we included another genus to quantify the sensitivity of the assumption of the no unmeasured confounding effects. Note that CMM takes a compositional mediator so the number of components (mediators) must be greater than one. Figure 5 presents the result of the sensitivity analysis. The estimated mediation effect through Oscillibacter and Allisonella at ρ = 0 was 0.026 with a 95% bootstrap CI of (0.006, 0.043). For $- 0.11 \leq ρ \leq 1$ ⁠, the sign and significance of the estimated mediation effect remained unchanged. The 95% bootstrap CI covered the value of zero only when $- 0.49 \leq ρ \leq - 0.12$ ⁠.

Fig. 5.

Open in new tab Download slide

Sensitivity analysis. The dashed line indicates the estimated mediation effect for ρ = 0. The solid line represents the estimated mediation effect at each value of ρ, and the gray areas represent the 95% bootstrap CI for the mediation effects

4 Discussion

In this study, we propose a sparse compositional mediation model for binary outcomes. To account for the characteristics of compositional data, we adopt the staying-in-the-simplex approach to jointly estimate the effect of a treatment on all the components of a compositional mediator; and we use an L1-penalized log-contrast regression model to estimate the effects of treatment and the components of a compositional mediator on binary outcomes. We demonstrated that CMM performs better than the methods based on principal component approaches in simulation studies. CMM also provides which components (taxa) could be potential drivers of mediation effects, which cannot be obtained directly by the principal component-based approach. Applying CMM to the COMBO data, we found a significant positive mediation effect of the gut microbiome in linking fat intake and overweight/obesity.

CMM, like other causal mediation models, requires assumptions to identify the direct and indirect effects. These assumptions are generally not verifiable with observational data. However, the assumption that treatment assignment is ignorable given observed pretreatment covariates is usually attained in a subgroup having similar characteristics. The no-confounding effects assumption between mediators and an outcome is often taken for granted after the observed pretreatment covariates are adjusted, and its sensitivity to unmeasured confounding effects is often measured. We allow pretreatment covariates in modeling CMM and provide a method for sensitivity analysis.

For the rare outcome case, the natural direct and indirect effects can be defined in log odds ratios, assuming $U_{2 i}$ follows a logistic distribution in Model (2); and their estimates can be approximated by c and ${(log a)}^{⊤} b$ ⁠, respectively. However, the logit model is computationally more intensive than the probit model for general cases in estimating the mediation effect. CMM was developed mainly for the general outcome case, which is more common in microbiome studies. So, CMM may not be an optimal method for the rare outcome. CMM uses a non-parametric bootstrap approach to testing the direct and indirect effects that involve the debiasing procedure, so it requires substantial computation time. For instance, it took 9 h and 29 min to run CMM with 100 samples and 200 components on a MacBook Pro with 2.0 GHz quad-core Intel Core i5. It would take longer if sensitivity analysis were also performed. We recommend sensitivity analysis be performed with a subset, as we did in the analysis of the COMBO data in Section 3.3.

The proposed method can be extended to multi-categorical treatments by utilizing indicator coding. However, extending CMM to multi-categorical outcomes or count outcomes is not trivial. These extensions are interesting future research topics. Another interesting and urgent extension of CMM is for longitudinal data, which has become increasingly common in clinical microbiome studies.

Acknowledgements

The authors would like to thank the reviewers for reviewing and suggesting valuable improvements to this work.

Funding

This work has been partially supported by the startup fund from the University of Rochester Medical Center.

Conflict of Interest: none declared.

References

Aitchison

J.

,

Bacon-Shone

J.

(

1984

)

Log contrast models for experiments with mixtures

.

Biometrika

,

71

,

323

–

330

.

Google Scholar

Crossref

WorldCat

Aitchison

J.

(

1986

)

The Statistical Analysis of Compositional Data

.

New York, NY

:

Chapman & Hall

.

Baron

R.M.

,

Kenny

D.A.

(

1986

)

The moderator-mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations

.

J. Pers. Soc. Psychol

.,

51

,

1173

–

1182

.

Billheimer

D.

et al. (

2001

)

Statistical interpretation of species composition

.

J. Am. Stat. Assoc

.,

96

,

1205

–

1214

.

Google Scholar

Crossref

WorldCat

Chén

O.Y.

et al. (

2015

) High-dimensional multivariate mediation with application to neuroimaging data. arXiv:1511.09354.

Efron

B.

,

Tibshirani

R.

(

1994

)

An Introduction to the Bootstrap

. Boca Raton, FL:

Chapman & Hall / CRC.

Green

P.J.

(

1984

)

Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives

.

J. R. Stat. Soc. Ser. B

,

46

,

149

–

192

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Heintz-Buschart

A.

,

Wilmes

P.

(

2018

)

Human gut microbiome: function matters

.

Trends Microbiol

.,

26

,

563

–

574

.

Honda

K.

,

Littman

D.R.

(

2016

)

The microbiota in adaptive immune homeostasis and disease

.

Nature

,

535

,

75

–

84

.

Huang

Y.T.

,

Pan

W.C.

(

2016

)

Hypothesis test of mediation effect in causal mediation model with high-dimensional continuous mediators

.

Biometrics

,

72

,

402

–

413

.

Imai

K.

et al. (

2010

)

Identification, inference and sensitivity analysis for causal mediation effects

.

Stat. Sci

.,

25

,

51

–

71

.

Google Scholar

Crossref

WorldCat

Imai

K.

et al. (

2010

)

A general approach to causal mediation analysis

.

Psychol. Methods

,

15

,

309

–

334

.

Imai

K.

,

Yamamoto

T.

(

2013

)

Identification and sensitivity analysis for multiple causal mechanisms: revisiting evidence from framing experiments

.

Polit. Anal

.,

21

,

141

–

171

.

Google Scholar

Crossref

WorldCat

Imbens

G.

,

Rubin

D.

(

2015

)

Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction

.

Cambridge

:

Cambridge University Press

.

Kurilshikov

A.

et al. (

2017

)

Host genetics and gut microbiome: challenges and perspectives

.

Trends Immunol

.,

38

,

633

–

647

.

Lam

Y.Y.

et al. (

2012

)

Increased gut permeability and microbiota change associate with mesenteric fat inflammation and metabolic dysfunction in diet-induced obese mice

.

PLoS ONE

,

7

,

e34233

.

Lee

S.

et al. (

2006

) Efficient L1 regularized logistic regression. AAAI-06.

Lewis

J.D.

et al. (

2015

)

Inflammation, antibiotics, and diet as environmental stressors of the gut microbiome in pediatric Crohn’s disease

.

Cell Host Microbe

,

18

,

489

–

500

.

Li

H.

(

2015

)

Microbiome, metagenomics, and high-dimensional compositional data analysis

.

Annu. Rev. Stat. Appl

.,

2

,

73

–

94

.

Google Scholar

Crossref

WorldCat

Lin

W.

et al. (

2014

)

Variable selection in regression with compositional covariates

.

Biometrika

,

101

,

785

–

797

.

Google Scholar

Crossref

WorldCat

Lu

J.

et al. (

2019

)

Generalized linear models with linear constraints for microbiome compositional data

.

Biometrics

,

75

,

235

–

244

.

MacKinnon

D.P.

et al. (

2002

)

A comparison of methods to test mediation and other intervening variable effects

.

Psychol. Methods

,

7

,

83

–

104

.

Maier

M.J.

(

2014

) DirichletReg: Dirichlet regression for compositional data in R. Research Report Series/Department of Statistics and Mathematics, 125. WU Vienna University of Economics and Business, Vienna.

Pearl

J.

(

2001

) Direct and indirect effects. In

Proceedings of the Seventeenth Conference on Uncertainty and Artificial Intelligence

, pp.

411

–

420

.

San Francisco, CA

:

Morgan Kaufmann

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Rubin

D.B.

(

2005

)

Causal inference using potential outcomes

.

J. Am. Stat. Assoc

.,

100

,

322

–

331

.

Google Scholar

Crossref

WorldCat

Shi

P.

et al. (

2016

)

Regression analysis for microbiome compositional data

.

Ann. Appl. Stat

.,

10

,

1019

–

1040

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Sohn

M.B.

,

Li

H.

(

2019

)

Compositional mediation analysis for microbiome studies

.

Ann. Appl. Stat

.,

13

,

661

–

681

.

Google Scholar

Crossref

WorldCat

Teixeira

T.F.

et al. (

2012

)

Potential mechanisms for the emerging link between obesity and increased intestinal permeability

.

Nutr. Res

.,

32

,

637

–

647

.

Thaiss

C.A.

et al. (

2016

)

The microbiome and innate immunity

.

Nature

,

535

,

65

–

74

.

VanderWeele

T.J.

,

Vansteelandt

S.

(

2010

)

Odds ratios for mediation analysis for a dichotomous outcome

.

Am. J. Epidemiol

.,

172

,

1339

–

1348

.

VanderWeele

T.J.

,

Vansteelandt

S.

(

2014

)

Mediation analysis with multiple mediators

.

Epidemiol. Method

,

2

,

95

–

115

.

Google Scholar

Crossref

WorldCat

Wang

C.

et al. (

2020

)

Estimating and testing the microbial causal mediation effect with high-dimensional and compositional microbiome data

.

Bioinformatics

,

36

,

347

–

355

.

Wu

G.

et al. (

2011

)

Linking long-term dietary patterns with gut microbial enterotypes

.

Science

,

334

,

105

–

108

.

Zhang

H.

et al. (

2021

)

Mediation effect selection in high-dimensional and compositional microbiome data

.

Stat. Med

.,

40

,

885

–

896

.

Zhao

Y.

,

Luo

X.

(

2016

) Pathway Lasso: Estimate and select sparse mediation pathways with high dimensional mediators. arXiv:1603.07749.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Associate Editor:

Download all slides

Month:	Total Views:
August 2021	47
September 2021	52
October 2021	41
November 2021	30
December 2021	66
January 2022	86
February 2022	53
March 2022	65
April 2022	58
May 2022	69
June 2022	37
July 2022	26
August 2022	18
September 2022	26
October 2022	18
November 2022	48
December 2022	26
January 2023	13
February 2023	38
March 2023	84
April 2023	48
May 2023	37
June 2023	41
July 2023	48
August 2023	63
September 2023	57
October 2023	74
November 2023	51
December 2023	63
January 2024	103
February 2024	52
March 2024	56
April 2024	58

Article Contents

A compositional mediation model for a binary outcome: Application to microbiome studies

Abstract

1 Introduction

2 Materials and methods

2.1 Algebraic operators in simplex space

2.2 Compositional mediation model for binary outcomes

2.3 Model assumptions and identification

2.4 Estimation of composition parameters

2.5 Estimation of regression parameters

2.6 Debiasing procedure and its asymptotic convergence

2.7 Hypothesis test of mediation effect

2.8 Sensitivity analysis

3 Results

3.1 Simulation study I: synthetic data

3.2 Simulation study II: real microbiome data

3.3 Real data analysis: COMBO data

4 Discussion

Acknowledgements

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

A compositional mediation model for a binary outcome: Application to microbiome studies

Abstract

1 Introduction

2 Materials and methods

2.1 Algebraic operators in simplex space

2.2 Compositional mediation model for binary outcomes

2.3 Model assumptions and identification

2.4 Estimation of composition parameters

2.5 Estimation of regression parameters

2.6 Debiasing procedure and its asymptotic convergence

2.7 Hypothesis test of mediation effect

2.8 Sensitivity analysis

3 Results

3.1 Simulation study I: synthetic data

3.2 Simulation study II: real microbiome data

3.3 Real data analysis: COMBO data

4 Discussion

Acknowledgements

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only