Abstract

An important problem within both epidemiology and many social sciences is to break down the effect of a given treatment into different causal pathways and to quantify the importance of each pathway. Formal mediation analysis based on counterfactuals is a key tool when addressing this problem. During the last decade, the theoretical framework for mediation analysis has been greatly extended to enable the use of arbitrary statistical models for outcome and mediator. However, the researcher attempting to use these techniques in practice will often find implementation a daunting task, as it tends to require special statistical programming. In this paper, the authors introduce a simple procedure based on marginal structural models that directly parameterize the natural direct and indirect effects of interest. It tends to produce more parsimonious results than current techniques, greatly simplifies testing for the presence of a direct or an indirect effect, and has the advantage that it can be conducted in standard software. However, its simplicity comes at the price of relying on correct specification of models for the distribution of mediator (and exposure) and accepting some loss of precision compared with more complex methods. Supplementary Data, contain implementation examples in SAS software (SAS Institute, Inc., Cary, North Carolina) and R language (R Foundation for Statistical Computing, Vienna, Austria).

Important questions within both epidemiology and social sciences often require moving beyond “simply” estimating the total effect of a given exposure and instead require breaking down the total effect into separate causal pathways. The foremost example of such a strategy, which we consider here, is the decomposition of total effect into an indirect effect mediated through a specific mediator and the remaining direct effect.

The standard approach, which is inspired by Baron and Kenny (1), involves estimating the direct effect as the residual association between outcome and exposure after regression adjustment for the mediator(s) and the indirect effect by subtracting this from the total effect (on an appropriate scale). It has been shown that this approach works in the special case of linear models without interactions but is fundamentally flawed otherwise (2–4). Building on the counterfactual framework (refer to the work by Pearl (5)), a formal approach to mediation analysis has now been developed. Using ideas of Robins and Greenland (6), Pearl (7) showed that a total effect can always be broken down into a so-called natural direct and indirect effect, regardless of the underlying statistical model.

Although much attention has been given to the development of identification conditions for natural direct and indirect effects (6–11), the researcher who wishes to estimate natural direct and indirect effects from actual data continues to face many challenges. This is because current procedures obtain natural direct and indirect effect estimates through a nontrivial combination of parameter estimates from a regression model for the mediator and a regression model for the outcome (4, 8, 10–15). The way to compute natural direct and indirect effects can therefore differ substantially between different types of mediator or outcome, and standard error calculations become even more tedious. Furthermore, even simple models for the mediator and outcome (e.g., a linear model for the mediator and a logistic regression model for the outcome) tend to produce complex expressions of natural direct and indirect effects. This can make results difficult to report (e.g., because these effects may turn out to depend on covariates in a complicated way). Moreover, it makes interesting hypotheses (e.g., for modification of the direct or indirect effect by covariates) essentially impossible to test, because it can be difficult to specify models for the outcome and mediator that satisfy the considered null hypothesis (e.g., that the natural (in)direct effect does not depend on covariates). All in all, this complexity places severe restrictions on the practical utility of current methods for computing natural direct and indirect effects for general epidemiologic research.

In this paper, we suggest a unified model for the direct and indirect effects with a corresponding simple estimation procedure. The estimation procedure is related to work by Hong (16) but generalized to a broad class of outcomes. The approach can, in principle, be used for any type of outcome (binary, continuous, survival, categorical, and so on) and any type of mediator, even when exposure-mediator interactions exist. Because it involves directly modeling the natural direct and indirect effects of interest, results become simpler for reporting, and interesting hypotheses concerning these effects become straightforward to test. The approach can be implemented in any software package capable of handling weighted modeling. However, its simplicity comes at the price of relying on correct specification of models for the distribution of mediator (and exposure) and of not exploiting all information in the data; more efficient estimators under the same model can thus be obtained. Supplementary Data, present detailed implementation examples in SAS software and R language.

DEFINITIONS AND ASSUMPTIONS

The results of this paper are based on the directed acyclic graph depicted in Figure 1, where A is the observed exposure of interest; M, the mediator; C, a set of baseline confounders; and Y, the outcome. Thus, it is assumed that there are no unmeasured confounders (confounders not included in C) for the exposure-outcome, exposure-mediator, or mediator-outcome relations. Variables are allowed to be of any type (e.g., continuous, binary, categorical, or survival). For each subject, we define the counterfactual variable Ya,m as the outcome we would, possibly contrary to the fact, have observed for that subject had the exposure A been set to the value a and the mediator M set to m. Similarly, the counterfactual variable Ma denotes the value of the mediator if, possibly contrary to the fact, the exposure A was set to a.

Figure 1.

Directed acyclic graph of the causal structure assumed throughout the paper. Note that A is the exposure of interest; M, the mediator; C, a set of baseline confounders; and Y, the outcome.

Figure 1.

Directed acyclic graph of the causal structure assumed throughout the paper. Note that A is the exposure of interest; M, the mediator; C, a set of baseline confounders; and Y, the outcome.

Following the tradition in the causal inference literature (9), we will describe direct and indirect effects in terms of so-called nested counterfactuals, forumla, denoting the outcome that would have been observed if A were set to a* and M were set to the value it would have taken if A were set to a. In particular, we will compare forumla with forumla to obtain a measure of the natural direct effect of changing the exposure from a to a*. Such comparison can, for instance, be made in terms of an average difference within levels of covariates, forumla, or marginally, forumla; as a risk ratio, forumla, and so on. Likewise, we will compare forumla with forumla to obtain a measure of the natural indirect effect. The word “natural” refers to the fact that we let the mediator take the value it would take naturally when the exposure is set to a.

In this paper, as in most of the work on causal mediation analysis, we will discuss the estimation of natural direct and indirect effects under the no-unmeasured confounding assumption, implicit in the causal diagram of Figure 1, that the same set of covariates C is sufficient to control for confounding of the associations between exposure and outcome, exposure and mediator, and mediator and outcome. In particular, we thus assume that there are no variables L that are effects of exposure and that confound the mediator-outcome relation. A formal description of these assumptions is, for instance, given in the report by VanderWeele and Vansteelandt (4).

COUNTERFACTUAL-BASED MEDIATION ANALYSIS

The traditional approach to estimating natural direct and indirect effects uses the mediation formula (17) to calculate forumla as  
formula
(1)
This corresponds to estimating the mean value of the outcome in each stratum defined by mediator and confounders among the individuals with treatment a* but weighting these by the likelihood of each mediator value among individuals with treatment a. Likewise, forumla can be calculated as  
formula
(2)

When C is high dimensional, it will be necessary to use parametric models for the outcome mean and mediator distribution. If the outcome Y and mediator M are modeled by a linear model, that is, forumla and forumla, equation 2 simplifies greatly, and the natural direct and indirect effects are captured by α1(aa*) and α2β1(aa*), respectively. However, equation 2 is less suited to outcomes modeled by a nonlinear model, for example, binary outcomes modeled by a logistic regression or survival outcomes, because the resulting expressions for the natural direct and indirect effects easily become complicated (e.g., they may depend on the values of the confounders in a complicated way; refer to the articles by VanderWeele and Vansteelandt (13) and by VanderWeele (15)).

ESTIMATING NATURAL EFFECTS BY MARGINAL STRUCTURAL MODELS

Marginal structural models (MSMs) are models for the marginal expectation (or distribution) of a counterfactual outcome (18). They have become popular for nonnested counterfactuals such as Ya. For instance, the total causal effect of the exposure A on the outcome Y can be modeled in terms of a MSM of the form E[Ya] = b0 + b1a, where b1 then captures the average causal effect of the exposure. In contrast, MSMs for nested counterfactuals such as forumla have received very little attention, with few exceptions (19, 20). Such models are nonetheless of interest as they enable simultaneous and parsimonious modeling of the natural direct and indirect effect of the exposure A on the outcome Y other than through mediator M, as, for example,  
formula
(3)

Here, we have included the exposure twice to ascertain that it works through 2 distinct causal pathways. It is now easy to infer that c1(aa*) captures the natural direct effect forumla, that c2(aa*) captures the natural indirect effect forumla, and that their sum measures the total effect forumla.

Equation 3 is a special case of the more general class of generalized linear MSMs given by  
formula
(4)
where g is a link function specifying the requested model for the outcome (e.g., logistic model), and c3 is an interaction term, which can be included if required. When c3 = 0 and g is the logit link, then exp[c1(aa*)] captures the natural direct effect odds ratio: forumla; exp[c2(aa*)] captures the natural indirect effect odds ratio: forumla; and their product measures the total effect: forumla. Further, a value c3 differing from zero indicates that the magnitude of the direct effect may depend on the natural level at which the mediator is controlled and may thus be the result of an exposure-mediator interaction. Robins and Greenland (6) and Hafeman and Schwartz (9) proposed the terms pure- and total natural (in)direct effects when such interactions are present. Thus, when c3 is nonzero and g is the logit link, the natural effects of changing exposure from a* to a are given by the following:  
formula
 
formula
 
formula
 
formula
When, instead, exposure-covariate interactions are of interest, then the above generalized linear MSMs can be phrased conditional on covariates as  
formula
(5)

Here, c5 captures the extent to which the direct effect is modified by covariates, and c6 captures the extent to which the indirect effect is modified by covariates.

The class of generalized linear MSMs encompasses a wide range of models, but the Cox and Aalen models (21), which are important models for survival data, are not included in the class. Cox and Aalen models assume that the hazard function corresponding to the counterfactual survival time, forumla, can be expressed as  
formula
(6)
 
formula
(7)
where λ0(t) and γ0(t) are unspecified baseline hazards. Whenever the outcome is a survival time, we will in addition assume that censoring satisfies the usual assumptions, that is, that censoring is independent of event time (22) conditional on the covariates in the MSM. The rest of the paper is devoted to estimating MSMs that can be written as in equations 4–7.
We propose to estimate the MSMs given by equations 4–7 by the following procedure, which generalizes a proposal by Hong (16) to models for the nested counterfactuals forumla, corresponding to a wide variety of outcome types. We explain it first for a dichotomous exposure A. Construct a new data set by repeating each observation in the original data set twice and including an additional variable A* capturing the 2 possible values of the exposure relative to the indirect path. For the first replication of the observation, A* is set to the actual value of the exposure (that is forumla), while for the second replication, A* is set to the opposite of the actual exposure (i.e., forumla when A is coded to be 0 or 1). The MSM given by equations 4–7 can now be estimated from the new data set by using standard software by regressing the outcome on the observed exposure A and the additional variable A* on the basis of the new data set, weighting each observation in the expanded data set with  
formula

The first fraction in these weights ensures that the exposure-outcome association is adjusted for confounding by C. Indeed, the impact of up-weighting observations with a rare combination of exposures and confounders is to create a pseudo-population in which the exposure is no longer associated with C and, thus, there is no residual confounding by C (i.e., mimicking a randomized trial). The second fraction of the weights serves to distinguish between the direct and indirect paths (by correcting for the fact that the observed mediator value may differ from the counterfactual value forumla that is of interest).

The above approach can easily be implemented in standard software by using the following simple estimation procedure for a dichotomous exposure, which apart from step 3 is completely analogous to estimation of standard MSMs:

  1. Estimate a suitable model for the exposure conditional on confounders by using the original data set.

  2. Estimate a suitable model for the mediator conditional on exposure and baseline variables by using the original data set.

  3. Construct a new data set by repeating each observation in the original data set twice and including an additional variable A*, which is equal to the original exposure for the first replication and equal to the opposite of the actual exposure for the second replication. In addition, add an identification variable to indicate which data rows originate from the same subject.

  4. Compute weights by applying the fitted models from steps 1 and 2 to the new data set. In most software packages, this can be done by using “predict-functionality.”

  5. Fit a suitable model to the outcome including only A and A* (and perhaps their interaction) as covariates and weighted by the weights from the previous step. It can be shown (Supplementary Data), provided that the exposure and mediator models in steps 1 and 2 are fitted by using a standard maximum likelihood procedure and provided that the mediator model is sufficiently rich so as not to contradict the restrictions imposed by the chosen generalized linear MSM, that conservative confidence intervals can be obtained as the estimate of the natural direct or indirect effect plus/minus 1.96 times a robust standard error, which can be obtained by using software for generalized estimating equations; alternatively, a bootstrap procedure can be used.

It is well established in the literature on MSMs (refer to the article by Robins et al. (18)) that estimators based on inverse probability weights like 1/P(A = Ai|C = Ci) can be unstable in samples of small to moderate size, as the weights can become so large that individual observations dominate the estimation. Unless the MSM is saturated, somewhat better behaving estimators may be obtained by instead using stabilized weights given by  
formula
When the MSM includes the covariates C (compare with equation 5), then the following stabilized weights can also be used:  
formula

Note that these do not involve inverse probability weighting by the exposure distribution, because the adjustment for confounding by C now happens via a standard regression adjustment. These weights will thereby typically be much more stable.

The above approach is very flexible because, unlike traditional approaches, it does not work indirectly by combining parameter estimates from standard models for the mediator M and outcome Y. In particular, it can, in principle, be used for any type of outcome, mediator, and exposure and regardless of the choice of MSM and the models for mediator and exposure. However, the approach may lend itself less ideally to the analysis of continuous mediators, because this requires substituting the probabilities P(M = Mi | A = Ai, C = Ci) in the weights by probability densities, which in turn may yield unstable weights (refer to Supplementary Data). For categorical exposures A, a minor modification is needed in that one must repeat the original data set as many times as needed to ensure that, for each subject, A* takes on all the possible values it can take. For continuous exposures, we recommend fitting MSMs conditional on covariates, with corresponding stabilized weights forumla to avoid instability due to inverse weighting by the exposure distribution. Here, the user is advised to follow the procedure prescribed for categorical exposures but to replace A* for subject i by randomly drawn exposures. This can either be done by resampling from the observed exposures or by drawing from a normal distribution with the mean and standard deviation matching the observed exposures. For continuous exposures, a minimum of 5 draws must be made for each original observation.

Supplementary Data contain a mathematical validation of the procedure and of the validity of the robust standard errors, respectively. In addition, Supplementary Data implementations (one with a binary outcome and one with a survival outcome) of this procedure in both SAS software and R language.

DISCUSSION AND CONCLUSION

It should be noted that the proposed estimators do not exploit all available information in the data and, thus, that more efficient estimators can, in principle, be obtained. Furthermore, their correctness critically hinges on the correctness of the MSMs and the models used for the exposure and the mediator, except when the MSM is specified conditionally on covariates C, in which case correct specification of the exposure model is not required. If only the MSM model is misspecified, then the resulting measures for natural effects can still be interpreted as a “best” approximation; however, this is no longer the case when the exposure or mediator models are misspecified. The current advice must therefore be to conduct a thorough misspecification analysis for the 2 models used in this paper and to evaluate the stability of the weights. Tchetgen Tchetgen and Shpitser (20) and subsequently Zheng and van der Laan (23) proposed estimators that are efficient and multiply robust in the sense that they merely require the correctness of 2 out of 3 models (the 3 models being the model for the exposure, the model for the mediator, and the model for the outcome) to be correct, regardless of which 2 are correct. However, implementation of these estimators is more demanding at present. Work is ongoing to develop alternative estimators that can also be obtained via standard software, are more efficient than the ones proposed in this paper, and also share multiple robustness properties.

The sensitivity of effects separation techniques toward unmeasured confounders is an active area of research (11, 20, 24–26). Supplementary Data presents a simulation study assessing the sensitivity of the described approach toward unmeasured confounders in a simple setup of only binary variables. As expected, the assumption of no-unmeasured confounders is found to be critical, the only exception being unmeasured confounding of the exposure-mediator relation, which does not affect estimation of the direct effect.

In summary, this paper has described a simple unified procedure for estimating natural direct and indirect effects. The procedure can be applied to almost any combination of variable types and can be conducted in standard software. Supplementary Data of the paper provide detailed implementation examples in SAS software and R language.

ACKNOWLEDGMENTS

Author affiliations: Department of Biostatistics, University of Copenhagen, Copenhagen, Denmark (Theis Lange); and Department of Applied Mathematics and Computer Science, Ghent University, Ghent, Belgium (Stijn Vansteelandt, Maarten Bekaert).

T. L. was supported by the Commission of Social Inequality in Cancer (grant SU08004). S. V. was supported by the IAP research network (grant P06/03) from the Belgian government (Belgian Science Policy). M. B. acknowledges support from the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT vlaanderen).

Conflict of interest: none declared.

REFERENCES

1
Baron
RM
Kenny
DA
The moderator-mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations
J Pers Soc Psychol
1986
, vol. 
51
 
6
(pg. 
1173
-
1182
)
2
Cole
SR
Hernán
MA
Fallibility in estimating direct effects
Int J Epidemiol
2002
, vol. 
31
 
1
(pg. 
163
-
165
)
3
Kaufman
JS
Maclehose
RF
Kaufman
S
A further critique of the analytic strategy of adjusting for covariates to identify biologic mediation
Epidemiol Perspect Innov
2004
, vol. 
1
 
1):4. (doi:10.1186/1742-5573-1-4
4
VanderWeele
TJ
Vansteelandt
S
Conceptual issues concerning mediation, interventions and composition
Statistics and Its Interface
2009
, vol. 
Vol 2
 
Somerville, MA
International Press
(pg. 
457
-
468
)
5
Pearl
J
Causality: Models, Reasoning, and Inference
2009
New York, NY
Cambridge University Press
6
Robins
JM
Greenland
S
Identifiability and exchangeability for direct and indirect effects
Epidemiology
1992
, vol. 
3
 
2
(pg. 
143
-
155
)
7
Pearl
J
Direct and indirect effects
Proceedings of the American Statistical Association Joint Statistical Meetings
2005
Brentwood, MO
MIRA Digital Publishing
(pg. 
1572
-
1581
(Technical report R-273)
8
Petersen
ML
Sinisi
SE
van der Laan
MJ
Estimation of direct causal effects
Epidemiology
2006
, vol. 
17
 
3
(pg. 
276
-
284
)
9
Hafeman
DM
Schwartz
S
Opening the Black Box: a motivation for the assessment of mediation
Int J Epidemiol
2009
, vol. 
38
 
3
(pg. 
838
-
845
)
10
Imai
K
Keele
L
Yamamoto
T
Identification, inference and sensitivity analysis for causal mediation
Stat Sci
2010
, vol. 
25
 
1
(pg. 
51
-
71
)
11
Imai
K
Keele
L
Tingley
D
A general approach to causal mediation analysis
Psychol Methods
2010
, vol. 
15
 
4
(pg. 
309
-
334
)
12
VanderWeele
TJ
Marginal structural models for the estimation of direct and indirect effects
Epidemiology
2009
, vol. 
20
 
1
(pg. 
18
-
26
)
13
VanderWeele
TJ
Vansteelandt
S
Odds ratios for mediation analysis for a dichotomous outcome
Am J Epidemiol
2010
, vol. 
172
 
12
(pg. 
1339
-
1348
)
14
Lange
T
Hansen
JV
Direct and indirect effects in a survival context
Epidemiology
2011
, vol. 
22
 
4
(pg. 
575
-
581
)
15
VanderWeele
TJ
Causal mediation analysis with survival data
Epidemiology
2011
, vol. 
22
 
4
(pg. 
582
-
585
)
16
Hong
G
Ratio of mediator probability weighting for estimating natural direct and indirect effects
Proceedings of the American Statistical Association, Biometrics Section.
2010
Alexandria, VA
American Statistical Association
(pg. 
2401
-
2415
)
17
Pearl
J
The Mediation Formula: A Guide to the Assessment of Causal Pathways in Non-Linear Models
2011
Los Angeles, CA
University of California
 
(Technical report R-363)
18
Robins
JM
Hernán
MA
Brumback
B
Marginal structural models and causal inference in epidemiology
Epidemiology
2000
, vol. 
11
 
5
(pg. 
550
-
560
)
19
van der Laan
MJ
Petersen
ML
Direct effect models
Int J Biostat
2008
, vol. 
4
 
1
(pg. 
1
-
27
)
20
Tchetgen Tchetgen
EJ
Shpitser
I
Semiparametric estimation of models for natural direct and indirect effects
2011
Berkeley, CA
bepress
 
(Harvard University Biostatistics Working Paper Series. Working Paper 129)
21
Aalen
O
Klonecki
W
Kozek
A
Rosinski
J
, et al. 
A model for non-parametric regression analysis of counting processes
Lecture Notes in Statistics-2: Mathematical Statistics and Probability Theory
1980
New York, NY
Springer-Verlag
(pg. 
1
-
25
)
22
Martinussen
T
Scheike
TH
Dynamic Regression Models for Survival Data
2006
New York, NY
Springer
23
Zheng
W
van der Laan
MJ
Targeted maximum likelihood estimation of natural direct effect
2011
Berkeley, CA
bepress
 
(UC Berkeley Division of Biostatistics Working Paper Series. Working Paper 288)
24
Van der Weele
TJ
Bias formulas for sensitivity analysis for direct and indirect effects
Epidemiology
2010
, vol. 
21
 
4
(pg. 
540
-
551
)
25
Hafeman
DM
Confounding of indirect effects: a sensitivity analysis exploring the range of bias due to a cause common to both the mediator and the outcome
Am J Epidemiol
2011
, vol. 
174
 
6
(pg. 
710
-
717
)
26
Christensen
KB
Labriola
M
Lund
T
Explaining the social gradient in long-term sickness absence: a prospective study of Danish employees
J Epidemiol Community Health
2008
, vol. 
62
 
2
(pg. 
181
-
183
)

Author notes

Abbreviation: MSM, marginal structural model.

Supplementary data