- Split View
-
Views
-
Cite
Cite
S Yang, P Ding, Asymptotic inference of causal effects with observational studies trimmed by the estimated propensity scores, Biometrika, Volume 105, Issue 2, June 2018, Pages 487–493, https://doi.org/10.1093/biomet/asy008
- Share Icon Share
SUMMARY
Causal inference with observational studies often relies on the assumptions of unconfoundedness and overlap of covariate distributions in different treatment groups. The overlap assumption is violated when some units have propensity scores close to |$0$| or |$1$|, so both practical and theoretical researchers suggest dropping units with extreme estimated propensity scores. However, existing trimming methods often do not incorporate the uncertainty in this design stage and restrict inference to only the trimmed sample, due to the nonsmoothness of the trimming. We propose a smooth weighting, which approximates sample trimming and has better asymptotic properties. An advantage of our estimator is its asymptotic linearity, which ensures that the bootstrap can be used to make inference for the target population, incorporating uncertainty arising from both design and analysis stages. We extend the theory to the average treatment effect on the treated, suggesting trimming samples with estimated propensity scores close to |$1$|.
1. Introduction
In the potential outcomes framework, there is an extensive literature on estimating causal effects based on the assumptions of unconfoundedness and overlap of the covariate distributions (Rosenbaum & Rubin, 1983; Angrist & Pischke, 2008; Imbens & Rubin, 2015). Unfortunately, it is common to have limited overlap in covariates between the treatment and control groups, which affects the credibility of all methods attempting to estimate causal effects for the population (King & Zeng, 2005; Imbens, 2015). Consequently, extreme estimated propensity scores induce large weights, which can result in a large variance and poor finite-sample properties (Kang & Schafer, 2007; Khan & Tamer, 2010). Therefore, it may seem desirable to modify the estimand to averaging only over that part of the covariate space with treatment probabilities bounded away from |$0$| and |$1$|. For example, in a medical study of a particular chemotherapy for breast cancer, because patients with stage I breast cancer have never been treated with chemotherapy, clinicians then redefine the study population to be patients with stage II to stage IV breast cancer, omitting patients with stage I breast cancer for whom the propensity scores are zero. This effectively alters the estimand by changing the reference population to a different target population. Petersen et al. (2012) used a projection function to define the target parameter within a marginal structural working model. Li et al. (2018) proposed a general representation for the target population.
Trimming observational studies based on estimated propensity scores was first used in medical applications (e.g., Vincent et al., 2002; Grzybowski et al., 2003; Kurth et al., 2005) and then formalized by Crump et al. (2009), who suggested dropping units from the analysis which have estimated propensity scores outside an interval |$[\alpha_{1},\alpha_{2}]$|, so that the average treatment effect for the target population can be estimated with the smallest asymptotic variance. Other methods, e.g., those of Traskin & Small (2011) and Fogarty et al. (2016), construct the study population based on covariates themselves. But with moderate- or high-dimensional covariates, these rules for discarding units become complicated. In these cases, dimension reduction, for example seeking a scalar summary of the covariates, seems important. This was the original motivation of the propensity score (Rosenbaum & Rubin, 1983), which is arguably the most interpretable scalar function of the covariates.
Existing methods rarely incorporate the uncertainty in this design stage and restrict inference to the trimmed sample. We incorporate uncertainty in both the design and the analysis stages. The nonsmooth nature of trimming renders the target causal estimand not root-|$n$| estimable (Crump et al., 2009), so, instead of making a binary decision to include or exclude units from analysis, we propose to use a smooth weight function to approximate the existing sample trimming. This allows us to derive the asymptotic properties of the corresponding causal effect estimators using conventional linearization methods for two-step statistics. We show that the new weighting estimators are asymptotically linear, so the bootstrap can be used to construct confidence intervals.
2. Potential outcomes, causal effects and assumptions
For each unit |$i$|, the treatment is |$A_{i}\in\{0,1\}$|, where |$0$| and |$1$| are labels for control and treatment. There are two potential outcomes, one for treatment and the other for control, denoted by |$Y_{i}(1)$| and |$Y_{i}(0)$|, respectively. The observed outcome is |$Y_{i}=Y_{i}(A_{i})$|. {Let |$X_{i}$| be the observed pre-treatment covariates.} We assume that |$\{A_{i},X_{i},Y_{i}(1),Y_{i}(0)\}_{i=1}^{N}$| are independent draws from the distribution of |$\{A,X,Y(1),Y(0)\}$|. Given the observed covariates, the conditional average causal effect is |$\tau(X)=E\{Y(1)-Y(0)\mid X\}$|. The average treatment effect is |$\tau=E\{Y(1)-Y(0)\}=E\{\tau(X)\}$|. The common assumptions to identify |$\tau$| are as follows (Rosenbaum & Rubin, 1983).
(Unconfoundedness). For |$a=0,1$|, |$Y(a)$| is independent of |$A\mid X$|.
(Overlap). There exist constants |$c_{1}$| and |$c_{2}$| such that with probability |$1$|, |$0<c_{1}\leqslant e(X)\leqslant c_{2}<1$|, where |$e(X)=\mathrm{pr}(A=1\mid X)$| is the propensity score.
The augmented weighting estimator features a double robustness property in the sense that under Assumptions 1 and 2, it is consistent for |$\tau$| if either |$e(X)$| or |$\mu(a,X)$| is correctly specified.
3. Main results for the average causal effect
We show in the Supplementary Material that |$b_{1,\epsilon}\rightarrow0$| as |$\epsilon\rightarrow0$|. Therefore, the increased variability due to estimating the support, |$b_{1,\epsilon}^{{ \mathrm{\scriptscriptstyle T} }}\mathcal{I}(\theta^{*})^{-1}b_{1,\epsilon}$|, is close to |$0$| with a small |$\epsilon$|.
The term |$-b_{2,\epsilon}^{{ \mathrm{\scriptscriptstyle T} }}\mathcal{I}(\theta^{*})^{-1}b_{2,\epsilon}$| implies that the estimated propensity score increases the precision of the simple weighting estimator of |$\tau$| based on the true propensity score, a phenomenon that has previously appeared in the causal inference literature (e.g., Rubin & Thomas, 1992; Hahn, 1998; Abadie & Imbens, 2016).
If the outcome model is correctly specified, then |$\tilde{\mu}(a,X)=\mu(a,X)$| and thus |$C_{0}=C_{1}=0$|. Consequently, the asymptotic variance of |$\hat{\tau}_{\epsilon}^{\mathrm{aug}}$| reduces to |$\tilde{\sigma}_{\epsilon}^{2}+b_{1\epsilon}^{{ \mathrm{\scriptscriptstyle T} }}\mathcal{I}(\theta^{*})^{-1}b_{1\epsilon}$|, which is smaller than the asymptotic variance of |$\hat{\tau}_{\epsilon}$|. Intuitively, by regressing |$Y$| on |$X$| and |$A$|, we use the residual as the new outcome, which in general has a smaller variance than |$Y$|.
Because |$\hat{\tau}_{\epsilon}$| and |$\hat{\tau}_{\epsilon}^{\mathrm{aug}}$| are asymptotically linear, the bootstrap can be used to estimate the variances of |$\hat{\tau}_{\epsilon}$| and |$\hat{\tau}_{\epsilon}^{\mathrm{aug}}$| (Shao & Tu, 2012). We evaluate the finite-sample properties of the bootstrap variance estimator by simulation in the Supplementary Material. Let |$\mathcal{S}=\{X:e(X^{{ \mathrm{\scriptscriptstyle T} }}\theta^{*})=\alpha_{1}$| or |$\alpha_{2}\}$|. We also show that if |$\mathrm{pr}(X\in\mathcal{S})=0$|, the bootstrap works for the weighting estimator with the indicator function, which is confirmed by simulation.
Although some robust nonparametric methods (Hirano et al., 2003; Lee et al., 2010, 2011) can be used for propensity score estimation, the majority of the literature uses parametric generalized linear models. When the propensity score model is misspecified, the weighting estimators are not consistent for the causal effect defined on the target population |$\mathcal{O}=\{X:\alpha_{1}\leqslant e(X)\leqslant\alpha_{2}\}$|. However, our estimators can still be helpful to inform treatment effects for the population defined as |$\mathcal{O}^{*}=\{X:\alpha_{1}\leqslant e(X^{{ \mathrm{\scriptscriptstyle T} }}\theta^{*})\leqslant\alpha_{2}\}$|, where |$e(X^{{ \mathrm{\scriptscriptstyle T} }}\theta^{*})$| is the propensity score projected to the generalized linear model family. This new study population is defined as being between two hyperplanes of the covariate space, which is slightly more complicated than the study population defined by the trees in Traskin & Small (2011) or by the intervals of covariates in Fogarty et al. (2016). Moreover, the smooth weighting estimators are still asymptotically linear, and again the bootstrap can be used for constructing confidence intervals. See the Supplementary Material for more details.
An important issue regarding the smooth weight function is the choice of |$\epsilon$|, which involves a bias-variance trade-off. On the one hand, the discrepancy between |$\tau_{\epsilon}$| and the target parameter |$\tau(\mathcal{O})$| is |$E([\omega_{\epsilon}(X^{{ \mathrm{\scriptscriptstyle T} }}\theta^{*})-1{\{\alpha_{1}\leqslant e(X^{{ \mathrm{\scriptscriptstyle T} }}\theta^{*})\leqslant\alpha_{2}\}}]\tau(X))$|. Assuming that |$\tau(X)$| is integrable, by the dominated convergence theorem, |$\tau_{\epsilon}$| converges to |$\tau(\mathcal{O})$| as |$\epsilon\rightarrow0$|. This implies that based on |$\hat{\tau}_{\epsilon}$| or |$\hat{\tau}_{\epsilon}^{\mathrm{aug}}$|, we can draw inference for |$\tau(\mathcal{O})$| by choosing a small |$\epsilon$|. On the other hand, as |$\epsilon\rightarrow0$|, the smooth weight function (4) becomes closer to the indicator weight function (1), which increases the variance of the weighting estimators. In practice, we recommend a sensitivity analysis varying |$\epsilon$| over a grid, for example, |$10^{-4},10^{-5},\ldots$|, as illustrated in the Supplementary Material and the application in the next section.
4. National Health and Nutrition Examination Survey data
We examine a dataset from the 2007–2008 U.S. National Health and Nutrition Examination Survey to estimate the causal effect of smoking on blood lead levels (Hsu & Small, 2013). The dataset includes |$3340$| subjects consisting of |$679$| smokers, denoted by |$A=1$|, and |$2661$| nonsmokers, denoted by |$A=0$|. The outcome variable |$Y$| is the measured level of lead in the subject’s blood, with the observed range being from |$0.18~\mu$|g/dl to |$33.10~\mu$|g/dl. The covariates are age, income-to-poverty level, gender, education and race.
The propensity score is estimated by a logistic regression model with linear predictors including all covariates. To help address the lack of overlap, for the average smoking effect, because there is little overlap for the propensity score less than |$0.05$| or greater than |$0.6$|, we restrict our estimand to the target population |$\mathcal{O}=\{X:0.05\leqslant e(X)\leqslant0.6\}$|. The truncation of the propensity score at |$0.6$| is because there are few subjects with propensity score above |$0.6$|. This removes |$794$| subjects, including |$111$| smokers and |$683$| non-smokers. Thus, the final analysis sample includes |$2546$| subjects, with |$568$| smokers and |$1978$| non-smokers. In the Supplementary Material, we display the summary statistics of the covariates and give a more detailed interpretation of the target population.
We consider the weighting estimators using both the indicator and the smooth weight functions with |$\epsilon=10^{-4}$| and |$\epsilon=10^{-5}$|. For the augmented weighting estimator, we use a linear outcome model adjusting for all covariates, separately for |$A=0,1$|. Table 1 shows the results. The weighting estimators with the smooth weight function are close to their counterparts with the indicator weight function, but have slightly smaller estimated standard errors. The smooth weighting estimators are insensitive to the choice of |$\epsilon$|. From the results, on average, smoking increases the lead level in blood by at least |$0.65$||$\mu$|g/dl over the target population with |$0.05\leqslant e(X)\leqslant0.6$|.
. | |$\epsilon$| . | Estimate . | s.e. . | |$95\%$| c.i. . | . | Estimate . | s.e. . | |$95\%$| c.i. . |
---|---|---|---|---|---|---|---|---|
|$\hat{\tau}(\hat{\theta})$| | – | |$0.646$| | |$0.135$| | |$(0.376,0.916)$| | |$\hat{\tau}^{\mathrm{aug}}(\hat{\theta})$| | |$0.765$| | |$0.107$| | |$(0.552,0.978)$| |
|$\hat{\tau}_{\epsilon}(\hat{\theta})$| | |$10^{-4}$| | |$0.661$| | |$0.124$| | |$(0.412,0.909)$| | |$\hat{\tau}_{\epsilon}^{\mathrm{aug}}(\hat{\theta})$| | |$0.763$| | |$0.105$| | |$(0.554,0.973)$| |
|$\hat{\tau}_{\epsilon}(\hat{\theta})$| | |$10^{-5}$| | |$0.632$| | |$0.133$| | |$(0.366,0.899)$| | |$\hat{\tau}_{\epsilon}^{\mathrm{aug}}(\hat{\theta})$| | |$0.754$| | |$0.105$| | |$(0.543,0.964)$| |
. | |$\epsilon$| . | Estimate . | s.e. . | |$95\%$| c.i. . | . | Estimate . | s.e. . | |$95\%$| c.i. . |
---|---|---|---|---|---|---|---|---|
|$\hat{\tau}(\hat{\theta})$| | – | |$0.646$| | |$0.135$| | |$(0.376,0.916)$| | |$\hat{\tau}^{\mathrm{aug}}(\hat{\theta})$| | |$0.765$| | |$0.107$| | |$(0.552,0.978)$| |
|$\hat{\tau}_{\epsilon}(\hat{\theta})$| | |$10^{-4}$| | |$0.661$| | |$0.124$| | |$(0.412,0.909)$| | |$\hat{\tau}_{\epsilon}^{\mathrm{aug}}(\hat{\theta})$| | |$0.763$| | |$0.105$| | |$(0.554,0.973)$| |
|$\hat{\tau}_{\epsilon}(\hat{\theta})$| | |$10^{-5}$| | |$0.632$| | |$0.133$| | |$(0.366,0.899)$| | |$\hat{\tau}_{\epsilon}^{\mathrm{aug}}(\hat{\theta})$| | |$0.754$| | |$0.105$| | |$(0.543,0.964)$| |
s.e., standard error; c.i., confidence interval.
. | |$\epsilon$| . | Estimate . | s.e. . | |$95\%$| c.i. . | . | Estimate . | s.e. . | |$95\%$| c.i. . |
---|---|---|---|---|---|---|---|---|
|$\hat{\tau}(\hat{\theta})$| | – | |$0.646$| | |$0.135$| | |$(0.376,0.916)$| | |$\hat{\tau}^{\mathrm{aug}}(\hat{\theta})$| | |$0.765$| | |$0.107$| | |$(0.552,0.978)$| |
|$\hat{\tau}_{\epsilon}(\hat{\theta})$| | |$10^{-4}$| | |$0.661$| | |$0.124$| | |$(0.412,0.909)$| | |$\hat{\tau}_{\epsilon}^{\mathrm{aug}}(\hat{\theta})$| | |$0.763$| | |$0.105$| | |$(0.554,0.973)$| |
|$\hat{\tau}_{\epsilon}(\hat{\theta})$| | |$10^{-5}$| | |$0.632$| | |$0.133$| | |$(0.366,0.899)$| | |$\hat{\tau}_{\epsilon}^{\mathrm{aug}}(\hat{\theta})$| | |$0.754$| | |$0.105$| | |$(0.543,0.964)$| |
. | |$\epsilon$| . | Estimate . | s.e. . | |$95\%$| c.i. . | . | Estimate . | s.e. . | |$95\%$| c.i. . |
---|---|---|---|---|---|---|---|---|
|$\hat{\tau}(\hat{\theta})$| | – | |$0.646$| | |$0.135$| | |$(0.376,0.916)$| | |$\hat{\tau}^{\mathrm{aug}}(\hat{\theta})$| | |$0.765$| | |$0.107$| | |$(0.552,0.978)$| |
|$\hat{\tau}_{\epsilon}(\hat{\theta})$| | |$10^{-4}$| | |$0.661$| | |$0.124$| | |$(0.412,0.909)$| | |$\hat{\tau}_{\epsilon}^{\mathrm{aug}}(\hat{\theta})$| | |$0.763$| | |$0.105$| | |$(0.554,0.973)$| |
|$\hat{\tau}_{\epsilon}(\hat{\theta})$| | |$10^{-5}$| | |$0.632$| | |$0.133$| | |$(0.366,0.899)$| | |$\hat{\tau}_{\epsilon}^{\mathrm{aug}}(\hat{\theta})$| | |$0.754$| | |$0.105$| | |$(0.543,0.964)$| |
s.e., standard error; c.i., confidence interval.
5. Extension to the average treatment effect on the treated
Another estimand of interest is the average treatment effect for the treated, |$\tau_{\mathrm{ATT}}=E\{Y(1)-Y(0)\mid A=1\}=E\{\tau(X)\mid A=1\}$|. Similar to Crump et al. (2009), if |$\sigma^{2}(1,X)=\sigma^{2}(0,X)$|, we can show that the optimal overlap for estimating |$\tau_{\mathrm{ATT}}$| is of the form |$\mathcal{O}=\{X:1-e(X)\geqslant\alpha\}$| for some |$\alpha$|, for which the estimators have the smallest asymptotic variance. Intuitively, for the treated units with |$e(X)$| close to |$1$|, there are few similar units in the control group that can provide information to infer their |$Y(0)$| values. Therefore, it is reasonable to drop these units with |$e(X)$| close to |$1$| when inferring |$\tau_{\mathrm{ATT}}$|. We give a formal discussion in the Supplementary Material.
Define |$\tilde{b}_{1,\epsilon}$| and |$\tilde{b}_{2,\epsilon}$| as the analogues of |$b_{1,\epsilon}$| and |$b_{2,\epsilon}$| with weights |$\omega_{{\mathrm{ATT}},\epsilon}(X^{ \mathrm{\scriptscriptstyle T} }\hat{\theta})$|. In contrast to Remark 1, for |$\tau_{\mathrm{ATT}}$|, the term |$\tilde{b}_{1,\epsilon}$| does not converge to |$0$| as |$\epsilon\rightarrow0$|. The correction term in the asymptotic variance formula due to the estimated propensity score instead of the true propensity score, |$\tilde{b}_{1,\epsilon}^{ \mathrm{\scriptscriptstyle T} }\mathcal{I}(\theta^{*})^{-1}\tilde{b}_{1,\epsilon}-\tilde{b}_{2,\epsilon}^{ \mathrm{\scriptscriptstyle T} }\mathcal{I}(\theta^{*})^{-1}\tilde{b}_{2,\epsilon}$|, can be negative, zero, or positive. Ignoring the uncertainty in the estimated propensity score, the inference can be either conservative or anticonservative for |$\tau_{\mathrm{ATT}}$|, which differs from the inference for |$\tau$|. This fundamental difference also appeared for matching estimators (Abadie & Imbens, 2016), which highlights the importance of incorporating the uncertainty in the design stage especially for |$\tau_{\mathrm{ATT}}$|.
Acknowledgement
We benefited from the insightful comments from the associate editor and two reviewers. Peng Ding was partially supported by the U.S. Institute of Education Sciences and National Science Foundation.
Supplementary material
Supplementary material available at Biometrika online includes proofs, a simulation study, an extension, and more details on the application.