-
PDF
- Split View
-
Views
-
Cite
Cite
Bo Zhang, Dylan S. Small, A Calibrated Sensitivity Analysis for Matched Observational Studies with Application to the Effect of Second-Hand Smoke Exposure on Blood Lead Levels in Children, Journal of the Royal Statistical Society Series C: Applied Statistics, Volume 69, Issue 5, November 2020, Pages 1285–1305, https://doi.org/10.1111/rssc.12443
- Share Icon Share
Summary
We conducted a matched observational study to investigate the causal relationship between second-hand smoke and blood lead levels in children. Our first analysis that assumes no unmeasured confounding suggests evidence of a detrimental effect of second-hand smoke. However, unmeasured confounding is a concern in our study as in other observational studies of second-hand smoke's effects. A sensitivity analysis asks how sensitive the conclusion is to a hypothesized unmeasured confounder U. For example, in our study, one potential unmeasured confounder is whether the child attends a public or private school. A commonly used sensitivity analysis for matched observational studies adopts a worst-case perspective, which assumes that, in each matched set, the unmeasured confounder is allocated to make the bias worst: in a matched pair, the child with higher blood lead level always attends public school and the other private school. This worst-case allocation of U does not correspond to any realistic distribution of U in the population and is difficult to compare with observed covariates. We proposed a new sensitivity analysis method that addresses these concerns. We apply the new method to our study and find that, to explain away the association between second-hand smoke exposure and blood lead levels as non-causal, the unmeasured confounder would have to be a bigger confounder than any measured confounder.
1 Introduction
1.1 An observational study on the effect of second-hand smoke exposure on blood lead levels in children
Concerns about the health effects of second-hand smoke exposure have been around since at least 1928 when Schönherr proposed that lung cancers among non-smoking women could be caused by inhalation of their husbands’ smoke (Smith, 2003). A randomized controlled trial is clearly unethical. Over the years, many observational studies have been conducted (e.g. Enstrom and Kabat (2003), Mannino et al. (2003) and Oberg et al. (2011)). In this paper, we are concerned with the effect of second-hand smoke on blood lead levels among children. It is widely acknowledged that active tobacco smoking causes higher blood lead levels (Shaper et al., 1982; Grasmick et al., 1985; Mannino et al., 2005). An important public health question arises about whether exposure to second-hand smoke also causes higher blood lead levels, particularly in children. An effect of second-hand smoke on blood lead levels in children would be concerning because children's nervous systems are still developing and a high blood lead level is thought to cause consequences such as decreased intelligence, impaired growth and anaemia (National Research Council, 1993).
Mannino et al. (2003) studied the relationship between second-hand smoke exposure and blood lead levels in a sample of 5592 US children, aged 4–16 years, and concluded that there was strong evidence that second-hand smoke exposure was associated with increased blood lead levels. If second-hand smoke exposure were randomized, then association would imply causation, but second-hand smoking was not randomized and any causal conclusion needs to be made with great caution.
We followed Mannino et al. (2003) in constructing a data set that included children aged 4–16 years for whom both serum cotinine levels and blood lead levels were measured in the third National Health and Nutrition Examination Survey, along with the following variables: race or ethnicity, age, sex, poverty income ratio, education level of the reference adult, family size, number of rooms in the house and year the house was constructed. The biomarker cotinine is a metabolite of nicotine and an indicator of second-hand smoke exposure (Mannino et al., 2003). We excluded active smokers (cotinine level greater than or equal to 15.0 ng ml−1) as in Mannino et al. (2003) and classified a child as having a high cotinine level (treatment=1) if the cotinine level was in the third tertile, greater than or equal to 0.563 ng ml−1, and a not high cotinine level (treatment=0) otherwise.
Two main problems have been raised in making causal inference from observational studies of second-hand smoke's effects: measurement error and confounding (Kawachi and Colditz, 1996). Measurement error refers to the inaccuracy in classifying the subjects into the exposed and control groups. This difficulty is largely overcome in our study, because the classification is based on accurate cotinine measurement. A second and perhaps more challenging problem is confounding. Individuals who are exposed to environmental tobacco smoke usually display adverse profiles in relation to socio-economic position and health-related behaviours (Smith, 2003). Table 1 summarizes eight baseline covariates in the treated and control groups in our study. We see that children who are exposed to second-hand smoke are in general poorer, younger, live in older houses and have fewer rooms in their houses. A direct comparison of treated and control groups is thus not warranted.
Adjusted means and standardized differences of baseline covariates in the treatment and control groups before and after full matching†
Covariate . | Results for before matching . | Results for after matching . | ||||
---|---|---|---|---|---|---|
Control . | Treated . | Standard difference . | Control . | Treated . | Standard difference . | |
EDUCATION | 11.13 | 10.93 | −0.07 | 11.16 | 10.93 | −0.08 |
MALE | 0.49 | 0.49 | −0.01 | 0.52 | 0.51 | −0.01 |
NON-WHITE | 0.73 | 0.71 | −0.05 | 0.69 | 0.71 | 0.03 |
POVERTY | 1.86 | 1.42 | −0.39 | 1.47 | 1.42 | −0.04 |
BEFORE 1974 | 0.37 | 0.29 | −0.18 | 0.29 | 0.29 | −0.01 |
ROOM | 6.00 | 5.67 | −0.20 | 5.70 | 5.67 | −0.02 |
FAMILY SIZE | 5.08 | 4.92 | −0.08 | 4.81 | 4.92 | 0.06 |
AGE | 9.56 | 9.06 | −0.13 | 9.04 | 9.06 | 0.01 |
Covariate . | Results for before matching . | Results for after matching . | ||||
---|---|---|---|---|---|---|
Control . | Treated . | Standard difference . | Control . | Treated . | Standard difference . | |
EDUCATION | 11.13 | 10.93 | −0.07 | 11.16 | 10.93 | −0.08 |
MALE | 0.49 | 0.49 | −0.01 | 0.52 | 0.51 | −0.01 |
NON-WHITE | 0.73 | 0.71 | −0.05 | 0.69 | 0.71 | 0.03 |
POVERTY | 1.86 | 1.42 | −0.39 | 1.47 | 1.42 | −0.04 |
BEFORE 1974 | 0.37 | 0.29 | −0.18 | 0.29 | 0.29 | −0.01 |
ROOM | 6.00 | 5.67 | −0.20 | 5.70 | 5.67 | −0.02 |
FAMILY SIZE | 5.08 | 4.92 | −0.08 | 4.81 | 4.92 | 0.06 |
AGE | 9.56 | 9.06 | −0.13 | 9.04 | 9.06 | 0.01 |
EDUCATION, years of education of the reference adult; MALE or FEMALE, gender; NON-WHITE or WHITE, race or ethnicity; POVERTY, income divided by poverty threshold; BEFORE or AFTER 1974, when the house was built; ROOM, number of rooms in the house; FAMILY, size of the family.
Adjusted means and standardized differences of baseline covariates in the treatment and control groups before and after full matching†
Covariate . | Results for before matching . | Results for after matching . | ||||
---|---|---|---|---|---|---|
Control . | Treated . | Standard difference . | Control . | Treated . | Standard difference . | |
EDUCATION | 11.13 | 10.93 | −0.07 | 11.16 | 10.93 | −0.08 |
MALE | 0.49 | 0.49 | −0.01 | 0.52 | 0.51 | −0.01 |
NON-WHITE | 0.73 | 0.71 | −0.05 | 0.69 | 0.71 | 0.03 |
POVERTY | 1.86 | 1.42 | −0.39 | 1.47 | 1.42 | −0.04 |
BEFORE 1974 | 0.37 | 0.29 | −0.18 | 0.29 | 0.29 | −0.01 |
ROOM | 6.00 | 5.67 | −0.20 | 5.70 | 5.67 | −0.02 |
FAMILY SIZE | 5.08 | 4.92 | −0.08 | 4.81 | 4.92 | 0.06 |
AGE | 9.56 | 9.06 | −0.13 | 9.04 | 9.06 | 0.01 |
Covariate . | Results for before matching . | Results for after matching . | ||||
---|---|---|---|---|---|---|
Control . | Treated . | Standard difference . | Control . | Treated . | Standard difference . | |
EDUCATION | 11.13 | 10.93 | −0.07 | 11.16 | 10.93 | −0.08 |
MALE | 0.49 | 0.49 | −0.01 | 0.52 | 0.51 | −0.01 |
NON-WHITE | 0.73 | 0.71 | −0.05 | 0.69 | 0.71 | 0.03 |
POVERTY | 1.86 | 1.42 | −0.39 | 1.47 | 1.42 | −0.04 |
BEFORE 1974 | 0.37 | 0.29 | −0.18 | 0.29 | 0.29 | −0.01 |
ROOM | 6.00 | 5.67 | −0.20 | 5.70 | 5.67 | −0.02 |
FAMILY SIZE | 5.08 | 4.92 | −0.08 | 4.81 | 4.92 | 0.06 |
AGE | 9.56 | 9.06 | −0.13 | 9.04 | 9.06 | 0.01 |
EDUCATION, years of education of the reference adult; MALE or FEMALE, gender; NON-WHITE or WHITE, race or ethnicity; POVERTY, income divided by poverty threshold; BEFORE or AFTER 1974, when the house was built; ROOM, number of rooms in the house; FAMILY, size of the family.
To control for the measured confounders, we constructed a matched observational study. In a matched observational study, subjects with comparable measured confounders are put into matched sets and inferences are drawn based on comparisons within these matched sets (Rosenbaum, 2002, 2010a; Hansen, 2004; Stuart, 2010; Lu et al., 2011; Zubizarreta, 2012; Pimentel et al., 2015). A matched observational study can be analysed in a model-based or non-parametric way. Rubin (1979) found that the combination of matching and model-based adjustment within matched sets is robust to model misspecification and is relatively efficient.
We performed a full matching (Hansen, 2004; Hansen and Klopfer, 2006) on the aforementioned eight variables. Table 1 suggests that the covariates are well balanced after matching; in particular, all standardized differences are less than 0.1 after matching (Silber et al., 2016). The mean blood lead level is 3.27 μg dL−1. Assuming that there is no unmeasured confounding, the 95% two-sided confidence interval of the effect of second-hand smoke exposure on blood lead level is (0.71,0.98) μg dL−1 by using the senfmCI function in the R package sensitivityfull (Rosenbaum, 2017) with default settings. This confidence interval was obtained by inverting the test based on Huber's M-statistic for an additive constant treatment effect.
The above analysis suggests evidence for a causal relationship between second-hand smoke and higher blood lead levels in children; however, the analysis relies on assuming ‘no unmeasured confounding’, which is a rather questionable assumption. Although we have put subjects with similar observed covariates in the same matched sets, we are concerned that some unmeasured confounder that we have not matched on, say some aspect of the socio-economic status or a genetic variant, might be associated with the cotinine level and the blood lead level, and induce a spurious causal relationship. For instance, one potential unmeasured confounder that we are concerned about is whether the child attends a public or private school. Public versus private school could be associated with cotinine exposure because
- (a)
adjusting for reference adult's education and poverty income ratio does not fully adjust for the child's socio-economic status (consequently, private versus public school may capture residual socio-economic status of the child after adjusting for the reference adult's education and poverty income ratio, and be associated with cotinine level) and
- (b)
it is conceivable that students attending public schools are at higher risk of cotinine exposure because there are on average more smokers at public schools.
Private versus public school could be associated with the potential outcomes for blood lead level because public schools tend to have older buildings and one leading source of hazard of lead exposure for US children is lead-based paint which was commonly used in older buildings (Committee on Environmental Health, 2005).
1.2 Sensitivity analysis for matched observational studies: the bounds approach and application to the current study
Sensitivity analysis is one approach to tackling concerns about unmeasured confounding. A sensitivity analysis asks how much of a departure from the no-unmeasured-confounding assumption would be needed to affect the inferences that are drawn from an analysis that assumes no unmeasured confounding. A solid causal conclusion should be relatively insensitive to plausible violations of this no-unmeasured-confounding assumption. Over the years, many sensitivity analysis methods have been proposed for different causal inference frameworks (Cornfield et al., 1959; Rosenbaum and Rubin, 1983; Gastwirth et al., 1998; Imbens, 2003; McCandless et al., 2007; Ichino et al., 2008; Hosman et al., 2010; Carnegie et al., 2016; Ding and VanderWeele, 2016; Dorie et al., 2016; Genbäck and de Luna, 2019; Fogarty, 2019; Franks et al., 2019; Cinelli and Hazlett, 2020).
The simultaneous sensitivity analysis model contains two sensitivity parameters: λ, which controls the strength of the association between U and the treatment status Z, and δ, which controls the strength of the association between U and the response Y. Consider a test statistic t(·) for which we want to compute the tail probability P{t(·)⩾a} under the null hypothesis of no treatment effect. A sensitivity analysis computes the tail probability for a given test statistic t(·), a set of sensitivity parameters (λ, δ) and given values of the unobserved covariates U for each subject. As the values of U vary, the permutation test yields different p-values. Gastwirth et al. (1998) showed that, for matched pairs, a sharp upper bound on the p-value (i.e. the most conservative inference) is obtained by letting the unit with higher response have U = 1 and the unit with lower response have U = 0. Rosenbaum (1987) considered model (1) except asserted only the model for treatment Z and not the model for the outcome Y(0) and showed also that a sharp upper bound on the p-value is obtained by letting the unit with higher response have U = 1 and the unit with lower response have U = 0. Gastwirth et al. (1998) showed that this bound is identical to that obtained from taking δ→∞ in their model. These bounds are sometimes referred to as Rosenbaum bounds in the literature (DiPrete and Gangl, 2004). When the matching design is a full match (Hansen, 2004) rather than a pair match, Gastwirth et al. (2000) and Small et al. (2009) showed how to obtain asymptotically valid bounds by using a computationally fast algorithm.
We followed the approach of Gastwirth et al. (1998) and tested the null hypothesis of no treatment effect for various sensitivity parameters (λ, δ) in our matched design using the aligned rank test (Small et al., 2009). Sensitivity parameters and the corresponding worst-case p-values are summarized in Table 2. We divided the sensitivity parameter δ by the estimated standard deviation of the outcome blood lead level, so that the magnitude of δ is more interpretable. The null hypothesis of no treatment effect can still be rejected when , but not when .
(λ, ) . | p-value . |
---|---|
(0,0) | 0.000 |
(0.5, 0.15) | 0.000 |
(0.8, 0.25) | 0.000 |
(1.0, 0.32) | 0.488 |
(1.2, 0.40) | 0.999 |
(λ, ) . | p-value . |
---|---|
(0,0) | 0.000 |
(0.5, 0.15) | 0.000 |
(0.8, 0.25) | 0.000 |
(1.0, 0.32) | 0.488 |
(1.2, 0.40) | 0.999 |
is the estimated standard deviation of the outcome blood lead level. We report so that δ takes into account the magnitude of the outcome and is easier to interpret.
(λ, ) . | p-value . |
---|---|
(0,0) | 0.000 |
(0.5, 0.15) | 0.000 |
(0.8, 0.25) | 0.000 |
(1.0, 0.32) | 0.488 |
(1.2, 0.40) | 0.999 |
(λ, ) . | p-value . |
---|---|
(0,0) | 0.000 |
(0.5, 0.15) | 0.000 |
(0.8, 0.25) | 0.000 |
(1.0, 0.32) | 0.488 |
(1.2, 0.40) | 0.999 |
is the estimated standard deviation of the outcome blood lead level. We report so that δ takes into account the magnitude of the outcome and is easier to interpret.
1.3 Limitations of the bounds approach, calibration to observed covariates and motivation for a new framework
The bounds approach derives bounds on the p-value, given the sensitivity parameters, by finding the worst-case allocation of the unmeasured confounder in a finite sample. This approach has two limitations. First, it is conservative for an unmeasured confounder in a superpopulation. Consider one potential unmeasured confounder in our study of whether a child attends a public or private school. Let U = 1 if the child goes to the public school and U = 0 otherwise. The p-values that are reported in the bounds approach of Gastwirth et al. (1998) (see Table 2) correspond to the following worst-case scenario: in each matched pair, the child with higher blood lead level always goes to the public school (U = 1), whereas the child with lower blood lead level always goes to the private school (U = 0). However, this hypothesized binary unmeasured confounder, if it had existed, must have a Bernoulli distribution in the population and, as we construct matched pairs from the population, there is always some positive probability that both units have U simultaneously equal to 0 or 1, i.e. both children attend the private school or both the public school. The worst-case allocation of U under which the bounds are computed is conservative and does not correspond to any realistic probability distribution for this hypothesized unmeasured confounder in the population.
The bounds approach also does not enable researchers to perform accurate calibration. Calibration refers to methodologies that aid researchers in judging the plausibility of the unmeasured confounding in reference to observed covariates, i.e. to calibrate how much unmeasured confounding would make a study sensitive to bias in terms of how it compares with the confounding from observed covariates. Several papers have developed calibration methods for various sensitivity analysis frameworks (Ichino et al., 2008; Hosman et al., 2010; Blackwell, 2013; Griffin et al., 2013; Hsu and Small, 2013; Dorie et al., 2016; Middleton et al., 2016; Cinelli and Hazlett, 2020). In particular, Hsu and Small (2013) developed a calibration approach for matched observational studies using the bounds approach of Gastwirth et al. (1998). Hsu and Small (2013) calibrated the unmeasured confounder to the observed covariates by comparing the sensitivity parameters (λ, δ) in the simultaneous sensitivity analysis with the coefficients of the observed covariates obtained from a regression analysis. A problem with this approach is that, as explained in the previous paragraph, the binary unmeasured confounder has a deterministic structure under the worst-case allocation that the bounds correspond to and therefore cannot be compared with any observed covariate with a probability distribution in a meaningful way. This limitation motivates us to develop a new sensitivity analysis framework for matched observational studies, and to leverage it to assess the robustness of our causal conclusion concerning the effect of second-hand smoke exposure on blood lead level in children.
1.4 Our contribution
Our sensitivity analysis framework for matched observational studies departs substantively from the bounds approach in two respects. First, our framework views matched sets as drawn from a population. The hypothesized unmeasured confounder is modelled as a random variable with a probability distribution, instead of as having a deterministic structure as in the worst-case allocation under which the bounds are computed (Gastwirth et al., 1998; Rosenbaum, 2010a). We borrow from the literature on model-based sensitivity analysis methods (Rosenbaum and Rubin (1983), Imbens (2003), Carnegie et al. (2016) and Cinelli and Hazlett (2020); see the on-line supplementary material section A.1 for a brief review) and adapt it to matched observational studies. Extensive simulations demonstrate that, by combining matching and model-based adjustment, our method is robust to assorted model misspecifications and can be more powerful compared with the bounds approach.
Additionally, by endowing the hypothesized unmeasured confounder with a probability distribution in the population, our method enables calibrating the hypothesized unmeasured confounder to the measured confounders in a more meaningful manner compared with the calibration strategy in Hsu and Small (2013). Ichino et al. (2008) also developed a sensitivity analysis method for matched observational studies with calibration. Our model and calibration approach differ from those in Ichino et al. (2008) in that Ichino et al. (2008) assumed that U is independent of the observed covariates conditional on the treatment and the outcome, whereas we assume that U and the observed covariates are unconditionally independent (and consequently may be conditionally dependent). Conditional independence between U and X adopted as in Ichino et al. (2008) may not hold in some cases, as is discussed in detail in the on-line supplementary material section A.2 from a causal directed acyclic graph point of view.
More broadly, our calibration method is distinct from many existing calibration methods in that we adjust for the omission of the unmeasured confounder when performing calibration, whereas most existing approaches, with the exception of Hsu and Small (2013) and Cinelli and Hazlett (2020), use the observed statistics of the measured covariates for calibration without adjustment for the omission of the unmeasured confounder. As discussed extensively in Cinelli and Hazlett (2020), this practice has undesirable properties, mainly because the estimates of how the observed covariates are related to the outcome and the treatment assignment are often themselves affected by the omission of the unmeasured confounder U, even when U is assumed to be independent of the observed covariates.
There are two outputs from our sensitivity analysis method. The first output is a calibration plot that contrasts sensitivity parameters with estimated regression coefficients in magnitude. This calibration plot is further supplemented with a table summarizing the variable importance of the hypothesized unmeasured confounder relative to the measured confounders in confounding the outcome and the treatment assignment. We leverage our proposed sensitivity analysis method to assess the robustness of our primary analysis conclusion, i.e. that second-hand smoke exposure has a detrimental effect on children by elevating their blood lead levels. We find that, to explain away the causal conclusion, an unmeasured confounder needs to be associated with blood lead level to a similar extent to that of the poverty income ratio and at the same time be more associated with cotinine level than any of the eight observed covariates. The rest of the paper is organized as follows. In Section 2, we explain the notation and present model specifications. In Section 3, we reframe the problem as a missing data problem and solve it using an expectation–maximization (EM) algorithm. Section 4 discusses the proposed calibration approach and Section 5 presents simulation results. We apply our methodology to do a calibrated sensitivity analysis for the effect of second-hand smoke exposure on a child's blood lead level in Section 6 and conclude with a brief discussion in Section 7. R package sensitivityCalibration implements the methodology and is available through the Comprehensive R Archive Network.
2 Notation and framework
We clarify the notation and describe our model in this section. Suppose that there are I matched sets, i = 1,2,…,I, each matched for a set of observed covariates X. Within each matched set, there are ni subjects and let be the total number of subjects in the study. In the setting of full matching (Rosenbaum, 2002; Hansen, 2004), each matched set consists of either one treated subject and ni−1 controls, or one control and ni−1 treated subjects. If ni = 2 for all i, then we have the pair matching.
Let ij denote subject j, j = 1,2,…,ni, either treated or control, in matched set i. Let Zij = 1 if subject ij is treated and Zij = 0 otherwise. Let denote the observed covariates of subject ij. According to the matched design, for j and j′ in the same matched set i. To conduct a sensitivity analysis, we hypothesize the existence of an unmeasured confounder Uij that is associated with each subject ij. Following the potential outcome framework (Neyman, 1990; Rubin, 1974), we let and denote the potential outcome of subject ij, under treatment (Zij = 1) or control (Zij = 0) respectively. Hence, the observed outcome of subject ij is .
To prepare for calibration, we estimate not only the treatment effect, but also the effect of each observed covariate on treatment assignment and response, i.e. the coefficients of observed covariates in the outcome regression model and propensity score model. To estimate ψ robustly, we leverage the matched design and write , where is a linear combination of and ai is a fixed effect that is specific to the matched set i (Rubin, 1979). Our model and inference are more robust compared with assuming that is linear in . For simplicity, we assume that κ is linear in here, but we may also include a fixed effect for each matched set when estimating κ(·).
3 Estimating model parameters via an expectation–maximization algorithm
We consider a similar model to that in Rosenbaum and Rubin (1983) and Imbens (2003); however, we allow for fixed effects for matched sets. There are only a handful of parameters in the model that was considered in Rosenbaum and Rubin (1983) and Imbens (2003), and these parameters can be easily estimated via maximum likelihood. In our model, to leverage the matched design, we allow for distinct fixed effects for each matched set and the number of parameters is of the same order as the number of matched sets. It is not unusual to have thousands of matched sets or pairs in a matched observational study and directly optimizing the observed data likelihood would be difficult. The key observation here is that the sensitivity analysis problem can be reframed as a missing data problem (Ichino et al., 2008). Specifically, we treat the unmeasured confounder U as a missing covariate and find the maximum likelihood estimate efficiently via an EM algorithm, and then use the block bootstrap (Abadie and Spiess, 2019) to construct the confidence interval. Below, we briefly describe the E-step and M-step in our context. A brief recap of the EM algorithm can be found in the on-line supplementary material section B.1 and more details of the E-step and M-step can be found in supplementary material section B.2.
4 Calibrating the unmeasured confounder to observed covariates
The parameters in the bounds approach are interpreted in terms of the odds ratio. For instance, in the simultaneous sensitivity analysis of Gastwirth et al. (1998), two subjects matched for observed covariates X differ in their odds of receiving treatment by at most a factor of exp (λ). The magnitude of the sensitivity parameter λ indicates the strength of the hypothesized unmeasured confounder in confounding the treatment assignment; however, an audience without much training in interpreting sensitivity analysis results may still be perplexed about what λ means for their specific problem. Calibration aims to remedy this by comparing the hypothesized unmeasured confounder with the observed covariates in a meaningful way. We ask two interconnected questions. First, if a binary unmeasured confounder had existed, how big an effect, compared with the observed covariates, would it have to have on the treatment assignment and the response to affect materially the qualitative conclusion that we draw from the observational data? Second, what can we say about the importance of this unmeasured confounder relative to that of the other observed covariates?
We leverage our proposed sensitivity analysis model to answer these two questions. For a specified parameter p, we first identify the boundary that separates (λ, δ) pairs that render the treatment effect significant at the 0.05-level from those that do not. In practice, for each fixed λ, we do a binary search to find the largest δ that renders the treatment effect significant. For a specific (λ, δ) pair, we record the coefficient estimates of observed covariates and contrast them with (λ, δ) on the same graph. To make the comparison between the coefficient estimates and the sensitivity parameters meaningful, we transform each observed covariate to the same scale as the hypothesized unmeasured confounder. Here, we follow the suggestion in Gelman (2008) and standardize all non-binary covariates to mean 0 and standard deviation (SD) 0.5 and leave dichotomized covariates intact. According to this standardization scheme, the coefficients of dichotomized variables can be interpreted directly and those of continuous or ordinal variables can be interpreted as the effect of a 2-SD increase in the covariate value, which roughly corresponds to flipping a binary variable from 0 to 1. This calibration plot enables empirical researchers to interpret the sensitivity parameters in a matched observational study better by helping them to draw comparisons with the more tangible observed covariates. When we implement the calibration method in the R package sensitivityCalibration, we make an interactive calibration plot where, for a fixed p, the (λ, δ) pair moves along the boundary and the coefficients of the observed covariates adjust themselves accordingly.
Comparing the sensitivity parameters with the coefficients of observed covariates sheds some light on interpreting the magnitude of the sensitivity parameters; however, a naive comparison of sensitivity parameters with regression coefficients may not suffice for two reasons. First, although U is assumed to be independent of all observed covariates X, any two observed covariates Xi and Xj may be correlated, and the magnitudes of their regression coefficients and are not necessarily informative about how important Xi and Xj are in confounding the treatment and outcome. Second, sometimes researchers would like to calibrate the unmeasured confounder to a group of two or more observed covariates, say ethnicity and poverty income ratio. It is not clear how to achieve this by simply focusing on regression coefficients’ magnitudes.
To tackle these challenges, we propose to supplement this naive comparison with an assessment of the relative importance of the hypothesized unmeasured confounder compared with the observed covariates. Many relative importance measures have been developed over the years; see Kruskal and Majors (1989) for a taxonomy of these measures. A ‘well-known, venerable collection’ of measures (Kruskal and Majors, 1989) often expresses the relative importance of covariates in terms of the variance reduction, i.e. how much variance of the response is accounted for by each covariate. Specifically, in the context of multiple regression, Pratt (1987) derived a unique measure of variable importance based on symmetry and invariance principles under linear transformation. This measure assigns to each variable an importance that has the interpretation of an average reduction in variance if that variable is held fixed. More recently, Azen and Dudescu (2003) proposed so-called dominance analysis: a method based on examining R2-values across all subset models.
Our sensitivity analysis framework works seamlessly with such relative importance measures and enables researchers to speak of the importance of the hypothesized unmeasured confounder relative to that of the observed covariates. For each sensitivity parameter trio (p, λ, δ), we run the EM algorithm until it converges, and at convergence we have a data set where each original instance is augmented into two instances: one with the binary unmeasured confounder U = 0 and the other with U = 1. Each of them is associated with a weight, which is the probability that U = 1 or U = 0 given the data and parameter values at convergence. We expand this weighted data set and obtain a full data set that contains all the observed covariates as well as the binary unmeasured confounder U, and we use this full data set to assess the relative importance of covariates. It is also straightforward to assess the relative importance of a group of covariates under our framework: Pratt's method has the property that the relative importance of several variables is the sum of their individual relative importance. In contrast, dominance analysis handles a group of covariates by adding the entire group to each subset model and computing the average increase in R2. Note that regression coefficients of observed covariates, as well as their relative importance measures, are all computed with adjustment for the unmeasured confounder. As pointed out in Section 4, this is an important distinction between our proposal and many existing calibration approaches in the literature. We leverage proposed calibration procedures to study in detail the causal relationship between second-hand smoke exposure and blood lead levels in Section 2.
5 Simulation study
In a simulation study, we examined the confidence interval coverage of our approach and how our approach compares in power with the bounds approach under the simultaneous sensitivity model of Gastwirth et al. (1998). The power of a sensitivity analysis is defined as the probability that it correctly rejects the null hypothesis in a favourable situation where there is a genuine treatment effect and no unmeasured confounding (Rosenbaum, 2010b). If there is no bias and a genuine treatment effect, then we would hope to reject the null hypothesis of no effect and the power of a sensitivity analysis is the chance that our hope will be realized. The power of a sensitivity analysis guides the choice of methods of analysis for a fixed research design (Hansen et al., 2014).
In Section 1, we generate data according to model (2) described in Section 2 with κ(X) and ψ(X) both linear in X, a real treatment effect, and no unmeasured confounding (corresponding to (λ, δ) = (0,0)). We verify that our confidence interval has correct coverage and demonstrate that our method is more powerful than the simultaneous sensitivity analysis in Gastwirth et al. (1998). In Section 2, we generate the response from a non-linear model and verify that matching combined with regression adjustment within matched sets makes our method robust against model misspecification. In Section 3, we perform additional simulations with different effect sizes and error structures, and consider more realistic situations where the matching is not exact. In all of these scenarios, we demonstrate that our framework is robust against various model misspecifications, including a non-normal error structure, a non-linear response model and non-exact matching, in the sense that the coverage of our confidence intervals is always approximately equal to the nominal coverage.
5.1 Linear response model
(λ, δ) . | Results for linear model, n = 1000 . | Results for linear model, n = 4000 . | Results for heterogeneous model, n = 1000 . | Results for non-linear model, n = 1000 . | ||||
---|---|---|---|---|---|---|---|---|
Full matching with regression adjustment . | Simultaneous sensitivity analysis . | Full matching with regression adjustment . | Simultaneous sensitivity analysis . | Full matching with regression adjustment . | Simultaneous sensitivity analysis . | Full matching with regression adjustment . | Simultaneous sensitivity analysis . | |
(0, 0) | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
(1, 1) | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
(1.5, 1.5) | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
(2, 2) | 1.000 | 0.998 | 1.000 | 1.000 | 1.000 | 0.840 | 1.000 | 0.824 |
(2.5, 2.5) | 0.954 | 0.198 | 1.000 | 0.672 | 0.952 | 0.012 | 0.928 | 0.076 |
(3, 3) | 0.000 | 0.000 | 0.000 | 0.000 | 0.200 | 0.000 | 0.000 | 0.000 |
(λ, δ) . | Results for linear model, n = 1000 . | Results for linear model, n = 4000 . | Results for heterogeneous model, n = 1000 . | Results for non-linear model, n = 1000 . | ||||
---|---|---|---|---|---|---|---|---|
Full matching with regression adjustment . | Simultaneous sensitivity analysis . | Full matching with regression adjustment . | Simultaneous sensitivity analysis . | Full matching with regression adjustment . | Simultaneous sensitivity analysis . | Full matching with regression adjustment . | Simultaneous sensitivity analysis . | |
(0, 0) | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
(1, 1) | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
(1.5, 1.5) | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
(2, 2) | 1.000 | 0.998 | 1.000 | 1.000 | 1.000 | 0.840 | 1.000 | 0.824 |
(2.5, 2.5) | 0.954 | 0.198 | 1.000 | 0.672 | 0.952 | 0.012 | 0.928 | 0.076 |
(3, 3) | 0.000 | 0.000 | 0.000 | 0.000 | 0.200 | 0.000 | 0.000 | 0.000 |
From left to right: linear model with a constant additive treatment effect and sample size 1000; linear model with a constant additive treatment effect and sample size 4000; linear model with an additive but heterogeneous treatment effect and sample size 1000; non-linear model with a constant additive treatment effect and sample size 1000. Levels of our method are 5.6%, 4.5%, 4.2% and 5.4% respectively, when (λ, δ) = (0,0) are set as the truth and the nominal level is 5%.
(λ, δ) . | Results for linear model, n = 1000 . | Results for linear model, n = 4000 . | Results for heterogeneous model, n = 1000 . | Results for non-linear model, n = 1000 . | ||||
---|---|---|---|---|---|---|---|---|
Full matching with regression adjustment . | Simultaneous sensitivity analysis . | Full matching with regression adjustment . | Simultaneous sensitivity analysis . | Full matching with regression adjustment . | Simultaneous sensitivity analysis . | Full matching with regression adjustment . | Simultaneous sensitivity analysis . | |
(0, 0) | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
(1, 1) | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
(1.5, 1.5) | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
(2, 2) | 1.000 | 0.998 | 1.000 | 1.000 | 1.000 | 0.840 | 1.000 | 0.824 |
(2.5, 2.5) | 0.954 | 0.198 | 1.000 | 0.672 | 0.952 | 0.012 | 0.928 | 0.076 |
(3, 3) | 0.000 | 0.000 | 0.000 | 0.000 | 0.200 | 0.000 | 0.000 | 0.000 |
(λ, δ) . | Results for linear model, n = 1000 . | Results for linear model, n = 4000 . | Results for heterogeneous model, n = 1000 . | Results for non-linear model, n = 1000 . | ||||
---|---|---|---|---|---|---|---|---|
Full matching with regression adjustment . | Simultaneous sensitivity analysis . | Full matching with regression adjustment . | Simultaneous sensitivity analysis . | Full matching with regression adjustment . | Simultaneous sensitivity analysis . | Full matching with regression adjustment . | Simultaneous sensitivity analysis . | |
(0, 0) | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
(1, 1) | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
(1.5, 1.5) | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
(2, 2) | 1.000 | 0.998 | 1.000 | 1.000 | 1.000 | 0.840 | 1.000 | 0.824 |
(2.5, 2.5) | 0.954 | 0.198 | 1.000 | 0.672 | 0.952 | 0.012 | 0.928 | 0.076 |
(3, 3) | 0.000 | 0.000 | 0.000 | 0.000 | 0.200 | 0.000 | 0.000 | 0.000 |
From left to right: linear model with a constant additive treatment effect and sample size 1000; linear model with a constant additive treatment effect and sample size 4000; linear model with an additive but heterogeneous treatment effect and sample size 1000; non-linear model with a constant additive treatment effect and sample size 1000. Levels of our method are 5.6%, 4.5%, 4.2% and 5.4% respectively, when (λ, δ) = (0,0) are set as the truth and the nominal level is 5%.
5.2 Non-linear response model
It would be difficult, if not impossible, to specify this response model correctly; however, as one performs matching and then regression adjustment within each matched set, the procedure is robust towards this non-linear response model. In fact, 94.6% of our 500 experiments have 95% confidence intervals covering the true treatment effect β = 2. Moreover, a comparison of the power of sensitivity analysis again demonstrates that our model has superior power compared with the simultaneous sensitivity analysis, as shown in Table 3.
5.3 Different effect sizes, error structures and non-exact matching
In this subsection, we report additional simulation results for various effect sizes, error structures and non-exact matching. We first consider the same linear and non-linear response model as in Section 1 and 2 but with a smaller effect size β = 1. With the linear response model, 93.2% of confidence intervals cover the true parameter β = 1 in our simulations, and the confidence intervals have almost exact coverage for the non-linear response model, with 94% of the 95% confidence intervals covering the true parameter. Table 4 further summarizes the power of our sensitivity analysis in this setting.
(λ, δ) . | Results for effect . | Results for non-normal error . | ||
---|---|---|---|---|
Linear response . | Non-linear response . | Double exponential . | Student's t (df = 2) . | |
(0, 0) | 1.000 | 1.000 | 1.000 | 0.960 |
(0.5, 0.5) | 1.000 | 1.000 | 1.000 | 0.954 |
(0.8. 0.8) | 1.000 | 1.000 | 1.000 | 0.932 |
(1, 1) | 1.000 | 1.000 | 1.000 | 0.914 |
(1.5, 1.5) | 0.954 | 0.964 | 0.944 | 0.626 |
(2, 2) | 0.008 | 0.014 | 0.008 | 0.028 |
(λ, δ) . | Results for effect . | Results for non-normal error . | ||
---|---|---|---|---|
Linear response . | Non-linear response . | Double exponential . | Student's t (df = 2) . | |
(0, 0) | 1.000 | 1.000 | 1.000 | 0.960 |
(0.5, 0.5) | 1.000 | 1.000 | 1.000 | 0.954 |
(0.8. 0.8) | 1.000 | 1.000 | 1.000 | 0.932 |
(1, 1) | 1.000 | 1.000 | 1.000 | 0.914 |
(1.5, 1.5) | 0.954 | 0.964 | 0.944 | 0.626 |
(2, 2) | 0.008 | 0.014 | 0.008 | 0.028 |
The second and third columns are the linear and non-linear model with effect size 2 and 3 and sample size 1000. Levels are 6.0% and 5.0% respectively, when (λ, δ) = (0,0) are set at truth. The fourth and fifth columns are the linear response model with non-normal error. Levels are 4.6% and 6.2% when the error is distributed as a double-exponential distribution and a Student t-distribution (df=2) respectively when (λ, δ) = (0,0) are set as the truth and the nominal level is 5%.
(λ, δ) . | Results for effect . | Results for non-normal error . | ||
---|---|---|---|---|
Linear response . | Non-linear response . | Double exponential . | Student's t (df = 2) . | |
(0, 0) | 1.000 | 1.000 | 1.000 | 0.960 |
(0.5, 0.5) | 1.000 | 1.000 | 1.000 | 0.954 |
(0.8. 0.8) | 1.000 | 1.000 | 1.000 | 0.932 |
(1, 1) | 1.000 | 1.000 | 1.000 | 0.914 |
(1.5, 1.5) | 0.954 | 0.964 | 0.944 | 0.626 |
(2, 2) | 0.008 | 0.014 | 0.008 | 0.028 |
(λ, δ) . | Results for effect . | Results for non-normal error . | ||
---|---|---|---|---|
Linear response . | Non-linear response . | Double exponential . | Student's t (df = 2) . | |
(0, 0) | 1.000 | 1.000 | 1.000 | 0.960 |
(0.5, 0.5) | 1.000 | 1.000 | 1.000 | 0.954 |
(0.8. 0.8) | 1.000 | 1.000 | 1.000 | 0.932 |
(1, 1) | 1.000 | 1.000 | 1.000 | 0.914 |
(1.5, 1.5) | 0.954 | 0.964 | 0.944 | 0.626 |
(2, 2) | 0.008 | 0.014 | 0.008 | 0.028 |
The second and third columns are the linear and non-linear model with effect size 2 and 3 and sample size 1000. Levels are 6.0% and 5.0% respectively, when (λ, δ) = (0,0) are set at truth. The fourth and fifth columns are the linear response model with non-normal error. Levels are 4.6% and 6.2% when the error is distributed as a double-exponential distribution and a Student t-distribution (df=2) respectively when (λ, δ) = (0,0) are set as the truth and the nominal level is 5%.
Next, instead of assuming a normal error, we consider two different error structures: a Student t-distribution with degrees of freedom df=2 and a Laplace (double-exponential) distribution with rate 1.5. We adopt a linear response model with an additive treatment effect β = 1, no unmeasured confounding (λ = δ = 0) and sample size equal to 1000. 95.4% and 93.8% of confidence intervals capture the truth β = 1 when ε takes on a Student t-distribution with df=2 and a Laplace distribution with rate 1.5 respectively. Again, the results demonstrate that our method is robust against misspecified error structures. The power of our sensitivity analysis is further summarized in Table 4.
In all aforementioned simulations, we made it easy on matching by simulating multiple objects with exactly the same observed covariates, which is rarely so in practice. To mimic a real data set better, we add uniform[−0.2,0.2], uniform[−0.5,0.5] or uniform[−1.0,1.0] noise to each observed covariate so that subjects in the same matched set are similar, but no longer identical in their observed covariates. We run the experiment with the same linear response model as specified before with β = 1 and σ = 1.5. When we perform the matching, subjects with different observed covariates, sometimes from different strata, are put in the same matched set. As a consequence, hypothesis testing assuming a homogeneous, additive treatment effect without adjustment for covariates may yield invalid inferences. However, our method adjusts for covariates and produces valid inferences if the model for covariates is correctly specified. Our confidence intervals covered the true parameter 92.4%, 95.6% and 94.4% of the times for the uniform[−0.2,0.02], uniform[−0.5,0.5] and uniform[−1,1] noise respectively. Table 5 summarizes the power of our sensitivity analysis. Our method is still quite powerful: it correctly rejects the null hypothesis in a favourable situation 94.4% of the times when (λ, δ) = (1.5,1.5), whereas the simultaneous sensitivity analysis rejects only 62.6% of the times.
Power of the sensitivity analysis: linear response model with non-exact matching†
(λ, δ) . | Results for uniform[−0.2,0.2] . | Results for uniform[−0.5,0.5] . | Results for uniform[−1,1] . |
---|---|---|---|
(0, 0) | 1.000 | 1.000 | 1.000 |
(0.5, 0.5) | 1.000 | 1.000 | 1.000 |
(0.8. 0.8) | 1.000 | 1.000 | 1.000 |
(1, 1) | 1.000 | 1.000 | 1.000 |
(1.5, 1.5) | 0.944 | 0.878 | 0.772 |
(2, 2) | 0.008 | 0.008 | 0.008 |
(λ, δ) . | Results for uniform[−0.2,0.2] . | Results for uniform[−0.5,0.5] . | Results for uniform[−1,1] . |
---|---|---|---|
(0, 0) | 1.000 | 1.000 | 1.000 |
(0.5, 0.5) | 1.000 | 1.000 | 1.000 |
(0.8. 0.8) | 1.000 | 1.000 | 1.000 |
(1, 1) | 1.000 | 1.000 | 1.000 |
(1.5, 1.5) | 0.944 | 0.878 | 0.772 |
(2, 2) | 0.008 | 0.008 | 0.008 |
The levels are 7.6%, 4.4% and 5.6% for different degrees of inexact matching, when (λ, δ) = (0,0) are set as the truth and the nominal level is 5%.
Power of the sensitivity analysis: linear response model with non-exact matching†
(λ, δ) . | Results for uniform[−0.2,0.2] . | Results for uniform[−0.5,0.5] . | Results for uniform[−1,1] . |
---|---|---|---|
(0, 0) | 1.000 | 1.000 | 1.000 |
(0.5, 0.5) | 1.000 | 1.000 | 1.000 |
(0.8. 0.8) | 1.000 | 1.000 | 1.000 |
(1, 1) | 1.000 | 1.000 | 1.000 |
(1.5, 1.5) | 0.944 | 0.878 | 0.772 |
(2, 2) | 0.008 | 0.008 | 0.008 |
(λ, δ) . | Results for uniform[−0.2,0.2] . | Results for uniform[−0.5,0.5] . | Results for uniform[−1,1] . |
---|---|---|---|
(0, 0) | 1.000 | 1.000 | 1.000 |
(0.5, 0.5) | 1.000 | 1.000 | 1.000 |
(0.8. 0.8) | 1.000 | 1.000 | 1.000 |
(1, 1) | 1.000 | 1.000 | 1.000 |
(1.5, 1.5) | 0.944 | 0.878 | 0.772 |
(2, 2) | 0.008 | 0.008 | 0.008 |
The levels are 7.6%, 4.4% and 5.6% for different degrees of inexact matching, when (λ, δ) = (0,0) are set as the truth and the nominal level is 5%.
6 Application: second-hand smoking and blood lead levels in children
6.1 Robustness of the causal relationship between second-hand smoking and blood lead levels in children
We now leverage our proposed sensitivity analysis method to examine how sensitive the treatment effect is to unmeasured confounding in the second-hand smoke exposure and blood lead levels study that was described in Section 1. For a fixed sensitivity parameter trio (p, λ, δ), we maximized the observed data likelihood of model (2) as described in Section 3 after conducting a full match. Model (2) assumes that the error term in the blood lead level model is normally distributed. Simulations in Section 3 suggest that matching combined with regression adjustment is at least quite robust to non-normal errors. We first tabulate the 95% confidence intervals of treatment effect for various combinations of sensitivity parameters (p, λ, δ) in Table 6. Additional results for more (p, λ, δ) combinations are presented in the on-line supplementary material section C.2. Note that (λ, δ) = (0,0) corresponds to no unmeasured confounding and we would expect approximately the same confidence intervals regardless of p. This is indeed so. We compare our method with the bounds approach of Gastwirth et al. (1998). The last column in Table 6 summarizes the worst-case p-values of testing no treatment effect with various (λ, δ) pairs. When (λ, δ) = (1.0,1.0), the 95% confidence interval does not cover 0 for p = 0.1, p = 0.3 or p = 0.5; however, the bounds approach of Gastwirth et al. (1998) yields a p-value of 0.488. The null hypothesis cannot be rejected according to the bounds approach; however, an unmeasured confounder of such magnitude cannot explain away the treatment effect under our sensitivity analysis model.
(λ, δ) . | 95% confidence interval for p = 0.5 . | 95% confidence interval for p = 0.3 . | 95% confidence interval for p = 0.1 . | p-value . |
---|---|---|---|---|
(0,0) | (0.397, 0.808) | (0.407, 0.823) | (0.415, 0.825) | 0.000 |
(0.5, 0.5) | (0.349, 0.750) | (0.345, 0.766) | (0.380, 0.771) | 0.000 |
(0.8, 0.8) | (0.243, 0.673) | (0.275, 0.677) | (0.341, 0.749) | 0.000 |
(1.0, 1.0) | (0.144, 0.572) | (0.180, 0.594) | (0.317, 0.711) | 0.488 |
(1.2, 1.2) | (0.0448, 0.455) | (0.106, 0.507) | (0.259, 0.676) | 0.999 |
(1.5, 1.5) | (−0.142,0.276) | (−0.0670,0.323) | (0.212, 0.602) | 1.000 |
(2.0, 2.0) | (−0.551,−0.141) | (−0.442,−0.0501) | (0.0491, 0.474) | 1.000 |
(λ, δ) . | 95% confidence interval for p = 0.5 . | 95% confidence interval for p = 0.3 . | 95% confidence interval for p = 0.1 . | p-value . |
---|---|---|---|---|
(0,0) | (0.397, 0.808) | (0.407, 0.823) | (0.415, 0.825) | 0.000 |
(0.5, 0.5) | (0.349, 0.750) | (0.345, 0.766) | (0.380, 0.771) | 0.000 |
(0.8, 0.8) | (0.243, 0.673) | (0.275, 0.677) | (0.341, 0.749) | 0.000 |
(1.0, 1.0) | (0.144, 0.572) | (0.180, 0.594) | (0.317, 0.711) | 0.488 |
(1.2, 1.2) | (0.0448, 0.455) | (0.106, 0.507) | (0.259, 0.676) | 0.999 |
(1.5, 1.5) | (−0.142,0.276) | (−0.0670,0.323) | (0.212, 0.602) | 1.000 |
(2.0, 2.0) | (−0.551,−0.141) | (−0.442,−0.0501) | (0.0491, 0.474) | 1.000 |
(λ, δ) . | 95% confidence interval for p = 0.5 . | 95% confidence interval for p = 0.3 . | 95% confidence interval for p = 0.1 . | p-value . |
---|---|---|---|---|
(0,0) | (0.397, 0.808) | (0.407, 0.823) | (0.415, 0.825) | 0.000 |
(0.5, 0.5) | (0.349, 0.750) | (0.345, 0.766) | (0.380, 0.771) | 0.000 |
(0.8, 0.8) | (0.243, 0.673) | (0.275, 0.677) | (0.341, 0.749) | 0.000 |
(1.0, 1.0) | (0.144, 0.572) | (0.180, 0.594) | (0.317, 0.711) | 0.488 |
(1.2, 1.2) | (0.0448, 0.455) | (0.106, 0.507) | (0.259, 0.676) | 0.999 |
(1.5, 1.5) | (−0.142,0.276) | (−0.0670,0.323) | (0.212, 0.602) | 1.000 |
(2.0, 2.0) | (−0.551,−0.141) | (−0.442,−0.0501) | (0.0491, 0.474) | 1.000 |
(λ, δ) . | 95% confidence interval for p = 0.5 . | 95% confidence interval for p = 0.3 . | 95% confidence interval for p = 0.1 . | p-value . |
---|---|---|---|---|
(0,0) | (0.397, 0.808) | (0.407, 0.823) | (0.415, 0.825) | 0.000 |
(0.5, 0.5) | (0.349, 0.750) | (0.345, 0.766) | (0.380, 0.771) | 0.000 |
(0.8, 0.8) | (0.243, 0.673) | (0.275, 0.677) | (0.341, 0.749) | 0.000 |
(1.0, 1.0) | (0.144, 0.572) | (0.180, 0.594) | (0.317, 0.711) | 0.488 |
(1.2, 1.2) | (0.0448, 0.455) | (0.106, 0.507) | (0.259, 0.676) | 0.999 |
(1.5, 1.5) | (−0.142,0.276) | (−0.0670,0.323) | (0.212, 0.602) | 1.000 |
(2.0, 2.0) | (−0.551,−0.141) | (−0.442,−0.0501) | (0.0491, 0.474) | 1.000 |
For a fixed (λ, δ) pair, a public health researcher may be interested in plotting the 95% confidence interval against all values of p between 0 and 1 and reporting the inference for various p. See Fig. 1 for such a plot for various (λ, δ) combinations. Fig. 2 suggests that the choice of p = 0.5 yields approximately the most conservative treatment effect for this study. Intuitively, when p is close to 0 or 1, subjects with different treatment status in the same matched set tend to have the same value of U, which makes it difficult to attribute the observed treatment effect to the disagreement in U; therefore, we would expect that p taking a value somewhere in between either end would yield the most conservative inference. Instead of reporting Fig. 1, empirical researchers may simply report a summary statistic. For a fixed (λ, δ) pair, let CIp = [Lp, Up] represent the level α confidence interval corresponding to p. One choice is to report the most conservative CIp, i.e. , where . One may also report so that the confidence interval has the correct (but may be conservative) coverage for all realizations of p.

95% confidence interval versus p for (λ, δ) pairs (a) (1, 1) and (b) (1.5, 1.5)

Calibration plot with (p, λ, δ) = (0.5,1,1.99): the labels and the covariate information that they encode are POVERTY, poverty income ratio, AGE, age at the time of interview, SEX, male or female, HOUSE, whether the house was built before or after 1974, ROOM, number of rooms in the house, FAMILY, size of family, ETHNICITY, white or non-white, and EDUCATION, years of education of the reference adult
A public health researcher studying this particular data set could stop here and report the sensitivity analysis results as discussed above. However, the subject matter expert, as well as any audience of the report, may stare at the sensitivity analysis results, rather confused. The absolute scale of (λ, δ) sheds little light on the actual interpretation of this hypothesized unmeasured confounder: the causal conclusion is still significant when (λ, δ) = (1.2,1.2) for p = 0.5 and p = 0.3, but no longer so when (λ, δ) = (1.5,1.5). Is the hypothesized unmeasured confounding indeed something worth worrying about? Does this degree of sensitivity to unmeasured confounding cast doubt on the findings when no unmeasured confounding was assumed or does it reinforce the findings? To help subject matter experts to interpret the sensitivity parameters better in their specific context, we compare the hypothesized unmeasured confounder with the observed covariates below.
6.2 Calibration to observed covariates
Fig. 2 displays our calibration plot when p = 0.5. The marker represents one (λ, δ) pair on the boundary and, for this specific sensitivity parameter pair (λ, δ) = (1,1.99), we estimate the coefficients of observed covariates (after standardized to mean 0 and SD 0.5 if necessary), which are shown on the same plot as dots with labels. A subject matter expert can make sense of Fig. 2 as follows: for the binary unmeasured confounder U to explain away the treatment effect, it may be associated with sensitivity parameters (λ, δ) = (1,1.99). A (λ, δ) pair as large as (1,1.99) says that flipping the unmeasured confounder from 0 to 1 corresponds to a one-unit increase in the propensity score (in the logit scale) and almost an increase of 2 μg dL−1 in the blood lead level, holding everything else fixed. To draw a comparison, we see that a 2-SD increase in the poverty-to-income ratio roughly corresponds to a 1.2-unit increase in the propensity score and an increase of 1 μg dL−1 in the blood lead level, while holding everything else fixed. We implement an interactive calibration plot in our R package sensitivityCalibration and the on-line supplementary material section C.3 contains four snapshots of this animated plot with different (λ, δ) pairs on the boundary. The estimated coefficients of the observed covariates are relatively insensitive to the change in (λ, δ).
As is discussed in Section 4, the magnitudes of the regression coefficients, even after standardization, do not necessarily indicate their importance. We adopt two useful and interpretable metrics to describe the importance of the hypothesized unmeasured confounder relative to that of other observed covariates. In the outcome regression model, we assess the relative variable importance by using both Pratt's method (Pratt, 1987) and the dominance analysis method (Azen and Dudescu, 2003). The results are summarized in Table 7. In Pratt's method, the contribution of each covariate is the share of total R2 that this covariate accounts for, whereas, in the dominance analysis, the contribution of each covariate is the average increase in R2 across all subset models. Although the two analyses have different motivations, the results align perfectly. With δ = 2, the hypothesized binary unmeasured confounder U is almost comparable with the observed covariate POVERTY (the poverty income ratio) in confounding the blood lead level. Similarly, we assess the relative importance of U in confounding the propensity score under the logistic regression model via the generalized dominance analysis. With λ = 1, the hypothesized unmeasured confounder has higher relative importance than any other observed covariates in confounding the treatment assignment. This supplementary information helps subject matter experts to judge for themselves how robust the causal conclusion is.
Assessing the relative importance of observed confounders and the hypothesized unmeasured confounders†
Results for outcome regression (δ = 2) . | Results for propensity score (λ = 1), generalized dominance analysis . | ||||
---|---|---|---|---|---|
Pratt's method . | Dominance analysis . | ||||
Covariate . | Contribution . | Covariate . | Contribution . | Covariate . | Contribution . |
AGE | 0.044 | AGE | 0.043 | U | 0.045 |
POVERTY | 0.040 | POVERTY | 0.035 | POVERTY | 0.024 |
U | 0.032 | U | 0.031 | HOUSE | 0.005 |
ETHNICITY | 0.023 | ETHNICITY | 0.020 | ETHNICITY | 0.004 |
FAMILY | 0.018 | FAMILY | 0.019 | FAMILY | 0.003 |
HOUSE | 0.018 | HOUSE | 0.017 | ROOM | 0.003 |
SEX | 0.013 | SEX | 0.013 | AGE | 0.003 |
COP | 0.010 | COP | 0.012 | EDUCATION | 0.001 |
EDUCATION | 0.007 | EDUCATION | 0.011 | SEX | 0.000 |
ROOM | −0.003 | ROOM | 0.002 |
Results for outcome regression (δ = 2) . | Results for propensity score (λ = 1), generalized dominance analysis . | ||||
---|---|---|---|---|---|
Pratt's method . | Dominance analysis . | ||||
Covariate . | Contribution . | Covariate . | Contribution . | Covariate . | Contribution . |
AGE | 0.044 | AGE | 0.043 | U | 0.045 |
POVERTY | 0.040 | POVERTY | 0.035 | POVERTY | 0.024 |
U | 0.032 | U | 0.031 | HOUSE | 0.005 |
ETHNICITY | 0.023 | ETHNICITY | 0.020 | ETHNICITY | 0.004 |
FAMILY | 0.018 | FAMILY | 0.019 | FAMILY | 0.003 |
HOUSE | 0.018 | HOUSE | 0.017 | ROOM | 0.003 |
SEX | 0.013 | SEX | 0.013 | AGE | 0.003 |
COP | 0.010 | COP | 0.012 | EDUCATION | 0.001 |
EDUCATION | 0.007 | EDUCATION | 0.011 | SEX | 0.000 |
ROOM | −0.003 | ROOM | 0.002 |
POVERTY, poverty income ratio; AGE, age at the time of interview; SEX, male or female; HOUSE, whether the house was built before or after 1974; ROOM, number of rooms in the house; FAMILY, size of family; ETHNICITY, white or non-white; EDUCATION, years of education of the reference adult; COP, treatment status.
Assessing the relative importance of observed confounders and the hypothesized unmeasured confounders†
Results for outcome regression (δ = 2) . | Results for propensity score (λ = 1), generalized dominance analysis . | ||||
---|---|---|---|---|---|
Pratt's method . | Dominance analysis . | ||||
Covariate . | Contribution . | Covariate . | Contribution . | Covariate . | Contribution . |
AGE | 0.044 | AGE | 0.043 | U | 0.045 |
POVERTY | 0.040 | POVERTY | 0.035 | POVERTY | 0.024 |
U | 0.032 | U | 0.031 | HOUSE | 0.005 |
ETHNICITY | 0.023 | ETHNICITY | 0.020 | ETHNICITY | 0.004 |
FAMILY | 0.018 | FAMILY | 0.019 | FAMILY | 0.003 |
HOUSE | 0.018 | HOUSE | 0.017 | ROOM | 0.003 |
SEX | 0.013 | SEX | 0.013 | AGE | 0.003 |
COP | 0.010 | COP | 0.012 | EDUCATION | 0.001 |
EDUCATION | 0.007 | EDUCATION | 0.011 | SEX | 0.000 |
ROOM | −0.003 | ROOM | 0.002 |
Results for outcome regression (δ = 2) . | Results for propensity score (λ = 1), generalized dominance analysis . | ||||
---|---|---|---|---|---|
Pratt's method . | Dominance analysis . | ||||
Covariate . | Contribution . | Covariate . | Contribution . | Covariate . | Contribution . |
AGE | 0.044 | AGE | 0.043 | U | 0.045 |
POVERTY | 0.040 | POVERTY | 0.035 | POVERTY | 0.024 |
U | 0.032 | U | 0.031 | HOUSE | 0.005 |
ETHNICITY | 0.023 | ETHNICITY | 0.020 | ETHNICITY | 0.004 |
FAMILY | 0.018 | FAMILY | 0.019 | FAMILY | 0.003 |
HOUSE | 0.018 | HOUSE | 0.017 | ROOM | 0.003 |
SEX | 0.013 | SEX | 0.013 | AGE | 0.003 |
COP | 0.010 | COP | 0.012 | EDUCATION | 0.001 |
EDUCATION | 0.007 | EDUCATION | 0.011 | SEX | 0.000 |
ROOM | −0.003 | ROOM | 0.002 |
POVERTY, poverty income ratio; AGE, age at the time of interview; SEX, male or female; HOUSE, whether the house was built before or after 1974; ROOM, number of rooms in the house; FAMILY, size of family; ETHNICITY, white or non-white; EDUCATION, years of education of the reference adult; COP, treatment status.
We can now make causal statements regarding the detrimental effect of second-hand smoke on a child's blood lead level as follows. Under the assumption of treatment ignorability, the exposure to second-hand smoke causes a significant increase in blood lead levels in children, with a 95% confidence interval (0.397,0.808) μg dL−1. To explain away this causal conclusion, the residual confounding that is represented by an independent binary unmeasured confounder needs to be associated with the blood lead level to a similar extent to that of the poverty income ratio and at the same time be more associated with the cotinine level than any of the eight observed covariates.
7 Discussion
In this paper, we investigate the detrimental effect of second-hand smoking on blood lead levels among children by using observational data. To assess the robustness of our causal conclusion better, we develop a new sensitivity analysis framework for matched observational studies that allows calibration, i.e. the hypothesized unmeasured confounder can be meaningfully compared with the observed covariates. Our sensitivity analysis has two outputs. The first output is a calibration plot. For a fixed sensitivity parameter p, we identify (λ, δ) pairs such that the estimated treatment effect is significant at the 0.05-level and, for each (λ, δ) pair on the boundary, we further estimate the coefficients of observed covariates by doing regression adjustment in each matched set, and we contrast them with (λ, δ) on the same plot. To complement a naive comparison of magnitudes of regression coefficients to sensitivity parameters further, we supplement the calibration plot with a table summarizing the variable importance of the unmeasured confounder relative to other observed covariates by using both Pratt's method and the dominance analysis. This enables us to assess further the robustness of our causal conclusions regarding the detrimental effect of second-hand smoke on blood lead levels. We find that, to explain away the observed causal relationship, the unmeasured confounding needs to be associated with blood lead levels to a similar degree to that of the poverty income ratio, and more associated with cotinine level than any observed covariate.
In this paper, we model the unmeasured confounder as binary. We discussed motivations behind this modelling choice in Section 2. This modelling aspect can be generalized and is not essential in principle. When the unmeasured confounder has a continuous distribution, the E-step yields an integral rather than a weighted sum and this integral typically has no closed form; however, it can be approximated by being discretized into a weighted sum so that the EM algorithm by the method of weights can still be applied (Brahim and Weisberg, 1992). Alternatively, one may use a Monte Carlo EM algorithm (Ibrahim, 1990). These approaches usually involve heavy computation. Hence, we focus on analysing a binary unmeasured confounder in the paper, although the extension to a continuous unmeasured confounder with any prespecified distribution is immediate. For future work, it would be interesting to investigate the consequences of assuming a dichotomous unmeasured confounder. In the context of hypothesis testing, it was shown in Wang and Krieger (2006) that the causal conclusions are most sensitive to a binary unobserved covariate for a large class of test statistics in the setting of matched pairs with binary treatment and response. Any analogous result in the context of estimation would be of great interest.
Supporting information
Additional ‘supporting information’ may be found in the on-line version of this article:
‘Supplementary materials for “A calibrated sensitivity analysis for matched observational studies with application to the effect of second-hand smoke exposure on blood lead levels in children”’.
Acknowledgements
The authors thank the Associate Editor and referees for helpful comments.