Summary

We conducted a matched observational study to investigate the causal relationship between second-hand smoke and blood lead levels in children. Our first analysis that assumes no unmeasured confounding suggests evidence of a detrimental effect of second-hand smoke. However, unmeasured confounding is a concern in our study as in other observational studies of second-hand smoke's effects. A sensitivity analysis asks how sensitive the conclusion is to a hypothesized unmeasured confounder U. For example, in our study, one potential unmeasured confounder is whether the child attends a public or private school. A commonly used sensitivity analysis for matched observational studies adopts a worst-case perspective, which assumes that, in each matched set, the unmeasured confounder is allocated to make the bias worst: in a matched pair, the child with higher blood lead level always attends public school and the other private school. This worst-case allocation of U does not correspond to any realistic distribution of U in the population and is difficult to compare with observed covariates. We proposed a new sensitivity analysis method that addresses these concerns. We apply the new method to our study and find that, to explain away the association between second-hand smoke exposure and blood lead levels as non-causal, the unmeasured confounder would have to be a bigger confounder than any measured confounder.

1 Introduction

1.1 An observational study on the effect of second-hand smoke exposure on blood lead levels in children

Concerns about the health effects of second-hand smoke exposure have been around since at least 1928 when Schönherr proposed that lung cancers among non-smoking women could be caused by inhalation of their husbands’ smoke (Smith, 2003). A randomized controlled trial is clearly unethical. Over the years, many observational studies have been conducted (e.g. Enstrom and Kabat (2003), Mannino et al. (2003) and Oberg et al. (2011)). In this paper, we are concerned with the effect of second-hand smoke on blood lead levels among children. It is widely acknowledged that active tobacco smoking causes higher blood lead levels (Shaper et al., 1982; Grasmick et al., 1985; Mannino et al., 2005). An important public health question arises about whether exposure to second-hand smoke also causes higher blood lead levels, particularly in children. An effect of second-hand smoke on blood lead levels in children would be concerning because children's nervous systems are still developing and a high blood lead level is thought to cause consequences such as decreased intelligence, impaired growth and anaemia (National Research Council, 1993).

Mannino et al. (2003) studied the relationship between second-hand smoke exposure and blood lead levels in a sample of 5592 US children, aged 4–16 years, and concluded that there was strong evidence that second-hand smoke exposure was associated with increased blood lead levels. If second-hand smoke exposure were randomized, then association would imply causation, but second-hand smoking was not randomized and any causal conclusion needs to be made with great caution.

We followed Mannino et al. (2003) in constructing a data set that included children aged 4–16 years for whom both serum cotinine levels and blood lead levels were measured in the third National Health and Nutrition Examination Survey, along with the following variables: race or ethnicity, age, sex, poverty income ratio, education level of the reference adult, family size, number of rooms in the house and year the house was constructed. The biomarker cotinine is a metabolite of nicotine and an indicator of second-hand smoke exposure (Mannino et al., 2003). We excluded active smokers (cotinine level greater than or equal to 15.0 ng ml−1) as in Mannino et al. (2003) and classified a child as having a high cotinine level (treatment=1) if the cotinine level was in the third tertile, greater than or equal to 0.563 ng ml−1, and a not high cotinine level (treatment=0) otherwise.

Two main problems have been raised in making causal inference from observational studies of second-hand smoke's effects: measurement error and confounding (Kawachi and Colditz, 1996). Measurement error refers to the inaccuracy in classifying the subjects into the exposed and control groups. This difficulty is largely overcome in our study, because the classification is based on accurate cotinine measurement. A second and perhaps more challenging problem is confounding. Individuals who are exposed to environmental tobacco smoke usually display adverse profiles in relation to socio-economic position and health-related behaviours (Smith, 2003). Table 1 summarizes eight baseline covariates in the treated and control groups in our study. We see that children who are exposed to second-hand smoke are in general poorer, younger, live in older houses and have fewer rooms in their houses. A direct comparison of treated and control groups is thus not warranted.

Table 1

Adjusted means and standardized differences of baseline covariates in the treatment and control groups before and after full matching

CovariateResults for before matchingResults for after matching
ControlTreatedStandard differenceControlTreatedStandard difference
EDUCATION11.1310.93−0.0711.1610.93−0.08
MALE0.490.49−0.010.520.51−0.01
NON-WHITE0.730.71−0.050.690.710.03
POVERTY1.861.42−0.391.471.42−0.04
BEFORE 19740.370.29−0.180.290.29−0.01
ROOM6.005.67−0.205.705.67−0.02
FAMILY SIZE5.084.92−0.084.814.920.06
AGE9.569.06−0.139.049.060.01
CovariateResults for before matchingResults for after matching
ControlTreatedStandard differenceControlTreatedStandard difference
EDUCATION11.1310.93−0.0711.1610.93−0.08
MALE0.490.49−0.010.520.51−0.01
NON-WHITE0.730.71−0.050.690.710.03
POVERTY1.861.42−0.391.471.42−0.04
BEFORE 19740.370.29−0.180.290.29−0.01
ROOM6.005.67−0.205.705.67−0.02
FAMILY SIZE5.084.92−0.084.814.920.06
AGE9.569.06−0.139.049.060.01

EDUCATION, years of education of the reference adult; MALE or FEMALE, gender; NON-WHITE or WHITE, race or ethnicity; POVERTY, income divided by poverty threshold; BEFORE or AFTER 1974, when the house was built; ROOM, number of rooms in the house; FAMILY, size of the family.

Table 1

Adjusted means and standardized differences of baseline covariates in the treatment and control groups before and after full matching

CovariateResults for before matchingResults for after matching
ControlTreatedStandard differenceControlTreatedStandard difference
EDUCATION11.1310.93−0.0711.1610.93−0.08
MALE0.490.49−0.010.520.51−0.01
NON-WHITE0.730.71−0.050.690.710.03
POVERTY1.861.42−0.391.471.42−0.04
BEFORE 19740.370.29−0.180.290.29−0.01
ROOM6.005.67−0.205.705.67−0.02
FAMILY SIZE5.084.92−0.084.814.920.06
AGE9.569.06−0.139.049.060.01
CovariateResults for before matchingResults for after matching
ControlTreatedStandard differenceControlTreatedStandard difference
EDUCATION11.1310.93−0.0711.1610.93−0.08
MALE0.490.49−0.010.520.51−0.01
NON-WHITE0.730.71−0.050.690.710.03
POVERTY1.861.42−0.391.471.42−0.04
BEFORE 19740.370.29−0.180.290.29−0.01
ROOM6.005.67−0.205.705.67−0.02
FAMILY SIZE5.084.92−0.084.814.920.06
AGE9.569.06−0.139.049.060.01

EDUCATION, years of education of the reference adult; MALE or FEMALE, gender; NON-WHITE or WHITE, race or ethnicity; POVERTY, income divided by poverty threshold; BEFORE or AFTER 1974, when the house was built; ROOM, number of rooms in the house; FAMILY, size of the family.

To control for the measured confounders, we constructed a matched observational study. In a matched observational study, subjects with comparable measured confounders are put into matched sets and inferences are drawn based on comparisons within these matched sets (Rosenbaum, 2002, 2010a; Hansen, 2004; Stuart, 2010; Lu et al., 2011; Zubizarreta, 2012; Pimentel et al., 2015). A matched observational study can be analysed in a model-based or non-parametric way. Rubin (1979) found that the combination of matching and model-based adjustment within matched sets is robust to model misspecification and is relatively efficient.

We performed a full matching (Hansen, 2004; Hansen and Klopfer, 2006) on the aforementioned eight variables. Table 1 suggests that the covariates are well balanced after matching; in particular, all standardized differences are less than 0.1 after matching (Silber et al., 2016). The mean blood lead level is 3.27 μg dL−1. Assuming that there is no unmeasured confounding, the 95% two-sided confidence interval of the effect of second-hand smoke exposure on blood lead level is (0.71,0.98) μg dL−1 by using the senfmCI function in the R package sensitivityfull (Rosenbaum, 2017) with default settings. This confidence interval was obtained by inverting the test based on Huber's M-statistic for an additive constant treatment effect.

The above analysis suggests evidence for a causal relationship between second-hand smoke and higher blood lead levels in children; however, the analysis relies on assuming ‘no unmeasured confounding’, which is a rather questionable assumption. Although we have put subjects with similar observed covariates in the same matched sets, we are concerned that some unmeasured confounder that we have not matched on, say some aspect of the socio-economic status or a genetic variant, might be associated with the cotinine level and the blood lead level, and induce a spurious causal relationship. For instance, one potential unmeasured confounder that we are concerned about is whether the child attends a public or private school. Public versus private school could be associated with cotinine exposure because

  • (a)

    adjusting for reference adult's education and poverty income ratio does not fully adjust for the child's socio-economic status (consequently, private versus public school may capture residual socio-economic status of the child after adjusting for the reference adult's education and poverty income ratio, and be associated with cotinine level) and

  • (b)

    it is conceivable that students attending public schools are at higher risk of cotinine exposure because there are on average more smokers at public schools.

Private versus public school could be associated with the potential outcomes for blood lead level because public schools tend to have older buildings and one leading source of hazard of lead exposure for US children is lead-based paint which was commonly used in older buildings (Committee on Environmental Health, 2005).

1.2 Sensitivity analysis for matched observational studies: the bounds approach and application to the current study

Sensitivity analysis is one approach to tackling concerns about unmeasured confounding. A sensitivity analysis asks how much of a departure from the no-unmeasured-confounding assumption would be needed to affect the inferences that are drawn from an analysis that assumes no unmeasured confounding. A solid causal conclusion should be relatively insensitive to plausible violations of this no-unmeasured-confounding assumption. Over the years, many sensitivity analysis methods have been proposed for different causal inference frameworks (Cornfield et al., 1959; Rosenbaum and Rubin, 1983; Gastwirth et al., 1998; Imbens, 2003; McCandless et al., 2007; Ichino et al., 2008; Hosman et al., 2010; Carnegie et al., 2016; Ding and VanderWeele, 2016; Dorie et al., 2016; Genbäck and de Luna, 2019; Fogarty, 2019; Franks et al., 2019; Cinelli and Hazlett, 2020).

In this section, we focus on reviewing a commonly used sensitivity analysis framework for matched observational studies: the bounds approach. We illustrate the bounds approach by using the simultaneous sensitivity analysis model that was proposed by Gastwirth et al. (1998). Gastwirth et al. (1998) hypothesized the existence of an unmeasured variable U such that there would be no more unmeasured confounding if U were observed and matched on. Specifically, they considered the following model:
(1)
where X is a vector of observed covariates, U a hypothesized unmeasured confounder, Z the treatment status and Y(0) the potential outcome under no treatment. This model contains as a special case that the potential outcome under control follows a normal distribution:

The simultaneous sensitivity analysis model contains two sensitivity parameters: λ, which controls the strength of the association between U and the treatment status Z, and δ, which controls the strength of the association between U and the response Y. Consider a test statistic t(·) for which we want to compute the tail probability P{t(·)⩾a} under the null hypothesis of no treatment effect. A sensitivity analysis computes the tail probability for a given test statistic t(·), a set of sensitivity parameters (λ, δ) and given values of the unobserved covariates U for each subject. As the values of U vary, the permutation test yields different p-values. Gastwirth et al. (1998) showed that, for matched pairs, a sharp upper bound on the p-value (i.e. the most conservative inference) is obtained by letting the unit with higher response have U = 1 and the unit with lower response have U = 0. Rosenbaum (1987) considered model (1) except asserted only the model for treatment Z and not the model for the outcome Y(0) and showed also that a sharp upper bound on the p-value is obtained by letting the unit with higher response have U = 1 and the unit with lower response have U = 0. Gastwirth et al. (1998) showed that this bound is identical to that obtained from taking δ→∞ in their model. These bounds are sometimes referred to as Rosenbaum bounds in the literature (DiPrete and Gangl, 2004). When the matching design is a full match (Hansen, 2004) rather than a pair match, Gastwirth et al. (2000) and Small et al. (2009) showed how to obtain asymptotically valid bounds by using a computationally fast algorithm.

We followed the approach of Gastwirth et al. (1998) and tested the null hypothesis of no treatment effect for various sensitivity parameters (λ, δ) in our matched design using the aligned rank test (Small et al., 2009). Sensitivity parameters and the corresponding worst-case p-values are summarized in Table 2. We divided the sensitivity parameter δ by the estimated standard deviation of the outcome blood lead level, so that the magnitude of δ is more interpretable. The null hypothesis of no treatment effect can still be rejected when (λ,δ/σ^)=(0.8,0.25), but not when (λ,δ/σ^)=(1.0,0.32).

Table 2

Simultaneous sensitivity analysis of Gastwirth et al. (1998) applied to the second-hand smoke and blood lead levels study

, δ/σ^  )p-value
(0,0)0.000
(0.5, 0.15)0.000
(0.8, 0.25)0.000
(1.0, 0.32)0.488
(1.2, 0.40)0.999
, δ/σ^  )p-value
(0,0)0.000
(0.5, 0.15)0.000
(0.8, 0.25)0.000
(1.0, 0.32)0.488
(1.2, 0.40)0.999

σ^ is the estimated standard deviation of the outcome blood lead level. We report δ/σ^ so that δ takes into account the magnitude of the outcome and is easier to interpret.

Table 2

Simultaneous sensitivity analysis of Gastwirth et al. (1998) applied to the second-hand smoke and blood lead levels study

, δ/σ^  )p-value
(0,0)0.000
(0.5, 0.15)0.000
(0.8, 0.25)0.000
(1.0, 0.32)0.488
(1.2, 0.40)0.999
, δ/σ^  )p-value
(0,0)0.000
(0.5, 0.15)0.000
(0.8, 0.25)0.000
(1.0, 0.32)0.488
(1.2, 0.40)0.999

σ^ is the estimated standard deviation of the outcome blood lead level. We report δ/σ^ so that δ takes into account the magnitude of the outcome and is easier to interpret.

1.3 Limitations of the bounds approach, calibration to observed covariates and motivation for a new framework

The bounds approach derives bounds on the p-value, given the sensitivity parameters, by finding the worst-case allocation of the unmeasured confounder in a finite sample. This approach has two limitations. First, it is conservative for an unmeasured confounder in a superpopulation. Consider one potential unmeasured confounder in our study of whether a child attends a public or private school. Let U = 1 if the child goes to the public school and U = 0 otherwise. The p-values that are reported in the bounds approach of Gastwirth et al. (1998) (see Table 2) correspond to the following worst-case scenario: in each matched pair, the child with higher blood lead level always goes to the public school (U = 1), whereas the child with lower blood lead level always goes to the private school (U = 0). However, this hypothesized binary unmeasured confounder, if it had existed, must have a Bernoulli distribution in the population and, as we construct matched pairs from the population, there is always some positive probability that both units have U simultaneously equal to 0 or 1, i.e. both children attend the private school or both the public school. The worst-case allocation of U under which the bounds are computed is conservative and does not correspond to any realistic probability distribution for this hypothesized unmeasured confounder in the population.

The bounds approach also does not enable researchers to perform accurate calibration. Calibration refers to methodologies that aid researchers in judging the plausibility of the unmeasured confounding in reference to observed covariates, i.e. to calibrate how much unmeasured confounding would make a study sensitive to bias in terms of how it compares with the confounding from observed covariates. Several papers have developed calibration methods for various sensitivity analysis frameworks (Ichino et al., 2008; Hosman et al., 2010; Blackwell, 2013; Griffin et al., 2013; Hsu and Small, 2013; Dorie et al., 2016; Middleton et al., 2016; Cinelli and Hazlett, 2020). In particular, Hsu and Small (2013) developed a calibration approach for matched observational studies using the bounds approach of Gastwirth et al. (1998). Hsu and Small (2013) calibrated the unmeasured confounder to the observed covariates by comparing the sensitivity parameters (λ, δ) in the simultaneous sensitivity analysis with the coefficients of the observed covariates obtained from a regression analysis. A problem with this approach is that, as explained in the previous paragraph, the binary unmeasured confounder has a deterministic structure under the worst-case allocation that the bounds correspond to and therefore cannot be compared with any observed covariate with a probability distribution in a meaningful way. This limitation motivates us to develop a new sensitivity analysis framework for matched observational studies, and to leverage it to assess the robustness of our causal conclusion concerning the effect of second-hand smoke exposure on blood lead level in children.

1.4 Our contribution

Our sensitivity analysis framework for matched observational studies departs substantively from the bounds approach in two respects. First, our framework views matched sets as drawn from a population. The hypothesized unmeasured confounder is modelled as a random variable with a probability distribution, instead of as having a deterministic structure as in the worst-case allocation under which the bounds are computed (Gastwirth et al., 1998; Rosenbaum, 2010a). We borrow from the literature on model-based sensitivity analysis methods (Rosenbaum and Rubin (1983), Imbens (2003), Carnegie et al. (2016) and Cinelli and Hazlett (2020); see the on-line supplementary material section A.1 for a brief review) and adapt it to matched observational studies. Extensive simulations demonstrate that, by combining matching and model-based adjustment, our method is robust to assorted model misspecifications and can be more powerful compared with the bounds approach.

Additionally, by endowing the hypothesized unmeasured confounder with a probability distribution in the population, our method enables calibrating the hypothesized unmeasured confounder to the measured confounders in a more meaningful manner compared with the calibration strategy in Hsu and Small (2013). Ichino et al. (2008) also developed a sensitivity analysis method for matched observational studies with calibration. Our model and calibration approach differ from those in Ichino et al. (2008) in that Ichino et al. (2008) assumed that U is independent of the observed covariates conditional on the treatment and the outcome, whereas we assume that U and the observed covariates are unconditionally independent (and consequently may be conditionally dependent). Conditional independence between U and X adopted as in Ichino et al. (2008) may not hold in some cases, as is discussed in detail in the on-line supplementary material section A.2 from a causal directed acyclic graph point of view.

More broadly, our calibration method is distinct from many existing calibration methods in that we adjust for the omission of the unmeasured confounder when performing calibration, whereas most existing approaches, with the exception of Hsu and Small (2013) and Cinelli and Hazlett (2020), use the observed statistics of the measured covariates for calibration without adjustment for the omission of the unmeasured confounder. As discussed extensively in Cinelli and Hazlett (2020), this practice has undesirable properties, mainly because the estimates of how the observed covariates are related to the outcome and the treatment assignment are often themselves affected by the omission of the unmeasured confounder U, even when U is assumed to be independent of the observed covariates.

There are two outputs from our sensitivity analysis method. The first output is a calibration plot that contrasts sensitivity parameters with estimated regression coefficients in magnitude. This calibration plot is further supplemented with a table summarizing the variable importance of the hypothesized unmeasured confounder relative to the measured confounders in confounding the outcome and the treatment assignment. We leverage our proposed sensitivity analysis method to assess the robustness of our primary analysis conclusion, i.e. that second-hand smoke exposure has a detrimental effect on children by elevating their blood lead levels. We find that, to explain away the causal conclusion, an unmeasured confounder needs to be associated with blood lead level to a similar extent to that of the poverty income ratio and at the same time be more associated with cotinine level than any of the eight observed covariates. The rest of the paper is organized as follows. In Section 2, we explain the notation and present model specifications. In Section 3, we reframe the problem as a missing data problem and solve it using an expectation–maximization (EM) algorithm. Section 4 discusses the proposed calibration approach and Section 5 presents simulation results. We apply our methodology to do a calibrated sensitivity analysis for the effect of second-hand smoke exposure on a child's blood lead level in Section 6 and conclude with a brief discussion in Section 7. R package sensitivityCalibration implements the methodology and is available through the Comprehensive R Archive Network.

2 Notation and framework

We clarify the notation and describe our model in this section. Suppose that there are I matched sets, i = 1,2,…,I, each matched for a set of observed covariates X. Within each matched set, there are ni subjects and let N=Σi=1Ini be the total number of subjects in the study. In the setting of full matching (Rosenbaum, 2002; Hansen, 2004), each matched set consists of either one treated subject and ni−1 controls, or one control and ni−1 treated subjects. If ni = 2 for all i, then we have the pair matching.

Let ij denote subject j, j = 1,2,…,ni, either treated or control, in matched set i. Let Zij = 1 if subject ij is treated and Zij = 0 otherwise. Let Xij denote the observed covariates of subject ij. According to the matched design, Xij=Xij for j and j in the same matched set i. To conduct a sensitivity analysis, we hypothesize the existence of an unmeasured confounder Uij that is associated with each subject ij. Following the potential outcome framework (Neyman, 1990; Rubin, 1974), we let Yij(1) and Yij(0) denote the potential outcome of subject ij, under treatment (Zij = 1) or control (Zij = 0) respectively. Hence, the observed outcome of subject ij is Yij=ZijYij(1)+(1Zij)Yij(0).

Now we describe our model for the distribution of the unmeasured confounder, the treatment assignment and the response. We use an independent Bernoulli random variable to model the residual confounding that is not captured by the observed covariates:
Here, we use a Bernoulli random variable to represent the unmeasured confounding for several reasons. First, it substantially reduces the computational cost of parameter estimation and is often adopted in the literature (Rosenbaum and Rubin, 1983; Imbens, 2003; Dorie et al., 2016). Second, it facilitates contrasting our method with the bounds approach, where the unmeasured confounder takes values 1 and 0 under the worst-case allocation. Third, although the support of U (binary versus continuous) affects the interpretation of the association between U and the treatment assignment and outcome, later we shall reparameterize the association into a relative importance measure that is scale free. In Section 7 we shall describe how our method could be extended to consider a continuous unmeasured confounder.
We further assume that (X, U) are all the confounders, observed and unobserved, so that, conditionally on (X, U), the treatment assignment is independent of potential outcomes, i.e.
We assume that the treatment follows a logistic regression model,
and the response follows a normal model,
According to this formulation, β denotes a constant and additive treatment effect, and (λ, δ) are sensitivity parameters that control how strongly the unobserved covariate U is associated with the treatment assignment and the response. In our model, p is also a sensitivity parameter.

To prepare for calibration, we estimate not only the treatment effect, but also the effect of each observed covariate on treatment assignment and response, i.e. the coefficients of observed covariates in the outcome regression model and propensity score model. To estimate ψ robustly, we leverage the matched design and write ψ(Xij)=ai+ψTXij, where ψTXij is a linear combination of Xij and ai is a fixed effect that is specific to the matched set i (Rubin, 1979). Our model and inference are more robust compared with assuming that ψ(Xij) is linear in Xij. For simplicity, we assume that κ is linear in Xij here, but we may also include a fixed effect for each matched set when estimating κ(·).

The entire data-generating process is summarized as follows:
(2)
To recap, we want to estimate κ,ψ and the treatment effect β for fixed sensitivity parameters (p, λ, δ), and to compare the hypothesized unmeasured confounder U with the observed covariates in a meaningful way. In the on-line supplementary material section C.1, we discuss how to extend the model to accommodate treatment effect heterogeneity further.

3 Estimating model parameters via an expectation–maximization algorithm

We consider a similar model to that in Rosenbaum and Rubin (1983) and Imbens (2003); however, we allow for fixed effects for matched sets. There are only a handful of parameters in the model that was considered in Rosenbaum and Rubin (1983) and Imbens (2003), and these parameters can be easily estimated via maximum likelihood. In our model, to leverage the matched design, we allow for distinct fixed effects for each matched set and the number of parameters is of the same order as the number of matched sets. It is not unusual to have thousands of matched sets or pairs in a matched observational study and directly optimizing the observed data likelihood would be difficult. The key observation here is that the sensitivity analysis problem can be reframed as a missing data problem (Ichino et al., 2008). Specifically, we treat the unmeasured confounder U as a missing covariate and find the maximum likelihood estimate efficiently via an EM algorithm, and then use the block bootstrap (Abadie and Spiess, 2019) to construct the confidence interval. Below, we briefly describe the E-step and M-step in our context. A brief recap of the EM algorithm can be found in the on-line supplementary material section B.1 and more details of the E-step and M-step can be found in supplementary material section B.2.

Let xl=(xobs,l,xmis,l) denote the complete covariate data for each subject l, l = 1,2,…,N, where xmis,l=ul and xl=(xobs,l,ul). Let a=(a1,,aI) denote the fixed effects for I matched sets, γ=(a,κ,ψ, σ, β) the set of parameters of interest and l(γ|xl,zl,yl) the log-likelihood for the lth observation. The conditional expectation of the complete-data log-likelihood for subject l is
where γ(s) denotes the current model parameters and the summation is over all possible realizations of the missing covariate U. For a fixed set of sensitivity parameters (p, λ, δ), the M-step reduces to finding γ=(a,κ,ψ, σ, β) that maximizes
where
To maximize Q, it suffices to maximize Q(1) and Q(2) separately, which reduces to finding the maximum likelihood estimate for a weighted regression with matched set fixed effect and a weighted logistic regression and can be easily implemented in commonly used statistical software.

4 Calibrating the unmeasured confounder to observed covariates

The parameters in the bounds approach are interpreted in terms of the odds ratio. For instance, in the simultaneous sensitivity analysis of Gastwirth et al. (1998), two subjects matched for observed covariates X differ in their odds of receiving treatment by at most a factor of exp (λ). The magnitude of the sensitivity parameter λ indicates the strength of the hypothesized unmeasured confounder in confounding the treatment assignment; however, an audience without much training in interpreting sensitivity analysis results may still be perplexed about what λ means for their specific problem. Calibration aims to remedy this by comparing the hypothesized unmeasured confounder with the observed covariates in a meaningful way. We ask two interconnected questions. First, if a binary unmeasured confounder had existed, how big an effect, compared with the observed covariates, would it have to have on the treatment assignment and the response to affect materially the qualitative conclusion that we draw from the observational data? Second, what can we say about the importance of this unmeasured confounder relative to that of the other observed covariates?

We leverage our proposed sensitivity analysis model to answer these two questions. For a specified parameter p, we first identify the boundary that separates (λ, δ) pairs that render the treatment effect significant at the 0.05-level from those that do not. In practice, for each fixed λ, we do a binary search to find the largest δ that renders the treatment effect significant. For a specific (λ, δ) pair, we record the coefficient estimates of observed covariates and contrast them with (λ, δ) on the same graph. To make the comparison between the coefficient estimates and the sensitivity parameters meaningful, we transform each observed covariate to the same scale as the hypothesized unmeasured confounder. Here, we follow the suggestion in Gelman (2008) and standardize all non-binary covariates to mean 0 and standard deviation (SD) 0.5 and leave dichotomized covariates intact. According to this standardization scheme, the coefficients of dichotomized variables can be interpreted directly and those of continuous or ordinal variables can be interpreted as the effect of a 2-SD increase in the covariate value, which roughly corresponds to flipping a binary variable from 0 to 1. This calibration plot enables empirical researchers to interpret the sensitivity parameters in a matched observational study better by helping them to draw comparisons with the more tangible observed covariates. When we implement the calibration method in the R package sensitivityCalibration, we make an interactive calibration plot where, for a fixed p, the (λ, δ) pair moves along the boundary and the coefficients of the observed covariates adjust themselves accordingly.

Comparing the sensitivity parameters with the coefficients of observed covariates sheds some light on interpreting the magnitude of the sensitivity parameters; however, a naive comparison of sensitivity parameters with regression coefficients may not suffice for two reasons. First, although U is assumed to be independent of all observed covariates X, any two observed covariates Xi and Xj may be correlated, and the magnitudes of their regression coefficients |βi| and |βj| are not necessarily informative about how important Xi and Xj are in confounding the treatment and outcome. Second, sometimes researchers would like to calibrate the unmeasured confounder to a group of two or more observed covariates, say ethnicity and poverty income ratio. It is not clear how to achieve this by simply focusing on regression coefficients’ magnitudes.

To tackle these challenges, we propose to supplement this naive comparison with an assessment of the relative importance of the hypothesized unmeasured confounder compared with the observed covariates. Many relative importance measures have been developed over the years; see Kruskal and Majors (1989) for a taxonomy of these measures. A ‘well-known, venerable collection’ of measures (Kruskal and Majors, 1989) often expresses the relative importance of covariates in terms of the variance reduction, i.e. how much variance of the response is accounted for by each covariate. Specifically, in the context of multiple regression, Pratt (1987) derived a unique measure of variable importance based on symmetry and invariance principles under linear transformation. This measure assigns to each variable an importance that has the interpretation of an average reduction in variance if that variable is held fixed. More recently, Azen and Dudescu (2003) proposed so-called dominance analysis: a method based on examining R2-values across all subset models.

Our sensitivity analysis framework works seamlessly with such relative importance measures and enables researchers to speak of the importance of the hypothesized unmeasured confounder relative to that of the observed covariates. For each sensitivity parameter trio (p, λ, δ), we run the EM algorithm until it converges, and at convergence we have a data set where each original instance is augmented into two instances: one with the binary unmeasured confounder U = 0 and the other with U = 1. Each of them is associated with a weight, which is the probability that U = 1 or U = 0 given the data and parameter values at convergence. We expand this weighted data set and obtain a full data set that contains all the observed covariates as well as the binary unmeasured confounder U, and we use this full data set to assess the relative importance of covariates. It is also straightforward to assess the relative importance of a group of covariates under our framework: Pratt's method has the property that the relative importance of several variables is the sum of their individual relative importance. In contrast, dominance analysis handles a group of covariates by adding the entire group to each subset model and computing the average increase in R2. Note that regression coefficients of observed covariates, as well as their relative importance measures, are all computed with adjustment for the unmeasured confounder. As pointed out in Section 4, this is an important distinction between our proposal and many existing calibration approaches in the literature. We leverage proposed calibration procedures to study in detail the causal relationship between second-hand smoke exposure and blood lead levels in Section 2.

5 Simulation study

In a simulation study, we examined the confidence interval coverage of our approach and how our approach compares in power with the bounds approach under the simultaneous sensitivity model of Gastwirth et al. (1998). The power of a sensitivity analysis is defined as the probability that it correctly rejects the null hypothesis in a favourable situation where there is a genuine treatment effect and no unmeasured confounding (Rosenbaum, 2010b). If there is no bias and a genuine treatment effect, then we would hope to reject the null hypothesis of no effect and the power of a sensitivity analysis is the chance that our hope will be realized. The power of a sensitivity analysis guides the choice of methods of analysis for a fixed research design (Hansen et al., 2014).

In Section 1, we generate data according to model (2) described in Section 2 with κ(X) and ψ(X) both linear in X, a real treatment effect, and no unmeasured confounding (corresponding to (λ, δ) = (0,0)). We verify that our confidence interval has correct coverage and demonstrate that our method is more powerful than the simultaneous sensitivity analysis in Gastwirth et al. (1998). In Section 2, we generate the response from a non-linear model and verify that matching combined with regression adjustment within matched sets makes our method robust against model misspecification. In Section 3, we perform additional simulations with different effect sizes and error structures, and consider more realistic situations where the matching is not exact. In all of these scenarios, we demonstrate that our framework is robust against various model misspecifications, including a non-normal error structure, a non-linear response model and non-exact matching, in the sense that the coverage of our confidence intervals is always approximately equal to the nominal coverage.

5.1 Linear response model

We generate 100 strata with 10 subjects in each stratum. Subjects in the same stratum have the same observed covariates consisting of seven measurements, each independently and normally distributed with mean (3,1,5,2,6,4,5) and SD (1,0.15,1.5,0.2,1,0.8,1). Each subject is further associated with an unmeasured confounder U such that U|X∼Bernoulli(0.5). We simulate a favourable situation where there is a real treatment effect and no unmeasured confounding by letting β = 2, σ = 1.5 and λ = δ = 0, and setting both the treatment assignment and the response model linear in X as follows:
We perform a full matching and then estimate the treatment effect β for various sensitivity parameters. We repeat the simulation 500 times and use 500 bootstrap samples to construct the confidence interval each time. With 100 strata and 10 subjects in each stratum, we observed that 94.4% of the times the 95% confidence intervals cover the true parameter β = 2 when (λ, δ) are set at the truth of 0. We also compare the power of our sensitivity analysis with the simultaneous sensitivity analysis by Gastwirth et al. (1998). Table 3 summarizes the power of both approaches for various (λ, δ) values in this case. Our approach is much more powerful: when (λ, δ) = (2.5,2.5), our approach correctly rejects the null hypothesis 95.5% of the times, whereas the bounds approach for the simultaneous sensitivity analysis correctly rejects the null hypothesis only 19.8% of the times. We repeat the simulation with a larger sample size (200 strata with 20 subjects in each stratum) and the same qualitative results hold. Next, we replace the constant additive treatment effect β = 2 with a heterogeneous treatment effect βN(2,1). We compute the level of our approach by calculating how many 95% confidence intervals capture E[βi]=2, and power by calculating how many 95% confidence intervals do not contain 0. Table 3 summarizes the results in this scenario and again similar qualitative results hold.
Table 3

Comparison of power of sensitivity analysis

(λ, δ)Results for linear model, n = 1000Results for linear model, n = 4000Results for heterogeneous model, n = 1000Results for non-linear model, n = 1000
Full matching with regression adjustmentSimultaneous sensitivity analysisFull matching with regression adjustmentSimultaneous sensitivity analysisFull matching with regression adjustmentSimultaneous sensitivity analysisFull matching with regression adjustmentSimultaneous sensitivity analysis
(0, 0)1.0001.0001.0001.0001.0001.0001.0001.000
(1, 1)1.0001.0001.0001.0001.0001.0001.0001.000
(1.5, 1.5)1.0001.0001.0001.0001.0001.0001.0001.000
(2, 2)1.0000.9981.0001.0001.0000.8401.0000.824
(2.5, 2.5)0.9540.1981.0000.6720.9520.0120.9280.076
(3, 3)0.0000.0000.0000.0000.2000.0000.0000.000
(λ, δ)Results for linear model, n = 1000Results for linear model, n = 4000Results for heterogeneous model, n = 1000Results for non-linear model, n = 1000
Full matching with regression adjustmentSimultaneous sensitivity analysisFull matching with regression adjustmentSimultaneous sensitivity analysisFull matching with regression adjustmentSimultaneous sensitivity analysisFull matching with regression adjustmentSimultaneous sensitivity analysis
(0, 0)1.0001.0001.0001.0001.0001.0001.0001.000
(1, 1)1.0001.0001.0001.0001.0001.0001.0001.000
(1.5, 1.5)1.0001.0001.0001.0001.0001.0001.0001.000
(2, 2)1.0000.9981.0001.0001.0000.8401.0000.824
(2.5, 2.5)0.9540.1981.0000.6720.9520.0120.9280.076
(3, 3)0.0000.0000.0000.0000.2000.0000.0000.000

From left to right: linear model with a constant additive treatment effect and sample size 1000; linear model with a constant additive treatment effect and sample size 4000; linear model with an additive but heterogeneous treatment effect and sample size 1000; non-linear model with a constant additive treatment effect and sample size 1000. Levels of our method are 5.6%, 4.5%, 4.2% and 5.4% respectively, when (λ, δ) = (0,0) are set as the truth and the nominal level is 5%.

Table 3

Comparison of power of sensitivity analysis

(λ, δ)Results for linear model, n = 1000Results for linear model, n = 4000Results for heterogeneous model, n = 1000Results for non-linear model, n = 1000
Full matching with regression adjustmentSimultaneous sensitivity analysisFull matching with regression adjustmentSimultaneous sensitivity analysisFull matching with regression adjustmentSimultaneous sensitivity analysisFull matching with regression adjustmentSimultaneous sensitivity analysis
(0, 0)1.0001.0001.0001.0001.0001.0001.0001.000
(1, 1)1.0001.0001.0001.0001.0001.0001.0001.000
(1.5, 1.5)1.0001.0001.0001.0001.0001.0001.0001.000
(2, 2)1.0000.9981.0001.0001.0000.8401.0000.824
(2.5, 2.5)0.9540.1981.0000.6720.9520.0120.9280.076
(3, 3)0.0000.0000.0000.0000.2000.0000.0000.000
(λ, δ)Results for linear model, n = 1000Results for linear model, n = 4000Results for heterogeneous model, n = 1000Results for non-linear model, n = 1000
Full matching with regression adjustmentSimultaneous sensitivity analysisFull matching with regression adjustmentSimultaneous sensitivity analysisFull matching with regression adjustmentSimultaneous sensitivity analysisFull matching with regression adjustmentSimultaneous sensitivity analysis
(0, 0)1.0001.0001.0001.0001.0001.0001.0001.000
(1, 1)1.0001.0001.0001.0001.0001.0001.0001.000
(1.5, 1.5)1.0001.0001.0001.0001.0001.0001.0001.000
(2, 2)1.0000.9981.0001.0001.0000.8401.0000.824
(2.5, 2.5)0.9540.1981.0000.6720.9520.0120.9280.076
(3, 3)0.0000.0000.0000.0000.2000.0000.0000.000

From left to right: linear model with a constant additive treatment effect and sample size 1000; linear model with a constant additive treatment effect and sample size 4000; linear model with an additive but heterogeneous treatment effect and sample size 1000; non-linear model with a constant additive treatment effect and sample size 1000. Levels of our method are 5.6%, 4.5%, 4.2% and 5.4% respectively, when (λ, δ) = (0,0) are set as the truth and the nominal level is 5%.

5.2 Non-linear response model

One advantage of matching is that it renders the model more robust against misspecification. To demonstrate this, we simulate data sets according to the same propensity score model but a non-linear response model as follows:

It would be difficult, if not impossible, to specify this response model correctly; however, as one performs matching and then regression adjustment within each matched set, the procedure is robust towards this non-linear response model. In fact, 94.6% of our 500 experiments have 95% confidence intervals covering the true treatment effect β = 2. Moreover, a comparison of the power of sensitivity analysis again demonstrates that our model has superior power compared with the simultaneous sensitivity analysis, as shown in Table 3.

5.3 Different effect sizes, error structures and non-exact matching

In this subsection, we report additional simulation results for various effect sizes, error structures and non-exact matching. We first consider the same linear and non-linear response model as in Section 1 and 2 but with a smaller effect size β = 1. With the linear response model, 93.2% of confidence intervals cover the true parameter β = 1 in our simulations, and the confidence intervals have almost exact coverage for the non-linear response model, with 94% of the 95% confidence intervals covering the true parameter. Table 4 further summarizes the power of our sensitivity analysis in this setting.

Table 4

Power of the sensitivity analysis

(λ, δ)Results for effect  size=23Results for non-normal error
Linear responseNon-linear responseDouble exponentialStudent's t (df = 2)
(0, 0)1.0001.0001.0000.960
(0.5, 0.5)1.0001.0001.0000.954
(0.8. 0.8)1.0001.0001.0000.932
(1, 1)1.0001.0001.0000.914
(1.5, 1.5)0.9540.9640.9440.626
(2, 2)0.0080.0140.0080.028
(λ, δ)Results for effect  size=23Results for non-normal error
Linear responseNon-linear responseDouble exponentialStudent's t (df = 2)
(0, 0)1.0001.0001.0000.960
(0.5, 0.5)1.0001.0001.0000.954
(0.8. 0.8)1.0001.0001.0000.932
(1, 1)1.0001.0001.0000.914
(1.5, 1.5)0.9540.9640.9440.626
(2, 2)0.0080.0140.0080.028

The second and third columns are the linear and non-linear model with effect size 2 and 3 and sample size 1000. Levels are 6.0% and 5.0% respectively, when (λ, δ) = (0,0) are set at truth. The fourth and fifth columns are the linear response model with non-normal error. Levels are 4.6% and 6.2% when the error is distributed as a double-exponential distribution and a Student t-distribution (df=2) respectively when (λ, δ) = (0,0) are set as the truth and the nominal level is 5%.

Table 4

Power of the sensitivity analysis

(λ, δ)Results for effect  size=23Results for non-normal error
Linear responseNon-linear responseDouble exponentialStudent's t (df = 2)
(0, 0)1.0001.0001.0000.960
(0.5, 0.5)1.0001.0001.0000.954
(0.8. 0.8)1.0001.0001.0000.932
(1, 1)1.0001.0001.0000.914
(1.5, 1.5)0.9540.9640.9440.626
(2, 2)0.0080.0140.0080.028
(λ, δ)Results for effect  size=23Results for non-normal error
Linear responseNon-linear responseDouble exponentialStudent's t (df = 2)
(0, 0)1.0001.0001.0000.960
(0.5, 0.5)1.0001.0001.0000.954
(0.8. 0.8)1.0001.0001.0000.932
(1, 1)1.0001.0001.0000.914
(1.5, 1.5)0.9540.9640.9440.626
(2, 2)0.0080.0140.0080.028

The second and third columns are the linear and non-linear model with effect size 2 and 3 and sample size 1000. Levels are 6.0% and 5.0% respectively, when (λ, δ) = (0,0) are set at truth. The fourth and fifth columns are the linear response model with non-normal error. Levels are 4.6% and 6.2% when the error is distributed as a double-exponential distribution and a Student t-distribution (df=2) respectively when (λ, δ) = (0,0) are set as the truth and the nominal level is 5%.

Next, instead of assuming a normal error, we consider two different error structures: a Student t-distribution with degrees of freedom df=2 and a Laplace (double-exponential) distribution with rate 1.5. We adopt a linear response model with an additive treatment effect β = 1, no unmeasured confounding (λ = δ = 0) and sample size equal to 1000. 95.4% and 93.8% of confidence intervals capture the truth β = 1 when ε takes on a Student t-distribution with df=2 and a Laplace distribution with rate 1.5 respectively. Again, the results demonstrate that our method is robust against misspecified error structures. The power of our sensitivity analysis is further summarized in Table 4.

In all aforementioned simulations, we made it easy on matching by simulating multiple objects with exactly the same observed covariates, which is rarely so in practice. To mimic a real data set better, we add uniform[−0.2,0.2], uniform[−0.5,0.5] or uniform[−1.0,1.0] noise to each observed covariate so that subjects in the same matched set are similar, but no longer identical in their observed covariates. We run the experiment with the same linear response model as specified before with β = 1 and σ = 1.5. When we perform the matching, subjects with different observed covariates, sometimes from different strata, are put in the same matched set. As a consequence, hypothesis testing assuming a homogeneous, additive treatment effect without adjustment for covariates may yield invalid inferences. However, our method adjusts for covariates and produces valid inferences if the model for covariates is correctly specified. Our confidence intervals covered the true parameter 92.4%, 95.6% and 94.4% of the times for the uniform[−0.2,0.02], uniform[−0.5,0.5] and uniform[−1,1] noise respectively. Table 5 summarizes the power of our sensitivity analysis. Our method is still quite powerful: it correctly rejects the null hypothesis in a favourable situation 94.4% of the times when (λ, δ) = (1.5,1.5), whereas the simultaneous sensitivity analysis rejects only 62.6% of the times.

Table 5

Power of the sensitivity analysis: linear response model with non-exact matching

(λ, δ)Results for uniform[−0.2,0.2]Results for uniform[−0.5,0.5]Results for uniform[−1,1]
(0, 0)1.0001.0001.000
(0.5, 0.5)1.0001.0001.000
(0.8. 0.8)1.0001.0001.000
(1, 1)1.0001.0001.000
(1.5, 1.5)0.9440.8780.772
(2, 2)0.0080.0080.008
(λ, δ)Results for uniform[−0.2,0.2]Results for uniform[−0.5,0.5]Results for uniform[−1,1]
(0, 0)1.0001.0001.000
(0.5, 0.5)1.0001.0001.000
(0.8. 0.8)1.0001.0001.000
(1, 1)1.0001.0001.000
(1.5, 1.5)0.9440.8780.772
(2, 2)0.0080.0080.008

The levels are 7.6%, 4.4% and 5.6% for different degrees of inexact matching, when (λ, δ) = (0,0) are set as the truth and the nominal level is 5%.

Table 5

Power of the sensitivity analysis: linear response model with non-exact matching

(λ, δ)Results for uniform[−0.2,0.2]Results for uniform[−0.5,0.5]Results for uniform[−1,1]
(0, 0)1.0001.0001.000
(0.5, 0.5)1.0001.0001.000
(0.8. 0.8)1.0001.0001.000
(1, 1)1.0001.0001.000
(1.5, 1.5)0.9440.8780.772
(2, 2)0.0080.0080.008
(λ, δ)Results for uniform[−0.2,0.2]Results for uniform[−0.5,0.5]Results for uniform[−1,1]
(0, 0)1.0001.0001.000
(0.5, 0.5)1.0001.0001.000
(0.8. 0.8)1.0001.0001.000
(1, 1)1.0001.0001.000
(1.5, 1.5)0.9440.8780.772
(2, 2)0.0080.0080.008

The levels are 7.6%, 4.4% and 5.6% for different degrees of inexact matching, when (λ, δ) = (0,0) are set as the truth and the nominal level is 5%.

6 Application: second-hand smoking and blood lead levels in children

6.1 Robustness of the causal relationship between second-hand smoking and blood lead levels in children

We now leverage our proposed sensitivity analysis method to examine how sensitive the treatment effect is to unmeasured confounding in the second-hand smoke exposure and blood lead levels study that was described in Section 1. For a fixed sensitivity parameter trio (p, λ, δ), we maximized the observed data likelihood of model (2) as described in Section 3 after conducting a full match. Model (2) assumes that the error term in the blood lead level model is normally distributed. Simulations in Section 3 suggest that matching combined with regression adjustment is at least quite robust to non-normal errors. We first tabulate the 95% confidence intervals of treatment effect for various combinations of sensitivity parameters (p, λ, δ) in Table 6. Additional results for more (p, λ, δ) combinations are presented in the on-line supplementary material section C.2. Note that (λ, δ) = (0,0) corresponds to no unmeasured confounding and we would expect approximately the same confidence intervals regardless of p. This is indeed so. We compare our method with the bounds approach of Gastwirth et al. (1998). The last column in Table 6 summarizes the worst-case p-values of testing no treatment effect with various (λ, δ) pairs. When (λ, δ) = (1.0,1.0), the 95% confidence interval does not cover 0 for p = 0.1, p = 0.3 or p = 0.5; however, the bounds approach of Gastwirth et al. (1998) yields a p-value of 0.488. The null hypothesis cannot be rejected according to the bounds approach; however, an unmeasured confounder of such magnitude cannot explain away the treatment effect under our sensitivity analysis model.

Table 6

Treatment effect versus different (λ, δ) pairs

(λ, δ)95% confidence interval for p = 0.595% confidence interval for p = 0.395% confidence interval for p = 0.1p-value
(0,0)(0.397, 0.808)(0.407, 0.823)(0.415, 0.825)0.000
(0.5, 0.5)(0.349, 0.750)(0.345, 0.766)(0.380, 0.771)0.000
(0.8, 0.8)(0.243, 0.673)(0.275, 0.677)(0.341, 0.749)0.000
(1.0, 1.0)(0.144, 0.572)(0.180, 0.594)(0.317, 0.711)0.488
(1.2, 1.2)(0.0448, 0.455)(0.106, 0.507)(0.259, 0.676)0.999
(1.5, 1.5)(−0.142,0.276)(−0.0670,0.323)(0.212, 0.602)1.000
(2.0, 2.0)(−0.551,−0.141)(−0.442,−0.0501)(0.0491, 0.474)1.000
(λ, δ)95% confidence interval for p = 0.595% confidence interval for p = 0.395% confidence interval for p = 0.1p-value
(0,0)(0.397, 0.808)(0.407, 0.823)(0.415, 0.825)0.000
(0.5, 0.5)(0.349, 0.750)(0.345, 0.766)(0.380, 0.771)0.000
(0.8, 0.8)(0.243, 0.673)(0.275, 0.677)(0.341, 0.749)0.000
(1.0, 1.0)(0.144, 0.572)(0.180, 0.594)(0.317, 0.711)0.488
(1.2, 1.2)(0.0448, 0.455)(0.106, 0.507)(0.259, 0.676)0.999
(1.5, 1.5)(−0.142,0.276)(−0.0670,0.323)(0.212, 0.602)1.000
(2.0, 2.0)(−0.551,−0.141)(−0.442,−0.0501)(0.0491, 0.474)1.000
Table 6

Treatment effect versus different (λ, δ) pairs

(λ, δ)95% confidence interval for p = 0.595% confidence interval for p = 0.395% confidence interval for p = 0.1p-value
(0,0)(0.397, 0.808)(0.407, 0.823)(0.415, 0.825)0.000
(0.5, 0.5)(0.349, 0.750)(0.345, 0.766)(0.380, 0.771)0.000
(0.8, 0.8)(0.243, 0.673)(0.275, 0.677)(0.341, 0.749)0.000
(1.0, 1.0)(0.144, 0.572)(0.180, 0.594)(0.317, 0.711)0.488
(1.2, 1.2)(0.0448, 0.455)(0.106, 0.507)(0.259, 0.676)0.999
(1.5, 1.5)(−0.142,0.276)(−0.0670,0.323)(0.212, 0.602)1.000
(2.0, 2.0)(−0.551,−0.141)(−0.442,−0.0501)(0.0491, 0.474)1.000
(λ, δ)95% confidence interval for p = 0.595% confidence interval for p = 0.395% confidence interval for p = 0.1p-value
(0,0)(0.397, 0.808)(0.407, 0.823)(0.415, 0.825)0.000
(0.5, 0.5)(0.349, 0.750)(0.345, 0.766)(0.380, 0.771)0.000
(0.8, 0.8)(0.243, 0.673)(0.275, 0.677)(0.341, 0.749)0.000
(1.0, 1.0)(0.144, 0.572)(0.180, 0.594)(0.317, 0.711)0.488
(1.2, 1.2)(0.0448, 0.455)(0.106, 0.507)(0.259, 0.676)0.999
(1.5, 1.5)(−0.142,0.276)(−0.0670,0.323)(0.212, 0.602)1.000
(2.0, 2.0)(−0.551,−0.141)(−0.442,−0.0501)(0.0491, 0.474)1.000

For a fixed (λ, δ) pair, a public health researcher may be interested in plotting the 95% confidence interval against all values of p between 0 and 1 and reporting the inference for various p. See Fig. 1 for such a plot for various (λ, δ) combinations. Fig. 2 suggests that the choice of p = 0.5 yields approximately the most conservative treatment effect for this study. Intuitively, when p is close to 0 or 1, subjects with different treatment status in the same matched set tend to have the same value of U, which makes it difficult to attribute the observed treatment effect to the disagreement in U; therefore, we would expect that p taking a value somewhere in between either end would yield the most conservative inference. Instead of reporting Fig. 1, empirical researchers may simply report a summary statistic. For a fixed (λ, δ) pair, let CIp = [Lp, Up] represent the level α confidence interval corresponding to p. One choice is to report the most conservative CIp, i.e. CIp*, where p*=argminp[0,1]Lp. One may also report p[0,1]CIp so that the confidence interval has the correct (but may be conservative) coverage for all realizations of p.

95% confidence interval versus p for (λ, δ) pairs (a) (1, 1) and (b) (1.5, 1.5)
Fig. 1

95% confidence interval versus p for (λ, δ) pairs (a) (1, 1) and (b) (1.5, 1.5)

Calibration plot with (p, λ, δ) = (0.5,1,1.99): the labels and the covariate information that they encode are POVERTY, poverty income ratio, AGE, age at the time of interview, SEX, male or female, HOUSE, whether the house was built before or after 1974, ROOM, number of rooms in the house, FAMILY, size of family, ETHNICITY, white or non-white, and EDUCATION, years of education of the reference adult
Fig. 2

Calibration plot with (p, λ, δ) = (0.5,1,1.99): the labels and the covariate information that they encode are POVERTY, poverty income ratio, AGE, age at the time of interview, SEX, male or female, HOUSE, whether the house was built before or after 1974, ROOM, number of rooms in the house, FAMILY, size of family, ETHNICITY, white or non-white, and EDUCATION, years of education of the reference adult

A public health researcher studying this particular data set could stop here and report the sensitivity analysis results as discussed above. However, the subject matter expert, as well as any audience of the report, may stare at the sensitivity analysis results, rather confused. The absolute scale of (λ, δ) sheds little light on the actual interpretation of this hypothesized unmeasured confounder: the causal conclusion is still significant when (λ, δ) = (1.2,1.2) for p = 0.5 and p = 0.3, but no longer so when (λ, δ) = (1.5,1.5). Is the hypothesized unmeasured confounding indeed something worth worrying about? Does this degree of sensitivity to unmeasured confounding cast doubt on the findings when no unmeasured confounding was assumed or does it reinforce the findings? To help subject matter experts to interpret the sensitivity parameters better in their specific context, we compare the hypothesized unmeasured confounder with the observed covariates below.

6.2 Calibration to observed covariates

Fig. 2 displays our calibration plot when p = 0.5. The marker represents one (λ, δ) pair on the boundary and, for this specific sensitivity parameter pair (λ, δ) = (1,1.99), we estimate the coefficients of observed covariates (after standardized to mean 0 and SD 0.5 if necessary), which are shown on the same plot as dots with labels. A subject matter expert can make sense of Fig. 2 as follows: for the binary unmeasured confounder U to explain away the treatment effect, it may be associated with sensitivity parameters (λ, δ) = (1,1.99). A (λ, δ) pair as large as (1,1.99) says that flipping the unmeasured confounder from 0 to 1 corresponds to a one-unit increase in the propensity score (in the logit scale) and almost an increase of 2 μg dL−1 in the blood lead level, holding everything else fixed. To draw a comparison, we see that a 2-SD increase in the poverty-to-income ratio roughly corresponds to a 1.2-unit increase in the propensity score and an increase of 1 μg dL−1 in the blood lead level, while holding everything else fixed. We implement an interactive calibration plot in our R package sensitivityCalibration and the on-line supplementary material section C.3 contains four snapshots of this animated plot with different (λ, δ) pairs on the boundary. The estimated coefficients of the observed covariates are relatively insensitive to the change in (λ, δ).

As is discussed in Section 4, the magnitudes of the regression coefficients, even after standardization, do not necessarily indicate their importance. We adopt two useful and interpretable metrics to describe the importance of the hypothesized unmeasured confounder relative to that of other observed covariates. In the outcome regression model, we assess the relative variable importance by using both Pratt's method (Pratt, 1987) and the dominance analysis method (Azen and Dudescu, 2003). The results are summarized in Table 7. In Pratt's method, the contribution of each covariate is the share of total R2 that this covariate accounts for, whereas, in the dominance analysis, the contribution of each covariate is the average increase in R2 across all subset models. Although the two analyses have different motivations, the results align perfectly. With δ = 2, the hypothesized binary unmeasured confounder U is almost comparable with the observed covariate POVERTY (the poverty income ratio) in confounding the blood lead level. Similarly, we assess the relative importance of U in confounding the propensity score under the logistic regression model via the generalized dominance analysis. With λ = 1, the hypothesized unmeasured confounder has higher relative importance than any other observed covariates in confounding the treatment assignment. This supplementary information helps subject matter experts to judge for themselves how robust the causal conclusion is.

Table 7

Assessing the relative importance of observed confounders and the hypothesized unmeasured confounders

Results for outcome regression (δ = 2)Results for propensity score (λ = 1), generalized dominance analysis
Pratt's methodDominance analysis
CovariateContributionCovariateContributionCovariateContribution
AGE0.044AGE0.043U0.045
POVERTY0.040POVERTY0.035POVERTY0.024
U0.032U0.031HOUSE0.005
ETHNICITY0.023ETHNICITY0.020ETHNICITY0.004
FAMILY0.018FAMILY0.019FAMILY0.003
HOUSE0.018HOUSE0.017ROOM0.003
SEX0.013SEX0.013AGE0.003
COP0.010COP0.012EDUCATION0.001
EDUCATION0.007EDUCATION0.011SEX0.000
ROOM−0.003ROOM0.002
Results for outcome regression (δ = 2)Results for propensity score (λ = 1), generalized dominance analysis
Pratt's methodDominance analysis
CovariateContributionCovariateContributionCovariateContribution
AGE0.044AGE0.043U0.045
POVERTY0.040POVERTY0.035POVERTY0.024
U0.032U0.031HOUSE0.005
ETHNICITY0.023ETHNICITY0.020ETHNICITY0.004
FAMILY0.018FAMILY0.019FAMILY0.003
HOUSE0.018HOUSE0.017ROOM0.003
SEX0.013SEX0.013AGE0.003
COP0.010COP0.012EDUCATION0.001
EDUCATION0.007EDUCATION0.011SEX0.000
ROOM−0.003ROOM0.002

POVERTY, poverty income ratio; AGE, age at the time of interview; SEX, male or female; HOUSE, whether the house was built before or after 1974; ROOM, number of rooms in the house; FAMILY, size of family; ETHNICITY, white or non-white; EDUCATION, years of education of the reference adult; COP, treatment status.

Table 7

Assessing the relative importance of observed confounders and the hypothesized unmeasured confounders

Results for outcome regression (δ = 2)Results for propensity score (λ = 1), generalized dominance analysis
Pratt's methodDominance analysis
CovariateContributionCovariateContributionCovariateContribution
AGE0.044AGE0.043U0.045
POVERTY0.040POVERTY0.035POVERTY0.024
U0.032U0.031HOUSE0.005
ETHNICITY0.023ETHNICITY0.020ETHNICITY0.004
FAMILY0.018FAMILY0.019FAMILY0.003
HOUSE0.018HOUSE0.017ROOM0.003
SEX0.013SEX0.013AGE0.003
COP0.010COP0.012EDUCATION0.001
EDUCATION0.007EDUCATION0.011SEX0.000
ROOM−0.003ROOM0.002
Results for outcome regression (δ = 2)Results for propensity score (λ = 1), generalized dominance analysis
Pratt's methodDominance analysis
CovariateContributionCovariateContributionCovariateContribution
AGE0.044AGE0.043U0.045
POVERTY0.040POVERTY0.035POVERTY0.024
U0.032U0.031HOUSE0.005
ETHNICITY0.023ETHNICITY0.020ETHNICITY0.004
FAMILY0.018FAMILY0.019FAMILY0.003
HOUSE0.018HOUSE0.017ROOM0.003
SEX0.013SEX0.013AGE0.003
COP0.010COP0.012EDUCATION0.001
EDUCATION0.007EDUCATION0.011SEX0.000
ROOM−0.003ROOM0.002

POVERTY, poverty income ratio; AGE, age at the time of interview; SEX, male or female; HOUSE, whether the house was built before or after 1974; ROOM, number of rooms in the house; FAMILY, size of family; ETHNICITY, white or non-white; EDUCATION, years of education of the reference adult; COP, treatment status.

We can now make causal statements regarding the detrimental effect of second-hand smoke on a child's blood lead level as follows. Under the assumption of treatment ignorability, the exposure to second-hand smoke causes a significant increase in blood lead levels in children, with a 95% confidence interval (0.397,0.808) μg dL−1. To explain away this causal conclusion, the residual confounding that is represented by an independent binary unmeasured confounder needs to be associated with the blood lead level to a similar extent to that of the poverty income ratio and at the same time be more associated with the cotinine level than any of the eight observed covariates.

7 Discussion

In this paper, we investigate the detrimental effect of second-hand smoking on blood lead levels among children by using observational data. To assess the robustness of our causal conclusion better, we develop a new sensitivity analysis framework for matched observational studies that allows calibration, i.e. the hypothesized unmeasured confounder can be meaningfully compared with the observed covariates. Our sensitivity analysis has two outputs. The first output is a calibration plot. For a fixed sensitivity parameter p, we identify (λ, δ) pairs such that the estimated treatment effect is significant at the 0.05-level and, for each (λ, δ) pair on the boundary, we further estimate the coefficients of observed covariates by doing regression adjustment in each matched set, and we contrast them with (λ, δ) on the same plot. To complement a naive comparison of magnitudes of regression coefficients to sensitivity parameters further, we supplement the calibration plot with a table summarizing the variable importance of the unmeasured confounder relative to other observed covariates by using both Pratt's method and the dominance analysis. This enables us to assess further the robustness of our causal conclusions regarding the detrimental effect of second-hand smoke on blood lead levels. We find that, to explain away the observed causal relationship, the unmeasured confounding needs to be associated with blood lead levels to a similar degree to that of the poverty income ratio, and more associated with cotinine level than any observed covariate.

In this paper, we model the unmeasured confounder as binary. We discussed motivations behind this modelling choice in Section 2. This modelling aspect can be generalized and is not essential in principle. When the unmeasured confounder has a continuous distribution, the E-step yields an integral rather than a weighted sum and this integral typically has no closed form; however, it can be approximated by being discretized into a weighted sum so that the EM algorithm by the method of weights can still be applied (Brahim and Weisberg, 1992). Alternatively, one may use a Monte Carlo EM algorithm (Ibrahim, 1990). These approaches usually involve heavy computation. Hence, we focus on analysing a binary unmeasured confounder in the paper, although the extension to a continuous unmeasured confounder with any prespecified distribution is immediate. For future work, it would be interesting to investigate the consequences of assuming a dichotomous unmeasured confounder. In the context of hypothesis testing, it was shown in Wang and Krieger (2006) that the causal conclusions are most sensitive to a binary unobserved covariate for a large class of test statistics in the setting of matched pairs with binary treatment and response. Any analogous result in the context of estimation would be of great interest.

Supporting information

Additional ‘supporting information’ may be found in the on-line version of this article:

‘Supplementary materials for “A calibrated sensitivity analysis for matched observational studies with application to the effect of second-hand smoke exposure on blood lead levels in children”’.

Acknowledgements

The authors thank the Associate Editor and referees for helpful comments.

References

Abadie
,
A.
and
Spiess
,
J.
(
2019
)
Robust post-matching inference
. Working Paper. Department of Economics, Harvard University, Cambridge. (Available from https://scholar.harvard.edu/spiess/publications/robust-post-matching-inference.)

Azen
,
R.
and
Dudescu
,
D. V.
(
2003
)
The dominance analysis approach for comparing predictors in multiple regression
.
Psychol. Meth.
,
8
,
129
148
.

Blackwell
,
M.
(
2013
)
A selection bias approach to sensitivity analysis for causal effects
.
Polit. Anal.
,
22
,
169
182
.

Brahim
,
J. G.
and
Weisberg
,
S.
(
1992
)
Incomplete data in generalized linear models with continuous covariates
.
Aust. J. Statist.
,
34
,
461
470
.

Carnegie
,
N. B.
,
Harada
,
M.
and
Hill
,
J. L.
(
2016
)
Assessing sensitivity to unmeasured confounding using a simulated potential confounder
.
J. Res. Educ. Effect.
,
9
,
395
420
.

Cinelli
,
C.
and
Hazlett
,
C.
(
2020
)
Making sense of sensitivity: extending omitted variable bias
.
J. R. Statist. Soc.
B,
82
,
39
67
.

Committee on Environmental Health
(
2005
)
Lead exposure in children: prevention, detection, and management
.
Pediatrics
,
116
,
1036
1046
.

Cornfield
,
J.
,
Haenszel
,
W.
,
Hammond
,
E.
,
Lilienfeld
,
A.
,
Shimkin
,
M.
and
Wynder
,
E.
(
1959
)
Smoking and lung cancer
.
J. Natn. Cancer Inst.
,
22
,
173
203
.

Ding
,
P.
and
VanderWeele
,
T. J.
(
2016
)
Sensitivity analysis without assumptions
.
Epidemiology
,
27
,
368
377
.

DiPrete
,
T. A.
and
Gangl
,
M.
(
2004
)
Assessing bias in the estimation of causal effects: Rosenbaum bounds on matching estimators and instrumental variables estimation with imperfect instruments
.
Sociol. Methodol.
,
34
,
271
310
.

Dorie
,
V.
,
Harada
,
M.
,
Carnegie
,
N. B.
and
Hill
,
J.
(
2016
)
A flexible, interpretable framework for assessing sensitivity to unmeasured confounding
.
Statist. Med.
,
35
,
3453
3470
.

Enstrom
,
J. E.
and
Kabat
,
G. C.
(
2003
)
Environmental tobacco smoke and tobacco related mortality in a prospective study of Californians, 1960–98
.
Br. Med. J.
,
326
,
1057
1060
.

Fogarty
,
C. B.
(
2019
)
Studentized sensitivity analysis for the sample average treatment effect in paired observational studies
.
J. Am. Statist. Ass.
, to be published.

Franks
,
A.
,
D’Amour
,
A.
and
Feller
,
A.
(
2019
)
Flexible sensitivity analysis for observational studies without observable implications
.
J. Am. Statist. Ass.
, to be published.

Gastwirth
,
J.
,
Krieger
,
A. M.
and
Rosenbaum
,
P. R.
(
1998
)
Dual and simultaneous sensitivity analysis for matched pairs
.
Biometrika
,
85
,
907
920
.

Gastwirth
,
J. L.
,
Krieger
,
A. M.
and
Rosenbaum
,
P. R.
(
2000
)
Asymptotic separability in sensitivity analysis
.
J. R. Statist. Soc.
B,
62
,
545
555
.

Gelman
,
A.
(
2008
)
Scaling regression inputs by dividing by two standard deviations
.
Statist. Med.
,
27
,
2865
2873
.

Genbäck
,
M.
and
de Luna
,
X.
(
2019
)
Causal inference accounting for unobserved confounding after outcome regression and doubly robust estimation
.
Biometrics
,
75
,
506
515
.

Grasmick
,
C.
,
Huel
,
G.
,
Moreau
,
T.
and
Sarmini
,
H.
(
1985
)
The combined effect of tobacco and alcohol consumption on the level of lead and cadmium in blood
.
Sci. Totl Environ.
,
41
,
207
217
.

Griffin
,
B. A.
,
Eibner
,
C.
,
Bird
,
C. E.
,
Jewell
,
A.
,
Margolis
,
K.
,
Shih
,
R.
,
Slaughter
,
M. E.
,
Whitsel
,
E. A.
,
Allison
,
M.
and
Escarce
,
J. J.
(
2013
)
The relationship between urban sprawl and coronary heart disease in women
.
Hlth Place
,
20
,
51
61
.

Hansen
,
B. B.
(
2004
)
Full matching in an observational study of coaching for the SAT
.
J. Am. Statist. Ass.
,
99
,
609
618
.

Hansen
,
B. B.
and
Klopfer
,
O. S.
(
2006
)
Optimal full matching and related designs via network flows
.
J. Computnl Graph. Statist.
,
15
,
609
627
.

Hansen
,
B. B.
,
Rosenbaum
,
P. R.
and
Small
,
D. S.
(
2014
)
Clustered treatment assignments and sensitivity to unmeasured biases in observational studies
.
J. Am. Statist. Ass.
,
109
,
133
144
.

Hosman
,
C. A.
,
Hansen
,
B. B.
and
Holland
,
P. W. H.
(
2010
)
The sensitivity of linear regression coefficients confidence limits to the omission of a confounder
.
Ann. Appl. Statist.
,
4
,
849
870
.

Hsu
,
J. Y.
and
Small
,
D. S.
(
2013
)
Calibrating sensitivity analyses to observed covariates in observational studies
.
Biometrics
,
69
,
803
811
.

Ibrahim
,
J. G.
(
1990
)
Incomplete data in generalized linear models
.
J. Am. Statist. Ass.
,
85
,
765
769
.

Ichino
,
A.
,
Mealli
,
F.
and
Nannicini
,
T.
(
2008
)
From temporary help jobs to permanent employment: what can we learn from matching estimates and their sensitivity?
J. Appl. Econmetr.
,
23
,
305
327
.

Imbens
,
G. W.
(
2003
)
Sensitivity to exogeneity assumptions in program evaluation
.
Am. Econ. Rev.
,
93
,
126
132
.

Kawachi
,
I.
and
Colditz
,
G. A.
(
1996
)
Confounding, measurement error, and publication bias in studies of passive smoking
.
Am. J. Epidem.
,
144
,
909
915
.

Kruskal
,
W.
and
Majors
,
R.
(
1989
)
Concepts of relative importance in recent scientific literature
.
Am. Statistn
,
43
,
2
6
.

Lu
,
B.
,
Greevy
,
R.
,
Xu
,
X.
and
Beck
,
C.
(
2011
)
Optimal nonbipartite matching and its statistical applications
.
Am. Statistn
,
65
,
21
30
.

Mannino
,
D. M.
,
Albalak
,
R.
,
Grosse
,
S. D.
and
Repace
,
J.
(
2003
)
Second-hand smoke exposure and blood lead levels in U.S. children
.
Epidemiology
,
14
,
719
727
.

Mannino
,
D. M.
,
Homa
,
D. M.
,
Matte
,
T.
and
Hernandez-Avila
,
M.
(
2005
)
Active and passive smoking and blood lead levels in U.S. adults: data from the Third National Health and Nutrition Examination Survey
.
Nictn. Tobacc. Res.
,
7
,
557
564
.

McCandless
,
L. C.
,
Gustafson
,
P.
and
Levy
,
A.
(
2007
)
Bayesian sensitivity analysis for unmeasured confounding in observational studies
.
Statist. Med.
,
26
,
2331
2347
.

Middleton
,
J. A.
,
Scott
,
M. A.
,
Diakow
,
R.
and
Hill
,
J. L.
(
2016
)
Bias amplification and bias unmasking
.
Polit. Anal.
,
24
,
307
323
.

National Research Council
(
1993
)
Measuring Lead Exposure in Infants, Children, and Other Sensitive Populations
.
Washington DC
:
National Academy Press
.

Neyman
,
J.
(
1990
)
On the application of probability theory to agricultural experiments (Engl. transl. D. Dabrowska and T. P. Speed)
.
Statist. Sci.
,
5
,
465
480
.

Oberg
,
M.
,
Jaakkola
,
M. S.
,
Woodward
,
A.
,
Peruga
,
A.
and
Prüss-Ustün
,
A.
(
2011
)
Worldwide burden of disease from exposure to second-hand smoke: a retrospective analysis of data from 192 countries
.
Lancet
,
377
,
139
146
.

Pimentel
,
S. D.
,
Kelz
,
R.
,
Silber
,
J.
and
Rosenbaum
,
P.
(
2015
)
Large, sparse optimal matching with refined covariate balance in an observational study of the health outcomes produced by new surgeons
.
J. Am. Statist. Ass.
,
110
,
515
527
.

Pratt
,
J. W.
(
1987
)
Dividing the indivisible: using simple symmetry to partition variance explained
. In
Proc. 2nd Tampere Conf. Statistics
(eds
T.
Pukkila
and
S.
Puntanen
), pp.
245
260
.
Tampere
:
University of Tampere
.

Rosenbaum
,
P. R.
(
1987
)
Sensitivity analysis for certain permutation inferences in matched observational studies
.
Biometrika
,
74
,
13
26
.

Rosenbaum
,
P. R.
(
2002
)
Observational Studies
.
New York
:
Springer
.

Rosenbaum
,
P. R.
(
2010a
)
Design of Observational Studies
.
New York
:
Springer
.

Rosenbaum
,
P. R.
(
2010b
)
Design sensitivity and efficiency in observational studies
.
J. Am. Statist. Ass.
,
105
,
692
702
.

Rosenbaum
,
P. R.
(
2017
)
sensitivityfull: sensitivity analysis for full matching in observational studies
. R Package Version 1.5.6. (Available from https://CRAN.R-project.org/package=sensitivityfull.)

Rosenbaum
,
P. R.
and
Rubin
,
D. B.
(
1983
)
Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome
.
J. R. Statist. Soc.
B,
45
,
212
218
.

Rubin
,
D. B.
(
1974
)
Estimating causal effects of treatments in randomized and nonrandomized studies
.
J. Educ. Psychol.
,
66
,
688
701
.

Rubin
,
D. B.
(
1979
)
Using multivariate matched sampling and regression adjustment to control bias in observational studies
.
J. Am. Statist. Ass.
,
74
,
318
328
.

Shaper
,
A. G.
,
Pocock
,
S. J.
,
Walker
,
M.
,
Wale
,
C. J.
,
Clayton
,
B.
,
Delves
,
H. T.
and
Hinks
,
L.
(
1982
)
Effects of alcohol and smoking on blood lead in middle-aged British men
.
Br. Med. J.
,
284
,
299
302
.

Silber
,
J. H.
,
Rosenbaum
,
P. R.
,
McHugh
,
M. D.
,
Ludwig
,
J. M.
,
Smith
,
H. L.
,
Niknam
,
B. A.
,
Even-Shoshan
,
O.
,
Fleisher
,
L. A.
,
Kelz
,
R. R.
and
Aiken
,
L. H.
(
2016
)
Comparison of the value of nursing work environments in hospitals across different levels of patient risk: the value of better nursing environments
.
J. Am. Med. Ass. Surg.
,
151
,
527
536
.

Small
,
D.
,
Gastwirth
,
J.
,
Krieger
,
A.
and
Rosenbaum
,
P. R.
(
2009
)
Simultaneous sensitivity analysis for observational studies using full matching or matching with multiple controls
.
Statist. Interfc.
,
2
,
203
211
.

Smith
,
G. D.
(
2003
)
Effect of passive smoking on health
.
Br. Med. J.
,
326
,
1048
1049
.

Stuart
,
E. A.
(
2010
)
Matching methods for causal inference: a review and a look forward
.
Statist. Sci.
,
25
,
1
21
.

Wang
,
L.
and
Krieger
,
A. M.
(
2006
)
Causal conclusions are most sensitive to unobserved binary covariates
.
Statist. Med.
,
25
,
2257
2271
.

Zubizarreta
,
J. R.
(
2012
)
Using mixed integer programming for matching in an observational study of kidney failure after surgery
.
J. Am. Statist. Ass.
,
107
,
1360
1371
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)