Using ecological propensity score to adjust for missing confounders in small area studies

Summary Small area ecological studies are commonly used in epidemiology to assess the impact of area level risk factors on health outcomes when data are only available in an aggregated form. However, the resulting estimates are often biased due to unmeasured confounders, which typically are not available from the standard administrative registries used for these studies. Extra information on confounders can be provided through external data sets such as surveys or cohorts, where the data are available at the individual level rather than at the area level; however, such data typically lack the geographical coverage of administrative registries. We develop a framework of analysis which combines ecological and individual level data from different sources to provide an adjusted estimate of area level risk factors which is less biased. Our method (i) summarizes all available individual level confounders into an area level scalar variable, which we call ecological propensity score (EPS), (ii) implements a hierarchical structured approach to impute the values of EPS whenever they are missing, and (iii) includes the estimated and imputed EPS into the ecological regression linking the risk factors to the health outcome. Through a simulation study, we show that integrating individual level data into small area analyses via EPS is a promising method to reduce the bias intrinsic in ecological studies due to unmeasured confounders; we also apply the method to a real case study to evaluate the effect of air pollution on coronary heart disease hospital admissions in Greater London.

The BYM model proposed by Besag et al. (1991) is assigned to ψ 1qd + ψ 2qd for district d: where ψ 1qd is specified through the intrinsic conditional autoregressive model (iCAR) proposed by Besag and Kooperberg (1995), while ψ 2qd follows a normal distribution with a common variance units. Note that the iCAR is improper, as it is possible to add a constant to each ψ 1qd without changing the distribution, so a global intercept α q needs to be added as well as a sum-to-zero constraint. We followed the specification provided in Lunn et al. (2012) pag 264 and on α q we specified a flat distribution between ±∞, leading to the joint prior of the intercept and random effects to be equivalent to specify an iCAR prior on the random effects without constraint.
A correlation structure between the confounders should be included to allow borrowing of strength, as some might have been collected on several years, thus being available for many individuals, while other might not. We extend σ 2 ψ1q in (1.1) to Σ ψ1 and equivalently σ 2 ψ2q to Σ ψ2 ; the diagonals model the variances for each confounder (spatially structured and non), while the off diagonal identifies the covariances among confounders. This specification leads to the multivariate BYM model (MVBYM). See Gamerman et al. (2003), Thomas et al. (2004) for details and applications of MVBYM.

The specification of RW(2)
We choose a T-state RW(2) as f () in the imputation model. First, each continuous variable C p from (2.3) in the main text is converted into a categorical variable by cutting its space into T equally spaced states (t = 1, . . . , T ), so that each C ip becomes C cat(ti)p . Then the non-linear relationship between C p and EPS is approximated by C cat(ti)p and s t where s t is the value estimated by RW(2) on the tth state, i.e. f (C cat(ti)p ); RW(2) links 1st and 2nd order neighbours of s t through the following conditional distribution: The same specification is considered for the link between EPS and λ in the analysis model (Equation 2.4 in the main text).
3. The following MAR criterion is then applied to remove M or m from around 50% of the areas: where l i is the indicator for the missingness of M i or m i . The complete cases are defined as the areas with M i or m i available (l i = 0, i.e. i ∈ S), while the remaining areas have The simulation result with true β 2 = 0 is shown in Table 1.

Simulation study to compare EPS imputation specification
There are two ways to include the information of X in the imputation model: 1. including X as a predictor (PredX): 2. including X as a response variable (RespX): The former (PredX) is a standard way of including covariates in the imputation model (Kenward and Carpenter, 2007), whereas the latter (RespX) is used by McCandless et al. (2012) to define the relationship between EP S i and X. RespX is natural for EPS imputation since it follows the EPS estimation model, but the specification of RespX requires an additional logistic regression (4.6), which is computationally less efficient than PredX.

The simulation to compare RespX and PredX
It seems little research has been done on the comparison of these two different specifications involving exposure X in modelling missing EPS, and so the following simulation study is designed to compare the performance of PredX and RespX.

Simulation design
The simulation uses the datasets generated from Section 3 with the true imputation model specified as EPS i = −3+|C 1i |+|C 2i |+Normal(0, 0.5) and with true Y gen- The following missing probability model is specified to create around 50% of the missing EPS: logit(P ( Each scenario contains 100 datasets, and four statistics (mean, RMSE, CI coverage and CI width at 95% level) are computed for the estimated β 2 .
The simulation initially assesses whether X should be included in the imputation process in the case where the imputation function f () linking confounders C is either truly specified or ignored. Then it compares two specifications of X in the imputation model: PredX (treat X as a predictor) and RespX (treat X as a response variable). To focus on the comparison of PredX and RespX, the true link function h() = 0.2(EPS i + 2) 2 is used.

Simulation results
The simulation results are shown in Table 2. The simulation shows that it is important to include the information of X in the imputation process, either through PredX or RespX, especially in the case where the imputation function f () is not correctly specified (in reality, the imputation function f () is indeed unknown). For example, the first three rows show that the bias of β 2 is reduced from 0.84 to a negligible value by including the information of X.
The simulation also suggests that the format of including the information X is not critical due to the largely indistinguishable results between PredX and RespX. 5. Simulation study designed to assess the model performance in presence of spatial structure and non-linearity We designed and run an additional simulation study to assess the performance of our model in presence of spatial structure and non-linear relationship between unmeasured confounding factors and the health outcome. The synthetic data are generated broadly following the main simulation design presented in Section 3, with few exceptions. In particular, in the interest of simplicity, we assume only one ecological confounder C and to allow for spatial structure and non-linearity we assume that two of the four unmeasured ecological confounders are continuous instead of binary.
We simulate 100 replicated data sets as follows: 1. Simulate a spatial region comprising 300 areas on a regular 20 × 15 grid and construct a binary 300 × 300 spatial neighbourhood matrix for this region based on spatial adjacency.
Generate a spatial random effect ϕ using a Gaussian process representation with zero mean and covariance function described by an exponential model (e.g. Banerjee et al., 2014). Simulate a high-order 4 × 5 super-grid that incorporates the original grid cells in cluster of 15 cells. This mimic the real-world application presented in the main paper, where larger spatial units (i.e. local authority districts) are used for modelling spatially structured random effects in the EPS estimation and imputation to overcome the issue of spatial sparsity at ward level. In this simulation study, we keep the set of the true parameter values as described in Section 3 (note that for the ecological confounder C we use the coefficients used in Section 3 for C 1 ).
The analyses performed for the EPS estimation, imputation and adjustment follow the modelling approach described in Sections 2 of the main paper. Table 3 presents the results of this simulation study, assuming the true value of β 2 = 0.5. Using direct adjustment we can appreciate that if the non-linearity in the relationship between M 4 and Y is ignored the RMSE increases and the coverage decreases. At the same time scenario 2 shows that there is bias of 0.11 and a RMSE of 0.12 when only C and X are included in the analyses. Scenario 3 shows that the EPS framework returns unbiased results; the root mean square error increases slightly with respect to the direct adjustment, but the coverage increases to 72% when the spatial structure is also considered in the model from 62% seen for the direct adjustment when M 4 is included as linear.

Convergence
The following figure shows the potential scale reduction factor (R); it represents how the credible intervals would be reduced if the MCMC simulation were to run forever. It is recommended as a measure of parameter convergence (Gelman and Hill, 2006) and the usual practice is to continue the simulation untilR < 1.1. We can see that for our analysisR is always smaller than 1.1, suggesting good convergence. Figure 1:R for all the parameters estimated through our proposed modelling framework in the illustrative example.