-
PDF
- Split View
-
Views
-
Cite
Cite
Katie Wilson, Jon Wakefield, Pointless spatial modeling, Biostatistics, Volume 21, Issue 2, April 2020, Pages e17–e32, https://doi.org/10.1093/biostatistics/kxy041
Close -
Share
Abstract
The analysis of area-level aggregated summary data is common in many disciplines including epidemiology and the social sciences. Typically, Markov random field spatial models have been employed to acknowledge spatial dependence and allow data-driven smoothing. In the context of an irregular set of areas, these models always have an ad hoc element with respect to the definition of a neighborhood scheme. In this article, we exploit recent theoretical and computational advances to carry out modeling at the continuous spatial level, which induces a spatial model for the discrete areas. This approach also allows reconstruction of the continuous underlying surface, but the interpretation of such surfaces is delicate since it depends on the quality, extent and configuration of the observed data. We focus on models based on stochastic partial differential equations. We also consider the interesting case in which the aggregate data are supplemented with point data. We carry out Bayesian inference and, in the language of generalized linear mixed models, if the link is linear, an efficient implementation of the model is available via integrated nested Laplace approximations. For nonlinear links, we present two approaches: a fully Bayesian implementation using a Hamiltonian Monte Carlo algorithm and an empirical Bayes implementation, that is much faster and is based on Laplace approximations. We examine the properties of the approach using simulation, and then apply the model to the classic Scottish lip cancer data.
1. Introduction
When modeling residual spatial dependence, it is appealing to formulate modeling in terms of an underlying continuous spatial surface, and this is the usual approach for point-referenced data. Continuous modeling becomes more difficult when the data contain regional aggregates at varying spatial resolutions. In epidemiological studies, data is often aggregated for reporting or anonymization. While there exists a wealth of techniques to model regional data at a fixed resolution (Cressie and Wikle, 2011; Banerjee and others, 2014), these models do not extend in a straightforward fashion to situations where more than one resolution is used. In this article, we develop methods for dealing with such situations. We describe two motivating settings.
In the first scenario, we suppose we have areal data only. In epidemiology and the social sciences this situation is the most common, since such data usually satisfy confidentiality constraints, and typically arises from aggregation over a disjoint, irregular partition of the study map, based on administrative boundaries. As an example, we consider incident lip cancer counts observed in 56 counties in Scotland over the years 1975–1980. These data provide a good test case, since they have been extensively analyzed in the literature; see Wakefield (2007) and the references there-in. In this setting, we may view the continuous underlying surface as a device to induce a spatial prior for the areas that avoids the usual arbitrary element of defining neighbors over an irregular geography.
The second scenario, we consider is one in which one data source is available as area level averages over a set of areas but is supplemented with data collected from surveys at known locations. Our interest in this problem arises from spatial modeling of demographic indicators in a developing world context. One source of data is the census, which provides data at the aggregate level, e.g. the average or sum of a variable over an administrative areal unit. In many countries, demographic and health information is collected via surveys, such as Demographic and Health Surveys (DHS; Corsi and others, 2012), and these provide a second source of data. These surveys are typically stratified cluster designs with countries being stratified into coarse areas and into urban/rural, with enumeration areas (EAs) sampled within strata, and then households sampled within EAs. In these surveys, the locations of the EAs, i.e. the GPS coordinates, are often available. In each of these examples, we assume that there is a latent, continuous Gaussian random field (GRF) that varies in space, |$\{S(\boldsymbol{x}): \boldsymbol{x}\in R \subset \mathbb{R}^2\}$| where |$R$| is our study region of interest.
The situation with which we are concerned with in this article is closely related to the change of support problem (COSP; Bradley and others, 2016; Cressie and Wikle, 2011; Gelfand, 2010; Gotway and Young, 2002). This problem occurs when one would like to make inference at a particular spatial resolution, but the data are available at another resolution. Much of this work focuses on normal data, with block kriging being used. For example, Fuentes and Raftery (2005) combine point and aggregate pollution data, with the latter consisting of outputs from numerical models, produced over a gridded surface. Berrocal and others (2010) considered the same class of problem, but added a time dimension and used a regression model with coefficients that varied spatially to relate the observed data to the modeled output. Moraga and others (2017) develop a similar framework to ours and use a stochastic partial different equation (SPDE) approach in order to relate two levels of pollution data. Specifically, the model they propose relates the continuous surface to the area (grid) level by taking an unweighted average of the surface at various points within each grid. We extend this work in several regards: most importantly, our model can accommodate non-normal outcomes and we also allow for a more complex relationship between the point-level process and the aggregated data. Therefore, we are able to address a wider range of situations.
Diggle and others (2013) take a different approach for discrete data and model various applications using log-Gaussian Cox processes, including the reconstruction of a continuous spatial surface from aggregate data. Their approach is based on Markov chain Monte Carlo (MCMC) and follows Li and others (2012) in simulating random locations of cases within areas, which is a computationally expensive step. Recently, Taylor and others (2015, 2018) have focused on improving computation for this scenario and extended the framework to include spatiotemporal models.
A related problem to the COSP, is the modeling of areal data over time, but with boundary changes. Lee and others (2017) analyze space-time data on male bladder cancer in Nova Scotia; the spatial aggregation changes over time, with the older data tending to be of aggregate form and the point data being the norm in more recent years. Building on previous work (Fan and others, 2011; Li and others, 2012; Nguyen and others, 2012), they use a local EM algorithm in conjunction with a local polynomial to model the risk surface.
We propose a three-stage Bayesian hierarchical model that can combine point and areal data by assuming a common underlying smooth, continuous surface. We use the SPDE approach of Lindgren and others (2011) to model the latent field, which allows for computationally efficient inference. The article is structured as follows. In Section 2, we describe the model and in Section 3 the computational details. A simulation study in Section 4 considers a number of scenarios including point data, areal data, and a combination of the two. In Section 5, we illustrate the non-linear areal data only situation for the famous Scotland lip cancer example. Section 6 contains concluding remarks.
2. Model description
We propose a general model framework for inference that can be used for data collected at points, over areas, or a combination of the two. We describe the likelihood first for normal and then for Poisson data (as an illustration of a non-normal outcome), before concluding with a discussion of the model for the latent spatial surface.
2.1. Normal responses
2.2. Poisson responses
If we have non-rare outcomes and only observe the sum then the situation is far more difficult to deal with since the sum of binomials with varying probabilities is a convolution of binomials. If we observe the individual outcomes |$Y_{ij}$| (and not just the sum), then we can model each as binomial (so that we do not have to resort to the convolution). The common situation with disease counts and expected numbers are available across a set of areas is considered in Section 5.
2.3. Model for the latent process
3. Computation
There are two steps to the computation, first the continuous latent surface is discretized in a convenient fashion (Section 3.1) and second the posterior is approximated. We begin with the normal case (Section 3.2) before turning to the more difficult Poisson case (Section 3.3).
3.1. Approximating the latent process
The major hurdle to the more widespread modeling of spatial data with a continuous surface has been the computation. In particular, inverting and finding the determinant of the Matérn covariance matrix, which is in general not sparse, has been a roadblock when the number of points is not small. However, recent work by Lindgren and others (2011) and Simpson and others (2012) detail the connection between GRFs and Gaussian Markov random fields (GMRFs). They first note that GRFs with a Matérn covariance function are solutions to a particular SPDE, and under certain relatively non-restrictive choices this produces a Markovian GRF (MGRF). They then show it is possible to obtain a representation of the solution to the SPDE using a GMRF.
For inference, the discretized version of the spatial prior is combined with the likelihood. In the setting, where we have known locations, it follows from (3.1) that the value of the spatial random effect at an observation point, |$\boldsymbol{x}_{ij}$|, can be approximated by a weighted average of the value of the GMRF on the three nearest mesh vertices. We can write, |$S(\boldsymbol{x}_{ij}) \approx \tilde{S}(\boldsymbol{x}_{ij}) = \boldsymbol{A}_{ij}^\text{T} \boldsymbol{w}$|, where |$\boldsymbol{A}_{ij}$| is an |$m \times 1$|-vector of weights that corresponds to the |$ij$|-th row of a sparse projection matrix |$\boldsymbol{A}$|. The nonzero entries of |$\boldsymbol{A}_{ij}$|, which correspond to the mesh points comprising the triangle containing |$\boldsymbol{x}_{ij}$|, are calculated using Barycentric interpolation. In the case where the observation location, |$\boldsymbol{x}_{ij}$|, is at a mesh vertex, |$\boldsymbol{A}_{ij}$| contains one non-zero entry that is equal to one.
For the normal response model with areal data, we use a fully Bayesian approach, since a fast computational strategy is available. Specifically, the integrated nested Laplace approximation (INLA), an approach for analyzing latent Gaussian models (Rue and others, 2009), can be used. INLA works by using a combination of Laplace approximations along with numerical integration to obtain approximations to the posterior marginals. The SPDE approach has also been implemented in the R package R-INLA (Lindgren and Rue, 2015). For the Poisson response model, R-INLA cannot be used for data aggregated over areas; instead, we consider approaches that involve empirical Bayes (EB), Laplace approximations, MCMC, or hybrid combinations of these techniques.
3.2. Normal responses
This type of model can be fit using INLA, since |$\boldsymbol{D}_i^\text{T} \boldsymbol{w}$| is Gaussian. See Appendix A of the supplementary material available at Biostatistics online for details on the implementation in the context of the simulation that we describe in Section 4.
3.3. Poisson responses
Due to the structure of this model, it is not possible to use the R-INLA software for fitting, but we describe three alternatives. First, a quick approximation is offered by EB with a Laplace approximation being used to integrate out the spatial random effects. To implement this, we use the R package TMB (which stands for Template Model Builder; Kristensen 2014). This is very efficient and estimates of the spatial hyperparameters and fixed effects can be computed within minutes. Second, we resort to MCMC methods. It is well known that in the Gaussian Process context, MCMC methods can be inefficient (Filippone and others, 2013). We opt to use a Hamiltonian Monte Carlo (HMC; Neal, 2011) transition operator for updating |$\boldsymbol{w}$|. Specifically, we first update the spatial hyperparameters |$\boldsymbol{\theta}$| using a random walk proposal and then jointly update |$\boldsymbol{w}$| and |$\beta_0$| using HMC. Finally, we consider a hybrid approach where estimates for the spatial hyperparameters |$\boldsymbol{\theta}$| are found using the empirical Bayes approach and then, conditional on these estimates, posteriors for |$\boldsymbol{w}$| and any fixed effects are explored using MCMC methods. Details of these algorithms in the context of the Scotland example can be found in Appendix C of the supplementary material available at Biostatistics online.
In both the simulation and the real data example, we use relatively vague priors; see Appendices A and C of the supplementary material available at Biostatistics online for details.
4. Simulation study in the normal response case
4.1. Set up
We illustrate the method for normal responses via a simulation considering observations associated with points and observations associated with areas. As a motivating example, we assume the aim is to construct a poverty surface; understanding the spatial structure of poverty and poverty-related factors is of considerable interest (e.g. Gething and others, 2015; Minot and Baulch, 2005; Okwi and others, 2007). Poverty has many different facets, and we take the wealth index (Rutstein and Johnson, 2004) as our measure, which serves as a surrogate for long-term standard of living. The wealth index is comprised of several variables such as household ownership of consumables, access to drinking water, and toilet facilities. The score is then standardized to have mean 0 and standard deviation 1. We simulate a surface of the average wealth index score within households.
We will consider situations in which the wealth index is measured at point locations and we also consider incorporating census data, which provides the average wealth index at the area-level. Observations associated with points are taken from a design that is informed by the Kenya DHS (Kenya National Bureau of Statistics, 2015). It is simplified in that we do not consider stratification or explicit cluster sampling for the 400 locations, which correspond to the centroids of enumeration areas (EAs) from the Kenya 2008 DHS. The dots on the plots in Figure 1 indicate the locations of these sampling points. We emphasize that these are point locations. It would be straightforward to extend the simulation and model to acknowledge the complex design, see Wakefield and others (2018).
Mesh (left) and latent spatial surface (right) used for the simulations. The mesh extends beyond the border of Kenya to avoid boundary effects. The black dots represent the locations of the 400 enumeration areas.
Let |$i=1,\dots,n$| index the administrative areas in Kenya and |$j=1,\dots,n_i$| represent the EAs in area |$R_i$|. Hence, |$\sum_{i=1}^n n_i=400$|. Furthermore, let |$h=1,\dots,N_i$| index the households included in the census in area |$R_i$| and let |$N_{ij}$| be the number of households surveyed at the |$j$|th location (EA) in |$R_i$|. For our simulation, the number of households participating in a survey, |$N_{ij}$|, ranges from 41 to 81, with mean 55 to give 21 946 households in total. We let |$Y_{ih}$| be the wealth index of household |$h$| in area |$R_i$|. As assumed in the DHS, all households in EA |$j$| have the same location.
We assume that the spatial model is a MGRF with Matérn covariance controlled by variance parameter |$\lambda^2$| and scale parameter |$\kappa$|. We set |$\beta_0 = 0$|, |$\sigma^2 = 0.25$|, and |$\lambda^2 = 0.75$|. To simulate the spatial surface, we use the SPDE approach, which requires a triangulated mesh. This mesh is shown in the left panel of Figure 1 with |$m=2,786$| mesh points; these mesh points are approximately 15 km apart in the interior of Kenya. We set |$\kappa = \exp(1/2)$|, which corresponds to a practical range of |$\phi = \sqrt{8}/\kappa=1.72$| degrees. The simulated average household wealth index surface, i.e., |$\beta_0 + \tilde{S}(\boldsymbol{x})$|, is shown in the right panel of Figure 1, where the spatial effect |$\tilde{S}(\boldsymbol{x})$| approximates |$S(\boldsymbol{x})$|.
To simulate data at the 400 EAs, we approximate (4.2) by |$\mu_{ij} = \beta_0 + \tilde{S}(\boldsymbol{x}_{ij})$| where |$\tilde{S}(\boldsymbol{x}_{ij})$| is the simulated spatial effect at EA |$j$| in area |$R_i$|. To simulate the census data we use gridded population estimates from SEDAC (Center for International Earth Science Information Network - CIESIN - Columbia University, 2016), which are available on an (approximately) 1 km square grid at the equator. The gridded population estimates are then transformed to household estimates by dividing the population estimates by 3.9, the mean size of households in 2014 (Kenya National Bureau of Statistics, 2015). We then approximate (4.1) by |$\mu_i = \beta_0 + \frac{1}{N_i}\sum_{g=1}^{G_i} N_{ig} \tilde{S}(\boldsymbol{x}_{ig})$| where |$N_{ig}$| is the household estimate for grid cell |$g$| in area |$R_i$|, |$N_i = \sum_{g=1}^{G_i}N_{ig}$| is the household estimate for area |$R_i$|, and |$\tilde{S}(\boldsymbol{x}_{ig})$| is the simulated spatial effect at the centroid of grid cell |$g$|.
We consider five different scenarios with varying levels of information available on location: (i) survey data only, (ii) census data up to county level (|$n=47$|) only, (iii) both survey data and census data up to county level, (iv) census data up to provincial level (|$n=8$|) only, and (v) both survey data and census data up to provincial level. When we analyze survey and census data together, we assume the two data sources are independent, which in practice means that the surveyed population is only a small fraction of the total population.
Comparison of results under the five scenarios: 400 surveys with exact location (Surveys), census data at the county level (47 Areas), both survey and census data at the county level (Surveys + 47 Areas), census data at the provincial level (8 Areas), and both survey and census data at the provincial level (Surveys + 8 Areas). Top row is the data available under each scenario. Black dots are the locations of the enumeration areas and black borders correspond to the various boundaries (47 counties and 8 provinces). Middle row is the predicted (posterior mean) of the spatial surface. Bottom row is the posterior standard deviation of the predicted surface.
Posterior median and 95% credible intervals (CI) for parameters, mean squared error (MSE), and mean absolute error (MAE) of the surfaces in the simulation under five scenarios: 400 surveys with exact location (Surveys), census data at the county level (47 Areas), both survey and census data at the county level (Surveys + 47 Areas), census data at the provincial level (8 Areas), and both survey and census data at the provincial level (Surveys + 8 Areas)
| Scenario . | |$\beta_0$|: |$0$| . | |$\sigma^2$|: |$0.25$| . | |$\phi$|: |$1.72$| . | |$\lambda^2$|: |$0.75$| . | MSE . | MAE . |
|---|---|---|---|---|---|---|
| Surveys | 0.0614 | 0.292 | 2.16 | 0.894 | 0.206 | 0.333 |
| (|$-$|0.549, 0.635) | (0.235, 0.365) | (1.56, 3.12) | (0.515, 1.65) | |||
| 47 Areas | 0.198 | 0.213 | 2.08 | 0.687 | 0.339 | 0.463 |
| (|$-$|0.334, 0.728) | (0.0618, 0.367) | (1.37, 3.24) | (0.388, 1.27) | |||
| Surveys + 47 Areas | 0.0529 | 0.323 | 2.11 | 0.833 | 0.170 | 0.316 |
| (|$-$|0.519, 0.575) | (0.262, 0.401) | (1.56, 3.03) | (0.500, 1.50) | |||
| 8 Areas | |$-$|0.0108 | 0.211 | 1.10 | 0.654 | 0.551 | 0.602 |
| (|$-$|0.511, 0.444) | (0.0617, 0.364) | (0.383, 2.30) | (0.317, 1.76) | |||
| Surveys + 8 Areas | 0.0571 | 0.298 | 2.17 | 0.879 | 0.201 | 0.330 |
| (|$-$|0.552, 0.616) | (0.241, 0.371) | (1.58, 3.15) | (0.516, 1.63) |
| Scenario . | |$\beta_0$|: |$0$| . | |$\sigma^2$|: |$0.25$| . | |$\phi$|: |$1.72$| . | |$\lambda^2$|: |$0.75$| . | MSE . | MAE . |
|---|---|---|---|---|---|---|
| Surveys | 0.0614 | 0.292 | 2.16 | 0.894 | 0.206 | 0.333 |
| (|$-$|0.549, 0.635) | (0.235, 0.365) | (1.56, 3.12) | (0.515, 1.65) | |||
| 47 Areas | 0.198 | 0.213 | 2.08 | 0.687 | 0.339 | 0.463 |
| (|$-$|0.334, 0.728) | (0.0618, 0.367) | (1.37, 3.24) | (0.388, 1.27) | |||
| Surveys + 47 Areas | 0.0529 | 0.323 | 2.11 | 0.833 | 0.170 | 0.316 |
| (|$-$|0.519, 0.575) | (0.262, 0.401) | (1.56, 3.03) | (0.500, 1.50) | |||
| 8 Areas | |$-$|0.0108 | 0.211 | 1.10 | 0.654 | 0.551 | 0.602 |
| (|$-$|0.511, 0.444) | (0.0617, 0.364) | (0.383, 2.30) | (0.317, 1.76) | |||
| Surveys + 8 Areas | 0.0571 | 0.298 | 2.17 | 0.879 | 0.201 | 0.330 |
| (|$-$|0.552, 0.616) | (0.241, 0.371) | (1.58, 3.15) | (0.516, 1.63) |
Posterior median and 95% credible intervals (CI) for parameters, mean squared error (MSE), and mean absolute error (MAE) of the surfaces in the simulation under five scenarios: 400 surveys with exact location (Surveys), census data at the county level (47 Areas), both survey and census data at the county level (Surveys + 47 Areas), census data at the provincial level (8 Areas), and both survey and census data at the provincial level (Surveys + 8 Areas)
| Scenario . | |$\beta_0$|: |$0$| . | |$\sigma^2$|: |$0.25$| . | |$\phi$|: |$1.72$| . | |$\lambda^2$|: |$0.75$| . | MSE . | MAE . |
|---|---|---|---|---|---|---|
| Surveys | 0.0614 | 0.292 | 2.16 | 0.894 | 0.206 | 0.333 |
| (|$-$|0.549, 0.635) | (0.235, 0.365) | (1.56, 3.12) | (0.515, 1.65) | |||
| 47 Areas | 0.198 | 0.213 | 2.08 | 0.687 | 0.339 | 0.463 |
| (|$-$|0.334, 0.728) | (0.0618, 0.367) | (1.37, 3.24) | (0.388, 1.27) | |||
| Surveys + 47 Areas | 0.0529 | 0.323 | 2.11 | 0.833 | 0.170 | 0.316 |
| (|$-$|0.519, 0.575) | (0.262, 0.401) | (1.56, 3.03) | (0.500, 1.50) | |||
| 8 Areas | |$-$|0.0108 | 0.211 | 1.10 | 0.654 | 0.551 | 0.602 |
| (|$-$|0.511, 0.444) | (0.0617, 0.364) | (0.383, 2.30) | (0.317, 1.76) | |||
| Surveys + 8 Areas | 0.0571 | 0.298 | 2.17 | 0.879 | 0.201 | 0.330 |
| (|$-$|0.552, 0.616) | (0.241, 0.371) | (1.58, 3.15) | (0.516, 1.63) |
| Scenario . | |$\beta_0$|: |$0$| . | |$\sigma^2$|: |$0.25$| . | |$\phi$|: |$1.72$| . | |$\lambda^2$|: |$0.75$| . | MSE . | MAE . |
|---|---|---|---|---|---|---|
| Surveys | 0.0614 | 0.292 | 2.16 | 0.894 | 0.206 | 0.333 |
| (|$-$|0.549, 0.635) | (0.235, 0.365) | (1.56, 3.12) | (0.515, 1.65) | |||
| 47 Areas | 0.198 | 0.213 | 2.08 | 0.687 | 0.339 | 0.463 |
| (|$-$|0.334, 0.728) | (0.0618, 0.367) | (1.37, 3.24) | (0.388, 1.27) | |||
| Surveys + 47 Areas | 0.0529 | 0.323 | 2.11 | 0.833 | 0.170 | 0.316 |
| (|$-$|0.519, 0.575) | (0.262, 0.401) | (1.56, 3.03) | (0.500, 1.50) | |||
| 8 Areas | |$-$|0.0108 | 0.211 | 1.10 | 0.654 | 0.551 | 0.602 |
| (|$-$|0.511, 0.444) | (0.0617, 0.364) | (0.383, 2.30) | (0.317, 1.76) | |||
| Surveys + 8 Areas | 0.0571 | 0.298 | 2.17 | 0.879 | 0.201 | 0.330 |
| (|$-$|0.552, 0.616) | (0.241, 0.371) | (1.58, 3.15) | (0.516, 1.63) |
4.2. Survey data
In the first scenario, we consider the situation in which we have survey data available from 400 EAs. To fit the model using R-INLA, we construct the projection matrix |$\boldsymbol{A}$| as described in Section 3.1. We fit model (4.2) using the SPDE approach. Computational details can be found in Appendix A of the supplementary material available at Biostatistics online.
Posterior medians and 95% credible intervals (CIs) for the parameters are presented in Table 1 and the predicted spatial random effect surface is depicted in Figure 2 (left column). In general, the posterior medians are relatively close to their true values and all CIs cover the true value, though are fairly wide. The predicted spatial surface (posterior mean) over Kenya is visually similar to the true spatial surface, though there is some attenuation. Regions of Kenya that have a higher spatial effect are predicted to be lower and vice versa; this shrinkage to the mean phenomenon is well known in the spatial literature (Section 6.4 of Diggle and Ribeiro, 2007). We also see that the posterior standard deviation of the spatial effect is lower in the vicinity of the 400 EAs and higher elsewhere. The posterior median and 2.5th and 97.5th percentiles of the predicted average household wealth index is depicted in Figure 1 (left column) of the supplementary material available at Biostatistics online and we see similar patterns.
4.3. Census data (47 counties)
We next consider a situation in which we have census data for each of the |$n=47$| counties in Kenya. To implement (4.1), we approximate |$\mu_i$| using (3.2), which requires the population density at the mesh points. To determine the population estimate corresponding to the grid containing the mesh point, we used gridded population estimates from SEDAC. Figure 2 of the supplementary material available at Biostatistics online shows which mesh points have higher than average population density in each of the |$n=47$| counties.
The results are presented in Table 1 and depicted in Figure 2 (second column). Again, we see that the posterior medians are relatively close to the true values. The predicted spatial surface is similar to the truth and is very similar to the predicted surface estimated for the point data. In general, the posterior standard deviation of the spatial effect is higher under this scenario than when we had information from 400 surveys. This is also evident when comparing the 2.5th and 97.5th percentile of the predicted average household wealth index, displayed in Figure 1 (middle column) of the supplementary material available at Biostatistics online.
4.4. Survey and census data (47 counties)
Another scenario that might arise is one in which we have both survey data at 400 EAs and census data available for 47 counties. Thus, we simply combine the methods from Sections 4.2 and 4.3. The results are presented in Table 1 and displayed in Figure 2 (third column) and Figure 1 (right column) of the supplementary material available at Biostatistics online. Overall, there is a slight improvement over the survey information only case. We note that there are some identifiability problems when estimating the two variance parameters, which manifests itself here with |$\sigma^2$| being overestimated and |$\lambda^2$| underestimated.
4.5. Census data (8 provinces)
In order to evaluate the effect when area information is known at a greater aggregate level than previously considered, we examine a situation where we only have census data available for each of the |$n=8$| provinces in Kenya. Implementation-wise, this scenario is analogous to the one previously described in Section 4.3. Results are presented in Table 1 and a depiction of the posterior mean and standard deviation along with a map displaying the eight provinces is in Figure 2 (fourth column). In this scenario, inference for the parameters is severely deteriorated when compared with the previous cases. In particular, the CIs are much wider than in the previous scenarios and the MSE and MAE are substantially larger.
4.6. Survey and census data (8 provinces)
The last scenario we consider is similar to that in Section 4.4, where we have survey and census data available (at the provincial level). Parameter estimates are presented in Table 1 and the posterior mean and standard deviation of the random effect is depicted in Figure 2 (last column). Again, identifiability issues are evident in inference for the variances. The spatial effect surface is similar to the surveys-only scenario.
In terms of the MSEs, the values are 0.206, 0.339, and 0.170 when we have survey data with geographic coordinates, census data at the county-level (|$n=47$|), and a combination. In this simulation, there is a loss of accuracy when we only have census data, but it is not dramatic. However, when we aggregate at the provincial-level (|$n=8$|) the MSE is 0.551 and when we additionally incorporate survey data the MSE is 0.201. In general, we see a modest improvement when incorporating the census data over just using survey data. The improvement is significantly better when we used county-level census data rather than provincial-level census data. Similar trends hold for the mean absolute errors.
5. Application to Scottish lip cancer data
Let |$R_i$| denote county |$i$|, |$i=1,\dots,n=56$| and let |$Y_{iaj}$| be the binary male lip cancer indicator in stratum (age-band) |$a$| of county |$i$| at location |$\boldsymbol{x}_{ij}$|, |$j=1,\dots,N_{ia}$| where |$N_{ia}$| is the male population in county |$R_i$| age group |$a$|. In the usual case, the available data correspond to summed disease counts |$Y_{i++}=\sum_{a=1}^A\sum_{j=1}^{N_{ia}} Y_{iaj}$| and expected numbers |$E_i$|; these expected numbers are often pre-calculated as |$E_i = \sum_{a=1}^A N_{ia} q_a$|, where |$q_a$| is a reference risk for stratum |$a$|. The |$q_a$| may be taken from a previous time period or calculated (via internal standardization) in advance. The rarity of many diseases, and the lack of stratum-specific information, means that simplifying modeling assumptions are needed, as we now describe.
Top left: the SIR estimates of the relative risks of lip cancer in 56 counties of Scotland. Top right: relative risk estimates (posterior medians) from the BYM model. Bottom left: relative risk estimates (posterior medians) from aggregating results from the hybrid EB/MCMC approach. Bottom right: relative risk estimates (posterior medians) from aggregating results from the fully Bayesian approach.
Inference for this model proceeds as discussed in Section 3.3. The mesh used is shown in Figure 4 with |$m=2417$| mesh points, which results in mesh points that are |$\approx$| 6.3 km apart. We determined the relative population density for each |$R_i$| in the same manner as we did for the Kenya simulation and mesh points associated with higher relative population densities are shown in Figure 4.
Left: mesh used for Scotland analysis, consisting of |$m=2417$| mesh points. Middle: distribution of population in Scotland. The gray circles represent mesh points where the population density is larger than average for that area. Right: the predicted continuous relative risk surface from using the fully Bayesian approach.
We considered several different computational strategies, which are described more fully in Appendix C of the supplementary material available at Biostatistics online. Briefly, we implemented an EB approach in which the spatial random effects were integrated out using Laplace approximations, a fully Bayesian approach (using HMC), and a hybrid of these two in which HMC was used, with the spatial hyperparameters |$\boldsymbol{\theta}$| fixed at the EB estimates. For the fully Bayesian approach, we initialized four chains, used a burn in of 10 000 iterations, ran an additional 1 000 000 iterations and thinned them to ultimately save 1000 iterations from each chain. For the hybrid approach, we also initialized four chains, used a burn in of 500 iterations, and ran an additional 1000 iterations for each chain. Convergence summaries for both the fully Bayesian and hybrid approach are given in Appendix D of the supplementary material available at Biostatistics online.
Estimates and 95% CIs for the parameters |$\exp(\beta_0)$|, |$\phi$|, and |$\lambda^2$| are presented in Table 1 of the supplementary material available at Biostatistics online for the three different computational strategies. There is good agreement across the three approaches, though we notice that the posterior CIs tend to be wider when using the fully Bayesian computational strategy, which is not surprising given the assumption of asymptotic normality used to construct the CIs in the EB approach.
We also obtain predictions and posterior standard deviations of |$\tilde{S}(\boldsymbol{x})$|, displayed in Figure 6 of the supplementary material available at Biostatistics online. We note that the posterior standard deviation of the surface is smallest in regions of Scotland where the population is greatest, and larger elsewhere. Furthermore, the posterior standard deviation tends to be a little lower for the hybrid approach than for the fully Bayesian approach, which is not surprising given that the spatial hyperparameters were fixed in the hybrid approach. The predicted continuous relative risk surface using the fully Bayesian approach (posterior median) is presented in Figure 4. As expected, we see that the continuous relative risk surface is largest in the counties with higher SIRs, and lowest in the counties with the smallest SIRs.
We obtain relative risk estimates (posterior medians), as well as 95% CIs for each of the 56 counties from this model by aggregating (with respect to the population density) the continuous relative risk surface within each county. To obtain estimates of the desired quantiles in both the fully Bayesian and hybrid approach, for each county |$R_i$| we obtain |$b=1,\dots,B=4{,}000$| draws from the “aggregated” relative risk surface, |$\theta_i^{(b)} = \exp\left(\beta_0^{(b)}\right) \ \boldsymbol{D}_i^\text{T} \boldsymbol{T}^{(b)}$|, where |$\boldsymbol{T}^{(b)} = \left[~\exp\left(w_1^{(b)}\right),\dots,\exp\left(w_m^{(b)}\right)~ \right]^\text{T}$| (see (3.3)). As before, |$\boldsymbol{D}_i$| has at most |$m_i$| nonzero entries that correspond to the population density estimates, |$d(\boldsymbol{x}_{ik})$|. From here, we can obtain the desired summary measures. Posterior medians are presented in Figure 3 and the 95% CIs are displayed in Figure 7 of the supplementary material available at Biostatistics online. We see that the relative risk estimates are nearly identical for both computational strategies and are similar to the SIRs, but that the estimates are shrunk towards the overall mean, which is as expected.
We also compare our results to those obtained using the BYM model. Parameter estimates are in Appendix D of the supplementary material available at Biostatistics online. The predicted relative risks (posterior medians) for each county are presented in Figure 3, and the 95% CIs are presented in Figure 7 of the supplementary material available at Biostatistics online. The results are very similar to the continuous model.
For the fully Bayesian approach using HMC (using our own code), it took approximately a week to fit the model using a computing cluster. This can be improved tremendously by using the hybrid approach. It takes on the order of minutes to obtain the empirical Bayes estimates and about 10 min to run the HMC.
6. Discussion
In this article, we propose a Bayesian hierarchical model that can accommodate observations taken at different spatial resolutions. To this end, we assume a continuous spatial surface, which we model using the SPDE approach.
In the simulation example, we considered surveys taken at point locations and census data associated with areas. When the only data available was census data at the county-level, there was not a substantial loss in accuracy when comparing it to a situation in which we had survey (point) data. In general, we would not expect this to hold when comparing point data to areal data as the loss of information depends on the strength of the spatial dependence, the number and geographical configuration of the areas, and on the amount and quality of the survey data. When the size of the areas increased (comparing 47 counties to 8 provinces), estimates for the spatial parameters and the overall household wealth index were highly variable and the predicted spatial surface was much less nuanced.
With point data, we are not able to learn anything about the surface at a spatial resolution which is less than the distance between the two closest points. If we only have areal data, the situation is far worse. Hence, one should not over-interpret fine spatial scale structure. In both the point and areal data cases, model checking is difficult. For areal data, one may simply view the model as a method to induce an area-level spatial prior. The results can be presented at the area-level, as we did with the Scottish lip cancer example; with area-level data only, we can only check the model at the aggregate level.
We also applied our method to the Scotland lip cancer dataset. Overall, there were very minor differences in the relative risk estimates for each county when comparing our continuous spatial model to a discrete spatial model (i.e. the BYM model) but again there is strong spatial dependence in these data (which explains in part the popularity of these data). However, we note that modeling a continuous surface is particularly attractive in that it is not subject to definitions of administrative boundaries, which can often be arbitrary, and a continuous risk surface, in general, more accurately reflects disease etiology. Furthermore, it can easily be adapted to situations where we might also have point level covariate data. In the latter case, the use of the models we have described avoids the ecological fallacy, which occurs when area-level associations differ from the individual-level counterparts. To avoid ecological bias, one requires point level covariates but the availability of such data is increasing (Gething and others, 2015).
For normal outcomes, all computation can be performed quickly using R-INLA. In the simulation example, it took about 2 min to fit each of the models on a standard laptop. For Poisson outcomes, computation is much more difficult. There has been an increased interest in implementing sparse matrix operations in Stan, which would improve usability of this method. In general, there is still a need for computationally efficient MCMC schemes for Gaussian process data.
In the simulation study we considered, we looked at combining survey (point) data and census (areal) data. However, with older DHS surveys exact geographic coordinates are not available. Instead, it is only known in which area of the country the survey was taken. This is different from the problem we considered in that, instead of observing outcomes that are associated with an entire area (census data), outcomes are for a specific point in the area, but that exact location is unknown. Therefore, the methods proposed here would need to be altered to accommodate this type of situation.
7. Software
All code and input data used in the simulation and application is available on github (https://github.com/wilsonka/pointless-spatial-modeling).
Acknowledgments
The authors would like to thank Dan Simpson for numerous helpful conversations and Jim Thorson for advice on TMB. Conflict of Interest: None declared.
Funding
Both authors were supported by R01CA095994 from the National Institutes of Health.




