-
PDF
- Split View
-
Views
-
Cite
Cite
Victoire Michal, Jon Wakefield, Alexandra M Schmidt, Alicia Cavanaugh, Brian E Robinson, Jill Baumgartner, Model-Based Prediction for Small Domains Using Covariates: A Comparison of Four Methods, Journal of Survey Statistics and Methodology, Volume 12, Issue 5, November 2024, Pages 1489–1514, https://doi.org/10.1093/jssam/smae032
- Share Icon Share
Abstract
We consider methods for model-based small area estimation when the number of areas with sampled data is a small fraction of the total areas for which estimates are required. Abundant auxiliary information is available from the survey for all the sampled areas. Further, through an external source, there is information for all areas. The goal is to use auxiliary variables to predict the outcome of interest for all areas. We compare areal-level random forests and LASSO approaches to a frequentist forward variable selection approach and a Bayesian shrinkage method using a horseshoe prior. Further, to measure the uncertainty of estimates obtained from random forests and the LASSO, we propose a modification of the split conformal procedure that relaxes the assumption of exchangeable data. We show that the proposed method yields intervals with the correct coverage rate and this is confirmed through a simulation study. This work is motivated by Ghanaian data available from the sixth Ghana Living Standards Survey (GLSS) and the 2010 Population and Housing Census, in the Greater Accra Metropolitan Area (GAMA) region, which comprises eight districts that are further divided into enumeration areas (EAs). We estimate the areal mean household log consumption using both datasets. The outcome variable is measured only in the GLSS for 3 percent of all the EAs (136 out of 5019) and 174 potential covariates are available in both datasets. In the application, among the four modeling methods considered, the Bayesian shrinkage performed the best in terms of bias, mean squared error (MSE), and prediction interval coverages and scores, as assessed through a cross-validation study. We find substantial between-area variation with the estimated log consumption showing a 1.3-fold variation across the GAMA region. The western areas are the poorest while the Accra Metropolitan Area district has the richest areas.
We propose a procedure to compute prediction intervals that is valid for a variety of modeling method (including random forests and the LASSO) and that relaxes the assumption of exchangeable data. We compare the performance of random forests and the LASSO with forward variable selection and a Bayesian method to produce small area estimates when a large proportion of the areas are not sampled.
1. MOTIVATION
In 2015, the United Nations (UN) released their 2030 agenda for sustainable development goals (SDGs) consisting of 17 goals, the first of which was to end poverty worldwide (United Nations 2015). For their first SDG, the UN made seven guidelines explicit, including the implementation of “poverty eradication policies” at a disaggregated level. To that end, producing reliable and high-resolution spatial estimates of socioeconomic status and income inequality is fundamental to help decision makers prioritize and target certain areas for decentralized interventions (Elbers et al. 2002). These detailed maps empower local communities to understand their situation compared to their neighbors, which also helps when planning interventions (Bedi et al. 2007).
In Ghana, household surveys are collected every 5–7 years to measure the living conditions of households across Ghanaian regions and districts and to monitor poverty. To keep track of the Ghanaian population wealth, the equivalized consumption is recorded for the sampled households. Although the household income is not directly measured, the equivalized expenditure is an alternative that allows decision makers to assess a household’s standard of living (Johnson et al. 2005). This measure corresponds to the household consumption scaled by a weight based on the number of members in the household. We aim to estimate the equivalized consumption at a disaggregated level, to help communities, civil society organizations, and policymakers better understand the distribution of the households’ living standard in Ghana, in order to prioritize certain areas when implementing poverty eradication policies. The sixth Ghana Living Standards Survey (GLSS), conducted in 2012–2013, was the last household survey carried out prior to the new UN SDGs agenda. The fifth GLSS had shown that inequalities had increased since 2006. In particular, although the overall poverty decreased nation-wide, the wealthiest decile of the population consumed 6.8 times more than the poorest (Cooke et al. 2016). A downside of these household surveys is that the sampling design is stratified two-stage cluster sampling, which only allows for reliable survey sampling estimates at the district level at best. Ghana is divided into ten regions, which were comprised of 170 districts in 2010 or, at a finer level, around 38,000 enumeration areas (EAs). Producing reliable estimates at the EA level would further help the authorities in their policy decisions (Corral et al. 2022).
We analyze data from the sixth GLSS for the Greater Accra Metropolitan Area (GAMA), which consists of eight districts. The GLSS used a stratified two-stage cluster sample in which strata are defined by an urban or rural indicator. The first stage sample of clusters (EAs) was selected proportional to size. Within the sampled EAs, 15 households were systematically sampled. For each sampled household, we have detailed assessment of consumption and their level of education, employment, assets, totaling 174 auxiliary variables. This yields a sample of 136 EAs out of the 5019, in this Ghanaian region. Given that a small proportion of the areas are observed, model-based prediction approach is sensible (Pfeffermann 2013; Tzavidis et al. 2018; Ghosh 2020; Erciulescu and Opsomer 2022; Hogg et al. 2023). Additionally, the sampled EAs are anonymized, which means it is unknown which 136 EAs of the 5019 EAs are represented in the survey. Finally, we have data available from the 2010 Ghanaian census for all EAs in the GAMA. Among others, the same 174 variables are measured in this census and in the sixth GLSS. The aim of this work is to produce estimates with uncertainty of the mean log household consumption at the EA level in the GAMA. Note that, as is common in the literature on poverty mapping, we focus on the equivalized consumption in the log scale to model symmetrical data (Elbers et al. 2003; Nguyen et al. 2017). If needed, the estimates could be transformed back into the original scale.
Given the higher number of auxiliary variables compared to the number of sampled EAs, we assess the performance of random forests and the LASSO (which performs variable selection) to estimate the mean household log consumption at the EA level in the GAMA. For the sake of comparison, we also consider a forward variable selection approach in the frequentist framework and a Bayesian shrinkage method using a horseshoe prior. For all four approaches, we adopt EA-level models. Due to the nature of the motivating Ghanaian datasets, where only a small proportion of the areas are sampled and are anonymized, synthetic small area estimators are of interest. Further, we propose a modification of the split conformal (SC) procedure to compute prediction intervals for the random forest and LASSO predictions while relaxing the assumption of exchangeable responses, which is necessary due to the complex sampling design.
This paper is organized as follows. Section 1.1 briefly reviews the literature on small area estimation (SAE) and variable selection in the frequentist, Bayesian and machine learning frameworks. Section 2 describes the four methods that will be compared and the proposed procedure to produce prediction intervals for estimates obtained through random forests and the LASSO. Section 3 shows the results from two simulation studies. First, section 3.1 presents a comparison between the proposed modified SC and the original SC procedures. Then, section 3.2 provides a comparison between the four methods that perform variable selection. Section 4 discusses the results from the four methods applied to the Ghanaian datasets. Finally, section 5 concludes the paper with a discussion.
1.1 Literature Review
SAE concerns estimation of area-level summaries when data are sparse or non-existent in selected areas (Rao and Molina 2015). This area of research in survey sampling has greatly evolved in the last 50 years (Pfeffermann 2002, 2013; Rao and Molina 2015; Ghosh 2020). Tzavidis et al. (2018) points out that the use of SAE by national statistical institutes and other organizations to produce official statistics exhibits this increasing popularity; notable examples include the povmap software developed by the World Bank (Elbers et al. 2003; World Bank 2015) and the Small Area Income and Poverty Estimates project carried out by the US Census Bureau (Census Bureau 2018).
In survey sampling, the design-based framework is distinguished from the model-based framework. Design-based methods, also called randomization methods, assume the variable of interest to be fixed in the finite population while the randomness comes from the sampling process. Direct (weighted) estimators have favorable design-based properties in large samples and rely only on the sampling weights and the recorded responses within each sampled area to produce areal estimates. Hence, estimates for non-sampled areas are missing. Additionally, data sparsity will yield imprecise direct estimates at the areal level. Similarly, data sparsity within areas may lead to imprecise model-assisted estimates. These latter approaches also fall under the umbrella of design-based inference. Model-assisted methods are design-based approaches that model the responses to gain precision but are still design consistent (Särndal et al. 2003). An alternative is to use model-based approaches, where the responses are no longer assumed fixed but are treated as random variables, modeled using auxiliary information and/or random effects. In this framework, it is common to use exterior sources of information to augment the auxiliary information from the sample to the entire finite population; for example, using information obtained from censuses. Tzavidis et al. (2018) describe a two-step approach to produce model-based small area estimates. First, a model is fitted using the survey responses and survey auxiliary variables. Then, the outcome is predicted for the entire finite population according to the estimated model parameters and finite population auxiliary information.
Abundant auxiliary information may be measured in the sample, for the sampled areas, and through exterior sources, for all the areas of the region of interest. It may therefore be necessary to select a subset of covariates to model the response variable in the presence of high-dimensional auxiliary information. In this way, precision can be increased as unnecessary auxiliary variables are not included. The inference procedure for model-based approaches can be performed under the frequentist or Bayesian frameworks, or with flexible parametric models via machine learning techniques.
Machine learning methods are becoming more popular in the survey sampling community; see for example, Wang et al. (2014) and Breidt and Opsomer (2017). However, it is not straightforward to perform inference and assess the estimates’ uncertainty under these approaches. For example, the bootstrap does not work for non-smooth targets such as LASSO estimates (Dezeure et al. 2015). Among machine learning methods, random forests (Breiman 2001) can be fitted to unit-level or area-level data for a flexible approach. Random forests are a collection of regression trees that recursively partition the responses into increasingly homogeneous subgroups (nodes), based on covariate splits. Random forests potentially have the benefit of accommodating non-linear relationships and complex interactions and naturally select variables through these covariate splits. Each individual regression tree is fitted on a bootstrap sample of the original dataset. There is a growing literature on methods to measure the uncertainty of random forest point estimates, for example using different jackknife approaches (Wager et al. 2014; Steinberger and Leeb 2016; Wager and Athey 2018) or V-statistics (Zhou et al. 2021). However, the subsampling procedures have drawbacks, including their computational overheads and their unclear application to survey data. Recently, Zhang et al. (2019) proposed the so-called out-of-bag (OOB) prediction intervals, which are computed based on quantiles of the random forest OOB prediction errors. These denote the difference between a data point’s outcome and its point estimate, obtained from a random forest grown without said data point. In simulation studies, Zhang et al. (2019) show that their proposed method performs similar to the SC approach proposed by Lei et al. (2018). The SC approach may be used to compute prediction intervals for any modeling method (e.g., linear models or random forests) and is a novelty in the literature on survey sampling (Bersson and Hoff 2024; Wieczorek 2023). To compute prediction intervals for random forest estimates with the SC method, the original dataset is first split into two datasets. A random forest is trained on one subsample, and point estimates and their associated prediction errors are obtained for the other subsample. Then, the intervals are computed based on the empirical quantiles of the prediction errors from the second subsample. Note that while the OOB method proposed by Zhang et al. (2019) only estimates prediction intervals for random forests, the SC method can potentially be applied to any modeling procedure used to obtain point estimates. A common feature of all these prediction interval methods is that the data are assumed to be exchangeable (Angelopoulos and Bates 2023). This is a strong assumption and is not usually true for data gathered from a complex survey design.
Inference procedures for model-based approaches can also follow the frequentist or the Bayesian paradigms. In these frameworks, variable selection is an important yet contentious research topic. In the frequentist framework, two-step procedures are common. Variables are first iteratively selected (forward selection) or removed (backward elimination) to model the outcome, based on the optimization of some criterion (e.g., AIC, BIC, R2). Then, a final model that includes only the selected covariates is fitted to the data. In SAE, it is common to select variables by comparing models through some criterion (e.g., AIC or BIC, or survey sampling adjusted versions); for example, Han (2013); Rao and Molina (2015), and Lahiri and Suntornchost (2015). In the frequentist framework, regularization methods have also been proposed in the literature, such as ridge regression and the LASSO (Tibshirani 1996, 2011; McConville et al. 2017). These methods apply constraints to the regression parameters. However, in the case of the LASSO, these constraints yield estimates of the model parameters whose uncertainty estimation is difficult, especially in a survey setting. In a simulation study, Lei et al. (2018) show that their proposed SC method performs well in computing prediction intervals for predictions obtained through the LASSO, when the data are exchangeable.
In the Bayesian framework, variable selection is conducted by imposing informative priors on the model parameters. Multiple shrinkage priors have been proposed in the literature, for example, Bayesian ridge regression and the Bayesian LASSO (Hans 2010). In the former, a Gaussian prior is assigned to the regression parameters, while a double-exponential distribution is used for the latter. It can be shown that, under the respective priors, computing the maxima a posteriori to estimate the parameters results exactly in ridge-type and LASSO-type estimators (Reich and Ghosh 2019). A more recent popular approach is the use of the horseshoe prior (Carvalho et al. 2010), which imposes a priori a heavier weight toward zero than a normal or double-exponential distribution (Datta and Ghosh 2013; Porwal and Raftery 2022).
2. METHODS
Note that for a non-sampled area and the estimator reduces to , with prediction interval, .
Random forests and the LASSO are considered to estimate in the model-based framework. For the sake of comparison, we also consider a forward variable selection approach in the frequentist paradigm and a Bayesian shrinkage method.
Therefore, inference is conducted using and the non-sampled mean predictions, , are computed using the available covariates’ non-sampled means, .
2.1 Random Forest and LASSO Approaches
To measure the uncertainty associated to the random forest and LASSO predictions (4) and (5), we propose a modification of the SC prediction intervals of Lei et al. (2018), relaxing the assumption of exchangeable sampled and non-sampled data points. The original SC procedure assumes and are exchangeable. However, as shown in (3), and are not exchangeable. Hence, in the proposed modified SC procedure, we assume the mean structures to be similar and allow the variances to be scaled differently, as is the case in (3). Specifically, in this context of a complex sampling design, we assume the variance is independent of the sample strata. The unit-level variance, , is assumed fixed across the strata and the sampled and non-sampled areal-level variances only vary with the number of sampled and non-sampled units, nc and , respectively. We propose scaling the residuals computed in the original SC procedure before computing the empirical quantile necessary for the prediction intervals. Said quantile is then scaled when computing the prediction intervals. The proposed scaled SC procedure can be described through the following steps:
Randomly split into two equal sized datasets. Denote the resulting two sets of area indices by S1 and S2;
Train a random forest or a LASSO approach on and predict ;
Compute the scaled absolute residuals ;
Find , the th smallest residual R, for ;
Let the prediction interval be .
Appendix A in the supplementary data online provides a proof of the marginal coverage of the proposed scaled SC prediction intervals
2.2 Forward Selection
2.3 Bayesian Shrinkage Approach
3. SIMULATION STUDY
This section presents two simulation studies that assess the performance of the proposed scaled SC procedure and compare the four modeling methods. Section 3.1 focuses on the proposed scaled SC method that computes prediction intervals while relaxing the assumption of exchangeable data points. In section 3.2, different generating models and sampling designs are studied to compare the four model selection methods described in section 2.
Inference is performed in R. The random forests of B = 1,000 trees are trained using the ranger package (Wright et al. 2022). For each simulation scenario, the random forest hyperparameters are fixed after a cross-validation study of different values. The code to conduct the proposed scaled SC procedure for random forest estimates is available in appendix D in the supplementary data online. The LASSO method is conducted through the glmnet package, using the cv.glmnet function to define the optimal shrinkage penalty parameter. Bayesian inference is performed with the NIMBLE package (de Valpine et al. 2017). Convergence of the MCMC chains is assessed through trace plots, effective sample sizes, and the statistic (Gelman and Rubin 1992).
3.1 Simulation Study: Scaled SC Procedure
(Stratified) Select all areas and within each area, sample units;
(Stratified) Select all areas and within each area, sample units;
(One-stage) Sample areas and within each area, select all units;
(Two-stage) Sample areas and within each area, sample units;
(Two-stage) Sample areas and within each area, sample units.
The proportion of sampled areas is higher in the stratified sampling designs, as all areas are selected. Hence, the areal-level inference is conducted on more data points in the first two scenarios than in the remaining three, and we expect any modeling method to perform better in these two scenarios. The one-stage and two-stage designs all yield m = 500 areal-level responses. The difference between these last three scenarios is in the sampling fraction within areas. Out of the five simulation scenarios considered, the fourth one is the closest to the Ghanaian data analyzed in section 4.
For each simulation scenario and in each sample, the estimates described in (1) are computed using four methods: a linear model that includes the correct six covariates; a linear model that omits x4, x5, and x6; a random forest method that considers all six covariates to grow the trees; and a linear LASSO model. After conducting a cross-validation study, the random forest hyperparameters were set to mtry = 2 and nodesize = 5. For each scenario, 50 percent, 80 percent, and 95 percent prediction intervals (see (2)) are computed following the SC and the proposed scaled SC procedures in each sample, for each modeling method. These non-parametric methods may be applied to any modeling approach. The objective of this simulation study is to assess whether the proposed scaled SC method yields valid coverage rates of the prediction intervals.
All results are shown in figure 1. The observed coverages for each scenario, method, and interval level are shown in the left panel and the interval widths, in the right panel. The simulation scenarios are identified by 1–5, as described above. Further, we differentiate the results for the sampled and non-sampled areas (Yes/No, respectively). Scenarios 1 and 2 present results for sampled areas, as all areas are sampled in a stratified sample. Scenario 3 shows the results for non-sampled areas because the target is exactly estimated in the sampled areas since all units are selected in a one-stage sample. Therefore, in the third scenario, we measure the predictions’ uncertainty only in the non-sampled areas.

Coverages and Widths of the Prediction Intervals (PI) Obtained From the Proposed Scaled and Original Split Conformal (SC) Procedures for the Four Modeling Methods and Across the Five Simulation Scenarios (1–5). Yes: coverages and widths across the sampled areas; No: coverages and widths across the non-sampled areas.
In the first scenario and for the sampled areas in the fourth scenario, nc and are equal. Therefore, the data points are exchangeable, and the SC and proposed scaled SC approaches are the same. In these scenarios, both methods yield the right coverages of the prediction intervals, regardless of the modeling method. In terms of interval widths, the linear model with an incorrect set of covariates leads to the widest intervals under both SC procedures. The linear model with correct covariates and the LASSO method yields the narrowest intervals regardless of the SC method.
For all other sampling schemes, . Therefore, the sampled and non-sampled means, and , are not exchangeable and have differently scaled variances, and , respectively. In these cases, the original SC intervals obtained for all four modeling methods do not attain the right coverages. The original SC leads to under-coverage of the prediction intervals. On the other hand, the proposed scaled SC procedure produces prediction intervals with the right coverages, regardless of the interval level and modeling method. In particular, when fitting a linear model with the LASSO constraint or with the right mean structure, the scaled SC intervals have exactly the right coverages. When modeling with a non-parametric random forest approach or with a linear model assuming the wrong mean structure, the scaled SC prediction intervals show slight errors in the coverage rate.
In terms of interval width, the proposed scaled SC intervals tend to be a little wider than their original SC counterparts for all interval levels. Regardless of the simulation scenario, the SC intervals and proposed scaled SC intervals obtained for the random forest estimates tend to be narrower than the ones obtained for the linear estimates using the incorrect set of covariates.
Finally, appendix E in the supplementary data online shows the results from the equivalent simulation study in the design-based framework. In this case, the finite population is assumed fixed and different samples are taken. Similar results are obtained: when , the proposed scaled SC procedure yields the correct coverages, while the original SC approach produces under-covering prediction intervals.
3.2 Simulation Study: Prediction Methods Comparison
To compare the performance of the random forest and the LASSO methods to the frequentist forward variable selection and the Bayesian shrinkage approaches, as described in section 2, a model-based simulation study is conducted considering three generating models for the outcome and five sampling designs. Similar to section 3.1 and appendix E in the supplementary data online, a design-based counterpart to this simulation study is shown in appendix F in the supplementary data online, producing similar results. We replicate R = 100 times the creation of three finite populations of M = 1,000 areas of sizes Nc with and as follows:
where the covariates are such that and with coefficients ;
where the covariates are such that with and ;
with covariates .
The covariates are the same across the replicated populations and the randomness comes from the y’s. Populations A and B assume a linear relationship between the outcome and the first ten covariates. In scenario B, however, the strength of the association is weak, and the covariates are correlated. Population C is inspired by Scornet (2017) and assumes a non-linear relationship between the outcome and covariates. Throughout this simulation study, areas are indiscriminately termed “areas” or “EAs.” From each finite population, a sample is drawn following the two sampling schemes:
(Stratified) Select all areas and within each area, sample units;
(Two-stage) Sample areas and within each area, sample units.
These simulation scenarios were motivated by the Ghanaian data analyzed in section 4, where only 15 households were sampled within the selected areas. Additional sampling designs are considered in appendix G in the supplementary data online, where a higher number of sampled units within the selected areas, and in appendix H in the supplementary data online, where a simulation study aims to emulate the Ghanaian data analysed in section 4. For each scenario, the estimates and their prediction intervals are computed as described in section 2. Further, for each scenario, the estimates and their uncertainty are also computed assuming anonymized EAs. In this context, the modeling methods are trained on the sample and predictions are obtained ignoring which areas have been sampled, that is, assuming at the prediction stage. We conducted the latter study because the available Ghanaian data that are analyzed in section 4 does not identify the sampled EAs. The random forest hyperparameters are set following a cross-validation study conducted for each simulation scenario. A random forest with hyperparameters set to is found to perform the best for both sampling schemes in populations A and B. In population C, we set and , for the stratified samples and the two-stage samples, respectively. In all scenarios, the Bayesian approach runs through a MCMC procedure with two chains of 5,000 iterations, including a burn-in period of 2,500 iterations; with practical convergence validated for these values by assessing trace plots, effective sample sizes and statistics (Gelman and Rubin 1992). For a particular finite population and sampling scheme, running the random forest over 100 replicates takes 55 minutes, the LASSO takes 4 minutes, the forward selection method takes 7 minutes, and the Bayesian approach, 2.5 hours.
The methods’ performances are compared via four measures: the mean absolute bias ; mean squared error ; coverages of 50 percent, 80 percent, and 95 percent prediction intervals, where with the prediction interval; and their proper interval score (Gneiting and Raftery 2007). Smaller values of the interval proper scores are preferred, indicating narrow intervals and average close to the nominal. Additionally, we extract which covariates have been selected from each method. Note that in the Bayesian framework, a covariate is said to be selected when its coefficient’s posterior 95 percent credible interval does not include zero. For the random forest approach, when the p-value related to a variable’s importance (Altmann et al. 2010) is smaller than 0.05, the variable is selected. The variable importance is computed based on results from random forests fitted with permutations of the set of covariates.
Figure 2 shows the selected covariates by each method for each model and sampling design. When the association is linear between the covariates and the outcome (A and B), regardless of the sampling design, the forward selection approach tends to adequately select the true auxiliary information. However, it also tends to select irrelevant variables. Each unimportant covariate is selected about 20 percent of the time by the forward selection method. In scenario A, the LASSO and Bayesian approaches also select the right covariates 100 percent of the time, while rarely including redundant covariates. When the covariates are correlated, the LASSO and Bayesian methods tend to miss the right covariates 20–70 percent of the time, depending on the sampling design. In scenarios A and B, the random forest method misses the right covariates 10–50 percent of the time, while it always captures the correct set when the association is non-linear. The LASSO selects one out of two correct covariates about 80 percent of the time in this third population. Both the forward selection and Bayesian approaches miss the correct set of covariates in scenario C almost 100 percent of the time and include irrelevant variables.

Covariate Selection Frequency for Each Method Across the Six Simulation Scenarios. Left of the vertical dashed line: true covariates used in the generating models.
Figure 3 shows the absolute biases multiplied by 100, MSEs, prediction intervals’ coverages, and proper scores for all methods, generating models (A–C), and sampling schemes. The results for the sampled and non-sampled areas are differentiated by red and black symbols, respectively. Only results for the sampled EAs are produced for the stratified sampling design, as all areas are sampled. The results assuming the anonymized EAs are distinguished from the ones in which we know which areas have been sampled by the circle and cross symbols, respectively. The results shown in figure 3 are also provided in tables I.1–I.8 of Appendix I in the supplementary data online.

Mean Absolute Bias, MSE, Coverages and Proper Scores of the Prediction Intervals, Obtained for Each Method Across the Six Simulation Scenarios. RF: Random forest approach.
Mean Absolute Bias, MSE, Coverages, and Proper Scores of the 50 Percent, 80 Percent and 95 Percent Prediction Intervals, Obtained for Each Method in the Eight-Fold Cross-Validation Study on the GAMA Sample
. | Absolute bias . | MSE . | PI coverage . | Proper interval score . | ||||
---|---|---|---|---|---|---|---|---|
. | . | 95% . | 80% . | 50% . | 95% . | 80% . | 50% . | |
Bayesian shrinkage | 0.244 | 0.086 | 94.1 | 80.9 | 48.5 | 1.33 | 1.91 | 3.86 |
Forward selection | 0.975 | 0.168 | 72.1 | 52.2 | 28.7 | 3.55 | 5.35 | 8.03 |
LASSO | 0.965 | 0.133 | 91.9 | 76.5 | 49.3 | 1.75 | 2.48 | 4.65 |
Random forest | 0.516 | 0.097 | 91.9 | 79.4 | 50.0 | 1.61 | 2.08 | 4.16 |
. | Absolute bias . | MSE . | PI coverage . | Proper interval score . | ||||
---|---|---|---|---|---|---|---|---|
. | . | 95% . | 80% . | 50% . | 95% . | 80% . | 50% . | |
Bayesian shrinkage | 0.244 | 0.086 | 94.1 | 80.9 | 48.5 | 1.33 | 1.91 | 3.86 |
Forward selection | 0.975 | 0.168 | 72.1 | 52.2 | 28.7 | 3.55 | 5.35 | 8.03 |
LASSO | 0.965 | 0.133 | 91.9 | 76.5 | 49.3 | 1.75 | 2.48 | 4.65 |
Random forest | 0.516 | 0.097 | 91.9 | 79.4 | 50.0 | 1.61 | 2.08 | 4.16 |
Mean Absolute Bias, MSE, Coverages, and Proper Scores of the 50 Percent, 80 Percent and 95 Percent Prediction Intervals, Obtained for Each Method in the Eight-Fold Cross-Validation Study on the GAMA Sample
. | Absolute bias . | MSE . | PI coverage . | Proper interval score . | ||||
---|---|---|---|---|---|---|---|---|
. | . | 95% . | 80% . | 50% . | 95% . | 80% . | 50% . | |
Bayesian shrinkage | 0.244 | 0.086 | 94.1 | 80.9 | 48.5 | 1.33 | 1.91 | 3.86 |
Forward selection | 0.975 | 0.168 | 72.1 | 52.2 | 28.7 | 3.55 | 5.35 | 8.03 |
LASSO | 0.965 | 0.133 | 91.9 | 76.5 | 49.3 | 1.75 | 2.48 | 4.65 |
Random forest | 0.516 | 0.097 | 91.9 | 79.4 | 50.0 | 1.61 | 2.08 | 4.16 |
. | Absolute bias . | MSE . | PI coverage . | Proper interval score . | ||||
---|---|---|---|---|---|---|---|---|
. | . | 95% . | 80% . | 50% . | 95% . | 80% . | 50% . | |
Bayesian shrinkage | 0.244 | 0.086 | 94.1 | 80.9 | 48.5 | 1.33 | 1.91 | 3.86 |
Forward selection | 0.975 | 0.168 | 72.1 | 52.2 | 28.7 | 3.55 | 5.35 | 8.03 |
LASSO | 0.965 | 0.133 | 91.9 | 76.5 | 49.3 | 1.75 | 2.48 | 4.65 |
Random forest | 0.516 | 0.097 | 91.9 | 79.4 | 50.0 | 1.61 | 2.08 | 4.16 |
For all performance measures, the four modeling approaches yield similar results when it is known and unknown which areas have been sampled. For example, in population C with a two-stage sampling design and regardless of the modeling method, the MSE results over the anonymized sampled EAs are not worse than the results over the non-sampled EAs. This result is reassuring as for analyzing the Ghanaian data, where the sampled EAs are anonymized.
In terms of bias, all methods are virtually unbiased with mean absolute biases between zero and 0.08, regardless of the population and sampling design. In the linear scenarios, the random forest tends to yield slightly higher mean absolute biases, compared to the forward, LASSO and Bayesian methods, which is sensible as the correct mean structure cannot be estimated in this non-parametric approach.
In terms of MSE, there is no apparent difference between the LASSO, forward selection, and Bayesian methods, for all simulation models and sampling schemes. These three methods, which fit linear models, yield slightly smaller MSEs than the random forest approach when the association between the covariates and the outcome is strongly linear (A). For scenario C, however, the random forest produces smaller MSEs than the three linear modeling approaches. The random forest method divides the other three modeling methods’ MSEs by a factor of three in population C, regardless of the sampling scheme. This result is explained by the fact that the random forest approach is a non-parametric method that adapts better to the non-linear setting.
The prediction intervals computed for all four methods in each sampling scheme for populations A and B yield the right coverages, with a slight under-coverage for the random forest approach. This may be due to the incorrect mean structure that is fitted to the data. These intervals are wider for the random forest method in model A, for both sampling designs, as deduced from the proper interval scores. When the relationship between the outcome and the covariates is non-linear (C), we observe that all four modeling methods yield under-coverage in both sampling designs. The random forest method, which accommodates a non-linear relationship, leads to prediction intervals with slightly higher coverage rates than the other three methods, in particular across the sampled areas, but still misses the targeted rates by about 30 percent. Note, however, that the random forest approach produces prediction intervals with smaller proper scores than the other three modeling methods.
4. AREAL LOG CONSUMPTION PREDICTION IN THE GAMA
In this section, the four modeling methods described in section 2 are applied to the data for the GAMA in Ghana. Using the sixth GLSS and the 2010 Ghanaian census, a complete map of the mean equivalized consumption (in the log scale) is produced across the M = 5019 EAs for each method. Note that in the household survey, only m = 136 EAs have been sampled. To provide estimates in the missing areas, the response values are modeled using the p = 174 auxiliary variables, which are measured in both available datasets and have been scaled for computational efficiency.
For the random forest approach, due to the small sample size m and following Hastie et al. (2009), a cross-validation study on the survey data was run to set the hyperparameters to B = 1,000 trees grown with mtry = 25 and nodesize = 3. The Bayesian approach required two MCMC chains of 100,000 iterations, including a burn-in period of 50,000, and a thinning factor of 50. Convergence was attained as assessed by the trace plots, effective sample sizes and statistics smaller than 1.1.
Wakefield et al. (2020) point out the importance of including the design variables in model-based SAE methods. To that end, the urban indicator, which corresponds to the sampling strata, is added to all four modeling methods. In the forward selection approach, this inclusion means that the urban indicator is added to the vector of selected covariates, even if it was not selected in the first step. In the Bayesian shrinkage and LASSO methods, it means there is no shrinkage applied to the regression coefficient that corresponds to the urban indicator. Finally, in the random forest approach, it means that the urban indicator is part of the variables considered for each covariate split. Figure 4 presents the covariates that were selected by each method. Despite all methods including the urban indicator, only the random forest finds it relevant with a p-value for its variable importance smaller than 0.05. Additionally, figure 4 shows that the horseshoe prior leads to only one variable whose coefficient’s posterior 95 percent credible interval does not include zero. The LASSO approach selects about 6 percent of the available covariates (11 variables), while the forward method and random forest methods select more than 12 percent of the variables (21 and 22, respectively). The variable indicating whether a household’s floor is made of cement or concrete is selected by all four methods.

Selected Covariates for Each Method When Modeling the Log Equivalized Consumption in GAMA.
Figure 5 shows the mean log consumption areal estimates and their 95 percent prediction intervals’ widths for each of the four methods. Among the four methods, the random forest approach yields the most homogeneous point estimates across the EAs, with a prediction range between eight and nine, compared to a range of seven to ten, for the other three methods. This can further be seen in figure 6, which compares the predictions obtained using each method for each EA. The prediction interval widths are shown across the EAs in figure 5 and compared between the modeling methods in pairwise scatter plots in figure 7. The prediction intervals computed for the linear approach with forward variable selection are the narrowest. As expected, the widths of the intervals obtained through the proposed scaled SC approach for the random forest and LASSO predictions behave similarly. The widths are of the form , where is the only quantity that differs between the LASSO and random forest approaches. Note that in this analysis, the scaled SC procedure divides the dataset into two halves, consequently computing the necessary residuals and quantile based on only data points.

Estimated Mean Log Equivalized Consumption in the GAMA EAs (Left) and Widths of the Corresponding 95 Percent Prediction Intervals (Right) Obtained From Each Modeling Method. RF: random forest.

Pairwise Comparison of the Areal Estimates Obtained From Each of the Four Methods: Forward Selection, LASSO, Bayesian Shrinkage and Random Forest.

Pairwise Comparison of the Areal Prediction Interval Widths Obtained From Each of the Four Methods: Forward Selection, LASSO, Bayesian Shrinkage and Random Forest.
Finally, to determine which method performs the best in this particular data application, an eight-fold cross-validation study is conducted. The 136 sampled EAs are divided into eight rural EAs and 128 EAs. Hence, in this eight-fold cross-validation study, 17 EAs are removed from the sample at a time (one rural and 16 urban EAs), the four methods are fitted on the remaining 119 EAs, and predictions are obtained for the 17 removed ones. The four methods are compared in terms of mean absolute bias, MSE, coverages, and proper scores of the 50 percent, 80 percent, and 95 percent prediction intervals in table 1. These performance measures are computed as described in section 3.2, with the number of replicates R corresponding to the eight folds and the total number of areas M becoming the number of sampled EAs. The Bayesian shrinkage approach performs the best among the four methods we consider, yielding the smallest bias, MSE, and interval scores, and attaining the right coverage rates of the prediction intervals. On the other hand, the prediction intervals obtained for the forward selection approach lead to significant undercoverage. A cross-validation study was also conducted where the four methods were fitted without forcing the inclusion of the urban indicator. The results are not shown in this paper as the performance of the four methods in terms of bias, MSE, coverage, and proper score of the prediction intervals were similar to the ones shown in table 1. The Bayesian shrinkage method considered in this paper consists of applying the horseshoe prior to the regression coefficients. Other priors could have been considered, such as a Bayesian ridge prior. A cross-validation study was conducted with the Bayesian ridge approach for the GAMA sample. Because the results were similar to the ones shown in table 1, obtained with the horseshoe prior, in terms of bias, MSE, coverage, and proper score of the prediction intervals, they are not presented in this paper.
In the original scale, on average, we find that the Bayesian shrinkage consumption estimates among the richest 10 percent are 2.3 times bigger than the ones among the poorest 10 percent. We also find that the 92 percent urban EAs are not uniformly distributed across the estimated consumption deciles: there are only 79 percent urban EAs among the poorest 10 percent, versus 91 percent among the richest 10 percent. Following Dong and Wakefield (2021), we rank the EAs from poorest to richest, based on the Bayesian shrinkage point estimates to identify the EAs where interventions should be prioritized. In particular, in this Bayesian framework, we obtain each EAs posterior ranking distribution by ranking the point estimates at each MCMC iteration. Figure 8 shows the posterior ranking distributions for five of the 10 percent poorest EAs and five of the 10 percent richest EAs. Additionally, the right-hand side of figure 8 maps the 10 percent poorest and richest EAs. We find that the Greater Accra South district, which corresponds to the western EAs in figure 8, contains most of the poorest EAs, while the Accra Metropolitan Area district, which corresponds to the southern EAs in figure 8, is the richest. Figure 8 further shows that the 500 poorest EAs’ ranking distributions overlap, which seems to indicate that there is a need to intervene in the poorest 500 EAs.

Left: Histograms of the Posterior Ranking Distributions of Five of the 10 Percent Poorest EA’s (Left Column, Red) and Five of the 10 Percent Richest EA’s (Right Column, Green), as Estimated From the MCMC Samples Obtained for the Bayesian Shrinkage Approach. Right: Map of the Greater Accra Metropolitan Area highlighting the 500 poorest EA’s (red) and the 500 richest EA’s (green). There are a total of 5019 EAs in the study region.
5. DISCUSSION
In this paper, approaches based on random forests and the LASSO are compared with a frequentist forward variable selection procedure and a Bayesian shrinkage method to estimate area-level means of a variable of interest when abundant auxiliary variables are available. Throughout, the areas correspond to the sampling clusters. The methods are area-level model-based small area prediction procedures used to obtain areal estimates and their uncertainties. First, a random forest approach models the outcome values. By construction, auxiliary variables are selected when partitioning the response values through covariate splits. Then, in the frequentist framework, a LASSO method selects covariates by shrinking irrelevant regression coefficients toward zero. To measure the uncertainty of estimates obtained from random forests and the LASSO methods, a modification of the SC procedure is proposed. The SC algorithm (Lei et al. 2018) estimates prediction intervals with no specific distribution assumption for the data. However, the data are assumed to be exchangeable. The proposed scaled SC procedure relaxes the assumption that the data are exchangeable. Specifically, the proposed algorithm allows the data points to have variances of different scales. This proposed scaled SC procedure allows inference to be conducted for the random forest and the LASSO estimates.
A first simulation study assesses the performance of the proposed scaled SC method compared to the original SC procedure. It is found that when the data points are exchangeable, both procedures perform similarly, regardless of the modeling method. In the simulation scenarios where the number of sampled units is not equal to the number of non-sampled units (), the variances are scaled differently, . Hence, the SC procedure does not yield the appropriate coverage rates for the prediction intervals in these scenarios. The proposed scaled SC method corrects the under-coverage in all the simulation scenarios that were considered.
The random forest and LASSO approaches are compared with the frequentist forward selection and Bayesian shrinkage methods in an additional simulation study. When data are generated from a linear model, the methods that assume normality yield smaller biases and MSEs than the random forest approach. All modeling methods, however, lead to adequate prediction interval coverages. The random forest method performs better in terms of MSE when the data are generated from a non-linear model. All methods yield under-coverage when few units are selected within the sampled areas in this complex population.
In the sixth GLSS, from 2012 to 2013, the log equivalized consumption is measured at the household level in a small fraction of the areas (EAs) within the GAMA, alongside 174 auxiliary variables. The same auxiliary information is recorded for all the GAMA EAs in the 2010 Ghanaian census. Using both datasets and the four EA-level method-based approaches, areal estimates of the mean log equivalized consumption are computed for all EAs in the GAMA. Additionally, prediction intervals are computed for all EA estimates to measure their uncertainties. The LASSO and forward variable selection methods select more than 10 percent of the auxiliary variables, while the Bayesian horseshoe model yields posterior credible intervals that do not include zero for only one coefficient. The random forest procedure estimates a smoother map of the mean log consumption than the other three approaches. A cross-validation study conducted on the sample data shows that the Bayesian shrinkage method performs the best, among the four methods considered, on this particular dataset.
Finally, before fitting random forests to the different datasets, cross-validation studies were run to help set the hyperparameters. These hyperparameters are the number of regression trees included in the forest, the number of variables considered at each step when growing the trees, and the final node sizes. This step could be improved as other hyperparameters could have led to better performing random forests. For further discussion on the selection of random forest hyperparameters, see McConville and Toth (2019) and Dagdoug et al. (2023). On the other hand, the proposed scaled SC procedure used to compute prediction intervals for the random forest and LASSO estimates relies on an equal split of the data points to grow a forest and compute prediction errors. In the data application of this paper, this implies that the prediction interval limits are based on 68 data points. This partition, suggested by Lei et al. (2018) for the original SC algorithm, could be revisited to attempt to narrow down the resulting intervals.
Supplementary Materials
Supplementary materials are available online at academic.oup.com/jssam.
This work was supported by the Fonds de Recherche Nature et Technologies [B2X – 314857 to V.M.] and Institut de valorisation des données (IVADO) [Fundamental Research Project, PRF-2019-6839748021 to A.M.S., J.B., and B.E.R.].