MULTIPLE IMPUTATION WITH SURVEY WEIGHTS: A MULTILEVEL APPROACH

Multiple imputation established as a practical and ﬂexible method for analyzing partially observed data, particularly under the missing at random assumption. However, when the substantive model is a weighted analysis, there is concern about the empirical performance of Rubin’s rules and also about how to appropriately incorporate possible interaction between the weights and the distribution of the study variables. One approach that has been suggested is to include the weights in the imputation model, potentially also allowing for interactions with the other variables. We show that the theoretical criterion justifying this approach can be approximately satisﬁed if we stratify the weights to deﬁne level-two units in our data set and include random intercepts in the imputation model. Further, if we let the covariance matrix of the variables have a random distribution across the level-two units, we also allow imputation to reﬂect any interaction between weight strata and the distribution of the variables. We evaluate our proposal in a number of simulation scenarios, showing it has promising performance both in terms of coverage levels of the model parameters and bias of the associated Rubin’s variance estimates. We illus-trate its application to a weighted analysis of factors predicting reception-year readiness in children in the UK Millennium Cohort Study.


INTRODUCTION
When collecting data for research, it is often the case that we are not able to obtain all the desired information for various reasons (e.g., lack of resources, unwillingness to disclose information, loss to follow-up). Unfortunately, such missing data complicate the intended analysis, not only causing a loss of power but also potentially biasing the results-when the reason for the missing data is associated with our scientific question.
For this reason, many methods have been developed to deal with missing data, each relying on a series of different assumptions. One of the biggest categories of missing data methods is represented by imputation strategies. Imputing missing data means replacing the missing values with a particular value, drawn from a specified distribution, typically from the conditional distribution of the missing data given the observed data. Fitting the substantive analysis model to such an imputed dataset gives the same weight to observed and imputed values; however, the latter are, at best, good guesses, and therefore, they should be somehow down-weighted. Otherwise, such an approach will result in marked underestimation of the standard errors because of a failure to reflect uncertainty due to the missing values.
In some specific settings, methods to obtain a valid variance estimate under single imputation have been derived (Rao and Shao 1992;S€ arndal, Swensson, and Wretman 1992;Rao 1996;Beaumont, Haziza, and Bocci 2011), and these are often used to handle missing data in surveys.
Alternatively, a very flexible, general method to address the same issue is multiple imputation (MI) Rubin (1987). With MI, the missing values are imputed from the Bayesian predictive distribution of the missing data, given the observed data, to create K imputed datasets. The substantive analysis model is then fitted to each of these in turn, giving K different estimates of the model parametersĥ k together with their standard error estimatesr k . These are combined for final inference using Rubin's rules: ðĥ k Àĥ MI Þ 2 : Over the last 25 years, the practicality and flexibility of MI, coupled with the availability of accessible software, has led it to become increasingly popular, particularly in clinical research. As with all statistical methods, MI relies on some assumptions; in particular, most MI methods assume data are Missing at Random (MAR) (Rubin 1976), which broadly means that the reason for their missingness is unrelated to the unseen values, after conditioning on all the observed data. Another important issue is that of congeniality. This was raised in the original work from Rubin and was thoroughly investigated in a series of articles in the mid-nineties (Fay 1992(Fay , 1993Meng 1994). These highlighted that in order for MI to lead to valid inference, the imputation model (i.e., the model used to impute the data) and the substantive model (i.e., the original model we wanted to fit on the complete data) must be congenial, which means, loosely speaking, that they need to be derived from the same joint model. In some situations, particularly when the substantive model has nonlinear effects or interactions, it can be challenging to choose a congenial imputation model (Goldstein, Carpenter, and Browne 2014;Bartlett, Seaman, White, and Carpenter 2015; for a practical review of the issue, see Carpenter and Kenward 2013, pp. 64-73).
Another issue that was discussed in the study by Meng (1994) is that of selfefficiency; a procedure is self-efficient if it is not possible to gain precision by applying it to a subset of the whole data. Self-efficiency of the complete-data procedure is required for the validity of MI inference (Meng and Romero 2003).
This article focuses on the situation in which the substantive model is a weighted regression model; this is common in survey sampling settings, where appropriate weighting is often used to take account of the sampling schemes (e.g., S€ arndal et al. 1992). Throughout, we assume the weights considered are the final ones, after adjustments for nonresponse (Holt and Elliot 1991) and calibration (Deville and S€ arndal 1992). The idea of MI was originally expounded in a survey setting, and in Rubin (1987), it was implicitly assumed that the imputer should have access to the variables used to construct any weights and should always include them in the imputation model. However, discussion of weighting was limited to a brief reference in the introduction, where it was noted that "[weighting's] apparent simplicity disappears with multivariate outcomes", followed by two excercises. Two questions remained unanswered at the time: (i) How should we include weights in the imputation model? (ii) Does Rubin's variance formula still hold in these settings?
To answer these questions, it is important to clarify the inferential framework under which properties of MI are to be evaluated. For example, if evaluating the properties with respect to the joint distribution of the response mechanism and the sampling mechanism, Rubin's variance estimator is valid under the assumption of proper imputation (Rubin 1987). Kim, Michael Brick Fuller, and Kalton (2006) took a different approach and evaluated the properties with respect to the joint distribution of the sampling mechanism, the response mechanism, the imputation mechanism, and a super-population model. They showed that, because of a lack of self-efficiency, even in the case of a simple weighted regression model with missing data in the outcome only, using a standard imputation model with Rubin's rules results in an upwardly biased estimate of the variance. In particular, assuming both our substantive and imputation model are of the form: and that we want to use the weighted least square estimator: then the bias in the Rubin's variance estimator is where r 2 is the residual variance, the subscript "m" indexes missing observations, and "o" to the observed, so that for example, X o represents the set of covariates X for complete records. As Kim et al. (2006) pointed out, a practically important consequence follows from this expression: the bias vanishes if the weights are included in the space spanned by the variates in the regression model (i.e., for w o ¼ X o d and w m ¼ X m d, for some value of d). Therefore, a simple way to correct for the bias in Rubin's variance estimator when using survey weights, at least with linear regression, is to introduce these into the linear predictor. However, Kim et al. (2006) also picked up the result of Meng (1994) that the MI variance estimator is biased upward if the imputer assumes more than the analyst; therefore, for accurate inference within domains for survey data, the imputer needs to include in his model (i) the weights, (ii) all the domain indicators (i.e., all relevant covariates of the weighted regression), and (iii) their interactions for valid MI inference in general. Seaman et al. (2012) extended Kim's results, showing that if using multiple imputation with this correctly specified model, Rubin's variance estimate is asymptotically unbiased with missing data in the outcome only. With missing covariates, there is generally an upward bias in the variance; however, the simulation results they report suggest this is of little practical concern.
In practice, a key issue is the correct specification of the imputation model, taking into account the weights and their interaction with the covariates. Ideally, we would impute separately in each weight strata so that the relationship between the variables are allowed to differ across weight strata. Thus ideally, we should include weight as a factor variable, possibly with interactions with other variables, alongside including it as a linear variable to avoid the bias in Rubin's MI variance estimator. In general, this will make estimation noisy because of the increased number of parameters.
Instead in this article, we propose using the weight variable to define a second level and then adopt a multilevel approach.
In section 2, we describe our proposed approach and how it approximately satisfies Kim et al.'s (2006) criteria. Then in section 3, we describe a set of simulation studies to evaluate our proposal against alternative approaches. The results are given in section 4. In section 5, we apply the same methods to handle missing items in wave 2 of the Millennium Cohort Study; finally, we conclude in section 6 with a discussion.

METHODS
As set out in the previous section, the situation we are interested in is a sample survey dataset, provided with weights and affected by missing items. This is exemplified by the Millennium Cohort Study we analyze in section 5. Our setup follows Kim et al. (2006); we assume we have a complex sample from an infinite super-population. We evaluate the properties of Rubin's MI variance estimator considering the joint distribution of the sampling, response, imputation, and super-population models.

Proposed Approach
We have already noted that the bias in Rubin's MI variance formula, given by (5), vanishes if the weights are included in the space of the variates spanned in the regression model (i.e., if w o ¼ X o d and w m ¼ X m d, for some value of d).
For example, suppose that we have p covariates for each unit, of which the first is the intercept and the second the weight. Then if the p Â 1 vector d ¼ ð0; 1; 0; . . . ; 0Þ T ; this criterion is satisfied. However, while this is sufficient for valid variance estimation using Rubin's rules for a mean, as noted previously, it is insufficient when we have domains in our data; then we need to include both the domain indicators and their interactions with the weights.
Instead, now suppose that we group the weights, without loss of generality, into g ¼ 1; . . . ; G groups. We include additional G dummy variables as the leftmost covariates in X indexing which of the groups unit i's weight belongs to. Also, let d ¼ ð w 1 ; w 2 ; . . . ; w G ; 0; . . . ; 0Þ T ; where w g is the mean weight for group g. Now the criteria for the bias in the variance vanishing is approximately satisfied. Further, the approximation will improve as the weight SD within the groups decreases. This is often possible to do in applications because the weights are calculated (often by the data provider) from a set of categorical predictors. Importantly, we also note that including these extra parameters in the regression model of interest changes from one that is marginal to the weights to one that conditions on them and that this may not be the model of concern for the analyst.
While approximately satisfying the criterion for Rubin's variance formula to work, this approach also has the advantage that it does not require the relationship between the weights and the dependent variable to be linear; it is unstructured across the groups. However in general, fitting a large number of fixed parameters is not desirable nor is it consistent with the aims of the data analyst, as pointed out previously.
Instead, we propose letting the G weight groups define a second level in the data and including random intercepts. This still approximately satisfies the criterion for bias in (5) to vanish, but now we can pool information across weight groups where appropriate. In other words, when we impose the standard assumption that the random intercept distribution has zero mean, the fixed part of the model will represent the (marginal) expected relationship for the population, as desired by the analyst. This is not sufficient in general, though, because ideally (as noted in the introduction) we should allow for an interaction between the weights and the other variables in the imputation model. We can do this by allowing the covariance matrix of the (level-one) variables to vary across the (level-two) units (weight-strata). Again, rather than introduce a lot of parameters, we can give the covariance matrix a random distribution across strata.
For a specific example of our proposal, suppose that the substantive model is a weighted linear regression of y i;j on x 1;i;j ; x 2;i;j , where i ¼ 1; . . . ; n j indexes units in strata j ¼ 1; . . . ; J with weight w j . Suppose data are MAR. We let the weight strata define level two, and our imputation model is: where W -1 denotes the inverse Wishart distribution. Notice this includes the random intercepts u k;j for each variable k and that the level-1 covariance matrix X j varies across weight strata j, allowing the association of y, x 1 , x 2 to vary across strata. h k;0;j represent the overall means of the three variables, while W is the level-two covariance matrix. This model was proposed in different context by Yucel (2011) and developed for individual patient data meta-analysis by Quartagno and Carpenter (2016). Compared with including the weights as a linear term in the imputation model, together with their interaction with the other variables, model (6) has the advantage that the relationship across the weight strata is not required to be linear; it is driven by the data, and information is pooled across strata as appropriate. While in general it only approximately satisfies the criteria for Rubin's variance formula to hold, we will show by simulations that the difference between the empirical and Rubin's MI variance is small or negligible, and suggesting this will be satisfactory in applications.
Note that if the weight is common in each group G, then as the number of observations in each strata gets large, this approach tends to the natural-and often optimal-approach of imputing separately in each strata. However, if the proportion of missing in some strata is high, our approach may be able to improve on this.
Having outlined our proposal, we now evaluate it using a series of simulation studies, comparing with imputing separately in each strata, ignoring the weights in the imputation, and including them in various ways.

SIMULATION STUDIES
First, we describe the base-case simulation scenario, before outlining the methods we are comparing with (6), and briefly discuss their relative merits. We conclude this section by describing three additional simulation scenarios.
The simulation scenarios are designed to reveal differences between the methods. In all the scenarios, we consider three variables: Y, X 1 , and X 2 ; our total sample size is 400 individuals, stratified in ten equal-sized strata, each with corresponding known weight.

Base-Case Scenario
The base-case scenario simulated data from the following mechanism: 3; 4; 5; 6; 7; 8; 9; 10Þ Here, j indexes different weight strata. After having generated the data, the substantive analysis model is a weighted linear regression, where Y is the dependent variable, X 1 and X 2 are the covariates, and we assume the weights are known and equal to: This may seem an extreme choice and in applications weights (and fixed effect parameters b) would probably be more homogenous; however, we decided to use such extreme values in order to bring out the properties of the methods. We simulate 1,000 data sets and make Y Missing Completely at Random (MCAR) with probability 0.5. We compare analysis of the original full data (FD) and complete records (CR) and use the competing multiple imputation methods we now describe. All of the imputation models are fitted by means of a Gibbs sampling algorithm using data augmentation to impute the missing data, using the R-package jomo (Quartagno, Grund, and Carpenter 2018).

Imputation Methods
We now describe the seven imputation approaches that we compare.

Multiple imputation with no weights (MI-noW).
Multiple imputation with no weights (MI-noW) uses the first and simplest imputation model we might consider. It consists of a multivariate normal model for the three partially observed variables and does not make any use of the weights: We know from section 1 that weights should be included in the imputation model for Rubin's rules to hold, and therefore, we expect this method to perform relatively poorly.
3.2.2 Multiple imputation with weights (MI-W). The next option is to use an imputation model where the weights are included as additional variables; the easiest way to do this is to include them as an additional covariate in the multivariate normal imputation model, assuming a linear relation between weights and all three variables: While this model includes the weights it (i) assumes a linear relationship between the weights and the variables and (ii) does not include the interactions between the weights and the covariates that appear in the substantive model, which according to the literature is desirable, as seen in the introduction.

Multiple imputation with weights and interactions (MI-xW).
As outlined in the introduction, the literature (Kim et al. 2006;Seaman et al. 2012) suggests a better imputation model should include not only the weights but also all interactions between weights and covariates. This can be done easily when missing data are confined to the outcome variable-but not when data are missing in all variables. In this setting, we need to use the substantive model compatible imputation developed by Goldstein et al. (2014) (see also Bartlett et al. 2015).
The idea is to use an imputation model that partitions the joint distribution of the three variables between a joint distribution for the covariates and a conditional distribution of the dependent variable given the covariates: Missing data in Y are imputed from the conditional model given the covariates, weights, and their interactions, while missing data in the covariates are imputed compatibly with the model for Y, by means of a Metropolis-Hastings step within the Gibbs sampler. Note the model specified for Y in (11) is not the substantive model, which is the weighted regression of y on x 1 , x 2 . Although this approach should improve on (1) and (2), there are two potential shortcomings. First, when values are missing in all three variables, we do not have an interaction with the weights and the distribution of x 2 jx 1 and vice versa because their covariance matrix, X, is common across the strata. Second, it again assumes a linear relationship of y on the weights and their interactions; with a large number of covariates, the conditional model for Y becomes complicated, and estimating the parameters may lead to noisy results.

Stratum-specific multiple imputation (MI-S).
Where we have welldefined strata and sufficient data in each, this is perhaps the best approach. It is straight forward (we use standard imputation in each strata), allows a full interaction in the relationship between the variables by strata, and satisfies the criteria for Rubin's rules to give valid inference. For the ten strata in our simulated data, we therefore have, The disadvantage of this method is that it may struggle with small strata or substantial numbers of missing values within some strata.
3.2.5 Homoscedastic multilevel multiple imputation (MLMI-Hom). This is the first of our three multilevel imputation approaches. This approach does not use the random weight-strata covariance matrices in (6); the reason for this is to explore if this aspect is necessary. Thus, the imputation model is (6) with common covariance matrix: where the weight strata form the j ¼ 1; . . . ; J level-two groups. The problem with this method is that the level-one correlation between the outcome and the covariates of the substantive model is kept fixed in the imputation model across strata; this may bias inferences when (as will often be the case in practice) the association between variables varies with the weights.
3.2.6 Heteroscedastic multilevel multiple imputation (MLMI-Het). The sixth approach is (6), discussed in section 2. While not necessarily optimal in all scenarios, it should have good performance across them all. In particular, it allows the relationships between the variables to vary with the weights but does not insist this happens in a linear way.
3.2.7 Substantive model compatible multilevel MI (MLMI-SMC). The seventh and final method is to use multilevel substantive model compatible imputation (Goldstein et al. 2014). Essentially, this makes method three, (11), multilevel by giving the coefficients in the model of yjx 1 ; x 2 random coefficients across the weight strata: where v k;j are independent from u k;j . This model is almost as flexible as (6), but not quite, as the distribution of (x 1 , x 2 ) does not fully vary across weight strata.

Further Simulation Scenarios
In addition to the base-case scenario with missing values in Y alone described at the start of this section, we consider four further cases: 3.3.1 Base-case scenario: missingness in Y, X 1 , X 2 . We use the base-case scenario (7)   Data are made MCAR independently in each variable with probability 0.2. When imputing data generated using this model, all the methods apply the latent normal method for imputing the binary dependent variable (Carpenter and Kenward 2013, Chapter 5; Quartagno and Carpenter 2019).
3.3.5 Realistic scenario. As discussed, the base-case simulation scenario involved the choice of very extreme simulation parameters to make the comparison of the performance of different methods clear. In this last scenario, we instead use more realistic parameters, mimicking the distribution of data from the Millennium Cohort Study. In particular, we generate data for a total of 5,400 individuals divided in nine strata; weights associated with each stratum range between 0.23 and 2, and we use the following data-generating mechanism: x 1;i;j

SIMULATION RESULTS
All the simulations used 1,000 replications. Imputation used the jomo package, generating twenty imputed tables, with a burn-in of 500, and 500 updates between each imputed dataset.

Base-Case Scenario
The results of the base-case scenario, with 50 percent of y values MCAR but other variables complete, are shown in the top part of table 1. As expected, MI-S performs best here, giving approximately unbiased point estimates and good coverage levels close to 95 percent. However, both multilevel imputation methods, MLMI-Het and MLMI-SMC, give similar results for bias, precision, and coverage. In particular, the model SE (i.e., the SE obtained using Rubin's rules) and the empirical SE are similar for all three parameter estimates. While MI-xW has similar results, there seems to be slightly more bias in the parameter estimates; also, the model SE is somewhat larger than empirical SE and greater than MI-S, MLMI-Het, and MLMI-SMC. Additionally, MI-noW, MI-W, and MLMI-Hom all have some shortcomings in the way they handle weights in the imputation model, and therefore, it is not surprising that they lead to unsatisfactory results. Interestingly, the estimated SEs are consistently smaller with MI-S, MLMI-Het, and MLMI-SMC than for the CR analysis. In general, when missing data are in the outcome only, it is not possible to recover information without auxiliary variables. Here though, the combination of small strata with relatively high proportions of missing data is likely to be the reason why results after MI gain over CR.

Base-Case Scenario-Missingness in Y, X1, X2
The results are shown in the bottom half of table 1. The MIMI-Het and MIML-SMC now give the best results, outperforming MI-S because of their ability to pool information across strata through the random effects (this issue with MI-S also appears to affect CR and MI-xW, where the model SEs for b 1 , b 2 are relatively large). The MIMI-Het and MIML-SMC also recover a nontrivial proportion of information compared with CR. Some of the parameter estimates with CR, MI-S, and MI-xW are now slightly more biased, possibly because of a common reason (i.e., the fact that strata are so small); this leads to undercovering for b 2 with CR and MI-S. Although with MI-xW, an overestimation of the standard error leads to overcovering, despite the bias introduced.
Finally, we note that in both these scenarios, the full flexibility of MLMI-Het is not needed, as the covariance matrix of the covariates is common across strata in the data-generating mechanism. However, this does not adversely affect its performance.

Base-Case Scenario-Missingness Proportional to Weights
This is a challenging scenario because we have a nontrivial proportion of missing data in each variable, and missingness is proportional to the weights that are inversely proportional to the strata coefficients. The top parts of table 2 and figure 1 summarize the results.
The best results are now obtained with MLMI-Het and MLMI-SMC, which both perform better than MI-S. This can be clearly seen in the top row of figure 1, which shows the three zip plots (Morris 2016) for MI-S, MLMI-Het, and MLMI-SMC. Each zip plot is for b 2 (true value 1.707) and ranks the results of the 1,000 replications top to the bottom according to their p value against the null hypothesis. The red vertical line indicates the true value, and we can see that both multilevel imputation methods seem approximately unbiased. The purple bars indicate simulations that failed to cover the true value in NOTE.-Data are generated from the base-case configuration and successively missing data are introduced either (i) with missingness directly proportional to weights or (ii) inversely proportional. We compare the true value of the parameters with the full data estimates and with the estimates handling missing data with CR, MI-noW, MI-W, MI-xW, MI-S, MLMI-Hom, MLMI-Het, and MLMI-SMC.
their interval. These are approximately 5 percent for both multilevel methods; however, MLMI-Het tend to cover a bit less than 95 percent and MLMI-SMC a bit more. As expected, the fact that MLMI-Het allows for different covariate covariance across the strata (which is not present here) gives slightly larger SEs (wider CIs) than MIMI-SMC. By contrast, the zip plot for MI-S has approximately correct coverage levels, but the mean estimate is slightly biased, as can be seen from the fact that the noncovering simulations are almost all to the left of the true value. This is likely due to the fact that in this scenario, the probability of missingness is extremely high for the most weighted strata, which are the ones with the biggest effect on the overall weighted estimate of the parameters. When imputing using MI-S, we therefore do not have enough information in some strata to build and fit our stratum-specific imputation model, leading to biased estimates, mainly toward the null.
Because missingness is no longer MCAR here, CR is not valid. Also (table 2, top half), MI-xW gives poor results here; the pattern of missing data means it leans heavily on its incorrect assumption of a linear effect of the weights.

Base-Case Scenario-Missingness Proportional to Weights
This scenario is less challenging than the previous one because the proportion of missing data is higher in the strata with lower weight. The results are shown in the lower parts of table 2 and figure 1. Once again, we see good results for MIMI-Het and MLMI-SMC; however, relative to these, the performance of MI-S and MI-xW has improved.

GLM Scenario
Recall that in this scenario, once again each variable is MCAR with probability 0.2. Table 3 shows the results. There is a little bias in all the coefficient estimates, most likely due to small sample effects in GLMs. Here, MI-xW, MI-S, MLMI-Het, and MLMI-SMC are all competitive, with best results for MI-xW and MLMI-SMC. However, for MI-xW the model SEs tend to be smaller than the empirical SEs; this is avoided with MLMI-SMC and MLMI-Het. For MI-S, the model SEs are also larger than the empirical SEs, and this allows the coverage to be relatively good despite the slight bias (particularly for b 2 ).

Realistic Scenario
Results are again shown in table 3. While generally inference seems to be acceptable with most imputation methods, as indicated by negligible biases and good coverage levels, MI-xW, MLMI-Het, and MLMI-SMC are the best methods for variance estimation, as they are the methods for which model and empirical standard errors are most similar. MI-S seems to work similarly well, as expected given that weight strata are large in this example (i.e., 600 observations per stratum), and hence, within-stratum imputation is not as noisy as in the previous examples.

Summary of Simulation Results
In summary, both the MLMI-Het and MLMI-SMC give similar results across the range of scenarios considered here, and in each scenario, either are competitive with the best method or give the best results. In particular, neither MI-S nor MI-xW give such consistently good results.

APPLICATION TO MILLENNIUM COHORT DATA
We now use the methods evaluated previously in an analysis of the Millennium Cohort Study dataset (Plewis 2007). This is a multidisciplinary research project following the lives of around 19,000 children born in the UK in 2000 and 2001. We focus on the second wave (children around three years of age), where some items are missing, particularly in the family income and hearing problems variables (around 12 percent missing). Our substantive model is a weighted regression of the quantitative Bracken school readiness score on three explanatory variables: logarithm of family income, whether the child has hearing problems (1 ¼ yes; 0 ¼ no), and the number of siblings. The sampling weights are provided with the data; more detail on their derivation is given by Plewis (2007). We analyzed the complete records, and then we multiply imputed missing values using each of the methods presented in section 2. We used a burn in of 500 updates and imputed 20 datasets, updating the sampler 500 times between each imputation. As there are only nine weight strata, we use them to define the second level for the multilevel imputation method (i.e., all weights are the same within each stratum). Table 4 summarizes the results. Compared with the CR analysis, only MI-S gives larger SEs, suggesting estimation in some strata is poor, so this approach may be less reliable here. The other best-performing methods from the simulation study (MLMI-Het, MLMI-SMC, and MI-xW) give similar results. Focusing on results from these three methods, compared with CR, they suggest (i) a > 1 SE stronger positive effect of income on school readiness score (b 1 ); (ii) a slightly weaker effect of hearing problems (b 2 ), and (iii) a marginally stronger negative effect of a greater number of siblings (b 3 ).

DISCUSSION
In this article, we have reviewed some of the issues raised by using multiple imputation to impute missing values when the substantive analysis is a weighted model. This led us to propose a multilevel approach, where (i) the weights are used to form strata which define level-two, (ii) we include random intercepts, and (iii) we allow the variance structure of the variables to vary across strata. While random-effects models has been proposed previously as substantive models in order to shrink across weight strata (Elliott and Little 2000;Elliott 2007;Xia and Elliott 2016), to our knowledge, the use of a multilevel model, allowing associations between the variables to vary across weight strata in these settings, is new. Furthermore, while the possibility of using such models for the imputation of missing data has been suggested from a theoretical point of view (Zhou, Elliott, and Raghunathan 2016b), in this article, this strategy is evaluated through simulations and application of real data for the first time.
We have evaluated our approach in a series of simulation studies, finding encouraging results across all the scenarios. In applications, we may need to group weights in order to form strata. In this case, the approach is likely to perform best if the strata are relatively homogeneous.
Given our results, we believe that adopting our approach (either MLMI-Het or MLMI-SMC) addresses the issues raised by (Kim et al. 2006) and so renders their conclusion that "MI is not generally recommended for public use data files" unduly negative.
If our approach is adopted and the imputer and analyst are separate, we believe that those imputing data and subsequently releasing them for public use should also publish the imputation model so that users can see the structure that has been captured in the imputation model.
Across the scenarios we considered for MLMI-Het and MLMI-SMC, the empirical standard error was either close to or slightly larger than the modelbased SE (obtained using Rubin's rules), suggesting at worse the approach may be slightly conservative but still more efficient than CR (cf Meng, 1994).
Compared with the MI-xW approach, our multilevel approach does not rely on a linear association between the weights and the other variables; this can vary as dictated by the data across the weight strata. A further potential advantage of MLMI-SMC is that it can be combined with the approach outlined by Goldstein et al. (2014) to impute consistent with nonlinear relationships and interactions.
One possible issue with our proposed method is that with high dimensional data, incorporating all domains with their interaction might be complicated, even using our random effects to reduce the number of parameters. Because of this, an alternative multiple imputation method based on finite population Bayesian bootstrap has been recently proposed (Zhou, Elliott, andRaghunathan 2016a, 2016b); this method could have a potential advantage with large numbers of domains. We plan in the future to compare this method with our strategy to explore under which circumstances one is better than the other.
In this article, we focused on weights arising from simple random sampling. Extensions to consider more complex multistage sampling designs may be needed in applications. For example, our approach could have to be used to impute multilevel datasets. In this setting, the weight strata give rise to a crossclassified structure (Browne, Goldstein, and Rasbash 2001). In principle, this could be addressed simply by using a more complicated imputation model. This is a topic for future research.
For standard MI (i.e., when the substantive model is not weighted), our approach can also be used. In other words, we can estimate the probability of missing data and use these to form strata. If the weights are right, but the imputation model is wrong, preliminary simulations show this provides a degree of double-robustness.
The strategy we investigated in this article is tailored to handle missing items in surveys. We did not focus on the case of completely missing units, since we assume this will be addressed by the weights. This was also considered by Seaman et al. (2012), who recommended a two-stage combination of multiple imputation to handle item nonresponse and inverse probability weighting (IPW) to handle unit nonresponse.
To conclude, we have proposed and evaluated a multilevel multiple imputation approach for situations where the substantive analysis is weighted and found promising results. The approach described here is implemented in the R package jomo Quartagno and Carpenter (2014).