Missing data and prediction: the pattern submodel

SUMMARY Missing data are a common problem for both the construction and implementation of a prediction algorithm. Pattern submodels (PS)—a set of submodels for every missing data pattern that are fit using only data from that pattern—are a computationally efficient remedy for handling missing data at both stages. Here, we show that PS (i) retain their predictive accuracy even when the missing data mechanism is not missing at random (MAR) and (ii) yield an algorithm that is the most predictive among all standard missing data strategies. Specifically, we show that the expected loss of a forecasting algorithm is minimized when each pattern-specific loss is minimized. Simulations and a re-analysis of the SUPPORT study confirms that PS generally outperforms zero-imputation, mean-imputation, complete-case analysis, complete-case submodels, and even multiple imputation (MI). The degree of improvement is highly dependent on the missingness mechanism and the effect size of missing predictors. When the data are MAR, MI can yield comparable forecasting performance but generally requires a larger computational cost. We also show that predictions from the PS approach are equivalent to the limiting predictions for a MI procedure that is dependent on missingness indicators (the MIMI model). The focus of this article is on out-of-sample prediction; implications for model inference are only briefly explored.

1.1 PS loss is a weighted average of a full and reduced model For a linear model the squared prediction error is a common and relevant loss function.To examine the bias-variance tradeoff in PS, it is helpful to revisit a simple example given by Shmueli (2010) in which the Expected Prediction Error (EPE) is evaluated for a "fully specified model (large) versus an "underspecified model" (small).Suppose data come from the model f (x) = β 0 + β 1 x 1 + β 2 x 2 + with ∼ N (0, 1).When no predictors are missing we estimate the full model as f (x) = β0 + β1 x 1 + β2 x 2 .Here the expected prediction error (EPE) is the sum of the bias, variance, and irreducible error of the predictions or fitted values (Hastie and others, 2009): where EPE L denotes the EPE of the full model.In contrast the EPE of the underspecified or submodel is given by: where f * (x) = β * 0,1 + β *  and draws from the conditional distribution The EPE for the correctly specified full model is just the irreducible error, whereas the EPE for the underspecified model increases as the out-of-sample predictor moves away from its population mean.There is bias-variance trade-off between the former approach (data are pooled across patterns but the implied prediction model is less good), and the latter (a better pattern-specific model that is estimated less precisely).
The out-of-sample prediction error from the large model is given by the green line in figure 1, and is equal to the model variance.If the data were generated from the large model and predictions were given from the small model that includes only X 2 , then the expected prediction error is approximated by the purple points in figure 1.
The yellow points in figure 1 denote the prediction error that arised from the PS in this setting; f1 makes predictions when all data are available and f2 makes predictions when only X 2 is available.Clearly, PS has smaller EPE for every out-of-sample X 1 .In this case the probability of missingness was 50%, P (M 1 = 1) = 0.5.Avg.Prediction Error Large Small PMKS Weighted Avg.Table 1.Missing data mechanisms used for simulation.ν0 is empirically calculated to allow the probability of missingness to maintain the desired level.expit = e x 1+e x .
Missing Data Mechanism for Table 2. Squared Imputation Error of the true out-of-sample X1 compared to the imputed X1 under different imputation methods and missing data mechanisms: Imputation Error of X1 = i (X1i − X1i) 2 .Multiple Impuatation was done the usual way using predictive mean matching and chained equations, where the variable with the least amount of missing data is the first variable imputed (Y ), and the variable with the next least amount of missing data is imputed second (in this case X1).plays an important role in our modeling assumptions.An outcome generated from a selection model formulation is assumed to be independent of the missing data mechanism, such that the outcome would be the same regardless of whether covariate information is missing or observed.

MAR MNAR MAR PMY MNAR PMY
The pattern mixture model formulation assumes that the missing data mechanism is part of the response model, such that the outcome can be depend on the missing data pattern.The two approaches represent fundamentally different descriptions of the underlying process.And they only coincide when δ 1 = ... = δ p = 0, which is the MAR assumption.
1.5.4Remark D: Extending to Generalized Linear Models and Other Prediction Approaches We performed the same set of simulations assuming a true logistic regression model, where we used a logarithmic scoring rule to compare methods.The general ordering of results holds and will be explored in future papers.These results extend to random forests as well.

2S.
Mercaldo and J. Blume is just a weighted average of the large and small prediction models.EPE P S = m P (M = m)EPE m = EPE L (1 − P (M )) + EPE S P (M ) 1.1.1Bias/Variance Tradeoff We simulated the EPE in figure 1.The simulation fixes X 1 = x 1 ,

Fig. 1 .
Fig. 1.Comparison of Expected Prediction Error for the Large fully specified model: E[Y |X1, X2] = β0 + β1X1 + β2X2, and Small underspecified model: E[Y |X1] = β0,3 + β * 1,3 X1.Pattern Submodel (PS) predictions are a weighted average of the Large and Small models, weighted by P(M=1) model can provide unbiased parameter estimates.1.5.2Remark B: Conditioning Y on X and M One might ask whether we are interested in the model marginalized over M , E[Y |X] = Xβ, or the conditional model, E[Y |X, M ] = Xβ +M δ.This is a philosophical question with differing viewpoints.For inferential purposes it has been argued that the marginal model is the model of interest.However in many situations the mixture of conditional models is the simpler way to express a complicated marginal model.Note that by first assuming the marginal model is true, one must make the assumption that data are MAR and the δ parameters are zero.A way to assess this is to evaluate whether the degree to which the MI model and PS model give similar predictions, as we did in the SUPPORT example.1.5.3Remark C: The Relationship between Y and M The relationship between Y and M