Analyses of Sensitivity to the Missing-at-Random Assumption Using Multiple Imputation With Delta Adjustment: Application to a Tuberculosis/HIV Prevalence Survey With Incomplete HIV-Status Data

Abstract Multiple imputation with delta adjustment provides a flexible and transparent means to impute univariate missing data under general missing-not-at-random mechanisms. This facilitates the conduct of analyses assessing sensitivity to the missing-at-random (MAR) assumption. We review the delta-adjustment procedure and demonstrate how it can be used to assess sensitivity to departures from MAR, both when estimating the prevalence of a partially observed outcome and when performing parametric causal mediation analyses with a partially observed mediator. We illustrate the approach using data from 34,446 respondents to a tuberculosis and human immunodeficiency virus (HIV) prevalence survey that was conducted as part of the Zambia–South Africa TB and AIDS Reduction Study (2006–2010). In this study, information on partially observed HIV serological values was supplemented by additional information on self-reported HIV status. We present results from 2 types of sensitivity analysis: The first assumed that the degree of departure from MAR was the same for all individuals with missing HIV serological values; the second assumed that the degree of departure from MAR varied according to an individual's self-reported HIV status. Our analyses demonstrate that multiple imputation offers a principled approach by which to incorporate auxiliary information on self-reported HIV status into analyses based on partially observed HIV serological values.

Multiple imputation with delta adjustment provides a flexible and transparent means to impute univariate missing data under general missing-not-at-random mechanisms. This facilitates the conduct of analyses assessing sensitivity to the missing-at-random (MAR) assumption. We review the delta-adjustment procedure and demonstrate how it can be used to assess sensitivity to departures from MAR, both when estimating the prevalence of a partially observed outcome and when performing parametric causal mediation analyses with a partially observed mediator. We illustrate the approach using data from 34,446 respondents to a tuberculosis and human immunodeficiency virus (HIV) prevalence survey that was conducted as part of the Zambia-South Africa TB and AIDS Reduction Study (2006)(2007)(2008)(2009)(2010). In this study, information on partially observed HIV serological values was supplemented by additional information on self-reported HIV status. We present results from 2 types of sensitivity analysis: The first assumed that the degree of departure from MAR was the same for all individuals with missing HIV serological values; the second assumed that the degree of departure from MAR varied according to an individual's self-reported HIV status. Our analyses demonstrate that multiple imputation offers a principled approach by which to incorporate auxiliary information on self-reported HIV status into analyses based on partially observed HIV serological values. causal mediation analysis; incomplete data; nonignorable nonresponse; sensitivity analysis Abbreviations: AIDS, acquired immune deficiency syndrome; HIV, human immunodeficiency virus; MAR, missing at random; MNAR, missing not at random; NDE, natural direct effect; NIE, natural indirect effect; TB, tuberculosis; ZAMSTAR, Zambia-South Africa TB and AIDS Reduction.
Missing data are common in epidemiologic studies and can lead to substantial bias and misleading inference when inadequately handled. Incomplete data are frequently analyzed only under the missing-at-random (MAR) assumption when they may more plausibly be missing not at random (MNAR). Data are said to be MAR if, conditional on the observed values, missingness of any variable does not depend on the unobserved values (1). Because the MAR assumption cannot be verified from the observed data, it is important to perform sensitivity analyses that assess the impact on the study results of departures from this assumption. However, methods for implementing structured sensitivity analyses are in need of further development and wider dissemination (2). This article reviews the procedure of multiple imputation with delta adjustment and demonstrates how it can be used to assess sensitivity to departures from MAR, both when estimating the prevalence of a partially observed outcome and when performing parametric causal mediation analyses with a partially observed mediator using the approach of Valeri and VanderWeele (3). Mediation analysis allows researchers to explore alternative mechanisms for a given outcome-exposure relationship via third variables and is becoming an increasingly popular tool in epidemiologic research.
We applied the delta-adjustment approach to data from a survey on the prevalence of tuberculosis (TB) and human immunodeficiency virus (HIV) that was conducted as part of the Zambia-South Africa TB and AIDS Reduction (ZAMSTAR) Study (4). We wished to obtain overall and sex-specific estimates of HIV prevalence and investigate the mediating influence of HIV status on the relationship between educational attainment and active pulmonary TB.
Missingness of the HIV test result data is most plausibly MNAR, because prior knowledge or strong beliefs about one's status influence test acceptance. Evidence from several recent longitudinal studies suggests that individuals who have previously tested HIV-positive may be more likely to refuse testing subsequently compared with individuals who were HIV-negative when last tested (5)(6)(7)(8). Such individuals may refuse testing because they fear further disclosure of their status to others. Some authors have advocated the collection of additional auxiliary information on prior testing behavior (8) to adjust for this, but there is little guidance on how to incorporate this information into the final analysis; current ad hoc approaches include supplementing the partially observed HIV serological values with self-reported values. By including self-reported HIV status in the imputation model for incomplete HIV serological values we demonstrate a novel and principled approach to incorporating this information that builds on current guidelines from the World Health Organization and the United Nations Programme on HIV and AIDS for handling missingness of HIV status data (9).
Collecting information on past HIV testing behavior, including the self-reported result of the most recent HIV test, also provides an opportunity to conduct more nuanced sensitivity analyses as it is likely that rates of HIV test acceptance differ within groups defined on the basis of self-reported HIV status. To this end, we present results from 2 types of sensitivity analysis: the first assumed that the degree of departure from MAR was the same for all individuals with missing HIV serological values, and the second assumed that the degree of departure from MAR varied according to an individual's self-reported HIV status. To our knowledge, these are the first sensitivity analyses of this type to be reported in the literature.

ZAMSTAR Study
We used data from a TB/HIV prevalence survey conducted as part of the ZAMSTAR Study (4). This survey aimed to include approximately 4,000 adults aged 18 years or older in each of 16 trial communities in Zambia and 8 communities in the Western Cape province of South Africa. We restricted our analyses to the 34,446 adult participants with an evaluable TB sputum sample among the 16 trial communities in Zambia.
Information on HIV status was available from 2 sources. All survey participants were offered point-of-care, rapid HIV testing as part of the study, yielding a partially observed variable for HIV status based on serological analysis. Participants were also asked about prior HIV tests, yielding a fully observed, self-reported, auxiliary variable with 4 categories: HIV-positive, HIV-negative, refused to disclose the result of the most recent HIV test, and never tested. Data were also collected on a large number of sociodemographic and socioeconomic variables and on prior diagnosis, symptoms, and treatment for TB and/or HIV. Data collected on highest school grade completed were used to create an educationalattainment exposure variable with the following 5 categories: none, primary (less than grade 8), lower secondary (grade 8 or 9), upper secondary (grade 10, 11, or 12), and college/ university.
Among participants with an evaluable TB sputum sample, 31.8% had missing HIV serological values. In order to create a data set with univariate missingness, we deleted 648 (1.2%) observations that had missing values on any other variable included in the final imputation model. Omitting these observations had no impact on inference under the MAR assumption (data not shown). Communities were grouped into 4 noncontiguous regions characterized by their annual risk of TB infection (defined by the percentage of schoolchildren with a positive tuberculin test in a 2005 tuberculin skin test survey in all 24 trial communities (10)) and whether they were urban, rural, or located in Lusaka, the capital city.

Multiple imputation
Multiple imputation involves first specifying a distribution for the unobserved data given the observed data. Multiple complete data sets are produced by taking random draws from this distribution. Each imputed data set is analyzed using standard methods, and point estimates and standard errors for the quantities of interest are aggregated across the imputed data sets using Rubin's rules. Standard implementations assume that the missing data are MAR. A comprehensive treatment of the underlying statistical theory can be found in Rubin (11).
We first describe a standard implementation of multiple imputation under the MAR assumption for a single incomplete variable. We then show how the delta-adjustment procedure extends this approach to allow for multiple imputation under alternative MNAR assumptions by modifying the values imputed under a MAR assumption so that they differ from the observed values in a specified way.

Multiple imputation under the MAR assumption
MAR assumption. Suppose that we have a vector of fully observed variables X and a single partially observed variable Y. Let R = 1 if Y is observed and R = 0 if Y is missing. The MAR assumption states that, conditional on the observed data, missingness of Y does not depend on the unobserved data. This can be formulated as , Pr or equivalently as Constructing the imputation model. The imputation model should include all of the variables in the analysis model(s) of interest as well as any variable that is a significant predictor of both the HIV test result and missingness of the HIV test result (12).
We constructed 4 imputation models of increasing complexity for the HIV test result variable under the MAR assumption. Model A was a logistic regression of HIV test results on age and region only. Model B included the variables in model A plus active pulmonary TB. Model C included the variables in model B plus current TB treatment, past TB treatment, household wealth index, educational attainment, marital status, diabetes status, smoking status, alcohol consumption, hunger in past 3 months, household crowding, circumcision status (males only), current cough, persistent cough for more than 2 weeks, current chest pain, current fever, current night sweats, current shortness of breath, and unintentional weight loss in past month. Model D included all of the variables in model C plus the auxiliary HIV selfreport variable. Because the risk factors for a positive HIV test and for HIV test refusal varied by sex and also because our analysis models contained an interaction term for age by sex, we imputed missing HIV status values for men and women separately in all 4 models. We created M = 25 imputed data sets under each imputation model using the mice package in R (13). Our imputation procedure did not account for clustering by census enumeration area or household because this had little impact on inference in completecase analyses (data not shown). We did not include any additional interaction terms in the imputation model.

Multiple imputation under MNAR using delta adjustment
Multiple imputation with delta adjustment offers a transparent and flexible means by which to impute univariate data under general MNAR mechanisms, and thus to assess sensitivity to departures from MAR. Inspired by original proposals in Rubin (14), it has previously been used by van Buuren et al. (15) and implemented for a variety of variable types in the R package SensMice by Resseguier et al. (16). Further examples can be found in Carpenter and Kenward (17,18).
After fitting an imputation model for the incomplete variable Y under MAR, implementation of the delta-adjustment procedure involves adding a fixed quantity δ to the linear predictor before imputing missing data using the updated model. As such, it is a simple type of pattern-mixture model. When Y is binary and the missing data are imputed using a logistic regression model, δ represents the difference in the log-odds of Y = 1 for individuals with missing Y values compared with individuals with observed Y values. A simple imputation model under MAR is and a corresponding imputation model under MNAR is given by Varying δ across a range of values, ideally elicited from a subject-matter expert, produces an analysis of sensitivity to departures from MAR.
Extending the procedure. The delta-adjustment procedure can be refined to allow the degree of departure from MAR to vary among individuals with missing Y values according to their values on another fully observed variable Z. Examples can be found in Moreno-Betancur and Chavance (19) and Liublinska and Rubin (20). If Z is a 4-level categorical variable, we impute under the following model: Choice of adjustment values. Our final choice of adjustment values was informed by findings from a study that used data from 3 consecutive, annual rounds of HIV counseling and testing in the Karonga District of Malawi between 2007 and 2010 to investigate patterns in refusal of HIV testing over time (6). Given the result of their last HIV test, this study provided estimates of the proportion of individuals self-reporting as HIV-positive and HIV-negative as well as the proportion that accepted or refused HIV testing at the next testing round. We adjusted these figures to take account of differences in testing behavior between the populations in the Malawi and ZAMSTAR studies. We used these estimates in conjunction with expert opinion and the observed ZAMSTAR data to obtain an appropriate set of sensitivity parameter values. Further details of our approach are provided in Web Appendix 1 (available at http://aje.oxfordjournals.org/), including illustrative probability trees (Web Figures 1 and 2). Example R code for implementing the imputation procedure is provided in Web Appendix 2.

Parametric causal mediation analysis
Analyses assessing sensitivity to departures from MAR can be difficult to perform when the primary analysis is of a complex form that requires multiple subcomponent models to be fitted to the data. Multiple imputation is particularly wellsuited to such situations. Here we demonstrate how the deltaadjustment procedure can be used to assess the impact of departures from MAR on estimates arising from a parametric causal mediation analysis. This analysis investigated whether part of the observed relationship between educational attainment and active pulmonary TB can be explained via HIV status. While we use the term "effect" throughout, as with any observational study, we cannot rule out the possibility of uncontrolled confounding, issues surrounding the exposure definition, and model misspecification.
Valeri and VanderWeele (3) presented an integrated framework for parametric mediation analysis that is valid in the presence of exposure-mediator interaction and allows for the outcome and mediator variables to be any combination of binary, categorical, continuous, or count. This extended previous work that considered only a binary outcome and a continuous mediator (21). The approach involves fitting 2 parametric regression models to the data: a regression of the outcome on the exposure, mediator, and other confounders and a regression of the mediator on exposure and other confounders. The exposure variable can take 2 or more levels. In our example, TB status was the outcome, HIV test result was the mediator, educational attainment was the exposure, and we fitted 2 logistic regression models. Our confounder set for this analysis contained age, sex, region, and an age by sex interaction, resulting in 48 observed covariate patterns. Primary education was used as the reference category for the educational-attainment exposure variable.
The Valeri and VanderWeele approach (3) decomposes the total effect of setting the exposure to level a rather than to level ⁎ a as the product of a natural direct effect (NDE) and a natural indirect effect (NIE) on the odds-ratio scale. Such a decomposition is often not possible using the standard approach of Baron and Kenny (22). The causal effects are identified assuming that there is no unobserved confounding of any of the outcome-exposure, outcomemediator, or mediator-exposure relationships and that no confounder of the outcome-mediator relationship is associated with the exposure. While the latter assumption may not be satisfied in our setting, this does not affect our ability to illustrate the delta-adjustment method. Although the NDE does not vary when there is no exposure-mediator interaction, in general the NDE, NIE, and total effect depend on the values of the confounding variables. Standard errors for these quantities can be obtained via bootstrapping or the multivariate delta method (3). Further details are provided in Web Appendix 3.
We implemented the parametric causal mediation analysis within the multiple imputation framework as follows: We first fitted the regression models for the outcome and the mediator in each imputed data set and then pooled the resulting imputation-specific coefficient estimates and their variance-covariance matrices using Rubin's rules. Finally, we calculated the causal-effect estimates and their standard errors.

Risk factors for HIV infection and HIV test refusal
Odds ratios for a number of potential risk factors for HIV infection and HIV test refusal, stratified by sex and adjusted for age and region, are presented in Tables 1  and 2, respectively. Self-reported HIV status was strongly related to both a positive HIV test and HIV test refusal in this sample, and its distribution varied considerably by sex, age, and region (Web Table 1).

Sensitivity analyses
We first assumed that the degree of departure from MAR was identical for all individuals with missing HIV serological values. In this case, δ represented the difference in the log-odds of a positive HIV test result for individuals with missing HIV test results compared with individuals with observed HIV test results. We considered a range of values from exp(δ) = 1.0 to exp(δ) = 5.0.
We then explored the impact of allowing the degree of departure from MAR to vary according to self-reported HIV status Z. We let δ 1 , δ 2 , δ 3 , and δ 4 capture the degree of departure from MAR for individuals who self-reported as HIVnegative, who self-reported as HIV-positive, who refused to disclose their most recent test result, and who reported that they had never been tested for HIV, respectively. The values chosen for δ 1 , δ 2 , δ 3 , and δ 4 (summarized in Table 3) captured our beliefs about the missing-data mechanism, assuming that no individual failed to report having had a prior HIV test. Missingness for individuals who self-reported as HIVnegative was believed to be MNAR (exp(δ 1 ) > 1), because in addition to those who tested negative at their last test, this group includes individuals who know or suspect that they are HIV-positive but prefer to report as HIV-negative. Conversely, missingness for individuals who self-reported as HIV-positive was believed to be MAR (exp(δ 1 ) = 1). Missingness for individuals who refused to disclose their status was believed to be strongly MNAR (exp(δ 3 ) > 1), while missingness for individuals who reported that they had never previously been tested for HIV was believed to be MAR or weakly MNAR (exp(δ 4 ) of close to 1). Table 4 presents estimates of HIV prevalence from a complete-case analysis, best-and worst-case analyses (in which all missing HIV test values were imputed as 0 or 1, respectively), and the 4 alternative multiple-imputation analyses under MAR. Table 5 presents estimates of HIV prevalence from a selected subset of multiple-imputation analyses under MNAR based on imputation model D. The estimates from complete-case analysis were systematically lower than those produced by multiple imputation under MAR, while imputation models A, B, and C produced very similar estimates of the overall HIV prevalence. Including self-reported HIV status in the imputation model resulted in an increased estimate of the overall HIV prevalence.

Estimation of HIV prevalence
In MNAR analyses, estimates of the overall HIV prevalence varied from 18.1% under MAR (exp(δ) = l.0) to 24.9% when exp(δ) = 5.0. Allowing the degree of departure from MAR to vary according to self-reported HIV status, as captured by the group-specific δ j values, was associated with more subtle differences in the estimates of HIV prevalence than applying a common δ value to all participants with missing HIV test result data (Table 5). This is further illustrated by the filled contour plot in Web Figure 3, which presents overall and sex-stratified estimates of HIV prevalence by group-specific δ j values.
the effects of having upper-secondary or college/university education (compared with primary education) on active pulmonary TB across a majority of covariate patterns, there was no evidence that HIV status mediated the effects of having lower-secondary education. The NIE on active pulmonary TB of having no education compared with having primary education was in the opposite direction of the NDE for all covariate patterns, indicating a lack of mediation. Accounting for missing data via multiple imputation under MAR did not produce a qualitative change in inference regarding mediation. A representative set of causal-effect estimates and 95% confidence intervals for one covariate pattern is presented in Table 6.
Analyses assessing sensitivity to departure from the MAR assumption. Estimates of the average NDE for each level of the educational-attainment exposure variable were insensitive to departures from the MAR assumption across a majority of covariate patterns. While estimates of the NIE of having a college/university education exhibited moderate sensitivity to departures from MAR across all covariate patterns, estimates of the NIE for the remaining exposure levels exhibited little sensitivity. Sensitivity of the average NIE and total effect to    f Imputation model included age, region, and active pulmonary TB only. g Imputation model included age, region, active pulmonary TB, household wealth index, educational attainment, current TB treatment, past TB treatment, marital status, diabetes status, smoking status, alcohol consumption, hunger in past 3 months, household crowding, circumcision status (males only), current cough, persistent cough for more than 2 weeks, current chest pain, current fever, current night sweats, current shortness of breath, and unintentional weight loss in past month.
h Imputation model included all variables in model C with the addition of self-reported HIV status. Abbreviations: HIV, human immunodeficiency virus; MAR, missing at random; MNAR, missing not at random; SE, standard error; TB, tuberculosis. a Participants responded to a 2010 survey on the prevalence of TB and HIV and had an evaluable TB sputum sample. b Imputation model included age, region, active pulmonary TB, household wealth index, educational attainment, current TB treatment, past TB treatment, marital status, diabetes status, smoking status, alcohol consumption, hunger in past 3 months, household crowding, circumcision status (males only), current cough, persistent cough for more than 2 weeks, current chest pain, current fever, current night sweats, current shortness of breath, unintentional weight loss in past month, and self-reported HIV status. c δ 1 is the degree of departure from MAR for individuals who self-reported as HIV-negative. d δ 2 is the degree of departure from MAR for individuals who self-reported as HIV-positive. e δ 3 is the degree of departure from MAR for individuals who refused to disclose the result of their most recent HIV test. f δ 4 is the degree of departure from MAR for individuals who reported having no prior HIV tests.
departures from MAR was primarily attributable to sensitivity of the coefficient estimate for the educational-attainment exposure in the model for the mediator (Web Figure 4). In general, accounting for possible violation of the MAR assumption was not associated with a qualitative change in inference regarding mediation. A sensitivity analysis for the covariate pattern shown in Table 6 is presented in Figure 1.

DISCUSSION
In this study, we reviewed multiple imputation with the delta-adjustment procedure and demonstrated how it can be used to impute data under general MNAR mechanisms, thus facilitating analysis of sensitivity to departures from the MAR assumption. We applied the approach to data from a survey on TB/HIV prevalence, conducted as part of the ZAMSTAR Study, assessing the impact of departures from MAR on HIV prevalence and causal-effect estimates in 2 types of sensitivity analysis.
The first sensitivity analysis assumed that the degree of departure from MAR was the same for all individuals with missing HIV serological values, while the second assumed that the degree of departure from MAR varied according to an individual's self-reported HIV status. Although we assumed that the degree of departure from MAR for individuals with missing HIV test result values did not vary according to TB status or educational attainment, sensitivity analyses exploring the impact of such dependencies could be performed in an identical fashion.
Our approach to sensitivity analysis produces a range of inferences by varying the sensitivity parameters across a range of plausible values. This allows the investigator to explore how the inference changes according to the assumption placed on the missing-data mechanism. A possible alternative, attractive to policy-makers, provides a single inference by placing informative prior distributions on the sensitivity parameters in a fully Bayesian analysis (23).
Recently developed multiple-model multiple-imputation approaches (24,25) can be used to approximate such analyses within the multiple-imputation framework. We acknowledge that elicitation of the sensitivity parameter values can represent a significant challenge in many applied research settings. In situations where there is a clear hypothesis to be tested-for example, determining whether HIV prevalence has fallen below a specified value-it can be easier to conduct a tipping-point analysis (20,26). In this approach, the investigator varies the sensitivity parameters across a large range of values in order to determine a set of values for which there is a qualitative change in inference. The investigator must then evaluate whether this set of values is plausible for the data at hand and thus whether the results of their analyses are sensitive to departures from MAR. Improved tools for the elicitation of sensitivity parameters are needed if MNAR methods are to enjoy routine use among applied researchers.
Multiple imputation offers a rigorous approach by which to incorporate auxiliary information on self-reported HIV status into analyses based on partially observed HIV serological analysis. Exploiting auxiliary information on self-reported HIV status produced estimates of overall and subgroupspecific HIV prevalence with greater face validity when it was included as a variable in the imputation model and also allowed us to perform more sophisticated analyses of sensitivity to departures from the MAR assumption. Future population-based studies should continue to collect information on self-reported HIV status in addition to testing for HIV, especially in settings with high rates of prior testing. Seeking more information on past HIV-testing behavior (for example, the date of the most recent HIV test) or beliefs about status if never tested would also be valuable. For example, we encountered some difficulty in selecting an appropriate range of delta values for the never-tested subgroup. This group is likely to contain a mixture of individuals at quite different levels of risk of HIV infection. Some individuals might not have access to testing, some might refuse testing because they believe themselves to be at very low risk, and others might refuse testing because they believe themselves to be at high risk and fear disclosure. In the absence of further information about the composition of this subgroup, it may be reasonable to consider a larger range of values for the degree of departure from MAR than was presented here-for example, from exp(δ 4 ) = 0.5 to exp(δ 4 ) = 2.0.
Our causal-effect estimates exhibited marked insensitivity to departures from MAR. Nevertheless, the validity of these estimates depends critically on the set of identifying restrictions detailed earlier and on the assumption that the 2 component parametric models are correctly specified. While we are confident that we have captured the most important confounders of the outcome-mediator, outcome-exposure, and mediator-exposure relationships-and that the confounders of the outcome-mediator relationship for which we adjusted are not associated with the exposure-the impact of violations of these assumptions could be explored in further sensitivity analyses. For example, Tchetgen Tchetgen and Phiri (27) and Naimi (28) have derived bounds for natural effects when the exposure is associated with one or more confounders of the outcome-mediator relationship. Furthermore, some readers may not agree that educational attainment constitutes a welldefined counterfactual cause (29); further discussion of this perspective is provided in Web Appendix 3. While we have focused on an example with a single incomplete variable, we note that delta-adjustment procedures can also be used to adjust for missing data in longitudinal clinical trials subject to dropout (19,30). Furthermore, while some authors (15,16) have attempted to perform delta adjustment in conjunction with the chained-equations algorithm, at present this approach lacks a strong theoretical foundation and thus should be used with caution.
In conclusion, multiple imputation with delta adjustment offers a transparent and flexible means to perform analyses of sensitivity to departures from the MAR assumption in the presence of a single incomplete variable. While appropriate for use in conjunction with all types of univariable and multivariable analysis, this method may represent a particularly important tool for sensitivity analysis in contexts such as mediation analysis where multiple subcomponent models must be fitted to the data.