Testing unit root non-stationarity in the presence of missing data in univariate time series of mobile health studies

Abstract The use of digital devices to collect data in mobile health studies introduces a novel application of time series methods, with the constraint of potential data missing at random or missing not at random (MNAR). In time-series analysis, testing for stationarity is an important preliminary step to inform appropriate subsequent analyses. The Dickey–Fuller test evaluates the null hypothesis of unit root non-stationarity, under no missing data. Beyond recommendations under data missing completely at random for complete case analysis or last observation carry forward imputation, researchers have not extended unit root non-stationarity testing to more complex missing data mechanisms. Multiple imputation with chained equations, Kalman smoothing imputation, and linear interpolation have also been used for time-series data, however such methods impose constraints on the autocorrelation structure and impact unit root testing. We propose maximum likelihood estimation and multiple imputation using state space model approaches to adapt the augmented Dickey–Fuller test to a context with missing data. We further develop sensitivity analyses to examine the impact of MNAR data. We evaluate the performance of existing and proposed methods across missing mechanisms in extensive simulations and in their application to a multi-year smartphone study of bipolar patients.


Introduction
Assessing the stationarity of a univariate time series has long been a question of interest to evaluate data assumptions and inform subsequent analyses.However, solutions for testing for stationarity have been largely developed for the field of economics, where the data is typically recorded as a fully observed time series.The increase in the prevalence of time series data across disciplines (e.g., social science, political science, and psychiatry) introduces new challenges for existing time series analysis methods.
One growing source of time series data is personal digital devices.This technology is increasingly employed to collect information, particularly in the field of health research (Aledavood et. al., 2017;Silva et. al., 2015).It can record participants' real-time exposures and outcomes, with the potential for many observations per second (Torous et. al., 2017;Vaidya et. al., 2013).Mobile health (mHealth) studies for example utilize individuals' smartphones and wearable devices to collect information on the participants daily activities and well-being (Mandel & Ghosh, 2021;Aledavood et. al., 2017).As smartphones become an essential tool for day-to-day life, mHealth study designs will become ever-more prevalent, and necessitate the development of new statistical methods to handle the particular challenges that mobile health data can introduce.MHealth data encompasses active data which is collected directly from the participant, such as daily survey responses provided by participants, and passive data which is observed with no action required from the participant, such as GPS records, accelerometer data, and phone and text logs.However, these records may suffer from pervasive missingness resulting from technological errors, inactive devices, or participants being unable, unwilling, or uninterested in engaging with the software.The Bipolar Longitudinal Study (BLS) is an mHealth cohort study of 74 individuals with Bipolar I or II disorder, schizophrenia, or schizoaffective disorder followed for up to five years using the Beiwe application (Onnela et. al., 2021, Huang & Onnela, 2019;Barnett & Onnela, 2020) on the participant's smartphone.It was configured in the study to record basic information of the participant's anonymized phone calls and text messages, their accelerometer and GPS data, and daily responses to a 31 item survey including questions on mood, sleep, social interaction, physical activity, and other health conditions (Cai et. al., 2022).The study provides researchers with the possibility of identifying the causal effects of at home interventions such as increased social or physical activity to improve an individual's mood and symptoms.However, the validity of such analysis will be dependent on modeling assumptions, including stationarity.In the Bipolar Longitudinal Study we observe missing data rates for the daily survey varying from 2% to over 90%.Missingness is also pervasive in passive data and could result from a variety of factors including glitches, noncompliance, or failure to update software.For example, for one participant in the BLS study we fail to observe any text data for over a four month gap, or 16% of the follow up their period.Preliminary results in the Bipolar Longitudinal Study also indicate the presence of non-stationarity among several participants' time series data (Cai et. al., 2022).
In time series analysis, stationarity is defined as constant mean and variance across time (Metcalfe & Cowpertwait, 2009).Models for stationary processes such as linear mixed models are overwhelmingly used in mHealth (Tewari & Murphy, 2017;Luckett et. al., 2019;Neto et. al., 2016).However, non-stationarity is likely possible in mHealth studies and it is important to test for it prior to proceeding with traditional longitudinal models.There are many types of non-stationarity such as change points and unstable variance, but the most commonly evaluated is unit root.Unit root non-stationarity is characterized by an autocorre-lation of one between the current state and lagged values.A common example of a unit root time series is a random walk.The augmented Dicker Fuller test, or ADF test, is the conventional method used to test for unit root non-stationarity (Metcalfe & Cowpertwait, 2009).However, despite the importance of stationarity assumptions in analyses, unit root testing is often omitted from mHealth studies (Tewari & Murphy, 2017), likely due to the ADF test requirement for complete data.
There are three classifications for missing data to determine its relative impact on analyses and validity: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) (Little & Rubin, 2019).When data is missing completely at random, the likelihood of a data point not being observed is not influenced by any observed or unobserved data; this is the optimal scenario.In the case of daily surveys from mHealth studies, this could be conceived as a participant failing to respond to survey questions at random a few days a week.Alternatively, when data is MAR, the probability of missingness is only affected by observed variables.An example of missing at random in the mHealth context is loss of interest in study engagement, where missing rates of survey responses increase with time in the study.Lastly, the most challenging mechanism is if data is missing not at random, and the probability of missing data is influenced by unobserved information.For mobile health, this might be represented by an individual not responding to questions on their mood, due to an unobserved depressive state.Because it is hypothesized that mHealth data is often missing at random or missing not random (Goldberg et. al., 2021), time series methods must be adapted to these contexts.
The existing missing data methods used in time series analysis impose strong assumptions about unit root, and thus may have impacts on validity when used with the ADF test.Basic methods such as mean imputation and last observation carry forward are known to bias estimates and distort variance even when data is MCAR (Little & Rubin, 2019).Linear and spline interpolation have been shown to produce accurate estimates when used for imputation in covariates (Terry et. al., 1986;Skjelbred and Kong, 2019) but likely bias autocorrelation estimates due to their imposition on the autocorrelation structure.Wijesekara and Liyanage (2020) show under MCAR that Kalman smoothing (Moritz & Bartz-Beielstein, 2017) achieves high accuracy when imputing values in a univariate time series, however the method has not been evaluated on other missing mechanisms.Commonly used multiple imputation methods developed for non-longitudinal data, such as multiple imputation with chained equations fail to account for the correlation between the lagged and current value when imputing variance (Azur et. al., 2011;Little & Rubin, 2019), and thus may provide unreliable results.Lastly, complete case analysis which simply drops all unobserved time points and thus disrupts the time sequence may be appropriate when data is MCAR (Shin & Sarkar, 1996).If the time series has unit root, the autocorrelation will be maintained, while if the autocorrelation is less than one, the autocorrelation will be underestimated.
The existing research on stationarity testing and autocorrelation estimation with incomplete data is sparse, and relies on strong assumptions about the missingness mechanism.Little and Rubin (2019) propose an EM algorithm for autocorrelation estimation for an AR(1) time series, but condition on stationarity.Shin andSarkar (1994, 1996) demonstrate that if the time series follows an AR or ARMA structure and data is MCAR, then complete case analysis or last observation carry forward are sufficient methods to be used prior to the ADF test.Park, Shin, et al. (2005) develop a Bayesian test for incomplete data that tests multiple hypotheses of asymmetry and unit root non-stationarity.However there is a need for the investigation of the impact of different missing mechanisms and rates on unit root testing, and a consideration of more sophisticated missing data methods for the ADF test.
We first propose using imputations from a state space model with multiple imputation (SSMimpute) (Cai et. al., 2022) for unit root testing with the ADF test.While this method is shown to produce non-biased coefficients for covariate estimates with multivariate time series analysis, it has not been tested with a univariate time series and estimation of the lagged value coefficient.Under the assumption of a single lag order and normally distributed errors, we additionally propose two likelihood maximization approaches to adapt the ADF test to a context of missing data.The first uses mathematical optimization to maximize the observed data likelihood, and runs the ADF test using the obtained estimates.The second incorporates an EM algorithm to recursively impute missing data and estimate the autocorrelation and variance, and runs the ADF test on the final imputed time series.We additionally introduce sensitivity analyses with a δ adjustment for the state space model with multiple imputation and maximum likelihood approach using an EM algorithm to consider the ramifications of MNAR data on unit root testing.The remainder of this paper will first review unit root testing without missing data, then expand these methods to the context of MCAR and MAR data, and lastly introduce sensitivity analyses for MNAR data.We will use simulation and application to BLS mobile health data to review to validity and performance of proposed methods.

Testing unit root without missing data
Let t represent time point from t = 0, 1, ..., T .Then assuming lag order of 1, we can define the time series Y t such that y t = ρy t−1 +ǫ t , where E(ǫ t ) = 0, V ar(ǫ t ) = σ 2 , and ǫ t are independently and identically distributed.We state that time series y has unit root if ρ = 1.If this is true, the process is known as a random walk, and as T → ∞, the variance of the time series will diverge to infinity (Metcalfe & Cowpertwait, 2009).This renders various statistical assumptions invalid, such as those needed to apply standard longitudinal analysis methods including linear mixed models.Notably, preliminary analysis of the Bipolar Longitudinal Study has identified that some participants' response patterns appear to follow a random walk, demonstrating the importance of testing for unit root in mHealth data (Cai et. al., 2022).The Dickey Fuller test was developed to test the null hypothesis: H 0 : ρ = 1 against the alternative that H 1 : ρ < 1 (Dickey & Fuller, 1979).The test employs ordinary least square methods regressing Y t by Y t−1 to estimate ρ and σ2 such that test statistic is defined as: (1) Under the null hypothesis, the distribution of the test statistic has no closed form (Dickey & Fuller, 1979), but is named the Dickey Fuller distribution.For testing purposes, p-values and rejection regions are generated through interpolation, using a table of distribution quantiles obtained via simulation (Dickey & Fuller, 1979).Assuming ǫ t are distributed normally, we note that y t |y t−1 ∼ N (ρy t−1 , σ 2 ).We can thus solve for the joint log-likelihood of ρ, σ 2 conditional on y 1 , ..., y T as: (2) 3 Testing unit root in the presence of MCAR or MAR missing data For t = 1, ..., T , let r t serve as an indicator of whether y t is observed, and p t represent the probability of observing y t , i.e. p t = P (r t = 1).If data is missing completely at random, we assume that p t is independent of Y t , t, and any other observed or unobserved variables, for t = 1, ..., T .When the time series is missing at random, p t is dependent only on observed variables.In the case of a univariate time series, we will assume that the probability of observing a time point is dependent on only observed covariates X including time t and previously observed non-missing y t , such that for some function g, p t = g(X).We let k = t 1 , ..., t n represent the index of times with non-missing observations, such that y t1 , y t2 , ..., y tn represent all observed time series values, with n total non-missing observations, and n ≤ T .Note that as seen in (Bertin et. al., 2011), we can solve that the distribution of y t k conditional on the previously observed point y t k−1 is normal, such that We can now adapt the complete time series log-likelihood from (2) for observed data: Note that when p t is independent of ρ and σ, the likelihood can be maximized with respect to these parameters by disregarding the first term of the log-likelihood.This condition holds when data is MCAR or MAR.

Maximum Likelihood Estimation with Numeric Optimization
Using likelihood from (4), where we assume ǫ t is distributed normally, even if we assume data is MCAR or MAR and ignore the first term, we are unable to solve for an algebraic maximum.Instead, we suggest using numerical optimization, specifically the Nelder-Mead method to maximize the function with respect to σ 2 and ρ.We calculate initial estimates of ρ and σ using only time points where y t and y t−1 are observed, such that To implement the ADF test using estimates generated from numerical maximization (ρ, σ2 ), we calculate a conservative new test statistic as: We will refer to this method as MLEN (maximum likelihood estimation by numerical optimization, conservative).This statistic fails to leverage on the longer observation period, and treats the follow up time as strictly the number of non-missing time points.For time series with severe missing rates, or shorter follow up, the conservative nature of this test statistic becomes a limitation for its power.To address this issue, we additionally consider a scaled version of the above statistic, where we leverage on the unobserved time points, by calculating DF ρ,s = T n DF ρ,c .We will refer to this method as MLENS (maximum likelihood estimation by numerical optimization, scaled).
When assumptions of normality and lag order of one are satisfied, the algorithm benefits from exact model specification, allowing for accurate results with computational ease.

Maximum Likelihood Estimation with Expectation Maximization algorithm
We additionally consider a maximum likelihood estimation approach using an iterative expectation maximization algorithm (MLEEM).We calculate initial estimates for the parameters as in (5).The algorithm then recursively imputes missing observations, and updates parameters until convergence.In iteration j, we impute missing observations chronologically such that for unobserved y t , ŷ(j) t = ρ(j−1) ŷ(j) t−1 .In the M step, we maximize the likelihood using the imputed time series, which is equivalent to ordinary least squares estimation when regressing Y t by Y t−1 , and update estimates ρ(j) and σ2 (j) .Similarly to the numerical optimization method, the EM approach for maximum likelihood estimation benefits from a exact model specification, and computational efficiency when assumptions hold.It additionally allows for direct calculation of the ADF test statistic from the fully imputed time series, avoiding issues of reduced power encountered by numerical optimization without imputation.

State Space Model with Multiple Imputation
We lastly propose a more flexible imputation method to be used prior to unit root testing.The state space model with multiple imputation (SSMimpute) does not assume normality, nor a single order lag structure (Cai et. al., 2022).Assuming q lag order, the state space model is fit by regressing Y t by its lagged values, Y t−q , ..., Y t−1 .We generate initial estimates for the missing values of Y t using Kalman filtering, and impute these values in the lagged regressors, Y t−q , ..., Y t−1 , but not Y t .The algorithm then recursively until convergence estimates the coefficients of the regressors and the variance through maximizing the likelihood and calculating new imputations from the posterior to update missing values of Y t−q , ..., Y t−1 .Note that in the case of a single lag order, the maximization step will be estimating ρ and σ 2 as defined previously.When convergence is achieved for the likelihood and coefficient estimation, multiple imputations are drawn from the last posterior distribution.These imputations are used to conduct the ADF test.Because the Dickey Fuller test statistic's distribution has no closed form, we pool test results across imputations by calculating the median test statistic, as has been employed in other contexts (Eekhout et. al., 2017;van de Wiel et. al., 2009).Specifically, the SSMimpute method (Cai et. al., 2022)  The SSMimpute approach for unit root testing relaxes assumptions by not relying on normality and single lag order.Although this makes it slightly more computationally intensive, by not calculating multiple imputations within each iteration, it eases the computational burden without compromising on estimate accuracy (Cai et. al., 2022).Additionally Cai et. al. (2022) demonstrated the method has low bias across a range of data generation scenarios.
Full properties of the four proposed methods for testing unit root non-stationarity with missing data can be seen in Table 1.

MNAR sensitivity analyses
Data is classified as missing not at random, or MNAR when the probability of not observing a time point is dependent on unobserved information.This is the most severe missing data classification, and yet it cannot be tested for due to its nature.In the case of a univariate time series, we assume that data MNAR would signify that the probability of missing a time point is dependent on the value of that individual time point.In the psychiatric mobile health context, this missingness seems plausible, with participants being less likely to report a negative mood when in a depressive state.Because we are unable to observe or test for the influence of missing not at random, sensitivity analyses are commonly used to assess the impact of MNAR data, through incorporating a range of values of δ which represent the hypothesized difference between observed and unobserved time points.We propose methods to conduct such a sensitivity analysis for unit root testing for the maximum likelihood estimation with expectation maximization and state space model with multiple imputation methods.
The MLEEM method's sensitivity analysis within each iteration, recursively across time t adds a term to each imputation for missing value y t such that ŷ(j) t = ρ(j−1) ŷ(j) t−1 ± δ t .Thus δ t represents the hypothesized missing mechanism, or specifically the expected difference in y t given r t = 1 and y t given r t = 0.As our eventual goal is testing for unit root, we wish avoid creating any major jumps in the time series when imputing, to not disrupt autocorrelation estimates.To do so, we assume that for a missing gap from time points y u to y v , the time series will add δ t to imputations for the first half of time points t = (u, u + 1, ..., u + ⌊ v−u 2 ⌋), and subtract δ ).This will essentially create a rise and fall peak effect across each missing gap.
We focus on the context where δ t is constant across t, however we additionally consider the case from Example 15.4 in Little and Rubin (2019) where it is hypothesized that all missing values fall between (λ, ∞).Note that in this scenario, for a sensitivity analysis λ would be chosen as a range of plausible values which would subsequently inform δ t .Under this hypothesis, conditioning on the normality assumption we can further inform our δ.We let φ and Φ represent the standard normal density and cumulative distribution functions respectively, and obtain the following adjustment: .
For the SSMimpute approach, following initialization, we incorporate δ within the E-step of each iteration of the algorithm.After imputation of missing values in lagged regressors, we will add a δ informed term to ensure all imputations are shifted to represent the hypothesized MNAR effects.Thus if the state space model in iteration j imputes for lagged regressor Y t−q at time point t a value ỹ(j) t−q , we calculate the final imputation in this iteration to be ŷ(j) t−q = ỹ(j) t−q + m t−q δ t−q .Similarly to the MLEEM sensitivity analysis, we wish to create a rise and fall of imputed values, to avoid a large jump in the imputed time series.Thus for a missing gap in between time points u and v, we let m t = (t − u + 1) for t = (u, u + 1, ..., u ).We will refer to this adjustment as peak δ.We additionally consider the case where a value missing has a stagnant effect, and let m t = 1 across all time points (δ s ).As by the time the algorithm converges the δ-adjustment should be incorporated into the posterior, we do not additionally impose the adjustment to the multiple imputations sampled from said posterior.

Strict assumptions Relatively computationally intensive
To assess the performance of proposed methods against existing missing data approaches for univariate time series, we conduct a simulation study, divided in two parts.The first applies methods under the hypothesis that data is MCAR or MAR, and the later employs the proposed sensitivity analyses, under the assumption data is MNAR.

Main Simulation
We conduct 2000 simulations generating time series with T = 500 and a single lagged order such that y t = ρy t−1 + ǫ t and ǫ t ∼ N (0, 1).For each simulation, we consider ρ = .5,.9,.95, 1.Within each simulation for the four time series we consider MCAR, MAR, MNAR missing mechanisms, with missing rates of 30, 50, and 70 percent.For MCAR data we determine missingness by a random sample of T from a binomial distribution with fixed p = 0.3, 0.5, 0.7.For MAR, we specify p = c × t, with c set such that on average the desired missing rate was achieved, and the probability of missing increases as t increases.Lastly, to simulate MNAR, we first consider an extreme deterministic scenario, where values of y t above the 30, 50, or 70th quantile are all classified as missing (MNAR-D).We additionally for this simulation consider data such that probability of missing, p is calculated as an interaction between t 2 and y t (MNAR-T).Specifically, we calculate p = e ct 2 yt /(1 + e ct 2 yt ), with c defined such that on average the desired missing rate is achieved.For each resulting time series, we apply mean imputation (M), last observation carry forward (LOCF), linear and spline interpolation (IntL, IntS)and Kalman smoothing imputation (K) from the imputeTS package in R (Moritz & Bartz-Beielstein, 2017), and complete case analysis (CC) including only observed time points.For each of these single imputations and complete case analysis we apply the Dickey Fuller test.We additionally consider multiple imputation with chained equations (MICE) with covariates y t−1 , t and m = 5 from the mice R package (van Buuren & Groothuis-Oudshoorn, 2011), and obtain a pooled result for the Dickey Fuller test by taking the median test statistic.We compare these methods with results from MLEN, MLENS, MLEEM, and SSMimpute with m = 5, and a single lag order.In addition to extracting the test statistic and p-value for each simulation, we calculate the estimated autocorrelation, ρ.To evaluate the methods, we examine autocorrelation, test statistic, and p-value distributions across ρ, and the power across different values of ρ and type 1 error.
For the simulation we consider 4 different MNAR mechanisms.First, we consider MNAR-D and MNAR-T, as described above.We additionally consider, MNAR-P, which uses a probabilitic framework such that the time series observations were divided into three quantiles with 40%, 10%, and 0% missing rates, in descending order of quantile.The fourth, MNAR-H, a hybrid between probabilistic and deterministic missing, added an additional quantile to MNAR-P in which all data was missing.As in the main simulation, we consider 30, 50, and 70 percent missing rates, and set quantile cutoffs accordingly.For MLEEM, we consider δ t = 0.05, 0.1, 0.2, 0.3 with a fixed δ across time.For SSMimpute, we consider a peak δ t = 0.05, 0.1, 0.2, 0.3.We remove the stagnant δ adjustment for SSMimpute due to convergence issues.

Simulation Results
The percent of simulations which rejected the null of unit root in the main simulation are shown in Table 2 by missing rate, method, missing mechanism (MCAR, MAR, MNAR-D), and ρ.For ρ = 1, the value represents type I error, while for ρ < 1, the value is equivalent to power.The entry with rate of 0 for each ρ under complete case (CC) represents the performance of the ADF test with no missing data.We omit the case where ρ = 0.5, as here all methods have power greater than 0.95.Additionally, MICE and mean imputation are omitted, but across all simulations both methods had a type I error greater than 0.9, demonstrating their ineffectiveness for unit root testing.The MLEN method is indeed very conservative, with type I error of 0 for MCAR and MAR data, and low power.When data is MCAR or MAR, MLEEM, MLENS, and SSMimpute have power over 79% for a ρ of 0.95.These methods additionally demonstrate type I error less that 0.05 with a missing rate of 30% when data is MCAR or MAR.Complete case analysis also has comparable type I error, but power suffers with higher rates of missing data, as expected.When data is MAR with 70% missing, the complete case power is merely 32%, compared to 96% and 85% for MLENS and SSMimpute, respectively.We observe that LOCF exhibits both inflated type I error and reduced power, while linear and spline (excluded from table) interpolation and Kalman smoothing imputation both bias towards non-stationarity and have low power.For a deterministic missing not at random, or MNAR-D, time series, all methods perform poorly as expected, given they are misspecified.Regardless of method, even with only 30% missing not at random, type I error is greater than 8%, meaning an increase chance of rejecting the null of non-stationarity when the time series is truly non-stationary.With SSMimpute this effect is especially severe, with a type I error of 30%.For time series with probability of missing as an interaction of value and time, MNAR-T, and moderate missing rates, methods no longer exhibit an inflated type I error, but power does suffer, particularly for complete case analysis.The box-plots of p-values by method and missing mechanism with 50% missing in Figure 1 further demonstrate simulation results.High power is seen by box-plots for ρ < 1 mostly below the α = 0.05 cutoff, and low type I error is visualized by little of the ρ = 1 distribution of p-values below the α = 0.05 line.Across all methods the MNAR-T mechanism does not suffer from the high type I error seen for MNAR-D, but does have lower power compared to MCAR and MAR.In Figure 2, the ρ estimates by method are plotted, demonstrating the biases by method and missing mechanism.The four proposed methods, SSMimpute, MLEEM, MLEN, and MLENS all have low bias when data is MCAR or MAR.As seen with p-values, mean and MICE imputation underestimate ρ. linear interpolation, and Kalman smoothing imputation bias results towards non-stationarity, as is seen by their estimates above the true value of ρ, when ρ < 1.As expected, complete case analysis underestimates ρ when ρ < 1, and is accurate when ρ = 1.In Web Appendix A, we include graphs for the p-values, ρ estimates, and test statistics with 30, 50, and 70 percent missing across methods, ρ, and missing mechanism.
In the sensitivity analysis simulation, we see that across different types of MNAR data the proposed δ adjustments are able to shift results from the ADF test.Specifically, we demonstrate that with positive δ Table 2: Simulated power (when ρ < 1) and type I error (when ρ = 1) by missing mechanism, ρ, missing rate, and method.We denote complete case analysis by CC, SSMimpute by SSM, last observation carry forward by LOCF, linear interpolation by IntL, and Kalman smoothing imputation by K.  values, we are able to correct the inflated type I error exhibited by MNAR-D data in the general simulation.In Figure 3, p-values for the SSMimpute and MLEEM methods are visualized across a range of δ values for MNAR-D time series.Increases in δ generate a increase in p-value for both methods, representing a reduced likelihood of rejecting the null hypothesis and thus a shift towards non-stationarity.Simulation results additionally demonstrate that the SSMimpute method appears more sensitive to changes in δ compared to MLEEM.This was expected, as it is a more flexible model and thus more able to adapt to reflect changes in the imputation structure.The results for hybrid and probibalistic missing not at random are similar to those seen with deterministic missing not at random.For MNAR-T, we did not see the same pattern of inflated type I error in prior to sensitivity analyses, and the proposed δ adjustments tested fail to incorporate the impact of t on the probability of missing.Despite these limitations, the increasing δ adjustments do provide an increasing range of p-value and autocorrelation estimates.This is especially pronounced when applied to time series with higher rates of missing data.Full visualizations across tested MNAR mechanisms are shown in Web Appendix B, with autocorrelation and p-value graphs by missing mechanism, missing rate, and ρ.
Overall when the time series is MCAR or MAR, the simulation demonstrates the high power and acceptable type I error of proposed methods, and the limitations of existing methods when testing for unit root nonstationarity.We additionally observe inflated type I error across all methods when data is MNAR.The impact of time points missing not at random can be assessed using a sensitivity analysis with the MLEEM or SSMimpute methods by specifying a range of δ adjustments.

Application to Mobile Health Data
We applied the proposed methods for unit root testing in univariate time series with missing data to mobile health data collected from the Bipolar Longitudinal Study (BLS).BLS is an ongoing longitudinal cohort study of 74 participants with Bipolar I or II disorder, schizophrenia, or schizoaffective disorder recruited from the Psychotic Disorders Division at McLean Hospital in Belmont, MA.The Beiwe mobile application developed by Jukka-Pekka Onnela's lab (Onnela et. al., 2021, Huang & Onnela, 2019;Barnett & Onnela, 2020) is employed to passively collect data on physical activity, GPS locations, and telecommunications of texts and calls.The application additionally was configured to prompt users to daily respond to a 5-minute survey at 5:00 PM on their moods, sleep, social activity, and psychotic symptoms.To apply proposed methods, we focus on negative and positive mood scores which are built as aggregates of survey responses to questions.Negative mood score is generated from questions relating to fear, anxiety, embarrassment, hostility, stress, upset, irritation, and loneliness, ranging from 0 to 27 (Cai et. al., 2022).Positive mood score is from questions relating to stress management, determination, and being alert, energetic, happy, inspired, and outgoing, with scores ranging from 0 to 28.Across participants, missing rates in mood scores vary from less than 10% to over 90%.We applied new methods SSMimpute, MLEEM, MLEN, MLENS, and compare them to existing methods of complete case analysis, LOCF, linear and spline interpolation, Kalman smoothing imputation, mean imputation, and MICE for unit root testing.We additionally considered MNAR senstivity analyses for the SSMimpute and MLEEM methods.As the mood score errors are not normally distributed, we found MLE methods conditioning on normality were unfit, and tended to over-report nonstationarity and estimate ρ close to 1.To evaluate the performance of other methods, we highlight the time series of three participants.We first focus on the negative mood score for a participant with bipolar disorder followed for 708 days, with a 22.6% missing rate in negative mood.We will concentrate on the first 407 days of follow up, prior to a significant life change for the patient which created a change point in the observed data.As seen in Figure 4, the negative mood time series of participant 1 appears non-stationary with potential seasonality.All methods but mean and MICE imputation had a p-value greater than 0.05, and thus failed to reject the null hypothesis of unit root.These results concord with those found in the simulation, where MICE and mean imputation strongly bias test results towards stationarity regardless of the true ρ.For reference, all but mean imputation and MICE estimated the autocorrelation to be greater than 0.95.As expected, a sensitivity analysis for the SSMimpute and MLEEM methods with δ values ranging from 0 to 1 had no impact on failing to reject the null of non-stationarity.
We additionally consider the positive mood score of a participant with bipolar disorder with 89 days of follow up and a 41.6% missing rate in positive mood observations.In Figure 4, we see for the positive score of participant 2 almost all missing time points are consecutive, with only one observation following the missing block.All methods but spline interpolation, Kalman smoothing imputation, MLEN and MLENS reject the null of unit root.This reflects results seen in the simulation, where spline interpolation and Kalman smoothing imputation appear to bias towards non-stationarity, reflected by an inflation in the type I error.Comparing autocorrelation estimates, those of linear and spline interpolation and Kalman smoothing imputation are larger than 0.6, while for complete case analysis, SSMimpute, and LOCF they are less that 0.4.This example of a short time series with high rates of missing data demonstrates the discrepancies in results that can be obtained using different methods, and the importance of considering the impact of the chosen method on autocorrelation estimation and unit root testing.When considering values of δ from 0 to 1 for a MNAR senstivity analysis, MLEEM with a δ ≥ 0.5 fails to reject the null hypothesis.SSMimpute shows increased estimates of autocorrelation, but across all considered δ concludes that the time series is stationary.
Lastly, we include the negative mood score of a participant with schizophrenia and 1461 days of follow up with 34.6% missing.In Figure 4, the time series visually appears stationary, with intermittent missing.Indeed, all methods reject the null hypothesis of unit root and find the autocorrelation to be less than one.This example demonstrates that with sufficiently low value of ρ, a relatively low missing rate, and a long follow up, the power is high enough that the chosen method to handle missingness has no impact on unit-root testing.Even with in sensitivity testing, we found little changes in autocorrelation estimates and no changes by Dickey Fuller test results across δ values.

Discussion
The increase in not fully observed time series data from contexts such as mHealth necessitates the adaptation of current time series analysis methods to a missing data setting.In particular, testing a univariate time series for stationarity is an important evaluation to inform appropriate subsequent analyses.However, there are few recommendations for testing unit root with missing data, and no consideration for missing mechanisms beyond missing completely at random.Here we propose adaptations to the ADF test including  an MLE approach for testing for unit root when the time series has lag order of one and normal and independent errors, and a state space model multiple imputation method to test for unit root when these conditions are not met.We additionally introduce sensitivity analyses to evaluate the impact of data MNAR on test results and autocorrelation estimation.In our simulation we find that both methods are effective under MCAR and MAR mechanisms, with improved power and reasonable type I error when missingness is moderate.We additionally show that our proposed sensitivity analyses are effective at generating a range of autocorrelation estimates and test results depending on the missing mechanism assumptions.By applying the existing univariate time series missing methods and our proposed methods to mood time series from the Bipolar Longitudinal Study, we demonstrate that our simulation results hold in observed data, and the limitation of the proposed MLE method when errors are not normally distributed.
The proposed SSMimpute, MLEEM, and MLENS methods offer computational efficiency, with high power and acceptable type I error across MCAR and MAR time series with varying rates of missingness.They additionally offer researchers the ability to evaluate the impact of hypothesized MNAR mechanisms on test results through sensitivity analyses.When noise is normally distributed and there is a single lag error, we recommend the MLENS method which demonstrates high power and low type I error when correctly specified.Should data be hypothesized to be missing not at random, the MLEEM method can be employed for sensitivity analyses.Should the time series follow a more complex generation process, the SSMimpute method is most appropriate, as it can handle non-normal noise and lag order greater than one.These methods will allow for improved validity of future mHealth analyses by accurate unit root testing to inform model selection and appropriate assumptions.However, the general limitations in power of the ADF test, especially when autocorrelation is close to one or there are few observations (Harris, 1992) still apply to the proposed methods.Additionally, the ADF test and thus proposed adaptations to address missing data only test for unit root non-stationarity.Other violations to stationarity such as change points and unstable variance could also be found in mHealth data, and should also be tested for to inform subsequent analyses.Further research is needed to develop hypothesis tests for additional types of non-stationarity in the context of missing data.We also note that more research is needed to provide a framework for identifying the correct lag order under missing data.

Figure 1 :
Figure 1: P-value box plots by missing mechanism, ρ, and method for time series with 50% missing.The red horizontal line indicates cut-off values of α = 0.05.We denote complete case analysis by CC, last observation carry forward by LOCF, linear interpolation by IntL, Kalman smoothing imputation by K, and mean imputation by M.

Figure 3 :
Figure 3: P-value box plots for sensitivity analysis by missing rate, true ρ, method, and δ adjustment.The red horizontal line indicates cut-off values of α = 0.05.

Figure 4 :
Figure 4: Time series plots for three participants included in BLS study.Missing observations are marked by red shading.

Table 1 :
Properties of four proposed methods for unit root testing with missing data.