High-dimensional changepoint estimation with heterogeneous missingness

We propose a new method for changepoint estimation in partially-observed, high-dimensional time series that undergo a simultaneous change in mean in a sparse subset of coordinates. Our first methodological contribution is to introduce a 'MissCUSUM' transformation (a generalisation of the popular Cumulative Sum statistics), that captures the interaction between the signal strength and the level of missingness in each coordinate. In order to borrow strength across the coordinates, we propose to project these MissCUSUM statistics along a direction found as the solution to a penalised optimisation problem tailored to the specific sparsity structure. The changepoint can then be estimated as the location of the peak of the absolute value of the projected univariate series. In a model that allows different missingness probabilities in different component series, we identify that the key interaction between the missingness and the signal is a weighted sum of squares of the signal change in each coordinate, with weights given by the observation probabilities. More specifically, we prove that the angle between the estimated and oracle projection directions, as well as the changepoint location error, are controlled with high probability by the sum of two terms, both involving this weighted sum of squares, and representing the error incurred due to noise and the error due to missingness respectively. A lower bound confirms that our changepoint estimator, which we call 'MissInspect', is optimal up to a logarithmic factor. The striking effectiveness of the MissInspect methodology is further demonstrated both on simulated data, and on an oceanographic data set covering the Neogene period.


Introduction
The Big Data era offers the exciting prospect of being able to transform our understanding of many scientific phenomena, but at the same time many traditional statistical techniques may perform poorly, or may no longer be computable at all, when applied to contemporary data challenges.A core assumption that underpins much of statistical theory, as well as the way in which we think about statistical modelling, is that our data are realisations of independent and identically distributed random variables.However, practical experience reveals that this is typically unrealistic for modern data sets, and developing methods and theory to handle departures from this important but limited setting represents a key theme for the field.
In contexts where data are collected over time, one of the simplest generalisations of an independent and identically distributed data stream is given by changepoint models.
Here, we postulate that our data may be segmented into shorter, homogeneous series.Of course, the structural break, or changepoint, between these series is often of interest in applications, such as distributed denial of service monitoring of network traffic (Peng, Leckie and Ramamohanarao, 2004), disease progression tracking via the alignment of electronic medical records (Huopaniemi et al., 2014) and the analysis of 'shocks' in stock price data (Chen and Gupta, 1997).
Another issue that turns out to be critical in working with Big Data in practice is that of missing data.One reason for this is that when each observation is high-dimensional, it is frequently the case that most or even every observation has missingness in some coordinates; thus a complete-case analysis, which simply discards such observations, is unviable (Zhu, Wang and Samworth, 2019).
The aim of this paper is to study the core, high-dimensional changepoint problem of a sparse change in mean, but where our data are corrupted by missingness.In fact, in cases where our data arise as discrete observations of several continuous processes, the observation times in different coordinates may not be the same, and such a setting also fits within our framework.A key feature of both our methodology and theory is that we wish to be able to handle heterogeneous missingness, i.e.where the levels of missingness may differ across coordinates.Specifically, our primary theoretical goal is to understand the way in which the missingness interacts with the signal strengths in the different series to determine the difficulty of the problem.
In Section 2, we consider a setting where the practitioner has access to a partially-observed p×n data matrix, where p is the number of series (coordinates) being monitored, and n is the number of time points.We seek to identify a time at which the p-dimensional mean vector changes, in at least one coordinate.One of the key ideas that underpins our methodological contribution is to define a new version of the popular Cumulative Sum (CUSUM) transformation (Page, 1955) that is able to handle the missingness appropriately.This operation, which we refer to as the MissCUSUM transformation, returns a p × (n − 1) matrix, and the intuition is that in coordinates that undergo a change in mean, the transformed series should peak in absolute value near the changepoint.One of the main advantages of our proposal is that it avoids the need to impute missing data1 .
Since the changepoint location is shared across the signal coordinates, it is natural to seek to borrow strength across the different data streams to estimate the changepoint.To this end, our next goal is to estimate a projection direction, in order to convert the MissCUSUM transformation into a univariate CUSUM series.Such a projection direction should ideally maximise the signal-to-noise ratio of the projected series.When the data are fully observed, the oracle projection direction turns out to be the leading left singular vector of the (rank one) CUSUM transformation of the mean matrix.This facilitates estimation approaches based on entrywise 1 -penalised M -estimation, as in the inspect algorithm of Wang and Samworth (2018).A crucial difference when we have to handle missing data, however, is that the MissCUSUM transformation of the mean matrix is no longer of rank one, which means that the entrywise 1 -penalty no longer adequately captures the sparsity structure of the vector of mean change.Instead, we introduce a new optimisation problem that penalises the 1 -norm of the leading left singular vector of a rank one approximation of the Miss-CUSUM transformation.This methodological proposal, which we call MissInspect, leads to considerably improved performance.Implementation code for our method is available in the GitHub repository https://github.com/wangtengyao/MissInspect.A further benefit of the MissInspect methodology is that it is amenable to theoretical analysis.In particular, we study a Missing Completely At Random model with row homogeneous missingness; in other words, the observation probability remains constant in each row, but may vary arbitrarily across rows.In Proposition 1 in Section 3, we provide a high-probability bound on the angle between the estimated and oracle projection directions.Theorems 1 and 2 then establish high-probability bounds on the accuracy of the estimated changepoint location whenever the estimated and oracle projection directions are sufficiently well aligned, for a sample splitting variant of our algorithm.Theorem 1 provides a very general guarantee, while Theorem 2 establishes a faster rate whenever the observation probability in each row satisfies a lower bound.This faster rate comprises two terms, representing the error incurred due to noise in the observations, and the error due to missingness, respectively.The key quantity in both of these terms turns out to be a weighted Euclidean norm of the vector of mean change, where the weights are given by the observation probabilities in each row.This weighted average therefore captures the interaction between the signal strength and the missingness probabilities, and suggests that our analysis handles effectively the heterogeneity of the missingness across rows (a more naive analysis would see the worst-case observation probability appearing in the bounds).This intuition is confirmed by our minimax lower bound (Theorem 3), which indicates that the MissInspect algorithm attains the minimax rate of convergence in all problem parameters, up to a logarithmic factor.
Section 4 explores the empirical performance of our MissInspect methodology.We study the ability of the algorithm to estimate both the oracle projection direction and the changepoint location, and compare with an alternative algorithm that imputes the missing entries using the well-known softImpute algorithm (Mazumder, Hastie and Tibshirani, 2010), and then runs the original inspect algorithm.We find that the MissInspect algorithm considerably outperforms this approach, and provides further evidence of its practical utility in missing data settings.In this section, we also present an application of the MissInspect methodology to detect changes in oceanographic currents from carbon isotope measurements extracted from cores drilled into the ocean floor.Section 5 discusses various methodological and theoretical extensions of our proposal to more complicated problems, such as the estimation of multiple changepoints, or more general data generating and missingness mechanisms.Proofs or our main results are given in Section 6, with auxiliary results deferred to the Appendix.
The study of changepoint problems dates at least back to Page (1955), and has since found applications in many different areas, including genetics (Olshen et al., 2004;Zhang et al., 2010), disease outbreak watch (Sparks, Keighley and Muscatello, 2010), aerospace engineering (Henry, Simani and Patton, 2010) and functional magnetic resonance imaging studies (Aston and Kirch, 2013), in addition to those already mentioned.Entry points to the literature include Csörgő and Horváth (1997) and Horváth and Rice (2014).In highdimensional changepoint settings, where we may have a sparsity assumption on the coordinates of change, prior work includes Bai (2010), Zhang et al. (2010), Horváth and Hušková (2012), Cho and Fryzlewicz (2014), Chan and Walther (2015), Jirak (2015), Cho (2016), Soh and Chandrasekaran (2017), Wang and Samworth (2018), Enikeeva and Harchaoui (2019), Padilla et al. (2019) and Liu, Gao and Samworth (2021).The only works of which we are aware on changepoint estimation with missing data are those of Xie, Huang and Willett (2013) and Londschien, Kovács and Bühlmann (2021), both of which consider different settings to ours.Xie, Huang and Willett (2013) study a situation where partially-observed sequential data lie close to a time-varying, low-dimensional submanifold embedded within an ambient space; on the other hand, Londschien, Kovács and Bühlmann (2021) consider changepoint detection in graphical models.Finally we mention that our focus in this work is on the offline version of the changepoint estimation problem, where the practitioner sees the whole data set prior to determining a changepoint location.The corresponding online version, where data are observed sequentially and the challenge is to declare a change as soon as possible after it has occurred, has also received attention in recent years; see, e.g., Mei (2010), Xie and Siegmund (2013), Chan (2017) and Chen, Wang and Samworth (2021).
We conclude this section by introducing some notation that is used throughout the paper.Given n ∈ N, we let [n] := {1, . . ., n}.For a vector u = (u 1 , . . ., u M ) ∈ R M , a matrix and let u r,q := M i=1 |u i | r q i 1/r .Writing σ 1 (A), . . ., σ s (A) for the nonzero singular values of A, where s := rank(A), we let its operator, nuclear and Frobenius norms respectively.We also write u 0 := M i=1 1 {u i =0} .We denote by diag(u) the M × M diagonal matrix with u as its diagonal.For S ⊆ [M ] and T ⊆ [N ], we write u S := (u i : i ∈ S) ∈ R |S| and write M S,T ∈ R |S|×|T | for the sub-matrix of A obtained by extracting the rows and columns with indices in S and T respectively.For two matrices A, B ∈ R M ×N , we denote their trace inner product as A, B := tr(A B).We also denote their Hadamard product as for the acute angle bounded between them.We let B M := {R M : x 2 ≤ 1} and S M −1 := {x ∈ R M : x 2 = 1} denote the unit Euclidean ball and sphere in R M respectively, and define S M −1 (k) := {x ∈ S M −1 : x 0 ≤ k}.Given positive sequences (a n ), (b n ), we write a n b n to mean that there exists a universal constant C > 0 such that a n ≤ Cb n for all n.

MissInspect methodology
Throughout this work, we will assume that the practitioner has access to a partially-observed p × n data matrix.We will denote the full data matrix as X = (X j,t ) ∈ R p×n , and let Ω = (ω j,t ) ∈ {0, 1} p×n denote the revelation matrix, so that ω j,t = 1 if X j,t is observed, and is equal to zero otherwise.Formally, then, we can regard the observed data as (X • Ω, Ω); note here, that since the practitioner has access to the matrix Ω, they are able to distinguish between an observed zero and a zero caused by missingness in X • Ω.For our theoretical analysis, the • notation is a convenient way of avoiding the need to introduce an 'NA' category for missing values.
In our theory, we will regard X as a realisation of a random matrix, whose mean matrix we denote by µ = (µ 1 , . . ., µ n ) ∈ R p×n .The changepoint structure of µ is encoded via the assumption that there exist z ∈ [n − 1] and µ (1) , µ (2) ∈ R p with θ := µ (2) − µ (1) = 0 such that (1) In Section 3, we will assume that the change in mean is sparse, in the sense that θ 0 ≤ k for some k that is typically much smaller than p.However, we remark that our methodology is adaptive to this unknown sparsity level.
Our goal is to estimate the changepoint location z.To this end, we first introduce a new version of the CUSUM transformation that is appropriate in our missing data setting.Writing L j,t := t r=1 ω j,r , R j,t := n r=n−t+1 ω j,r and N j := L j,n = R j,n for j ∈ [p] and t ∈ [n], we define the MissCUSUM transformation T Miss p,n : R p×n × {0, 1} p×n → R p×(n−1) by when L j,t > 0 and R j,n−t > 0, and define [T Miss p,n (M, Ω)] j,t := 0 otherwise.Since the subscripts p and n of T Miss p,n can be inferred from the dimensions of its arguments, we will frequently abbreviate this transformation as T Miss .We note that this transformation only depends on M through M •Ω.In practice, we will always apply this transformation to pairs of the form (M • Ω, Ω); in other words, an entry of the first argument is zero whenever the corresponding entry of the second argument is zero.When the data matrix is fully observed (i.e.Ω is an all-one matrix), the MissCUSUM transformation reduces to the standard CUSUM transformation T (M ) := T Miss (M, Ω).
A key feature of the MissCUSUM transformation is that it captures the interaction between the signal strength and the number of observations in each coordinate.To illustrate this, we focus on a single (jth) coordinate, and the noiseless setting where X = µ.In this case, the peak level of the absolute MissCUSUM transformation is we see that the peak level in the jth coordinate is controlled by the absolute mean change |θ j |, together with the effective sample size min(L j,z , R j,n−z ).
Another interesting property of the MissCUSUM transformation is that the multivariate setting allows us to borrow strength across the different coordinates to compensate for some of the missingness.To see this, note that the MissCUSUM transformation is piecewise constant in each coordinate.In particular, even in the noiseless setting, the absolute MissCUSUM series will typically not have a unique maximiser in each coordinate, but combining the information across coordinates allows us to pin down the changepoint location to an interval of length min t > z : This will often be a shorter interval than one would obtain from any of the individual component series.
The next step of the MissInspect algorithm is to use the MissCUSUM transformation to find a good projection direction v ∈ S p−1 .The idea is that even though it is not possible to project the data along v, due to the missingness, we can nevertheless compute the univariate series (v T Ω ) t t∈[n−1] , where T Ω := T Miss (X • Ω, Ω).Writing A Ω := T Miss (µ • Ω, Ω) and A := T (µ), if each column of X has identity covariance matrix, then for a generic projection direction v ∈ S p−1 , we find that E (v T Ω ) t Ω = (v A Ω ) t , and Var (v T Ω ) t Ω = 1.In Proposition 2 below, we will show that A Ω can be well approximated by the rank one matrix (diag √ q)A = (θ • √ q)γ , where attains its peak in absolute value at the true changepoint location z.Substituting this rank one approximation into the expression for E (v T Ω ) t Ω = (v A Ω ) t suggests that an oracle projection direction is a unit vector in the direction of θ • √ q, which is the leading left singular vector of (diag √ q)A.
For the corresponding problem with fully observed data, Wang and Samworth (2018) proposed a semi-definite relaxation technique to estimating the oracle projection direction.Unfortunately, since A Ω is not a rank one matrix when some data are missing, this relaxation turns out to be too coarse, and a new approach is required.Motivated by the fact that θ • √ q has the same sparsity pattern as θ, and viewing T Ω as a perturbation of (diag √ q)A, we propose to estimate the oracle projection direction by solving the following optimisation problem: where λ > 0 is a tuning parameter to be specified later.Here, with a suitable choice of λ, the 1 penalty on ṽ in (2) exploits the sparsity of the oracle projection direction to allow for consistent estimation of (θ • √ q)/ θ • √ q 2 , even when the dimension p is large, as will be shown in Proposition 1 in Section 3. A further advantage of (2) over the semi-definite relaxation approach is that it directly exploits the row sparsity pattern of the rank one matrix (θ • √ q)γ , as opposed to just the overall entrywise sparsity of this matrix.Using the estimated oracle projection direction v, we can project the MissCUSUM transformation T Ω of (X • Ω, Ω), and estimate the changepoint by the location of the maximum absolute value in the univariate projected series.Pseudocode for the MissInspect algorithm is given in Algorithm 1.
Algorithm 1: Pseudocode of the MissInspect algorithm Input: Output: ẑ The optimisation problem in Step 2 of Algorithm 1 is bi-convex in (ṽ, w); i.e., the objective is concave in ṽ for every fixed w and concave in w for every fixed ṽ.Hence, we can alternate between optimising over ṽ and w in (2).By inspecting the Karush-Kuhn-Tucker conditions as in Lemma 1, we see that when λ < T Ω 2→∞ , both steps of each iteration have closed form expressions, which lead us to the iterative procedure to optimise (2) given in Algorithm 2. In that algorithm, we define the soft-thresholding function soft : We remark that T Ω is known to the practitioner, so we can always choose λ < T Ω 2→∞ .As usual for such iterative algorithms for bi-convex optimisation, the objective increases at each iteration; empirically, we have not observed any convergence issues.
Algorithm 2: Pseudocode for an iterative procedure optimising (2) We conclude this section by illustrating the MissInspect algorithm in action in Figure 1.Here, with n = 250 and p = 100, we generated n independent p-variate Gaussian observations with mean structure (1) and identity covariance matrix.We took z = 100, and θ = (ϑ1 k /k 1/2 , 0 p−k ) , with k = 10 and ϑ = 2. Thus, the first 10 coordinates represent signals, while the remaining 90 are noise coordinates.All entries of our data matrix were observed independently (and independently of the data), with probability 0.2.The top panels display visualisations of the data and the MissCUSUM transformation respectively.In the bottom-left panel, the coloured lines are the first five components of the MissCUSUM transformation; we see that these traces are piecewise constant, with jumps at observed data points.Even though each of these five is obtained from a signal coordinate, the locations of the peaks of these individual series would not yield very reliable changepoint estimates, both because the noise introduces considerable variability (e.g. the peak of the purple series runs from time 217 to 228), and because the missingness can lead to fairly long stretches where these series are constant.Nevertheless, once all 100 series are aggregated appropriately by our MissInspect algorithm, the resulting black trace does have a sharper peak close to the true changepoint.The bottom-right plot shows two nonparametric density estimates of the estimated changepoint locations from the MissInspect procedure over 1000 repetitions from this data generating mechanism; the first is a histogram, which requires the choice of a binwidth, while the second is the log-concave maximum likelihood estimator (Dümbgen and Rufibach, 2009;Cule, Samworth and Stewart, 2010), which is fully automatic.Both indicate a sharp peak for the density close to the true changepoint; in the latter case, the mode is exactly at 100.

Theoretical guarantees
We will focus our theoretical analysis on the single changepoint setting, in order to try to articulate more clearly the way that the coordinate-wise signal-to-noise ratio and missingness mechanism interact to determine both the performance of the MissInspect algorithm and the fundamental difficulty of the problem.Moreover, we assume that the revelation matrix Ω = (ω j,t ) ∈ {0, 1} n×p has a row-homogeneous distribution, in the sense that there exists a vector q = (q 1 , . . ., q p ) ∈ (0, 1] p such that ω j,t ∼ Bern(q j ), independently for all j ∈ [p] and t ∈ [n].We will refer to q as the observation rate vector.Such a row-homogeneous assumption may be appropriate, for instance, in applications where each component series is measured by a separate device with its own observation rate.As for the data, we will assume that the columns where µ 1 , . . ., µ n satisfy (1).
Proposition 1.Let (X, Ω) ∼ P n,p,z,θ,σ,q and let (v, ŵ) be obtained from Step 2 in Algorithm 1, applied with inputs X Ω = X • Ω, Ω and λ ≥ 2σ n log(pn).Then Considering the case λ = 2σ n log(pn) for simplicity, Proposition 1 reveals that, with high probability, the sine of the acute angle between v and θ • √ q is controlled by the sum of two terms: the first of these represents the estimation error caused by the noise in the data we observe, and we see that θ 2,q /σ can be thought of as an effective signal-to-noise ratio.
On the other hand, the second term reflects the error due to our incomplete observations (and would be present even in the noiseless case with σ = 0); here θ 2 2,q / θ 2 2 may be regarded as a signal-weighted observation probability.
From a theoretical point of view, the fact that v is estimated using the entire available data set X Ω makes it difficult to analyse the post-projection noise structure.For this reason, in the analysis below, we work with a sample-splitting variant of Algorithm 1, as given in Algorithm 3. Here, the projection direction v is estimated using only the observed data at odd-numbered time points, and the MissCUSUM transformation of the observed data at even-numbered time points is then projected along v to obtain the final estimate of the changepoint location.
Algorithm 3: Pseudo-code for the sample-splitting variant of the MissInspect algorithm. Input: ∈ {0, 1} p×n 1 and Ω (2) ∈ {0, 1} p×n 1 denote the matrices formed from the first n 1 odd and the n 1 even numbered columns of Ω respectively; 3 Let X (1) Ω ∈ R p×n 1 and X (2) Ω ∈ R p×n 1 denote the matrices formed from the first n 1 odd and the n 1 even numbered columns of X Ω respectively; ẑ Theorem 1 is our first main result on the performance of Algorithm 3 in contexts where our data are generated from a single changepoint, row-homogeneous model P n,p,z,θ,σ,q .
Theorem 1. Suppose (X, Ω) ∼ P n,p,z,θ,σ,q .Assume for simplicity that n and z are even.Let ẑ be the output of Algorithm 3 with inputs X • Ω, Ω and λ = 2σ n log(pn).There exist universal constants C, C > 0 such that whenever we have Condition (4) ensures that the projection direction v obtained in Step 6 of Algorithm 3 has non-trivial correlation with the oracle projection direction θ • √ q; cf.Proposition 1.An attractive feature of Theorem 1 is the way that the interaction between the signal strength and the observation rate is captured through θ 2,q .As mentioned in the introduction, this weighted average provides much greater understanding of the influence of missingness on the performance of the MissInspect algorithm than more naive bounds that depend on the worst-case missingness probability across all rows.For instance, we see that a high degree of missingness in noise or weak signal coordinates may not have too much of a detrimental effect on performance compared with complete observation of these coordinates.See also Theorem 3 below for confirmation of the way in which θ 2,q also controls the fundamental difficulty of the problem (not just for our procedure).
A further attraction of Theorem 1 is the absence of any condition on the number of observations in each row.On the other hand, it turns out that if the expected number of observations in each row is at least k/τ 2 (up to logarithmic factors), then we can obtain a substantially improved bound on the rate of estimation of Algorithm 3. Theorem 2. Suppose (X, Ω) ∼ P n,p,z,θ,σ,q .Assume for simplicity that n and z are even.Let ẑ be the output of Algorithm 3 with inputs X • Ω, Ω and λ = 2σ n log(pn).Define Then there exist universal constants c, C 1 , C 2 > 0 such that if ρ ≤ c and nτ 2 min j∈[p] q j ≥ C 1 k log(pn), then The rate obtained in Theorem 2 is essentially the square of that obtained in Theorem 1.In fact, an additional improvement is the reduction of θ 2 in the second term to θ ∞ .Again, we see the decomposition of the estimation error into terms reflecting the noise in the observed data and the incompleteness of the observations respectively.
As a complement to Theorem 2, we now present a minimax lower bound, which studies the fundamental limits of the expected estimation error that are achievable by any algorithm.We write Z for the set of estimators of z, i.e. the set of Borel measurable functions ẑ : 2,q , then there exists c > 0, depending only on M , such that for n ≥ 4, , n .
Theorem 3 reveals that the MissInspect algorithm as given in Algorithm 3 attains the minimax optimal estimation error rate up to logarithmic factors in all of the parameters of the problem, at least in settings where the signals are of comparable magnitude.Note that the MissInspect algorithm also matches (deterministically) the second term in the minimum in Theorem 3, because it trivially satisfies |ẑ − z| ≤ n − 2. The form of the lower bound in Theorem 3 confirms that θ 2,q is the correct functional of the mean change vector θ and observation q for capturing the difficulty of the changepoint estimation problem in our missing data setting.
4 Numerical studies

Choice of tuning parameter
The tuning parameter choice of λ = 2σ n log(pn) is convenient in our theoretical analysis.However, this choice often turns out to be slightly too conservative in practice, so to explore this, we considered the output of Algorithm 2 for a range of λ values, under several different settings of n, p, k, θ and q. Figure 2 displays the mean angle between the estimated projection direction v from (2) and the oracle projection direction θ • √ q/ θ • √ q 2 as a function of λ in two such sets of simulations.In both panels, we set n = 1000, p = 500, z = 400 and took λ = aσ n log(pn) for a ∈ [0, 2].The vector of mean change is θ = ϑk −1/2 (1 k , 0 p−k ) .Data were observed according to the row-homogeneous missingness model with q j = q s if θ j = 0 and q j = q n otherwise.In the left panel of the figure, we set q s = q n = 0.2 and vary k ∈ {3, 10, 50} and ϑ ∈ {1, 1.5, 2, 2.5, 3}, whereas in the right panel, we took k = 3, ϑ = 2, σ = 1 and vary q s , q n ∈ {0.1, 0.2, 0.3, 0.4, 0.5}.We note that the choice λ = 2 −1 σ n log(pn) performs well in all settings, especially when the signal is relatively sparse.We therefore settled on this choice of λ throughout our numerical studies.It is reassuring to see from the right panel of Figure 2 that the performance of the projection direction estimator has almost no dependence on q n , as predicted by our Proposition 1 (since θ 2,q does not depend on q n ).

Validation of theoretical results
The aim of this subsection is to provide empirical confirmation of the forms of the bounds obtained in Proposition 1 and Theorem 2. In particular, we would like to verify that the crucial quantity θ 2,q does indeed capture the appropriate interaction between signal and missingness that determines the performance of the MissInspect algorithm.The two panels of Figure 3 study the angle between the estimated and oracle projection directions, and the estimated changepoint location error respectively.To obtain this figure, we set n = 1200, p = 1000 and generated data vectors under (3) with every entry observed independently with probability q ∈ {0.1, 0.2, 0.4, 0.8}, independent of the data.A single change occurred at z = 400 with vector of mean change θ = ϑk −1/2 (1 k , 0 p−k ) and k = 3.We investigated the performance of MissInspect over 200 Monte Carlo repetitions for each of ϑ ∈ {0.5, 1, 1.5, 2} and σ ∈ {0.2, 0.4, 0.8, 1.6}.The left panel of Figure 3 shows that the logarithm of the mean sine angle loss decreases approximately linearly with log θ 2,q , with gradient approximately −1.This is consistent with the conclusion of Proposition 1, which shows that the sine angle loss is controlled with high probability by an upper bound that is inversely proportional to θ 2,q = ϑq 1/2 .Moreover, curves corresponding to σ = 0.2, 0.4, 0.8, 1.6 are roughly equally spaced on the logarithmic scale, which corresponds to the linear dependence on λ = 2 −1 σ n log(pn) of the first term in the high-probability bound in Proposition 1.For fixed σ, the blue, orange and green curves are approximately overlapping, especially when the sine angle loss is large.In particular, doubling ϑ and reducing q by a factor of four leaves the sine angle loss virtually unchanged in these settings, which is consistent with the first term in the high probability upper bound in Proposition 1 being the dominant one, with its reciprocal dependence on θ 2,q = ϑq 1/2 .The contribution from the second term in Proposition 1 is still visible in the high signal-to-noise ratio settings, where, for instance when σ = 0.2, the ϑ = 2 curve (green) lies above the ϑ = 1 curve (orange).This is again consistent with the form of the second term in the bound in Proposition 1, which, in our setting, does not depend on ϑ, but is inversely proportional to q 1/2 .
A similar story emerges in the right panel of Figure 3 for the changepoint location estimator accuracy.Here, for fixed ϑ and σ, most points lie on approximate straight lines with slope −2, which is in agreement with the θ −2 2,q dependence in the high probability bound of |ẑ − z| in Theorem 2. The σ 2 dependence in the first term of the bound in Theorem 2 is represented by the mostly equi-spaced curves for the four different equi-spaced σ values on the logarithmic scale.The contribution of the second term in the bound in Theorem 2 can be seen from the three curves corresponding to the smallest noise scale σ = 0.2.Here, the estimation error only improves slightly as ϑ increases, which is in agreement with our theoretical prediction, since, in the setting of this simulation, the second term in Theorem 2 is proportional to q −1/2 and does not depend on ϑ.

Comparison with a competitor
Since we are not aware of other methods that have been proposed for the problem studied in this paper, in this subsection, we compare the performance of MissInspect with a natural alternative that combines the idea of handling missing data via imputation and the original inspect procedure.Specifically, given a data matrix with incomplete observations, this comparator first applies the softImpute procedure of Mazumder, Hastie and Tibshirani (2010), with the maximum matrix rank parameter set to 2 (since X • Ω can be viewed as a perturbation of its mean (diag √ q)µ, which has rank 2).It then performs changepoint estimation on the imputed data matrix using the inspect procedure of Wang and Samworth (2018), with the suggested regularisation parameter choice therein.We refer to this alternative approach as ImputeInspect.Table 1 compares the performance of MissInspect and ImputeInspect under various settings.Here, we choose n = 1200, p = 2000, k ∈ {3, √ p , p}, ϑ ∈ {1, 2, 3}.Data are generated according to (3) with row-homogeneous missingness, independent of the data.The changepoint occurs at z = 400, with vector of mean change having 2 norm ϑ and proportional to (1, 2 −1/2 , . . ., k −1/2 , 0, . . ., 0) .The observation rate vector q = (q 1 , . . ., q p ) is randomly generated, independent of all other sources of randomness, such that q j iid ∼ Beta 10ν, 10(1 − ν) for j ∈ [p], where ν ∈ {0.1, 0.5}.Since both the ImputeInspect and MissInspect procedures are projection based, it is natural to compare their performance in both the projection direction estimation and the final changepoint estimation errors.Note that the oracle projection direction for ImputeInspect is parallel to θ (since the imputed matrix has no missing entries), whereas the oracle projection direction of MissInspect is parallel to θ • √ q.We see in Table 1 that MissInspect consistently outperforms ImputeInspect, often dramatically, for all observation fractions, sparsity levels and signal strengths considered.

Real data analysis
In this subsection, we illustrate the applicability of the MissInspect algorithm on an oceanographic data set covering the Neogene geological period.Oceanographers study historic changes in the global ocean circulation system by examining microfossils that record the isotopic composition of water at the time at which they lived (Wright and Miller, 1996).In particular, large cores are extracted from the ocean floor and a species of microfossils called foraminifera are taken from small slices of sediment at different depths within the core.The ratio of the abundances of 13 C to 12 C isotopes in their calcium carbonate shells is compared against a standard, to understand the carbon composition within the oceans during their lifetime, and hence to determine the direction in which ocean currents flowed.The depth of the foraminifera within the core is used as a proxy for the geological age of the fossil, measured in millions of years (Ma).
Our data, which are available in a GitHub repository 2 , and were previously analysed by Samworth and Poore (2005) and Poore et al. (2006), consist of measurements from 16 cores extracted from the North Atlantic, Pacific and Southern Oceans and are displayed in Figure 4.In total, there are 7369 observations at 6295 distinct time points, but Figure 4 makes clear that the heterogeneous nature of the data collection process means that it is appropriate to think of the data as containing missingness.The figure also indicates the 10 most prominent changepoints identified by applying the MissInspect algorithm in combination with binary segmentation, as discussed in Section 5 below.It is notable that the first changepoint, found by applying the algorithm to the full data set, occurs at 6.13Ma, a time that has previously been identified as a time of rapid change in oceanographic current flow (Poore et al., 2006, p. 13).

Extensions
As mentioned in the introduction, one of our main theoretical goals in this work is to understand the way in which the missingness and the signal interact in changepoint problems to determine the difficulty of the problem.This is particularly challenging when we seek to handle both high dimensionality and different levels of missingess in different coordinates.For the purposes of our theoretical analysis, then, it is natural to impose stronger assumptions elsewhere, so as to best expose the interesting phenomena at play.Nevertheless, it remains of interest to consider the extent to which the methodology could be generalised, and the assumptions could be relaxed, to cover a wider range of scenarios and problems one might see in practice.
As we saw in analysing the oceanography data in Section 4.4, it may be that we wish to identify multiple changepoints.There are several standard techniques for extending single changepoint procedures to such settings, including binary segmentation and different versions of wild binary segmentation (Fryzlewicz, 2014;Kovács et al., 2020).Any of these approaches can be used in conjunction with the MissInspect algorithm to identify multiple changepoints in high-dimensional data streams in the presence of missingness.The theoretical analysis of such a procedure would be technically involved, but would proceed along similar lines to that of Wang and Samworth (2018) for the case of fully-observed data.
As always when handling missing data, the situation becomes much more complicated when the missingness and the data are not independent, i.e. the Missing Completely At Random (MCAR) assumption does not hold.In the worst case, the missingness may render the changepoint estimation problem impossible, for instance if no signal coordinate has observed data on both sides of the changepoint.A less adversarial setting would be one in which all observations exceeding 1 are censored.Thus, if the vector of mean change had positive entries in signal coordinates, we would expect to see fewer observations in these coordinates after the change.The censoring would lead to (different) truncated Gaussian distributions before and after the change, but a difference in mean of these distributions would persist, so changepoint estimation may still be possible.In general, careful and problem-specfic modelling of the dependence of the data and the missingness mechanism is recommended.
Finally, we discuss settings of temporal and spatial dependence in the data.In the former case, a natural model is to replace (3) with where µ 1 , . . ., µ n satisfy (1) and where the noise vectors (W 1 , . . ., W n ) form a mean-zero, stationary Gaussian process.In this case, with row-homogeneous missingness independent of the data, the oracle projection direction remains (θ • √ q)/ θ • √ q 2 , and the MissInspect methodology does not need to be altered.On the other hand, if spatial dependence is introduced into (1) by replacing the identity covariance matrix there with a general covariance matrix Σ, then the oracle projection direction becomes proportional to Σ −1 (θ • √ q).If Σ is unknown, then estimating Σ −1 may represent a significant challenge, but it may be considerably simplified if our data stream satisfies additional structural assumptions.For instance, if Σ = diag(σ 2 1 , . . ., σ 2 p ), where σ 1 , . . ., σ p are unknown, then we can estimate these quantities robustly using, for example, the median absolute deviation of the marginal one-dimensional series (Hampel, 1974).As another example, if Σ is Toeplitz with Σ = (ρ |j−k| ) j,k∈ [p] for some ρ ∈ (−1, 1), then Σ −1 is tridiagonal, and its form can again be used to estimate ρ (Wang and Samworth, 2018, Lemma 12).
6 Proof of main results

Proof of Proposition 1
The proof of Proposition 1 requires the following result, which provides an entrywise control of the difference between the mean of the MissCUSUM transformation of (X • Ω, Ω) conditional on Ω and its unconditional mean.
First assume that j satisfies nτ q j ≥ 24 log(kn).Let a j := 8 log(kn) 3q j . It follows that a j ≤ nτ /6.Define Recall that L j,t = t r=1 ω j,r and R j,t = n r=n−t+1 ω j,r for j ∈ [p] and t ∈ [n].We consider the event By Bernstein's inequality (Lemma 3), we obtain that where we have used the facts that tq j ≤ 8 log(kn)/3 for t ≤ a j −1 and 2 log(kn) ≤ 3tq j /2 for t ≥ a j in the penultimate inequality.
We will now bound |∆ j,t | for different values of t on A j,t .First, consider a j ≤ t ≤ z.Since n − z ≥ nτ > a j , we have δ j,n−z ≤ δ j,nτ ≤ 1/2.We deduce from the definition of It follows that for j ∈ S and a j ≤ t ≤ z, By a similar calculation for deviations in the opposite direction, we have ∆ j,t ≥ − √ q j A j,t (δ j,t + 4δ j,n−z /3).
Thus, using the fact that A j,t ≤ θ j min √ t, √ n − z , we deduce that By symmetry, if z < t ≤ n − a j , we also have |∆ j,t | ≤ 7θ j 6 log(kn).
A symmetric argument shows that |∆ j,t | ≤ 8θ j log(kn) for n − a j ≤ t ≤ n − 1. Combining the above bounds on |∆ j,t |, we see that for j satisfying nτ q j ≥ 24 log(kn) and all t ∈ [n − 1], we have that We now turn our attention to j satisfying nτ q j < 24 log(kn).If q j = 0, then ∆ j,t = 0.So we may assume q j > 0. Define j := 24 log(kn) nτ q j , so that j > 1.For j ∈ S, consider the event By Lemma 3 again, we have On B j , we have (A Ω ) j,t ≤ (A Ω ) j,z ≤ θ j min{L j,z , R j,n−z } ≤ θ j (1 + j )nτ q j ≤ θ j 48 log(kn).
On the other hand, Consequently, when nτ q j < 24 log(kn), we have The first claim follows from ( 6) and ( 7).It now follows that as desired.
The proposition follows on observing that where the penultimate inequality uses the fact that (A Ω ) j,t − (T Ω ) j,t | Ω ∼ N (0, σ 2 ) for all t ∈ [n − 1] and j ∈ [p] such that L j,t R j,n−t = 0, and is equal to 0 when L j,t R j,n−t = 0.
Writing 2) , Ω (2) ).Our main decomposition of interest here is Ω .Since Algorithm 3 remains the same if we replace v in Step 6 with −v, we may assume without loss of generality that v (θ Observe that for every t ∈ [n 1 − 1], we have Ω ) t is stochastically dominated by N (0, σ 2 ).Hence, together with the first conclusion of Proposition 2 and a union bound, there exists an event B with probability at least 1 − 4/(kn 1 ) − 1/n 1 such that on B we have max Combining ( 11) and ( 12), and by increasing the universal constant C > 0 if necessary, we have by ( 4) that on In particular, on A ∩ B, we have from the definition of ẑ that (v T (2) Ω ) z/2 ≥ 0, so on this event we have the basic inequality Wang and Samworth (2018, Lemma 7) on the event A ∩ B, for every t ∈ Combining ( 13), ( 14) and ( 12), we then have on A ∩ B that, For C ≥ 84 √ 3, we have by (4) that which means that the minimum on the left-hand side of (15) must be achieved by the first term.We therefore deduce with probability at least P(A ∩ B) ≥ 1 − 22/n that, as desired.

Proof of Theorem 2
The proof of Theorem 2 will make use of the following propositions.
Since (ω j,r ) j∈[p],r∈(t,z] are independent Bern(q j ) random variables, we have by Lemma 3 that P(B c t ) ≤ δ/6.On B t , we have that For the second term on the right-hand side of ( 16), let F denote the σ-algebra generated by (ω j,r : j is measurable with respect to F. We can therefore apply Lemma 3 conditional on F to obtain that there is a event C t with P(C c t | F) ≤ δ/6 on which |G j | log(12/δ). ( For 0 ≤ a < b ≤ n, define We consider the event A j,t := max H j,(0,t) , H j,(0,z) , H j,(z,n) ≤ 1 5 ∩ H j,(t,z) ≤ 2 log(12p/δ) (z − t)q j + log(12p/δ) 3(z − t)q j .
By Lemma 3 again, we obtain that where we used the assumption nτ q j ≥ 60 log(12p/δ) in the final inequality.On A j,t , we have where we have used the fact that t ≥ (49/50)nτ .Combining the above inequality with (18), on ∩ j∈[p] A j,t ∩ C t , we have by the Cauchy-Schwarz inequality that For the third and fourth terms on the right-hand side of ( 16), since |z − t| ≤ nτ /50, we have where the final step uses the fact that (1 + √ 2a + a/3) 2 ≤ 2 + 25a 2 for any a > 0. Therefore, on ∩ j∈[p] A j,t , since |z − t| ≤ nτ /50 and nτ min j∈[p] q j ≥ 60k log(12p/δ), we have by the Cauchy-Schwarz inequality again that Therefore, combining ( 16), ( 17), ( 20) and ( 21), we have on Since p j=1 P(A c j,t ) + P(B c t ) + P(C c t ) ≤ δ, the proof is complete.
Proposition 4. Suppose that Ω = (ω j,t ) j∈[p],t∈ [n] and W = (W j,t ) j∈[p],t∈[n] are independent, with ω j,t ∼ Bern(q j ) independently and q j ∈ (0, 1], and with W j,t iid 20 log(11p/δ), then we have for any δ ∈ (0, 1] that Proof.By symmetry, we may assume without loss of generality that t < z.We note that (E Ω ) j,z −(E Ω ) j,t is a centred normal random variable conditional on Ω, so we start by looking at its conditional variance.By definition of T Miss , we have Now, by the mean value theorem, there exists a random variable Ξ j ∈ [L j,t , L j,z ] such that Also, observe that Substituting ( 23) and ( 24) into ( 22), and observing that n r=1 W j,r ω j,r is positively correlated The result follows by taking u := 2 log(11/δ) and using the fact that nτ min j∈[p] q j ≥ 20 log(11p/δ).
Proof of Theorem 2. We write z 1 := z/2 and n 1 := n/2.Taking C, C > 0 from Theorem 1, we may assume that c ∈ (0, 1/50] is small enough that the hypothesis (4) of Theorem 1 is satisfied when ρ ≤ c.Hence, by Theorem 1, there is an event E with probability at least 1 − 22/n such that By further reducing c > 0 if necessary, we may assume that on E, and when ρ ≤ c, we have |ẑ − z| ≤ nτ /50.Let A (2) , ∆ (2) and E (2) Ω be defined as in the proof of Theorem 1.With v = (v 1 , . . ., vp ) ∈ S p−1 as defined in Algorithm 3, an inspection of the proof of Theorem 1 reveals that on E, we also have for all Recall that v is measurable with respect to the σ-algebra generated by the odd-numbered time points, and that ∆ (2) and E Ω are measurable with respect to the σ-algebra generated by the even-numbered time points.By taking the universal constant C 1 > 0 in the statement of the theorem to be sufficiently large, we can ensure that the lower bounds on nτ min j∈[p] q j in Propositions 3 and 4 are satisfied.It follows by these propositions that when ρ ≤ c, for each t ∈ [z 1 − n 1 τ ρ, z 1 + n 1 τ ρ], there is an event A t of probability at least 1 − n −2 on which both Combining ( 26), ( 27), ( 28) and the basic inequality as in ( 13), we have on the event E ∩ t∈[z/2−n 1 τ ρ,z/2+n 1 τ ρ] A t and with ρ ≤ c that for some unit-length (random) vector w = (w j ) j∈[p] that is orthogonal to v and some α, β ∈ R such that α 2 + β 2 = 1.Moreover, by inspecting the proof of Theorem 1, we see that on E, we have |β| = sin ∠(v, v) ≤ ρ.Then from (29), we have on , where the final bound uses the definition of ρ and the fact that nτ 2 min j∈[p] q j ≥ C 1 k log(pn).The desired result follows since P E ∩ t∈[z/2−n 1 τ ρ,z/2+n 1 τ ρ] A t ≥ 1−22/n−(2n 1 τ ρ+1)/n 2 ≥ 1 − 23/n.
A Auxiliary lemmas and proofs for our objective function.We first note that maximisers exist since f is concave and the constraint set is convex and compact.Moreover, for (v, w) ∈ B p × B n , we have If M 2→∞ < λ, then f (v, w) ≤ g(v, w) ≤ 0, with both equalities holding if and only if v = 0, and we deduce that v * = 0.If M 2→∞ = λ, then again f (v, w) ≤ g(v, w) ≤ 0 with both equalities holding if and only if either v = 0, or v M w = M w ∞ = λ; the latter case yields the constraints on w * and v * given in the statement.Finally, we consider the case where M 2→∞ > λ.We can find (v 0 , w 0 ) ∈ S p−1 × S n−1 such that M w 0 ∞ = M 2→∞ and v 0 M w 0 = v 0 1 M w 0 ∞ , and consequently f (v * , w * ) ≥ f (v 0 , w 0 ) = g(v 0 , w 0 ) > 0.
In particular, we may assume that M v * = 0 in the remainder of the proof.Lemma 2. Suppose that A, T ∈ R p×n satisfy T − A ∞ ≤ λn −1/2 for some λ ≥ 0. Suppose further that v ∈ S p−1 (k) and w ∈ S n−1 are respectively the leading left and right singular vectors of A, and that (v, ŵ) ∈ argmax (ṽ, w)∈S p−1 ×S n−1 T, ṽ w − λ ṽ 1 .
We state below a version of the Bernstein's inequality that is convenient to apply in our setting.
In particular, if Y ∼ Bin(n, q) for n ∈ N and q ∈ (0, 1) and H := Y /(nq) − 1, then for any u > 0 and δ ∈ (0, 1) we have Moreover, the same conclusions hold with −S and −H replacing S and H respectively above.
Proof.Writing Y i := a i (X i − q i ) for i ∈ [n], we have for any positive integer r ≥ 2 that E|Y i | r = a r i {q i (1 − q i ) r + (1 − q i )q r i } ≤ a r i q i (1 − q i ). Consequently, Hence, the first two conclusions follows from Boucheron, Lugosi and Massart (2013, (2.10) and Theorem 2.10).The final two conclusions follows from the first two by setting a = (1, . . ., 1) ∈ R n and y = nqu.

Figure 1 :
Figure 1: MissInspect algorithm in action.Top-left: visualisation of the data matrix with p = 100 and n = 250, where each column represents a p-dimensional observation and missing entries are shown in white.Darker colours indicate larger values.Time runs from left to right, and a change in mean occurs at time 100 in each of the first ten rows.Top-right: visualisation of the MissCUSUM transformation of the data.Bottom-left: the first five rows of the MissCUSUM matrix are plotted in colour, and the black curve shows the projected MissCUSUM series, which is maximised at the estimated changepoint location of 90 (black dashed line).The true changepoint is shown as a grey solid line.Bottom-right: histogram of estimated changepoints over 1000 repetitions from the same data setting; a log-concave estimated density is shown in red.

Figure 4 :
Figure 4: Ratios of carbon isotope measurements taken from foraminifera in different cores from the North Atlantic, Pacific and Southern Oceans.The label of each panel indicates both the ocean and the number of the core, while the horizontal axis measures geological time (0-23 Ma).The red dashed lines indicate the 10 most prominent changepoints identified by applying the MissInspect algorithm in combination with binary segmentation, with the most significant change plotted with a solid line.