High-dimensional, multiscale online changepoint detection

We introduce a new method for high-dimensional, online changepoint detection in settings where a $p$-variate Gaussian data stream may undergo a change in mean. The procedure works by performing likelihood ratio tests against simple alternatives of different scales in each coordinate, and then aggregating test statistics across scales and coordinates. The algorithm is online in the sense that its worst-case computational complexity per new observation, namely $O\bigl(p^2 \log (ep)\bigr)$, is independent of the number of previous observations; in practice, it may even be significantly faster than this. We prove that the patience, or average run length under the null, of our procedure is at least at the desired nominal level, and provide guarantees on its response delay under the alternative that depend on the sparsity of the vector of mean change. Simulations confirm the practical effectiveness of our proposal.


Introduction
Modern technology has not only allowed the collection of data sets of unprecedented size, but has also facilitated the real-time monitoring of many types of evolving processes of interest.Wearable health devices, astronomical survey telescopes, self-driving cars and transport network load-tracking systems are just a few examples of new technologies that collect large quantities of streaming data, and that provide new challenges and opportunities for statisticians.
Very often, a key feature of interest in the monitoring of a data stream is a changepoint; that is, a moment in time at which the data generating mechanism undergoes a change.Such times often represent events of interest, e.g. a change in heart function, and moreover, the accurate identification of changepoints often facilitates the decomposition of a data stream into stationary segments.
Historically, it has tended to be univariate time series that have been monitored and studied, within the well-established field of statistical process control (e.g.Duncan, 1952;Page, 1954;Barnard, 1959;Fearnhead and Liu, 2007;Oakland, 2007;Tartakovsky, Nikiforov and Basseville, 2014).These days, however, it is frequently the case that many data processes are measured simultaneously.In the context of changepoint detection, this introduces the new challenge of borrowing strength across the different component series in an attempt to detect much smaller changes than would be possible through the observation of any individual series alone.
The field of changepoint detection and estimation also has a long history (e.g.Page, 1955), but has been undergoing a marked renaissance in recent years; entry points to the field include Csörgő and Horváth (1997) and Horváth and Rice (2014).However, the vast majority of this ever-growing literature has focused on the offline changpoint problem, where, after the entire data stream is observed, the statistician is asked to identify any changepoints retrospectively.For univariate, offline changepoint estimation, state-of-the-art methods include the Pruned Exact Linear Time method (PELT) (Killick, Fearnhead and Eckley, 2012), Narrowest-Over-Threshold (NOT) (Baranowski, Chen and Fryzlewicz, 2019), Simultaneous Multiscale Changepoint Estimator (SMUCE) (Frick, Munk and Sieling, 2014) and 0 -penalisation (Wang, Yu and Rinaldo, 2018), while work on multivariate and high-dimensional offline changepoints includes the double CUSUM method of Cho (2016), the inspect algorithm of Wang and Samworth (2018), as well as Enikeeva and Harchaoui (2019), Liu, Gao and Samworth (2019) and Padilla et al. (2019).
Despite this rich literature on offline changepoint problems, it is the online version of the problem that is arguably the more important for many applications: one would like to be able to detect a change as soon as possible after it has occurred.Of course, one option here is to apply an offline method after seeing every new observation (or batch of observations).However, this is unlikely to be a successful strategy: not only is there a difficult and highly dependent multiple testing issue to handle when using the method repeatedly on data sets of increasing size, but moreover, the storage and running time costs may frequently be prohibitive.
In this work, we are interested in algorithms for detecting changepoints in high-dimensional data that are observed sequentially.In order to avoid the trap mentioned in the previous paragraph and ensure that any methods we consider can be applied to large data streams, we will focus our attention on online algorithms.By this, we mean that the computational complexity for processing each new observation should depend only on the number of bits needed to represent the new data point observed, and not on the storage requirements of any previously observed data.This turns out to be a very stringent requirement, in the sense that finding online algorithms with good statistical performance is typically extremely challenging.Online algorithms must necessarily store only compact summaries of the historical observations, so the class of all possible procedures is severely restricted.
To set the scene for our contributions, let X 1 , X 2 , . . .be a sequence of independent random vectors in R p .Assume that for some unknown, deterministic time z ∈ N ∪ {0}, the sequence is generated according to X 1 , . . ., X z ∼ N p (µ − , I p ) and X z+1 , X z+2 , . . .∼ N p (µ + , I p ), (1) for some µ − , µ + ∈ R p .When µ + = µ − , we say that there is a changepoint at time z.In many applications, such as in industrial quality control where the distribution of relevant properties of goods in a manufacturing process under regular conditions may be well understood, we may assume that the mean before the change is known (or at least can be estimated to high accuracy using historical data).However, the vector of change, θ := µ + − µ − , is typically unknown.Thus, for simplicity, we will work in the setting where µ − = 0 and µ + = θ.Let P z,θ denote the joint distribution of (X n ) ∞ n=1 under (1) and E z,θ the expectation under this distribution.Note that when θ = 0, the joint distribution of the data does not depend on z, and we therefore let P 0 = P z,0 denote this joint distribution (with corresponding expectation E 0 ).We will then say that the data is generated under the null.By contrast, if θ = 0, we will say that the data is generated under the alternative, though we emphasise that in fact the alternative is composite, being indexed by z ∈ N and θ ∈ R p \ {0}.In practice, in order for our procedure to have uniformly non-trivial power, it will be necessary to work with a subset of the alternative hypothesis parameter space that is well-separated from the null, in the sense that the 2 -norm of the vector of mean change, ϑ := θ 2 , is at least a known lower bound β > 0.
We measure the performance of a sequential changepoint procedure via its responsiveness subject to a given upper bound on the false alarm rate, or equivalently, a lower bound on the average run length in the absence of change.Specifically, following the concepts intoduced by Lorden (1971), we define the patience1 of a sequential changepoint procedure N to be E 0 (N ), and its worst-case response delay2 to be Ēwc θ (N ) := sup While controlling the worst-case response delay provides a very strong theoretical guarantee of the average detection delay of the procedure, even under the worst possible pre-change data sequence, obtaining a good bound for this quantity is often difficult.We therefore also consider the average-case response delay, or simply the response delay of a procedure N , defined as Ēθ (N ) := sup We note that Ēθ (N ) ≤ Ēwc θ (N ).A good sequential changepoint procedure should have small worst-and average-case response delays, uniformly over the relevant class of alternatives {P z,θ : (z, θ) ∈ N × R p , θ 2 ≥ β}, subject to its patience being at least some suitably large, pre-determined γ > 0. Finally, as mentioned above, we are interested in sequential changepoint procedures that are online, so that the computational complexity per additional observation should be a function of p only.
Our main contribution in this work is to propose, in Section 2, a new algorithm called ocd (short for online changepoint detection), for high-dimensional, online changepoint detection in the above setting.The procedure works by performing likelihood ratio tests against simple alternatives of different scales in each coordinate, and then aggregating test statistics across scales and coordinates for changepoint detection.The ocd algorithm has worst-case computational complexity O p 2 log(ep) per new observation, so satisfies our requirement for being an online algorithm.In fact, as we explain in Section 2.1, the algorithmic complexity is often even better than this.Moreover, as we illustrate in Section 4, it has extremely effective empirical performance.In terms of theoretical guarantees, it turns out to be more convenient to analyse a slight variant of our initial algorithm, which we refer to as ocd .This has the same order of computational complexity per new observation as ocd, but enables us to ensure that whenever we are yet to declare that a change has occurred at the changepoint, only post-change observations contribute to the running test statistics.In practice, the original ocd algorithm also appears to have this property for typical pre-change sequences, and we argue heuristically that there is a sense in which it is more efficient than ocd by a factor of at most 2.
Our theoretical analysis in Section 3 initially considers separately versions of the ocd algorithm best tuned towards settings where the vector θ of change is dense, and where it is sparse in an appropriate sense.We then present results for a combined, adaptive procedure that seeks the best of both worlds.In all cases, the appropriate version of ocd has guaranteed patience, at least at the desired nominal level.In the (small-change) regime of primary interest, and when ϑ is of the same order as β, the response delay of ocd is of order at most √ p/ϑ 2 in the dense case, up to a poly-logarithmic factor; this can be improved to order s/ϑ 2 , again up to a poly-logarithmic factor, when the effective sparsity of θ is s < √ p.
As alluded to above, there is a paucity of prior literature on multivariate, online changepoint problems, though exceptions include Tartakovsky et al. (2006), Mei (2010) and Zou et al. (2015).These works focus either on the case where both the pre-and post-change distributions are exactly known, or where, for each coordinate, both the sign and a lower bound on the magnitude of change, are known in advance.A number of methods have also been proposed that involve scanning a moving window of fixed size for changes (Xie and Siegmund, 2013;Soh and Chandrasekaran, 2017;Chan, 2017).Such methods can be effective when the signal-to-noise ratio is large enough that the change can be detected within the prescribed window, but may experience excessive response delay in other cases.Of course, the window size may be increased to compensate, but this correspondingly increases the computational complexity and storage requirements, so allowing the window size to vary with the signal strength would fail to satisfy our definition of an online algorithm.
Numerical results illustrate the performance of our ocd algorithm in Section 4. Proofs of our main results are given in Section 5, with various auxiliary results deferred to Section 6.

Notation
We write N 0 for the set of all non-negative integers.
, we write A •,j := A 1,j , . . ., A d 1 ,j ∈ R d 1 and A −j,j := A 1,j , . . ., A j−1,j , A j+1,j . . ., A d 1 ,j ∈ R d 1 −1 .We use Φ(•) and φ(•) to denote the distribution function and density function of the standard normal distribution respectively.For two real-valued random variables U and V , we write U ≥ st V if P(U ≤ x) ≤ P(V ≤ x) for all x ∈ R. We adopt the convention that an empty sum is 0.
2 An online changepoint procedure

The ocd algorithm
In this section, we describe our online changepoint procedure, ocd, in more detail.As mentioned in the introduction, the procedure aggregates likelihood ratio test statistics against simple alternatives of different scales in different coordinates.If we want to test a null of N (0, 1) against a simple post-change alternative distribution of N (b, 1) for some b = 0 in coordinate j ∈ [p], by Page (1954), the optimal online changepoint procedure is to declare a change has occurred by time n when the test statistic exceeds a certain threshold.Note that n i=n−h+1 b(X j i − b/2) can be viewed as the likelihood ratio test statistic between the null and this simple alternative using the tail sequence X n−h+1 , . . ., X n .Thus R j n,b can be regarded as the most extreme of these likelihood ratio statistics, over all possible starting points for the tail sequence.Write for the length of the tail sequence in which the associated likelihood ratio statistic (in the jth coordinate) is maximised.One way to aggregate across the p coordinates would be to use p j=1 R j n,b as a test statistic.However, this approach is not ideal for two reasons.Firstly, the exact distribution of the tail likelihood ratio statistic R j n,b is hard to obtain, making it difficult to analyse the aggregated statistic under the null.More importantly, this aggregated statistic uses the same simple alternative N (b, 1) in all coordinates, and so even after varying the magnitude of b, it is only effective against a very limited set of alternative distributions in {P z,θ : z ∈ N, θ 2 ≥ β}, namely those for which the change is of very similar magnitude in all coordinates.To overcome these problems, our procedure uses the coordinate-wise statistics (R j n,b : j ∈ [p]), which we call 'diagonal statistics', to detect changes that have a large proportion of their signal concentrated in one coordinate.Then, for each j ∈ [p], we also compute tail partial sums of length t j n,b in all other coordinates j = j, given by and aggregate them to form an 'off-diagonal statistic' anchored at coordinate j.Note that the number of summands in A j ,j n,b depends only on the observed data in the jth coordinate, and not on the data being aggregated in the j th coordinate.These off-diagonal statistics are used to detect changes whose signal is not concentrated in a single coordinate.Intuitively, if a change has occurred and θ j /b ≥ 1, then we can expect the tail length in coordinate j to be roughly of order n − z for sufficiently large n, and this will ensure that the off-diagonal statistic anchored at coordinate j is close to the generalised likelihood ratio test statistic between the null and the composite alternative {P z,θ : θ 2 = 0}.If, in addition, a non-trivial proportion of the signal is contained in coordinates [p]\{j}, then this statistic will be powerful for detecting the change.
The full description of the ocd procedure is given in Algorithm 1.Note that for notational simplicity, we have suppressed the time dependence of many variables as they are updated recursively in the algorithm.In the following, when necessary, we will make this dependence explicit by writing A n,b , t n,b , Q n,b , S diag n and S off n for the relevant quantities at the end of the nth iteration of the repeat loop.
By Lemma 10, bA j,j n,b − b 2 t j n,b /2, as defined in the algorithm, is equal to the quantity R j n,b defined in (2) (we will also suppress its n dependence when it is clear from the context).Moreover, by Lemma 11, the two definitions of t j n,b from Algorithm 1 and (3) coincide.In the algorithm, we allow b to range over the (signed) dyadic grid B ∪ B 0 , since the maximal signal strength in individual coordinates, θ ∞ , can range from ϑ/ √ p to ϑ.In this way, the algorithm automatically adapts to different signal strengths in each coordinate.Here, the inclusion of B 0 and the extra logarithmic factors in the denominators of elements of B ∪ B 0 appear due to technical reasons in the theoretical analysis of the algorithm.Algorithm 1 uses S diag and S off to aggregate diagonal and off-diagonal statistics respectively as mentioned above, and declares that a change has occurred as soon as either of these quantities exceeds its own pre-determined threshold.As mentioned previously, S diag tracks the maximum of R j b over all scales b and coordinates j.Before introducing S off , we first discuss the off-diagonal statistics Q j b in Algorithm 1, which are 2 aggregations of normalised tail sums A j ,j b / t j b ∨ 1, each hard-thresholded at level a.The hard thresholding level can be chosen to detect dense or sparse signals θ; in the sparse case a non-zero a facilitates an aggregation that aims to exclude coordinates with negligible change (thereby reducing the variance of the normalised tail sums).Finally, S off is computed as the maximum of the Q j b over all anchoring coordinates j ∈ [p] and scales b ∈ B.
Although the off-diagonal statistics described in the previous paragraph are effective for detecting changes when the signal sparsity is known, it is desirable to the practitioner to have a combined procedure that adapts to the sparsity level.This may be computed straightforwardly by tracking S off for a = a dense and a = a sparse , as well as S diag , and declaring a change when any of these three statistics exceeds a suitable threshold.Figure 1 illustrates the performance of this adaptive procedure, together with the time evolution of normalised versions of all three statistics tracked, in synthetic datasets both with and without a change.This adaptive procedure is analysed theoretically in Section 3.3 and empirically in Section 4.
The ocd procedure satisfies our definition of an online algorithm.Indeed, for each new observation Figure 1: Behaviour of the three normalised statistics in ocd under the null and under the alternative with different signal strength, sparsity level and assumed lower bound.A change is declared as soon as one of these three normalised statistics exceeds 1.The data were generated in the top-left panel according to P 0 , and, in the other panels, according to P z,θ , with p = 100, z = 300 and θ = ϑU , where U is uniformly distributed on the union of all s-sparse unit spheres in R p (see Section 4.2 for a more detailed description).then computes S diag n and S off n via A n,b .These steps require O p 2 log(ep) operations.Moreover, the total storage used is O p 2 log(ep) throughout the algorithm.
In fact, the computational complexity of ocd can often be reduced, because typically T := {t j b : j ∈ [p], b ∈ B} has cardinality much less than p|B| (which is the worst case, when all elements are distinct).Correspondingly, at each time step, we need only store the p × |T | matrix (B k,t ) k∈[p],t∈T given by B k,t j b := A k,j b , resulting in an improved per-iteration computational complexity and storage for ocd of O(p|T |).For simplicity of exposition, we have not presented this computational speed-up in Algorithm 1, and it appears to be difficult to provide theoretical guarantees on |T |.Nevertheless we have implemented the algorithm in this form in the R package ocd (Chen, Wang and Samworth, 2020), and have found it to provide substantial computational savings in practice.

A slight variant of ocd
While the ocd algorithm performs very well numerically, it turns out to be easier theoretically to analyse a slight variant, which we call ocd , and describe in Algorithm 2. Again, we have suppressed the time dependence n of many variables including τ n,b , τn,b , Λ n,b and Λn,b in the algorithm.The main difference between these two algorithms is that in ocd , the off-diagonal statistics Q j b are computed using tail partial sums of length τ j b instead of t j b .These new tail partial sums are recorded in Λ b ∈ R p×p .
By Lemma 16, we always have whenever t j b ≥ 2. In this sense, the tail sample size used by ocd is smaller than that of ocd by a factor of at most 2. The benefit of using a shorter tail in ocd is that when n exceeds a known, deterministic threshold, we can be sure that whenever we have not declared that a change has occurred by time z, the tail partial sum consists exclusively of post-change observations.In practice, we observe that even in Algorithm 1, the tail lengths t j z,b at the changepoint are generally very short for many coordinates, so the inclusion of a few pre-change observations in the tail partial sum calculation does not significantly affect the efficacy of the changepoint detection procedure.The practical performance of Algorithm 1 is statistically more efficient than Algorithm 2 in many settings by a factor of between 4/3 and 2, as suggested by (4).By construction, τ j b and Λ •,j b are computable online, through auxiliary variables τ j b and Λ•,j b .Indeed, Algorithm 2 is also an online algorithm, with overall computational complexity per observation and storage remaining at O p 2 log(ep) in the worst case; similar computational improvements to those mentioned for ocd at the end of Section 2.1 are also possible here.
Algorithm 2: Pseudo-code of the ocd algorithm, a slight variant of ocd Input: , n = 0, b is a power of 2 and δ = 1 otherwise.

Theoretical analysis
As mentioned in Section 2, the input a in Algorithms 1 and 2 allows users to detect changepoints of different sparsity levels.More precisely, for any θ ∈ R p , we have by Lemma 15 that there exists a smallest s(θ) ∈ {2 0 , 2 1 , . . ., 2 log 2 p } such that the set has cardinality at least s(θ).On the other hand, we also have |S(θ)| ≤ s(θ) log 2 (2p).We call s(θ) the effective sparsity of the vector θ and S(θ) its effective support.Intuitively, the sum of squares of coordinates in the effective support of θ has the same order of magnitude as θ 2 2 , up to logarithmic factors.Moreover, if θ is an s-sparse vector in the sense that θ 0 ≤ s, then s(θ) ≤ s, and the equality is attained when, for example, all non-zero coordinates have the same magnitude.
In this section, we initially analyse the theoretical performance of Algorithm 2 for two different choices of a in S off = S off (a), namely a = 0 and a = 8 log(p − 1).We then present our combined, adaptive procedure and its performance guarantees. Define Then the stopping time for our changepoint detection procedure is simply

Dense case
Here, we analyse the changepoint detection procedure N = N (0), which, as we will see, is most suitable for detecting dense mean changes in the sense that s(θ) ≥ √ p (though we do not assume this in our theory).In this case, when p ≥ 2 and conditionally on τ j b , the quantity Q j b follows a chi-squared distribution with p − 1 degrees of freedom under the null, provided that τ j b is positive3 .Motivated by Laurent and Massart (2000, Lemma 1), we choose a threshold of the form say, for some T off > 0.
The following theorem provides control of the patience of ocd .
Theorem 1.Let X 1 , X 2 , . . .be generated according to P 0 .For any We note that either of the two statistics S diag and S off may trigger a false alarm under the null.The two threshold levels T diag and T off are chosen so that E 0 (N diag ) and E 0 (N off ) have comparable upper bounds.
Our next result controls the response delay of ocd in both worst-case and average senses.
Theorem 2. Assume that X 1 , X 2 , . . .are generated according to P z,θ for some z and θ such that θ 2 = ϑ ≥ β > 0 and that θ has an effective sparsity of s := s(θ).Then there exists a univeral constant C > 0, such that the output N from Algorithm 2, with inputs Furthermore, there exists β 0 (s) > 0, depending only on s, such that for all β ≤ β 0 (s), the output N satisfies for s ≥ 2, and for s = 1.

Sparse case
We now assume that p ≥ 2, and analyse the performance of N = N 8 log(p − 1) ; in other words, we choose a = 8 log(p − 1).This choice turns out to work particularly well when the vector of mean change is sparse in the sense that s(θ) ≤ √ p, though again we do not assume this in our theory.The motivation for this choice of a comes from the fact that, for fixed b and j, we have Λ j ,j b It is therefore natural to choose a to be of the same order as the maximum of p − 1 independent and identically distributed N (0, 1) random variables.The declaration threshold T off is determined based on Lemma 17. Theorem 3 below shows that, in the sparse case, the patience of our procedure is also guaranteed to be at least at the nominal level γ > 0. In addition, as in the dense case, we can also control the response delay of ocd according to Theorem 4. Theorem 3. Let X 1 , X 2 , . . .be generated according to P 0 .For any γ ≥ 1, let (X t ) t∈N , β > 0, a = 8 log(p − 1), T diag = log{16pγ log 2 (4p)} and T off = 8 log{16pγ log 2 (2p)} be the inputs of Algorithm 2, with corresponding output N .Then E 0 (N ) ≥ γ.
Theorem 4. Assume that X 1 , X 2 , . . .are generated according to P z,θ for some z and θ such that θ 2 = ϑ ≥ β > 0 and that θ has an effective sparsity of s := s(θ).Then there exists a universal constant C > 0, such that the output N from Algorithm 2, with inputs Comparing Theorems 2 and 4, we see that the thresholding induced by the non-zero choice of a = 8 log(p − 1) in Theorem 4 facilitates an improved dependence on the effective sparsity s in the bound on the response delay, whenever s is of smaller order than √ p.

Adaptive procedure
To adapt to different sparsity levels s, we can run ocd (or ocd ) with two values of a simultaneously: we choose a = a dense = 0 to form the off-diagonal dense statistic S off,d = S off (a dense ), and a = a sparse = 8 log(p − 1) to form the off-diagonal sparse statistic S off,s = S off (a sparse ).
We recall that the diagonal statistic S diag does not depend on the choice of a.For clarity, we redefine the three stopping times here: and N off,s := inf{n : S off,s n ≥ T off,s }, for appropriately-chosen thresholds T diag , T off,d and T off,s .The output of this adaptive procedure is thus The following results provide patience and response delay guarantees for this adaptive procedure.
Comparing these two results with the corresponding theorems in Sections 3.1 and 3.2, we see that by choosing slightly more conservative thresholds, the adaptive procedure retains the nominal patience control while (up to constant factors) achieving the best of both worlds in terms of its response delay guarantees under different sparsity regimes.
To better understand the worst-case and average-case response delay bounds in Theorem 6, it is helpful to assume that ϑ/C 1 ≤ β ≤ ϑ ≤ C 1 and log(γ/β) ≤ C 2 log p for some C 1 , C 2 > 0. Under these additional assumptions, the result of Theorem 6 takes the simpler form that for some C > 0, depending only on C 1 and C 2 , we have Figure 2: Illustration of the dependencies on sparsity of the worst-case and average-case response delays for the dense, sparse and adaptive versions of ocd , as given by Theorems 2, 4 and 6.

Worst case
In particular, the average-case response delay upper bound exhibits a phase transition when the effective sparsity level s is of order √ p, which is the boundary between the sparse and dense cases.Similar sparsity-related elbow effects have been observed in the minimax rate of estimating high-dimensional Gaussian means (Collier, Comminges and Tsybakov, 2017).On the other hand, we note that quadratic dependence on ϑ in the denominator is known to be optimal in the case when p = 1 (Lorden, 1971, Theorem 3).The different dependencies on sparsity of the worst-case and average-case response delays for the dense, sparse and adaptive versions of ocd are illustrated in Figure 2.

Numerical studies
In this section, we study the empirical performance of the ocd algorithm and compare it with other online changepoint detection methods.Recall that the (adaptive) ocd algorithm declares a change when any of the three statistics S diag , S off,d and S off,s exceeds their respective thresholds T diag , T off,d and T off,s .If a priori knowledge about the signal sparsity is available, it may be slightly preferable to use N diag ∧ N off,d in the dense case, and N diag ∧ N off,s in the sparse case, but for simplicity of exposition, we will focus on the adaptive version of our ocd procedure throughout the remainder of this section.While the threshold choices given in Theorem 5 guarantee that the patience of (adaptive) ocd will be at least at the nominal level, in practice, they may be conservative.We therefore describe a scheme for practical choice of thresholds in Section 4.1.Recall that, in order to form S off,d and S off,s , two different entrywise hard thresholds for A j ,j b / t j b ∨ 1 need to be specified.For S off,d , we choose a = 0 for both theoretical analysis and practical usage.For S off,s , the theoretical choice is a = 8 log(p − 1), but since this is also slightly conservative, the choice of a = √ 2 log p is used in our practical implementation of the algorithm, and our numerical simulations below.

Practical choice of declaration thresholds
The purpose of this section is to introduce an alternative to using the theoretical thresholds T diag , T off,d and T off,s provided by Theorem 5, namely to determine the thresholds through Monte Carlo simulation.The basic idea is that since the null distribution is known, we can simulate from it to determine the patience for any given choice of thresholds.A complicating issue is the fact that the choices of the three thresholds T diag , T off,d and T off,s are related, so that we may be able to achieve the same patience by increasing T diag and decreasing T off,d , for example.To handle this, we first argue that the renewal nature of the processes involved means that, at least for moderately large thresholds, the times to exceedence for each of the three statistics S diag , S off,d and S off,s are approximately exponentially distributed.Evidence to support this is provided by Figure 3, where we present QQ-plots of N diag /m(N diag ), N off,d /m(N off,d ) and N off,s /m(N off,s ), where the m(N ) statistics are empirical medians of the corresponding N statistics (divided by log 2) over 200 repetitions.
We can therefore set an individual Monte Carlo threshold for S diag as follows (the other two statistics can be handled in identical fashion): for r ∈ [B], simulate X , and take T diag to be the (1/e)th quantile of {V (r) : r ∈ [B]}.The rationale for the final step here is that if P 0 (V (1) < T diag ) = 1/e, then P 0 ( Ñ diag > γ) = 1/e, where Ñ diag := min{n : S diag n ≥ T diag }.Thus, under an exponential distribution for Ñ diag , we have that Ñ diag has individual patience γ.
Having determined appropriate thresholds T diag , T off,d and T off,s , we can then use similar ideas to set a suitable combined threshold T comb .In particular, we also argue that N diag ∧ N off,d ∧ N off,s has an approximate exponential distribution; see Figure 3 for supporting evidence.We therefore proceed as follows: for r ∈ Similar to before, our reasoning here is that if P 0 (W (1) < T comb ) = 1/e, then N diag := min n : Thus, under an exponential distribution for N diag ∧N off,d ∧N off,s , it again has the desired nominal patience.Our practical thresholds, therefore, are T diag = T diag T comb , T off,d = T off,d T comb and T off,s = T off,s T comb for S diag , S off,d and S off,s respectively.Table 1 confirms that, with these choices of Monte Carlo thresholds, the patience of the adaptive ocd algorithm remains at approximately the desired nominal level.Table 1: Estimated run lengths under the null using the Monte Carlo thresholds described in Section 4.1 over 500 repetitions, with desired patience level γ = 5000.Algorithm is terminated after 20000 data points for each repetition.Each reported value is the average run length taken over the repetitions which have already declared prior to time 20000.For reference, E(X | X < 20000) ≈ 4626.9 when X ∼ Exp(1/5000).off,d and N off,s and the response delay of the combined declaration time N for ocd, with the percentages of repetitions on which each statistics triggers the declaration first (or equal first) shown in parentheses.The quickest response in each setting is given in bold.Other parameters: p = 100, γ = 5000, z = 0 and θ = ϑU , where the distribution of U is described in Section 4.2.

Numerical performance of ocd
In this section, we study the empirical performance of ocd.As shown in Figure 1, under the alternative, all three statistics S diag , S off,d and S off,s in ocd can be the first to trigger a declaration that a mean change has occurred.We thus examine different settings under which each of these three statistics can respectively be the quickest to react to a change.Our simulations were run for p = 100, s ∈ {1, p 1/2 , p}, z ∈ {0, 1000}, γ = 5000, ϑ ∈ {1, 0.5, 0.25}, β ∈ {ϑ, 4ϑ, ϑ/4}.In all cases, θ was generated as ϑU , where U is uniformly distributed on the union of all ssparse unit spheres in R p .By this, we mean that we first generate a uniformly random subset S of [p] of cardinality s, then set U := Z/ Z 2 , where Z = (Z 1 , . . ., Z p ) has independent components satisfying Z j ∼ N (0, 1)1 {j∈S} .Instead of terminating the ocd procedure once one of the three statistics declares a change (as we would in practice), we run the procedure until all three statistics have exceeded their respective thresholds.Tables 2 and 3 summarise the performance of the three statistics for z = 0. Simulation results for z = 1000 were similar, and are therefore not included here.
We first discuss the case when β is correctly specified (Table 2).When the sparsity s is small or moderate and ϑ is small, the diagonal statistic S diag is likely to be the first to declare a change.The response delay of S diag increases with s, which means that the off-diagonal sparse statistic S off,s typically reacts quickest to a change when the s is moderate to large and ϑ is not too small.On the other hand, the stopping time N off,d , which is driven by the off-diagonal dense statistic, is not significantly affected by s (in agreement with our averagecase bound in Theorem 2), and is usually the dominant statistic when the signal is dense.A further observation is that the three individual response delays, as well as the combined response delay, are all approximately proportional to ϑ −2 , a phenomenon which is supported by Theorem 6.
Table 3 presents corresponding results when β is both over-and under-specified.We note  3: Estimated response delays over 200 repetitions for N diag , N off,d and N off,s and the response delay of the combined declaration time N for ocd.Settings where β is both overand under-specified are given.The quickest response in each setting is given in bold.Other parameters: p = 100, γ = 5000, z = 0 and θ = ϑU , where the distribution of U is described in Section 4.2.that both N off,d and N off,s are almost unaffected by either type of misspecification.For N diag , a mild over-misspecification of β helps it to react faster, while an under-misspecification causes it to have increased response delay.However, since we can also observe that N diag rarely declares first by a large margin, the performance of ocd is highly robust to misspecification of β, especially when s is not too small.

Comparison with other methods
We now compare our adaptive ocd algorithm with other online changepoint detection algorithms proposed in the literature, namely those of Mei (2010), Xie and Siegmund (2013) and Chan (2017).Since we were unable to find publicly-available implementations of any of these algorithms, we briefly describe below their methodology and the small adaptations that we made in order to allow them to be used in our settings.Mei (2010) assumes knowledge of θ, and, on observing each new data point, aggregates likelihood ratio tests in each coordinate of the null N (0, 1) against an alternative of N (θ j , 1) in the jth coordinate.More precisely, in the notation of (2), the algorithm declares a change when either j∈[p] R j n,θ j or max j∈[p] R j n,θ j exceeds given thresholds.In our setting where we do not know θ and only assume that θ 2 ≥ β, we replace j∈[p] R j n,θ j and max j∈[p] R j n,θ j with max respectively.The algorithms of Xie and Siegmund (2013) and Chan (2017) have a similar flavour.The idea is to test the null N p (0, I p ) distribution against an alternative where the jth coordinate has a (1 − p 0 )N (0, 1) + p 0 N (µ j , 1) mixture distribution, for some known p 0 ∈ [0, 1] and unknown µ j ∈ R.After specifying a window size w, both algorithms search for the strongest evidence against the null from the past r ∈ [w] observations.Specifically, writing Z j n,r := r −1/2 n i=n−r+1 X j i for n ∈ N, r ∈ [n] and j ∈ [p], the test statistics are of the form where Xie and Siegmund ( 2013) take (λ, κ, w) = (1, 2, 200) and Chan (2017 4,200).Since such a test statistic is only effective when j∈[p] (µ j ∨ 0) 2 is large, we considered statistics of the form S + XS,C (p 0 , λ, κ, w) ∨ S − XS,C (p 0 , λ, κ, w), where S − XS,C (p 0 , λ, κ, w) replaces the exponent Z j n,r ∨ 0 with Z j n,r ∧ 0. An adaptive choice of p 0 is not provided by the authors, but the choices p 0 ∈ {1/ √ p, 0.1, 1} have been considered; we found the choice p 0 = 1/ √ p to be the most competitive overall, so for simplicity of exposition, present only that choice in our results.
For each of the Mei (2010), Xie and Siegmund (2013) and Chan (2017) algorithms, we determined appropriate thresholds using Monte Carlo simulation, as suggested by the authors, and in a similar fashion to the way in which we set the ocd thresholds as described in Section 4.1.This guarantees that the algorithms have approximately the nominal patience, and so allows us to compare the methods by means of the response delay.
Table 4 displays the response delays for the ocd algorithm, as well as the alternative methods described above, for p ∈ {100, 2000}, s ∈ {5, √ p , p} and ϑ ∈ {1, 0.5, 0.25}.In fact, we ran simulations for p = 1000, s ∈ {1, p/2} and ϑ = {2, 0.125}, but the results are qualitatively similar and are therefore omitted.Overall, the results reveal that ocd performs very well in comparison with existing methods, across a wide range of scenarios; in several cases it is by far the most responsive procedure, and in none of the settings considered is it outperformed by much.The Xie and Siegmund (2013) and Chan (2017) algorithms perform similarly to each other, and in most settings are both more competitive than the Mei (2010) method described above.We note that the performance of the Xie and Siegmund (2013) and Chan (2017) algorithms is relatively better when the signal-to-noise ratio is high; in these scenarios, the default window size w = 200 is large enough that sufficient evidence against the null can typically be accumulated within the moving window.For lower signal-to-noise ratios, this ceases to be the case, and from time z + w onwards, the test statistic has the same marginal distribution (with no positive drift).This explains the relative deterioration in performance for those algorithms in the harder settings considered.As mentioned in the introduction, if the change in mean were known to be small, then the window size could be increased to compensate, but at additional computational expense; a further advantage of ocd, then, is that the computational time only depends on p (and not on β or other problem parameters).Mei (2010) (Mei), Xie and Siegmund (2013) (XS) and Chan (2017) (Chan) over 200 repetitions, with z = 0, γ = 5000 and θ generated as described in Section 4.2.The smallest response delay is given in bold.
(a) By ( 5) and a union bound, we have Recall that under the null, . Thus, we have by Laurent and Massart (2000, Lemma 1) that for all n ∈ [m], j ∈ [p] and b ∈ B, Combining ( 13) and ( 14), we have To study the distribution of H, consider the one-sided sequential probability ratio test of ) with log-boundaries T diag and −∞.The associated stopping time for this test is ) n is a Markov process that renews itself every time it hits 0, H follows a geometric distribution with success probability where the last inequality follows from Lemma 12. Consequently, As the above inequality holds for all j ∈ [p] and b ∈ B ∪ B 0 , we have that as desired, where in the penultimate inequality, we twice used the fact that (1 The proof of Theorem 2 is quite involved.We first define some relevant quantities, and then state and prove some preliminary results.For θ ∈ R p with effective sparsity s(θ), there can be at most one coordinate in θ of magnitude larger than ϑ/ √ 2, so there exists b has cardinality at least s(θ)/2 (note that the condition θ j /b * ≥ 1 above ensures that {θ j : j ∈ J } all have the same sign as b * ).Both b * and J can be chosen as functions of θ.Now, given any sequence X 1 , X 2 , . . .∈ R p and θ ∈ R p , define for any α ∈ (0, 1] the function q(α) = q(α; X 1 , . . ., X z , θ) := inf y ∈ R : {j ∈ J : where t j z,b * is obtained by running Algorithm 2 up to time z with a = 0 and T diag = T off = ∞.In other words, q(α) is the empirical α-quantile of the tail lengths (t j z,b * : j ∈ J ) when we run the algorithm without declaring any change up to time z.Recall the definition of the function ψ in (5).
Proposition 7. Assume that X 1 , X 2 , . . .are generated according to P z,θ for some z and θ such that θ 2 = ϑ ≥ β > 0 and that θ has an effective sparsity of s := s(θ) ≥ 2. Then the output N from Algorithm 2, with input for any α ∈ (0, 1]. Proof.Since the bound in ( 19) is positive, we may, throughout the proof and for arbitrary z ∈ N, restrict attention to realisations X 1 = x 1 , . . ., X z = x z for which we have not declared a change by time z.In other words, we have N > z.This restriction also ensures that q(a) defined in ( 18) by setting the thresholds to infinity is now indeed the empirical α-quantile of the tail lengths (t j z,b * : j ∈ J ) at the changepoint.Denote J α := j ∈ J : We now fix some Note that r 0 > 3q(α) ≥ 3t j z,b * for all j ∈ J α .For j ∈ J α , we define the event Ω j r := t j z+ r ,b * > 2 r /3 .
Under P z,θ , conditional on X 1 = x 1 , . . ., X z = x z , we know that X z+1 , X z+2 , . . .iid ∼ N p (θ, I p ).Hence, by using Lemma 11 and applying Lemma 13(b) to t j z+ r ,b * ∧ r for j ∈ J α , we obtain We now work on the event Ω j r , for some j ∈ J α .We note that (20) guarantees that r ≥ 2, and thus t j z+ r ,b * ≥ 2 r /3 ≥ 2.Then, by Lemma 16 and the fact that Hence we conclude that on the event Ω j r , Recall that Λ •,j z+ r ,b * ∈ R p records the tail CUSUM statistics with tail length τ j z+ r ,b * .We observe by ( 22) that on Ω j r , only post-change observations are included in Λ •,j z+ r ,b * .Hence we have that on the event Ω j r , for k ∈ [p]\{j}.Therefore, on the event Ω j r and conditional on τ j z+ r ,b * , X 1 = x 1 , . . ., X z = x z , the random variable follows a non-central chi-squared distribution with p − 1 degrees of freedom and noncentrality parameter θ −j 2 2 τ j z+ r ,b * .Since j ∈ J and s ≥ 2, we observe, by ( 17) and ( 22) that θ −j 2 2 τ j z+ r ,b * ≥ ϑ 2 r /6 on Ω j r .Write Then by Birgé (2001, Lemma 8.1), we have Combining ( 21) and ( 24), we deduce that , where the last inequality uses (20).Therefore, we have where the penultimate inequality follows from the fact that 1 − Φ(x) ≤ 1 2 e −x 2 /2 for x ≥ 0. The desired bound (19) follows by substituting in the expressions for r 0 and b * .
The following two propositions control the residual tail length quantile term q(α) in ( 19) in the worst-case and average-case scenarios respectively.Proposition 8. Let X 1 , X 2 , . .., z, θ, s, a, p and N be defined as in Proposition 7. On the event {N > z}, we have Proof.We will show the stronger result that on the event {N > z}, we have We claim that for all n ∈ z − t j z,b , . . ., z .To see this, the claim is true when n = z − t j z,b since the right hand side of ( 26) is 0 by (25).Now, assume ( 26) is true for some n = m − 1.Then, This proves the claim by induction.In particular, on the event {N > z}, we have T diag > R j z,b/2 > b 2 t j z,b /8 as desired.Proposition 9. Let X 1 , X 2 , . .., z, θ, s, a, p and N be defined as in Proposition 7.There exists β 0 (s) > 0, depending only on s, such that for all β < β 0 (s), we have Proof.Recall the definition of b * in (17).We may assume, without loss of generality that b * = β/ s log 2 (2p) (the case b * = −β/ s log 2 (2p) can be proved in essentially the same way).We first prove the result for s > 256.Recall that for r ∈ N and define S j 0 := Sj 0 := 0. Writing ξ j 0 := argmax 0≤r≤∆b −2 * S j r , ξ j := argmax r∈N 0 S j r , and ξj 0 := argmax 0≤r≤∆b −2 * Sj r , where ∆ := 8 log(16sb −2 * ), we note that like t j z,b * , these three maxima are also uniquely attained almost surely (see the proof of Lemma 13).By construction, we have for each j ∈ [p] that Writing q ξ (α) := inf y : |{j ∈ J : ξ j ≤ y}| ≥ α|J | as the empirical α-quantile of (ξ j : j ∈ J ), it follows that q(α) ≤ q ξ (α) and so it suffices to control E{q ξ (s −1/2 )} instead of E{q(s −1/2 )}.To this end, we observe that 16∆s and ξj 0 ≥ ξ j 0 , and thus For the first term on the right hand side of ( 27), by Donsker's invariance principle and the continuity of the argmax map (see, e.g.van der Vaart and Wellner, 1996, Lemma 3.2.1 and Theorem 3.2.2),we have in the limit β 0 that ∆b −2 * → ∞ and so ξj where (B t ) t≥0 denotes a standard Brownian motion.In particular, we can find β 0 (s) > 0 depending only on s such that for β ≤ β 0 (s) and s > 256, we have where in the second step we used the arcsine law for Brownian motion (see, e.g.Mörters and Peres, 2010, Theorem 5.26), and in the final step we used the fact that 4s −1/4 < 1.
For the second term on the right-hand side of ( 27), we have by a union bound that By reducing β 0 (s) > 0 if necessary, we can have for all β ≤ β 0 (s) that Since ∆ = 8 log(16sb −2 * ), we have Substituting ( 28) and ( 29) into ( 27), we have, for all j ∈ J , that As a result, j ∈ J : is stochastically larger than Bin |J |, s −1/4 .Thus, for s > 256, we have, where we have used Hoeffding's inequality and the fact that |J | ≥ s/2 in the last step.On the other hand, we have, where we have used Lemma 14(b) in the second inequality and Lemma 13(d) (with ∆b −2 * taking the role of c there) in the final inequality.As a result, where we have used in the final step the fact that e −s 1/2 /2 ≤ s −1/2 /100 for s > 256.This proves the desired result for s > 256.Finally, for s ≤ 256, we have by Lemma 13(c) that, for β < √ s/2, and the desired bound then follows.
We are now in a position to prove Theorem 2.
Proof of Theorem 2. The proof proceeds with different arguments for the case s ≥ 2 and the case s = 1.
, it follows by induction that R j * z+n,b * ≥ Rn for all n ∈ N 0 .Then, for n ≥ 4T diag /(b * θ j * ) =: n 0 , we have Therefore, After substituting in the expressions for b * , θ j * and T diag , we see that which proves both ( 6) and (8).

Proofs from
Hence, it follows that as desired.
Proof of Theorem 4. We note that the case s = 1 in the proof of Theorem 2 does not rely on the off-diagonal statistics.Hence ( 30) is still valid here with a = 8 log(p − 1) and the last expression in (30) again proves the desired bound (9).For the case s ≥ 2, we follow exactly the proof of Proposition 7 until (23), with the only exception that we now fix, instead of (20), By the definition of the effective sparsity of θ, for a fixed j ∈ J α , s log 2 (2p) and j = j has cardinality at least s − 1.On the event Ω j r , we have, by (22), that for all k ∈ L j We then observe, by (32), that ãr ≥ 32 log p > 2a. (33) Now, from ( 23) we have on the event Ω j r that, for all k ∈ L j , We denote Then, by the Chernoff-Hoeffding binomial tail bound (Hoeffding, 1963, Equation (2.1)), we have where the penultimate inequality follows from (33).Now, on the event Ω where the penultimate inequality uses the fact that |L j | ≥ s−1 and the last inequality follows from (32).We now denote Combining ( 21), ( 34) and ( 35), we deduce that .
Therefore we have Combining this with Proposition 8 (applied with α = 1), we have, by substituting in the expression for T off , that for some universal constant C > 0, as desired.
Proof of Theorem 6.We observe that Ēwc and similarly for Ēθ (N ).The desired bounds ( 10), ( 11) and ( 12) are therefore direct consequences of Theorems 2 and 4 (note that the different constants in the thresholds only affect the value of the universal constant).
Proof.We prove the claim by induction on n.The base case n = 0 is true since, by definition, R j 0,b = 0 and the sum on the right-hand side of (36) is empty.Assume (36) is true for n = m − 1.Then, by the update procedure in Algorithm 2, we have and the desired result follows.
Lemma 11.For n ∈ N 0 , b ∈ B ∪ B 0 and j ∈ [p], let t j n,b be defined as in Algorithm 2 and R j n,b as in Lemma 10.Then Proof.We observe from the procedure in Algorithm 2 that R j n,b = 0 if and only if t j n,b = 0 and that R j n,b > 0 if and only if t j n,b = t j n−1,b + 1.Hence, We now prove that t j n,b = sargmax 0≤h≤n n i=n−h+1 b(X j i − b/2) by induction on n.The base case n = 0 is true because t j n,b = 0, and the sum on the right-hand side of (37) is empty.Assume the claim is true for n = m − 1.Then, by the inductive hypothesis and Lemma 10, and the desired result follows.
For two distributions P 0 and P 1 on the same measurable space, the sequential probability ratio test of H 0 : X 1 , X 2 , . . .iid ∼ P 0 against H 1 : X 1 , X 2 , . . .iid ∼ P 1 with log-boundaries a > 0 and b < 0 is defined as the (extended) stopping time together with the decision rule after stopping that accepts H 0 if N i=1 log{(dP 1 /dP 0 )(X)} ≤ b and accepts H 1 if N i=1 log{(dP 1 /dP 0 )(X)} ≥ a.
Lemma 12. Suppose N is the stopping time associated with the (one-sided) sequential probability ratio test of H 0 : X 1 , X 2 , . . .
Proof.Let L n := n i=1 (dP 1 /dP 0 )(X i ).On the event {N < ∞}, we have L N ≥ e a .Therefore, which proves the desired result.
We collect in the following lemma some useful bounds on both the maximum and the argmax of a Gaussian random walk with a negative drift.
Proof.(a) Let g(p) denote the left-hand side of (40).It suffices to prove that g is an increasing function on (0, q].We may also assume that x ≥ 1, because otherwise the result is clear.Now, let We can therefore compute g (p) = h(p/q)h (p) − h(p)h (p/q)/q h(p/q) 2 , We write p(x) := (2π) −1/2 x −1/2 e −x/2 for the density of a χ 2 1 distribution.For λ ∈ (0, 1/4], we bound the moment generating function above as follows: x 1/2 e −x/4 dx = We set u = 6pe −a 2 /8 + 4x.If x ≤ pe −a 2 /8 /4, choose λ = p −1 xe a 2 /8 ≤ 1/4; if x > pe −a 2 /8 /4, choose λ = 1/4.In both cases, we have as required. For d ∈ N, we write [d] := {1, . . ., d}.Given a, b ∈ R, we denote a ∨ b := max(a, b) and a ∧ b := min(a, b).For a set S, we use 1 S and |S| to denote its indicator function and cardinality respectively.For a real-valued function f on a totally ordered set S, we write sargmax x∈S f (x) := min argmax x∈S f (x), the smallest maximiser of f in set S. For a vector X n , ocd updates t n,b ∈ R p and A n,b ∈ R p×p for O log(ep) different values of b.It

∼
N p (0, I p ) and for each n ∈ [γ], compute the diagonal statistic S diag,(r) n on the rth sample.Now compute V (r) := max 1≤n≤γ S diag,(r) n

Figure 3 :
Figure 3: QQ-plots of standardised versions of N diag , N off,d and N off,s , as well as N = N diag ∧ N off,d ∧ N off,s , against theoretical Exp(1) quantiles.
) (b) For j ∈ [p] and b ∈ B ∪ B 0 , denote N j b := inf{n : R j n,b ≥ T diag }, where R j n,b is defined by (2).By Lemma 10, we have that R j n,b = {R j n−1,b + b(X j n − b/2)} ∨ 0, and that this process is always non-negative.Then N diag = min N j b : j ∈ [p], b ∈ B ∪ B 0 .Now, fix some j ∈ [p] and b ∈ B ∪ B 0 .Define U 0 := 0 and U h := inf{n > U h−1 : R j n,b / ∈ (0, T diag )} for h ∈ N, and let H b ∈ B and j ∈ [p].The desired result then follows immediately by taking b = b * and restricting to the subset J ⊆ [p].Fix b ∈ B and j ∈ [p].Recall from (2) and Lemma 10 the definition of R j n,b and the recursive relation R j n,b = {R j n−1,b + b(X j n − b/2)} ∨ 0. By the update procedure for t j n,b in Algorithm 2 and Lemma 11, we have Sections 3.2 and 3.3 Proof of Theorem 3. It suffices to only prove P 0 (N off ≤ m) ≤ 1/4, since the remaining proof is identical to that of Theorem 1.Since Λ k,j b | τ j b iid ∼ N (0, τ j b ) for all b ∈ B, j ∈ [p] and k ∈ [p]\{j} under the null, by the fact that T off ≥ 12 and Lemma 17, we have X i ∼ N −b(r 2 − r 1 ), r 2 − r 1 .(a) Let (W t ) t≥0 be a standard Brownian motion, so that (S r + rb) r∈N d = (W r ) r∈N .Thus P(M ≥ a) ≤ P sup t≥0 (W t − bt) ≥ a = e −2ab ,

Table 2 :
Estimated response delays over 200 repetitions for N diag , N