Seeded Binary Segmentation: A general methodology for fast and optimal change point detection

In recent years, there has been an increasing demand on efficient algorithms for large scale change point detection problems. To this end, we propose seeded binary segmentation, an approach relying on a deterministic construction of background intervals, called seeded intervals, in which single change points are searched. The final selection of change points based on the candidates from seeded intervals can be done in various ways, adapted to the problem at hand. Thus, seeded binary segmentation is easy to adapt to a wide range of change point detection problems, let that be univariate, multivariate or even high-dimensional. We consider the univariate Gaussian change in mean setup in detail. For this specific case we show that seeded binary segmentation leads to a near-linear time approach (i.e. linear up to a logarithmic factor) independent of the underlying number of change points. Furthermore, using appropriate selection methods, the methodology is shown to be asymptotically minimax optimal. While computationally more efficient, the finite sample estimation performance remains competitive compared to state of the art procedures. Moreover, we illustrate the methodology for high-dimensional settings with an inverse covariance change point detection problem where our proposal leads to massive computational gains while still exhibiting good statistical performance.


Introduction and Motivation
Change point detection refers to the problem of estimating the location of abrupt structural changes for data that are ordered e.g. in time, space or by the genome sequence. One can distinguish between online (sequential) and offline (retrospective) change point detection. We are interested in the latter setup, where one has a set of ordered observations and the data in the segments between the change points (also called break points) are assumed to be coming from some common underlying distribution. The goal is to estimate both the location and the number of the change points. Applications cover areas in biology (e.g. copy number variation in Olshen et al. (2004); or ion channels in Hotz et al. (2013)), finance (Kim et al., 2005), environmental science (e.g. climate data in Reeves et al. (2007); or environmental monitoring systems in Londschien et al. (2019)) and many more.
There are two common approaches to tackle change point detection algorithmically: Modifications of dynamic programming (see e.g. Friedrich et al. (2008); Boysen et al. (2009); Killick et al. (2012)) aim to solve a global optimisation problem, which often lead to statistically efficient methods even in a multiscale manner (Frick et al., 2014); Greedy procedures provide a heuristic to those (see e.g. Fang et al. (2019)) and are often based on binary segmentation (Vostrikova, 1981). Plain binary segmentation is usually fast and easy to adapt to various change point detection scenarios, but it is not optimal in terms of statistical estimation. Fryzlewicz (2014) proposed wild binary segmentation and Baranowski et al. (2019) proposed the similar narrowest over threshold method which improve on statistical detection, but lose some of the computational efficiency of plain binary segmentation. Both approaches draw a high number of random search intervals and in each the best split point is found greedily, leading to a set of candidate split points. The two methods differ in the way the final change point estimates are selected from these candidates. The computational time for both essentially depends on the total length (and hence the number) of the drawn intervals. In frequent change point scenarios one should draw a higher number of random intervals both from a theoretical and practical perspective. The problems that arise are the following. In practice one may not know whether the number of drawn search intervals is sufficient and in the worst case, when essentially all possible segments are drawn, the total length of considered intervals will be cubic. Hence, as one considers all possible split points within the drawn intervals, going through all of them can already result in a computational cost that is cubic (e.g. in univariate Gaussian change in mean setups) or possibly even worse if some complicated model fits are used for the investigated problem.
We propose seeded binary segmentation as a generic approach for fast and flexible change point detection in large scale problems. Our approach is similar to wild binary segmentation and the narrowest over threshold method in the sense that for many different search intervals the best split point for each is first determined greedily, in order to form a preliminary set of candidates out of which the final selection can be made. However, we propose a deterministic construction of search intervals which can be pre-computed. The resulting seeded intervals allow computational efficiency, i.e. lead to a total length for the search intervals that is linear up to a logarithmic factor (and thus we will refer to it as near-linear), independent of the underlying number of change points, while keeping the improvements in terms of statistical estimation compared to standard binary segmentation. Figure 1 illustrates the statistical efficiency vs. computational efficiency of seeded binary segmentation compared to wild binary segmentation for a univariate signal. It is remarkable that the estimation performance of the default option suggested for seeded binary segmentation is very close to what seems to be possible in this specific setup. Hence, our approach not only uses available resources more efficiently, but final results are also much more robust across the number of generated seeded intervals.  (Donoho and Johnstone, 1994) with 5 · 2048 observations and 55 change points. The dots and squares represent the average and the error bars the range of obtained values across 50 simulations. Crosses correspond to the proposed default options for the methods. The dashed horizontal lines are drawn as a visual aid. Seeded binary segmentation was combined with a greedy selection and the final models were chosen using the same information criterion as for wild binary segmentation. Note the logarithmic scales on both axes. More details on the blocks signal and the error measures can be found in section 4. Two very useful applications for seeded binary segmentation in connection with ever larger data sets are in the following setups. First, applications with very long univariate Gaussian or similar low-dimensional signals with possibly hundreds of thousands of observations led to the need and thus emergence of computationally more efficient approaches. Based on dynamic programming, Killick et al. (2012) proposed the pruned exact linear time method that is returning the optimal partitioning in linear time in frequent change point scenarios when pruning works well, but remains of quadratic complexity in the worst case (with a few change points only). Maidstone et al. (2017) proposed the functional pruning optimal partitioning method and demonstrated faster empirical performance compared to the pruned exact linear time method in many scenarios. Fryzlewicz (2018) recently proposed wild binary segmentation 2 for a univariate Gaussian change in mean setup, a hybrid between classical binary segmentation and wild binary segmentation aiming to improve the computational performance of the latter. In all these setups, seeded binary segmentation would lead to a competitive approach with guaranteed near-linear cost independent of the underlying number of change points as opposed to the previously mentioned methods (including even plain binary segmentation) which can all have a quadratic computational complexity in certain method-specific adverse scenarios. Note that dynamic programming would typically run fast in frequent change point scenarios while wild binary segmentation would run fast in the presence of only a few change points, but neither of them are guaranteed to be fast under all circumstances.
However, a second, and to us even more appealing applicability of seeded binary segmentation is in setups where model fits are expensive. Several examples, especially in high-dimensional change point detection settings, are discussed at the beginning of section 5. The commonality among these setups is that they usually require computationally costly single fits and thus it is essential for practical feasibility to keep the total length of search intervals as small as possible. We will discuss this in more detail and using a concrete example in the last section.
The structure of the paper is as follows. In the next section we introduce seeded binary segmentation in more detail, background and with more discussion (e.g. details regarding computational times). For the ease of explanation we do this in the classical setup of Gaussian variables with changing mean for which section 3 contains theoretical results regarding consistency and localisation error. Section 4 provides simulation results which demonstrate competitive performance compared to other state-of-the-art estimation methods. As noted previously, the applicability of the seeded binary segmentation methodology is very wide and we illustrate this with a high-dimensional example in section 5.

On the way to Seeded Binary Segmentation
A well-studied problem, which we will analyse in detail as an illustrative example of the methodology, is the detection of shifts in mean of independent univariate Gaussian variables with variance that stays constant over time. Recent proposals for this setup are for example that of Frick et al. (2014), Fryzlewicz (2014, Du et al. (2016), Li et al. (2016) and Fang et al. (2019), references therein also provide an overview. The understanding for this piecewise constant Gaussian setup can often be generalised to other scenarios, e.g. for multivariate and dependent, for different parametric as well as non-parametric models, or even changes in piecewise linear structure. Hence, while focusing on this classical model for illustrative purposes, we would like to emphasise again the much wider potential applicability of our methodology (which we will come back to in the last section).

The Gaussian setup and notation
We consider the model where f t is a fixed one-dimensional, piecewise constant signal with an unknown number N of change points at unknown locations η 1 < · · · < η N ∈ {2, . . . , T − 1}. The noise t is assumed to be independent and identically distributed Gaussians with zero mean and variance σ 2 . Without loss of generality, the variance can be assumed to be 1. The test statistic used for detection by binary segmentation and related procedures is the CUSUM statistics (Page, 1954) defined for a split point s within the segment ( , r] as with integers 0 ≤ < r ≤ T and n = r − . The CUSUM statistic is the generalised likelihood ratio test for a single change point at location s in the interval ( , r] against a constant signal. The value |T ( ,r] (s)| will be referred to as gain, because the square of it gives the reduction in squared errors in the segment ( , r] when splitting at point s and fitting a separate mean on the corresponding left and right segments. The location of the maximal absolute CUSUM statisticsŝ is the best split point and thus a candidate when dividing the segment into two parts. Note that in scenarios other than the Gaussian setup from equation (1) one would only need to change the CUSUM statistic to one that is adapted to the problem at hand. Whether one declaresŝ ( ,r] to be a change point usually depends on the corresponding maximal gains |T ( ,r] (ŝ ( ,r] )|. We treat the final selection of change points as a separate issue and will discuss specific ways to do so later on.

Related methods based on binary segmentation
Binary segmentation (Vostrikova, 1981) first finds the best splitŝ (0,T ] on all data and if |T (0,T ] (ŝ (0,T ] )| is above a pre-defined threshold, then searches for other split points on the left and right segments (0,ŝ (0,T ] ] and (ŝ (0,T ] , T ]. The same iterative search is continued as long as the gains are still above the pre-defined threshold and there are sufficiently many observations in the resulting segments. The wild binary segmentation method (Fryzlewicz, 2014) and the narrowest over threshold method (Baranowski et al., 2019) both generate M random intervals I m = ( m , r m ], m = 1, . . . , M , where the starting and end points are drawn independently, uniformly and with replacement from {0, 1, . . . , T }, possibly subject to some minimal length constraint to be able to evaluate the chosen test statistic. One determines candidate splits and corresponding maximal gains for each of the random intervals via the CUSUM statistics defined in equation (2). The two methods are slightly different beyond this point, in the way to select the final change point estimates.
Wild binary segmentation orders the candidates according to the respective maximal gains (i.e. values of the test statistic) and selects greedily, i.e. the split point with highest value is the first change point estimate. Then all intervals are eliminated from the list which contain this found change point. From the remaining candidates again the one with maximal gain is selected, followed by the elimination of all intervals that contain this second one. The procedure is repeated as long as the split with highest gain remains above a certain pre-defined threshold. A suitable threshold can be derived from asymptotic theory for example. Out of several possible segmentations corresponding to different thresholds, one can also select the one optimising a chosen information criterion (e.g. proposed by Zhang and Siegmund (2007)).
The narrowest over threshold method selects change points amongst the candidates that have a maximal gain above a pre-defined threshold. Amongst these candidates the one having the shortest (i.e. narrowest) background interval is selected first. Then all intervals that contain this first estimate are eliminated and repeatedly from the left over ones above the pre-defined threshold the next one with the currently shortest background is selected. The procedure is repeated as long as no more maximal gain is above the pre-defined threshold. It is also possible to repeat this procedure with differing thresholds and then take the one optimising some information criterion. Note that the narrowest over threshold way of selecting the final estimates makes it applicable in more general scenarios, where the model is e.g. piecewise linear.
Elimination of the intervals in each step for both methods is necessary in order to avoid detecting the same change point at slightly different locations for several different background intervals. The two methods allow for detection in a larger class of signals compared to binary segmentation. The intuition behind this is that some of the random intervals will contain only a single change point sufficiently well separated from the boundaries and thus serve as good local backgrounds in which change points are easily detectable. As a comparison, binary segmentation is particularly bad in signals that exhibit many up and down jumps that cancel each other if a background contains several change points.
The main drawback of wild binary segmentation and the narrowest over threshold method is the increased computational cost. This grows linearly with the total length of used random intervals (and with O(T ) expected length for each random interval). In frequent change point scenarios one should draw a higher number of random intervals in order to ensure that there are still sufficiently many intervals with a single change point in them. In the worst case, when drawing essentially all possible intervals, the complexity is cubic, which is clearly not satisfactory for large scale change point detection problems. As mentioned previously, it is also an issue that one might not know how many change points to expect and thus how many intervals to draw. Fryzlewicz (2018) summarises these issues as the lack of computational adaptivity of wild binary segmentation and proposes a modification, called wild binary segmentation 2, which is a hybrid between wild and classical binary segmentation in the following sense. In order to speed up, he suggests to draw a smaller number M << M of random intervals, localise the first candidate greedily, and in the resulting left and right segments draw again M random intervals, localise again, and repeat iteratively. From the resulting preliminary list of candidates the final estimates are again determined based on certain maximal gains. However, this proposal is fairly specific to the univariate Gaussian setup and might fail when adapting to high-dimensional or other more complicated scenarios such as a model that is piecewise linear. An interesting two-stage procedure also targeting computational issues for the univariate Gaussian case was proposed by Lu et al. (2017). They obtain pilot estimates of change points on a sparse subsample and then refine these estimates in a second step in the respective neighbourhoods using dense sampling. This can reduce the computational cost to sub-linear, and is thus useful for extremely long time series. However, note that computational issues in high-dimensional setups primarily arise from the fact that data sets are too 'wide' and not because of being 'long'. Moreover, there is a price to pay in terms of detectable changes using this procedure, namely, being statistically suboptimal.
Next, we propose our conceptually simple seeded binary segmentation method that has a number of advantages. To name a few, it leads to a near-linear time algorithm under all circumstances, independent of the number of change points, it simplifies theoretical analysis and has an applicability wider than the Gaussian scenario.

Seeded Binary Segmentation
The question to start with regarding the randomly generated intervals above is, whether all of them are necessary? When the starting and end points are drawn uniformly at random, there will be many intervals with length of the order of the original series (see bottom of Fig. 2). While these are one of the main drivers of computational time, if there are many change points, such long intervals are often not really useful as they contain several change points. One should rather focus on shorter intervals. As an example, the total length of random intervals in the bottom of Fig. 2 is approximately the same as for the seeded intervals shown on the top of Fig. 2. While the shown random intervals only contain a few short intervals and also not covering short intervals uniformly, the seeded intervals achieve this much better. Formally, we propose the following deterministic construction guaranteeing a good coverage of single change points.
Definition 1 (Seeded intervals) Let a ∈ [1/2, 1) denote a given decay parameter. For 1 ≤ k ≤ log 1/a (T ) (i.e. logarithm with base 1/a) define the k-th layer as the collection of n k intervals of initial length l k that are evenly shifted by the deterministic shift s k as follows: Some technical remarks are due first. Due to rounding to integer grid values, the effective lengths and shifts are in practice slightly different from l k and s k . Moreover, some intervals might appear multiple times on small scales (depending on choice of a). Such redundant ones can be eliminated. Using some rough bounds, one easily obtains a near-linear bound for the total length of all the intervals contained in I: . Thus, evaluating the CUSUM statistics for all of these intervals also has a computational cost of the order O(T log 1/a (T )). Note that one can simply stop at an earlier layer than the log 1/a (T ) -th one if a certain minimal segment length is desirable (e.g. in high-dimensional scenarios) or when prior knowledge on the true minimal segment length is available.
While the lengths of intervals decay rapidly over the layers, the number of intervals is rapidly increasing as the scales become smaller. In particular, at the lowest layer the total number of intervals is of the same order as the total number of observations, i.e. for k = log 1/a (T ) , |I k | is of the order O(T ). This allows a good coverage even in frequent change point scenarios. Hence, in some sense, the number of intervals is automatically adapted to each scale. Figure 2 shows the first few layers of two collections of seeded intervals as well as some random intervals for a comparison. The key idea is best understood for a = 1/2 (top of Fig. 2). Here one obtains an overlapping dyadic partition. Each interval from the previous layer is split into a left, a right and an overlapping middle interval, each of half the size of the previous layer. The middle plot of Fig. 2 shows seeded intervals of slower decay, namely 1/2 1/2 . Here the solid lines in layers I 1 , I 3 and I 5 show intervals that were also included in the interval system for a = 1/2 (top), while the dashed lines are additionally obtained intervals. These intervals on the intermediate layers I 2 and I 4 are still evenly spread. In particular, a sufficiently big overlap amongst the consecutive intervals within the same layer is guaranteed. This is key, because one obtains for each change point background intervals with only that single change point in there and such that the change point is not too close to the boundaries, which facilitates detection. We recommend using a = 1/2 1/2 as a decay in practice, but as indicated above, adding intermediate layers is very easy when trying to achieve better finite sample performance in very challenging scenarios. Several related interval constructions were proposed by Guenther Walther together with various co-authors. First proposed for the detection of spatial clusters (Walther, 2010), as well as density estimation (Rufibach and Walther, 2010), these proposals were then gradually adapted towards more classical change point detection (or fairly related problems) by Rivera and Walther (2013); Walther (2013, 2015). The difference between these various intervals and seeded intervals mainly lies in the specific choices of intervals for each scale. Pein et al. (2017) also considered dyadic partitions, but as a major difference, without an overlap, moreover, with no flexibility in decay. Recently, yet another interval system was proposed by Chan and Chen (2017) in the context of bottom up approaches. However, these specific interval construction proposals appeared as tricks and side remarks rather than pronouncing the potential of a stand-alone flexible methodology. We believe our construction is more intuitive, in particular showing the trade-offs between computational time and statistical precision.
Algorithm 1 describes our seeded binary segmentation. The decay parameter a governs the trade-off between computational time and expected estimation performance. Limiting the minimal number of observations m is reasonable e.g. in high-dimensional scenarios or if prior information on the minimal true segment lengths is available. Choosing a larger value not only speeds up computations, but also reduces the risk of selecting false positives coming from very small intervals during the final selection. We find it important to disentangle the construction of search intervals from the final selection as various options with advantages and disadvantages arise. Some possible selection methods are discussed in section 2.4.

Require
a decay parameter a ∈ [1/2, 1), a minimal segment length m ≥ 2 and a selection method. Create a collection I of seeded intervals with decay a from Definition 1, which cover ≥ m observations. For i = 1 to |I| Take the i-th interval in I and denote its boundaries with and r. Calculate the CUSUM statistics T ( ,r] (s) from equation (2) for s = , . . . , r − 1. Apply the chosen selection method to T ( ,r] (·), ( , r] ∈ I to output the final change point estimates.

Algorithm 1: Seeded Binary Segmentation
Overall, the key components for our seeded interval construction are 1) a sufficiently fast decay such that the total length of all intervals is near-linear, 2) choosing a larger number of intervals on the small scales and thus suitably adapting for the possibility that there can be many more small segments than large ones, and last, but not least 3) guaranteeing a sufficient overlap amongst intervals on the same layer. The choice of a can be interpreted as a trade-off between computational time and expected performance.

Selection methods
While mentioned earlier, we recall some possible selection methods here. Greedy selection takes the location of the maximal CUSUM value (max ( ,r]∈I |T ( ,r] (ŝ ( ,r] )|) as the change point candidate, then eliminates from I all intervals that contain this candidate, and repeats these two steps as long as there are some intervals left. From the list of these candidates all above a certain threshold (typically some constant times log 1/2 T ) are kept. Alternatively, one can also investigate various thresholds, each leading to a different segmentation, and at the end select the segmentation that optimises some information criterion.
The narrowest over threshold selection takes all intervals ( , r] ∈ I for which T ( ,r] (ŝ ( ,r] ) is over a pre-defined threshold (typically some constant times log 1/2 T ). Out of these it takes the shortest (narrowest) interval ( * , r * ], i.e. the one for which r * − * ≤ r − for all intervals ( , r] with maximal CUSUM value above the threshold. Then it eliminates all intervals that contain the found change point candidateŝ ( * ,r * ] , and repeats these two steps as long as there are no more intervals left with a maximal CUSUM statistics above the pre-defined threshold. While raising the threshold tends to result in fewer change points, due to the narrowest selection followed by the elimination step, it could also happen that one obtains more change points. Hence, in some scenarios this selection method can be somewhat unstable, as even similar thresholds can lead to segmentations differing both with respect to number and location of the boundaries. It is also possible to evaluate a set of different segmentations arising from different thresholds in terms of some information criterion and then pick the overall best. Various other options, e.g. the steepest drop to low levels proposed by Fryzlewicz (2018) or selection thresholds guaranteeing false positive control either via theoretical approximation results (Fang et al., 2019) or via simulations (facilitated by the favourable scaling of the total length of seeded intervals of O(T log T )) could be adapted as well. Besides the choice of a, the overall computational costs also depend on how complicated the final model selection is. Greedy selection for example is very fast, while full solution paths (using all possible thresholds) of the narrowest over threshold method can be slow. Whether this becomes an issue depends on the problem at hand. In high-dimensional setups the expensive model fits would typically dominate the cost of model selection anyway. To make this more formal, consider the following theorems. Theorem 1 Assume model (1) and that in seeded binary segmentation (i.e., Algorithm 1) we perform either the greedy selection or the narrowest over threshold selection, with an arbitrarily fixed single threshold. Then the computational complexity of seeded binary segmentation is O(T log T ) and the memory complexity is O(T ).
Proof 1 The algorithm consists of two steps: the calculation of the CUSUM statistics from (2) for all seeded intervals and the subsequent model selection. For the first step, the cost of calculating CUSUM statistics on all seeded intervals is proportional to their total length, i.e., O(T log T ).
In the second step, we only consider the candidate split points from seeded intervals with maximal CUSUM values above the given threshold, and sort them either according to their maximal gains (for the greedy selection) or their lengths (for the narrowest over threshold selection). Note that the number of intervals contained in a collection of seeded intervals is of the order O(T ). Hence, the sorting procedure involves O(T log T ) computations in the worst case. Then we check for each seeded interval in the sorted list whether it contains any previously found change point. Such a check can be done in O(log T ) computations as follows. We store all previously found change points in a balanced binary tree structure (e.g., the AVL tree by Adel son-Vel skiȋ and Landis, 1962), where both the search and the insertion of new elements have the worst case complexity O(log T ), and the memory complexity is O(T ) (given the maximum of T total points that could potentially be declared as change points and thus need to be stored in the AVL tree). Thus, one needs O(log T ) time for checking whether the candidate split is suitable (i.e. the corresponding search interval does not contain any previously found change point, for which finding the nearest neighbouring points in the AVL tree to the currently considered candidate is sufficient), and O(log T ) time for updating the AVL tree supposing that the candidate is suitable and thus that it is declared/detected as a new change point. Alternatively, we can utilise the structure of seeded intervals by noticing that each potential split point s ∈ {1, . . . , T } is contained in at most O(log T ) intervals (as in each of the O(log T ) scales there is a constant number of intervals containing s). This allows us to mark all of the intervals that contain the found change point with a cost of O(log T ) such that these intervals can then be skipped in consecutive steps when adding the new change points. Overall, both approaches (AVL trees and the specific structure of seeded intervals) lead to a worst case complexity O(T log T ) for the second step as we go through at most O(T ) intervals (the total number of intervals in a collection of seeded intervals) and perform at most O(log T ) computations at each (checking whether the interval is suitable and an update of the AVL tree). Therefore, the total computational complexity is O(T log T ) and it does not depend on the underlying number of true change points. The total memory complexity is composed of the storage of information (maximal gain, best split, as well as start and end point for each search interval) plus the AVL tree, all of which requires at most O(T ) memory. Theorem 2 Assume model (1) and that in seeded binary segmentation (i.e., Algorithm 1) we determine a full solution path (i.e., calculate segmentations with all thresholds corresponding to maximal gains in any of the seeded intervals) and then select the final model out of the ones in the solution path based on an information criterion. Specifically, consider an information criterion with a penalty PEN(·) of the form We note that the information criterion (3) in Theorem 2 is fairly general, including e.g., the Akaike information criterion, the Bayes information criterion and its modifications (Zhang and Siegmund, 2007) as special cases. The upper bound for the narrowest over threshold selection in Theorem 2 is worse than that for the greedy selection, because, as mentioned previously, for the narrowest over threshold selection even similar thresholds sometimes lead to solutions differing both with respect to number and location of the change point estimates. Note that for the cases when close-by thresholds produce similar models, one could reuse parts of previous segmentations and thus improve on the worst case bound. We do not know whether any smart updating (potentially using the specific deterministic structure of seeded intervals) could lead to some provably better complexity than the crude bound stated. For this one would need to be able say something about the instability in the solution path (i.e., bound the number of cases when similar thresholds produce very different segmentations).
From a practical perspective, thresholds κ = C log 1/2 T are of interests for some constant C. Thus, instead of investigating all possible thresholds, picking some of this specific order could be sufficient to obtain good empirical results while maintaining fairly fast computations, i.e., O(T log T ). Alternatively, one could use a reasonable threshold (maybe even a few different ones) and then create a 'solution path' by deleting the found change points in a bottom up fashion based on local values of the test statistic (i.e., at each step one deletes the change point which has the smallest CUSUM value in the segment determined by its closest left and right neighbours, similar to the reverse segmentation proposal of Chan and Chen (2017)) to obtain a hierarchy of nested models for which the evaluation of the information is cheap (similar to greedy selection). Yet another option is to consider fewer layers in the collection of seeded intervals, leading to much fewer intervals in total (as most intervals are at the lowest layers covering small scales) and thus also much fewer possible thresholds to be used for the solution path. While the overall cost of the selection procedure could be reduced massively, the speedup would come at a price, namely, the risk of not detecting true change points with a small spacing in between. Last, but not least, computations in parallel are easy to implement both for the calculation of the test statistic in the various search intervals, as well as the model selection using different thresholds (in the narrowest over threshold selection).
Proof 2 In order to apply the information criterion (3), we need to compute the full solution path, namely, the solutions for all possible choices of thresholds. Consider first the greedy selection, which has the nice property that the obtained solutions are nested, i.e. adding another split point (equivalent to lowering the threshold) splits one of the existing intervals into two parts. Hence, to obtain the full solution path, one can just let the model selection run as in Theorem 1, but with a threshold set to 0, leading to at most O(T log T ) computations. Further, the penalty PEN(·) can be evaluated in an incremental way, i.e., each time when adding a new change point, which by assumption needs O(1) computations per update. For the least squares part in (3), it is also possible to update incrementally with O(log T ) computations at each step. More precisely, we have to find the nearest left and right point (out of all previously found change points) to the newly added change point. Once the neighbours are found, one can just update the least squares of the specific segment that was freshly divided in criterion (3) in O(1) time by utilising cumulative sum type of information (all of which can be pre-computed once in the beginning in O(T ) time, and stored in O(T ) space). Such searches of nearest neighbours can be done in O(log T ) time, if we build and iteratively update a balanced binary tree (e.g., the AVL tree by Adel son-Vel skiȋ and Landis, 1962) for the found change points as in the proof of Theorem 1. Note that such balanced trees can be updated incrementally also in O(log T ) time. Given that we go through at most O(T ) intervals and perform at most O(log T ) computations at each, we thus obtain an upper bound on the worst case computational complexity of O(T log T ) for the greedy selection.
Consider now the narrowest over threshold selection. Note that there are at most O(T ) seeded search intervals and thus thresholds which can potentially lead to different solutions. For each threshold, we can compute the solution in O(T log T ) time by Theorem 1 and evaluate the information criterion (3) in O(T log T ) time (i.e., in an incremental way similarly to the greedy selection considered above). Thus, in total, we have an upper bound on the worst case complexity being O(T 2 log T ).
Note that only the value of the information criterion (3) needs to be stored for every threshold that leads to a different solution, requiring O(T ) memory. The final solution can be calculated again with O(T log T ) time once the optimal threshold is selected. Thus, the memory complexity is O(T ) for both selection methods. Of course, if one wants to record all possible segmentations from the solution path of the narrowest over threshold selection, the memory complexity could be O(T 2 ).

Statistical theory
The introduction of seeded intervals (in Definition 1) not only accelerates the computation, but also simplifies the statistical analysis of the resulting segmentation method, i.e., the seeded binary segmentation (see Algorithm 1). Consistency results depend on the assumed model as well as the selection method. As an illustration, we consider again the Gaussian change in mean model (1) for which wild binary segmentation (Fryzlewicz, 2014) and the narrowest over threshold method (Baranowski et al., 2019) were claimed to be minimax optimal. However, the proof of the former seems to contain errors, e.g., some bounds for the single change point case were wrongly applied to the multiple change point cases. This has been noted by  in the (sub)Gaussian setup. Contrary to what is claimed by Fryzlewicz (2014), the greedy selection utilised in wild binary segmentation does not lead to optimal detection (and the same is also true for wild binary segmentation 2 of Fryzlewicz (2018) as the therein utilised steepest drop to low levels selection is also a certain greedy selection method). Put differently, without additional assumptions on the generated intervals or modification of the selection method itself, consistency seems only achievable for a smaller class of signals than claimed originally. Albeit in a nonparametric setup, Padilla et al. (2019a) mention the errors of the original proofs of Fryzlewicz (2014) as well, along some discussion on an additional assumption that they impose under which nearly optimal results can be attained (see their second remark after Theorem 2 on page 8). However, their assumption is a restriction on the generated search intervals requiring knowledge of the minimal segment lengths. Depending on the investigated data, such a background knowledge may or may not be a realistic assumption. Note that from a practical perspective greedy selection still offers advantages (e.g. easy applicability and implementation in various scenarios, segmentations are nested and stable across thresholds, etc.) and in terms of empirical performance it is often at least as good as the narrowest over threshold method for example (see the simulations in section 4). As seeded binary segmentation would suffer similarly if the greedy selection method is chosen, we show the strong consistency using the narrowest over threshold selection method (Baranowski et al., 2019) as an illustration. Theorem 3 Assume model (1) and let δ * = min i=1,...,N δ i with δ i = |f ηi+1 − f ηi |, and λ = min i=0,...,N |η i+1 − η i | with η 0 = 0 and η N +1 = T . Assume also that LetN andη 1 < · · · <ηN denote respectively the number and locations of estimated change points by the seeded binary segmentation in Algorithm 1 with the narrowest over threshold as the selection method. Then there exist constants C 1 , C 2 , independent of T and a (in Definition 1), such that, given the threshold for the selection method κ = C 1 log 1/2 T , as T → ∞, The signal f as well as δ i , δ * and λ is allowed to depend on T.
The assumption for detection in (4) and the localisation rates δ −2 i log T in (5) are minimax optimal up to constants (see e.g. Chan and Walther, 2013;Frick et al., 2014;Chan and Chen, 2017;Li et al., 2019). That is, no method can detect all N change points with weaker condition than (4), and no method can achieve a better rate than δ −2 i log T , in the asymptotic sense.
Proof 3 We consider interval (η i − λ, η i + λ] for each change point η i . By construction of seeded intervals in Definition 1, we can find a seeded interval (c i − r i , c i + r i ] such that (c i − r i , c i + r i ] ⊆ (η i − λ, η i + λ], r i ≥ a 2 λ and |c i − η i | ≤ r i (1 + a 2 )/2. Then the problem almost reduces to the detection of change point on the interval (c i − r i , c i + r i ]. The rest of the proof follows nearly the same as the proof of Theorem 1 in Baranowski et al. (2019), and is thus omitted.
We can further establish the consistency of the seeded binary segmentation using certain data adaptive selection rule (i.e., strengthened Schwarz information criterion).
Theorem 4 Assume notation from Theorem 3, and that λ ≥ log θ0 T , and N , δ j , j = 1, . . . , N lie in (c 1 , c 2 ) for some constants θ 0 > 1 and 0 < c 1 < c 2 < ∞. LetN andη 1 < · · · < ηN be respectively the number and locations of estimated change points by the seeded binary segmentation in Algorithm 1 with the narrowest over threshold, the threshold of which is determined by the strengthened Schwarz information criterion using θ ∈ (1, θ 0 ). Then there is a constant C independent of T and a (in Definition 1) such that, as T → ∞, Proof 4 It follows exactly the same way as the proof of Theorem 3 in Baranowski et al. (2019).
In a nutshell, seeded binary segmentation combined with the narrowest over threshold selection has the same statistical guarantee as the original narrowest over threshold method of Baranowski et al. (2019), but requires far less computations, with the worst case computational complexity being nearly linear (for a single threshold based selection), see Theorem 1.

Empirical results
We consider the same simulations as presented by Fryzlewicz (2014) in connection with wild binary segmentation. For a description of the signals see Appendix B of Fryzlewicz (2014). We used implementations of the wild binary segmentation and the narrowest over threshold method from the R packages wbs and not from CRAN, respectively, with a small bug fixed in both packages concerning the generation of random intervals. Moreover, for our proposed seeded binary segmentation methodology we slightly modified the code in wbs to allow for seeded intervals as inputs. The code is available at https://github.com/kovacssolt/SeedBinSeg. Arbitrary intervals as inputs were already supported by the not package. Note however, that the latter implementation can be slow as all segmentations in the solution path resulting from all possible thresholds are recorded, already leading to a quadratic number of candidates being saved (see Theorem 2 and corresponding discussions for details). Table 1 shows results based on averaged values over 100 simulations for each setup. Reported are the mean squared errors (MSE), Hausdorff distances, V measures (Rosenberg and Hirschberg, 2007, more precisely, viewing the change point segmentation as a clustering problem by identifying each constant part as a cluster), errors in the estimates for the number of change points as well as the total length of the constructed search intervals. Parentheses show the corresponding standard deviations. For our seeded binary segmentation methodology we chose m = 2 and a = 1/2 1/2 . Exceptions are the rows marked with a star, where the decay was chosen to be a = 1/2 1/8 , i.e. an increase in the total length of search intervals by a factor of roughly four. For both the wild binary segmentation and the narrowest over threshold methods we chose the recommended 5000 random intervals. Moreover, all methods selected the final model using the same information criterion.
Overall, seeded binary segmentation with greedy selection (g-SeedBS) performs similar to wild binary segmentation (WBS), while seeded binary segmentation with the narrowest over threshold selection (n-SeedBS) is similar to the original narrowest over threshold method (NOT). Most importantly, using seeded intervals we achieve competitive performance with total lengths of search intervals that are orders of magnitudes smaller than using the random intervals. In these scenarios both selection types perform similarly. We have the experience that increasing the noise level would lead to better results using the greedy selection. The performance of our default methodology (with a = 1/2 1/2 ) is slightly worse for the blocks signal compared to random intervals based counterparts. However, when choosing a = 1/2 1/8 , and thus accepting an increase in the total length of the search intervals by about a factor four, performances are very close again. Note that the total length of search intervals is still only a tenth of that of random intervals. For a better understanding, the simulations for the blocks signal are depicted in Fig. 3. In the top row (with a = 1/2 1/2 ) one can see points lining up vertically when considering Hausdorff distances. This line is around 40, and indicates that one of the true change points (lying to this distance to its closest neighbour) is missed. However, when increasing computational effort (a = 1/2 1/8 , bottom row), this vertical line basically disappears and for all three error measures many points line up exactly on the identity line. This indicates that found change points using seeded binary segmentation with greedy selection and wild binary segmentation exactly agree in many simulation runs. g-SeedBS, Seeded binary segmentation with greedy selection; n-SeedBS, Seeded binary segmentation with narrowest over threshold selection; WBS, Wild binary segmentation; NOT, Narrowest over threshold method; MSE, Mean squared error; Hausdorff dist., Hausdorff distance; tot. length, total length of search intervals (in thousand); signals are described in Appendix B of Fryzlewicz (2014).
Besides increasing the computational effort by the choice of the decay a, there could be other options to improve empirical performance. For example, one could find preliminary change point estimates with the recommended choice of a = 1/2 1/2 and include additional search intervals in a second step covering the ranges between previously found neighbouring change points. Should the best candidates from these additional intervals have test statistics above the pre-defined acceptance threshold, then one can also add them to the final set of change point estimates. Another option mentioned at the definition of seeded intervals is to limit the minimal number of observations m (i.e. consider fewer layers) if prior information on the minimal true segment lengths is available, which reduces the risk of selecting false positives coming from very small intervals during the final model selection. One could also use more refined test statistics instead of the CUSUM statistics that are adapted to various scales, e.g. multiscale variants.  Table 1 for the blocks signal with decay a = 1/2 1/2 (top) and decay a = 1/2 1/8 (bottom).
5 Beyond the univariate case: An example for high-dimensional models As mentioned earlier, another appealing applicability of seeded binary segmentation beyond the extensively discussed univariate Gaussian setup is in scenarios where model fits are expensive, for example in the recently emerging area of high-dimensional change point detection (e.g. means in Soh and Chandrasekaran (2017); Wang and Samworth (2018); covariance and precision matrices/graphical models in Roy et al. (2017); Gibberd and Nelson (2017); Wang et al. (2017); Gibberd and Roy (2017); Bybee and Atchadé (2018); Avanesov and Buzun (2018); Londschien et al. (2019); linear regression in Leonardi and Bühlmann (2016); Wang et al. (2019)). Several of these proposals require the calculation of numerous lasso (Tibshirani, 1996), graphical lasso (Friedman et al., 2008) or similar rather costly estimators. Another example would be multivariate non-parametric change point detection for which Kovács et al. (2020) use random forests (Breiman, 2001) and other classifiers, Padilla et al. (2019b) use kernel density estimation, while Matteson and James (2014) utilise energy distances. Overall, in all approaches requiring costly single fits, it is essential to keep the total length of search intervals as small as possible to make computations feasible and thus to have practically usable methods.
We would like to highlight this in a specific example involving change point detection in (high-dimensional) Gaussian graphical models. We consider a setup and estimator as introduced by Londschien et al. (2019). Let (X i ) T i=1 ⊆ R p be a sequence of independent Gaussian random vectors with means µ i ∈ R p and covariance matrices Σ i ∈ R p×p such that (µ i , Σ i ), i = 1, . . . , T is piecewise constant with an unknown number N of change points at unknown locations and corresponding segment boundaries p×(v−u) denote the matrix of the observations X u+1 , . . . , X v , denote withμ (u,v] their mean and let S (u,v] = (X (u,v] −μ (u,v] )(X (u,v] −μ (u,v] ) t /(v − u) be the corresponding covariance matrix. Let 0 < δ < 1/2 be some required minimal relative segment length and λ > 0 be a regularisation parameter. According to the proposal of Londschien et al. (2019), for a segment (u, v] with v − u > 2δT define the split point candidate aŝ Ω glasso (u,v] ; S (u,v] ) − L T (Ω glasso (u,s] ; S (u,s] where L T corresponds to a multivariate Gaussian log-likelihood based loss in the considered segment, i.e. for an arbitrary segment (u, v]. Moreover,Ω glasso (u,v] is the graphical lasso precision matrix estimator (Friedman et al., 2008) for the corresponding segment with a specifically scaled regularisation parameter, i.e.
Ω glasso (u,v] = arg min The goal here is to compare the estimation performance and computational times of the estimator (6) when applied to seeded as well as random intervals. We consider a specific chain network model (Fan et al., 2009, Example 4.1) with p = 20 variables: Σ ij = exp (−a |s i − s j |) with a = 1/2 and s i − s i−1 = 0.75, i = 2, . . . , p. The resulting precision matrix Σ −1 is tridiagonal. We consider a modified versionΣ by replacing the top left 5 × 5 part of Σ by I 5 , i.e. an identity matrix of dimension 5. Let F 1 = N (0 p , Σ) and F 2 = N (0 p ,Σ). We take 20 segments of length 100 each (i.e., N = 19) and generate independent observations in the following way. The first hundred observations are drawn from F 1 , the following hundred from F 2 , then another hundred from F 1 , F 2 , F 1 etc. In total, we obtain T = 2000 observations. We simulated 10 different data sets this way and to each of these applied seeded binary segmentation (greedy selection) with decays a = 1/2, 1/2 1/2 , 1/2 1/4 , 1/2 1/8 , 1/2 1/16 as well as M = 100, 200, 400, 800, 1600 random intervals for wild binary segmentation, leading to the 5 red, respectively 5 black point clouds in Fig. 4. To be precise, we set δ = 0.015, hence, we left out 30 observations at the boundaries of background intervals. This also implies that all intervals with less than 60 observations were discarded for both interval construction types. We set λ = 0.007 and a threshold of 0.04 as a necessary improvement in (6) in order to keep the proposed split pointη (u,v] in the segment (u, v] and declare it as a change point. Note that with this specific choice of δ we do not have a truly high-dimensional scenario in any of the segments, but regularisation is still helpful especially when considering short background intervals (where the number of variables p is of similar magnitude as the number of observations). The reasons why we kept the number of variables p comparably low are on the one hand the easier choice of tuning parameters, and on the other hand we would quickly face computational bottlenecks with wild binary segmentation when increasing p. In this 20-dimensional example, around 2300 seconds are needed for wild binary segmentation to obtain an estimation accuracy that is already achieved in about 28 seconds using seeded intervals, see top of Fig. 4. The bottom of Fig. 4 shows computational gains of similar magnitude (17 vs. 1200 seconds). The difference in computational times for the top and bottom of Fig. 4 arise from the different implementation. The top shows a naive implementation where for each split point s ∈ (u, v] both the input covariance matrices (S (u,s] , S (s,v] ) as well as the corresponding graphical lasso fits (Ω glasso (u,s] ,Ω glasso (s,v] ) are computed from scratch. However, the input covariance matrices can be updated cheaply for neighbouring split points, i.e. once S (u,s] is available, with a cost of O(p 2 ) it can be updated to obtain S (u,s+1] . This better implementation, shown at the bottom part of Fig. 4, saves 40 − 50% in computational time both for wild and seeded binary segmentation, but still preserves the massive computational gains of the seeded intervals. The remaining computational costs mainly come from the expensive graphical lasso fits. Without computing the graphical lasso fits, in our R implementation one would only need about 2 seconds for the fastest version of seeded binary segmentation (a = 1/2) and about 9 seconds for M = 100 and 150 seconds for M = 1600 for wild binary segmentation. However, as mentioned previously, without the regularised fits of the graphical lasso, the estimation error would be clearly worse, and hence, this is not an option. Thus, the next best guess to speed up would be to use warm starts from neighbouring splits when updating the graphical lasso fits, similar to the covariance matrix updates described previously. Unfortunately, not all algorithms that have been developed to compute graphical lasso fits are guaranteed to converge with warm starts. Mazumder and Hastie (2012b) proposed the DP-GLASSO algorithm that can be started at any positive definite matrix. When using the corresponding dpglasso R package on CRAN, we obtained speedups of 30 − 35% with warm starts, but even this way the computational times were roughly 10 − 20 times worse than using the graphical lasso algorithm of Friedman et al. (2008) implemented in the glasso R package on CRAN. The latter one we used to obtain the results of Fig. 4. Admittedly, there are a lot of factors influencing computational times (e.g. convergence thresholds) and different algorithms/implementations might be better or worse in other scenarios with a higher number of variables p and/or regularisation parameter λ.
It is hard to draw an overall conclusion on which combination of implementation and tricks would give the fastest results for changing Gaussian graphical models, but it is even harder to be fast without utilising the more efficient seeded interval construction for the following reason. In each background interval evaluating (6) at a single split point s requires the calculation of 2 graphical lasso fits. Hence, the total computational cost (for a small δ) would be roughly O(T log(T )p 2 +T log(T )·gl(p)) for seeded binary segmentation versus O(T M p 2 +T M ·gl(p)) for wild binary segmentation, where gl(p) denotes the cost of computing a graphical lasso estimator for a pdimensional covariance matrix. There are various algorithms computing exact or approximate solutions for the graphical lasso estimator with differing computational cost, but scalings for gl(p) of O(p 3 ) (or even worse) are common once the input covariance matrices are already available and unless some special structures can be exploited (e.g. block diagonal screening proposed by Witten et al. (2011) as well as Mazumder and Hastie (2012a)). The terms T log(T )p 2 and T M p 2 come from the calculation of the input covariance matrices via updates. Summarising all observations from above, for typical algorithms for graphical lasso problems, the total costs would be roughly O(T log(T )p 3 ) for seeded binary segmentation vs. O(T M p 3 ) for wild binary segmentation. As in the univariate Gaussian change in mean case, M needs to be much larger than O(log T ) if there are many change points. In the worst case M is of order O(T 2 ) when evaluating essentially all possible background intervals. In practice, one often does not know which choice of M is sufficient. This is a problem as the performance depends very much on the choice of M (see Fig. 4), especially in case there are many change points. Another advantage of seeded intervals is, that results are not depending much on the choice of the decay a for the seeded interval construction compared to the very strong dependence on the choice of M (the number of random intervals) for wild binary segmentation.
Overall the aim of this example is to demonstrate that seeded binary segmentation is highly useful in scenarios beyond the univariate Gaussian change in mean case. While we precisely quantify the computational gains in the changing Gaussian graphical model scenario when using graphical lasso estimators, it is more difficult to quantify this for some other costly estimators such as the lasso (Tibshirani, 1996) or random forest (Breiman, 2001) for the following reasons. As a main difference, they have a cost that not only depends on the number of variables, but also on the number of observations, moreover, possibilities to update neighbouring fits would need to be considered carefully. Even in absence of exact quantification of the computational costs in these scenarios, once can expect the substantial computational gains when using seeded intervals to be crucial to keep computations feasible when model fits are very expensive.

Conclusions
Two popular algorithmic approaches tackling change point detection are modifications of dynamic programming and greedy procedures, mostly based on binary segmentation. While dynamic programming provides optimal partitionings, its worst case computational complexity is typically quadratic in the length of the time series. Moreover, dynamic programming often requires considerable work to be adapted to the problem at hand, in particular when aiming to do so in efficient ways, see e.g. Killick et al. (2012) Plain binary segmentation on the other hand is easy to adapt to various change point detection scenarios while being fast at the same time. The price to pay is not being optimal in terms of statistical estimation. The recently proposed wild binary segmentation (Fryzlewicz, 2014) and related methods improve on statistical detection by means of considering random background intervals for the detection, but these methods lose the computational efficiency of plain binary segmentation.
In this paper we showed that this loss of computational efficiency can be avoided by means of a systematic construction of background intervals. The so called seeded intervals have a total length of order O(T log T ) compared to O(T M ) for wild binary segmentation with M random intervals. The choice of M should depend on the assumed minimal segment length which is typically unknown prior to the analysis. In frequent change point scenarios, M should be clearly larger than O(log(T )), in worst case M is of order O(T 2 ). The seeded interval construction is not only more efficient, but the total length of search intervals is independent of the number of change points, which even guarantees fast computations under all circumstances, unlike competing methods. Importantly, for this computational gain one does not need to sacrifice anything, i.e. the improved statistical estimation performance remains as demonstrated in various simulation results. We have proved minimax optimality of our proposed seeded binary segmentation methodology for the univariate Gaussian change in mean problem, and demonstrated massive computational gains for detecting change points in large-dimensional Gaussian graphical models.