-
PDF
- Split View
-
Views
-
Cite
Cite
S Kovács, P Bühlmann, H Li, A Munk, Seeded binary segmentation: a general methodology for fast and optimal changepoint detection, Biometrika, Volume 110, Issue 1, March 2023, Pages 249–256, https://doi.org/10.1093/biomet/asac052
- Share Icon Share
Summary
We propose seeded binary segmentation for large-scale changepoint detection problems. We construct a deterministic set of background intervals, called seeded intervals, in which single changepoint candidates are searched for. The final selection of changepoints based on these candidates can be done in various ways, adapted to the problem at hand. The method is thus easy to adapt to many changepoint problems, ranging from univariate to high dimensional. Compared to recently popular random background intervals, seeded intervals lead to reproducibility and much faster computations. For the univariate Gaussian change in mean set-up, the methodology is shown to be asymptotically minimax optimal when paired with appropriate selection criteria. We demonstrate near-linear runtimes and competitive finite sample estimation performance. Furthermore, we illustrate the versatility of our method in high-dimensional settings.
1. Introduction and motivation
We consider offline, or retrospective, changepoint detection, where one has a set of ordered observations and the data in the segments between the changepoints, or breakpoints, are assumed to be coming from some common distribution. The goal is to estimate the number of abrupt structural changes, their locations, as well as the signal in between. Applications cover biology, e.g., copy number variation in Olshen et al. (2004) and ion channels in Hotz et al. (2013), finance (Kim et al., 2005) and environmental science, e.g., climate data in Reeves et al. (2007) and environmental monitoring systems in Londschien et al. (2021).
There are two common approaches to tackling changepoint detection algorithmically: modifications of dynamic programming (see, e.g., Friedrich et al., 2008; Boysen et al., 2009; Killick et al., 2012; Frick et al., 2014; Maidstone et al., 2017) aim to solve a global optimization problem, which often leads to statistically efficient methods; greedy heuristic procedures (see, e.g., Fang et al., 2020) approximate a global optimization problem and are often based on binary segmentation (Vostrikova, 1981) or its modifications (Fryzlewicz, 2014; Baranowski et al., 2019). All these methods have certain drawbacks, such as not being statistically efficient, not being computationally efficient in certain scenarios or being hard to generalize to scenarios beyond simple problems of univariate changepoints.
We propose seeded binary segmentation as a generic approach to fast, flexible and statistically efficient changepoint detection in large-scale problems. The approach is similar to wild binary segmentation (Fryzlewicz, 2014), and the related narrowest-over-threshold method (Baranowski et al., 2019), where a large number of intervals of data are analysed separately. For each interval, the best change location is determined, in order to form a preliminary set of candidate changepoints from which the final selection can be made using an appropriate selection criterion. However, while Fryzlewicz (2014) and Baranowski et al. (2019) used random intervals, we propose a deterministic construction of intervals, called seeded intervals. The advantages of seeded intervals include much faster computations, simpler theoretical analysis and stable/reproducible results; see Kovács et al. (2020a) for a discussion on reproducibility. Our approach is also computationally efficient compared to other competitors, as independent of the underlying number of changepoints, the total length of our search intervals scales near linearly in the sample size, i.e., linear up to a logarithmic factor. We consider two simple changing mean signals in order to discuss and highlight potential advantages of this methodology. Similar examples can be found beyond this univariate problem, but we stick with this one as there are many well-established, specialized competitors for this problem. In Example 1 below, the signal has a baseline mean level of zero, and two consecutive short segments of length ten at location |$T/3$| with up/down jumps. Example 2 below has frequent up/down jumps every tenth observation. The jump sizes are set to be large such that detection becomes easy for any reasonable estimation method.
Let |$X_{1}, \ldots, X_{\lfloor T/3\rfloor-10}\sim\mathcal{N}(0,1)$|, |$X_{\lfloor T/3\rfloor-9}, \ldots, X_{\lfloor T/3\rfloor}\sim\mathcal{N}(4,1)$|, |$X_{\lfloor T/3\rfloor+1}, \ldots, X_{\lfloor T/3\rfloor+10}\sim \mathcal{N}(-4,1)$| and |$X_{\lfloor T/3\rfloor+11}, \ldots, X_{T}\sim\mathcal{N}(0,1)$| be independent observations.
Let |$X_{1}, \ldots, X_{10}\sim\mathcal{N}(4,1)$|, |$X_{11}, \ldots, X_{20}\sim\mathcal{N}(-4,1)$|, |$X_{21}, \ldots, X_{30}\sim\mathcal{N}(4,1)$|, |$X_{31}, \ldots, X_{40}\sim \mathcal{N}(-4,1)$|, etc. up to observation |$T$| be independent observations.
Figure 1 shows that only two considered methods are both fast, i.e., near linear, and have low estimation error in terms of the Hausdorff distance for both examples: our seeded binary segmentation with greedy solution path and the specialized functional pruning optimal partitioning method of Maidstone et al. (2017). However, the latter technique currently only works for changes in a univariate parameter, and cannot be used for, e.g., detecting changes in multivariate data. The pruned exact linear time method of Killick et al. (2012) uses another pruning technique with somewhat wider applicability. As shown in the right panels of Fig. 1, this method is only fast in frequent changepoint scenarios, such as Example 2, but remains of quadratic complexity if there are long segments with no changepoints; see Example 1. The main advantage of binary segmentation, compared to dynamic programming, is easy adaptability and implementation in various changepoint problems. Its runtime in practice is often nearly linear, with quadratic worst case in rather special signals such as for Example 2. As the biggest drawback, plain binary segmentation is not optimal in terms of statistical estimation; see Fig. 1(a) and the 1992 Stanford University PhD thesis by E. S. Venkatraman. Wild binary segmentation and the related narrowest-over-threshold method are able to improve estimation. Ideally, some of the drawn intervals will contain only a single changepoint sufficiently well separated from the boundaries and thus serve as good local backgrounds for detection as opposed to longer intervals with multiple changepoints. However, due to the need to search for the best candidate split point within a sufficiently high number of random intervals, both methods can be much slower than binary segmentation. Their computational times essentially depend on the total length and hence the number of intervals. In scenarios with short minimal spacing between consecutive changepoints one should draw more random intervals. In the worst case, such as in the examples presented, one would actually need |$O(T^2)$| random intervals and thus an overall |$O(T^3)$| computational time to achieve reasonable statistical performance. To summarize, the main drawbacks of random-interval-based methods are that one may not know whether the number of drawn intervals is sufficient for the underlying signal, estimation results may not be stable/reproducible across repetitions and in the worst case computational time is cubic or possibly even worse for problems requiring more costly fits than univariate means.

(a), (b) Average estimation error and (c), (d) computational times for Examples 1 and 2. We compare our seeded binary segmentation with greedy solution path (SeedBS-GS, decay |$a=1/2^{1/2}$|) with wild binary segmentation (WBS) using |$M1=5000$|, |$M2= 5000\{1+({T/1000})^{1/2}\}$|, |$M3=5000(1+T/1000)$| random intervals; binary segmentation (BS); pruned exact linear time (PELT) and the functional pruning optimal partitioning (FPOP) method. The narrowest-over-threshold method would have comparable behaviour to WBS, as they merely differ in the selection of final changepoint estimates. In order to eliminate concerns about potential differences in model selection, the detection thresholds are set such that each method detects exactly as many changepoints as the true number for that set-up. The thin dashed grey and brown lines in (c) and (d) indicate computational times with scaling |$O(T)$| and |$O(T^2)$|.
We advocate replacing the random intervals within wild binary segmentation, the narrowest-over-threshold method and similar methods by a computationally much more efficient deterministic construction, called seeded intervals. Remarkably, the speed-up in this case comes at no price, as statistical guarantees for seeded intervals remain the same as for random intervals. In the worst case, there are |$O(T)$| seeded intervals with a total length |$O(T\log T)$| used within seeded binary segmentation, leading to a near-linear overall computational cost, as opposed to the worst-case quadratic cost for dynamic programming, or even cubic costs for random-interval-based competitors; see § 2.3 and the Supplementary Material. The seeded interval construction not only improves statistical estimation compared to standard binary segmentation, but also retains the versatility. Compared to dynamic programming algorithms, seeded binary segmentation is much more flexible and easier to implement, making it a highly useful alternative for, e.g., univariate signals with dependent noise or multivariate signals where specialized pruned dynamic programming algorithms are not available or too slow. Another appealing applicability of our methodology is in high-dimensional changepoint detection and similar set-ups with costly model fits; see § 3 and the Supplementary Material. In such scenarios seeded intervals are powerful and attractive as they keep the total length and number of search intervals small and thus also the required number of such costly fits.
2. On the way to seeded binary segmentation
2.1. Set-up and notation
2.2. Seeded binary segmentation
We advocate the replacement of the random search intervals for wild binary segmentation (Fryzlewicz, 2014), the narrowest-over-threshold method (Baranowski et al., 2019) and related methods by a computationally more efficient deterministic construction. Why are these random intervals computationally inefficient? When the start and end points are drawn uniformly at random, there will be many long intervals, see the bottom of Fig. 2, their average length being |$T/3$|. Such long intervals, which are computationally more expensive, are often not useful as they contain several changepoints. Instead, the focus should be on shorter intervals containing single changepoints. The total length of seeded and random intervals in Fig. 2 is comparable. However, random intervals only cover a few short intervals as opposed to seeded intervals that guarantee coverage of single changepoints even for short spacings.

Layers of seeded intervals (top) versus random intervals ordered according to length (bottom). In the top panel the dashed lines are created additionally when |$a=(1/2)^{1/2}$| compared to |$a=1/2$| (solid lines).
(Seeded intervals). Let |$a\in[1/2,1)$| denote a given decay parameter. Define |$I_1 = (0,T]$| and, for |$k = 2, \ldots, \lceil\log_{1/a}(T)\rceil$|, i.e., logarithm with base |$1/a$|, define the |$k$|th layer as the collection of |$n_k$| intervals of initial length |$l_k$| that are evenly shifted, up to rounding, by the deterministic shift |$s_k$| as |${\mathcal I}_k = \bigcup\limits_{i=1}^{n_k} \{ (\lfloor (i-1)s_k \rfloor, \lceil(i-1)s_k + l_k \rceil ] \}$| and their collection |$ {\mathcal I} = \bigcup\limits_{k=1}^{\lceil\log_{1/a}(T)\rceil} {\mathcal I}_k $| as the set of seeded intervals, where |$n_k = 2\lceil (1/a)^{k-1} \rceil - 1$|, |$l_k = Ta^{k-1}$| and |$s_k = (T-l_k)/(n_k-1)$|. If some intervals appear multiple times on small scales due to rounding, they can be eliminated.
To facilitate detection, interval constructions should aim at covering each changepoint with a sufficiently long background interval with only that single changepoint in there and such that the changepoint is not too close to the boundaries. The top panel of Fig. 2 visualizes our construction. The key idea is best understood for |$a=1/2$|; observe the solid lines in layers |${\mathcal I}_1, {\mathcal I}_3, {\mathcal I}_5$| and |${\mathcal I}_7$|. One obtains an overlapping dyadic partition. The additional overlapping and evenly spread dashed intervals for layers |${\mathcal I}_2, {\mathcal I}_4$| and |${\mathcal I}_6$| are added for the slower decay |$a=1/2^{1/2}$|. To see why overlap is necessary, consider a signal with changepoints at |$T/4$| and |$T/2$| and decay |$\alpha=1/2$| but without overlap, i.e., |$\tilde I = \{(0,T], (0,T/2], (T/2,T], (0,T/4], \ldots\}$|. Then only |$(0,T]$| covers |$T/2$|. For some others, |$T/2$| would be exactly the boundary, i.e., not allowing detection. As |$(0,T]$| contains two changepoints rather than a single one, detection is not ensured in all scenarios. With the proposed overlap, there would be, e.g., |$(T/4,3T/4]$| containing only the single changepoint at |$T/2$|. Besides the overlap, the fact that all scales are sufficiently covered is the reason for good statistical performance. While the lengths of intervals decay rapidly over the layers, their number rapidly increases as the scales become smaller. At the lowest layer, the total number of intervals is |$O(T)$|, allowing good coverage even for frequent changepoints. Hence, in some sense, the number of intervals is automatically adapted to each scale. As the last key component, there are |$O(T)$| intervals in |${\mathcal I}$| and their total length is |$O(T\log T)$|. Evaluating the CUSUM statistic (2) for all of these intervals is proportional to the total length and thus leads to near-linear computations. Overall, the key components for the seeded interval construction are (i) a total of |$O(T)$| intervals with a near-linear total length of |$O(T \log T)$|, (ii) choosing more intervals on the small scale, and thus suitably adapting to the possibility that there can be many more small segments than large ones and (iii) guaranteeing a sufficient overlap between intervals.
Some related interval constructions have been proposed earlier. First used for the detection of spatial clusters (Walther, 2010), as well as density estimation (Rufibach & Walther, 2010), the former proposals were then gradually adapted towards more classical changepoint detection or in some sense related problems by Rivera & Walther (2013) and Chan & Walther (2013, 2015). The difference between these various intervals and seeded intervals mainly lies in the specific choices of intervals for each scale. Pein et al. (2017) also considered dyadic partitions, but without an overlap, and, moreover, with no flexibility in decay. Recently, yet another interval system was proposed by Chan & Chen (2022) in the context of bottom-up approaches. We believe that our construction is more intuitive, and capable of tuning trade-offs between computational time and expected performance through the decay |$a$|.
Algorithm 1 below describes seeded binary segmentation. In practice, the decay |$a=1/2^{1/2}$| already leads to competitive performance in many scenarios, and may thus be taken as the default. In very challenging problems choosing a slower decay might slightly improve the expected estimation performance at the price of somewhat higher computational costs. Limiting the minimal number of observations |$m$|, i.e., stopping before the last layer, is reasonable if a certain minimal segment length is desirable, e.g., high-dimensional scenarios, or when prior knowledge of the true minimal segment length is available. Choosing a larger value for |$m$| speeds up computations and also reduces the risk of selecting false positives coming from very short intervals during the final selection. We find it important to disentangle the interval construction from the final selection as various options lead to differing advantages and guarantees; see § 2.3.
Seeded binary segmentation.
Require a decay parameter |$a\in[1/2,1)$|, a minimal segment length |$m \ge 2$| and a selection
method.
Create a collection |${\mathcal I}$| of seeded intervals (Definition 1) with decay |$a$|, which cover |$\ge m$|
observations.
For |$i=1$| to |$|{\mathcal I}|$|
Take the |$i$|th interval in |${\mathcal I}$| and denote its boundaries with |$\ell$| and |$r$|.
Calculate the CUSUM statistic |$T_{(\ell,r]}(s)$| from (2) for |$s = \ell+1,\ldots, r-1$|.
Apply the chosen selection method to |$T_{(\ell,r]}(\cdot), (\ell,r] \in {\mathcal I}$|, to output the final changepoint estimates.
2.3. Selection methods and corresponding computational guarantees
Each interval |${(\ell,r]}\in{\mathcal I}$| gives a candidate |$\hat s_{(\ell,r]} = {\text{arg max}}_{s\in{\{\ell+1, \dots, r-1\}}} |T_{(\ell,r]}(s)|$| and a maximal gain |$G_{(\ell,r]} = |T_{(\ell,r]}(\hat s_{(\ell,r]})|$|. The question is which candidates to declare as final changepoint estimates. Various selection methods are suited to different model classes with different statistical and computational guarantees and empirical performance. We consider greedy selection with a single fixed threshold or multiple thresholds leading to a solution path; and narrowest-over-threshold selection (Baranowski et al., 2019) for a single fixed threshold or multiple thresholds leading to a solution path. Other options are, e.g., steepest drop to low levels (Fryzlewicz, 2020), localized pruning (Cho & Kirch, 2022), selection thresholds guaranteeing false positive control via theory (Fang et al., 2020), or simulations as well as so-called seeded dynamic programming approaches, as described in the Supplementary Material.
Greedy selection ranks the intervals according to their gains |$G_{(\ell,r]}$|; it takes as the first changepoint |$\hat s$| of the interval with the highest gain, and eliminates from |${\mathcal I}$| all intervals that contain |$\hat s$|; then repeats these two steps of picking the next best candidate and eliminating all intervals containing it. The fixed-threshold-based variant stops when no further interval remains with a gain |$G$| above the chosen threshold. The narrowest-over-threshold selection with a fixed threshold takes all intervals |$(\ell,r]\in {\mathcal I}$| for which |$G_{(\ell,r]}$| is over the threshold; it takes the shortest, i.e., narrowest, interval amongst these with corresponding candidate |$\hat s$|; eliminates all intervals that contain the found changepoint |$\hat s$| and repeats these two steps as long as intervals with gain above the predefined threshold remain. In the implementation we treat all intervals on the same layer of seeded intervals as equally long, despite slight differences due to rounding, and rank all the intervals on the same layer according to their gains to eliminate ties when they are equally long. In some scenarios this selection method can be somewhat unstable, as even similar thresholds can lead to segmentations differing both with respect to the number and location of the boundaries. Typical thresholds in the univariate Gaussian model are |$C\log^{1/2} T$| for both fixed threshold methods. The solution-path-based variants compute segmentations for all the |$\left|{\mathcal I}\right|=O(T)$| possible thresholds arising from distinct gain values |$G_{(\ell,r]}, (\ell,r]\in {\mathcal I}$|. Out of the obtained segmentations the best one in terms of some information criterion (cf. Zhang & Siegmund, 2007) is finally chosen. Besides the choice of |$a$| and the minimal segment length |$m$|, the overall computational costs also depend on how complicated the final model selection is.
For model (1), seeded binary segmentation has worst-case |$O(T \log T)$| computational and |$O(T)$| memory complexity if combined with any of the following selection methods:
(i) greedy selection with an arbitrarily fixed single threshold,
(ii) narrowest-over-threshold selection with an arbitrarily fixed single threshold,
(iii) greedy selection with a complete solution path, provided that the information criterion is of the form |$\tilde\eta_1<\cdots<\tilde\eta_m \mapsto \sum_{i = 1}^m \sum_{t \in (\tilde\eta_i,\tilde\eta_{i+1}]} (\bar X_{(\tilde\eta_i,\tilde\eta_{i+1}]} - X_t)^2 + \mathrm{\small{PEN}}(\{\tilde\eta_1,\ldots,\tilde\eta_m\})$| with |$\bar X_{(i,j]} = \sum_t X_t / (j-i)$|, |$\tilde\eta_0=0$| and |$\tilde\eta_{m+1}=T$|; and under the additional assumption that, given the penalty |$\mathrm{\small{PEN}}(S)$|, the computation of |$\mathrm{\small{PEN}}(S \cup \{\tilde\eta\})$| is |$O(\log T)$| for every finite set of split points |$S$| and any arbitrary additional split point |$\tilde\eta$|.
The information criterion in the third statement is fairly general, covering, e.g., the Bayes information criterion and its modifications (Zhang & Siegmund, 2007). The computational complexity statement for narrowest-over-threshold selection with a complete solution path, i.e., NS, remains open. Empirically, the computational cost seems to scale near linearly when using an algorithm that updates the segmentations from one threshold to the next. However, we could not prove such a near-linear scaling. A trivial bound would be |$O(T^2\log T)$| when computing each segmentation separately, but we did not find any example with a quadratic complexity empirically. Lastly, replacing the full grid search in seeded intervals with the faster adaptive optimistic search of Kovács et al. (2020b) can lead to further speed-ups, e.g., sublinear computations, assuming that cumulative sums and knowledge of the minimal segment length are available, or |$O(T)$| without such assumptions.
2.4. Statistical theory
Seeded intervals not only accelerate the computation, but also simplify the statistical analysis compared to random intervals. Consistency results depend on the assumed model and the selection method. As an illustration, the following strong consistency result is presented.
For the Gaussian change in mean model (1), seeded binary segmentation combined with narrowest-over-threshold selection with either a suitable fixed threshold or solution path is asymptotically minimax optimal both regarding signal conditions and localization error for the changepoints; see the Supplementary Material for the precise formulation.
Hence, these variants have the same statistical guarantee as the random-interval-based counterpart by Baranowski et al. (2019), but require far less computation. Contrary to claims by Fryzlewicz (2014), the greedy selection utilised in wild binary segmentation might fail in some rather special signals. Hence, consistency for greedy selection would be possible only in a restricted signal class, unless making additional assumptions or slightly modifying the selection. Hence, we do not consider greedy selection theory. However, greedy selection still has some advantages and often even outperforms narrowest-over-threshold-based selection; see the Supplementary Material for details.
3. Empirical results: the univariate Gaussian case and a high-dimensional example
In Examples 1 and 2 in § 1, seeded binary segmentation variants clearly outperformed random-interval-based counterparts in terms of estimation. Additionally to such examples with very short minimal segment length, we also consider the same simulations as Fryzlewicz (2014). We observe that seeded intervals achieve competitive performance with orders-of-magnitudes smaller total lengths of search intervals, and hence computational cost, than for random intervals. Our method with greedy selection performs similarly to wild binary segmentation, while in the case of combination with the narrowest-over-threshold selection, it is similar to the original narrowest-over-threshold method. Detailed results along with considerations for practice, e.g., the decay |$a$| and the minimal segment length |$m$|, that might improve estimation are presented in the Supplementary Material.
To emphasize the applicability of seeded binary segmentation also in scenarios with more complex and expensive model fits, we consider the proposal of Londschien et al. (2021) for detecting changes in high-dimensional graphical models: the gain function is based on the multivariate Gaussian loglikelihood, and the underlying precision matrices for each split point are estimated using the graphical lasso (Friedman et al., 2008). In the example shown in Fig. 3, for the same statistical accuracy, seeded binary segmentation is roughly |$30$|–|$80$| times faster than wild binary segmentation, due to the lower total length of the intervals, and thus many fewer computationally intense graphical lasso fits. Applying the adaptive optimistic search of Kovács et al. (2020b) in each interval could lead to further significant speed-ups.

Estimation performance on the changing Gaussian graphical model from the Supplementary Material with |$2000$| observations and |$19$| changepoints with (a),(b) a slower naive implementation and (c),(d) a faster one using updates for neighbouring input covariances. five square (red) and circular (black) point clouds correspond to five different values for the decay and five numbers for random intervals. Repeated symbols correspond to |$10$| different simulation runs. Note the logarithmic scales.
Acknowledgement
We thank an associate editor and two reviewers for constructive comments, and Tobias Ruckstuhl for helping with a Fortran-based fast implementation. Kovács and Bühlmann were supported by the European Research Council (786461 CausalStats). Li and Munk acknowledge the support of the DFG Cluster of Excellence Multiscale Bioimaging (EXC 2067). This work was supported by the Deutsche Forschungsgemeinschaft (DFG FOR 5381).
Supplementary material
The Supplementary Material includes additional remarks and details, e.g., on related methods, computational costs, time benchmarking and statistical theory. A fast implementation of our seeded binary segmentation is available at https://github.com/kovacssolt/ChangePoints.