Automatic Change-Point Detection in Time Series via Deep Learning

Detecting change-points in data is challenging because of the range of possible types of change and types of behaviour of data when there is no change. Statistically efficient methods for detecting a change will depend on both of these features, and it can be difficult for a practitioner to develop an appropriate detection method for their application of interest. We show how to automatically generate new offline detection methods based on training a neural network. Our approach is motivated by many existing tests for the presence of a change-point being representable by a simple neural network, and thus a neural network trained with sufficient data should have performance at least as good as these methods. We present theory that quantifies the error rate for such an approach, and how it depends on the amount of training data. Empirical results show that, even with limited training data, its performance is competitive with the standard CUSUM-based classifier for detecting a change in mean when the noise is independent and Gaussian, and can substantially outperform it in the presence of auto-correlated or heavy-tailed noise. Our method also shows strong results in detecting and localising changes in activity based on accelerometer data.

Many change-point detection methods are based upon modelling data when there is no change and when there is a single change, and then constructing an appropriate test statistic to detect the presence of a change (e.g.James et al., 1987;Fearnhead and Rigaill, 2020).The form of a good test statistic will vary with our modelling assumptions and the type of change we wish to detect.This can lead to difficulties in practice.As we use new models, it is unlikely that there will be a change-point detection method specifically designed for our modelling assumptions.Furthermore, developing an appropriate method under a complex model may be challenging, while in some applications an appropriate model for the data may be unclear but we may have substantial historical data that shows what patterns of data to expect when there is, or is not, a change.
In these scenarios, currently a practitioner would need to choose the existing change detection method which seems the most appropriate for the type of data they have and the type of change they wish to detect.To obtain reliable performance, they would then need to adapt its implementation, for example tuning the choice of threshold for detecting a change.Often, this would involve applying the method to simulated or historical data.
To address the challenge of automatically developing new change detection methods, this paper is motivated by the question: Can we construct new test statistics for detecting a change based only on having labelled examples of change-points?We show that this is indeed possible by training a neural network to classify whether or not a data set has a change of interest.This turns change-point detection in a supervised learning problem.
A key motivation for our approach are results that show many common test statistics for detecting changes, such as the CUSUM test for detecting a change in mean, can be represented by simple neural networks.This means that with sufficient training data, the classifier learnt by such a neural network will give performance at least as good as classifiers corresponding to these standard tests.In scenarios where a standard test, such as CUSUM, is being applied but its modelling assumptions do not hold, we can expect the classifier learnt by the neural network to outperform it.
There has been increasing recent interest in whether ideas from machine learning, and methods for classification, can be used for change-point detection.Within computer science and engineering, these include a number of methods designed for and that show promise on specific applications (e.g.Ahmadzadeh, 2018;De Ryck et al., 2021;Gupta et al., 2022;Huang et al., 2023).Within statistics, Londschien et al. (2022) and Lee et al. (2023) consider training a classifier as a way to estimate the likelihood-ratio statistic for a change.However these methods train the classifier in an un-supervised way on the data being analysed, using the idea that a classifier would more easily distinguish between two segments of data if they are separated by a change-point.Chang et al. (2019) use simulated data to help tune a kernel-based change detection method.Methods that use historical, labelled data have been used to train the tuning parameters of change-point algorithms (e.g.Hocking et al., 2015;Liehrmann et al., 2021).Also, neural networks have been employed to construct similarity scores of new observations to learned pre-change distributions for online change-point detection (Lee et al., 2023).However, we are unaware of any previous work using historical, labelled data to develop offline change-point methods.As such, and for simplicity, we focus on the most fundamental aspect, namely the problem of detecting a single change.Detecting and localising multiple changes is considered in Section 6 when analysing activity data.We remark that by viewing the change-point detection problem as a classification instead of a testing problem, we aim to control the overall misclassification error rate instead of handling the Type I and Type II errors separately.In practice, asymmetric treatment of the two error types can be achieved by suitably re-weighting misclassification in the Figure 1: A neural network with 2 hidden layers and width vector m = (4, 4).
two directions in the training loss function.
The method we develop has parallels with likelihood-free inference methods (Gourieroux et al., 1993;Beaumont, 2019) in that one application of our work is to use the ability to simulate from a model so as to circumvent the need to analytically calculate likelihoods.However, the approach we take is very different from standard likelihood-free methods which tend to use simulation to estimate the likelihood function itself.By comparison, we directly target learning a function of the data that can discriminate between instances that do or do not contain a change (though see Gutmann et al., 2018, for likelihood-free methods based on re-casting the likelihood as a classification problem).
For an introduction to the statistical aspects of neural network-based classification, albeit not specifically in a change-point context, see Ripley (1994).
We now briefly introduce our notation.For any n ∈ Z + , we define [n] := {1, . . ., n}.We take all vectors to be column vectors unless otherwise stated.Let 1 n be the all-one vector of length n.Let ½{•} represent the indicator function.The vertical symbol | • | represents the absolute value or cardinality of • depending on the context.For vector x = (x 1 , . . ., x n ) ⊤ , we define its p-norm as x p := n i=1 |x i | p 1/p , p ≥ 1; when p = ∞, define x ∞ := max i |x i |.All proofs, as well as additional simulations and real data analyses appear in the supplement.

Neural networks
The initial focus of our work is on the binary classification problem for whether a change-point exists in a given time series.We will work with multilayer neural networks with Rectified Linear Unit (ReLU) activation functions and binary output.The multilayer neural network consists of an input layer, hidden layers and an output layer, and can be represented by a directed acyclic graph, see Figure 1.Let L ∈ Z + represent the number of hidden layers and m = (m 1 , . . ., m L ) ⊤ the vector of the hidden layers widths, i.e. m i is the number of nodes in the ith hidden layer.For a neural network with L hidden layers we use the convention that m 0 = n and m L+1 = 1.For any bias vector where σ(x) = max(x, 0) is the ReLU activation function.The neural network can be mathematically represented by the composite function h : R n → {0, 1} as where σ * λ (x) = ½{x > λ}, λ > 0 and W ℓ ∈ R m ℓ+1 ×m ℓ for ℓ ∈ {0, . . ., L} represent the weight matrices.We define the function class H L,m to be the class of functions h(x) with L hidden layers and width vector m.The output layer in (1) employs the shifted heaviside function σ * λ (x), which is used for binary classification as the final activation function.This choice is guided by the fact that we use the 0-1 loss, which focuses on the percentage of samples assigned to the correct class, a natural performance criterion for binary classification.Besides its wide adoption in machine learning practice, another advantage of using the 0-1 loss is that it is possible to utilise the theory of the Vapnik-Chervonenkis (VC) dimension (see, e.g.Shalev-Shwartz and Ben-David, 2014, Definition 6.5) to bound the generalisation error of a binary classifier equipped with this loss; indeed, this is the approach we take in this work.The relevant results regarding the VC dimension of neural network classifiers are e.g. in Bartlett et al. (2019).As in Schmidt-Hieber (2020), we work with the exact minimiser of the empirical risk.In both binary or multiclass classification, it is possible to work with other losses which make it computationally easier to minimise the corresponding risk, see e.g.Bos and Schmidt-Hieber (2022), who use a version of the crossentropy loss.However, loss functions different from the 0-1 loss make it impossible to use VC-dimension arguments to control the generalisation error, and more involved arguments, such as those using the covering number (Bos and Schmidt-Hieber, 2022) need to be used instead.We do not pursue these generalisations in the current work.
3 CUSUM-based classifier and its generalisations are neural networks

Change in mean
We initially consider the case of a single change-point with an unknown location τ ∈ where µ L , µ R are the unknown signal values before and after the change-point; ξ ∼ N n (0, I n ).
The CUSUM test is widely used to detect mean changes in univariate data.For the observation x, the CUSUM transformation is the log likelihood-ratio statistic for testing a change at time i against the null of no change (e.g.Baranowski et al., 2019).For a given threshold λ > 0, the classical CUSUM test for a change in the mean of the data is defined as The following lemma shows that h CUSUM λ (x) can be represented as a neural network.
The fact that the widely-used CUSUM statistic can be viewed as a simple neural network has far-reaching consequences: this means that given enough training data, a neural network architecture that permits the CUSUM-based classifier as its special case cannot do worse than CUSUM in classifying change-point versus no-change-point signals.This serves as the main motivation for our work, and a prelude to our next results.

Beyond the mean change model
We can generalise the simple change in mean model to allow for different types of change or for non-independent noise.In this section, we consider change-point models that can be expressed as a change in regression problem, where the model for data given a change at τ is of the form where for some p ≥ 1, Z is an n × p matrix of covariates for the model with no change, c τ is an n × 1 vector of covariates specific to the change at τ , and the parameters β and φ are, respectively, a p × 1 vector and a scalar.The noise is defined in terms of an n × n matrix Γ and an n × 1 vector of independent standard normal random variables, ξ.
For example, the change in mean problem has p = 1, with Z a column vector of ones, and c τ being a vector whose first τ entries are zeros, and the remaining entries are ones.In this formulation β is the pre-change mean, and φ is the size of the change.The change in slope problem (Fearnhead et al., 2019) has p = 2 with the columns of Z being a vector of ones, and a vector whose ith entry is i; and c τ has ith entry that is max{0, i − τ }.In this formulation β defines the pre-change linear mean, and φ the size of the change in slope.Choosing Γ to be proportional to the identity matrix gives a model with independent, identically distributed noise; but other choices would allow for auto-correlation.
The following result is a generalisation of Lemma 3.1, which shows that the likelihood-ratio test for (2), viewed as a classifier, can be represented by our neural network.
Importantly, this result shows that for this much wider class of change-point models, we can replicate the likelihood-ratio-based classifier for change using a simple neural network.
Other types of changes can be handled by suitably pre-transforming the data.For instance, squaring the input data would be helpful in detecting changes in the variance and if the data followed an AR(1) structure, then changes in autocorrelation could be handled by including transformations of the original input of the form (x t x t+1 ) t=1,...,n−1 .On the other hand, even if such transformations are not supplied as the input, a neural network of suitable depth is able to approximate these transformations and consequently successfully detect the change (Schmidt-Hieber, 2020, Lemma A.2).This is illustrated in Figure 7 of appendix, where we compare the performance of neural network based classifiers of various depths constructed with and without using the transformed data as inputs.

Generalisation error of neural network change-point classifiers
In Section 3, we showed that CUSUM and generalised CUSUM could be represented by a neural network.Therefore, with a large enough amount of training data, a trained neural network classifier that included CUSUM, or generalised CUSUM, as a special case, would perform no worse than it on unseen data.In this section, we provide generalisation bounds for a neural network classifier for the change-in-mean problem, given a finite amount of training data.En route to this main result, stated in Theorem 4.3, we provide generalisation bounds for the CUSUM-based classifier, in which the threshold has been chosen on a finite training data set.
We write P (n, τ, µ L , µ R ) for the distribution of the multivariate normal random vector X ∼ . Define η := τ /n.Lemma 4.1 and Corollary 4.1 control the misclassification error of the CUSUM-based classifier.
The theoretical results derived for the neural network-based classifier, here and below, all rely on the fact that the training and test data are drawn from the same distribution.However, we observe that in practice, even when the training and test sets have different error distributions, neural network-based classifiers still provide accurate results on the test set; see our discussion of Figure 2 in Section 5 for more details.The misclassification error in (3) is bounded by two terms.The first term represents the misclassification error of CUSUM-based classifier, see Corollary 4.1, and the second term depends on the complexity of the neural network class measured in its VC dimension.Theorem 4.2 suggests that for training sample size N ≫ n 2 log n, a well-trained single-hidden-layer neural network with 2n − 2 hidden nodes would have comparable performance to that of the CUSUM-based classifier.However, as we will see in Section 5, in practice, a much smaller training sample size N is needed for the neural network to be competitive in the change-point detection task.This is because the 2n − 2 hidden layer nodes in the neural network representation of h CUSUM λ encode the components of the CUSUM transformation (±v ⊤ t x : t ∈ [n − 1]), which are highly correlated.By suitably pruning the hidden layer nodes, we can show that a single-hidden-layer neural network with O(log n) hidden nodes is able to represent a modified version of the CUSUM-based classifier with essentially the same misclassification error.More precisely, let Q := ⌊log 2 (n/2)⌋ and write T 0 := {2 q : 0 ≤ q ≤ Q} ∪ {n − 2 q : 0 ≤ q ≤ Q}.We can then define By the same argument as in Lemma 3.1, we can show that h CUSUM * The following Theorem shows that high classification accuracy can be achieved under a weaker training sample size condition compared to Theorem 4.2.
Theorem 4.3.Fix B > 0 and let the training data D be generated as in Theorem 4.2.Let h ERM := arg min h∈HL,m L N (h) be the empirical risk minimiser for a neural network with L ≥ 1 layers and m = (m 1 , . . ., m L ) ⊤ hidden layer widths.If m 1 ≥ 4⌊log 2 (n)⌋ and m r m r+1 = O(n log n) for all r ∈ [L − 1], then there exists a universal constant C > 0 such that for any δ ∈ (0, 1), (4) holds with probability 1 − δ.
Theorem 4.3 generalises the single hidden layer neural network representation in Theorem 4.2 to multiple hidden layers.In practice, multiple hidden layers help keep the misclassification error rate low even when N is small, see the numerical study in Section 5. Theorems 4.2 and 4.3 are examples of how to derive generalisation errors of a neural network-based classifier in the change-point detection task.The same workflow can be employed in other types of changes, provided that suitable representation results of likelihood-based tests in terms of neural networks (e.g.Lemma 3.2) can be obtained.In a general result of this type, the generalisation error of the neural network will again be bounded by a sum of the error of the likelihood-based classifier together with a term originating from the VC-dimension bound of the complexity of the neural network architecture.
We further remark that for simplicity of discussion, we have focused our attention on data models where the noise vector ξ = X − EX has independent and identically distributed normal components.However, since CUSUM-based tests are available for temporally correlated or sub-Weibull data, with suitably adjusted test threshold values, the above theoretical results readily generalise to such settings.See Theorems A.3 and A.5 in appendix for more details.

Numerical study
We now investigate empirically our approach of learning a change-point detection method by training a neural network.Motivated by the results from the previous section we will fit a neural network with a single layer and consider how varying the number of hidden layers and the amount of training data affects performance.We will compare to a test based on the CUSUM statistic, both for scenarios where the noise is independent and Gaussian, and for scenarios where there is auto-correlation or heavy-tailed noise.The CUSUM test can be sensitive to the choice of threshold, particularly when we do not have independent Gaussian noise, so we tune its threshold based on training data.
When training the neural network, we first standardise the data onto [0, 1], i.e. xi = (( This makes the neural network procedure invariant to either adding a constant to the data or scaling the data by a constant, which are natural properties to require.We train the neural network by minimising the cross-entropy loss on the training data.We run training for 200 epochs with a batch size of 32 and a learning rate of 0.001 using the Adam optimiser (Kingma and Ba, 2015).These hyperparameters are chosen based on a training dataset with cross-validation, more details can be found in Appendix B.
The above procedure is then repeated N/2 times to generate independent sequences x 1 , . . ., x N/2 with a single change, and the associated labels are (y 1 , . . ., y N/2 ) ⊤ = 1 N/2 .We then repeat the process another N/2 times with µ R = µ L to generate sequences without changes x N/2+1 , . . ., x N with (y N/2+1 , . . ., y N ) ⊤ = 0 N/2 .The data with and without change (x i , y i ) i∈[N ] are combined and randomly shuffled to form the training data.The test data are generated in a similar way, with a sample size N test = 30000 and the slight modification that µ R |τ ∼ Unif([−1.75b,−0.25b] ∪ [0.25b, 1.75b]) when a change occurs.We note that the test data is drawn from the same distribution as the training set, though potentially having changes with signal-to-noise ratios outside the range covered by the training set.We have also conducted robustness studies to investigate the effect of training the neural networks on scenario S1 and test on S1 ′ , S2 or S3.Qualitatively similar results to Figure 2 have been obtained in this misspecified setting (see Figure 6 in appendix).We compare the performance of the CUSUM-based classifier with the threshold cross-validated on the training data with neural networks from four function classes: H 1,m (1) ,H 1,m (2) , H 5,m (1) 15 and H 10,m (1) 110 where m (1) = 4⌊log 2 (n)⌋ and m (2) = 2n−2 respectively (cf.Theorem 4.3 and Lemma 3.1).Figure 2 shows the test misclassification error rate (MER) of the four procedures in the four scenarios S1, S1 ′ , S2 and S3.We observe that when data are generated with independent Gaussian noise ( Figure 2(a)), the trained neural networks with m (1) and m (2) single hidden layer nodes attain very similar test MER compared to the CUSUM-based classifier.This is in line with our Theorem 4.3.More interestingly, when noise has either autocorrelation ( Figure 2(b, c)) or heavy-tailed distribution ( Figure 2 5, m (1) 1 5 ) and (10, m (1) 1 10 ) outperform the CUSUM-based classifier, even after we have optimised the threshold choice of the latter.In addition, as shown in Figure 5 in the online supplement, when the first two layers of the network are set to carry out truncation, which can be seen as a composition of two ReLU operations, the resulting neural network outperforms the Wilcoxon statistics-based classifier (Dehling et al., 2015), which is a standard benchmark for change-point detection in the presence of heavy-tailed noise.Furthermore, from Figure 2, we see that increasing L can significantly reduce the average MER when N ≤ 200.Theoretically, as the number of layers L increases, the neural network is better able to approximate the optimal decision boundary, but it becomes increasingly difficult to train the weights due to issues such as vanishing gradients (He et al., 2016).A combination of these considerations leads us to develop deep neural network architecture with residual connections for detecting multiple changes and multiple change types in Section 6.
6 Detecting multiple changes and multiple change typescase study From the previous section, we see that single and multiple hidden layer neural networks can represent CUSUM or generalised CUSUM tests and may perform better than likelihood-based test statistics when the model is misspecified.This prompted us to seek a general network architecture that can detect, and even classify, multiple types of change.Motivated by the similarities between signal processing and image recognition, we employed a deep convolutional neural network (CNN) (Yamashita et al., 2018) to learn the various features of multiple change-types.However, stacking more CNN layers cannot guarantee a better network because of vanishing gradients in training (He et al., 2016).Therefore, we adopted the residual block structure (He et al., 2016) for our neural network architecture.After experimenting with various architectures with different numbers of residual blocks and fully connected layers on synthetic data, we arrived at a network We demonstrate the power of our general purpose change-point detection network in a numerical study.We train the network on N = 10000 instances of data sequences generated from a mixture of no change-point in mean or variance, change in mean only, change in variance only, nochange in a non-zero slope and change in slope only, and compare its classification performance on a test set of size 2500 against that of oracle likelihood-based classifiers (where we pre-specify whether we are testing for change in mean, variance or slope) and adaptive likelihood-based classifiers (where we combine likelihood based tests using the Bayesian Information Criterion).Details of the data-generating mechanism and classifiers can be found in Appendix B. The classification accuracy of the three approaches in weak and strong signal-to-noise ratio settings are reported in Table 1.We see that the neural network-based approach achieves similar classification accuracy as adaptive likelihood based method for weak SNR and higher classification accuracy than the adaptive likelihood based method for strong SNR.We would not expect the neural network to outperform the oracle likelihood-based classifiers as it has no knowledge of the exact change-type of each time series.
We now consider an application to detecting different types of change.The HASC (Human Activity Sensing Consortium) project data contain motion sensor measurements during a sequence of human activities, including "stay", "walk", "jog", "skip", "stair up" and "stair down".Complex changes in sensor signals occur during transition from one activity to the next (see Figure 3).We have 28 labels in HASC data, see Figure 10 in appendix.To agree with the dimension of the output, we drop two dense layers "Dense(10)" and "Dense(20)" in Figure 9.The resulting network can be effectively applied for change-point detection in sensory signals of human activities, and can achieve high accuracy in change-point classification tasks (Figure 12 in appendix).
Finally, we remark that our neural network-based change-point detector can be utilised to detect multiple change-points.Algorithm 1 outlines a general scheme for turning a change-point classifier into a location estimator, where we employ an idea similar to that of MO-SUM (Eichinger and Kirch, 2018) and repeatedly apply a classifier ψ to data from a sliding window of size n.Here, we require ψ applied to each data segment X * [i,i+n) to output both the class label L i = 0 or 1 if no change or a change is predicted and the corresponding probability p i of having a change.In our particular example, for each data segment X * [i,i+n) of length n = 700, we define ψ(X * ) predicts a class label in {0, 4, 8, 12, 16, 22} (see Figure 10 in appendix) and 1 otherwise.The thresholding parameter γ ∈ Z + is chosen to be 1/2.From left to right, there are 4 activities: "stair down", "stay", "stair up" and "walk", their change-points are 990, 1691, 2733 respectively marked by black solid lines.The grey rectangles represent the group of "no-change" with labels: "stair down", "stair up" and "walk"; The red rectangles represent the group of "onechange" with labels: "stair down→stay", "stay→stair up" and "stair up→walk".

Discussion
Reliable testing for change-points and estimating their locations, especially in the presence of multiple change-points, other heterogeneities or untidy data, is typically a difficult problem for the applied statistician: they need to understand what type of change is sought, be able to characterise it mathematically, find a satisfactory stochastic model for the data, formulate the appropriate statistic, and fine-tune its parameters.This makes for a long workflow, with scope for errors at its every stage.
In this paper, we showed how a carefully constructed statistical learning framework could automatically take over some of those tasks, and perform many of them 'in one go' when provided with examples of labelled data.This turned the change-point detection problem into a supervised learning problem, and meant that the task of learning the appropriate test statistic and finetuning its parameters was left to the 'machine' rather than the human user.
The crucial question was that of choosing an appropriate statistical learning framework.The key factor behind our choice of neural networks was the discovery that the traditionallyused likelihood-ratio-based change-point detection statistics could be viewed as simple neural networks, which (together with bounds on generalisation errors beyond the training set) enabled us to formulate and prove the corresponding learning theory.However, there are a plethora of other excellent predictive frameworks, such as XGBoost, LightGBM or Random Forests (Chen and Guestrin, 2016;Ke et al., 2017;Breiman, 2001) and it would be of interest to establish whether and why they could or could not provide a viable alternative to neural nets here.Furthermore, if we view the neural network as emulating the likelihood-ratio test statistic, in that it will create test statistics for each possible location of a change and then amalgamate these into a single classifier, then we know that test statistics for nearby changes will often be similar.This suggests that imposing some smoothness on the weights of the neural network may be beneficial.
A further challenge is to develop methods that can adapt easily to input data of different sizes, without having to train a different neural network for each input size.For changes in the structure of the mean of the data, it may be possible to use ideas from functional data analysis so that we pre-process the data, with some form of smoothing or imputation, to produce input data of the correct length.
If historical labelled examples of change-points, perhaps provided by subject-matter experts (who are not necessarily statisticians) are not available, one question of interest is whether simulation can be used to obtain such labelled examples artificially, based on (say) a single dataset of interest.Such simulated examples would need to come in two flavours: one batch 'likely containing no change-points' and the other containing some artificially induced ones.How to simulate reliably in this way is an important problem, which this paper does not solve.Indeed, we can envisage situations in which simulating in this way may be easier than solving the original unsupervised change-point problem involving the single dataset at hand, with the bulk of the difficulty left to the 'machine' at the learning stage when provided with the simulated data.
For situations where there is no historical data, but there are statistical models, one can obtain training data by simulation from the model.In this case, training a neural network to detect a change has similarities with likelihood-free inference methods in that it replaces analytic calculations associated with a model by the ability to simulate from the model.It is of interest whether ideas from that area of statistics can be used here.
The main focus of our work was on testing for a single offline change-point, and we treated location estimation and extensions to multiple-change scenarios only superficially, via the heuristics of testing-based estimation in Section 6.Similar extensions can be made to the online setting once the neural network is trained, by retaining the final n observations in an online stream in memory and applying our change-point classifier sequentially.One question of interest is whether and how these heuristics can be made more rigorous: equipped with an offline classifier only, how can we translate the theoretical guarantee of this offline classifier to that of the corresponding location estimator or online detection procedure?In addition to this approach, how else can a neural network, however complex, be trained to estimate locations or detect change-points sequentially?In our view, these questions merit further work.This is the appendix for the main paper Li, Fearnhead, Fryzlewicz, and Wang (2023), hereafter referred to as the main text.We present proofs of our main lemmas and theorems.Various technical details, results of numerical study and real data analysis are also listed here.

A.2 The Proof of Lemma 3.2
As Γ is invertible, (2) in main text is equivalent to Write X = Γ −1 X, Z = Γ −1 Z and cτ = Γ −1 cτ .If cτ lies in the column span of Z, then the model with a change at τ is equivalent to the model with no change, and the likelihood-ratio test statistic will be 0. Otherwise we can assume, without loss of generality that cτ is orthogonal to each column of Z: if this is not the case we can construct an equivalent model where we replace cτ with its projection to the space that is orthogonal to the column span of Z.As ξ is a vector of independent standard normal random variables, the likelihood-ratio statistic for a change at τ against no change is a monotone function of the reduction in the residual sum of squares of the model with a change at τ .The residual sum of squares of the no change model is The residual sum of squares for the model with a change at τ is Thus, the reduction in residual sum of square of the model with the change at τ over the no change model is then the likelihood-ratio test statistic is a monotone function of |vτ X|.This is true for all τ so the likelihood-ratio test is equivalent to max for some λ.This is of a similar form to the standard CUSUM test, except that the form of vτ is different.Thus, by the same argument as for Lemma 3.1 in main text, we can replicate this test with h(x) ∈ H1,2n−2, but with different weights to represent the different form for vτ .
A.3 The Proof of Lemma 4.1 Proof.(a) For each i ∈ [n − 1], since vi 2 = 1, we have v ⊤ i X ∼ N (0, 1).Hence, by the Gaussian tail bound and a union bound, The result follows by taking t = 2 log(n/ε).
(b) We write X = µ + Z, where Z ∼ Nn(0, In).Since the CUSUM transformation is linear, we have C(X) = C(µ) + C(Z).By part (a) there is an event Ω with probability at least 1 − ε on which Hence on Ω, we have by the triangle inequality that A.4 The Proof of Corollary 4.1 Proof.From Lemma 4.1 in main text with ε = ne −nB 2 /8 , we have and the desired result follows by integrating over π0.

A.7 The Proof of Theorem 4.3
The following lemma, gives the misclassification for the generalised CUSUM test where we only test for changes on a grid of O(log n) values.
(b) There exists some t0 ∈ T0 such that |t0 − τ | ≤ min{τ, n − τ }/2.By Lemma A.1, we have Consequently, by the triangle inequality and result from part (a), we have with probability at least 1 − ε that max as desired.
Using the above lemma we have the following result.
Proof of Theorem 4.3.We follow the proof of Theorem 4.2 up to (5).From the conditions of the theorem, we have A.8 Generalisation to time-dependent or heavy-tailed observations So far, for simplicity of exposition, we have primarily focused on change-point models with independent and identically distributed Gaussian observations.However, neural network based procedures can also be applied to time-dependent or heavy-tailed observations.We first considered the case where the noise series ξ1, . . ., ξn is a centred stationary Gaussian process with short-ranged temporal dependence.Specifically, writing K(u) := cov(ξt, ξt+u), we assume that Theorem A.3.Fix B > 0, n > 0 and let π0 be any prior distribution on Θ(B).We draw (τ, µL, µR) ∼ π0, set Y := ½{µL = µR} and generate and ξ is a centred stationary Gaussian process satisfying (6).Suppose that the training data D := (X (1) , Y (1) ), . . ., (X (N) , Y (N) ) consist of independent copies of (X, Y ) and let hERM := arg min h∈L L,m LN (h) be the empirical risk minimiser for a neural network with L ≥ 1 layers and m = (m1, . . ., mL) ⊤ hidden layer widths.If m1 ≥ 4⌊log 2 (n)⌋ and mrmr+1 = O(n log n) for all r ∈ [L − 1], then for any δ ∈ (0, 1), we have with probability at least 1 − δ that Proof.By the proof of Wang and Samworth (2018, supplementary Lemma 10), 48D) .
On the other hand, for t0 defined in the proof of Lemma A.1, we have that |µL − µR| τ 48D) .
We can then complete the proof using the same arguments as in the proof of Theorem 4.3.
We now turn to non-Gaussian distributions and recall that the Orlicz ψα-norm of a random variable Y is defined as For α ∈ (0, 2), the random variable Y has heavier tail than a sub-Gaussian distribution.The following lemma is a direct consequence of Kuchibhotla and Chakrabortty (2022, Theorem 3.1) (We state the version used in Li et al. (2023, Proposition 14)).

A.9 Multiple change-point estimation
Algorithm 1 is a general scheme for turning a change-point classifier into a location estimator.While it is theoretically challenging to derive theoretical guarantees for the neural network based change-point location estimation error, we motivate this methodological proposal here by showing that Algorithm 1, applied in conjunction with a CUSUM-based classifier have optimal rate of convergence for the changepoint localisation task.We consider the model xi = µi + ξi, where ξi Proof.For simplicity of presentation, we focus on the case where n is a multiple of 4, so γ = 1/2.Define By Lemma A.2 and a union bound, the event . We work on the event Ω henceforth.Denote Consequently, Li defined in Algorithm 1 is below the threshold γ = 1/2 for all i ∈ (τr−1 +n/2, τr −n/2]∪ (τr + n/2, τr+1 − n/2], monotonically increases for i ∈ (τr − n/2, τr − ∆] and monotonically decreases for i ∈ (τr + ∆, τr + n/2] and is above the threshold γ for i ∈ (τr − ∆, τr + ∆].Thus, exactly one changepoint, say τr, will be identified on (τr−1 + n/2, τr+1 − n/2] and τr = arg max i∈(τ r−1 +n/2,τ r+1 −n/2] Li ∈ (τr − ∆, τr + ∆] as desired.Since the above holds for all r ∈ [ν], the proof is complete. Assuming that log(n * ) ≍ log(n) and choosing B to be of order √ log n, the above theorem shows that using the CUSUM-based change-point classifier ψ = h CUSUM * λ * in conjunction with Algorithm 1 allows for consistent estimation of both the number of locations of multiple change-points in the data stream.In fact, the rate of estimating each change-point, 2B 2 /|µ (r) −µ (r−1) | 2 , is minimax optimal up to logarithmic factors (see, e.g.Verzelen et al., 2020, Proposition 6).An inspection of the proof of Theorem A.6 reveals that the same result would hold for any ψ for which the event Ω holds with high probability.In view of the representability of h CUSUM * λ * in the class of neural networks, one would intuitively expect that a similar theoretical guarantee as in Theorem A.6 would be available to the empirical risk minimiser in the corresponding neural network function class.However, the particular way in which we handle the generalisation error in the proof of Theorem 4.3 makes it difficult to proceed in this way, due to the fact that the distribution of the data segments obtained via sliding windows have complex dependence and no longer follow a common prior distribution π0 used in Theorem 4.2.

B.1 Simulation for Multiple Change-types
In this section, we illustrate the numerical study for one-change-point but with multiple change-types: change in mean, change in slope and change in variance.
The data set with change/no-change in mean is generated from P (n, τ, µL, µR).We employ the model of change in slope from Fearnhead et al. (2019), namely where φ0, φ1 and φ2 are parameters that can guarantee the continuity of two pieces of linear function at time t = τ .We use the following model to generate the data set with change in variance.
Besides, we let µ = 0, φ0 = 0 and the noise follows normal distribution with mean 0. For flexibility, we let the noise variance of change in mean and slope be 0.49 and 0.25 respectively.Both Scenarios 1 and 2 defined below use the neural network architecture displayed in Figure 9.
Benchmark.Aminikhanghahi and Cook (2017) reviewed the methodologies for change-point detection in different types.
To be simple, we employ the Narrowest-Over-Threshold (NOT) (Baranowski et al., 2019) and single variance change-point detection (Chen and Gupta, 2012) algorithms to detect the change in mean, slope and variance respectively.These two algorithms are available in R packages: not and changepoint.The oracle likelihood based tests LR oracle means that we pre-specified whether we are testing for change in mean, variance or slope.For the construction of adaptive likelihood-ratio based test LR adapt , we first separately apply 3 detection algorithms of change in mean, variance and slope to each time series, then we can compute 3 values of Bayesian information criterion (BIC) for each change-type based on the results of change-point detection.Lastly, the corresponding label of minimum of BIC values is treated as the predicted label.Scenario 1: Weak SNR.Let n = 400, N sub = 2000 and n ′ = 40.The data is generated by the parameters settings in Table 2.We use the model architecture in Figure 9 to train the classifier.The learning rate is 0.001, the batch size is 64, filter size in convolution layer is 16, the kernel size is (3, 30), the epoch size is 500.The transformations are (x, x 2 ).We also use the inverse time decay technique to dynamically reduce the learning rate.The result which is displayed in Table 1 of main text shows that the test accuracy of LR oracle , LR adapt and NN based on 2500 test data sets are 0.9056, 0.8796 and 0.8660 respectively.
Scenario 2: Strong SNR.The parameters for generating strong-signal data are listed in Table 2.The other hyperparameters are same as in Scenario 1.The test accuracy of LR oracle , LR adapt and NN based on 2500 test data sets are 0.9924, 0.9260 and 0.9672 respectively.We can see that the neural network-based approach achieves higher classification accuracy than the adaptive likelihood based method.

B.2.1 Simulation for simultaneous changes
In this simulation, we compare the classification accuracies of likelihood-based classifier and NN-based classifier under the circumstance of simultaneous changes.For simplicity, we only focus on two classes: no change-point (Class 1) and change in mean and variance at a same change-point (Class 2).The change-point location τ is randomly drawn from Unif{40, . . ., n − 41} where n = 400 is the length of time series.Given τ , to generate the data of Class 2, we use the parameter settings of change in mean and change in variance in Table 2 to randomly draw µL, µR and σ1, σ2 respectively.The data before and after the change-point τ are generated from N (µL, σ 2 1 ) and N (µR, σ 2 2 ) respectively.To generate the data of Class 1, we just draw the data from N (µL, σ 2 1 ).Then, we repeat each data generation of Class 1 and 2 2500 times as the training dataset.The test dataset is generated in the same procedure as the training dataset, but the testing size is 15000.We use two classifiers: likelihood-ratio (LR) based classifier (Chen and Gupta, 2012, p.59) and a 21-residual-block neural network (NN) based classifier displayed in Figure 9 to evaluate the classification accuracy of simultaneous change v.s.no change.The result are displayed in Table 3.We can see that under weak SNR, the NN has a good performance than LR-based method while it performs as well as the LR-based method under strong SNR.(Chen and Gupta, 2012, p.59) and our residual neural network (NN) based classifier with 21 residual blocks for setups with weak and strong signal-to-noise ratios (SNR).Data are generated as a mixture of no change-point (Class 1), change in mean and variance at a same change-point (Class 2).We report the true positive rate of each class and the accuracy in the last row.The optimal threshold value of LR is chosen by the grid search method on the training dataset.
Weak In this simulation, we compare the performance of Wilcoxon change-point test (Dehling et al., 2015), CUSUM, simple neural network HL,m as well as truncated HL,m for heavy-tailed noise.Consider the model: Xi = µi + ξi, i ≥ 1, where (µi) i≥1 are signals and (ξi) i≥1 is a stochastic process.To test the null hypothesis Dehling et al. (2015) proposed the so-called Wilcoxon type of cumulative sum statistic to detect the change-point in time series with outlier or heavy tails.Under the null hypothesis H, the limit distribution of Tn1 can be approximately by the supreme of standard Brownian bridge process (W (0) (λ)) 0≤λ≤1 up to a scaling factor (Dehling et al., 2015, Theorem 3.1).In our simulation, we choose the optimal thresh value based on the training dataset by using the grid search method.
The truncated simple neural network means that we truncate the data by the z-score in data preprocessing step, i.e. given vector x = (x1, x2, . . ., xn) ⊤ , then xi[|xi − x| > Zσx] = x + sgn(xi − x)Zσx, x and σx are the mean and standard deviation of x.
The training dataset is generated by using the same parameter settings of Figure 2(d) of the main text.The result of misclassification error rate (MER) of each method is reported in Figure 5.We can see that truncated simple neural network has the best performance.As expected, the Wilcoxon based test has better performance than the simple neural network based tests.However, we would like to mention that the main focus of Figure 2 of the main text is to demonstrate the point that simple neural networks can replicate the performance of CUSUM tests.Even though, the prior information of heavy-tailed noise is available, we still encourage the practitioner to use simple neural network by adding the z-score truncation in data preprocessing step.

B.2.3 Robustness Study
This simulation is an extension of numerical study of Section 5 in main text.We trained our neural network using training data generated under scenario S1 with ρt = 0 (i.e.corresponding to Figure 2 (Dehling et al., 2015) and simple neural network with truncation in data preprocessing.The average misclassification error rate (MER) is computed on a test set of size N test = 15000, against training sample size N for detecting the existence of a change-point on data series of length n = 100.We compare the performance of the CUSUM test, Wilcoxon test, H 1,m (2) and H 1,m (2) with Z = 3 where m (2) = 2n − 2 and Z = 3 means the truncated z-score, i.e. given vector x = (x 1 , x 2 , . . ., x n ) ⊤ , then x i [|x i − x| > Zσ x ] = x + sgn(x i − x)Zσ x , x and σ x are the mean and standard deviation of x.
the main text), but generate the test data under settings corresponding to Figure 2(a, b, c, d).In other words, apart the top-left panel, in the remaining panels of Figure 6, the trained network is misspecified for the test data.We see that the neural networks continue to work well in all panels, and in fact have performance similar to those in Figure 2(b, c, d) of the main text.This indicates that the trained neural network has likely learned features related to the change-point rather than any distributional specific artefacts.

B.2.4 Simulation for change in autocorrelation
In this simulation, we discuss how we can use neural networks to recreate test statistics for various types of changes.For instance, if the data follows an AR(1) structure, then changes in autocorrelation can be handled by including transformations of the original input of the form (xtxt+1)t=1,...,n−1.On the other hand, even if such transformations are not supplied as the input, a deep neural network of suitable depth is able to approximate these transformations and consequently successfully detect the change (Schmidt-Hieber, 2020, Lemma A.2).This is illustrated in Figure 7, where we compare the performance of neural network based classifiers of various depths constructed with and without using the transformed data as inputs.
is chosen in line with Lemma 4.1 to ensure a good range of signalto-noise ratio.We then generate x = (µL½ {t≤τ } + µR½ {t>τ } + εt) t∈[n ′ ] , with the noise ε = (εt) t∈[n ′ ] ∼ N n ′ (0, I n ′ ).We then draw independent copies x1, . . ., x N ′ of x.For each x k , we randomly choose 60 segments with length n ∈ {300, 400, 500, 600}, the segments which include τ k are labelled '1', others are labelled '0'.The training dataset size is N = 60N ′ where N ′ = 500.We then draw another Ntest = 3000 independent copies of x as our test data for change-point location estimation.We study the performance of change-point location estimator produced by using Algorithm 1 together with a single-layer neural network, and compare it with the performance of CUSUM, MOSUM and Wilcoxon statistics-based estimators.As we can see from the Figure 8, under Gaussian models where CUSUM is known to work well, our simple neural network-based procedure is competitive.On the other hand, when the noise is heavy-tailed, our simple neural network-based estimator greatly outperforms CUSUM-based estimator.respectively.The RMSE here is defined by 1/N N i=1 (τ i − τ i ) 2 where τi is the estimator of change-point for the i-th observation and τ i is the true change-point.The weak and strong signal-to-noise ratio There are 21 residual blocks in our deep neural network, each residual block contains 2 convolutional layers.Like the suggestion in Ioffe and Szegedy (2015) and He et al. (2016), each convolution layer is followed by one Batch Normalization (BN) layer and one ReLU layer.Besides, there exist 5 fullyconnected convolution layers right after the residual blocks, see the third column of Figure 9.For example, Dense(50) means that the dense layer has 50 nodes and is connected to a dropout layer with dropout rate 0.3.To further prevent the effect of overfitting, we also implement the L2 regularization in each fully-connected layer (Ng, 2004).As the number of labels in HASC is 28, see Figure 10, we drop the dense layers "Dense(20)" and "Dense(10)" in Figure 9.The output layer has size (28, 1).
We remark two discussable issues here.(a) For other problems, the number of residual blocks, dense layers and the hyperparameters may vary depending on the complexity of the problem.In Section 6 of main text, the architecture of neural network for both synthetic data and real data has 21 residual blocks considering the trade-off between time complexity and model complexity.Like the suggestion in He et al. (2016), one can also add more residual blocks into the architecture to improve the accuracy of classification.(b) In practice, we would not have enough training data; but there would be potential ways to overcome this via either using Data Argumentation or increasing q.In some extreme cases that we only mainly have data with no-change, we can artificially add changes into such data in line with the type of change we want to detect.there are 30 possible types of change-point.The total number of labels is 36 (6 activities and 30 possible transitions).However, we only found 28 different types of label in this real dataset, see Figure 10.The initial learning rate is 0.001, the epoch size is 400.Batch size is 16, the dropout rate is 0.3, the filter size is 16 and the kernel size is (3,25).Furthermore, we also use 20% of the training dataset to validate the classifier during training step.

C.4 Training and Detection
Figure 12 shows the accuracy curves of training and validation.After 150 epochs, both solid and dash curves approximate to 1.The test accuracy is 0.9623, see the confusion matrix in Figure 13.These results show that our neural network classifier performs well both in the training and test datasets.
Next, we apply the trained classifier to 3 repeated sequential datasets of Person 7 to detect the changepoints.The first sequential dataset has shape (3, 10743).First, we extract the n-length sliding windows with stride 1 as the input dataset.The input size becomes (9883,6,700).Second, we use Algorithm 1 to detect the change-points where we relabel the activity label as "no-change" label and transition label as "one-change" label respectively.Figures 14 and 15 show the results of multiple change-point detection for other 2 sequential data sets from the 7-th person.
Figure 2: Plot of the test set MER, computed on a test set of size N test = 30000, against training sample size N for detecting the existence of a change-point on data series of length n = 100.We compare the performance of the CUSUM test and neural networks from four function classes:H 1,m (1) ,H 1,m (2) , H 5,m (1)15 and H 10,m (1) 110 where m (1) = 4⌊log 2 (n)⌋ and m (2) = 2n−2 respectively under scenarios S1, S1 ′ , S2 and S3 described in Section 5.
Figure 4 illustrates the result of multiple change-point detection in HASC data which Algorithm 1: Algorithm for change-point localisation

Figure 3 :
Figure3: The sequence of accelerometer data in x, y and z axes.From left to right, there are 4 activities: "stair down", "stay", "stair up" and "walk", their change-points are 990, 1691, 2733 respectively marked by black solid lines.The grey rectangles represent the group of "no-change" with labels: "stair down", "stair up" and "walk"; The red rectangles represent the group of "onechange" with labels: "stair down→stay", "stay→stair up" and "stair up→walk".

Figure 4 :
Figure 4: Change-point detection in HASC data.The red vertical lines represent the underlying change-points, the blue vertical lines represent the estimated change-points.More details on multiple change-point detection can be found in Appendix C.

Figure 5 :
Figure 5: Scenario S3 with Cauchy noise by adding Wilcoxon type of change-point detection method(Dehling et al., 2015) and simple neural network with truncation in data preprocessing.The average misclassification error rate (MER) is computed on a test set of size N test = 15000, against training sample size N for detecting the existence of a change-point on data series of length n = 100.We compare the performance of the CUSUM test, Wilcoxon test, H 1,m (2) and H 1,m (2) with Z = 3 where m (2) = 2n − 2 and Z = 3 means the truncated z-score, i.e. given vector x = (x 1 , x 2 , . . ., x n ) ⊤ , then x i [|x i − x| > Zσ x ] = x + sgn(x i − x)Zσ x , x and σ x are the mean and standard deviation of x.

Figure 6 :
Figure 6: Plot of the test set MER, computed on a test set of size N test = 30000, against training sample size N for detecting the existence of a change-point on data series of length n = 100.We compare the performance of the CUSUM test and neural networks from four function classes:H 1,m (1) ,H 1,m (2) , H 5,m (1)15 and H 10,m (1) 110 where m (1) = 4⌊log 2 (n)⌋ and m (2) = 2n−2 respectively under scenarios S1, S1 ′ , S2 and S3 described in Section 5.The subcaption "A → B" means that we apply the trained classifier "A" to target testing dataset "B".

Figure 8 :
Figure 8: Plot of the root mean square error (RMSE) of change-point estimation (S1 with ρ t = 0 and S3), computed on a test set of size N test = 3000, against bandwidth n for detecting the existence of a change-point on data series of length n * = 2000.We compare the performance of the change-point detection by CUSUM, MOSUM, Algorithm 1 and Wilcoxon (only for S3)

Figure 9 :
Figure 9: Architecture of our general-purpose change-point detection neural network.The left column shows the standard layers of neural network with input size (d, n), d may represent the number of transformations or channels; We use 21 residual blocks and one global average pooling in the middle column; The right column includes 5 dense layers with nodes in bracket and output layer.More details of the neural network architecture appear in the supplement.

Figure 10 :
Figure 10: Label DictionaryThere are 7 persons observations in this dataset.The first 6 persons sequential data are treated as the training dataset, we use the last person's data to validate the trained classifier.Each person performs each of 6 activities: "stay", "walk", "jog", "skip", "stair up" and "stair down" at least 10 seconds.The transition point between two consecutive activities can be treated as the change-point.Therefore,

Figure
Figure 11: Label Frequency

Figure 14 :Figure 15 :
Figure 13: Confusion Matrix of Real Test Dataset

Table 1 :
Test classification accuracy of oracle likelihood-ratio based method (LR oracle ), adaptive likelihood ratio method (LR adapt ) and our residual neural network (NN) classifier for setups with weak and strong signal-to-noise ratios (SNR).Data are generated as a mixture of no change-point in mean or variance (Class 1), change in mean only (Class 2), change in variance only (Class 3), no-change in a non-zero slope (Class 4), change in slope only (Class 5).We report the true positive rate of each class and the accuracy in the last row.
architecture with 21 residual blocks followed by a number of fully connected layers.Figure9shows an overview of the architecture of the final general-purpose deep neural network for change-point detection.The precise architecture and training methodology of this network N N can be found in Appendix C. Neural Architecture Search (NAS) approaches (seePaaß and Giesselbach, 2023,  Section 2.4.3)offer principled ways of selecting neural architectures.Some of these approaches could be made applicable in our setting.

Table 2 :
The parameters for weak and strong signal-to-noise ratio (SNR).

Table 3 :
Test classification accuracy of likelihood-ratio (LR) based classifier