Population-based change-point detection for the identification of homozygosity islands

Abstract Motivation This work is motivated by the problem of identifying homozygosity islands on the genome of individuals in a population. Our method directly tackles the issue of identification of the homozygosity islands at the population level, without the need of analysing single individuals and then combine the results, as is made nowadays in state-of-the-art approaches. Results We propose regularized offline change-point methods to detect changes in the parameters of a multidimensional distribution when we have several aligned, independent samples of fixed resolution. We present a penalized maximum likelihood approach that can be efficiently computed by a dynamic programming algorithm or approximated by a fast binary segmentation algorithm. Both estimators are shown to converge almost surely to the set of change-points without the need of specifying a priori the number of change-points. In simulation, we observed similar performances from the exact and greedy estimators. Moreover, we provide a new methodology for the selection of the regularization constant which has the advantage of being automatic, consistent, and less prone to subjective analysis. Availability and implementation The data used in the application are from the Human Genome Diversity Project (HGDP) and is publicly available. Algorithms were implemented using the R software R Core Team (R: A Language and Environment for Statistical Computing. Vienna (Austria): R Foundation for Statistical Computing, 2020.) in the R package blockcpd, found at https://github.com/Lucas-Prates/blockcpd.

1. Introduction.In diploid organisms, such as humans, each individual's genome is organized into pairs of chromosomes, each half inherited from each parent.When an individual is an offspring of biologically related parents, both chromosomes of the same pair can share identical segments, creating long stretches of consecutive homozygosity, known as runs of homozygosity (ROH).
In the last decades, studies on the identification of ROH carried out in human populations have revealed the presence of ROH even in cosmopolitan non-inbred populations, disclosing an increment of inbreeding levels and the consequent reduction of genetic diversity of populations, which is proportional to the walking distance from Africa, as expected by the out-of-Africa model of human colonization (Ceballos et al., 2018;Kirin et al., 2010;Lemes et al., 2018;Leutenegger et al., 2011;Pemberton et al., 2012).
The distribution of ROH along the chromosomes is very uneven, resulting in some genomic regions having significant absence (coldspots) or excess of ROH (ROH islands) (Ceballos et al., 2018).The mechanisms for the emergence of these regions are still under discussion.For example, there is evidence that ROH islands could represent regions that harbor genes target of positive selection since low-recombination regions commonly are locations of selective sweeps, in which a new beneficial mutation increases in frequency and becomes fixed, causing the overall reduction in genetic diversity of the region (Ceballos et al., 2018;Pemberton et al., 2012).
To detect ROH and ROH islands, the genetic material of individuals from a given population is genotyped, and a set of single nucleotide polymorphisms (SNPs) is obtained.Each SNP entry is codified to 1 if that SNP belongs to an ROH for that individual and to 0 otherwise, where a marker is defined as belonging to an ROH for an individual if it is surrounded by a region with high frequency of homozygous SNPs.Finally, ROH islands are regions in which ROH are most frequent in that population.That is, the positions in the array with high frequency of individual ROH passing through them.Therefore, we can think on the problem of ROH island detection in a population as the identification of regions with high frequencies of 1's in the codified SNPs of individuals of that population.That is, this problem can be regarded as a change-point problem for the parameters of a multidimensional random vector with Bernoulli marginal distributions.
1.1.Change point detection.Classically, change point detection refers to the problem of determining the times at which sequential observed data undergoes an abrupt change.
In that type of setting, a change point may refer to changes in mean (Page, 1954;Tsay, 1988;Keshavarz, Scott and Nguyen, 2018), variance (Chen and Gupta, 1997;Hawkins and Zamba, 2005), regression slope (Chow, 1960;Qu and Perron, 2007), general distributions forms (Matteson and James, 2014), or other types of change.Many of these methods have been applied to a wide range of problems such as stream anomaly detection in industry (Li et al., 2018), monitoring of sleep stages using EEG/EMG (Agudelo-España et al., 2020), identification of cyberattacks on networks (Tartakovsky et al., 2006), between many other interesting applications.For more references about change point methods for time series and other applications, we refer the reader to the book Chen and Gupta (2012) or the recent reviews Truong, Oudre and Vayatis (2020) and Lee (2010).
As Niu, Hao and Zhang (2016) points out, from a methodological point of view, there is a variety of ways to formulate the change-point detection problem: online vs. offline, single change point vs. multiple change points, parametric vs. non-parametric, Bayesian vs. non-Bayesian, and many other approaches when we dive into specific estimation procedures.But in most of the classical formulations for change point detection, the estimation problem is considered under the hypothesis that the number of observations along the dimension of interest grows.Early results of Hinkley (1970) show that, even for more straightforward problems, these methods are usually consistent for the change points fractions, but not the change points properly.For offline issues, a possible interpretation is that we have a fixed number of series (usually one) over a unit interval.What grows with the number of observations, in this case, is the resolution over this interval.
As technology advances and other types of data arise from different application areas, new types of change point detection problems arise as well.The ROH islands detection for a population is an example of this.Indeed, here the dimension of interest is discrete and finite, and what grows is the number of independent observations of the random vector.Moreover, since the indices correspond to specific positions such as biological markers, it is essential to recover the exact location of the change points.Therefore, in this case, theoretical consistency results must guarantee convergence to the true change points as the number of independent observations of the joint multidimensional distribution grows.
1.2.Our contribution and related works.In this work we consider several aligned, independent samples of a multidimensional distribution and we assume the distribution has a block structure with different parameters on each block.We focus on detecting such changes in the parameters, that is, the boundaries of the blocks when the number of observations grows.
We propose a penalized maximum likelihood approach to detect the set of change points in the distribution.We allow for multiple change points without assuming an a priori fixed, known number.The penalized maximum likelihood approach has also been considered recently in Castro et al. (2018); Leonardi et al. (2021), but on a different type of change-point problem.There, the approach was introduced for non-parametric discrete distributions in order to detect points of independence on a multidimensional random vector, under independent or non-independent sampling.Here we use a similar methodology but on a different problem; namely, the detection of changes on some parameters of the distribution of the random vector.We also show how to compute the exact estimator by a dynamic programming algorithm and how to approximate the global solution by an efficient greedy binary segmentation algorithm.Albeit using a parametric approach, we do not consider a specific family of distributions.Instead, we state a set of general hypotheses that the family must satisfy and then prove the consistency of the change point set estimator when the number of aligned sequences grows, for the exact estimator as well as for the binary segmentation algorithm.The approach proposed here naturally allows us to directly use bootstrap techniques to assess the variability of the change point estimators.Bootstrap resampling methods have been applied before to construct confidence intervals for the change point location in the single change point problem (Antoch, Hušková and Veraverbeke, 1995;Hušková and Kirch, 2008) or on multiple change point problems (Frick, Munk and Sieling, 2013).We suggest using the bootstrap method to approximate the probability of each index being detected as a change point and to measure the variability of the entire set of change points.A simulation study is presented to provide insights on the behavior of the algorithms and the bootstrap confidence estimation as sample size grows.Finally, we discuss the identification of the ROH islands on the African and European populations, based on the analysis of SNP data from the Human Genome Diversity Project (HGDP) (Li et al., 2008).We compare the ROH islands detected by our method with those detected by the procedure proposed in McQuillan et al. (2008); Kirin et al. (2010), using the PLINK software (Purcell et al., 2007), a well-established command line software designed to solve many medical and population genetics problems.
It is worth mentioning that the methodology for change-point detection developed in this work is very general, and it is not restricted to discrete distributions.In particular, we verify the conditions leading to the consistency of the estimators for standard distributions such as categorical random variables and Gaussian random variables.Still, other families of distributions can satisfy these conditions, widening the possibility of applying our method to many different applied problems.

Change Point Estimation.
2.1.Definition of the model.Let m be a positive integer and consider a random vector X = (X 1 , . . ., X m ) taking values in A m , where A is any state space.Let Θ be a parametric space and consider a family F = {f θ | θ ∈ Θ} of probability densities or probability mass functions over A , for each ∈ N, indexed by θ ∈ Θ.We assume each variable X i , i = 1, . . ., m is distributed according to some f θ ∈ F .Given two integers r and s, with r ≤ s, we use the notation r : s for the set {r, r + 1 . . ., s}.Denote by C the class of all ordered sets C = {c 0 , c 1 , . . ., c k } ⊆ 0 : m such that c 0 = 0 and c k = m.We say that C ∈ C is a change point set for the random vector X if the variables in the subvector (X cj−1+1 , . . ., X cj ) are identically distributed with distribution f θj ∈ F and θ j = θ j+1 for all j = 1, . . ., k − 1. Observe that the change point vector is unique, and from now on it will be denoted by C * .Given any set C ∈ C, we denote by k C the number of positive elements in C; that is, if Assuming the blocks in the random vector X are independent, we can write the probability of observing x = (x 1 , . . ., x m ) ∈ A m in the model with parameters (C, θ), θ = {θ j ∈ Θ : j ∈ 1 : k C } by (2.1) where f θj (x (cj−1+1):cj ) represents the distribution of a random vector over A cj−cj−1 with parameter θ j .The independence assumption over the different blocks is not a necessary condition for the method but the generalization to a non-independent setting is out of the scope of this work.We present below two well-known examples of families of distributions, one for continuous random variables and the other for discrete random variables, with the exact expression for the likelihood function (2.1).
EXAMPLE 1.Let A = {a 1 , . . ., a d } be a finite set and let be the family of all probability distributions over A. The likelihood function (2.1) for x ∈ A m can be written, for C ∈ C and θ = (p 1 , . . ., p k ) as of probability densities over A = R, where f (µ,σ 2 ) denotes the density of a Gaussian random variable with mean µ and variance σ 2 .Fixing the change point set C with k C = k and the vector of parameters θ = ((µ j , σ 2 j )) k j=1 , the likelihood function assuming the variables in x are independent is given by 2.2.Estimation and model selection.Consider a sample of n i.i.d.random vectors x n = {x (i) } n i=1 distributed as X, with change point set C * = (c * 0 , . . ., c * k * ) and parameters θ * = (θ * 1 , . . ., θ * k * ).Our main goal is to estimate the change point set C * and the parameters θ * .For any integer interval I ⊂ 1 : m assume we can compute the maximum likelihood estimator based on the subsample {x (i) j } i∈1:n,j∈I .Write the maximum likelihood function for the set C of candidate change points as where θ j denotes the maximum likelihood estimator computed on the sample {x c } i∈1:n,c∈Ij with I j = (c j−1 + 1) : c j .From (2.3) the log-likelihood function is given by Let R : C → R denote some regularization function and J : N → R an increasing function on the sample size n.We introduce the penalized likelihood estimator based on the functions R and J in the following definition.
DEFINITION 1.Given a sample x n and a constant λ > 0, the Penalized Likelihood (PL) function for the set of change points C is defined as The PL estimator for the change point set is then defined as As we will show later in Theorems 1 and 2, in order to obtain the consistency of the change point estimator defined by (2.6) we need the functions R and J to satisfy some properties.This will be made precise in the statements of these theorems.
3. Computation of the Estimator.In order to efficiently estimate the change point set, we suppose in this section that the regularization function R is additive.That means that there exists a function ρ : {1, . . ., m} 2 → R such that for C = {c 0 , . . ., c k } we have 3.1.Dynamic programming segmentation algorithm.As presented in Jackson et al. (2005), dynamic programming can be used to calculate exactly the PL estimator.Under an additive regularization, the function we want to minimize can be written as The equation shows that we can completely decouple the loss from different blocks.Let C i be the set of all ordered change point sets in 1 : i. Define as the optimal value for the segmentation up to variable i.The estimator C(x n ) given by (2.6) is obtained by computing F (m).But notice that which establishes a recursion equation for the values of F (i), i varying from 1 to m.The value of F (1) can be computed trivially, and then we use the recursion to compute the values until we reach F (m).
Albeit m is fixed and only n grows, the number of variables m can be very large in some applications, so it is useful to express the complexity in terms of both.The dynamic programming segmentation algorithm runs on a time complexity of O(m 2 ).However, we are assuming that Q have been previously computed for all intervals in 1 : m.
To compute Q, we need to compute the maximum likelihood estimators for each block.For most models, this can be done efficiently by computing the sufficient statistics, and then compute the entries of Q.For the models presented in Examples 1 and 2, the time complexity for computing the sufficient statistics is T (n, m) = O(nm), and then O(m 2 ) to compute the entries of Q.In the case where no sufficient statistics exist, we need to reprocess the data every time, so the complexity to compute Q is O(nm 3 ).
Hence, the final time complexity of the algorithm is O(T (n, m) + m 2 ).In the worst case the algorithm is O(nm 3 ), and can be very slow for big values of m.
Depending on the function ρ chosen, the PELT algorithm of Killick, Fearnhead and Eckley (2012) might be applicable.It consists of an adaptation of the dynamic programming algorithm discussed here.In some scenarios, such as when the number of change points is proportional to the number of variables, it runs in O(m).The final complexity would be O(nm + m) when suitable sufficient statistics exist.
3.2.Hierarchical segmentation algorithm.For efficient computation of C(x n ) we can use an approximation to the optimum in (2.6), known as hierarchical segmentation or binary segmentation, first proposed in Scott and Knott (1974).Given an integer interval I = r : s, write x n I for the data with columns restricted to I and assume the penalizing function R is additive.Define the penalized loss of I as with the appropriate renumbering of the columns in x n I .That is, the penalized loss corresponding to the interval I is the penalized loss defined in (2.5) when we only consider the data x n I and perform no splits.We use the convention PL(∅) = 0.For c ∈ I = r : s, define (3.2) h I (c) = PL(r : c) + PL((c + 1) : s) .
Observe that when c = s, by convention we have (s + 1) : s = ∅ so that h I (s) = PL(I).
The hierarchical segmentation algorithm works recursively as follows.It begins with the set C HS (x n ) = {0, m} corresponding to the single interval I = 1 : m.In each iteration and for each interval I determined by C HS (x n ), the algorithm computes and adds it to C HS (x n ).Observe that if in one iteration ĉ = s, as s ∈ C HS (x n ), no changes are made on C HS (x n ).The process continues until no more points can be added to C HS (x n ).
To evaluate PL at each interval, we either store all possible values in the same fashion as for the dynamic programming algorithm or we evaluate them on the run.Since storing would make the algorithm O(m 2 ) in any scenario, it is more interesting to pre-compute the sufficient statistics and evaluate the loss on the intervals as they appear.
In the worst-case scenario, the algorithm has time complexity of order O(T (n, m) + m 2 ).However, the algorithm has asymptotic complexity of order O(T (n, m) + mk C * )), as stated in the next proposition.
PROPOSITION 1.The hierarchical segmentation algorithm will asymptotically perform exactly 2k C * −1 recursive calls.Moreover, its asymptotic complexity is of order O(T (n, m)+ mk C * ), where T (n, m) is the time complexity to compute the sufficient statistics for each interval.
The proof of this proposition is given in Appendix A.
4. Consistency of the Change Point Estimators.In this section we state the theoretical results that guarantee the consistency of the estimator (2.6) computed exactly by the dynamic programming algorithm or approximated by the hierarchical segmentation algorithm.For each method, we state a different set of assumptions that must be satisfied by the family of probability distributions considered in the model.Then we prove that standard families of distributions such as Gaussian and categorical probability distributions satisfy these assumptions.
ASSUMPTION 1. Suppose there exists a function l * : C → R such that (PL1) For any C ∈ C we have that almost surely as n → ∞, where l is the log-likelihood function defined in (2.4).Moreover, assume there exists α > 0 such that ∞, and such that for any C ⊇ C * we have that We now state the consistency result of the PL estimator given in (2.6).
THEOREM 1. Suppose that the family F of probability distributions satisfy Assumption 1.Let R be a penalizing function such that R(C) > R(C ) whenever C ⊃ C and let J(n) be such that J(n)/v(n) → ∞ and J(n)/n → 0 when n → ∞.Then the estimator of the change point set given by (2.6) satisfies C(x n ) = C * eventually almost surely as n → ∞.
Notice that the regularization function R does not need to be additive to guarantee the consistency of the PL estimator.However, this is a desirable property to efficiently compute the estimator by using the dynamic programming segmentation algorithm.
In order to prove that the estimator given by the hierarchical segmentation algorithm is also consistent, we need a slightly different set of hypotheses considering the local nature of this algorithm.Denote by I the set of all intervals I ⊂ 1 : m.Given I ∈ I, denote by θ I the maximum likelihood estimator of the parameter θ on the interval I and as before, let x n I be the data restricted to the interval I. Define the maximum log-likelihood function for the interval I as l(I; Let h I be the loss function for the interval I as defined in (3.2).
ASSUMPTION 2. There exists a function l * : I → R such that (H1) For any integer interval I ∈ I we have that We can now state the consistency of the change point set estimator given by the hierarchical segmentation algorithm.THEOREM 2. For any λ > 0, let C HS (x n ) be the estimator computed by the hierarchical segmentation algorithm.Suppose that the family F of probability distributions satisfy Assumption 2. Suppose that R satisfies (3.1) for some function ρ : I → R and that ρ(I) < ρ(I 1 ) + ρ(I 2 ) whenever It turns out that verifying directly the inequality (4.3) on Assumption 2 is usually hard.We now state a lemma that provides easier to verify conditions that imply the desired inequality.
then it is equals to h * I (s) on this interval.Then, the function h * I satisfies that min The following lemma shows that under general conditions on the function h * I verified in practice by many models, the conditions of Lemma 1 hold.We now state two results that guarantee that both families of probability distributions given in Examples 1 and 2 satisfies Assumptions 1 and 2. Hence, both the dynamic programming and hierarchical segmentation algorithms provide consistent estimators of the change point parameters (C * , θ * ).
PROPOSITION 2. The family of discrete categorical distributions, as presented in Example 1, satisfies Assumptions 1 and 2. PROPOSITION 3. The family of Gaussian densities, as presented in Example 2, with unknown mean and constant known variance for all 2blocks satisfies Assumption 1 and 2.
The proof of Proposition 2 is presented in Appendix A and the proof of Proposition 3 is given in the Supplementary Material to this article, jointly with results of simulations implemented for this specific family of distributions.
5. Bootstrap Methods to Assess Variability.A criticism usually made to change point detection methods is that they often do not provide measures of variability or confidence sets for the estimated change points set.Here, we propose the usage of bootstrap methods to tackle this problem.
As mentioned in the Introduction, in our setting, each index usually has a specific interpretation, such as being a genetic marker or a particular position on a river.Hence, a researcher might be interested in the statistical significance of a given point detected as a change point.To know how likely an index c in 1 : (m − 1) is detected as a change point on a model with parameters (C * , θ * ) we can compute Since our scenario assumes the rows of x n correspond to i.i.d samples of the vector X, bootstrap resampling is straightforward.Resampling B data sets x n,1 , . . ., x n,B with the same dimension of x n by using the bootstrap on the rows of x n and computing the change point set C(x n,b ) for b = 1, . . ., B, the bootstrap estimate for the quantity above is To measure the variability of the whole estimated change point set, we could consider a metric d defined on sets in C. Our interest would be to calculate By replacing C * by C(x n ), the bootstrap estimates of these quantities are respectively.6. Simulations.In this section we show the results of a simulation study to compare the convergence of both algorithms as sample size increases.We consider two families of distributions: the Bernoulli distribution and the Gaussian distribution with both mean and variance unknown (results available as Supplementary Material).We fixed the number of variables m and varied the number of change points in the model.
For both distributions, we varied the number of samples n from 50 to 500 in steps of 50.For each sample size, we simulated 1000 data sets with n samples.We fixed the number of variables as m = 200 and the number of change points |C * | = k * to take values in {10, 20, 50}.For each k * , the change points were sampled without replacement from a uniform distribution in [1,199] and where maintained fixed for all data sets and all sample sizes.Bernoulli parameters were sampled independently from a uniform distribution.For the Gaussian distribution, means were sampled independently from a N (0, √ 5), and variances from an Exp(1).These parameters also remained fixed for all data sets and all sample sizes.The penalization constant λ was selected from the set {0.1, 1, 10}.The models were fitted for each λ in the set, and then BIC was used to choose the final model.The penalization functions were set to J(n) = log(n) and ρ(r : s) = 1.
In order to evaluate the convergence of the change point set estimated by both, the dynamic programming and hierarchical segmentation algorithms, we considered the Jaccard Index, defined by We also took into account the convergence of the estimated number of change points to k * .It allow us to see when the model is heavily underestimating or overestimating the number of change points.
Figure 1 shows that both algorithms are converging to the true change point set as sample sizes grow.The boxplots on the top are becoming increasingly narrower and closer to 1, which indicates total similarity between the change point sets.For k * = 10, most values are indeed 1, as indicated by the collapsed boxplots from sample sizes greater or equal to 350.We also observe that the performance of the algorithms are very similar as sample sizes grow, since the boxplots shapes become similar.The hierarchical algorithm exhibits more variability, since the boxplots are wider and have more outliers.The graphics on the bottom row of the figure, that display boxplots and average lines for the estimated number of change points, help us understand also the behavior of the algorithms as sample size increases.
For the proposed bootstrap procedure of Section 5, we evaluated how likely each index was estimated as a change point by the hierarchical algorithm on the k * = 10 scenario (Figure 2).
Intuitively, we expect the change point detection rate to be higher and with less variability if the blocks it divides have very discrepant parameters and are larger.For the Bernoulli model, we can asses how the detection rate varies with the absolute difference between consecutive parameters on each block, as observed in Figure 3.
The detection of each of the change points increases with sample size, and that change points which separates blocks with higher absolute difference of probability parameters are detected more frequently.

ROH islands on African and European populations.
As described in the introduction, we propose to frame the problem of ROH islands detection as a change point detection problem assuming a Bernoulli marginal distribution for each codified SNP.Observe that we do not need to assume independence between different SNPs in order to have consistent estimators of the change-points, we only assume consecutive parameters on the blocks of the distribution are different.In particular, the ROH islands can be determined as those blocks with a high value of the estimated parameter.
Choosing a regularization function that suits the problem is not an easy task.We can use domain knowledge to construct a proper regularization function.The first consideration is that the distance between SNPs is not uniform.That is, the distance between the i-th and j-th  SNP is not |i − j|, but rather |B(i) − B(j)|, where B is a function that maps each SNP to its physical location on the chromosome.The physical location of a SNP is measured as the number of base pairs before that particular SNP.The second observation is that very small blocks are usually not interesting for the analyst.It is usual to set a minimum block size in which SNPs are grouped.
Considering these observations, we define the regularization function ρ for the block r : s as In the expression above, T denotes a threshold for the minimal physical distance allowed for an ROH island, and β = 10 6 is a scaling factor to work on mega bases unit.The regularization function R(C) in (2.5) is then defined as the sum of the function ρ over the different blocks in C, as in (3.1).
The SNPs data we analyzed was obtained from the Human Genome Diversity Project (HGDP), and consists of approximately 600,000 SNP markers from Illumina HuHap 650k platform (Li et al., 2008).We considered individuals from African and European populations.On this dataset, each row represents an individual from the population, and each column corresponds to each SNP.
For each population, we performed ROH identification for each individual with the criteria described by McQuillan et al. (2008) and Kirin et al. (2010), using the software PLINK v1.9 (Purcell et al., 2007), and compared to the data after preprocessing.The results for chromosome 22 on both populations are shown in Figure 4.
We then estimated the change-point set and parameters on each block for each population using a value of λ that was selected using BIC from the set {0.1, 1, 10}.The threshold size T was set to be 1% of the chromosome size, and the penalization for the sample size set to J(n) = √ n. Figure 5 shows a comparison between the high frequency regions detected by PLINK and by our method.
For both populations, we observe that the peaks with higher ROH frequency detected by PLINK in general correspond to blocks with higher parameter detected by the change point detection method.There are also blocks with high parameter not detected by PLINK, showing that the methods do not always provide the same ROH islands.This can be understood if we look back at Figure 4.There are regions that have a very high number of overlapping blocks in the processed data, but that do not appear at PLINK.Since they correspond to regions with uninterrupted homozigosity SNPs in the processed data, the output of our method will detect such regions as a block with very high probability parameter, but not in PLINK, since almost no individuals with ROH have detected ROH at that region.These contrasting regions can rise due to small values of the preprocessing parameter, PLINK's criteria, or the presence of a high amount of missing data in a specific genomic region.
Finally, Figure 6 shows an application of the bootstrap to provide how reliable is the estimate of the change point set.We see some spikes on the percentage of detection near the start of most blocks, indicating these are indeed the most detected indices.However, notice that the spikes do not have a very high percentage of detection.This is due to the fact that the change points oscillate near the ends of the segments.Since only one change point can be detected at a time in such regions due to the restriction imposed on the minimum block size, the confidence of detection actually refers to the confidence of the change points on a given region and not on a specific position.A last interesting observation is that some blocks have almost no percentage of change points detected in their central parts, indicating that the model is confident about the boundaries of most of the blocks.8. Discussion.In this paper, we studied a different change point detection scenario that arises with new applications.The inferential goals are the same as the classical problem, but consistency is studied differently, and exact estimation for the change point set is required.We proved that a penalized maximum likelihood approach can be applied and that this method consistently estimates the change point set for different classes of distribution families.We also showed how the bootstrap could be used to obtain confidence estimates for each change point and the whole change point set.
The penalization constant λ appearing in the definition of the estimator has to be set by the user or estimated from data.theorems guarantee that the algorithms are consistent as long as the constant is fixed.However, since the speed of convergence depends on this constant, it is interesting to select a value that enhances convergence speed.
If our method is used in a regression setting as part of feature selection for other algorithm, cross-validation can be used to select a proper value for λ.If it is used in a more descriptive manner, BIC can be used for selection, such as applied in the simulations.As long as the proposed set for lambda is finite and fixed, the algorithms continue to be consistent.The simulation study suggests that the choice of the algorithm should depend on sample sizes.The performance of the hierarchical algorithm seems to be equivalent to the exact solution provided by the dynamical programming algorithm for bigger sample sizes.The exact algorithm remains better for small sample sizes since it seems more robust when selecting the penalization constant.There is a clear speed gain using the greedy approach since it is asymptotically O(m(n + |C|)), while the exact algorithm has complexity O(nm + m 2 ).This difference is crucial in high dimensional analysis, such as applications in genetics.Of course, limiting the number of estimated blocks also aids in time speed and might avoid over-segmentation.
This new approach for change point detection is motivated by the problem of identifying homozygosity islands on the genome of individuals in a population.Our method directly tackles the issue of determining the homozygosity islands at the population level without analyzing single individuals and then combining the results, as is made nowadays in stateof-the-art approaches.Applying this method to real data of two populations from the Human Genome Diversity Project (HGDP) showed the potentiality of these algorithms to highlight highly homozygous regions in the genome.The non-homogeneity of the regularization function R can provide flexibility to incorporate more domain knowledge of the application area.
There is much to explore in future research.A first step will be to check if the assumptions hold for a wider class of distributions, such as Exponential Family, finite-state Markov Chains and Multivariate Gaussian within each block.For the i.i.d Gaussian with unknown mean and variance, the proof that assumption 1 holds is very similar to the one provided for the case with known variance in the supplementary material.However, proving assumption 2 seems to be more challenging.A second interesting question is to prove or give a counterexample of whether 2 implies 1.That is if the convergence of the greedy algorithm implies the convergence of the exact algorithm as well.A third interesting question is how to derive rates of convergence of the change point set estimators. 9. Additional information.Simulations and computation of the estimators were performed using the R software R Core Team (2020) and the R package blockcpd developed for this task.The package provides easy-to-usedr of usage and implements the algorithms for the Bernoulli, Gaussian with unknown mean and variance, Exponential and Poisson families for the i.i.d case.An implementation for a two state Markov Chain is also provided.The c++ R framework provided by the Rcpp package (Eddelbuettel and François, 2011;Eddelbuettel and Balamuta, 2017) was used to overcome performance bottlenecks.
To obtain the final complexity, note that a call in the interval r : s the algorithm does |s − r + 1| ≤ m + 1 comparisons and memory checks, and therefore has complexity of O(mk C * )).
PROOF OF THEOREM 1.First we will prove that C(x n ) will almost surely contain C * .Let C ∈ C such that C ⊇ C * .Then by (PL1) in Assumption 1 we have that eventually almost surely On the other hand, as as n → ∞.Therefore, eventually almost surely we have As the number of C ⊇ C * is finite we have that eventually almost surely eventually almost surely as n → ∞.As a result we obtain that C(x n ) = C * eventually almost surely as n → ∞.
We now present the proof of the consistency of the hierarchical estimator.
PROOF OF THEOREM 2. The proof begins by showing that, at any given possible scenario, the algorithm takes a correct choice almost surely.Then, an inductive argument will guarantee that the algorithm is consistent.For every integer interval I = r : s, the possible scenarios are: The correct decision for the algorithm in case (a) is to halt, not performing more recursive calls for that interval.For case (b), the correct decision is to choose any of the change points available and perform recursive calls on the sub intervals.We show that the algorithm will take the correct decision almost surely for both cases.For case (a), suppose I has no change points inside and let u ∈ r : (s − 1).By (H2) in Assumption 2 we have that eventually almost surely as n → ∞, because ρ(r, s) − ρ(r, u) − ρ(u + 1, s) < 0. Hence, no splitting will be done eventually almost surely, and the algorithm will not perform more recursive calls inside I.For case (b) we have to prove that the algorithm will almost surely choose true change points to split the interval.The inequality (H1) on Assumption 2 implies that, for any u that is not a change point, there exists a change point c * in I \ {s} such that and therefore ĉ ∈ C * , eventually almost surely as n → ∞.
We finish the proof by using mathematical induction on the number of variables m.If m = 1, then there are no change points on the model, and the algorithm will not even have comparisons to make.Hence, the estimated change point set will be empty by construction and therefore equal to the true set of change points.Assume now that the algorithm is consistent for every vector of dimension m <= m − 1.We will prove that it will be consistent for vectors of dimension m.The first run of the algorithm is on the interval 1 : m.By case (a), if there are no change points on this interval, the algorithm will almost surely not split the interval and will halt.Hence, the estimated set will almost surely be equal to the true set of change points.On the other hand, if there are change points on the interval 1 : m, the proof in case (b) shows us again that the algorithm eventually almost surely takes the correct choice and splits the interval at a change point c ∈ C * .After the split, recursive calls are made on 1 : c and on (c + 1) : m.But the length of these vectors is at most m − 1, and by the induction hypothesis, the algorithm will eventually almost surely retrieve all the change points in Before presenting the proof of Lemma 2 we state and prove the following basic lemma.
LEMMA 3. Let f, g : R → R be twice differenciable convex functions.If there exists a constant α such that f (x) + g(x) = α , then f and g must be linear functions.
PROOF.Differentiating both sides twice we obtain that f (x) + g (x) = 0 .
Since both f (x) ≥ 0 and g (x) ≥ 0, then f (x) = 0 = g (x) and the result follows.We now check each one of the condition of Lemma 2, beginning with (b).To prove that h * I is concave on the interval [c * j−1 , c * j ], it is sufficient to show that (u − r + 1)ψ(θ r:u ) and (s − u)ψ(θ (u+1):s ) are convex on this interval.Take g(u) = (u − r + 1)ψ(θ r:u ), and treat the vectors as column vectors.Then the first derivative of g is T ∇ψ(θ r:u ) , where ∇ψ(θ r:u ) is the gradient of ψ(θ) evaluated at θ r:u and the second equality follows from the fact that (u − s + 1)t (u) = −t(u).The second derivative of g is then because Hψ(θ r:u ), the Hessian matrix of ψ evaluated at θ r:u , is positive definite and −t (u)t(u) ≥ 0. We conclude that g is convex on We finish the proof by showing condition (a) in Lemma 1, that is that the minimum is attained at the interior of I. Suppose that this is not the case, that is that the minimum is attained at h * I (s).Since h * I (s) is also a maximum, because by definition we have that the function in the whole interval I must be constant.Following the same arguments as in the proof of condition (b) we conclude that θ * i = θ * j for all i, j ∈ {1, . . ., k * }, which is a contradiction with the hypothesis that θ * j = θ * j+1 for all j = 1, . . ., k * − 1.
We will now verify that Assumptions 1 and 2 hold for the families of distributions in Examples 1 and 2. These results were stated in Propositions 2 and 3, respectively.PROOF OF PROPOSITION 2. Denote by C * = {c * 0 , . . ., c * k C * } and p * = (p * 1 , . . ., p * k C * ) the true change point set and parameters, respectively.We will denote by C = {c 0 , . . ., c kC } and p = (p 1 , . . ., p kC ) any arbitrary change point set and we will denote by I j , respectively I * j , the interval between two change points in C, respectively in C * , that is I j = (c j−1 + 1) : c j and I * j = (c * j−1 + 1) : c * j .We begin by verifying hypotheses (PL1) and (PL2) in Assumption 1.In the case of categorical random variables, the log-likelihood function (2.4) for the sample x n = {x (i) } n i=1 , evaluated at the maximum likelihood estimator for θ can be written as To check (PL1) in Assumption 1 let any set C ∈ C. By the Law of Large Numbers we have that .
As the difference N Ij (a) − n|I j |p j (a) can be written as a sum of zero-mean independent random variables with finite variance, we have, by the Law of the Iterated Logarithm, see for example Breiman (1969, Theorem 3.52), that for a given constant c and for all a ∈ A, eventually almost surely as n → ∞.Hence for all δ > 0 we have that eventually almost surely as n → ∞.Finally, we obtain that with p * min = min{p * j (a) : p * j (a) > 0, j = 1, . . ., k C * , a ∈ A}.On the other hand, it is easy to see that l(C; x n ) − l(C * ; x n ) ≥ 0 for any C ⊇ C * .Then, hypothesis (PL2) holds for v(n) = δ log n, for any δ > 0. This establishes the conditions on Assumption 1.Now we verify the hypotheses (H1)-(H2) in Assumption 2. Observe that as in the case of (PL1), the Law of Large Numbers can be invoked to prove that whenever I \ {s} ∩ C * = ∅, concluding the proof of (H1).The proof of (H2) follows by (PL2) in Assumption 1. and Then, for any C ∈ C, by some manipulation we obtain that the almost sure limit of the maximum log-likelihood function is for some > 0. This establishes condition (PL2).
Simulations for Gaussian random variables.We provide a similar simulation study as the one presented in the main text the Gaussian family of distributions.
In these plots again we observe over-segmentation for the hierarchical algorithm for small sample sizes, and more common than in the Bernoulli case.The dynamic programming algorithm seems again to be more robust with respect to the choice of the penalization constant λ.
In Figure 1 is evident that the hierarchical algorithm is more volatile, but as sample size grows, has an equivalent, and maybe even better, performance than the dynamic programming exact solution.

LEMMA 1 .
Given an integer interval I assume that(a) There exists a point c ∈ I such that h * I (c) < h * I (s).(b) The function h * I restricted to the intervals [c

LEMMA 2 .
Let Θ be a convex subset of R d for some d ∈ N.For any interval I ∈ I define the parameter θ I = k * k=1 |I∩I * k | |I| θ * r .Assume the function h * I defined by (4.2), for I = r : s, is of the form h * I (u) = (u − r + 1)ψ(θ r:u ) + (s − u)ψ(θ (u+1):s ) + α, with ψ a strictly convex function having second order derivatives in the interior of Θ and α a constant not depending on u.Then h * I satisfies the conditions (a)-(c) in Lemma 1.
where 1 C(x n,b ) (c) denotes the indicator function that c ∈ C(x n,b ).In practice, however, it might be difficult for the researcher to pinpoint the exact location of the change point beforehand.His practical experience most likely indicates that a change occurs within an region.Then, we can address the question of how significant it is for the interval I to contain a change point by computing p(I) = P (C * ,θ * ) I ∩ C = ∅ .This quantity can be estimated by

FIG 1 .
FIG 1.Comparison of estimated change point sets between algorithms and against ground truth for the Bernoulli family using the Jaccard Index (top) and the number of change-points (bottom).The true number of change points k * is indicated on top of each figure and marked on the purple dotted horizontal line.

FIG 2 .
FIG 2. Percentage of detection by the hierarchical segmentation algorithm of each index as a change point.A random data set from the scenario k * = 10, n = 250 and Bernoulli family was selected, and 200 bootstrap samples were obtained.Black lines show how frequently the particular indexes were estimated as change points.Vertical dashed red lines indicate the position of the true change points on the simulated model.

FIG 3 .
FIG 3. The curves on the left figure shows the bootstrap averaged percentage of detection of true change-points as sample size grows.The average is taken over data sets with the same size.The curve on the right shows the absolute difference between the parameters of the blocks they divide.The difference is plotted with respect to the change-point index (cp index) and not location on the random vector.

FIG 4 .
FIG 4. ROH regions (green and blue segments) as defined by our method (top) and by PLINk (bottom) for African (left) and European (right) populations.Each row corresponds to an individual and each column to a SNP marker.The blue segments on the top graphics correspond to individuals not having ROH detected by PLINK.

FIG 5 .
FIG 5. Comparison between the blocks detected by the change-point method and the methodology proposed in McQuillan et al. (2008) and Kirin et al. (2010), using PLINK.The black lines indicate the frequency of occurrence of each SNP in an ROH as detected by PLINK (left scale).The purple lines indicate the blocks detected by our methodology.The height of each block corresponds to the estimated parameter on each block (right scale).

FIG 6 .
FIG 6. Bootstrap confidence of detection of the estimated change point set with 500 replications.Left scale shows the bootstrap percentage of detection as a change point per index.The purple lines indicate the blocks detected by our methodology.The height of each block corresponds to the estimated parameter on each block (right scale).
(a) There are no change points in I \ {s}; (b) There are change points in I \ {s}.
1 : c and (c + 1) : m exactly.By a union of those change point sets, we have that the final estimated change point set is equal to C * eventually almost surely as n → ∞.PROOF OF LEMMA 1.Let h * I (u) be the minimum value of h * I on I. First, notice that by (a), h * I (u) ≤ h * I (c) < h * I (s), so the minimum is strictly smaller than the value of the function at the end of the interval.We now show that u must be a change point in C * .Suppose that u ∈ C * , and let I * j = [c * j−1 , c * j ] be the unique interval that contains u.Since h* I is concave in [c * j−1 , c * j+1 ],if the minimum is attained at an interior point, then h * I must be constant on I * j .However, by (c), this would imply that h * I (u) = h * I (s), which is a contradiction.Then u ∈ C * and min PROOF OF LEMMA 2. Observe that it is enough to consider h * I as given by h* I (u) = −(u − r + 1)ψ(θ r:u ) − (s − u)ψ(θ (u+1):s ) , u ∈ I ,as the constant C does not depend on u.For any u ∈[c * j−1 , c * j ] define t(u) = c * j−1 −r+1 u−r+1 .Then we can write θ r:u = t(u)θ r:c * j−1 + (1 − t(u))θ * j .

NN
a) log pIj (a) =: l * (C) C ⊇ C * we have that l * (C) − l * (C * ) = p * r for all r and j with |I j ∩ I * r | = ∅.But this can only happen if C ⊇ C * , which is a contradiction.To verify hypothesis (PL2) observe that for any C ⊇ C *l(C; x n ) − l(C * ; x n ) = Ij (a) log N Ij (a) n|I j | − k C * j=1 a∈A N I * j (a) log p * j (a) ,where the last inequality follows because N I * j (a)/n|I * j | are the maximum likelihood estimators for p * j (a).As C ⊇ C * , we have that any interval I * r contains one or more intervals I j , and Ij (a) logN Ij (a)/n|I j | p * ij (a)where i j is the corresponding index in C * .Therefore we obtain thatl(C; x n ) − l(C * ; x n ) ≤ kC j=1 n|I j | D( p j ; p * ij ) ,wherep j (a) = N Ij (a) n|I j | , a ∈ Aand D denotes the Kullback-Leibler divergence between the probability distributions { p j (a)} a∈A and {p * ij (a)} a∈A .By a well-known inequality, see for example Csiszar and Talata (2006, Lemma 6.3) we have that x n ) → l * (I) : = |I| a∈A pI (a) log pI (a) for any integer interval I ∈ I. Now, for I = r : s consider the function h * I : I → R defined as h * (u) = −l * (r : u) − l * ((u + 1) : s) , u ∈ I .As l * (I) satisfies the hypotheses of Lemma 2 with ψ(p) = a∈A p(a) log p(a) we have that min u∈I\{s}∩C * h * (u) < min u ∈I\{s}∩C * h * (u) l * (C) .We begin by verifying conditions (PL1) and (PL2) on Assumption 1. Observe that for any intervalI ∈ I we have that C) ≤ l * (C * ) .Moreover, the inequality is strict unless C ⊇ C * , and then (PL1) follows.To prove (PL2) in Assumption 1, if we define the empirical variance of an interval I asσ2 I = 1 n|I j | c ∈ Ij n i=1 (X (i) c − μj ) 2then the log-likelihood for any block over a finite sample can be written asl(C; x n ) point set C, define I r (C) = {j ∈ (1 : k C )|I j ⊆ I * r }, the set of all indices whose blocks defined by C are contained in I * r .If C does not segment I * r , then there is only one index in I r and the set that it indexes is exactly I * r .The log-likelihood difference to the true change point set is thenl(C; x n ) − l(C * ; x n ) = nBy the law of the iterated logarithm, there exists j * > 1 such that lim sup n→∞ S j * < j * .By lim sup properties, * r |µ * ,2 r − |I| μ2 I .

FIG 1 .
FIG 1.Comparison of estimated change point sets between algorithms and against ground truth for the Gaussian case using the Jaccard Index.The number of change points k * is indicated in the title.