Abstract
Microarray technologies allow for simultaneous measurement of DNA copy number at thousands of positions in a genome. Gains and losses of DNA sequences reveal themselves through characteristic patterns of hybridization intensity. To identify change points along the chromosomes, we develop a marker clustering method which consists of 2 parts. First, a “circular clustering tree test statistic” attaches a statistic to each marker that measures the likelihood that it is a change point. Then construction of the marker statistics is followed by outlier detection approaches. The method provides a new way to build up a binary tree that can accurately capture change-point signals and is easy to perform. A simulation study shows good performance in change-point detection, and cancer cell line data are used to illustrate performance when regions of true copy number changes are known.
1 INTRODUCTION
DNA copy number alterations in tumor cells are key genetic events in the development and progression of human cancers (Tirkkonen and others, 1998; Lengauer and others, 1998; Foronzan and others, 2000). These alterations are typically a result of genomic events producing gains and losses of chromosomes or chromosomal regions. Losses and gains of DNA can contribute to alterations in the expression of tumor suppressor genes and oncogenes, respectively. Therefore, the identification of DNA copy number alterations in tumor genomes may help to discover critical genes associated with cancers and, eventually, to improve therapeutic approaches. A variety of techniques have been developed to screen genomes for all possible regions with DNA copy number alterations. For example, comparative genomic hybridization (CGH) (Kallioniemi and others, 1992, 1994), high-resolution array-based CGH (Pinkel and others, 1998; Snijders and others, 2001), representational difference analysis (Lisitsyn and others, 1993), and single nucleotide polymorphism (SNP) arrays (Zhao and others, 2004) can all be used to assess copy number.
The goal in analyzing DNA copy number data is to partition the whole genome into regions with the same underlying copy number and, subsequently, to quantify the copy number in each region. Therefore, locating copy number changes is an important step in the analysis of DNA copy number data. Various statistical methods have been proposed to address this problem. The software CGH-Plotter uses moving median and mean filtering, k-means clustering, and a dynamic programming algorithm (Autio and others, 2003). Jong and others (2003, 2004) proposed a genetic local search algorithm to maximize a likelihood with a penalty term containing the number of change points. Other proposals for change-point detection methods include a dynamic programming algorithm with a penalized likelihood (Picard and others, 2005) and a weighted likelihood function using adaptive weight smoothing (Hupe and others, 2004). Olshen and others (2004) proposed a modified binary segmentation procedure, called circular binary segmentation (CBS), to look for 2 breakpoints at a time in the data arranged in a circle. Wang and others (2005) proposed a simple but effective method based on hierarchical clustering along the chromosomes (CLAC) to select regions of interest while simultaneously controlling the false discovery rate (FDR). Fridlyand and others (2004) used hidden Markov models (HMMs) to estimate the underlying copy numbers (the hidden states). In Hsu and others (2005), a smoothing algorithm was given that denoises the data using a wavelets approach.
A recent paper (Lai and others, 2005) described an extensive study comparing several of these methods and pointed out that some segmentation methods, especially the CGH segmentation method (Picard and others, 2005) and CBS (Olshen and others, 2004), appeared to perform consistently well whether the noise level was low or high. Willenbrock and Fridlyand (2005) also did a comparison study of HMMs, CBS, and the Hupe method and concluded that CBS has the best operational characteristics in terms of its sensitivity and FDR. In Picard's method, however, the CGH profile is assumed to follow a Gaussian signal, an assumption that may be violated, while CBS is not designed to detect single outliers. For processing a large number of high-resolution arrays, the speed of many algorithms becomes an issue. Both HMMs and CBS can be slow and difficult to apply to very high–density oligonucleotide arrays.
In this report, we seek approaches to segment the chromosomes into regions of equal copy number that can include either a single marker or multiple contiguous markers. We propose a method, which we call the “circular clustering tree test statistic” (CCTTS), and following the construction of the statistics, we apply outlier detection approaches to identify change points along the chromosomes. Our method is novel in that it provides a new way to build up a binary tree that can accurately capture change-point signals and is easy to perform.
The rest of the paper is organized as follows. The CCTTS algorithm and properties are given in Section 2. In Section 3, we provide several change-point detection approaches based on the scores produced by the CCTTS algorithm. In Section 4, the proposed methods are evaluated by simulation studies. The method is applied to data from cell lines with known copy number changes in Section 5. We summarize our conclusions and discuss open questions in Section 6.
2 METHODS
It is clear that estimating the locations of regions with aberrant DNA copy numbers is closely connected to the problem of finding the locations of change points (Olshen and others, 2004). This makes change-point methods a natural framework to approach the analysis of array DNA copy number data. Let be a sequence of random variables that correspond to markers positioned sequentially along the genome. In general, an index is called a change point if have a common distribution and have a different common distribution. Intuitively, a point can be identified as a change point when the difference in signal intensity is large between 2 adjacent segments demarcated by the point. The correct choice for the sizes of the 2 adjacent segments is an important issue when the signal difference is measured. Segments that are either too small or too large will reduce the sensitivity of change-point detection methods. Our proposed method—the CCTTS—automatically gives 2 “optimal” adjacent segments (or clusters) with which to measure the difference at any point along a chromosome. The idea is to build a hierarchical cluster-style tree from the bottom upward such that the gain/normal/loss regions are separated into different branches. In this procedure, the tree building and the evaluation of statistics at any point are carried out simultaneously. Furthermore, the statistics associated with the tree fully reflect the information involved in the tree structure, in- cluding the magnitude of the difference between any 2 adjacent clusters and the order of the merging procedure. These statistics can then be used for detecting the most likely change points along a chromosome.
2.1 The CCTTS algorithm
The CCTTS algorithm uses a bottom-up strategy that generates a binary tree to represent the similarities in the observed copy number intensities along a chromosome. The algorithm begins with the data arranged in a circle, obtained by connecting the 2 ends of the chromosome together, with every observation representing a singleton cluster. At each of the steps, the closest 2 adjacent clusters are merged into a single cluster, where closest is defined by the smallest difference in copy number intensities, and the differences at the new cluster boundaries are updated. Hence, after each step there is one less cluster.
Let be a sequence of random variables for the relative copy number intensities at N marker positions. We define a signed distance to measure the difference between any 2 adjacent clusters as

-
1) Arrange the data in a circle so that the first point is connected to the last. That means and . Define N clusters each consisting of one point. Let be the set of indices corresponding to each of the N clusters, and define a set of statistics that will be used to decide if point i is a change point. To start, use
which is the distance between each pair of adjacent observations.
-
2) Let . Then merge the 2 adjacent clusters and , remove index l from I, and update and in the following way:

-
3) In general, merge the 2 adjacent clusters with the smallest . Then remove index q from I. If the data still contain more than one cluster, update and in the following way:
Let LC and RC be the left and right adjacent clusters to the cluster we just formed, respectively (see Figure 1).
(2.2)
(2.3)
-
4) Repeat Step (3) until one big cluster is obtained. After merging the last 2 clusters, updating S is not relevant and is not performed. At that point there will be N statistics .
The horizontal axis describes the positions of several markers along a chromosome, and the vertical axis is the signal. Horizontal lines stand for the mean of each cluster. The 2 solid line clusters are merged to form cluster C and then statistics and are updated.
The horizontal axis describes the positions of several markers along a chromosome, and the vertical axis is the signal. Horizontal lines stand for the mean of each cluster. The 2 solid line clusters are merged to form cluster C and then statistics and are updated.
In Step (2) or (3), if there are ties for the minimum value , say, , then indices are simultaneously removed from I, all corresponding pairs of adjacent clusters are merged, and all statistics on the boundaries of these newly formed clusters are updated as in Step (2) or (3). Note that 2 boundary points may correspond to the merging of more than 2 adjacent clusters.
The simple example in Table 1 is used to illustrate the CCTTS algorithm. The first row shows a data set with 5 points. The algorithm starts by defining 5 clusters: and . are calculated and given in the second row after connecting the data in a circle. In the first step, is removed from I, and and are merged together to obtain 4 clusters: , and . To update and corresponding to the boundaries of the newly formed cluster, (given in the second row) is compared to and is compared to . Both and are updated to new values that are larger in absolute values as seen in the third row. The remaining values of , are unchanged. Then, this procedure of computing , updating I, merging 2 adjacent clusters together, calculating , and updating is repeated until one cluster is formed. The final merging order matches the magnitude of the final statistics (in absolute value).
Example of CCTTS. Numbers in parentheses are the smallest absolute Si
| Yi | 0.8 | 1.6 | 1.3 | 0.2 | 0.9 | I | argmini∈ I|Si| |
| Si | − 0.566 | 0.212 | 0.778 | − 0.495 | (0.071) | {1, 2, 3, 4, 5} | 5 |
| D(·, ·) | − 0.612 | − 0.531 | |||||
| Update Si | − 0.612 | (0.212) | 0.778 | − 0.531 | {1, 2, 3, 4} | 2 | |
| D(·, ·) | − 0.6 | 1.021 | |||||
| Update Si | − 0.612 | 1.021 | (− 0.531) | {1, 3, 4} | 4 | ||
| D(·, ·) | − 0.895 | 0.895 | |||||
| Update Si | (− 0.895) | 1.021 | {1, 3} | 1 | |||
| Final Si | − 0.895 | 0.212 | 1.021 | − 0.531 | 0.071 |
| Yi | 0.8 | 1.6 | 1.3 | 0.2 | 0.9 | I | argmini∈ I|Si| |
| Si | − 0.566 | 0.212 | 0.778 | − 0.495 | (0.071) | {1, 2, 3, 4, 5} | 5 |
| D(·, ·) | − 0.612 | − 0.531 | |||||
| Update Si | − 0.612 | (0.212) | 0.778 | − 0.531 | {1, 2, 3, 4} | 2 | |
| D(·, ·) | − 0.6 | 1.021 | |||||
| Update Si | − 0.612 | 1.021 | (− 0.531) | {1, 3, 4} | 4 | ||
| D(·, ·) | − 0.895 | 0.895 | |||||
| Update Si | (− 0.895) | 1.021 | {1, 3} | 1 | |||
| Final Si | − 0.895 | 0.212 | 1.021 | − 0.531 | 0.071 |
In addition to treating the data as a circle, there is a substantial difference between the CCTTS algorithm and the conventional clustering approach (Wang and others, 2005) with respect to how and are updated in Step (2) or (3), for a given difference measure. In the conventional approach, and are simply updated by and , respectively, which ignores the previous statistic at a marker when updating the statistic to , after merging clusters and . This leads to a suboptimal choice of statistics. To clarify, let be the difference measurement between clusters LC and , and let be the difference between LC and C. Suppose is a true change point and , then our algorithm assigns the larger statistic to marker j, whereas the standard algorithm would assign the smaller statistic . This also means that there can be inconsistency between the order in which indices are removed from I and the magnitude of the resulting statistics . With the usual tree-building procedure, and may be smaller than , even though position q is merged before positions j and k. The merging order indicates that is less likely to be a change point than and . Our method of updating difference measurements in the CCTTS algorithm will produce a sequence of statistics , ordered by their position in the clustering process.
The CCTTS algorithm provides a way to calculate the statistics and also leads to 2 “optimally sized” adjacent clusters to achieve a statistic value (see Figure 2). Updating procedures (2.2) and (2.3) keep track of 2 optimally sized adjacent clusters at each step and detect the signals from using both the mean differences and the sizes of the clusters. This makes it possible to identify two interesting cases: (i) small regions with extremely large or small values and (ii) regions of consistent gain/loss that do not deviate much from zero. The magnitude of naturally reflects the likelihood that is a change point in the whole data set . Let . Then is most likely to be a change point, is the second most likely change point, and so on. Several additional properties of our statistics are given in detail in Section 2.2.
(a) Simulated data along a chromosome, with (b) corresponding statistics computed by the CCTTS algorithm. Six true change points are marked in (a) with different shapes. Horizontal solid and dashed lines indicate the corresponding adjacent right and left clusters used when defining . It can be clearly seen that change points correspond to outliers for the ‘s.
(a) Simulated data along a chromosome, with (b) corresponding statistics computed by the CCTTS algorithm. Six true change points are marked in (a) with different shapes. Horizontal solid and dashed lines indicate the corresponding adjacent right and left clusters used when defining . It can be clearly seen that change points correspond to outliers for the ‘s.
2.2 Properties
Under the null hypothesis of no change points, the final statistics have the following properties:
a) If are iid, then all are identically distributed, but not independent.
b) If come from a normal population, then all are approximately identically normally distributed, but not independent.
The clustering tree built by the CCTTS algorithm is based on a circular arrangement of the data. Property (a) is a natural consequence of this circular symmetric arrangement. In Appendix A, we give a simple example of a sequence data set containing only 3 points , and and illustrate the functional relationships between the statistic scores and the data set . In general, for a given data arranged in a sequence: , the method for calculating is a function of . So we write , where the subscript 1 on S corresponds to the first location, . On the other hand, for any , if the data are rearranged as , then, according to the previous notation, the CCTTS algorithm gives statistic corresponding to the first location of : . Since both data sets and , when arranged in a circle, are exactly the same, . Therefore, and are identically distributed because are iid. Property (a) plays a significant role when analyzing non-normal data for change-point detection, since the symmetry can be exploited to reduce the number of permutations needed to produce a reference distribution.
Property (b) has been established empirically by examining normal Q–Q plots of simulated data. From property (a), we only need to illustrate that is approximately normally distributed. Let , be n data sets generated iid from . The statistics were computed from the data set by the CCTTS algorithm. Then is a sample drawn from the distribution of . Figure 3 shows normal Q–Q plots for replications with number of markers and indicates that is approximately normally distributed. In our simulations, we simultaneously calculated sample correlation coefficients between and which, because of the symmetric circular arrangement and property (a), also estimate the correlation coefficients for any 2 statistics and . When , for different N, all these correlation coefficients were between and . When , correlations between and are small (e.g. correlations between and are around ), and these statistics could be treated as approximately independent. Due to the negative correlations between and , a histogram of all statistics jointly has larger tails than would be expected from the marginal approximate normality of any one statistic . Property (b) makes it possible to analyze normal data for change-point detection easily and efficiently.
Normal Q–Q plots with respect to from simulated data sets. The dash-dotted line is a robust linear fit of the sample order statistics which passes through the first and third quartiles. The slope of the solid line is 1 and represents the standard normal distribution. Simulations are applied for different data size N. The plots demonstrate approximate normality of the statistics but with larger variances than standard normal.
Normal Q–Q plots with respect to from simulated data sets. The dash-dotted line is a robust linear fit of the sample order statistics which passes through the first and third quartiles. The slope of the solid line is 1 and represents the standard normal distribution. Simulations are applied for different data size N. The plots demonstrate approximate normality of the statistics but with larger variances than standard normal.
3 CHANGE-POINT DETECTION
We assume that the observed is a realization of the true relative copy number at marker i plus random noise,

We discuss and introduce several different approaches for outlier detection, separately for normal and non-normal data.
3.1 Normally distributed data
We assume that the data are normally distributed with a common variance. The scores are computed according to the CCTTS. Under the null hypothesis of no change points, are each approximately normally distributed (see property (b)). Therefore, some approaches used to identify outliers for normal data are appropriate here, even though are negatively correlated between 2 adjacent . To perform this procedure, we apply the simple d standard deviation method, where the value of d can be user defined.
Approach 1: outlier detection method (ODM).
are sorted in decreasing absolute values: . The method starts with the assumption of no outliers in , then examines and determines the first most likely outliers. If outliers are detected, then they are removed from the set of scores . The procedure will be repeated for the remaining scores until no more outliers can be found.After completing this procedure, if , there are no change points; otherwise, we find change points .
1) Set . Calculate the mean m and standard deviation s of all the scores.
2) If , stop. Otherwise, identify the largest l such that . Then remove from , recalculate the mean m and standard deviation s for the remaining scores in , and then update .
3) In general, if , stop. Otherwise, identify the largest l such that .
4) Remove from , then recalculate the mean m and standard deviation s for the remaining scores in .
5) , go to Step (3).
The value of d can be user defined based on some domain or data knowledge, or theoretically using Chebyshev's theorem (Barnett and Lewis, 1994). In practice, d is usually chosen between 3 and 6 (Maletic and Marcus, 2000). From our experience, values between and 4 were found to generate the best results with fewer false positives and false negatives in general.
Under the null hypothesis of no change points, the negative correlations between and and approximate independence between and mean that a sample variance from overestimates the variance of the marginal distribution for any one statistic . Therefore, outlier detection in the outlier detection method (ODM) is conservative. Outlier detection in with one or several (outliers) points removed is also expected to be conservative. However, this conservativeness is not a concern in most cases due to the following 2 facts: (i) a sample variance from is an asymptotically unbiased estimator of the marginal variance with respect to the number of markers N (see Appendix B) and (ii) a user-defined value of d can be adjusted for outlier detection. Therefore, the negative correlations do not have much effect on ODM performance for large data sets. When the number of markers is small, we may choose a smaller value of d to adjust for the conservativeness.
3.2 Non-normal data
In the outlier detection method, we assumed that the data followed a normal distribution. For non-normal data, we will introduce 2 methods to identify change points based on permutation-derived reference distributions. The sequential detection method (SDM) is the most straightforward method to apply.
Approach 2: Sequential detection method (SDM).
Assume the null hypothesis is true, that is, there are no change points. Using property (a), each statistic value computed from position permuted data has the same distribution and, therefore, can be used as a contribution to the permutation distribution. For example, by doing 100 permutations on a data set of markers, null test statistics are obtained. This drastically reduces the number of permutations needed.
1) For the original data , we calculate statistic scores according to CCTTS and sort them in decreasing absolute values: .
2) Randomly permute the order of the markers M times (). For each permutation data set P, collect all absolute statistic values and construct a reference distribution. Outliers are chosen if is greater than the upper quantile of the permutation distribution . Let .
3) If , there are no change points. Otherwise, the data can be partitioned into k or segments separated by k change points . The exact number of segments (k or ) depends on whether is chosen as a change point.
4) Steps (1) to (3) are recursively performed on each small segment until no more change points are identified. Single-point regions of gain or loss may be identified by the occurrence of 2 adjacent change points. These may occur either because there was a very small region of copy number change on the chromosome that is measured by only one marker or because there was an error in the data at one marker. Suppose a chromosome with N markers has only one single-point region of gain/loss. Any permuted data from this chromosome also has one single-point region of gain/loss. That means 2 out of the N statistic values in each permutation will be relatively large, and therefore the magnitude of quantile depends critically on N. It will be difficult to detect change points for small values of N. Therefore, since the SDM does not work well in this situation, we propose a modification to improve detection of single-point regions. Let s be the sample standard deviation of the original data. Suppose during Step (3) on a segment with mean no change points were detected (). We propose to examine the indices of and . If they are adjacent in the circle, then point (or , depending on the relative positions of and ) may be a single-point region. If , then we conclude that there are no change points on the segment; otherwise, we conclude that and are 2 change points.
Approach 3: backward residual method (BRM).
The SDM works well and is robust to unequal variances. However, when segments become small, the SDM permutation approach can miss some change points. For example, look at a very simple data set: . It is obvious that there are 2 change points and the largest statistic, , is . The probability of achieving this largest value in the permutation distribution is . No change points can be identified at the level . Therefore, as an alternative approach, we developed a backward residual method (BRM) that uses the whole data set all the time rather than working with smaller segments. The algorithm starts in the same way as the SDM, but after some of the change points are detected, new permutations will be performed on the residual data instead of on the smaller segments.Similar to the SDM, a modification can be made to deal with single-point regions in data sets. Let s be the sample standard deviation of the original data. In Step (4), when , let and ( when , when ) be the next 2 most likely change points. Identify if there are potential single-point regions produced by combining with previously detected change points or by examining and . Suppose this potential single-point region lies in segment L. We need to decide if this is a real single-point region. If ( is the mean of segment L), we conclude that there are no more change points. Otherwise, or both and are identified to be new change points.
1) As before, for the original data , we calculate statistic scores according to CCTTS and sort them in decreasing absolute values: .
2) Let (one segment) and .
3) Randomly permute the order of the data M times (). For each permutation data set, collect all absolute statistic values to form a reference distribution. The threshold values can be chosen to be the upper quantile of the permutation distribution, say . Let .
4) If , stop. Otherwise, ( detected change points or segments); denote as the mean of the segment which contains observation and update to be the mean adjusted residual of , so that . Then go to Step (3).
Software written in MATLAB is available from the first author.
4 SIMULATIONS
In this section, we investigate through Monte Carlo simulations the performance of the proposed methods. Data containing regions of gain, normal, and loss copy numbers were generated from the model , where N is the sample size, is the mean, and is distributed as . We assume one region of loss and one region of gain per chromosome. Within a loss region, all markers have the same state, with copy number reduced by 1. However, within the gain region we allow 3 possible states amplified by , or 3 copies.
Following Pollack and others (2002), we set ‘s for the copy numbers deleted by 1 and amplified by 1, 2, and 3 to be , and , respectively, and we use for normal regions. To mimic typical data, the following quantities were generated randomly: the position of loss/gain regions, the size of loss/gain regions, the number and allocation of the 3 amplified states, and the sizes of the 3 amplified regions within the regions of gain. In order to compare our methods and the CBS method, the simulated data sets do not contain single points of loss/gain. The details of the simulation are given in the following.
-
1) A data set is defined to contain a normal region, a region containing a loss, and a region containing a gain in random order.
-
2) To generate the 3 regions, each of size m: , the following parameters are used:
a) Data from normal region: .
b) Data from region of loss:
c) Data from region of gain:
i) Initially generate .
ii) Generate the number of amplified states: and then randomly select the magnitude of the gains by sampling without replacement n times from within the region of gain.
iii) Generate the width of a region with minimum width 2 for 1 amplified state, 4 for 2 amplified states, and 6 for 3 amplified states:

iv) Locate the gain interval in the region: by generating i from a uniform distribution:

v) Divide the gain region into n segments, with the length of each segment at least 2 and chosen from a uniform distribution, and then randomly assign the n amplified states (selected by (ii)) to the n segments.
vi) Let , and . If segment corresponds to the state, then .
-
3) The whole simulated data set is produced by randomly concatenating the 3 regions generated by (2) and examples are given in Figure 4.
Examples of 4 simulated data sets with . Each data set is produced by randomly concatenating the 3 parts: , , and which contain normal, loss, and gain regions, respectively. In the gain regions, (a) has 1 state amplified by 3 copies, (b) has 2 states amplified by 1 and 2 copies, (c) and (d) have all 3 states, amplified by either 1, 2, or 3 copies.
Examples of 4 simulated data sets with . Each data set is produced by randomly concatenating the 3 parts: , , and which contain normal, loss, and gain regions, respectively. In the gain regions, (a) has 1 state amplified by 3 copies, (b) has 2 states amplified by 1 and 2 copies, (c) and (d) have all 3 states, amplified by either 1, 2, or 3 copies.
One hundred data sets with 180 markers were simulated for and for each value of . In order to compare the simulation results across different values of , are generated and are used for different . Examples with from simulation are shown in Figure 4. If regions of loss and gain are in the middle, a data set with only one amplified state for a gain region leads to 4 change points, whereas a data set with all 3 amplified states leads to 6 change points. The total number of change points in 100 data sets for a given is 500. We use standard deviations to perform the ODM. For both the SDM and the BRM, is used to define outliers.
The number of change points detected is summarized in Table 2. For comparison, the results of the CBS without pruning (Olshen and others, 2004) and the CLAC (Wang and others, 2005) are included. In order to compare the CPU time, we coded the algorithm for CBS in MATLAB instead of the existing R package. The CLAC method was included for comparison since it also involves hierarchical clustering of points. CLAC was implemented with the existing R package (available at http://www-stat.stanford.edu/∼wp57/CGH-Miner/) using default parameters. Because reference samples from the normal arrays are needed in the CLAC method, we simulated 100 reference data sets using a normal distribution with zero mean and variance , set to be the same as in the data to be analyzed.
Results from applying ODM, SDM, BRM, CBS, and CLAC to the simulated data. ODM is performed with d = 3.5 standard deviation. α = 0.001 is used to define outliers for both SDM and BRM. CBS is applied without pruning
| Number of | Method | σ = 0.05 | σ = 0.10 | σ = 0.15 | σ = 0.20 | σ = 0.25 | σ = 0.50 | σ = 1 |
| True change points | 500 | 500 | 500 | 500 | 500 | 500 | 500 | |
| ODM | 495 | 426 | 337 | 268 | 203 | 59 | 12 | |
| Change points | SDM | 461 | 416 | 334 | 264 | 207 | 58 | 14 |
| correctly detected | BRM | 496 | 428 | 337 | 272 | 212 | 61 | 13 |
| CBS | 453 | 407 | 337 | 264 | 218 | 69 | 17 | |
| CLAC | 13 | 13 | 15 | 26 | 29 | 27 | 26 | |
| ODM | 3 | 33 | 82 | 115 | 128 | 78 | 21 | |
| (3, 0, 0) | (29, 3, 1) | (63, 13, 6) | (76, 26, 13) | (80, 29, 19) | (36, 25, 17) | (7, 5, 9) | ||
| Change points | SDM | 3 | 28 | 78 | 112 | 126 | 86 | 26 |
| mislocated | (3, 0, 0) | (26, 1, 1) | (60, 13, 5) | (76, 23, 13) | (81, 25, 20) | (40, 27, 19) | (8, 8, 10) | |
| (in position by 1, 2, 3) | BRM | 3 | 34 | 85 | 118 | 137 | 84 | 28 |
| (3, 0, 0) | (30, 3, 1) | (65, 14, 6) | (79, 26, 13) | (86, 30, 21) | (38, 27, 19) | (10, 8, 10) | ||
| CBS | 8 | 39 | 73 | 116 | 135 | 87 | 24 | |
| (7, 1, 0) | (32, 5, 2) | (59, 7, 7) | (83, 21, 12) | (92, 24, 19) | (44, 23, 20) | (10, 5, 9) | ||
| CLAC | 231 | 214 | 209 | 190 | 185 | 128 | 102 | |
| (44, 175, 12) | (46, 149, 19) | (69, 115, 25) | (72, 95, 23) | (77, 82, 26) | (57, 51, 20) | (51, 33, 18) | ||
| ODM | 2 | 41 | 81 | 117 | 169 | 363 | 467 | |
| Missed change points | SDM | 36 | 56 | 88 | 124 | 167 | 356 | 460 |
| BRM | 1 | 38 | 78 | 110 | 151 | 355 | 459 | |
| CBS | 39 | 54 | 90 | 120 | 147 | 344 | 459 | |
| CLAC | 256 | 273 | 276 | 284 | 286 | 345 | 372 | |
| ODM | 5 | 9 | 11 | 19 | 30 | 45 | 26 | |
| Additional change points detected | SDM | 8 | 21 | 28 | 41 | 48 | 71 | 42 |
| BRM | 10 | 13 | 18 | 28 | 36 | 61 | 42 | |
| CBS | 7 | 13 | 22 | 19 | 29 | 53 | 40 | |
| CLAC | 65 | 83 | 85 | 88 | 89 | 101 | 422 |
| Number of | Method | σ = 0.05 | σ = 0.10 | σ = 0.15 | σ = 0.20 | σ = 0.25 | σ = 0.50 | σ = 1 |
| True change points | 500 | 500 | 500 | 500 | 500 | 500 | 500 | |
| ODM | 495 | 426 | 337 | 268 | 203 | 59 | 12 | |
| Change points | SDM | 461 | 416 | 334 | 264 | 207 | 58 | 14 |
| correctly detected | BRM | 496 | 428 | 337 | 272 | 212 | 61 | 13 |
| CBS | 453 | 407 | 337 | 264 | 218 | 69 | 17 | |
| CLAC | 13 | 13 | 15 | 26 | 29 | 27 | 26 | |
| ODM | 3 | 33 | 82 | 115 | 128 | 78 | 21 | |
| (3, 0, 0) | (29, 3, 1) | (63, 13, 6) | (76, 26, 13) | (80, 29, 19) | (36, 25, 17) | (7, 5, 9) | ||
| Change points | SDM | 3 | 28 | 78 | 112 | 126 | 86 | 26 |
| mislocated | (3, 0, 0) | (26, 1, 1) | (60, 13, 5) | (76, 23, 13) | (81, 25, 20) | (40, 27, 19) | (8, 8, 10) | |
| (in position by 1, 2, 3) | BRM | 3 | 34 | 85 | 118 | 137 | 84 | 28 |
| (3, 0, 0) | (30, 3, 1) | (65, 14, 6) | (79, 26, 13) | (86, 30, 21) | (38, 27, 19) | (10, 8, 10) | ||
| CBS | 8 | 39 | 73 | 116 | 135 | 87 | 24 | |
| (7, 1, 0) | (32, 5, 2) | (59, 7, 7) | (83, 21, 12) | (92, 24, 19) | (44, 23, 20) | (10, 5, 9) | ||
| CLAC | 231 | 214 | 209 | 190 | 185 | 128 | 102 | |
| (44, 175, 12) | (46, 149, 19) | (69, 115, 25) | (72, 95, 23) | (77, 82, 26) | (57, 51, 20) | (51, 33, 18) | ||
| ODM | 2 | 41 | 81 | 117 | 169 | 363 | 467 | |
| Missed change points | SDM | 36 | 56 | 88 | 124 | 167 | 356 | 460 |
| BRM | 1 | 38 | 78 | 110 | 151 | 355 | 459 | |
| CBS | 39 | 54 | 90 | 120 | 147 | 344 | 459 | |
| CLAC | 256 | 273 | 276 | 284 | 286 | 345 | 372 | |
| ODM | 5 | 9 | 11 | 19 | 30 | 45 | 26 | |
| Additional change points detected | SDM | 8 | 21 | 28 | 41 | 48 | 71 | 42 |
| BRM | 10 | 13 | 18 | 28 | 36 | 61 | 42 | |
| CBS | 7 | 13 | 22 | 19 | 29 | 53 | 40 | |
| CLAC | 65 | 83 | 85 | 88 | 89 | 101 | 422 |
The simulation results in Table 2 show that the proposed methods are competitive with the CBS method without pruning. In contrast, most of the change points detected by the CLAC are not in the right locations, and many change points are missed. This may be caused by the mean smoothing step in CLAC that reduces noise in the simulation data at the expense of blurring the edges of the loss/gain regions. When is small, our methods appear to perform better than CBS. When is large (e.g. ), SDM finds a few more false positives and ODM has a few more false negatives than CBS. As increases, the data sets become more noisy, and the numbers of both false negatives and false positives from all methods increase. However, once reaches very large values, the data sets appear approximately equal to those without change points and the number of additional change points reflects the false-positive rate in the absence of any change points. This can be seen in the last 2 columns in Table 2, where the number of additional change points detected for is less than for (except for CLAC).
To further assess the true-positive and false-positive rates, we applied CCTTS-based methods in additional simulations with gradually increasing toward infinity. We note that, from the viewpoint of change-point detection, analyzing a data set is equivalent to analyzing for any nonzero constant . Therefore, for any fixed , analyzing a simulation data set , yields the same results as analyzing the data set . As , the data set becomes one without change points. In Figure 5, we plot change-point detection rates against . We define true positives to be both change points correctly detected and those detected but mislocated. The results indicate that the number of change points correctly detected and the true-positive rates decrease as increases. However, among the true positives, the number of change points mislocated increases when is small and then decreases as becomes large. Similarly, for small values of , the false positives increase due to the increased noise. When reaches a certain value (around 0.5), the number of false positives begins to decrease and eventually approaches a limiting value corresponding to (data without change points).
Results from applying ODM, SDM, and BRM to the simulation data for different values of . (a) The number of change points correctly detected. (b) The number of change points mislocated. (c) True-positive rates (total numbers of change points correctly detected and mislocated divided by the number of true change points, 500). (d) The number of additional change points detected (false positives).
Results from applying ODM, SDM, and BRM to the simulation data for different values of . (a) The number of change points correctly detected. (b) The number of change points mislocated. (c) True-positive rates (total numbers of change points correctly detected and mislocated divided by the number of true change points, 500). (d) The number of additional change points detected (false positives).
We also compared the CPU time consumed (with MATLAB software) to perform these analyses. For analyzing 100 data sets with , the ODM, SDM, and BRM consumed about , and of the CPU time used by the CBS method, respectively.
5 APPLICATION TO A DATA SET OF FIBROBLAST CELL LINES
To further assess the performance of our methods, we applied them to a set of CGH data with known DNA copy number changes (Snijders and others, 2001). The data consisted of single experiments on 15 fibroblast cell lines. The variable used for analysis was the normalized average of of the test over reference ratio. For 6 of the cell lines, the copy number alteration encompassed the whole chromosome. Therefore, we first limited our analysis to the other 9 cell lines and tested one chromosome at a time for change points. The ODM was performed with both and . For SDM and BRM, we used thresholds based on and . Results can be found in Table 3. Outcomes from applying the CBS method reported by Olshen and others (2004) are included for comparison.
Results from applying ODM, SDM, BRM, and CBS to 9 cell lines with known copy number alterations
| Cell line/Chromosomes | ODM | SDM | BRM | CBS | ||||
| 3.5σ | 4σ | α = 0.001 | 0.0005 | α = 0.001 | 0.0005 | α = 0.01 | 0.005 | |
| GM03563/3 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM03563/9 | Yes | Yes | Yes | Yes | Yes | Yes | No | No |
| GM03563/false(single) | 8(2) | 4(0) | 7(2) | 5(3) | 6(3) | 5(3) | 8 | 5 |
| GM05296/10 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM05296/11 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM05296/false(single) | 3(2) | 1(1) | 4(3) | 3(3) | 2(4) | 2(4) | 3 | 0 |
| GM01750/9 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM01750/14 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM01750/false(single) | 3(1) | 0(0) | 3(1) | 3(1) | 3(1) | 3(1) | 1 | 0 |
| GM03134/8 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM03134/false(single) | 2(1) | 0(1) | 4(1) | 1(1) | 3(2) | 1(2) | 3 | 1 |
| GM13330/1 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM13330/4 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM13330/false(single) | 6(0) | 3(0) | 8(1) | 7(1) | 8(1) | 7(1) | 8 | 5 |
| GM01535/5 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM01535/12 | Yes | Yes | Yes | Yes | Yes | Yes | No | No |
| GM01535/false(single) | 3(3) | 1(1) | 1(6) | 1(6) | 2(6) | 1(6) | 2 | 0 |
| GM07081/7 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM07081/15 | No | No | No | No | No | No | No | No |
| GM07081/false(single) | 3(3) | 1(2) | 4(4) | 1(5) | 2(5) | 2(5) | 1 | 0 |
| GM13031/17 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM13031/false(single) | 5(0) | 1(1) | 7(1) | 5(1) | 5(2) | 4(2) | 5 | 3 |
| GM01524/6 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM01524/false(single) | 6(1) | 2(2) | 5(1) | 4(2) | 4(2) | 4(2) | 6 | 2 |
| Cell line/Chromosomes | ODM | SDM | BRM | CBS | ||||
| 3.5σ | 4σ | α = 0.001 | 0.0005 | α = 0.001 | 0.0005 | α = 0.01 | 0.005 | |
| GM03563/3 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM03563/9 | Yes | Yes | Yes | Yes | Yes | Yes | No | No |
| GM03563/false(single) | 8(2) | 4(0) | 7(2) | 5(3) | 6(3) | 5(3) | 8 | 5 |
| GM05296/10 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM05296/11 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM05296/false(single) | 3(2) | 1(1) | 4(3) | 3(3) | 2(4) | 2(4) | 3 | 0 |
| GM01750/9 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM01750/14 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM01750/false(single) | 3(1) | 0(0) | 3(1) | 3(1) | 3(1) | 3(1) | 1 | 0 |
| GM03134/8 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM03134/false(single) | 2(1) | 0(1) | 4(1) | 1(1) | 3(2) | 1(2) | 3 | 1 |
| GM13330/1 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM13330/4 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM13330/false(single) | 6(0) | 3(0) | 8(1) | 7(1) | 8(1) | 7(1) | 8 | 5 |
| GM01535/5 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM01535/12 | Yes | Yes | Yes | Yes | Yes | Yes | No | No |
| GM01535/false(single) | 3(3) | 1(1) | 1(6) | 1(6) | 2(6) | 1(6) | 2 | 0 |
| GM07081/7 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM07081/15 | No | No | No | No | No | No | No | No |
| GM07081/false(single) | 3(3) | 1(2) | 4(4) | 1(5) | 2(5) | 2(5) | 1 | 0 |
| GM13031/17 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM13031/false(single) | 5(0) | 1(1) | 7(1) | 5(1) | 5(2) | 4(2) | 5 | 3 |
| GM01524/6 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| GM01524/false(single) | 6(1) | 2(2) | 5(1) | 4(2) | 4(2) | 4(2) | 6 | 2 |
In Table 3, “yes” for ODM, SDM, and BRM means that the alteration was found for the particular cell line and chromosome at the given level, although in some cases, the locations of the change points may be 1 or 2 positions away from the real ones. Some additional change points may also be identified, which, as we will discuss further, can be eliminated through a “pruning” procedure. “Yes” for CBS is taken from Olshen and others (2004), but the localization relative to the truth was not given. “No” means that the alteration was not found. “Single” and “false” represent the number of chromosomes where change points were found that do not correspond to known alterations. The difference between “single” and “false” is that “single” includes only the chromosomes where all detected change points were single-point regions, whereas “false” includes regions that are larger than single points. Using the criterion, ODM gives better results than other methods. Outcomes for both SDM and BRM show that there is a little difference between and . Unlike the CBS application, our methods were directly applied to the original data sets without any prior smoothing process. Therefore, single-point regions can be identified.
All the alterations for 6 of the remaining cell lines covered a whole chromosome and would not be identified by analyzing each chromosome separately. However, analysis of an entire genome may help to identify whole chromosome alterations. We applied ODM to each of these cell lines with standard deviation followed by one specific pruning step: If the absolute mean difference between 2 adjacent segments around a change point detected by ODM was less than , then that change point was not considered a detected change point. The results are shown in Figure 6 and clearly reveal whole chromosome alterations along each cell line.
The ODM analysis of 6 whole fibroblast cell lines. Markers are arranged by chromosome. The vertical dotted lines demarcate chromosomes. The horizontal lines represent the mean of each segment.
The ODM analysis of 6 whole fibroblast cell lines. Markers are arranged by chromosome. The vertical dotted lines demarcate chromosomes. The horizontal lines represent the mean of each segment.
6 DISCUSSION AND CONCLUSIONS
We have proposed several new approaches for assessing DNA copy number variation along a chromosome. These methods can be used to detect copy number alterations in tumors or other tissues. All approaches can be directly applied to analyze data sets without any prior smoothing process, and single-point regions (or outliers) can be identified. We applied these methods to an array CGH data set of fibroblasts with known copy number changes confirmed by spectral karyotyping and were able to identify all the expected alterations. We also applied the methods to a 100-K SNP data set of non-small cell lung carcinoma, namely cell line H2122 from Zhao and others (2005). Our results were similar to Zhao and others (2005) for large regions, but we identified many more small regions or single-point alterations. The analysis time for the SNP data of markers was very fast (less than 3 min on a 3.20-GHz PC in performing ODM). These methods are also likely to be useful in assessing normal variation. In addition, through simulation we have demonstrated that our methods can correctly detect change points that partition chromosomes into regions with the same underlying copy number.
In developing the methods, we have made the implicit assumption that residual variances (defined in (3.1)) are equal across the chromosome. In some additional simulations, we found that SDM still works well for heteroscedastic data as long as heteroscedasticity across different regions is mild. In general, if heteroscedasticity is apparent, we suggest that some pre-analysis should be performed to derive good estimates of variances. Then the signed distance measure in (2.1) could be modified to involve estimated variances, although this approach would require future investigation.
An issue that can arise with SDM is the edge effect caused by artificially connecting the first and the last points and in the segment . That is, if the index k of an identified change point is “close ” to either i or j, then it might not be a real change point. The same problem occurs for the CBS method (Olshen and others, 2004) and can be handled in a similar way. By contrast, in the process of change-point detection, BRM does not cut the whole data set into several segments so that this problem of small segments does not exist. Also, the edge effect is a minor issue for BRM because there is only one edge to be considered.
Any statistical method for analyzing DNA copy number will detect more change points or alterations than there are real copy number changes. These false positives may be caused by local trends in the data or other reasons that are not totally understood. Therefore, a pruning procedure may be helpful to eliminate some change points or to merge adjacent segments in order to reduce the number of false positives. Pruning can be performed either from a statistical viewpoint, by controlling a sequence of sum of squares as in CBS, or from a biological viewpoint by merging 2 adjacent segments as in the HMM method (Fridlyand and others, 2004) and the penalized least-squares regression method (Huang and others, 2005). Our methods do not include this step because any pruning procedure is dominated by a threshold parameter which must be determined by the type of data produced, the level of purity of the populations, and whether type I or type II error is more undesirable. As Fridlyand and others (2004) point out, threshold values for comparing the means of 2 adjacent segments could be as low as when analyzing primary tumors and as high as when looking for genetic abnormalities homogeneously present in all cells. For the results given in Table 3, CBS includes a pruning procedure by controlling a sequence of sum of squares, while our 3 methods do not perform any pruning.
It may be noted that the critical values , used for SDM and BRM in both simulated and real data studies are much smaller than the ones used in CBS (, ). This is because all our change-point detection methods are essentially based on identifying the outliers among the scores produced by CCTTS algorithm, and this is conceptually different from the CBS approach. CBS, using the likelihood ratio, identifies only the largest change point at each segmentation step, whereas our methods find as many change points as possible each time. This distinction affects the permutation distributions. Our approach constructs a reference permutation distribution by collecting all absolute statistic scores, whereas CBS constructs a distribution by collecting only one score with the maximum absolute value for each permutation. That means that the distribution is stochastically larger than the distribution and the p-value of an observed statistic with respect to is smaller than . The cutoffs are chosen to give comparable performance, but the interpretation is not the same.
We have shown that, for normally distributed data, the ODM is easy to perform and extremely fast. The numbers of permutations needed for both SDM and BRM are small compared with the number needed for CBS. Furthermore, more markers means that even fewer permutations are needed. Existing methods tend to analyze each chromosome separately due to the fact that chromosomes are physically separated. Our methods can also analyze an entire genome simultaneously to identify regions of copy number variation. Inferences about change points drawn from analyzing the whole genome, rather than single chromosomes, can be used to further identify whole chromosome gain or loss.
APPENDIX A
A.1. Simple example illustrating the clustering algorithm and property (a)
We consider a simple situation of a data set containing only 3 points in sequence: , and . Let , and , where D is defined by (2.1). At the beginning of the CCTTS algorithm, the initial scores are , and . We focus here on how to obtain the final score of . According to the CCTTS algorithm, we have
-
1) When or or , then more than 2 clusters are merged simultaneously. One big cluster is obtained, and is not updated. So the final score is

-
2) When , then we merge clusters and , and we update and , but not . Since and are merged together, will not be updated anymore. Therefore, the final score is

-
3) When , then we merge clusters and , and we update to obtain
Note that, at this stage, the data set contains 2 clusters. The next merging procedure will produce one big cluster, and hence statistic scores will not be updated anymore. Therefore, the above achieved value is the final score of .
-
4) When , then we merge clusters and . In a similar discussion as in (3), the final score of can be obtained by

By summarizing (1), (2), (3), and (4), can be mathematically expressed as


APPENDIX B
B.1 Conditions for asymptotically unbiased variance estimation
Let be the statistic scores computed from the CCTTS. According to property (b) of being approximately identically normally distributed, negative correlations between 2 adjacent statistics, and approximate independence between statistics at least 2 positions apart, we simplify the assumptions for the joint distribution of as follows:
Here, the assumption of the correlation is based on the circular structure. The sample variance from is given by



This work was supported by the Mathematics of Information Technology and Complex Systems (MITACS/NCE), by National Institutes of Health grant HL68532, and Genome Canada through the Center for Applied Genomics, Toronto. Conflict of Interest: None declared.










