Abstract

Microarray technologies allow for simultaneous measurement of DNA copy number at thousands of positions in a genome. Gains and losses of DNA sequences reveal themselves through characteristic patterns of hybridization intensity. To identify change points along the chromosomes, we develop a marker clustering method which consists of 2 parts. First, a “circular clustering tree test statistic” attaches a statistic to each marker that measures the likelihood that it is a change point. Then construction of the marker statistics is followed by outlier detection approaches. The method provides a new way to build up a binary tree that can accurately capture change-point signals and is easy to perform. A simulation study shows good performance in change-point detection, and cancer cell line data are used to illustrate performance when regions of true copy number changes are known.

1 INTRODUCTION

DNA copy number alterations in tumor cells are key genetic events in the development and progression of human cancers (Tirkkonen and others, 1998; Lengauer and others, 1998; Foronzan and others, 2000). These alterations are typically a result of genomic events producing gains and losses of chromosomes or chromosomal regions. Losses and gains of DNA can contribute to alterations in the expression of tumor suppressor genes and oncogenes, respectively. Therefore, the identification of DNA copy number alterations in tumor genomes may help to discover critical genes associated with cancers and, eventually, to improve therapeutic approaches. A variety of techniques have been developed to screen genomes for all possible regions with DNA copy number alterations. For example, comparative genomic hybridization (CGH) (Kallioniemi and others, 1992, 1994), high-resolution array-based CGH (Pinkel and others, 1998; Snijders and others, 2001), representational difference analysis (Lisitsyn and others, 1993), and single nucleotide polymorphism (SNP) arrays (Zhao and others, 2004) can all be used to assess copy number.

The goal in analyzing DNA copy number data is to partition the whole genome into regions with the same underlying copy number and, subsequently, to quantify the copy number in each region. Therefore, locating copy number changes is an important step in the analysis of DNA copy number data. Various statistical methods have been proposed to address this problem. The software CGH-Plotter uses moving median and mean filtering, k-means clustering, and a dynamic programming algorithm (Autio and others, 2003). Jong and others (2003, 2004) proposed a genetic local search algorithm to maximize a likelihood with a penalty term containing the number of change points. Other proposals for change-point detection methods include a dynamic programming algorithm with a penalized likelihood (Picard and others, 2005) and a weighted likelihood function using adaptive weight smoothing (Hupe and others, 2004). Olshen and others (2004) proposed a modified binary segmentation procedure, called circular binary segmentation (CBS), to look for 2 breakpoints at a time in the data arranged in a circle. Wang and others (2005) proposed a simple but effective method based on hierarchical clustering along the chromosomes (CLAC) to select regions of interest while simultaneously controlling the false discovery rate (FDR). Fridlyand and others (2004) used hidden Markov models (HMMs) to estimate the underlying copy numbers (the hidden states). In Hsu and others (2005), a smoothing algorithm was given that denoises the data using a wavelets approach.

A recent paper (Lai and others, 2005) described an extensive study comparing several of these methods and pointed out that some segmentation methods, especially the CGH segmentation method (Picard and others, 2005) and CBS (Olshen and others, 2004), appeared to perform consistently well whether the noise level was low or high. Willenbrock and Fridlyand (2005) also did a comparison study of HMMs, CBS, and the Hupe method and concluded that CBS has the best operational characteristics in terms of its sensitivity and FDR. In Picard's method, however, the CGH profile is assumed to follow a Gaussian signal, an assumption that may be violated, while CBS is not designed to detect single outliers. For processing a large number of high-resolution arrays, the speed of many algorithms becomes an issue. Both HMMs and CBS can be slow and difficult to apply to very high–density oligonucleotide arrays.

In this report, we seek approaches to segment the chromosomes into regions of equal copy number that can include either a single marker or multiple contiguous markers. We propose a method, which we call the “circular clustering tree test statistic” (CCTTS), and following the construction of the statistics, we apply outlier detection approaches to identify change points along the chromosomes. Our method is novel in that it provides a new way to build up a binary tree that can accurately capture change-point signals and is easy to perform.

The rest of the paper is organized as follows. The CCTTS algorithm and properties are given in Section 2. In Section 3, we provide several change-point detection approaches based on the scores produced by the CCTTS algorithm. In Section 4, the proposed methods are evaluated by simulation studies. The method is applied to data from cell lines with known copy number changes in Section 5. We summarize our conclusions and discuss open questions in Section 6.

2 METHODS

It is clear that estimating the locations of regions with aberrant DNA copy numbers is closely connected to the problem of finding the locations of change points (Olshen and others, 2004). This makes change-point methods a natural framework to approach the analysis of array DNA copy number data. Let Y1,Y2, be a sequence of random variables that correspond to markers positioned sequentially along the genome. In general, an index p:ip<j is called a change point if Yi,,Yp have a common distribution and Yp+1,,Yj have a different common distribution. Intuitively, a point can be identified as a change point when the difference in signal intensity is large between 2 adjacent segments demarcated by the point. The correct choice for the sizes of the 2 adjacent segments is an important issue when the signal difference is measured. Segments that are either too small or too large will reduce the sensitivity of change-point detection methods. Our proposed method—the CCTTS—automatically gives 2 “optimal” adjacent segments (or clusters) with which to measure the difference at any point along a chromosome. The idea is to build a hierarchical cluster-style tree from the bottom upward such that the gain/normal/loss regions are separated into different branches. In this procedure, the tree building and the evaluation of statistics at any point are carried out simultaneously. Furthermore, the statistics associated with the tree fully reflect the information involved in the tree structure, in- cluding the magnitude of the difference between any 2 adjacent clusters and the order of the merging procedure. These statistics can then be used for detecting the most likely change points along a chromosome.

2.1 The CCTTS algorithm

The CCTTS algorithm uses a bottom-up strategy that generates a binary tree to represent the similarities in the observed copy number intensities along a chromosome. The algorithm begins with the data arranged in a circle, obtained by connecting the 2 ends of the chromosome together, with every observation representing a singleton cluster. At each of the N1 steps, the closest 2 adjacent clusters are merged into a single cluster, where closest is defined by the smallest difference in copy number intensities, and the differences at the new cluster boundaries are updated. Hence, after each step there is one less cluster.

Let Y1,Y2,,YN be a sequence of random variables for the log2 relative copy number intensities at N marker positions. We define a signed distance to measure the difference between any 2 adjacent clusters as 

graphic
(2.1)
where markers Y1(1),,Yn(1) are adjacent to Y1(2),,Ym(2). Note that D(·,·) is simply a normal test statistic for the data with variance 1. Here, the implicit assumption that the data have a common variance (without loss of generality, 1) can be relaxed for segmentation methods. We call index j (k) the left (right) boundary of a cluster {Yj+1,,Yk} (when j=0, in a circular arrangement, the left boundary of a cluster {Y1,,Yk} is defined to be N). The CCTTS iterative algorithm is given in detail in the following and is illustrated in Figure 1.

  • 1) Arrange the data in a circle so that the first point is connected to the last. That means YN+1=Y1 and Y0=YN. Define N clusters each consisting of one point. Let I={1,2,,N} be the set of indices corresponding to each of the N clusters, and define a set of statistics that will be used to decide if point i is a change point. To start, use 

    graphic
    which is the distance between each pair of adjacent observations.

  • 2) Let l=argminiI|Si|. Then merge the 2 adjacent clusters {Yl} and {Yl+1}, remove index l from I, and update Sl1 and Sl+1 in the following way: 

    graphic

  • 3) In general, merge the 2 adjacent clusters C1={Yj+1,,Yq}and C2={Yq+1,,Yk} with the smallest |Si|:|Sq|=miniI|Si|. Then remove index q from I. If the data still contain more than one cluster, update Sj and Sk in the following way:

    • Let LC and RC be the left and right adjacent clusters to the cluster C={Yj+1,,Yk} we just formed, respectively (see Figure 1).

    • Update Sj using 

      graphic
      (2.2)

    • Update Sk using 

      graphic
      (2.3)

  • 4) Repeat Step (3) until one big cluster is obtained. After merging the last 2 clusters, updating S is not relevant and is not performed. At that point there will be N statistics Sj,j=1,,N.

Fig. 1.

The horizontal axis describes the positions of several markers along a chromosome, and the vertical axis is the signal. Horizontal lines stand for the mean of each cluster. The 2 solid line clusters are merged to form cluster C and then statistics Sj and Sk are updated.

Fig. 1.

The horizontal axis describes the positions of several markers along a chromosome, and the vertical axis is the signal. Horizontal lines stand for the mean of each cluster. The 2 solid line clusters are merged to form cluster C and then statistics Sj and Sk are updated.

In Step (2) or (3), if there are ties for the minimum value miniI|Si|, say, |Sq1|==|Sqm|=miniI|Si|, then indices q1,,qm are simultaneously removed from I, all corresponding pairs of adjacent clusters are merged, and all statistics Si on the boundaries of these newly formed clusters are updated as in Step (2) or (3). Note that 2 boundary points may correspond to the merging of more than 2 adjacent clusters.

The simple example in Table 1 is used to illustrate the CCTTS algorithm. The first row shows a data set with 5 points. The algorithm starts by defining 5 clusters: {Y1},{Y2},{Y3},{Y4},{Y5} and I={1,2,3,4,5}. Si=D({Yi},{Yi+1})(i=1,2,3,4,5) are calculated and given in the second row after connecting the data in a circle. In the first step, argminiI|Si|=5 is removed from I, and {Y5} and {Y1} are merged together to obtain 4 clusters: {Y2},{Y3},{Y4}, and {Y5,Y1}. To update S4 and S1 corresponding to the boundaries of the newly formed cluster, D({Y4},{Y5,Y1})=0.531 (given in the second row) is compared to S4=0.495 and D({Y5,Y1},{Y2})=0.612 is compared to S1=0.566. Both S1 and S4 are updated to new values that are larger in absolute values as seen in the third row. The remaining values of Si,iI={1,2,3,4}, are unchanged. Then, this procedure of computing argminiI|Si|, updating I, merging 2 adjacent clusters together, calculating D(·,·), and updating Si is repeated until one cluster is formed. The final merging order 5,2,4,1,3 matches the magnitude of the final statistics Si (in absolute value).

Table 1.

Example of CCTTS. Numbers in parentheses are the smallest absolute Si

Yi 0.8 1.6 1.3 0.2 0.9 I argminiI|Si
Si − 0.566 0.212 0.778 − 0.495 (0.071) {1, 2, 3, 4, 5} 
D(·, ·) − 0.612   − 0.531    
Update Si − 0.612 (0.212) 0.778 − 0.531  {1, 2, 3, 4} 
D(·, ·) − 0.6  1.021     
Update Si − 0.612  1.021 (− 0.531)  {1, 3, 4} 
D(·, ·) − 0.895  0.895     
Update Si (− 0.895)  1.021   {1, 3} 
Final Si − 0.895 0.212 1.021 − 0.531 0.071   
Yi 0.8 1.6 1.3 0.2 0.9 I argminiI|Si
Si − 0.566 0.212 0.778 − 0.495 (0.071) {1, 2, 3, 4, 5} 
D(·, ·) − 0.612   − 0.531    
Update Si − 0.612 (0.212) 0.778 − 0.531  {1, 2, 3, 4} 
D(·, ·) − 0.6  1.021     
Update Si − 0.612  1.021 (− 0.531)  {1, 3, 4} 
D(·, ·) − 0.895  0.895     
Update Si (− 0.895)  1.021   {1, 3} 
Final Si − 0.895 0.212 1.021 − 0.531 0.071   

In addition to treating the data as a circle, there is a substantial difference between the CCTTS algorithm and the conventional clustering approach (Wang and others, 2005) with respect to how Sj and Sk are updated in Step (2) or (3), for a given difference measure. In the conventional approach, Sj and Sk are simply updated by D(LC,C) and D(C,RC), respectively, which ignores the previous statistic SO at a marker when updating the statistic to SN, after merging clusters C1 and C2. This leads to a suboptimal choice of statistics. To clarify, let SjO be the difference measurement between clusters LC and C1, and let SjN be the difference between LC and C. Suppose Yj is a true change point and |SjO|>|SjN|, then our algorithm assigns the larger statistic SjO to marker j, whereas the standard algorithm would assign the smaller statistic SjN. This also means that there can be inconsistency between the order in which indices are removed from I and the magnitude of the resulting statistics |Sj|. With the usual tree-building procedure, |Sj| and |Sk| may be smaller than |Sq|, even though position q is merged before positions j and k. The merging order indicates that Yq is less likely to be a change point than Yj and Yk. Our method of updating difference measurements Sj in the CCTTS algorithm will produce a sequence of statistics S1,S2,,SN, ordered by their position in the clustering process.

The CCTTS algorithm provides a way to calculate the statistics Si and also leads to 2 “optimally sized” adjacent clusters to achieve a statistic value (see Figure 2). Updating procedures (2.2) and (2.3) keep track of 2 optimally sized adjacent clusters at each step and detect the signals from using both the mean differences and the sizes of the clusters. This makes it possible to identify two interesting cases: (i) small regions with extremely large or small values and (ii) regions of consistent gain/loss that do not deviate much from zero. The magnitude of |Si| naturally reflects the likelihood that Yi is a change point in the whole data set Y1,,YN. Let |Sp1||Sp2||SpN|. Then Yp1 is most likely to be a change point, Yp2 is the second most likely change point, and so on. Several additional properties of our statistics are given in detail in Section 2.2.

Fig. 2.

(a) Simulated data Yi along a chromosome, with (b) corresponding statistics Si computed by the CCTTS algorithm. Six true change points are marked in (a) with different shapes. Horizontal solid and dashed lines indicate the corresponding adjacent right and left clusters used when defining Si. It can be clearly seen that change points correspond to outliers for the Si‘s.

Fig. 2.

(a) Simulated data Yi along a chromosome, with (b) corresponding statistics Si computed by the CCTTS algorithm. Six true change points are marked in (a) with different shapes. Horizontal solid and dashed lines indicate the corresponding adjacent right and left clusters used when defining Si. It can be clearly seen that change points correspond to outliers for the Si‘s.

2.2 Properties

Under the null hypothesis of no change points, the final statistics S1,S2,,SN have the following properties:

  • a) If Y1,,YN are iid, then all Si(i=1,,N) are identically distributed, but not independent.

  • b) If {Y1,,YN} come from a normal population, then all Si(i=1,,N) are approximately identically normally distributed, but not independent.

The clustering tree built by the CCTTS algorithm is based on a circular arrangement of the data. Property (a) is a natural consequence of this circular symmetric arrangement. In Appendix A, we give a simple example of a sequence data set containing only 3 points Y1,Y2, and Y3 and illustrate the functional relationships between the statistic scores Si and the data set {Y1,Y2,Y3}. In general, for a given data arranged in a sequence: Y1,Y2,,YN, the method for calculating Si is a function of Y1,Y2,,YN. So we write S1=f(Y1,Y2,,YN), where the subscript 1 on S corresponds to the first location, Y1. On the other hand, for any j:1<jN, if the data are rearranged as Yj,Yj+1,,YN,Y1,,Yj1, then, according to the previous notation, the CCTTS algorithm gives statistic Sj corresponding to the first location of Yj: Sj=f(Yj,Yj+1,,YN,Y1,,Yj1). Since both data sets Y1,Y2,,YN and Yj,Yj+1,,YN,Y1,,Yj1, when arranged in a circle, are exactly the same, Sj=Sj=f(Yj,Yj+1,,YN,Y1,,Yj1). Therefore, S1=f(Y1,Y2,,YN) and Sj=f(Yj,Yj+1,,YN,Y1,,Yj1)(j=2,3,,N) are identically distributed because Y1,,YN are iid. Property (a) plays a significant role when analyzing non-normal data for change-point detection, since the symmetry can be exploited to reduce the number of permutations needed to produce a reference distribution.

Property (b) has been established empirically by examining normal QQ plots of simulated data. From property (a), we only need to illustrate that S1 is approximately normally distributed. Let {Yi1,,YiN},i=1,,n, be n data sets generated iid from N(0,1). The statistics {Si1,,SiN} were computed from the ith data set by the CCTTS algorithm. Then {S11,S21,,Sn1} is a sample drawn from the distribution of S1. Figure 3 shows normal QQ plots for n=10000 replications with number of markers N=50,100,200,400,800,1600 and indicates that S1 is approximately normally distributed. In our simulations, we simultaneously calculated sample correlation coefficients between S1 and S1+j(j>0) which, because of the symmetric circular arrangement and property (a), also estimate the correlation coefficients for any 2 statistics Si and Si+j. When j=1, for different N, all these correlation coefficients were between 0.38 and 0.35. When j>1, correlations between Si and Si+j are small (e.g. correlations between Si and Si+2 are around 0.066), and these statistics could be treated as approximately independent. Due to the negative correlations between Si and Si+1, a histogram of all statistics S1,,SN jointly has larger tails than would be expected from the marginal approximate normality of any one statistic Si. Property (b) makes it possible to analyze normal data for change-point detection easily and efficiently.

Fig. 3.

Normal QQ plots with respect to S1 from 10000 simulated data sets. The dash-dotted line is a robust linear fit of the sample order statistics which passes through the first and third quartiles. The slope of the solid line is 1 and represents the standard normal distribution. Simulations are applied for different data size N. The plots demonstrate approximate normality of the statistics but with larger variances than standard normal.

Fig. 3.

Normal QQ plots with respect to S1 from 10000 simulated data sets. The dash-dotted line is a robust linear fit of the sample order statistics which passes through the first and third quartiles. The slope of the solid line is 1 and represents the standard normal distribution. Simulations are applied for different data size N. The plots demonstrate approximate normality of the statistics but with larger variances than standard normal.

3 CHANGE-POINT DETECTION

We assume that the observed Yi is a realization of the true log2 relative copy number α+βi at marker i plus random noise, 

graphic
(3.1)
where N is the number of markers on a given chromosome and ϵi are iid with mean 0. Under the null hypothesis of no change points H0:βi=0,i=1,,N, property (a) implies that S1,,SN have the same distribution. As discussed in Section 2.2, the problem of change-point identification can be reduced to the problem of finding extremely large |Si| that deviate from the pattern of {S1,,SN}. In other words, our task becomes the identification of the outliers among the scores {S1,,SN}. Interpretation of identified change points must take into account the artificial circular data. For example, if the number of detected change points is one, we must conclude that the original data contain no change points, since circular data with only one change point are still one cluster. For simplicity, we will not exclude a detected change point that is the last point on a chromosome.

We discuss and introduce several different approaches for outlier detection, separately for normal and non-normal data.

3.1 Normally distributed data

We assume that the data {Y1,,YN} are normally distributed with a common variance. The scores S1,,SN are computed according to the CCTTS. Under the null hypothesis of no change points, S1,,SN are each approximately normally distributed (see property (b)). Therefore, some approaches used to identify outliers for normal data are appropriate here, even though S1,,SN are negatively correlated between 2 adjacent Si. To perform this procedure, we apply the simple d standard deviation method, where the value of d can be user defined.

Approach 1: outlier detection method (ODM).

S1,,SN are sorted in decreasing absolute values: |Sp1||Sp2||SpN|. The method starts with the assumption of no outliers in S1,,SN, then examines and determines the first most likely outliers. If outliers are detected, then they are removed from the set of scores 𝒮={S1,,SN}. The procedure will be repeated for the remaining scores until no more outliers can be found.After completing this procedure, if j1, there are no change points; otherwise, we find change points Yp1,,Ypj.

  • 1) Set j=0. Calculate the mean m and standard deviation s of all the scores.

  • 2) If |Sp1m|d·s, stop. Otherwise, identify the largest l such that |Sp1m|,,|Splm|>d·s. Then remove Sp1,,Spl from 𝒮, recalculate the mean m and standard deviation s for the remaining scores in 𝒮, and then update jj+l.

  • 3) In general, if |Spj+1m|d·s, stop. Otherwise, identify the largest l such that |Spj+1m|,,|Spj+lm|>d·s.

  • 4) Remove Spj+1,,Spj+l from 𝒮, then recalculate the mean m and standard deviation s for the remaining scores in 𝒮.

  • 5) jj+l, go to Step (3).

The value of d can be user defined based on some domain or data knowledge, or theoretically using Chebyshev's theorem (Barnett and Lewis, 1994). In practice, d is usually chosen between 3 and 6 (Maletic and Marcus, 2000). From our experience, values between 3.5 and 4 were found to generate the best results with fewer false positives and false negatives in general.

Under the null hypothesis of no change points, the negative correlations between Si and Si+1 and approximate independence between Si and Si+j(j>1) mean that a sample variance σ^2 from S1,,SN overestimates the variance σ2 of the marginal distribution for any one statistic Sk. Therefore, outlier detection in the outlier detection method (ODM) is conservative. Outlier detection in S1,,SN with one or several (outliers) points removed is also expected to be conservative. However, this conservativeness is not a concern in most cases due to the following 2 facts: (i) a sample variance σ^2 from S1,,SN is an asymptotically unbiased estimator of the marginal variance σ2 with respect to the number of markers N (see Appendix B) and (ii) a user-defined value of d can be adjusted for outlier detection. Therefore, the negative correlations do not have much effect on ODM performance for large data sets. When the number of markers is small, we may choose a smaller value of d to adjust for the conservativeness.

3.2 Non-normal data

In the outlier detection method, we assumed that the data followed a normal distribution. For non-normal data, we will introduce 2 methods to identify change points based on permutation-derived reference distributions. The sequential detection method (SDM) is the most straightforward method to apply.

Approach 2: Sequential detection method (SDM).

Assume the null hypothesis H0 is true, that is, there are no change points. Using property (a), each statistic value Si computed from position permuted data has the same distribution and, therefore, can be used as a contribution to the permutation distribution. For example, by doing 100 permutations on a data set of N=200 markers, 20000 null test statistics are obtained. This drastically reduces the number of permutations needed.

  • 1) For the original data Y1,,YN, we calculate statistic scores according to CCTTS and sort them in decreasing absolute values: |Sp1||Sp2||SpN|.

  • 2) Randomly permute the order of the markers M times (M×N>10000). For each permutation data set P, collect all absolute statistic values |Sp1P|,,|SpNP| and construct a reference distribution. Outliers are chosen if |Spj| is greater than the upper αth quantile of the permutation distribution Uα. Let k=maxj(|Spj|>Uα).

  • 3) If k1, there are no change points. Otherwise, the data can be partitioned into k or k+1 segments separated by k change points Yp1,,Ypk. The exact number of segments (k or k+1) depends on whether YN is chosen as a change point.

  • 4) Steps (1) to (3) are recursively performed on each small segment until no more change points are identified. Single-point regions of gain or loss may be identified by the occurrence of 2 adjacent change points. These may occur either because there was a very small region of copy number change on the chromosome that is measured by only one marker or because there was an error in the data at one marker. Suppose a chromosome with N markers has only one single-point region of gain/loss. Any permuted data from this chromosome also has one single-point region of gain/loss. That means 2 out of the N statistic values Si in each permutation will be relatively large, and therefore the magnitude of quantile Uα depends critically on N. It will be difficult to detect change points for small values of N. Therefore, since the SDM does not work well in this situation, we propose a modification to improve detection of single-point regions. Let s be the sample standard deviation of the original data. Suppose during Step (3) on a segment with mean ms no change points were detected (k1). We propose to examine the indices of Sp1 and Sp2. If they are adjacent in the circle, then point Yp1 (or Yp2, depending on the relative positions of p1 and p2) may be a single-point region. If |Yp1ms|4s, then we conclude that there are no change points on the segment; otherwise, we conclude that Yp1 and Yp2 are 2 change points.

Approach 3: backward residual method (BRM).

The SDM works well and is robust to unequal variances. However, when segments become small, the SDM permutation approach can miss some change points. For example, look at a very simple data set: {Y1,,Y6}={0,1,1,1,0,0}. It is obvious that there are 2 change points and the largest statistic, |Sp1|, is 1.2247. The probability of achieving this largest value in the permutation distribution is 3!·3!·25!·6=0.1. No change points can be identified at the level α<0.1. Therefore, as an alternative approach, we developed a backward residual method (BRM) that uses the whole data set all the time rather than working with smaller segments. The algorithm starts in the same way as the SDM, but after some of the change points are detected, new permutations will be performed on the residual data instead of on the smaller segments.Similar to the SDM, a modification can be made to deal with single-point regions in data sets. Let s be the sample standard deviation of the original data. In Step (4), when kk0, let Ypl and Ypl+1 (l=1 when k0=1, l=k+1 when k0>1) be the next 2 most likely change points. Identify if there are potential single-point regions produced by combining Ypl with previously detected change points or by examining Ypl and Ypl+1. Suppose this potential single-point region {Yj} lies in segment L. We need to decide if this is a real single-point region. If |YjmL|4s (mL is the mean of segment L), we conclude that there are no more change points. Otherwise, Ypl or both Ypl and Ypl+1 are identified to be new change points.

  • 1) As before, for the original data Y1,,YN, we calculate statistic scores according to CCTTS and sort them in decreasing absolute values: |Sp1||Sp2||SpN|.

  • 2) Let k0=1(one segment) and {X1,,XN}={Y1,,YN}.

  • 3) Randomly permute the order of the data {X1,,XN}M times (M×N>10000). For each permutation data set, collect all absolute statistic values to form a reference distribution. The threshold values can be chosen to be the upper αth quantile of the permutation distribution, say Uα. Let k=argmaxj(|Spj|>Uα).

  • 4) If kk0, stop. Otherwise, k0k (k0 detected change points or segments); denote mi as the mean of the segment which contains observation Yi and update {X1,,XN} to be the mean adjusted residual of {Y1,,YN}, so that Xi=Yimi,i=1,,N. Then go to Step (3).

Software written in MATLAB is available from the first author.

4 SIMULATIONS

In this section, we investigate through Monte Carlo simulations the performance of the proposed methods. Data containing regions of gain, normal, and loss copy numbers were generated from the model yi=μi+σϵi,1iN, where N is the sample size, μ is the mean, and ϵ is distributed as N(0,1). We assume one region of loss and one region of gain per chromosome. Within a loss region, all markers have the same state, with copy number reduced by 1. However, within the gain region we allow 3 possible states amplified by 1,2, or 3 copies.

Following Pollack and others (2002), we set μ‘s for the copy numbers deleted by 1 and amplified by 1, 2, and 3 to be d1=log2(0.75),a1=log2(1.25),a2=log2(1.5), and a3=log2(1.75), respectively, and we use μ=0 for normal regions. To mimic typical data, the following quantities were generated randomly: the position of loss/gain regions, the size of loss/gain regions, the number and allocation of the 3 amplified states, and the sizes of the 3 amplified regions within the regions of gain. In order to compare our methods and the CBS method, the simulated data sets do not contain single points of loss/gain. The details of the simulation are given in the following.

  • 1) A data set is defined to contain a normal region, a region containing a loss, and a region containing a gain in random order.

  • 2) To generate the 3 regions, each of size m: Y1,,Ym, the following parameters are used:

    • a) Data from normal region: Y1,,YmN(0,σ).

    • b) Data from region of loss:

      • i) Initially generate Y1,,YmN(0,σ).

      • ii) Generate the width w of a loss region with minimum width w=2: 

        graphic

      • iii) Locate the loss interval in the region: Yi+1,,Yi+w by generating i from a uniform distribution: 

        graphic

      • iv) Let d1=log2(0.75) and Yi+jYi+j+d1,j=1,,w.

    • c) Data from region of gain:

      • i) Initially generate Y1,,YmN(0,σ).

      • ii) Generate the number of amplified states: nUni({1,2,3}) and then randomly select the magnitude of the gains by sampling without replacement n times from {1,2,3} within the region of gain.

      • iii) Generate the width of a region with minimum width 2 for 1 amplified state, 4 for 2 amplified states, and 6 for 3 amplified states: 

        graphic

      • iv) Locate the gain interval in the region: Yi+1,,Yi+w by generating i from a uniform distribution: 

        graphic

      • v) Divide the gain region into n segments, with the length of each segment at least 2 and chosen from a uniform distribution, and then randomly assign the n amplified states (selected by (ii)) to the n segments.

      • vi) Let a1=log2(1.25),a2=log2(1.5), and a3=log2(1.75). If segment Yk+1,,Yk+v corresponds to the ith state, then Yk+jYk+j+ai,j=1,,v.

  • 3) The whole simulated data set is produced by randomly concatenating the 3 regions generated by (2) and examples are given in Figure 4.

Fig. 4.

Examples of 4 simulated data sets with σ=0.05. Each data set is produced by randomly concatenating the 3 parts: ·, ×, and + which contain normal, loss, and gain regions, respectively. In the gain regions, (a) has 1 state amplified by 3 copies, (b) has 2 states amplified by 1 and 2 copies, (c) and (d) have all 3 states, amplified by either 1, 2, or 3 copies.

Fig. 4.

Examples of 4 simulated data sets with σ=0.05. Each data set is produced by randomly concatenating the 3 parts: ·, ×, and + which contain normal, loss, and gain regions, respectively. In the gain regions, (a) has 1 state amplified by 3 copies, (b) has 2 states amplified by 1 and 2 copies, (c) and (d) have all 3 states, amplified by either 1, 2, or 3 copies.

One hundred data sets with 180 markers were simulated for m=60 and for each value of σ{0.05,0.10,0.15,0.20,0.25,0.5,1}. In order to compare the simulation results across different values of σ, Y1,,YmN(0,1) are generated and σY1,,σYm are used for different σ. Examples with σ=0.05 from simulation are shown in Figure 4. If regions of loss and gain are in the middle, a data set with only one amplified state for a gain region leads to 4 change points, whereas a data set with all 3 amplified states leads to 6 change points. The total number of change points in 100 data sets for a given σ is 500. We use d=3.5 standard deviations to perform the ODM. For both the SDM and the BRM, α=0.001 is used to define outliers.

The number of change points detected is summarized in Table 2. For comparison, the results of the CBS without pruning (Olshen and others, 2004) and the CLAC (Wang and others, 2005) are included. In order to compare the CPU time, we coded the algorithm for CBS in MATLAB instead of the existing R package. The CLAC method was included for comparison since it also involves hierarchical clustering of points. CLAC was implemented with the existing R package (available at http://www-stat.stanford.edu/∼wp57/CGH-Miner/) using default parameters. Because reference samples from the normal arrays are needed in the CLAC method, we simulated 100 reference data sets using a normal distribution with zero mean and variance σ, set to be the same as in the data to be analyzed.

Table 2.

Results from applying ODM, SDM, BRM, CBS, and CLAC to the simulated data. ODM is performed with d = 3.5 standard deviation. α = 0.001 is used to define outliers for both SDM and BRM. CBS is applied without pruning

Number of Method σ = 0.05 σ = 0.10 σ = 0.15 σ = 0.20 σ = 0.25 σ = 0.50 σ = 1 
True change points  500 500 500 500 500 500 500 
 ODM 495 426 337 268 203 59 12 
Change points SDM 461 416 334 264 207 58 14 
correctly detected BRM 496 428 337 272 212 61 13 
 CBS 453 407 337 264 218 69 17 
 CLAC 13 13 15 26 29 27 26 
 ODM 33 82 115 128 78 21 
  (3, 0, 0) (29, 3, 1) (63, 13, 6) (76, 26, 13) (80, 29, 19) (36, 25, 17) (7, 5, 9) 
Change points SDM 28 78 112 126 86 26 
mislocated  (3, 0, 0) (26, 1, 1) (60, 13, 5) (76, 23, 13) (81, 25, 20) (40, 27, 19) (8, 8, 10) 
(in position by 1, 2, 3) BRM 34 85 118 137 84 28 
 (3, 0, 0) (30, 3, 1) (65, 14, 6) (79, 26, 13) (86, 30, 21) (38, 27, 19) (10, 8, 10) 
 CBS 39 73 116 135 87 24 
  (7, 1, 0) (32, 5, 2) (59, 7, 7) (83, 21, 12) (92, 24, 19) (44, 23, 20) (10, 5, 9) 
 CLAC 231 214 209 190 185 128 102 
  (44, 175, 12) (46, 149, 19) (69, 115, 25) (72, 95, 23) (77, 82, 26) (57, 51, 20) (51, 33, 18) 
 ODM 41 81 117 169 363 467 
Missed change points SDM 36 56 88 124 167 356 460 
BRM 38 78 110 151 355 459 
 CBS 39 54 90 120 147 344 459 
 CLAC 256 273 276 284 286 345 372 
 ODM 11 19 30 45 26 
Additional change points detected SDM 21 28 41 48 71 42 
BRM 10 13 18 28 36 61 42 
 CBS 13 22 19 29 53 40 
 CLAC 65 83 85 88 89 101 422 
Number of Method σ = 0.05 σ = 0.10 σ = 0.15 σ = 0.20 σ = 0.25 σ = 0.50 σ = 1 
True change points  500 500 500 500 500 500 500 
 ODM 495 426 337 268 203 59 12 
Change points SDM 461 416 334 264 207 58 14 
correctly detected BRM 496 428 337 272 212 61 13 
 CBS 453 407 337 264 218 69 17 
 CLAC 13 13 15 26 29 27 26 
 ODM 33 82 115 128 78 21 
  (3, 0, 0) (29, 3, 1) (63, 13, 6) (76, 26, 13) (80, 29, 19) (36, 25, 17) (7, 5, 9) 
Change points SDM 28 78 112 126 86 26 
mislocated  (3, 0, 0) (26, 1, 1) (60, 13, 5) (76, 23, 13) (81, 25, 20) (40, 27, 19) (8, 8, 10) 
(in position by 1, 2, 3) BRM 34 85 118 137 84 28 
 (3, 0, 0) (30, 3, 1) (65, 14, 6) (79, 26, 13) (86, 30, 21) (38, 27, 19) (10, 8, 10) 
 CBS 39 73 116 135 87 24 
  (7, 1, 0) (32, 5, 2) (59, 7, 7) (83, 21, 12) (92, 24, 19) (44, 23, 20) (10, 5, 9) 
 CLAC 231 214 209 190 185 128 102 
  (44, 175, 12) (46, 149, 19) (69, 115, 25) (72, 95, 23) (77, 82, 26) (57, 51, 20) (51, 33, 18) 
 ODM 41 81 117 169 363 467 
Missed change points SDM 36 56 88 124 167 356 460 
BRM 38 78 110 151 355 459 
 CBS 39 54 90 120 147 344 459 
 CLAC 256 273 276 284 286 345 372 
 ODM 11 19 30 45 26 
Additional change points detected SDM 21 28 41 48 71 42 
BRM 10 13 18 28 36 61 42 
 CBS 13 22 19 29 53 40 
 CLAC 65 83 85 88 89 101 422 

The simulation results in Table 2 show that the proposed methods are competitive with the CBS method without pruning. In contrast, most of the change points detected by the CLAC are not in the right locations, and many change points are missed. This may be caused by the mean smoothing step in CLAC that reduces noise in the simulation data at the expense of blurring the edges of the loss/gain regions. When σ is small, our methods appear to perform better than CBS. When σ is large (e.g. σ=0.5), SDM finds a few more false positives and ODM has a few more false negatives than CBS. As σ increases, the data sets become more noisy, and the numbers of both false negatives and false positives from all methods increase. However, once σ reaches very large values, the data sets appear approximately equal to those without change points and the number of additional change points reflects the false-positive rate in the absence of any change points. This can be seen in the last 2 columns in Table 2, where the number of additional change points detected for σ=1 is less than for σ=0.5 (except for CLAC).

To further assess the true-positive and false-positive rates, we applied CCTTS-based methods in additional simulations with σ gradually increasing toward infinity. We note that, from the viewpoint of change-point detection, analyzing a data set y1,y2,,yN is equivalent to analyzing λy1,λy2,,λyN for any nonzero constant λ. Therefore, for any fixed σ, analyzing a simulation data set yi=μi+σϵi,1iN, yields the same results as analyzing the data set yi/σ=μi/σ+ϵi,1iN. As σ, the data set becomes one without change points. In Figure 5, we plot change-point detection rates against σ. We define true positives to be both change points correctly detected and those detected but mislocated. The results indicate that the number of change points correctly detected and the true-positive rates decrease as σ increases. However, among the true positives, the number of change points mislocated increases when σ is small and then decreases as σ becomes large. Similarly, for small values of σ, the false positives increase due to the increased noise. When σ reaches a certain value (around 0.5), the number of false positives begins to decrease and eventually approaches a limiting value corresponding to σ= (data without change points).

Fig. 5.

Results from applying ODM, SDM, and BRM to the simulation data for different values of σ. (a) The number of change points correctly detected. (b) The number of change points mislocated. (c) True-positive rates (total numbers of change points correctly detected and mislocated divided by the number of true change points, 500). (d) The number of additional change points detected (false positives).

Fig. 5.

Results from applying ODM, SDM, and BRM to the simulation data for different values of σ. (a) The number of change points correctly detected. (b) The number of change points mislocated. (c) True-positive rates (total numbers of change points correctly detected and mislocated divided by the number of true change points, 500). (d) The number of additional change points detected (false positives).

We also compared the CPU time consumed (with MATLAB software) to perform these analyses. For analyzing 100 data sets with σ=0.15, the ODM, SDM, and BRM consumed about 0.02%,6.5%, and 3.3% of the CPU time used by the CBS method, respectively.

5 APPLICATION TO A DATA SET OF FIBROBLAST CELL LINES

To further assess the performance of our methods, we applied them to a set of CGH data with known DNA copy number changes (Snijders and others, 2001). The data consisted of single experiments on 15 fibroblast cell lines. The variable used for analysis was the normalized average of log2 of the test over reference ratio. For 6 of the cell lines, the copy number alteration encompassed the whole chromosome. Therefore, we first limited our analysis to the other 9 cell lines and tested one chromosome at a time for change points. The ODM was performed with both d=3.5 and d=4. For SDM and BRM, we used thresholds based on α=0.001 and 0.0005. Results can be found in Table 3. Outcomes from applying the CBS method reported by Olshen and others (2004) are included for comparison.

Table 3.

Results from applying ODM, SDM, BRM, and CBS to 9 cell lines with known copy number alterations

Cell line/Chromosomes ODM SDM BRM CBS 
 3.5σ 4σ α = 0.001 0.0005 α = 0.001 0.0005 α = 0.01 0.005 
GM03563/3 Yes Yes Yes Yes Yes Yes Yes Yes 
GM03563/9 Yes Yes Yes Yes Yes Yes No No 
GM03563/false(single) 8(2) 4(0) 7(2) 5(3) 6(3) 5(3) 
GM05296/10 Yes Yes Yes Yes Yes Yes Yes Yes 
GM05296/11 Yes Yes Yes Yes Yes Yes Yes Yes 
GM05296/false(single) 3(2) 1(1) 4(3) 3(3) 2(4) 2(4) 
GM01750/9 Yes Yes Yes Yes Yes Yes Yes Yes 
GM01750/14 Yes Yes Yes Yes Yes Yes Yes Yes 
GM01750/false(single) 3(1) 0(0) 3(1) 3(1) 3(1) 3(1) 
GM03134/8 Yes Yes Yes Yes Yes Yes Yes Yes 
GM03134/false(single) 2(1) 0(1) 4(1) 1(1) 3(2) 1(2) 
GM13330/1 Yes Yes Yes Yes Yes Yes Yes Yes 
GM13330/4 Yes Yes Yes Yes Yes Yes Yes Yes 
GM13330/false(single) 6(0) 3(0) 8(1) 7(1) 8(1) 7(1) 
GM01535/5 Yes Yes Yes Yes Yes Yes Yes Yes 
GM01535/12 Yes Yes Yes Yes Yes Yes No No 
GM01535/false(single) 3(3) 1(1) 1(6) 1(6) 2(6) 1(6) 
GM07081/7 Yes Yes Yes Yes Yes Yes Yes Yes 
GM07081/15 No No No No No No No No 
GM07081/false(single) 3(3) 1(2) 4(4) 1(5) 2(5) 2(5) 
GM13031/17 Yes Yes Yes Yes Yes Yes Yes Yes 
GM13031/false(single) 5(0) 1(1) 7(1) 5(1) 5(2) 4(2) 
GM01524/6 Yes Yes Yes Yes Yes Yes Yes Yes 
GM01524/false(single) 6(1) 2(2) 5(1) 4(2) 4(2) 4(2) 
Cell line/Chromosomes ODM SDM BRM CBS 
 3.5σ 4σ α = 0.001 0.0005 α = 0.001 0.0005 α = 0.01 0.005 
GM03563/3 Yes Yes Yes Yes Yes Yes Yes Yes 
GM03563/9 Yes Yes Yes Yes Yes Yes No No 
GM03563/false(single) 8(2) 4(0) 7(2) 5(3) 6(3) 5(3) 
GM05296/10 Yes Yes Yes Yes Yes Yes Yes Yes 
GM05296/11 Yes Yes Yes Yes Yes Yes Yes Yes 
GM05296/false(single) 3(2) 1(1) 4(3) 3(3) 2(4) 2(4) 
GM01750/9 Yes Yes Yes Yes Yes Yes Yes Yes 
GM01750/14 Yes Yes Yes Yes Yes Yes Yes Yes 
GM01750/false(single) 3(1) 0(0) 3(1) 3(1) 3(1) 3(1) 
GM03134/8 Yes Yes Yes Yes Yes Yes Yes Yes 
GM03134/false(single) 2(1) 0(1) 4(1) 1(1) 3(2) 1(2) 
GM13330/1 Yes Yes Yes Yes Yes Yes Yes Yes 
GM13330/4 Yes Yes Yes Yes Yes Yes Yes Yes 
GM13330/false(single) 6(0) 3(0) 8(1) 7(1) 8(1) 7(1) 
GM01535/5 Yes Yes Yes Yes Yes Yes Yes Yes 
GM01535/12 Yes Yes Yes Yes Yes Yes No No 
GM01535/false(single) 3(3) 1(1) 1(6) 1(6) 2(6) 1(6) 
GM07081/7 Yes Yes Yes Yes Yes Yes Yes Yes 
GM07081/15 No No No No No No No No 
GM07081/false(single) 3(3) 1(2) 4(4) 1(5) 2(5) 2(5) 
GM13031/17 Yes Yes Yes Yes Yes Yes Yes Yes 
GM13031/false(single) 5(0) 1(1) 7(1) 5(1) 5(2) 4(2) 
GM01524/6 Yes Yes Yes Yes Yes Yes Yes Yes 
GM01524/false(single) 6(1) 2(2) 5(1) 4(2) 4(2) 4(2) 

In Table 3, “yes” for ODM, SDM, and BRM means that the alteration was found for the particular cell line and chromosome at the given α level, although in some cases, the locations of the change points may be 1 or 2 positions away from the real ones. Some additional change points may also be identified, which, as we will discuss further, can be eliminated through a “pruning” procedure. “Yes” for CBS is taken from Olshen and others (2004), but the localization relative to the truth was not given. “No” means that the alteration was not found. “Single” and “false” represent the number of chromosomes where change points were found that do not correspond to known alterations. The difference between “single” and “false” is that “single” includes only the chromosomes where all detected change points were single-point regions, whereas “false” includes regions that are larger than single points. Using the d=4 criterion, ODM gives better results than other methods. Outcomes for both SDM and BRM show that there is a little difference between α=0.001 and 0.0005. Unlike the CBS application, our methods were directly applied to the original data sets without any prior smoothing process. Therefore, single-point regions can be identified.

All the alterations for 6 of the remaining cell lines covered a whole chromosome and would not be identified by analyzing each chromosome separately. However, analysis of an entire genome may help to identify whole chromosome alterations. We applied ODM to each of these cell lines with d=4 standard deviation followed by one specific pruning step: If the absolute mean difference between 2 adjacent segments around a change point detected by ODM was less than 0.35, then that change point was not considered a detected change point. The results are shown in Figure 6 and clearly reveal whole chromosome alterations along each cell line.

Fig. 6.

The ODM analysis of 6 whole fibroblast cell lines. Markers are arranged by chromosome. The vertical dotted lines demarcate chromosomes. The horizontal lines represent the mean of each segment.

Fig. 6.

The ODM analysis of 6 whole fibroblast cell lines. Markers are arranged by chromosome. The vertical dotted lines demarcate chromosomes. The horizontal lines represent the mean of each segment.

6 DISCUSSION AND CONCLUSIONS

We have proposed several new approaches for assessing DNA copy number variation along a chromosome. These methods can be used to detect copy number alterations in tumors or other tissues. All approaches can be directly applied to analyze data sets without any prior smoothing process, and single-point regions (or outliers) can be identified. We applied these methods to an array CGH data set of fibroblasts with known copy number changes confirmed by spectral karyotyping and were able to identify all the expected alterations. We also applied the methods to a 100-K SNP data set of non-small cell lung carcinoma, namely cell line H2122 from Zhao and others (2005). Our results were similar to Zhao and others (2005) for large regions, but we identified many more small regions or single-point alterations. The analysis time for the SNP data of 125906 markers was very fast (less than 3 min on a 3.20-GHz PC in performing ODM). These methods are also likely to be useful in assessing normal variation. In addition, through simulation we have demonstrated that our methods can correctly detect change points that partition chromosomes into regions with the same underlying copy number.

In developing the methods, we have made the implicit assumption that residual variances (defined in (3.1)) are equal across the chromosome. In some additional simulations, we found that SDM still works well for heteroscedastic data as long as heteroscedasticity across different regions is mild. In general, if heteroscedasticity is apparent, we suggest that some pre-analysis should be performed to derive good estimates of variances. Then the signed distance measure in (2.1) could be modified to involve estimated variances, although this approach would require future investigation.

An issue that can arise with SDM is the edge effect caused by artificially connecting the first and the last points Yi and Yj in the segment {Yi,,Yj}. That is, if the index k of an identified change point Yk is “close ” to either i or j, then it might not be a real change point. The same problem occurs for the CBS method (Olshen and others, 2004) and can be handled in a similar way. By contrast, in the process of change-point detection, BRM does not cut the whole data set into several segments so that this problem of small segments does not exist. Also, the edge effect is a minor issue for BRM because there is only one edge to be considered.

Any statistical method for analyzing DNA copy number will detect more change points or alterations than there are real copy number changes. These false positives may be caused by local trends in the data or other reasons that are not totally understood. Therefore, a pruning procedure may be helpful to eliminate some change points or to merge adjacent segments in order to reduce the number of false positives. Pruning can be performed either from a statistical viewpoint, by controlling a sequence of sum of squares as in CBS, or from a biological viewpoint by merging 2 adjacent segments as in the HMM method (Fridlyand and others, 2004) and the penalized least-squares regression method (Huang and others, 2005). Our methods do not include this step because any pruning procedure is dominated by a threshold parameter which must be determined by the type of data produced, the level of purity of the populations, and whether type I or type II error is more undesirable. As Fridlyand and others (2004) point out, threshold values for comparing the means of 2 adjacent segments could be as low as 0.25 when analyzing primary tumors and as high as 0.5 when looking for genetic abnormalities homogeneously present in all cells. For the results given in Table 3, CBS includes a pruning procedure by controlling a sequence of sum of squares, while our 3 methods do not perform any pruning.

It may be noted that the critical values α=0.001, 0.0005 used for SDM and BRM in both simulated and real data studies are much smaller than the ones used in CBS (α=0.01, 0.005). This is because all our change-point detection methods are essentially based on identifying the outliers among the scores produced by CCTTS algorithm, and this is conceptually different from the CBS approach. CBS, using the likelihood ratio, identifies only the largest change point at each segmentation step, whereas our methods find as many change points as possible each time. This distinction affects the permutation distributions. Our approach constructs a reference permutation distribution Fa by collecting all absolute statistic scores, whereas CBS constructs a distribution Fs by collecting only one score with the maximum absolute value for each permutation. That means that the distribution Fs is stochastically larger than the distribution Fa and the p-value of an observed statistic with respect to Fa is smaller than Fs. The cutoffs are chosen to give comparable performance, but the interpretation is not the same.

We have shown that, for normally distributed data, the ODM is easy to perform and extremely fast. The numbers of permutations needed for both SDM and BRM are small compared with the number needed for CBS. Furthermore, more markers means that even fewer permutations are needed. Existing methods tend to analyze each chromosome separately due to the fact that chromosomes are physically separated. Our methods can also analyze an entire genome simultaneously to identify regions of copy number variation. Inferences about change points drawn from analyzing the whole genome, rather than single chromosomes, can be used to further identify whole chromosome gain or loss.

APPENDIX A

A.1. Simple example illustrating the clustering algorithm and property (a)

We consider a simple situation of a data set containing only 3 points in sequence: Y1,Y2, and Y3. Let D1=D({Y1},{Y2}),D2=D({Y2},{Y3}), and D3=D({Y3},{Y1}), where D is defined by (2.1). At the beginning of the CCTTS algorithm, the initial scores are S1=D1,S2=D2, and S3=D3. We focus here on how to obtain the final score of S1. According to the CCTTS algorithm, we have

  • 1) When |D1|=|D2||D3| or |D1|=|D3||D2| or |D2|=|D3||D1|, then more than 2 clusters are merged simultaneously. One big cluster is obtained, and S1 is not updated. So the final score is 

    graphic

  • 2) When |D1|<min(|D2|,|D3|), then we merge clusters {Y1} and {Y2}, and we update S2 and S3, but not S1. Since {Y1} and {Y2} are merged together, S1 will not be updated anymore. Therefore, the final score is 

    graphic

  • 3) When |D2|<min(|D1|,|D3|), then we merge clusters {Y2} and {Y3}, and we update S1 to obtain 

    graphic
    Note that, at this stage, the data set contains 2 clusters. The next merging procedure will produce one big cluster, and hence statistic scores will not be updated anymore. Therefore, the above achieved value is the final score of S1.

  • 4) When |D3|<min(|D1|,|D2|), then we merge clusters {Y3} and {Y1}. In a similar discussion as in (3), the final score of S1 can be obtained by 

    graphic

By summarizing (1), (2), (3), and (4), S1 can be mathematically expressed as 

graphic
Similarly, we can obtain mathematical expressions for S2 and S3. For example, S2 can be written as 
graphic
From this simple case, we find that (i) the final statistic scores Si from the CCTTS algorithm are functions of the original sequence data set and (ii) if S1 is written as S1=f(Y1,Y2,Y3), then S2=f(Y2,Y3,Y1). So, S1 and S2 are identically distributed when Y1,Y2, and Y3 are iid.

APPENDIX B

B.1 Conditions for asymptotically unbiased variance estimation

Let S1,,SN be the statistic scores computed from the CCTTS. According to property (b) of Si being approximately identically normally distributed, negative correlations between 2 adjacent statistics, and approximate independence between statistics at least 2 positions apart, we simplify the assumptions for the joint distribution of S1,,SN as follows:

  • 1) Marginal distributions: 

    graphic

  • 2) Correlations: 

    graphic

Here, the assumption of the correlation Corr(S1,SN)=ρ is based on the circular structure. The sample variance σ^2 from S1,,SN is given by 

graphic
where S-=1Ni=1NSi. Our goal is to prove that σ^2 is an asymptotically unbiased estimator of σ2 with respect to N under the assumptions given in (1) and (2). This can be done by the following derivation: 
graphic
Therefore, 
graphic
E(σ^2)>σ2 when ρ<0 (negative correlation), and E(σ^2)σ2 as N.

This work was supported by the Mathematics of Information Technology and Complex Systems (MITACS/NCE), by National Institutes of Health grant HL68532, and Genome Canada through the Center for Applied Genomics, Toronto. Conflict of Interest: None declared.

References

Autio
R
Hautaniemi
S
Kauraniemi
P
Yli-Harja
O
Astola
J
Wolf
M
Kallioniemi
A
CGH-plotter: MATLAB toolbox for CGH-data analysis
Bioinformatics
 , 
2003
, vol. 
19
 (pg. 
1714
-
1715
)
Barnett
V
Lewis
T
Outliers in Statistical Data
 , 
1994
New York
John Wiley and Sons
Foronzan
F
Mahlamaki
EH
Monni
O
Chen
Y
Veldman
R
Jiang
Y
Gooden
GC
Ethier
SP
Kallioniemi
A
Kallioniemi
OP
Comparative genomic hybridization analysis of 38 breast cancer cell lines: a basis for interpreting complementary DNA microarray data
Cancer Research
 , 
2000
, vol. 
60
 (pg. 
4519
-
4525
)
Fridlyand
J
Snijders
AM
Pinkel
D
Albertson
DG
Jain
A
Hidden Markov models approach to the analysis of array CGH data
Journal of Multivariate Analysis
 , 
2004
, vol. 
90
 (pg. 
132
-
153
)
Hsu
L
Self
SG
Grove
D
Randolph
T
Wang
K
Delrow
JJ
Loo
L
Porter
P
Denoising array-based comparative genomic hybridization data using wavelets
Biostatistics
 , 
2005
, vol. 
6
 (pg. 
211
-
226
)
Huang
T
Wu
B
Lizardi
P
Zhao
H
Detection of DNA copy number alterations using penalized least squares regression
Bioinformatics
 , 
2005
, vol. 
21
 (pg. 
3811
-
3817
)
Hupe
P
Stransky
N
Thiery
J-P
Radvanyi
F
Barillot
E
Analysis of array CGH data: from signal ratio to gain and loss of DNA regions
Bioinformatics
 , 
2004
, vol. 
20
 (pg. 
3413
-
3422
)
Jong
K
Marchiori
E
Meijer
G
Vaart
AVD
Ylstra
B
Breakpoint identification and smoothing of array comparative genomic hybridization data
Bioinformatics
 , 
2004
, vol. 
20
 (pg. 
3636
-
3637
)
Jong
K
Marchiori
E
van der Vaart
A
Ylstra
B
Meijer
G
Weiss
M
Chromosomal Breakpoint Detection in Human Cancer. Lecture Notes in Computer Science
 , 
2003
, vol. 
Volume 2611
 
Berlin
Springer
Kallioniemi
A
Kallioniemi
OP
Piper
J
Tanner
M
Stokke
T
Chen
L
Smith
HS
Pinkel
D
Gray
JW
Waldman
FM
Detection and mapping of amplified DNA sequences in breast cancer by comparative genomic hybridization
Proceedings of the National Academy of Sciences USA
 , 
1994
, vol. 
91
 (pg. 
2156
-
2160
)
Kallioniemi
A
Kallioniemi
OP
Sudar
D
Rutovitz
D
Gray
JW
Waldman
F
Pinkel
D
Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors
Science
 , 
1992
, vol. 
258
 (pg. 
818
-
821
)
Lai
WR
Johnson
MD
Kucherlapati
R
Park
PJ
Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data
Bioinformatics
 , 
2005
, vol. 
21
 (pg. 
3763
-
3770
)
Lengauer
C
Kinzler
K
Vogelstein
B
Genetic instabilities in human cancers
Nature
 , 
1998
, vol. 
396
 (pg. 
643
-
649
)
Lisitsyn
N
Lisitsyn
N
Wigler
M
Cloning the differences between two complex genomes
Science
 , 
1993
, vol. 
259
 (pg. 
946
-
951
)
Maletic
JI
Marcus
A
Data cleansing: beyond integrity analysis
Proceedings of The Conference on Information Quality (IQ2000)
 , 
2000
Cambridge
MA
(pg. 
200
-
209
)
Olshen
AB
Venkatraman
ES
Lucito
R
Wigler
M
Circular binary segmentation for the analysis of array-based DNA copy number data
Biostatistics
 , 
2004
, vol. 
5
 (pg. 
557
-
572
)
Picard
F
Robin
S
Lavielle
M
Vaisse
C
Daudin
J-J
A statistical approach for array CGH data analysis
BMC Bioinformatics
 , 
2005
, vol. 
6
 pg. 
27
 
Pinkel
D
Segraves
R
Sudar
D
Clark
S
Poole
I
Kowbel
D
Collins
C
Kuo
WL
Chen
C
Zhai
Y
and others
High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays
Nature Genetics
 , 
1998
, vol. 
20
 (pg. 
207
-
211
)
Pollack
JR
Sørlie
T
Perou
CM
Rees
CA
Jeffrey
SS
Lonning
PE
Tibshirani
R
Botstein
D
Børresen-Dale
A
Brown
PO
Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors
Proceedings of the National Academy of Sciences USA
 , 
2002
, vol. 
99
 (pg. 
12963
-
12968
)
Snijders
AM
Nowak
N
Segraves
R
Blackwood
S
Brown
N
Conroy
J
Hamilton
G
Hindle
AK
Huey
B
Kimura
K
and others
Assembly of microarrays for genome-wide measurement of DNA copy number
Nature Genetics
 , 
2001
, vol. 
29
 (pg. 
263
-
264
)
Tirkkonen
M
Tanner
M
Karhu
R
Kallioniemi
A
Isola
J
Kallioniemi
OP
Molecular cytogenetics of primary breast cancer by CGH
Genes, Chromosomes & Cancer
 , 
1998
, vol. 
21
 (pg. 
177
-
184
)
Wang
P
Kim
Y
Pollack
J
Narasimhan
B
Tibshirani
R
A method for calling gains and losses in array CGH data
Biostatistics
 , 
2005
, vol. 
6
 (pg. 
45
-
58
)
Willenbrock
H
Fridlyand
J
A comparison study: applying segmentation to array CGH data for downstream analyses
Bioinformatics
 , 
2005
, vol. 
21
 (pg. 
4084
-
4091
)
Zhao
X
Weir
BA
LaFramboise
T
Lin
M
Beroukhim
R
Garraway
L
Beheshti
J
Lee
JC
Naoki
K
Richards
WG
and others
Homozygous deletions and chromosome amplifications in human lung carcinomas revealed by single nucleotide polymorphism array analysis
Cancer Research
 , 
2005
, vol. 
65
 (pg. 
5561
-
5570
)
Zhao
XJ
Li
C
Paez
JG
Chin
K
Jänne
PA
Chen
TH
Girard
L
Minna
J
Christiani
D
Leo
C
and others
An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays
Cancer Research
 , 
2004
, vol. 
64
 (pg. 
3060
-
3071
)