Abstract

Motivation: Array CGH technologies enable the simultaneous measurement of DNA copy number for thousands of sites on a genome. We developed the circular binary segmentation (CBS) algorithm to divide the genome into regions of equal copy number. The algorithm tests for change-points using a maximal t-statistic with a permutation reference distribution to obtain the corresponding P-value. The number of computations required for the maximal test statistic is O(N2), where N is the number of markers. This makes the full permutation approach computationally prohibitive for the newer arrays that contain tens of thousands markers and highlights the need for a faster algorithm.

Results: We present a hybrid approach to obtain the P-value of the test statistic in linear time. We also introduce a rule for stopping early when there is strong evidence for the presence of a change. We show through simulations that the hybrid approach provides a substantial gain in speed with only a negligible loss in accuracy and that the stopping rule further increases speed. We also present the analyses of array CGH data from breast cancer cell lines to show the impact of the new approaches on the analysis of real data.

Availability: An R version of the CBS algorithm has been implemented in the “DNAcopy” package of the Bioconductor project. The proposed hybrid method for the P-value is available in version 1.2.1 or higher and the stopping rule for declaring a change early is available in version 1.5.1 or higher.

Contact:  [email protected]

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

The DNA copy number at a location in a genome is the number of copies of DNA. The normal copy number is two for the autosomal chromosomes in humans. Chromosomal aberrations in the form of copy number gains or losses are common in cancer and studying them is a way of identifying and validating important cancer genes. For example, Whang-Peng et al. (1982) identified deletion in chromosome 3p(14-23) in small cell lung cancer cell lines. Comparative genomic hybridization (CGH) (Kallioniemi et al., 1992; du Manoir et al., 1993) was the first method developed to measure the DNA copy number variation of entire genomes at a 10–20M resolution. Higher throughput techniques based on microarray technology (hence array CGH) have been developed to simultaneously measure DNA copy number at thousands of locations on a genome. Pinkel and Albertson (2005) present a review of the array CGH technologies.

The purpose of these technologies is to study variations in DNA copy number and to identify chromosomal regions that have been gained or lost. We developed the circular binary segmentation (CBS) algorithm (Olshen et al., 2004) to divide the genome into regions of equal DNA copy number. Several alternate algorithms have also been proposed for the analysis of array copy number data. Willenbrock and Fridlyand (2005) in a comparison of algorithms for array CGH data concluded that DNAcopy (our software implementing the CBS algorithm) has the best operational characteristics in terms of its sensitivity and FDR for breakpoint detection. Lai et al. (2005) conducted a comparison of methods for analyzing array CGH data that included CBS and 10 other approaches. They concluded that CBS is one of the two methods that appear to perform consistently well. Unfortunately, they also found the CBS algorithm to be one of the slowest. In light of the proven ability of the CBS algorithm to identify the locations of copy number changes it is desirable to improve its speed.

In this manuscript we present two speed enhancements to the original CBS algorithm. The first one is a hybrid approach for the computation of the P-value of the maximal t-statistic using a tail probability approximation for the maxima of a Gaussian random field. The second one is a sequential testing approach for deriving a stopping rule that reduces the number of permutations when there is strong evidence for the existence of a change-point. The enhanced algorithms use the same test statistic as the original to detect change-points but modifies the procedure used to determine whether the change-points are statistically significant. We compare the performance of the new methods to the original permutation approach both in terms of speed and accuracy on simulated data as well as real data from breast cancer cell lines.

2 METHODS

The CBS procedure formulates the analysis of array CGH data as a problem of detecting change-points, where the change-points are the genomic locations of copy number transitions. The algorithm starts with the whole chromosome and segments it recursively by testing for change-points; it stops when none can be found in any of the segments. The test statistic was chosen to enable us to detect a narrow changed segment in the middle of a large segment.

Let X1,…,Xm be the data corresponding to the m markers for the segment under consideration. The test statistic is the maximal t-statistic given by T = max1 ≤ i < jm |Tij|, where Tij is the two-sample t-statistic to compare the mean of the observations with index from i + 1 to j, to the mean of the rest of the observations. That is

where formulaXj +1 + · + Xm)/(mj + i), and formula is the corresponding mean squared error. Note that if we view the segment being tested as indexed by a circle by connecting its two endpoints then the method tests whether there are two complementary arcs that have unequal means. We declare a change to be statistically significant if the P-value is smaller than a threshold level α (typically 0.01) and estimate the locations of the change-points as the i and j (if j < m) that maximize the test statistic.

In the original implementation of the CBS algorithm, we compute the P-value using a permutation reference distribution due it being a robust nonparametric method. However, this resulted in the computation time growing quadratically with the number of markers on the array (this was also noticed by Lai et al., 2005). This is due to the test-statistic T being the maximum of m(m-1)/2 different statistics for every segment considered and m increases with the number of markers on the array. This is computationally burdensome when analyzing high resolution array CGH data because although computing it once is quick it has to be repeated thousands of times to construct the permutation reference distribution. Another source of computational burden is that the procedure computes the entire set of permuted statistics in order to declare that a change-point exists even if part way through the process there is overwhelming evidence for its existence. We will now present a hybrid method to compute the P-value and derive a stopping rule to declare a change early, both of which will speed up the CBS algorithm substantially.

2.1 Hybrid P-value

The set {i, j: 1 ≤ i < jm } of all splits considered for the test statistic T can be written for any k (≤ m/2) as A1 ∪ A2, where
The set A1 corresponds to all splits in which the minor arc, which is the smaller of the two arcs made by the two-point intersection of a circle, contains at most k observations and the set A2 corresponds to all splits such that both the arcs have more than k observations. Note that T = max {T1, T2}, where formula. For an observed test statistic value of b, the P-value P(T > b) is bounded below by P(T2 > b) and, because of the Bonferroni inequality, bounded above by P(T1 > b + P(T2 > b. We compute P(T1 > b using a permutation approach since T1 is the maximum of several correlated t-statistics and an approximation to its distribution is unavailable even for normally distributed Xs. Since T1 requires only mk statistics to be computed, the computational burden is nearly linear in the number of markers with a suitable choice of k.
We approximate P(T2 > b as follows. Heuristically, since Tij is the standardized difference of means, it has a limiting normal distribution under suitable regularity conditions. So for m and k are large enough the distribution of the statistic Tij, under the null hypothesis of no change, is approximately standard normal. Observe that T2 = max {T2 +, T2-} where formula and formula So under the null hypothesis of no change, the distribution of the statistic T2+ is the same as the one if the Xi s are independent standard normal. Siegmund (1988) and Yao (1989) independently derived approximations for the tail probabilities of the statistic T2+ (this statistic is the same as Z3 in Yao 1993), which is given by
where δ = k/m, ϕ is the standard normal density, and ν is defined as
Since by symmetry P(T2+ > b) = P(T2- > b) and the probability of both T2+ and T2- exceeding b is small, P(T2 > b) ≈ 2 P(T2+ > b). This approximation is asymptotic, i.e. the ratio of the true probability and the approximate formula converges to 1 as formula a constant greater than 0 and 0 < δ < 1/2 fixed. In the next section we show that it is robust and recommend a choice of k as a function of m.

The CBS algorithm is modified in the following manner. Let b be the value of the test statistic T on the observed X1,…,Xm. Since the approximation is asymptotic we compute the full permutation P-value if there are fewer than m0 markers m. Otherwise, we first compute P(T2 > b from the approximation above. If it exceeds α we can declare that there is no change; otherwise we compute P(T1 > b and the P-value is the sum of the two and a change is declared if the sum is less than α . The only difference between the hybrid and the permutation approaches is the P-value. The choice of P-value method thus affects whether a change-point is detected but not its estimated location. Since the hybrid P-value is an approximation for an upper bound it can result in fewer change-points detected. We will study empirically, the impact of this on the procedure's ability to detect the true change-points.

2.2 Stopping rule to declare a change

In the CBS algorithm the magnitude of the P-value is relevant only to decide if it exceeds α. The permutation P-value is given by the proportion of times the permuted statistic exceeds the original statistic. Thus the permutations can be stopped and the null of no change accepted as soon as the permuted statistic exceeds the original more than αB times, where B is the number of permutations. However, at least (1-α)B permuted statistics must be computed to declare that a change exists even if there is overwhelming evidence earlier for it. For example, when B = 10000 and α = 0.01 the procedure cannot stop and declare that a change is present even if none of the first 1000 permuted statistics exceed the original statistic. We use concepts from sequential testing to derive a stopping rule that declares a change before all the permutations are completed. Such a stopping rule will also benefit the hybrid method since it has a permutation component.

Let E1,…, EB be the binary random variables indicating whether the permuted statistic exceeds the original and let R(i) = E1 + · + Ei be their partial sums. Let r be the smallest integer greater than α B, (α - P{T2 > b} instead of α for the hybrid) that is, r / B is the smallest P-value for which the null hypothesis of no change is not rejected. The stopping rule is a sequence of integers b1 < · < br such that the permutations are stopped after the bith permutation, where i is the smallest j for which R(bj) < j. That is, the permutations are stopped the first time there are fewer than i permuted statistics exceeding the original statistic among the first bi permutations. The bs are chosen to satisfy
for a pre-specified (Type I) error rate η. Since there are a large number of boundaries that satisfy the condition we choose the one given by repeated significance testing theory; that is, each bi is the smallest integer for which P{R(bi) < i | R(B) = r } is less than an η* that is chosen such that the overall error rate is η. This stopping rule increases the type I error rate of the algorithm by
but the increase is negligible since the summand above decreases very rapidly in l. A heuristic proof of this claim will be given later in this section. As with the hybrid P-value, the stopping rule only affects whether a change-point is detected but not its estimated location. We will now show the derivation of η* and the corresponding boundary.
Observe that conditioned on R(B)= r, the Es are a sequence of Br zeroes and r ones all of which are equally likely. For notational simplicity, we will omit the conditioning in the following. Note that the set {R(j) < i} corresponds to all sequences (Es) with at most i − 1 ones among E1,…, Ej; its probability is formula Since P{R(j) < i } is decreasing in j and is zero for j = B, for any threshold η*, bi is the smallest j for which the probability gets below η* The probability of interest P{R(bi) < i for any 1 ≤ ir}, for any boundary {b1,…, br}}, can be obtained as follows. The locations of the r ones are a random sample drawn without replacement from 1, …, B. Let L1 < · < Lr be the ordered locations. Since R(bi) < i if and only if Li > bi, the probability of interest is P{Li > bi for any 1 ≤ ir} which can be written as
(1)
sinceformula and the sets on the right are mutually exclusive. This can be calculated exactly using the following recursive equation:
and formula. Since this can be computationally daunting, we approximate the sum in (1) by
(2)
where d is the number of prior points for which the boundary is not crossed. This follows by observing that the sets in the summands of (2) contain those in (1). This is similar in vein to the improved Bonferroni inequalities in Wordsley (1982). Since P{R(bi) < i for any 1 ≤ i ≤ r | R(B) = r} increases as η* increases, the boundary is obtained using an iterative procedure until the desired error bound η is reached.

We will now give a heuristic proof for the claim that this stopping rule results in a minimal increase in the type I error. Under the null hypothesis of no change R(B) has a uniform distribution on {0,…, B}. Using the Bonferroni inequality we can bound P{R(bi) < i for any 1 ≤ i ≤ r | R(B) = l} by formula. This conditional distribution of R(bi) is hypergeometric with mean μl = l × bi/B and variance formula. Recall that i is the η* quantile of R(bi) when l =r. Thus (il)/ σl is smaller than formula, where Q is the η* quantile of R(bi). Appealing to normal approximation of the hypergeometric distribution we see that P{R(bi) < i | R(B) = l} decreases rapidly as l increases. Empirically when l = 2r, P{R(bi) < i | R(B) = l} is approximately (η*)2 when i = 1 and becomes much smaller as i increases resulting in an excess type I error smaller than α η .

2.3 Simulation experiments

We conducted simulation experiments to evaluate the performance gain that the improved CBS algorithm provides and the cost in terms of its ability to detect change-points. We first show that the approximation for P(T2 > b works well for large m and suitably chosen k and hence could be used to obtain the P-value for the statistic T. We then segment the same simulated data with and without change-points, using the original permutation P-value and the new hybrid by itself and with early stopping. For all these simulations the marker data are generated from standard normal, uniform or beta (0.5, 1.0) distributions. Since the approximation was obtained for the maximum of a Gaussian random field, the normal data is where it is expected to perform the best. The other two were chosen to provide different degrees of non-normality with the uniform having a flat density and the beta a very skewed J-shaped one. The computing times for the procedures give us a measure of the performance gain and the proportion of times change-points are detected give us the impact on the ability to detect change-points (false positive or missed ones). All computations were done on a 3.2 GHz Intel Xeon PC running the Debian Linux operating system with gcc and g77 compilers and R statistical software.

3 RESULTS

3.1 Evaluating the approximation

We generated 100 000 data sets with m markers, where m is one of 100, 200, 500 or 1000 from each of the three distributions. We calculated the statistic T2 for each of the data sets with k = 25. This choice of k was used so that the differences in means in Tij are based on sufficiently large number of observations for a normal approximation to be reasonable. We computed the tail probability P(T2 > b empirically as well as by using the approximation for a range of bs for each m. The results are shown in Figure 1. Observe that the approximation substantially underestimates the tail probability when m is only 100. For larger values of m the approximation is close for all three marker distributions. Hence we recommend a minimum of 200 markers (the recommended value of m0 in Section 2.1) when using the hybrid approach. Since the full permutation approach for this number of markers can be accomplished with modest computing effort, this requirement is not a serious burden.

The empirical tail probability distribution of the statistic T2 when the data are normal (red), uniform (green) or beta (0.5, 1.0) (blue) for m =100, 200, 500, or 1000 and k = 25. The black line is the approximation.
Fig. 1.

The empirical tail probability distribution of the statistic T2 when the data are normal (red), uniform (green) or beta (0.5, 1.0) (blue) for m =100, 200, 500, or 1000 and k = 25. The black line is the approximation.

The tail probability approximation is an asymptotic result with k / m a constant. Thus one concern is the use of a fixed k for all values of m. In order to assess this we chose m to be 5000 and evaluated the tail probability empirically when k is 25 or 50. This is shown in Figure 2, along with the approximation. Observe that there is a larger difference between the empirical values and the approximation when k is 25 while the degree to which the approximation agrees with the empirical values when k is 50 is similar to that found in the m = 1000 panel of Figure 1. These simulations show that with a properly chosen k the approximation for P(T2 > b works well for larger m and could be used to obtain the P-value for the segmentation procedure as described in the previous section. This also suggests that it would be prudent to choose a larger k when m increases. We recommend that k be set at 25 for m smaller than 1000 and be increased in increments of 5 as the number of markers doubles.

The empirical distribution of the statistic T2 when the data are normal (red), uniform (green) or beta (0.5, 1.0) (blue) for 5000 markers and k = 25 or 50.
Fig. 2.

The empirical distribution of the statistic T2 when the data are normal (red), uniform (green) or beta (0.5, 1.0) (blue) for 5000 markers and k = 25 or 50.

3.2 Performance of the segmentation procedure

In the previous section we showed that the approximation to P(T2 > b works well. We will now evaluate the operating characteristics of the CBS algorithm when the P-value is approximated by P(T1 > b) + P(T2 > b) as well as when the stopping rule to declare a change early is included. Specifically, we wish to assess the reduction in computing time achieved by these modifications and their cost in terms of false positive or missed change-points. We compare the CBS algorithms with the hybrid P-value alone, as well as when combined with the stopping rule to declare a change early, to the original that uses the full permutation P-value.

We simulated the case of no change-points by generating 5000 data sets each for each of 250, 500 or 1000 markers and for each of the three distributions. The segmentations were performed using a P-value threshold (α) of 0.01 or 0.05. Table 1 shows the percent of times the procedure segments the data (size) and the CPU time used. The size values are very similar for the three algorithms and are consistent with the nominal α level. The size of the new algorithms are lower than that of the original, which is consistent with the hybrid P-value being an approximate upper bound. Also, the addition of the stopping rule results in only a negligible increase in the significance level of the test as the achieved size barely increases. However, the newer algorithms show an enormous gain in computing speed. For example, the normal data with 1000 markers and α = 0.05 took 29 h using the original algorithm but only a bit over an hour with the hybrid algorithms. Since the stopping rule only enables us to declare a change early, we do not expect it to affect the time taken when there is no change and the results confirm it. Notice also that the computing times for the original algorithm quadruples when the number of markers doubles, whereas it grows linearly for the hybrid algorithms.

Table 1.

The results of segmentation when there is no change in the data

m = 250m = 500m = 1000
SizeTimeaSizeTimeaSizeTimea
Permutation1.0436.60.92124.61.08500.1
NHybrid1.0811.20.8818.71.0035.3
Hybrid + ES1.0810.20.8417.21.0832.3
Permutation1.1236.61.00132.51.22492.1
UHybrid1.1611.40.9019.11.0034.5
Hybrid + ES1.1810.30.9417.21.0831.8
Permutation0.8635.40.86128.61.14498.0
BHybrid1.0411.10.9219.41.0839.5
Hybrid + ES1.0410.50.9217.81.0636.1
Permutation5.44121.14.56441.74.721749.7
NHybrid5.2222.44.2036.54.4468.6
Hybrid + ES5.2016.94.2427.64.4650.9
Permutation5.24120.75.08451.24.601710.4
UHybrid4.8219.94.4432.83.6449.8
Hybrid + ES4.9214.94.5023.43.6439.5
Permutation5.04119.55.12451.64.961765.7
BHybrid4.9421.84.8035.64.6468.7
Hybrid + ES5.0216.74.8027.24.5449.3
m = 250m = 500m = 1000
SizeTimeaSizeTimeaSizeTimea
Permutation1.0436.60.92124.61.08500.1
NHybrid1.0811.20.8818.71.0035.3
Hybrid + ES1.0810.20.8417.21.0832.3
Permutation1.1236.61.00132.51.22492.1
UHybrid1.1611.40.9019.11.0034.5
Hybrid + ES1.1810.30.9417.21.0831.8
Permutation0.8635.40.86128.61.14498.0
BHybrid1.0411.10.9219.41.0839.5
Hybrid + ES1.0410.50.9217.81.0636.1
Permutation5.44121.14.56441.74.721749.7
NHybrid5.2222.44.2036.54.4468.6
Hybrid + ES5.2016.94.2427.64.4650.9
Permutation5.24120.75.08451.24.601710.4
UHybrid4.8219.94.4432.83.6449.8
Hybrid + ES4.9214.94.5023.43.6439.5
Permutation5.04119.55.12451.64.961765.7
BHybrid4.9421.84.8035.64.6468.7
Hybrid + ES5.0216.74.8027.24.5449.3

aElapsed times on a 3.2 GHz Pentium 4 computer.

The size column gives the percent of data sets when change-points were detected (false positives) and the time column gives the user cpu time in minutes to segment the 5000 data sets. The top and bottom halves correspond to α =0.01 and 0.05, respectively. The N, U and B in the first column indicate the marker distribution. Permutation used the original full permutation P-value, Hybrid used the approximation and Hybrid + ES used the approximation and early stopping.

Table 1.

The results of segmentation when there is no change in the data

m = 250m = 500m = 1000
SizeTimeaSizeTimeaSizeTimea
Permutation1.0436.60.92124.61.08500.1
NHybrid1.0811.20.8818.71.0035.3
Hybrid + ES1.0810.20.8417.21.0832.3
Permutation1.1236.61.00132.51.22492.1
UHybrid1.1611.40.9019.11.0034.5
Hybrid + ES1.1810.30.9417.21.0831.8
Permutation0.8635.40.86128.61.14498.0
BHybrid1.0411.10.9219.41.0839.5
Hybrid + ES1.0410.50.9217.81.0636.1
Permutation5.44121.14.56441.74.721749.7
NHybrid5.2222.44.2036.54.4468.6
Hybrid + ES5.2016.94.2427.64.4650.9
Permutation5.24120.75.08451.24.601710.4
UHybrid4.8219.94.4432.83.6449.8
Hybrid + ES4.9214.94.5023.43.6439.5
Permutation5.04119.55.12451.64.961765.7
BHybrid4.9421.84.8035.64.6468.7
Hybrid + ES5.0216.74.8027.24.5449.3
m = 250m = 500m = 1000
SizeTimeaSizeTimeaSizeTimea
Permutation1.0436.60.92124.61.08500.1
NHybrid1.0811.20.8818.71.0035.3
Hybrid + ES1.0810.20.8417.21.0832.3
Permutation1.1236.61.00132.51.22492.1
UHybrid1.1611.40.9019.11.0034.5
Hybrid + ES1.1810.30.9417.21.0831.8
Permutation0.8635.40.86128.61.14498.0
BHybrid1.0411.10.9219.41.0839.5
Hybrid + ES1.0410.50.9217.81.0636.1
Permutation5.44121.14.56441.74.721749.7
NHybrid5.2222.44.2036.54.4468.6
Hybrid + ES5.2016.94.2427.64.4650.9
Permutation5.24120.75.08451.24.601710.4
UHybrid4.8219.94.4432.83.6449.8
Hybrid + ES4.9214.94.5023.43.6439.5
Permutation5.04119.55.12451.64.961765.7
BHybrid4.9421.84.8035.64.6468.7
Hybrid + ES5.0216.74.8027.24.5449.3

aElapsed times on a 3.2 GHz Pentium 4 computer.

The size column gives the percent of data sets when change-points were detected (false positives) and the time column gives the user cpu time in minutes to segment the 5000 data sets. The top and bottom halves correspond to α =0.01 and 0.05, respectively. The N, U and B in the first column indicate the marker distribution. Permutation used the original full permutation P-value, Hybrid used the approximation and Hybrid + ES used the approximation and early stopping.

For the alternative of there being change-points, we generated 1000 data sets each and added μ to 10 contiguous markers in the middle of each set, where μ is chosen (empirically) to give approximately 80% power of detecting the change when α =0.01. These results are in Table 2, with power being the percent of times the change was detected. The power of all three procedures are comparable with the original only minimally more powerful than the hybrid algorithms. Thus the fact that the size of the hybrid procedure could be moderately conservative has minimal impact on the power of the procedure. As seen in the no change case the hybrid algorithms show enormous speed gains over the original. The addition of the stopping rule makes the procedure 3 to 4 times faster than the hybrid alone.

Table 2.

Segmentation results when there is a change

m = 250m = 500m = 1000
PowerTimeaPowerTimeaPowerTimea
Permutation79.8102.980.1380.678.21509.0
NHybrid80.326.080.047.078.091.5
Hybrid + ES80.48.679.913.078.326.4
Permutation80.0102.878.6379.975.71485.3
UHybrid80.525.878.445.674.886.3
Hybrid + ES80.67.978.312.074.722.3
Permutation76.799.576.8378.672.71450.9
BHybrid77.124.876.846.373.086.6
Hybrid + ES77.08.377.014.272.726.7
Permutation92.1121.287.8441.889.11830.8
NHybrid91.736.887.754.788.4110.2
Hybrid + ES91.613.687.513.688.728.5
Permutation91.5119.789.1445.388.41824.0
UHybrid91.136.088.054.286.8107.6
Hybrid + ES91.213.188.112.886.825.6
Permutation90.1119.590.3452.387.31823.9
BHybrid89.835.990.056.386.9108.5
Hybrid + ES90.013.690.014.387.228.2
m = 250m = 500m = 1000
PowerTimeaPowerTimeaPowerTimea
Permutation79.8102.980.1380.678.21509.0
NHybrid80.326.080.047.078.091.5
Hybrid + ES80.48.679.913.078.326.4
Permutation80.0102.878.6379.975.71485.3
UHybrid80.525.878.445.674.886.3
Hybrid + ES80.67.978.312.074.722.3
Permutation76.799.576.8378.672.71450.9
BHybrid77.124.876.846.373.086.6
Hybrid + ES77.08.377.014.272.726.7
Permutation92.1121.287.8441.889.11830.8
NHybrid91.736.887.754.788.4110.2
Hybrid + ES91.613.687.513.688.728.5
Permutation91.5119.789.1445.388.41824.0
UHybrid91.136.088.054.286.8107.6
Hybrid + ES91.213.188.112.886.825.6
Permutation90.1119.590.3452.387.31823.9
BHybrid89.835.990.056.386.9108.5
Hybrid + ES90.013.690.014.387.228.2

aElapsed times in minutes on a 3.2 GHz Pentium 4 computer.

The power column gives the percent of data sets when change-points were detected (100 - power is the percent of missed change-points). The description of the other elements of the table is the same as the one for Table 1. The times are the total time to segment the 1000 data sets of each type.

Table 2.

Segmentation results when there is a change

m = 250m = 500m = 1000
PowerTimeaPowerTimeaPowerTimea
Permutation79.8102.980.1380.678.21509.0
NHybrid80.326.080.047.078.091.5
Hybrid + ES80.48.679.913.078.326.4
Permutation80.0102.878.6379.975.71485.3
UHybrid80.525.878.445.674.886.3
Hybrid + ES80.67.978.312.074.722.3
Permutation76.799.576.8378.672.71450.9
BHybrid77.124.876.846.373.086.6
Hybrid + ES77.08.377.014.272.726.7
Permutation92.1121.287.8441.889.11830.8
NHybrid91.736.887.754.788.4110.2
Hybrid + ES91.613.687.513.688.728.5
Permutation91.5119.789.1445.388.41824.0
UHybrid91.136.088.054.286.8107.6
Hybrid + ES91.213.188.112.886.825.6
Permutation90.1119.590.3452.387.31823.9
BHybrid89.835.990.056.386.9108.5
Hybrid + ES90.013.690.014.387.228.2
m = 250m = 500m = 1000
PowerTimeaPowerTimeaPowerTimea
Permutation79.8102.980.1380.678.21509.0
NHybrid80.326.080.047.078.091.5
Hybrid + ES80.48.679.913.078.326.4
Permutation80.0102.878.6379.975.71485.3
UHybrid80.525.878.445.674.886.3
Hybrid + ES80.67.978.312.074.722.3
Permutation76.799.576.8378.672.71450.9
BHybrid77.124.876.846.373.086.6
Hybrid + ES77.08.377.014.272.726.7
Permutation92.1121.287.8441.889.11830.8
NHybrid91.736.887.754.788.4110.2
Hybrid + ES91.613.687.513.688.728.5
Permutation91.5119.789.1445.388.41824.0
UHybrid91.136.088.054.286.8107.6
Hybrid + ES91.213.188.112.886.825.6
Permutation90.1119.590.3452.387.31823.9
BHybrid89.835.990.056.386.9108.5
Hybrid + ES90.013.690.014.387.228.2

aElapsed times in minutes on a 3.2 GHz Pentium 4 computer.

The power column gives the percent of data sets when change-points were detected (100 - power is the percent of missed change-points). The description of the other elements of the table is the same as the one for Table 1. The times are the total time to segment the 1000 data sets of each type.

A final set of simulations were done to represent the case of multiple change-points in a chromosome. We reproduce a subset of the cases considered in the second set of simulations in Olshen et al. (2004). There are 497 markers in the chromosome with 6 change-points and the mean log-ratios (μ) for the markers given by:

Marker1–138–225–242–299–308–332–
i137225241298307331497
μ  i−0.180.081.07−0.530.16−0.69−0.16
Marker1–138–225–242–299–308–332–
i137225241298307331497
μ  i−0.180.081.07−0.530.16−0.69−0.16
Marker1–138–225–242–299–308–332–
i137225241298307331497
μ  i−0.180.081.07−0.530.16−0.69−0.16
Marker1–138–225–242–299–308–332–
i137225241298307331497
μ  i−0.180.081.07−0.530.16−0.69−0.16

One thousand observed log-ratio data sets were generated as c* μ + Z, where Z is standard normal and the scale factor c is 10 or 5 to represent low and high noise scenarios. The data were segmented using the permutation, the hybrid and the hybrid with early stopping procedures using α = 0.01 and 0.05. The number of change-points detected by the three procedures are reported in Table 3 and show that the three are nearly identical. These numbers are consistent with the results in Olshen et al. (2004). The number of data sets for which all three procedures identified the same set of change-points are:

Table 3.

The number of change-points detected by the three methods

cα#67891011+Timea
P9184536100667.5
100.01H9144639100123.5
H+ES915483610020.5
P7126716724237684.2
100.05H7196616124237139.7
H+ES709691672424723.3
P85910335300669.2
50.01H85910334310125.9
H+ES859993840031.1
P57219016841218685.4
50.05H57418416942247142.1
H+ES5711901704021833.7
cα#67891011+Timea
P9184536100667.5
100.01H9144639100123.5
H+ES915483610020.5
P7126716724237684.2
100.05H7196616124237139.7
H+ES709691672424723.3
P85910335300669.2
50.01H85910334310125.9
H+ES859993840031.1
P57219016841218685.4
50.05H57418416942247142.1
H+ES5711901704021833.7

aElapsed times in minutes on a 3.2 GHz Pentium 4 computer. The times are the total time to segment the 1000 data sets of each type.

Table 3.

The number of change-points detected by the three methods

cα#67891011+Timea
P9184536100667.5
100.01H9144639100123.5
H+ES915483610020.5
P7126716724237684.2
100.05H7196616124237139.7
H+ES709691672424723.3
P85910335300669.2
50.01H85910334310125.9
H+ES859993840031.1
P57219016841218685.4
50.05H57418416942247142.1
H+ES5711901704021833.7
cα#67891011+Timea
P9184536100667.5
100.01H9144639100123.5
H+ES915483610020.5
P7126716724237684.2
100.05H7196616124237139.7
H+ES709691672424723.3
P85910335300669.2
50.01H85910334310125.9
H+ES859993840031.1
P57219016841218685.4
50.05H57418416942247142.1
H+ES5711901704021833.7

aElapsed times in minutes on a 3.2 GHz Pentium 4 computer. The times are the total time to segment the 1000 data sets of each type.

α, c(0.01, 10)(0.01, 5)(0.05, 10)(0.05, 5)
identical985975977963
α, c(0.01, 10)(0.01, 5)(0.05, 10)(0.05, 5)
identical985975977963
α, c(0.01, 10)(0.01, 5)(0.05, 10)(0.05, 5)
identical985975977963
α, c(0.01, 10)(0.01, 5)(0.05, 10)(0.05, 5)
identical985975977963

The differences between the three procedures are minimal and could be partly explained by the differences in the stream of random numbers in their permutation component.

These simulation results clearly demonstrate that the new CBS algorithm with the hybrid P-value and stopping rule for declaring change early provide speed gains that make CBS a more practical approach for analyzing high resolutions arrays.

4 EXAMPLE

In this section we present two analyses of data from cell lines in order to evaluate how the improved algorithms perform on real data. The first is the analysis of the data from the ROMA (Lucito et al., 2003) array example shown in Figure 4 of Olshen et al. (2004). There are 9820 markers in this array with the maximum of 824 on chromosome 2. The data were segmented with all three procedures at an α level of 0.01. We used the hybrid P-value when the segment being tested had more than 200 markers and used k = 25. All three algorithms found the same 48 segments in the genome as can be seen in Figure 3. The original CBS algorithm segmented the data in 347 s, which was reduced to 44 s with the hybrid P-value and further reduced to 13 s with the inclusion of the stopping rule to declare a change early.

The ROMA array on a breast cancer cell line shown in Olshen et al. (2004). The segments by the original full permutation P-value (black), the new hybrid p-value (red) and the hybrid P-value along with the early stopping to declare change (green).
Fig. 3.

The ROMA array on a breast cancer cell line shown in Olshen et al. (2004). The segments by the original full permutation P-value (black), the new hybrid p-value (red) and the hybrid P-value along with the early stopping to declare change (green).

The second example is the analysis of data from three breast cancer cell lines (MCF7, SKBR3 and ZR75) evaluated using the Affymetrix 100k SNP platform. These data are available on the SNPscan website [http://pevsnerlab.kennedykrieger.org/snpscan.htm; Ting et al. (2006)]. These 100k SNP array data have a total of 115571 markers over 23 chromosomes (none from Y) with chromosome 2 having the maximum number (10352). The data were segmented using all three procedures. The numbers of change-points detected, the number identical and the total computing time per procedure are in the following table.

Cell lineTime (min)
MCF7SKBR3ZR75
P2202422707085.8
H217242271100.2
H+ES21624326925.0
Identical213242263
Cell lineTime (min)
MCF7SKBR3ZR75
P2202422707085.8
H217242271100.2
H+ES21624326925.0
Identical213242263
Cell lineTime (min)
MCF7SKBR3ZR75
P2202422707085.8
H217242271100.2
H+ES21624326925.0
Identical213242263
Cell lineTime (min)
MCF7SKBR3ZR75
P2202422707085.8
H217242271100.2
H+ES21624326925.0
Identical213242263

Detailed breakdown of the number of change-points by cell line and chromosome are provided in the supplementary table. This example demonstrates the speed gain (total computing time for the three cell lines is nearly 5 days for the original versus 100 min for hybrid alone and 25 min for the hybrid with early stopping) the new procedures accomplish without sacrificing efficacy.

5 DISCUSSION

We developed the CBS algorithm as a robust non-parametric method for segmenting array CGH data in order to facilitate the identification of chromosomal regions of gain or loss. This method has been successfully applied in several studies to characterize the copy number variation in different types of cancer (Aguirre et al., 2004; Brennan et al., 2004; Chen et al., 2006; Zhao et al., 2005). In their comparative study Lai et al., (2005) found that the CBS algorithm performed consistently well but was slow. In this manuscript, we presented the hybrid approach to compute the P-value for the maximal statistic, and added a stopping rule to declare a change early, in order to increase speed. The simulation study we conducted shows that the improved methods perform as well as the full permutation approach and results in the desired speed gains. The similarity of the results among the methods was demonstrated on real data from breast cancer cell lines. We conclude that the CBS algorithm is a natural choice for the analysis of high resolution array CGH data.

ACKNOWLEDGEMENT

We thank Mike Wigler for the breast cancer cell line data and Glenn Heller for helpful discussions.

Conflict of Interest: none declared.

REFERENCES

Aguirre
AJ
et al.
,
High-resolution characterization of the pancreatic adenocarcinoma genome.
Proc. Natl. Acad. Sci. USA
,
2004
, vol.
101
(pg.
9067
-
9072
)
Brennan
C
et al.
,
High-resolution global profiling of genomic alterations with long oligonucleotide microarray.
Cancer Res.
,
2004
, vol.
64
(pg.
4744
-
4748
)
Chen
W
et al.
,
Array comparative genomic hybridization reveals genomic copy number changes associated with outcome in diffuse large B-cell lymphomas.
Blood
,
2006
, vol.
107
(pg.
2477
-
2485
)
du Manoir
S
et al.
,
Detection of complete and partial chromosome gains and losses by comparative genomic in situ hybridization.
Human Gene.
,
1993
, vol.
90
(pg.
590
-
610
)
Gentleman
RC
et al.
,
Bioconductor: Open software development for computational biology and bioinformatics
Genome Biol.
,
2004
, vol.
5
pg.
R80
Kallioniemi
A
et al.
,
Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors.
Science
,
1992
, vol.
258
(pg.
818
-
821
)
Lai
WR
et al.
,
Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data
Bioinformatics
,
2005
, vol.
21
(pg.
3763
-
3770
)
Lucito
R
et al.
,
Representational oligonucleotide microarray analysis: a highresolution method to detect genome copy number variation.
Genome Res.
,
2003
, vol.
13
(pg.
2291
-
2305
)
Olshen
AB
et al.
,
Circular binary segmentation for the analysis of array-based DNA copy number data.
Biostatistics
,
2004
, vol.
5
(pg.
557
-
572
)
Pinkel
D
Albertson
D
,
Array comparative genomic hybridization and its application in cancer.
Nature Gene.
,
2005
, vol.
37
(pg.
S11
-
S17
(Suppl. S)
R Development Core Team.
R: A Language and wEnvironment for Statistical Computing
,
2006
Vienna, Austria
R Foundation for Statistical Computing
Siegmund
DO
,
Approximate tail probabilities for the maxima of some random fields.
Ann. of Probab.
,
1988
, vol.
16
(pg.
487
-
501
)
Ting
JC
et al.
,
Analysis and visualization of chromosomal abnormalities in SNP data with SNPscan.
BMC Bioinforma
,
2006
, vol.
7
pg.
25
Whang-Peng
J
et al.
,
Specific chromosome defect associated with human smallcell lung cancer; deletion 3p(14-23).
Science
,
1982
, vol.
215
4529
(pg.
181
-
182
)
Willenbrock
H
Fridlyand
J
,
A comparison study: applying segmentation to array CGH data for downstream analyses.
Bioinformatics
,
2005
, vol.
21
(pg.
4084
-
4091
)
Worsley
KJ
,
An improved Bonferroni inequality and applications.
Biometrika
,
1982
, vol.
69
(pg.
297
-
302
)
Yao
Q
,
Large deviations for boundary crossing probabilities of some random fields
J. Math. Res. Exposit.
,
1989
, vol.
9
(pg.
181
-
192
)
Yao
Q
,
Tests for change-points with epidemic alternatives.
Biometrika
,
1993
, vol.
80
(pg.
179
-
191
)
Zhao
XJ
et al.
,
Homozygous deletions and chromosome amplifications in human lung carcinomas revealed by single nucleotide polymorphism array analysis.
Cancer Res.
,
2005
, vol.
65
(pg.
5561
-
5570
)

Author notes

Associate Editor: Chris Stoeckert