Abstract
Array comparative genomic hybridization (aCGH) is a laboratory technique to measure chromosomal copy number changes. A clear biological interpretation of the measurements is obtained by mapping these onto an ordinal scale with categories loss/normal/gain of a copy. The pattern of gains and losses harbors a level of tumor specificity. Here, we present WECCA (weighted clustering of called aCGH data), a method for weighted clustering of samples on the basis of the ordinal aCGH data. Two similarities to be used in the clustering and particularly suited for ordinal data are proposed, which are generalized to deal with weighted observations. In addition, a new form of linkage, especially suited for ordinal data, is introduced. In a simulation study, we show that the proposed cluster method is competitive to clustering using the continuous data. We illustrate WECCA using an application to a breast cancer data set, where WECCA finds a clustering that relates better with survival than the original one.
1 INTRODUCTION
Chromosomal aberrations are a key event in the development and progression of cancer (Lengauer and others, 1998). Invasive ductal carcinoma forms the largest group of invasive breast tumors and is a heterogenous group of tumors. Subclassification of ductal invasive breast tumors is problematic since no specific histological characteristics were identified. The WHO envisions that subclassification through cDNA expression arrays may aid in the prediction of clinical outcome (Tavassoli and Devilee, 2003). As the position of chromosomal aberrations both varies from tumor to tumor and carries a certain tumor specificity, clustering of the chromosomal aberration data of breast cancer tumors may, as gene expression data, aid in the discovery of new subclasses. Hardly any cluster method dedicated to chromosomal aberration data has been proposed. We present WECCA, weighted clustering of called aCGH data.
Chromosomal DNA copy number is the number of copies of genomic DNA. Normal somatic cells have 2 copies of the autosomal chromosomes: the copy number is 2. In addition, cells of normal males have one copy of the X and the Y chromosome. Nuclei of Down syndrome patients show an extra copy of chromosome 21, whereas in cancer tissue the copy number may vary considerably over the genome.
Among others, Solinas-Toldo and others (1997) and Pinkel and others (1998) showed that DNA copy number can be measured genome-wide using array comparative genomic hybridization (aCGH). aCGH is a high-throughput technique, similar to the gene expression microarray but using chromosomal DNA rather than cDNA for the hybridization. The array consists of many (possibly synthetic) strands of DNA, here referred to as clones, that interrogate small regions of the genomic DNA (Pinkel and Albertson, 2005).
In an aCGH experiment, differently labeled test and reference samples are hybridized together to the array. The reference sample is assumed to have copy number 2 for all somatic chromosomes. Image analysis then results in test and reference intensities for all array elements. Under ideal experimental conditions, the intensity of an array element is linearly proportional to the abundance of the corresponding DNA sequence in the sample. The log2 ratio of the test and the reference intensities reflects the relative copy number in the test sample compared to that in the reference sample.
The log2 ratios undergo 3 preprocessing steps before arriving at the actual copy number. The first preprocessing step is normalization (Khojasteh and others, 2005; Neuvial and others, 2006). Normalization corrects for experimental artifacts in order to make the log2 ratios from different hybridizations comparable. The second step of the preprocessing, named segmentation, is motivated by the underlying discrete DNA copy numbers of test and reference samples. Segmentation algorithms divide the genome into nonoverlapping segments that are separated by break points (Olshen and others, 2004; Picard and others, 2005). These break points indicate a change in DNA copy number. Clones that belong to the same segment are assumed, as they are not separated by a break point, to have the same underlying chromosomal copy number. As a final and last preprocessing step, referred to as calling (Wang and others, 2004; Engler and others, 2006), the DNA copy number of each segment is determined. At present, calling algorithms cannot determine whether there are, say, 3 or 4 copies present. They can, however, detect deviations from the normal copy number and classify each segment as “normal,” “loss,” “gain,” or “amplification”: “normal” if there are 2 copies of the chromosomal segment are present, “loss” (also named deletion) if at least one copy is lost, “gain” if at least one additional copy is present, and “amplification” if there is high level, say > 5, copy number. Preprocessing thus maps the raw log2 intensity values onto the ordinal scale of calls.
The starting point of the work presented here is the preprocessed, called aCGH data. The called data are on the ordinal scale consisting of the values loss, normal, gain (and amplification if desired). Ordinal measurement only possesses the characteristics of “distinctiveness” (different values are assigned to objects that have different values of the property being measured) and “ordering in magnitude” (assigned values indicate an ordering in magnitude, with larger values representing more of the property being measured), and not the characteristics “equal intervals” and “absolute zero” which most continuous measurements do (Allen and Yen, 1979).
Usually, conversion of continuous data to categorical data is associated with information loss. For aCGH data, however, there are convincing reasons to convert the continuous intensities to the ordinal copy number calls. First, the DNA copy numbers, the phenomena under study, are of a discrete nature. The continuous intensities are only intermediates, indicative of the discrete copy numbers. Transforming the intensities into calls is thus a form of “de-noising.” Second, the called data have a clear biological meaning, lacked by the log2 ratios. aCGH profiles from different platforms can be compared directly when using the called data: the interpretation of loss, normal, gain, and amplification is the same across platforms, whereas the log2 ratios are likely to have different interpretations between platforms, possibly even between experiments. The calling could thus be viewed as a final between-array normalization step. Moreover, the called data can be used directly for corroboration with existing knowledge of genomic aberrations, which is expressed in terms of copy numbers, not in terms of log2 ratios. This also applies to the verification of findings from an aCGH experiment which is often done by fluorescent in situ hybridization, which counts the copy number of a genomic segment of interest. Third, in this paper, we show that it need not be disadvantageous to work with the ordinal data. And finally, in the near future, calling algorithms are likely to have an improved resolution, mapping the log2 ratios to a scale better resembling the copy numbers. Methods designed for the ordinal data will benefit from such an improved resolution.
Cluster analysis has the potential to be very useful in the analysis of aCGH data (Jong and others, 2007), as it has been for the analysis of gene expression data (Eisen and others, 1998). Cluster analysis seeks meaningful data-determined groupings of objects (clones or samples), such that the objects within groups are more “similar” than the objects across groups, based on the assumption that this similarity in DNA copy number implies some form of regulatory or functional similarity of clones (or samples).
Cluster methods used in the analysis of expression data may not be appropriate for clustering of the ordinal aCGH data, due to the ordinal nature of the called data and the break-point structure of aCGH profiles. Central to cluster analysis is the notion of distance, or similarity measure for noncontinuous data, between objects being clustered. Distances are not suitable for ordinal data as they assume—in addition to the properties of ordinal measurement—that equal distances between measurement values represent equal distances in the property being measured. For example, the difference in copy number between a loss and a normal is not (necessarily) equal to the difference in copy number between a normal and a gain, as a loss corresponds to either 0 or 1 chromosome copy and a gain to 3 or more. The use of a distance measure in clustering of aCGH data imposes this structure on the ordinal DNA copy number data. Moreover, the distance between clusters, say, a Euclidean distance of 6.12, which has intuitive meaning, has no clear interpretation translatable to practical implications.
In this article, we develop a dedicated cluster method for clustering of samples on the basis of their aCGH profile (this is not to be confused with clustering of clones as calling is sometimes referred to). To deal with the discrete nature of the data, we use similarity measures suited to ordinal data, addressing its 2 properties, distinctiveness and ordering in magnitude. These similarities are motivated from a model for aCGH data, which facilitates interpretation. For instance, the consequences of merging 2 clusters become clear from the similarity measure. Furthermore, the similarity measure is straightforwardly modified such that the medical researcher can assign certain clones or chromosomal regions to have more influence on the clustering. This acknowledges the fact that small-area aberrations may have a relatively large impact on tumor development. Moreover, preliminary exclusion of “dull” regions, regions with few aberrations over the samples, may prevent the algorithm from clustering on irrelevant features. We discuss several forms of linkage, which dictates how the similarity between clusters (instead of samples) is measured. Besides the well-known average and complete linkage, we introduce “total” linkage, especially suited for the data type under study. Finally, we evaluate the developed clustering method in a simulation and illustrate WECCA on a breast cancer data set from Fridlyand and others (2006). WECCA yields clusterings between which differences in survival are more significant than those between the clusters originally found.
To our knowledge, we are the first to propose a cluster method designed for called aCGH data. Liu and others (2006) discuss clustering of classical CGH data. They design a “segment-based similarity” Sim (and variants thereof) to deal with the CGH data. These are then used in classical hierarchical and k-means cluster methods. However, many papers in the medical literature do cluster aCGH data. This is done using either the log2 ratios or the segmentation states and standard hierarchical clustering, for example with the Euclidean distance and average linkage (Jong and others, 2007). Recently, Zhang and others (2006) proposed to use the Hamming distance for clustering of categorical data, which could be applied to aCGH data. Other methods of potential use for the clustering of called aCGH data can be found within the field of artificial intelligence, where the literature on clustering methods for categorical data is growing rapidly (Gibson and others, 1998; Huang, 1998; Ganti and others, 1999; Guha and others, 2000; Meila and Heckerman, 2001; Barbará and others, 2002; Andritsos and others, 2004; Zhong and Ghosh, 2004; and many more).
WECCA is much more tailor-made for aCGH data than these methods. WECCA addresses the ordinal, rather than the nominal, nature of the data through 2 interpretable similarities. Both similarities do not assume independence between clones as is done by other methods (e.g., Barbará and others, 2002). Moreover, WECCA allows for the weighting of clones or chromosomal regions by prior information on chromosomes, or alternatively through a data driven procedure. The 2 weighting methods can be combined. Finally, desirable mathematical properties of WECCA are derived from a simple nonparametric model.
2 MODEL
Here, a nonparametric model of the aCGH data is introduced. Within the context of the model, the similarities used in the clustering are defined and get their clear interpretation. Moreover, the model is employed to show that the similarities can be unbiasedly estimated, as well as to prove properties of the similarities. Let Xij be the DNA copy number of the jth (j = 1,…,p) clone of sample i (i = 1,…,n). Implicitly, we have assumed that all samples are hybridized on the same platform and the clones match up. If this assumption does not hold, the clones should be matched using, for example, the matching technique described in Van Wieringen and others (2006). The Xij are measured on an ordinal scale with categories {1,…,a} corresponding to, say, {loss, normal, gain, amplification}. We assume that the samples stem from m unobserved clusters; the number of clusters need not be specified. Let Zi be a latent random variable that indicates from which group sample i stems, and denote the probability that sample i stems from group g by P(Zi = g) = θg. We assume that samples are independent. We do not assume independence between clones because the copy number count of 2 or more clones (not necessarily adjacent) may be dominated by the same underlying biological mechanism. Instead, we assume clones are divided into r exhaustive and mutually exclusive groups B1,…,Br and assume independence between the groups. Note that this grouping is neither observed nor reconstructed, only hypothesized. Let τs = P(j∈Bs) be the probability that clone j belongs to group Bs for s = 1,…,r. We then write the probability that the DNA copy number Xij of sample i and clone j, given the cluster of sample i and the group of clone j, equals k as π(k|g,s). The probability that Xij equals k can then be written as

3 SIMILARITY MEASURES
In this section, we introduce 2 similarity measures. One addresses the distinctiveness property of ordinal data, and the other the ordering by magnitude property.
3.1 Similarity measure I
The first similarity measure we propose is inspired by the distinctiveness property of ordinal measurements. This property is addressed by the concept of agreement (De Mast and Van Wieringen, 2007). The measurements on a clone of 2 samples agree if they are identical. The similarity measure is then the probability that measurements of an arbitrary clone j of samples i1 and i2 agree:


The probability of agreement satisfies all properties of a similarity measure, as given in Kaufman and Rousseeuw (1990): (1) P(Xi1j = Xi2j)∈[0,1], (2) P(Xij = Xij) = 1, and (3) P(Xi1j = Xi2j) = P(Xi2j = Xi1j). Zero is—at least theoretically—the lower bound of the probability of agreement. In practice, it seems reasonable to consider 1/a as the lower bound of the similarity measure. For, suppose the distribution of the Xij is independent of the groups: π(k|g,s) = π(k|s) for all g. Then, the probability of agreement reduces to:

In Section D of the supplementary material available at Biostatistics online (http://www.biostatistics. oxfordjournals.org) it is shown that similarity I is consistent, which is a favorable property for a cluster method like WECCA. A similarity measure is consistent if—in expectation—2 samples are more similar if they stem from the same group than if they do not. Hence, we expect the probability of the agreement between the measurements Xi1j and Xi2j of clone j of samples i1 and i2 to be higher if the 2 samples stem from the same group than if they stem from different groups.
This similarity is also location and scale invariant: any monotonic transformation of the labels 1,…,a results in the same values of the similarity. Note that, for instance, the Euclidean distance is not scale invariant.
The similarity I between samples i1 and i2 is estimated by

3.2 Similarity measure II
The second similarity measure is inspired by the ordering in magnitude property of ordinal measurements. The concept of concordance reflects whether measurements on 2 samples have the same ordering in magnitude (Kendall and Gibbons, 1990; De Mast and Van Wieringen, 2006). The traditional definition of concordance discards clone pairs from contributing to concordance if they have equal DNA copy numbers for at least one of the samples. In this paper, we include clone pairs with equal copy number, as we consider this an indication of a shared underlying mechanism. Two samples are concordant if their DNA copy numbers of clone A are larger than those of clone B, or both DNA copy numbers of clone A are smaller than those of clone B, or clones A and B have the same DNA copy numbers for both samples. Then, as a second similarity measure for ordinal data we propose the probability that 2 samples are in concordance for 2 randomly selected clones j1 and j2:

We estimate the probability of concordance as follows:


4 WEIGHTING
It may be biologically relevant to allow some clones (or chromosomal regions) to have a larger influence on the clustering. For instance, because these clones are on a chromosomal region that is known to be of more importance to the phenomenon (disease) under study than other chromosomal regions, or because some chromosomes have a higher gene density than others, or because certain locations often have homozygous deletions. Therefore, we modified the similarity measures I and II such that these clones can be assigned a higher or lower weight in the clustering algorithm. Two methods of weighting are proposed.
4.1 Per clone
The first method asks the user to specify a weight for each clone. Let wj ≥ 0 be the weight assigned to clone j = 1,…,p. The weighted counterpart of similarity I is then estimated as



4.2 Per region
An alternative, natural and data-driven weighting is achieved when using regions instead of clones. A region is a series of neighboring clones on the chromosome whose aCGH signature is shared by all clones (Van de Wiel and Van Wieringen, 2007). The ordinal nature of the aCGH data allows for a huge dimension reduction, as the data assume only a limited number of values (3 or 4) and many segments on the genome are likely to have the same values over consecutive clones. A region may consist of one clone representing a small amplification, but also a complete chromosome arm. Hence, the regions capture the relevant features of the data: the break-point information and the copy number count. When using regions rather than clones in the clustering, each region, regardless of its coverage of the genome, contributes equally.
CGHregions (Van de Wiel and Van Wieringen, 2007) is an algorithm that constructs regions. To achieve region delineation, it starts from a large p×n matrix containing the called data and reduces the size of this matrix significantly by determining spatially adjacent sets of clones which for every sample are (almost) constant within the sample. Hence, CGHregions is a multisample approach and these spatially adjacent sets of clones are shared by all samples. In tumor genesis, transitions (from one category, e.g., loss, to another, e.g., normal) occur at different locations for different samples. However, Van de Wiel and Van Wieringen (2007) observed that, in general, many samples representing the same type of tissue have a large overlap in spatially adjacent sets of clones that do not change. We denote the constructed spatially adjacent sets of clones as regions. Clones in a region are not required to be constant “across” samples. This region algorithm is based on 2 principals: (1) the maximum distance between any 2 clone signatures in a region should not exceed c, where (2) c is tuned such that the information loss due to representing each region by its mediod signature is less than 1%. The result is a rectangular matrix R×n where R is a fraction of p and a row is the mediod signature of all clones in that region. Reduction of the number of items may vary between 10 and 250 times depending on the heterogeneity of transition locations in the samples. Van de Wiel and Van Wieringen (2007) contains a detailed discussion of the region construction algorithm and its potential benefits for downstream analysis.
The regions themselves can be weighted, as was done for clones in Section 4.1. When using regions constructed with c = 0 (any transition introduces a new region) and weights equal to the number of covered clones, the resulting clustering is identical to the nonweighted clustering of clones, but the former is computationally much faster.
4.3 Sim
The above weighting approaches can be used to show that the “segment-based similarity” Sim of Liu and others (2006) is a special case of the proposed weighted agreement similarity. Liu and others (2006) discuss other measures, which are variants of Sim. We limit our discussion to Sim as it is the most widely used of all their measures (Liu and others, 2006). The segment-based similarity Sim between 2 CGH or aCGH profiles is defined as the number of overlapping segment pairs. Liu and others (2006) define a segment as a contiguous block of aberrations of the same type. Two segments are overlapping if they have at least one common genomic interval of the same aberration type.
To arrive at Sim from the proposed approach, first construct regions, by applying CGHregions with no information loss (c = 0) to the 2 samples for which Sim is to be calculated. Now, assign weights to the constructed regions as follows. If a region is aberrated in both samples give it weight 1, if it is not aberrated in one of the samples give it weight 0. Use these weights in the calculation of the weighted agreement similarity as given in (4.1). This equals Sim.
5 LINKAGE
At each step of WECCA 2 clusters are merged. The resulting cluster consists of more than one sample. This calls for a modification of the definition of the similarity measures, as they are defined between samples and not between clusters (compare the several types of linkage in hierarchical clustering). Apart from standard linkages as average and complete linkage (see Section F of the supplementary material available at Biostatistics online, http://www.biostatistics.oxfordjournals.org), we introduce a new type of linkage, which we call total linkage.
5.1 Total linkage for agreement
Total linkage for the weighted similarity I is defined as the probability of all samples agreeing. This probability is estimated by

5.2 Total linkage for concordance
Total linkage for similarity II is defined as the probability of all samples being in concordance. The components of this probability are unbiasedly estimated by


6 HIERARCHIAL CLUSTERING
The proposed similarity measures are used for hierarchial clustering, as this is the most widely used clustering method for genomic data. Therefore, WECCA consists of the following steps:
Step 1: Assign weights to each clone, or construct the regions and (optional) assign weights to regions.
Step 2: Form the initial clusters, each containing one sample.
Step 3: Calculate the similarity between all cluster pairs.
Step 4: Merge the 2 clusters with the highest similarity.
Step 5: Iterate between Steps 3 and 4 until one final cluster remains.
Step 6: Determine the cutoff for the obtained dendrogram.
The cutoff is chosen such that it results in a clustering with compact and well-separated clusters. This is often done by plotting a statistic that evaluates each step in the hierarchy of the dendrogram (the effect of a new cluster split), for instance, through a ratio of the within-cluster similarity and the between-cluster similarity. In this plot, one searches for a “knee,” which indicates a qualitative change in behavior. The number of clusters at which the knee is observed indicates the optimal cutoff, and clustering (Halkidi and others, 2001).
The proposed similarity measures can be used in combination with many types of cluster algorithms. We briefly illustrate this for k-means clustering. First, assign the samples randomly to the k clusters. The n samples are then scanned, one at a time. The sample's (average or complete) similarity to all k clusters is calculated, and the sample is moved to the cluster with the highest similarity if that cluster is different from the one to which it already belongs. This process is repeated until no samples are moved or the number of iterations exceeds its maximum.
7 SIMULATION
WECCA is evaluated in a simulation study. Within the context of a simulation, the correct or true clustering is known. Cluster methods can then be evaluated by comparing the resulting unsupervised clusterings with the true clustering, which is done by external validation measures. External validation measures which measure the consensus between the correct and unsupervised clustering and are independent of the similarity used in the clustering, are used to assess the performance of the cluster methods. For an overview of external validation measures, see Halkidi and others (2001) or Handl and others (2005).
We use the simulation setup of Willenbrock and Fridlyand (2005), which is claimed to be more “realistic” than those in older studies, as real data are used to emulate variable copy numbers and segment lengths. We generated 100 data sets, with varying sample sizes (20 and 40), whose aCGH profiles consist of one chromosome of 500 clones using 2, 3, and 4 different state layouts (read: clusters), randomly assigned to the samples. The resulting log2 ratios are median normalized. The normalized data are segmented using DNAcopy (Olshen and others, 2004), which in turn are called using CGHcall (Van de Wiel and others, 2007). The called data are summarized into region data using CGHregions (Van de Wiel and Van Wieringen, 2007). Each resulting data set, continuous, called, and region, is hierarchically clustered using appropriate distances and linkages. The called data are also clustered using the Sim measure of Liu and others (2006). The resulting dendrograms are cut such that the resulting clustering yields 2 clusters (3 or 4 if applicable). Each clustering is compared to the true clustering using 6 external validation measures: proportion of overlapping, average proportion of overlapping, unadjusted Rand index, adjusted Rand index, F-measure, and normalized mutual information (details are given in Section H of the supplementary material available at Biostatistics online, http://www.biostatistics.oxfordjournals.org). All indices range between 0 and 1, 1 being optimal. Table 1 contains the simulation results involving 2 clusters of the cluster methods using the continuous, called, and region data. More results are given in Section I of the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org).
Simulation results for 2 state layouts. The column “Data” indicates the data type used; the columns “similarity” and “linkage” specify the similarity and linkage used, respectively; and the column “Samples” gives the number of aCGH profiles in each data set. The remaining columns contain the mean and standard deviation (in brackets) of the external validation measures: the proportion of overlapping samples (Pop), the average proportion of overlapping samples (Apop), the unadjusted Rand index (Rand), the adjusted Rand index (Rand.adj), the F-measure (F), and the normalized mutual information (NMI)
| Data | Similarity | Linkage | Samples | Pop (s.d.) | Apop (s.d.) | Rand (s.d.) | Rand.adj (s.d.) | F (s.d.) | NMI (s.d.) |
| Continuous | Euclidean | Average | 40 | 0.91 (0.08) | 0.91 (0.08) | 0.85 (0.13) | 0.69 (0.26) | 0.91 (0.08) | 0.70 (0.23) |
| Continuous | Euclidean | Complete | 40 | 0.91 (0.08) | 0.91 (0.08) | 0.84 (0.14) | 0.68 (0.27) | 0.90 (0.09) | 0.70 (0.24) |
| Continuous | Pearson | Average | 40 | 0.97 (0.05) | 0.97 (0.05) | 0.95 (0.09) | 0.89 (0.18) | 0.97 (0.05) | 0.89 (0.18) |
| Continuous | Pearson | Complete | 40 | 0.97 (0.05) | 0.97 (0.05) | 0.95 (0.09) | 0.90 (0.18) | 0.97 (0.06) | 0.89 (0.18) |
| Called | Sim | Average | 40 | 0.73 (0.17) | 0.67 (0.16) | 0.75 (0.14) | 0.51 (0.28) | 0.70 (0.18) | 0.62 (0.23) |
| Called | Sim | Complete | 40 | 0.55 (0.12) | 0.47 (0.18) | 0.51 (0.10) | 0.17 (0.17) | 0.50 (0.11) | 0.29 (0.17) |
| Called | Agreement | Average | 40 | 0.91 (0.09) | 0.91 (0.09) | 0.85 (0.14) | 0.70 (0.29) | 0.91 (0.09) | 0.72 (0.25) |
| Called | Agreement | Complete | 40 | 0.89 (0.08) | 0.90 (0.08) | 0.82 (0.14) | 0.64 (0.27) | 0.89 (0.09) | 0.66 (0.24) |
| Called | Agreement | Total | 40 | 0.85 (0.09) | 0.86 (0.08) | 0.76 (0.13) | 0.52 (0.25) | 0.85 (0.09) | 0.56 (0.21) |
| Called | Concordance | Average | 40 | 0.94 (0.08) | 0.94 (0.08) | 0.89 (0.14) | 0.79 (0.27) | 0.94 (0.09) | 0.79 (0.24) |
| Called | Concordance | Complete | 40 | 0.92 (0.09) | 0.92 (0.08) | 0.87 (0.14) | 0.74 (0.28) | 0.92 (0.09) | 0.75 (0.25) |
| Called | Concordance | Total | 40 | 0.86 (0.09) | 0.87 (0.08) | 0.78 (0.13) | 0.55 (0.27) | 0.86 (0.09) | 0.59 (0.23) |
| Regions | Agreement | Average | 40 | 0.99 (0.03) | 0.99 (0.04) | 0.98 (0.06) | 0.95 (0.12) | 0.99 (0.04) | 0.94 (0.12) |
| Regions | Agreement | Complete | 40 | 0.96 (0.07) | 0.96 (0.07) | 0.93 (0.11) | 0.85 (0.22) | 0.96 (0.07) | 0.85 (0.20) |
| Regions | Agreement | Total | 40 | 0.91 (0.12) | 0.91 (0.11) | 0.85 (0.17) | 0.71 (0.34) | 0.90 (0.13) | 0.73 (0.29) |
| Regions | Concordance | Average | 40 | 0.99 (0.01) | 0.99 (0.01) | 0.98 (0.03) | 0.96 (0.06) | 0.99 (0.01) | 0.95 (0.08) |
| Regions | Concordance | Complete | 40 | 0.96 (0.09) | 0.96 (0.09) | 0.92 (0.12) | 0.87 (0.24) | 0.96 (0.09) | 0.87 (0.23) |
| Regions | Concordance | Total | 40 | 0.93 (0.11) | 0.94 (0.10) | 0.90 (0.16) | 0.80 (0.31) | 0.93 (0.11) | 0.81 (0.27) |
| Data | Similarity | Linkage | Samples | Pop (s.d.) | Apop (s.d.) | Rand (s.d.) | Rand.adj (s.d.) | F (s.d.) | NMI (s.d.) |
| Continuous | Euclidean | Average | 40 | 0.91 (0.08) | 0.91 (0.08) | 0.85 (0.13) | 0.69 (0.26) | 0.91 (0.08) | 0.70 (0.23) |
| Continuous | Euclidean | Complete | 40 | 0.91 (0.08) | 0.91 (0.08) | 0.84 (0.14) | 0.68 (0.27) | 0.90 (0.09) | 0.70 (0.24) |
| Continuous | Pearson | Average | 40 | 0.97 (0.05) | 0.97 (0.05) | 0.95 (0.09) | 0.89 (0.18) | 0.97 (0.05) | 0.89 (0.18) |
| Continuous | Pearson | Complete | 40 | 0.97 (0.05) | 0.97 (0.05) | 0.95 (0.09) | 0.90 (0.18) | 0.97 (0.06) | 0.89 (0.18) |
| Called | Sim | Average | 40 | 0.73 (0.17) | 0.67 (0.16) | 0.75 (0.14) | 0.51 (0.28) | 0.70 (0.18) | 0.62 (0.23) |
| Called | Sim | Complete | 40 | 0.55 (0.12) | 0.47 (0.18) | 0.51 (0.10) | 0.17 (0.17) | 0.50 (0.11) | 0.29 (0.17) |
| Called | Agreement | Average | 40 | 0.91 (0.09) | 0.91 (0.09) | 0.85 (0.14) | 0.70 (0.29) | 0.91 (0.09) | 0.72 (0.25) |
| Called | Agreement | Complete | 40 | 0.89 (0.08) | 0.90 (0.08) | 0.82 (0.14) | 0.64 (0.27) | 0.89 (0.09) | 0.66 (0.24) |
| Called | Agreement | Total | 40 | 0.85 (0.09) | 0.86 (0.08) | 0.76 (0.13) | 0.52 (0.25) | 0.85 (0.09) | 0.56 (0.21) |
| Called | Concordance | Average | 40 | 0.94 (0.08) | 0.94 (0.08) | 0.89 (0.14) | 0.79 (0.27) | 0.94 (0.09) | 0.79 (0.24) |
| Called | Concordance | Complete | 40 | 0.92 (0.09) | 0.92 (0.08) | 0.87 (0.14) | 0.74 (0.28) | 0.92 (0.09) | 0.75 (0.25) |
| Called | Concordance | Total | 40 | 0.86 (0.09) | 0.87 (0.08) | 0.78 (0.13) | 0.55 (0.27) | 0.86 (0.09) | 0.59 (0.23) |
| Regions | Agreement | Average | 40 | 0.99 (0.03) | 0.99 (0.04) | 0.98 (0.06) | 0.95 (0.12) | 0.99 (0.04) | 0.94 (0.12) |
| Regions | Agreement | Complete | 40 | 0.96 (0.07) | 0.96 (0.07) | 0.93 (0.11) | 0.85 (0.22) | 0.96 (0.07) | 0.85 (0.20) |
| Regions | Agreement | Total | 40 | 0.91 (0.12) | 0.91 (0.11) | 0.85 (0.17) | 0.71 (0.34) | 0.90 (0.13) | 0.73 (0.29) |
| Regions | Concordance | Average | 40 | 0.99 (0.01) | 0.99 (0.01) | 0.98 (0.03) | 0.96 (0.06) | 0.99 (0.01) | 0.95 (0.08) |
| Regions | Concordance | Complete | 40 | 0.96 (0.09) | 0.96 (0.09) | 0.92 (0.12) | 0.87 (0.24) | 0.96 (0.09) | 0.87 (0.23) |
| Regions | Concordance | Total | 40 | 0.93 (0.11) | 0.94 (0.10) | 0.90 (0.16) | 0.80 (0.31) | 0.93 (0.11) | 0.81 (0.27) |
The simulation results of the proposed clustering methods are discussed, followed by the results for the different data types, and finally the results for hierarchical clustering with Sim and with traditional continuous data. Cluster methods using the concordance similarity outperform the methods equipped with the agreement similarity. In addition, average linkage yields higher external validation indices than complete linkage, which in turn produces higher validation indices than total linkage. Both conclusions hold for called and region data, although the differences between linkages are smaller for the region data. Hence, the cluster method with average linkage and concordance similarity is to be preferred.
If the methods are compared over data types, called versus region, it is striking how much all proposed methods gain when using region data instead of the called data. For instance, the adjusted index of the clustering with average linkage and agreement similarity increases from 0.70 to 0.95.
Clustering with the Sim index of Liu and others (2006), having an in-built region-like (called segments) approach, does not do well compared to the proposed similarities with the region weighting. It is also outperformed by its “relative,” the agreement similarity. This may be due to the fact that the Sim expresses an unequal interest in the calls: only agreements of aberrated segments are counted. This discards the information contained in the normal segments.
The cluster method with concordance and average linkage using the region data performs slightly better than the continuous cluster methods for 2 clusters, whereas for 3 and 4 clusters it is competitive. But when taking the standard deviation of the estimates of the external validation measures into account, there is no clear winner between these methods.
In summary, the simulation results show that WECCA using regions and equipped with average linkage and the concordance similarity performs particularly well, and is competitive to standard clustering methods using continuous aCGH data.
8 BREAST CANCER DATA
The breast cancer data set from Fridlyand and others (2006) is used to illustrate WECCA. The data (publicly available) consist of 67 aCGH breast tumor profiles with 2041 clones (bacterial artificial chromosome and [bacteriophage] P1-derived artificial chromosome) making up a profile. Since we “validate” the cluster methods by associating the clusters with survival, we removed 12 samples that had neither overall survival nor disease-specific survival. Clones on the sex chromosomes and with more than 30% missing values were removed. Remaining missing values are imputed using the k-nearest neighbor method, median normalized, segmented using the circular binary segmentation method (Olshen and others, 2004), called using CGHcall (Van de Wiel and others, 2007), and summarized into regions by means of the method of Van de Wiel and Van Wieringen (2007). The resulting region data consist of 236 regions. Notwithstanding the good performance of the hierarchical clustering with concordance and average linkage in the simulation, here the results of WECCA with concordance and total linkage are presented (Figure 1) because this method produces more useful dendrograms than complete average methods, followed by average linkage methods. A similar point is made in Jain and others (1999). Moreover, total linkage yields more compact clusters (compare with Fridlyand and others (2006): Ward's linkage is used to obtain a compact clustering). The cutoff of the dendrogram results in 7 clusters (details are given in Section J of the supplementary material available at Biostatistics online, http://www.biostatistics.oxfordjournals.org).
Heatmap of the hierarchical clustering with the unweighted concordance and total linkage. Red indicates a loss, black a normal, and green a gain in DNA copy number. The regions are ordered according to their location on the genome, starting below with the first region on chromosome 1. The numbers on the right-hand side are the chromosome numbers. The blue–yellow bar in the left-hand side indicates the p and q arms of the chromosomes (if present in the data set). The labels at the bottom correspond to the order of the samples in the data set.
Heatmap of the hierarchical clustering with the unweighted concordance and total linkage. Red indicates a loss, black a normal, and green a gain in DNA copy number. The regions are ordered according to their location on the genome, starting below with the first region on chromosome 1. The numbers on the right-hand side are the chromosome numbers. The blue–yellow bar in the left-hand side indicates the p and q arms of the chromosomes (if present in the data set). The labels at the bottom correspond to the order of the samples in the data set.
To illustrate the potential of weighting, as introduced in Section 4.1, we repeat the clustering after assigning weights to each region. The weights are derived from a breast cancer study of Pollack and others (2002), who present in Table 5 of their Supplementary Material a list of genes (including potential and known oncogenes) that are overexpressed and amplified. Regions on chromosome arms 8q, 15q, 17q, and 20q, each containing more than 10% of the genes named in the aforementioned list, are assigned a 10-fold weight as opposed to regions on other arms. This weighting expresses the belief that the clustering is predominantly determined by regions on these chromosomal arms, and will result in stronger delineation of these clusters. If, however, there are clusters in the data determined by the (lesser weighted) regions on other parts of the genome, they will be less emphasized by the weighting. Cutting the resulting dendrogram (displayed in Section J of the supplementary material available at Biostatistics online, http://www.biostatistics.oxfordjournals.org) yields 6 clusters.
Here, the clusterings of WECCA are compared to that reported by Fridlyand and others (2006). In the latter, hierarchical clustering with Pearson correlation and Ward linkage was used, which resulted in 3 clusters. The difference in number of clusters previously reported and found by WECCA is large (3 versus 7, or 6). In part, this is caused by total linkage that yields more compact clusters than many other types of linkage. Due to the exploratory nature of cluster methods, the difference only suggests that the samples in the study could be more heterogeneous than previously implied.
Next, we link the resulting clusterings to survival, as in Fridlyand and others (2006). Table 2 contains the p-values of the log-rank tests for a difference in disease-specific survival and overall survival between the clusters. Moreover, it contains the results of the log-rank tests of the WECCA clusterings, nonweighted and weighted, that resulted from cutting the respective dendrograms into 3 groups. This is done to prevent the number of clusters from obscuring the comparison (as the log-rank test may be sensitive to the number of clusters). The clustering reported in Fridlyand and others (2006) and that from WECCA of the region data with unweighted concordance and total linkage are both significant with respect to disease-specific survival. The latter with 7 clusters also has a significant result for overall survival. The clusters of WECCA with weighted concordance and total linkage yield more significant p-values than the other clusterings do for both survival outcomes (Table 2).
p-values of the log-rank test for difference in survival between clusters, for I, the clustering reported in Fridlyand and others (2006), II and III, the unweighted concordance with total linkage clustering with 3 and 7 clusters, respectively, and IV and V, the weighted concordance with total linkage clustering with 3 and 6 clusters, respectively
| Clustering | I | II | III | IV | V |
| Disease-specific survival | 0.009 | 0.006 | 0.002 | 0.005 | < 0.002 |
| Overall survival | 0.165 | 0.073 | 0.004 | 0.007 | < 0.001 |
| Clustering | I | II | III | IV | V |
| Disease-specific survival | 0.009 | 0.006 | 0.002 | 0.005 | < 0.002 |
| Overall survival | 0.165 | 0.073 | 0.004 | 0.007 | < 0.001 |
WECCA thus discerns more clusters than found before. These clusters could be potentially new subgroups in ductal invasive breast cancer. Their highly significant differences in survival make them promising candidates. Even when forced to yield 3 clusters as reported before, the WECCA clusters correlate better with survival. Of course, an extensive study of the histology of these new clusters is needed for interpretation, as is the validation of their predictive power on clinical outcome in new breast cancer studies.
9 CONCLUSION AND DISCUSSION
Cluster methods for clustering samples on the basis of called aCGH data are presented. Central to the methods are 2 easily interpretable similarities that address the 2 properties of ordinal data (distinctiveness and ordering in magnitude). Both similarities were shown to be location and scale invariant, and consistent. To avoid large “dull” chromosomal areas with normal DNA copy number dominating the clustering, 2 strategies for dealing with the weighting of clones or chromosomal regions in the cluster method were proposed. One enables the user to appoint clones or chromosomal regions that ought to have a higher influence on the clustering, which would lead to clearer delineation of the clusters determined by those chromosomal regions. The other weighting method is data driven. Regions, spatially adjacent sets of clones who share an aCGH signature, are determined and used (instead of the individual clones) in WECCA. The use of regions brings about a clustering on the relevant features of the data because long dull chromosomal areas with normal DNA copy number and small areas with many aberrations are weighted equally. Simulation showed that the use of regions greatly improves the performance of the clustering method. Moreover, as the number of regions is only a small percentage of the number of clones, it is a more manageable task to search for regions with differences in the copy number frequency tables between the clusters. By aiming to find regions that may be responsible for the clustering, WECCA allows the regions themselves to be weighted as well. Finally, we introduced a new type of linkage, total linkage, that generates more compact clusters and more useful dendrograms than average or complete linkage. The potential of WECCA was shown in a reanalysis of a breast cancer study of Fridlyand and others (2006) by demonstrating a better association with both disease-specific and overall survival.
We have thus shown that, when using called aCGH data, samples are best (hierarchically) clustered by transforming the clone data to region data, capturing the break-point structure of aCGH profiles, using the probability of concordance as a similarity measure, addressing the ordinal nature of the data, and letting total linkage define the distance between clusters, to produce well-separated and compact clusters. Weighting of the regions themselves could produce a better delineation of the clusters, and may improve the association with clinical outcome.
We discuss 2 possible extensions of WECCA. In the reanalysis of the breast cancer data, the weights were chosen in an ad hoc manner from a previously published breast cancer study. This illustrated the potential of unequal weighting of the features (exploiting prior knowledge). A more sophisticated approach could consist of a data driven, possibly empirical Bayes, procedure that yields weights reflecting the discriminative power (possibly measured by Shannon's entropy) of each feature, and resulting in possibly clearer delineation of the clusters. Another possibility would be to extend the present methodology to biclustering (Sheng and others, 2005). Biclustering would cluster both samples and regions at the same time. This may lead to the discovery of regions which have similar aberration signatures over only a subset of the samples.
Software for WECCA is written in R (R Development Core Team, 2005) and is available from http://www.few.vu.nl/∼wvanwie/WECCA/WECCA.html.
FUNDING
Center for Medical Systems Biology established by the Netherlands Genomics Initiative/Netherlands Organization for Scientific Research.
Conflict of Interest: None declared.

