Abstract

Summary: This synopsis provides an overview of array-based comparative genomic hybridization data display, abstraction and analysis using CGHAnalyzer, a software suite, designed specifically for this purpose. CGHAnalyzer can be used to simultaneously load copy number data from multiple platforms, query and describe large, heterogeneous datasets and export results. Additionally, CGHAnalyzer employs a host of algorithms for microarray analysis that include hierarchical clustering and class differentiation.

Availability: CGHAnalyzer, the accompanying manual, documentation and sample data are available for download at http://acgh.afcri.upenn.edu. This is a Java-based application built in the framework of the TIGR MeV that can run on Microsoft Windows, Macintosh OSX and a variety of Unix-based platforms. It requires the installation of the free Java Runtime Environment 1.4.1 (or more recent) (http://www.java.sun.com).

Contact:weberb@mail.med.upenn.edu

INTRODUCTION

Copy number assays

Genome copy number is widely regarded as an important aspect of identifying etiological factors in a range of human diseases (Weber, 2002). Complete and partial non-diploid genomes resulting from cytogenetic alterations have been implicated in the diagnosis of congenital disorders (Milunsky and Huang, 2003), as well as predictors of clinical outcomes of many cancer types (Look et al., 1991). Recently, various widely accessible microarray-based genome copy number measurement assays, including array-based comparative genomic hybridization (aCGH) (Snijders et al., 2001), dual-channel oligonucleotide assays (Lucito et al., 2003) and single channel ‘SNP chips’ (Affymetrix, Inc.) (Bignell et al., 2004) provide superior resolution to metaphase CGH using high-throughput platforms.

Most freely available microarray analysis packages are specifically designed to analyze gene expression data (reviewed in Dudoit et al., 2003). The great utility of these is their capability of loading multiple experiments and providing descriptive and analytical algorithms necessary to cope with a rapidly growing body of data; however, they are not designed to optimize data extraction and analysis from whole genome copy number datasets. Currently available software designed specifically for copy number assays are limited by single-experiment visualization (Chi et al., 2004), data abstraction from a single platform (Awad, 2004http://dahlia.stanford.edu:8080/caryoscope/index.html), or copy number breakpoint analysis in a single sample (Eilers and de Menezes, 2005). CGHAnalyzer is a freely available, open source software suite that is designed specifically for analysis of multiple-experiment copy number data. CGHAnalyzer can display detailed copy number profiles for multiple experiments, query large datasets for minimal common regions of aberration, integrate other genomic features with copy number data (e.g. known/predicted genes), conduct higher order analyses such as hierarchical/k-means clustering and perform statistical tests to identify regions that are differentially altered between classes of samples. This software can apply these operations to large numbers of experiments from various platforms simultaneously and display them in a range of customizable interactive views. These include hotlinks to the Ensembl and UCSC Genome Browser databases.

The primary advantage of CGHAnalyzer over existing packages is the degree of flexibility of visualization and analysis. Estimating copy number changes in a single experiment or a set of experiments can be done with one of several methods provided by the software or one that may be imported from other packages (Myers et al., 2004). Unlike currently available software, CGHAnalyzer employs a generic genome coordinate-based framework that offers probe-independent analyses. This provides the capability of side-by-side analysis of experiments from many platforms and increases the utility of aCGH data in public repositories. Further, CGHAnalyzer's full algorithm set is designed specifically for microarray analysisand presents a wide range of analytical capabilities.

PROGRAM OVERVIEW

Loading data

To highlight the capacity of CGHAnalyzer, we obtained six publicly available datasets for demonstrating the steps in configuring, processing and analyzing data (Table 1). These datasets were chosen to (1) compare data from multiple array platforms, (2) query and annotate large datasets and (3) apply commonly used microarray algorithms to copy number datasets. Although more detailed annotations can be incorporated, the minimal amount of information required to use the CGHAnalyzer are (1) a probe set with an associated copy number metric and (2) a genome location for each probe. CGHAnalyzer use an UCSC Human Genome Server-based coordinate system (http://genome.ucsc.edu) that can be configured to retrieve data from a database; however, datasets can be loaded from a number of standard file formats. Normalized input can include that from a peripheral application employing global normalization procedures (available at http://acgh.afcri.upenn.edu) or output from any number of published protocols (Kim et al., 2005).

Visualizing data

After data are loaded into CGHAnalyzer, a graphical interface prompts users for the essential data required for display and analysis, then copy number designations for each probe are made based on a user-selected method of either a standard ratio threshold, or P-value derived from a series of reference experiments. The primary graphical window used for all analyses is a series of ideograms that depict the color-coded copy number status for every probe of each sample in an ordered column, a view standard for multiple-sample genome copy number visualization (Fig. 1A). To analyze the copy number of regions not directly covered by an experiment probe set, and for cross-platform analysis where few, if any, probes are common to all samples, CGHAnalyzer uses a copy number assignment algorithm that estimates the copy number status for sequences between probes, allowing estimates for all regions of the entire genome. Aberrant regions of any sample are defined as the sequence extending from an aberrant probe to a neighboring probe of differing copy number. Consequently, cross-experiment analysis has no probe dependence; assays of any platform can be simultaneously scrutinized without the requirement that probes are shared between them.

Copy number assay data for breast cancer cell line SKBR3 from four platforms (Table 1) demonstrate the utility of this protocol. Little probe homology exists between these platforms. Furthermore, the mean direct sequence coverage of one platform by another is 11.3% (σ = 8.6, n = 12), with a maximum of 25.0% of Platform 2 covered by Platform 1. Despite the lack of coverage similarity, full genome copy number estimations of SKBR3 were highly concordant between platforms, with all confirming the large regions of 10q gain (∼55–85 Mb) and 10q loss (∼90–120 Mb), previously described by non array-based analysis (Davidson et al., 2000) (Fig. 1A and B). Additionally, all displayed regions are hot-linked directly to both the UCSC Genome Browser and Ensembl.

Querying data

A detailed knowledge of the most frequent aberrant regions can provide important insight into the underlying disease. Thus, the identification and annotation of these regions is a fundamental operation in genome copy number data analysis. CGHAnalyzer offers a sophisticated query interface for identifying aberrant entities in a dataset. Users can query for the most commonly aberrant probes, genes or customized feature. For example, amplification of MYCN (2p24.3), a known clinical prognostic indicator in neuroblastoma (Look et al., 1991), occurs with variable amplicon size (Reiter and Brodeur, 1996). In this example, nine primary neuroblastoma tumors were compared to ten neuroblastoma tumor-derived cell lines (Table 1, Platforms 1 and 4, respectively). The minimal common region of amplification for all samples with an MYCN amplicon (Fig. 1C), obtained from converting all platform data units into genome regions, is a 600 kb region of copy number gain (16.0–16.6 Mb).

Analyzing data

All of the algorithms for data mining implemented in TIGR's TM4 software (Saeed et al., 2003) have been adapted to handle higher order copy number data analysis in CGHAnalyzer. For example, the copy number status of selected gene sets may be meaningful classifiers for specific cancer types. Comparing the copy number status of 512 cancer-related genes (Futreal et al., 2004) among 15 breast tumors, 9 neuroblastoma tumors and 10 neuroblastoma cell lines [Platforms 1 and 3 (breast tumors); 1 and 4 (neuroblastoma tumors and cell lines)] using the T-test algorithm in CGHAnalyzer, which uses a conservative step-down max-T adjustment for multiple comparisons (Dudoit et al., 2003), identified a collection of genes important for class differentiation. Although both of these tumor types have a high rate of copy number variation, of the 42 genes deleted in >25% of either group, only deletion of PPP2R1B, a protein phosphotase at 11q22–24 that is lost in many human cancers (Wang et al., 1998), differed significantly between tumor groups (P=0.002). A deletion of this gene occurred in 76% of breast tumors and 19% of neuroblastoma tumors. A much greater number of gains (126 genes gained in >25% of samples) differentiated groups (n=14, P ≤ 0.005), and as expected, gains of MYCN (2p24.1) in neuroblastoma and PTK2 (8q24) in breast tumors distinguished the groups.

CONCLUSIONS

The CGHAnalyzer is a free, flexible utility for genome copy number analysis. It increases the accessibility of high-resolution copy number data by applying an assay-independent data structure. For copy number analyses, it provides elementary visualization tools and advanced data-mining algorithms with an interactive, user-friendly interface; functions important for studying cytogenetic disorders and cancer genomes.

Conflict of Interest: Barbara L. Weber declares that she is a full time employee of GlaxoSmithKline and will use this software for research sponsored by and internal to this company.

Table 1

Data were collected from a series of publicly available human genome copy number datasets from several platforms

Platform Sample type Platform Probe type No. of mapped probes Resolution (kb)a Sources 
Multiple cancer cell lines Glass slide BAC clone 4135 920 (Greshock et al., 2004); http://acgh.afcri.upenn.edu 
Multiple cancer cell lines Glass slide BAC clone 2240 1400 (Snijders et al., 2001); http://ncbi.nlm.nih.gov/geo 
Breast cancer cell lines primary tumors Glass slide cDNA 6691 1100 (Pollack et al., 2002) http://genome-www5.stanford.edu 
Primary neuroblastomas Glass slide cDNA 7566 967 (Beheshti et al., 2003); http://www.utoronto.ca/cancyto 
Breast cancer cell line Glass slide Oligonucleotide 85495 15 (Lucito et al., 2003); http://roma.cshl.org 
Multiple cancer cell lines Affymetrix 10K SNP chip Oligonucleotide 8473 150 (Bignell et al., 2004) ftp.sanger.pub/p501 
Platform Sample type Platform Probe type No. of mapped probes Resolution (kb)a Sources 
Multiple cancer cell lines Glass slide BAC clone 4135 920 (Greshock et al., 2004); http://acgh.afcri.upenn.edu 
Multiple cancer cell lines Glass slide BAC clone 2240 1400 (Snijders et al., 2001); http://ncbi.nlm.nih.gov/geo 
Breast cancer cell lines primary tumors Glass slide cDNA 6691 1100 (Pollack et al., 2002) http://genome-www5.stanford.edu 
Primary neuroblastomas Glass slide cDNA 7566 967 (Beheshti et al., 2003); http://www.utoronto.ca/cancyto 
Breast cancer cell line Glass slide Oligonucleotide 85495 15 (Lucito et al., 2003); http://roma.cshl.org 
Multiple cancer cell lines Affymetrix 10K SNP chip Oligonucleotide 8473 150 (Bignell et al., 2004) ftp.sanger.pub/p501 

All probes were mapped onto genome build 34 (June 2003).

aThis includes an UCSC Genome Browser aligned sequence (http://genome.ucsc.edu).

Fig. 1

Display windows allow the simultaneous visualization of many samples and provide mouse-driven direct access to raw data. (A) An estimated copy number comparison of chromosome 10 for the breast cancer cell line SKBR3 shows a high degree of concordance across four platforms. Regions of identified losses (red, seen on left) and gains (green, seen on right) are shown in separate views along chromosome 10. (B) Scatter plot of the test to reference ratio along chromosome 10 for the breast cancer cell line SKBR3 across four platforms, identifying a localized single copy loss spanning ∼4 Mb at 10p12.1 (region A) and a larger region of gain (∼50–90 Mb; region B). The distribution of probes between assays can be used to infer a minimal region of aberration in both cases. (C) The five tumor samples appear to have a more highly localized region of amplification than do eight neuroblastoma cell lines. The MYCN amplicon is highlighted in yellow.

Fig. 1

Display windows allow the simultaneous visualization of many samples and provide mouse-driven direct access to raw data. (A) An estimated copy number comparison of chromosome 10 for the breast cancer cell line SKBR3 shows a high degree of concordance across four platforms. Regions of identified losses (red, seen on left) and gains (green, seen on right) are shown in separate views along chromosome 10. (B) Scatter plot of the test to reference ratio along chromosome 10 for the breast cancer cell line SKBR3 across four platforms, identifying a localized single copy loss spanning ∼4 Mb at 10p12.1 (region A) and a larger region of gain (∼50–90 Mb; region B). The distribution of probes between assays can be used to infer a minimal region of aberration in both cases. (C) The five tumor samples appear to have a more highly localized region of amplification than do eight neuroblastoma cell lines. The MYCN amplicon is highlighted in yellow.

REFERENCES

Awad, I.A.B.
2004
Caryoscope. , Stanford, CA Stanford University
Beheshti, B., et al.
2003
Chromosomal localization of DNA amplifications in neuroblastoma tumors using cDNA microarray comparative genomic hybridization.
Neoplasia
 
5
53
–62
Bignell, G.R., et al.
2004
High-resolution analysis of DNA copy number using oligonucleotide microarrays.
Genome Res.
 
14
287
–295
Chi, B., et al.
2004
SeeGH—a software tool for visualization of whole genome array comparative genomic hybridization data.
BMC Bioinformatics
 
5
13
Davidson, J.M., et al.
2000
Molecular cytogenetic analysis of breast cancer cell lines.
Br. J. Cancer
 
83
1309
–1317
Dudoit, S., et al.
2003
Open source software for the analysis of microarray data.
Biotechniques
  Suppl.,
45
–51
Dudoit, S., et al.
2003
Multiple hypothesis testing in microarray experiments.
Stat. Sci.
 
18
71
–103
Eilers, P.H. and de Menezes, R.X.
2005
Quantile smoothing of array CGH data.
Bioinformatics
 
21
1146
–1153
Futreal, P.A., et al.
2004
A census of human cancer genes.
Nat. Rev. Cancer
 
4
177
–183
Greshock, J., et al.
2004
1-Mb resolution array-based comparative genomic hybridization using a BAC clone set optimized for cancer gene analysis.
Genome Res.
 
14
179
–187
Kim, S.Y., et al.
2005
ArrayCyGHt: a web application for analysis and visualization of arrayCGH data.
Bioinformatics
 
21
2554
–2555
Look, A.T., et al.
1991
Clinical relevance of tumor cell ploidy and N-myc gene amplification in childhood neuroblastoma: a Pediatric Oncology Group study.
J. Clin. Oncol.
 
9
581
–591
Lucito, R., et al.
2003
Representational oligonucleotide microarray analysis: a high-resolution method to detect genome copy number variation.
Genome Res.
 
13
2291
–2305
Milunsky, J.M. and Huang, X.L.
2003
Unmasking Kabuki syndrome: chromosome 8p22–8p23.1 duplication revealed by comparative genomic hybridization andBAC-FISH.
Clin. Genet.
 
64
509
–516
Myers, C.L., et al.
2004
Accurate detection of aneuploidies in array CGH and gene expression microarray data.
Bioinformatics
 
20
3533
–3543
Pollack, J.R., et al.
2002
Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors.
Proc. Natl Acad. Sci. USA
 
99
12963
–12968
Reiter, J.L. and Brodeur, G.M.
1996
High-resolution mapping of a 130-kb core region of the MYCN amplicon in neuroblastomas.
Genomics
 
32
97
–103
Saeed, A.I., et al.
2003
TM4: a free, open-source system for microarray data management and analysis.
Biotechniques
 
34
374
–378
Snijders, A.M., et al.
2001
Assembly of microarrays for genome-wide measurement of DNA copy number.
Nat. Genet.
 
29
263
–264
Wang, S.S., et al.
1998
Alterations of the PPP2R1B gene in human lung and colon cancer.
Science
 
282
284
–287
Weber, B.L.
2002
Cancer genomics.
Cancer Cell
 
1
37
–47

Comments

0 Comments