-
PDF
- Split View
-
Views
-
Cite
Cite
Arief Gusnanto, Henry M. Wood, Yudi Pawitan, Pamela Rabbitts, Stefano Berri, Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data, Bioinformatics, Volume 28, Issue 1, 1 January 2012, Pages 40–47, https://doi.org/10.1093/bioinformatics/btr593
Close -
Share
Abstract
Motivation: Comparison of read depths from next-generation sequencing between cancer and normal cells makes the estimation of copy number alteration (CNA) possible, even at very low coverage. However, estimating CNA from patients' tumour samples poses considerable challenges due to infiltration with normal cells and aneuploid cancer genomes. Here we provide a method that corrects contamination with normal cells and adjusts for genomes of different sizes so that the actual copy number of each region can be estimated.
Results: The procedure consists of several steps. First, we identify the multi-modality of the distribution of smoothed ratios. Then we use the estimates of the mean (modes) to identify underlying ploidy and the contamination level, and finally we perform the correction. The results indicate that the method works properly to estimate genomic regions with gains and losses in a range of simulated data as well as in two datasets from lung cancer patients. It also proves a powerful tool when analysing publicly available data from two cell lines (HCC1143 and COLO829).
Availability: An R package, called CNAnorm, is available at http://www.precancer.leeds.ac.uk/cnanorm or from Bioconductor.
Contact:a.gusnanto@leeds.ac.uk
Supplementary information:Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
Cancer cells often exhibit severe karyotypic alterations: whole chromosome gain or loss and structural rearrangements such as amplifications, deletions and translocations result in widespread aneuploidy (Hartwell and Kastan, 1994). The ability to detect copy number alterations (CNAs) of cancer cells is a crucial step to access the severity of chromosomal rearrangements and to find chromosomal regions where breakpoints are located. Furthermore, comparison of CNAs across tumours from different patients makes it possible to find regions commonly duplicated or lost to highlight the locations of cancer-related genes. Several methodologies are available to detect CNAs. Comparative genomic hybridization (CGH) (Kallioniemi et al., 1992), array CGH (aCGH) (Pinkel et al., 1998), single nucleotide polymorphism (SNP) array (Bignell et al., 2004) and, more recently, a new generation of sequencing machines enabled massively parallel sequencing (Roche 454, Illumina GAII, HiSeq, MiSeq, ABI SOLiD, Ion Torrent PGM), making it possible to sequence full genomes at affordable cost.
We previously showed (Wood et al., 2010) how it is possible to multiplex several samples in one Illumina GAII lane making copy number analysis by sequencing affordable and competitive with aCGH or SNP arrays. Between one and eight million aligning reads (compared to approximately half a billion reads for full coverage) are enough to provide genome-wide CNA at 50 kb resolution. As we expect sequencing technologies to become more widespread, affordable and accurate, copy number analysis by low coverage sequencing will become even more convenient and informative. Furthermore, sequencing is possible even with low amounts of DNA extracted from formalin-fixed paraffin-embedded specimens (Wood et al., 2010).
Finally, one advantage of sequencing compared with array technology is that the signal scales linearly to the input DNA and does not show saturation nor background noise typical of hybridization techniques.
One of the first steps to take when analysing these data is normalization. The total intensity of signal from array technology or the total number of reads from sequencing does not reflect the total DNA content of the cells of interest, but it is largely determined by various technical aspects, tuned to achieve the maximum intensity range (array technology) or highest number of reads (sequencing). The normalization is a crucial, non-trivial and often underestimated step which can have enormous repercussions on downstream analysis and conclusions. In this article, we present a computational tool, called CNAnorm, to correct, normalize and scale the data from low coverage sequencing experiments.
1.1 A problem with multiple solutions
From a theoretical point of view, it is often impossible with cancer cells to determine the actual ploidy using array or sequencing technology. As an example, we can consider a fully tetraploid genome and a normal one. Since input DNA is adjusted to a given amount (usually dictated by technical requirements), signal from these two samples will be comparable and it would not be possible to distinguish the tetraploid from the diploid genome. However if, for instance, a chromosome loss (gain) occurred in the tetraploid genome, we could detect a ratio of 3/4 (5/4) between the signal from that chromosome and the rest of the genome. A similar loss (gain) in a diploid genome would result in a ratio of 1/2 (3/2). A further complication arises when we deal with DNA from patient tumour samples as these are infiltrated with normal cells resulting in inevitable contamination with the patients' normal DNA. This means that, if we observe a ratio of 3/4 for a given region, we cannot determine if it is due to a loss from a tetraploid genome of 100% tumour sample (as described above) or a loss from a diploid genome contaminated with 50% normal cells.
This scenario is quite common in tumours from patients because genomes are frequently aneuploid and contamination is largely inevitable. These aspects are often overlooked by algorithms that analyse CNA data. They are usually designed to analyse cell lines with the assumption that the underlying genome is not only pure, but also largely diploid or the average or median ploidy is to be considered ‘normal’. There are some algorithms that perform normalization without assuming that the size of the cancer genome is comparable with the normal one (Castle et al., 2010; Chen et al., 2008; Staaf et al., 2007; van Houte et al., 2009) and, most likely, the best results can be achieved when using SNP array (Greenman et al., 2010; Yau et al., 2010) or high coverage sequencing, because information on the frequency of the underlying variant can inform on the absolute copy number.
Similarly, tools that analyse high-throughput sequencing data to obtain CNA are already available. Some are focused on germ line CNV (Xie and Tammi, 2009; Yoon et al., 2009) where aneuploidy and contamination are not an issue, whereas others are designed to detect cancer CNA (Chiang et al., 2009; Ivakhno et al., 2010; Kim et al., 2010). These algorithms, however, do not consider tumour contamination and mostly assume implicitly that the overall size of the two genomes are comparable. To our knowledge, the only exception is FREEC (Boeva et al., 2011) which optionally performs correction for contaminating normal tissue if the ploidy of the most abundant copy number is provided. Unfortunately, this information is not usually available when dealing with patients' tumour samples. The main goal of those tools, however, is the segmentation, i.e. detecting where a change in copy number occurs.
In this study, we do not make any assumption on the overall size of the tumour genome, nor its purity. We designed CNAnorm to take advantage of the linearity of signal to provide information on underlying ploidy and tumour content, with the only assumption that the tumour is largely monoclonal or, if polyclonal, most clones share most of the alterations. When multiple solutions are possible, we select the most conservative among the compatible ones, where the most abundant copy number has the smallest number of deviations from diploidy. Furthermore, it allows the user to correct the most conservative solution provided by the software if other independent analyses (e.g. FISH, flow cytometry, known tumour content) can guide the experimenter's choice. Finally, CNAnorm uses a third-party segmentation tool, DNAcopy (Olshen et al., 2004), to output data not only corrected for contamination and aneuploidy, but also segmented.
2 METHODS
2.1 Samples
Using an Illumina GAII, we produced 1 836 450 and 1 653 081 reads from DNA isolated from a fresh frozen lung tumour resection specimen and paired blood, respectively, from patient LS041. Similarly, we produced 3 089 173 test and 2 545 305 control reads from patient LS010. Details on sample preparation, DNA extraction and library preparation are described by (Wood et al. 2010). We also considered publicly available datasets and we used 44 762 968 test and 34 293 547 control reads from cell line HCC1143 (Chiang et al., 2009) as well as 18 546 568 test and 22 269 150 control reads from cell line COLO-829 (Pleasance et al., 2010).
2.2 Sequence alignments, filtering and GC content
Sequences were aligned using the bwa suite (Li and Durbin, 2009) against assembly hg19 of the human genome. Only sequences that could be uniquely aligned and with mapping quality ≥37 were used. For each window, we calculated the average genomic GC content. A Perl script (bam2window.pl) that reads sam/bam files and optionally calculated GC content can be freely downloaded from the CNAnorm website. It produces the table required as input to CNAnorm.
2.3 Read counts
To identify the copy number, we count the number of reads per non-overlapping fixed-width genomic regions (window). Throughout the analysis, unless specified differently, we set the window size so that the median number of reads for each window in the sample with least reads is 30. For samples with more reads (HCC1143 and COLO-829), we set the window size to 50 kb wide.
The window size is a tuning parameter that can be optimized for the data available. However, this is beyond the scope of this article. From our experience in several different samples, selecting window size in which there are 30–180 read counts per window on average strikes a reasonable balance between error variability and bias of CNA. Using a much smaller window size, e.g. on average 1–5 reads per window, will result in many genomic regions with zero read count and make the overall analysis non-informative. At the other extreme, using a much bigger window size will ‘smooth out’ some pattern of alteration (i.e. increasing bias).
2.4 CNA

A normal genome has two copies of (autosomal) chromosomes, while a tumour genome may have zero, one, two and further multiple duplications. So, the ratios ideally takes any value in G≡{gu}={0, 0.5, 1, 1.5, 2,…} corresponding to tumour copy numbers P≡{pu}={0, 1, 2, 3,…}. In reality, due to (i) error, (ii) different number of reads recorded (sequencing coverage), (iii) different size of tumour and normal genomes and (iv) contamination by normal sample in the tumour sample, the estimates
will not necessarily take a value in G (see Section 2.8 on contamination). Moreover, the observed CNA corresponding to normal genomic regions may not be centered to ratio one. We deal with the first problem by delineating the error variability in a linear model as described briefly in Section 2.6. We deal with the other problems by shifting and scaling the ratio to estimate CNA as described below.
Taking the above problems into considerations, the simplest ratio rjk is a bias estimator, because it has not taken into account the different number of reads in each genome. Each sample may acquire a different level of coverage from the experiment, so that the ratio in Equation (1) may not be properly aligned. A common approach to solve this problem is by normalizing for the total number of reads. However, this assumes implicitly that both genomes involved are of equal size. This is a reasonable assumption when searching for germline CNV, but cannot be applied when one genome could be aneuploid.
We estimate the CNA in several steps below. These steps will be elaborated further in the subsequent sections.
Calculate the ratios
, and correct them for GC content (Section 2.5).Smooth the signal from rjk to obtain
. This step reduces noise and highlights the information about genomic alterations. In this article, we use the smooth segmentation approach (Huang et al., 2007) as described in Section 2.6.- Perform normalization on the distribution of
so that the most common genomic regions are centered to ratio one. This can be written as where δ is a genome-wide alignment coefficient. The coefficient takes into account the different size of tumour and normal genomes. This step is elaborated further in Section 2.7.(2)
The above estimates
have not taken into account the tumour sample contamination. In this step, we estimate the level of contamination,
, and correct the distribution of
to obtain the estimates of CNA
. This is discussed further in Section 2.8. The observed ratio for each window rjk (not only
) can be corrected accordingly once the estimates
and
are obtained.At this point, the original data can be segmented using any segmentation tool and results are corrected accordingly. CNAnorm uses DNAcopy (Olshen et al., 2004).
2.5 GC correction

2.6 Smooth segmentation
With thousands of windows to consider, we need to use the spatial information in the data to identify patterns, through smoothing. This step is necessary in the context of low-coverage data, because random error variability can severely affect the normalization and correction of ratio distribution. This, in turn, would result in bias estimates of proportion which would lead to wrong assignment of the diploid group. In our context, this step is critical to guide the normalization and scaling process. However, in case of large excess of reads, typically >500 per window, this step could be skipped.
It is important to note that we do not propose a new segmentation method in CNAnorm. In this study, we implement the smoothing approach as proposed by Huang et al. (2007) that employs a linear model where we assume that the second-order difference of the random-effect parameter follow a Cauchy distribution. The estimates of the random effects are the segmented ratio
. After normalization and scaling, CNAnorm optionally performs a segmentation as implemented by Olshen et al. (2004), although other segmentation methods can generally be used.
2.7 Genome-wide normalization
Genome-wide normalization is a step to correct the location of the distribution of the copy number ratio by estimating δ from the segmented ratio data
. Owing to systematic gains and losses, the ratio rjk shows a multi-modal distribution. The segmented ratio
exhibits the multi-modality of the distribution more clearly after removing unwanted random errors in the smoothing step. The modes of the distribution of
indicate the (biased) position of CNA in G, corresponding to different copy numbers in P. At this stage, the modes of the distribution is not centered to the expected CNA in G, and the estimation of δ requires us to characterize the distribution of
.


Our experience suggested that the normal distribution as a component is adequate to model the distribution of
, as the distribution of
does not show some heavy tails. Some genomic regions may exhibit extreme values of rjk (but not
) because of low counts in xjk and yjk. This is rare and mainly due to problems of mapping reads to some regions of the genome. We find that the smooth segmentation approach is relatively robust, where extreme values in the rjk do not affect our fitting of the mixture distribution on the smoothed ratio
.
We estimate the mixture components in model (4) using a standard expectation–maximization (EM) algorithm (e.g. McLachlan and Krishnan (1997)). In the algorithm, we put some constraints to impose identifiability (see also the Supplementary Material). The number of components in the model, M, is chosen using Akaike's information criterion (AIC) across different plausible values.
are obtained, it is important for the subsequent steps to describe the relationship between
and their corresponding tumour copy number (ploidy) in P. For example, we generally have a set of copy numbers P*≡{p*m}={0, 1, 2, 3,…, M−1} from M components identified in the mixture model (4). However, we allow a ‘leap’ in the ploidy so that the M copy number in P* may contain non-unit increment. To deal with this, we acknowledge that the ideal CNA values in G increase linearly with copy number or ploidy P. So, for our purpose of aligning
, we model
in a simple linear regression as 


with
, the fitted value of
from the above model (5). The estimation of δ indicates that the process of genome-wide normalization involves identifying the mixture component which corresponds to the normal ratio and shift the whole distribution of
multiplicatively so that the normal ratio is centered to one.Once the estimate
is obtained,
is the estimate of ‘crude’ CNA where contamination is still present. When we have different pair of samples from different individuals, we are dealing with different degrees of contamination between the pairs. So, the ‘crude’ CNA between individuals are not comparable for the purpose of, e.g. statistical testing of the genomic regions. To make the estimate of CNA that are comparable between samples, we need to characterize the contamination, and make a proper correction to the distribution of
as described in the next section.
2.8 Contamination correction
In clinical situations, pure tumour cells are difficult to obtain when the material comes from tissues of patients' tumour. Contamination with normal cells are inevitable. If there were no contamination, the smoothed ratio
is expected to take any one values in G≡{gu}={0, 0.5, 1, 1.5, 2,…}. When contamination by a normal genome is present, the smoothed ratio
will be shrunk towards ratio one (Supplementary Fig. S1).
and ρjk=0.5 will be shrunk to
. Given the previous step to centre the normal copy number to ratio one, the estimate of ‘crude’ CNA
can be assumed to have come up from a shrinkage on the non-contaminated
around ratio one 
is the estimate of contamination proportion (
).We estimate ψ by investigating how the estimates in
have been shrunk towards
that corresponds to the copy number two. We first normalize the estimates
into
, for m=1,…, M.
is given by 

in the above equation have taken into account the different read depths, genome sizes and contamination. This makes the estimates comparable between different pairs of samples, which is important when statistical tests are performed to infer the pattern of genome-wide CNA.2.9 Simulation study

with Nt=11515563746 and Nc=6175502642 nt (note: the normal genome is diploid) and 0.15<ψi<0.80 is the contamination level. We simulated a mixture of tumour and normal cells in ratio (1−ψi):ψi and then produced three million reads coming from the genomes of such a mixture of cells. Clearly, since the tumour genome is larger than the normal one, a tumour content of (1−ψi) will produce more than (1−ψi) of the reads. To simulate the reads, we used wgsim 2.6 (default parameters), a tool part of the bwa suite (Li and Durbin, 2009). Similarly, we produced three million reads for the control genome.

is the estimated copy number for window j and ρj is the expected copy number for the same window. In Equation (10), a high score Δ suggests that the estimates of CNA
is far from the true values, and vice versa.3 RESULTS
3.1 Simulation study
First, we present the results of one of the 100 simulated datasets (a single realization) with 30% contamination (ψ=0.30). The results on all simulated datasets are presented subsequently below. Figure 1 shows the histogram of the ratio and smoothed ratio (see Supplementary Fig. S6 for more details on ratios across the genome). From the figure, we now can see the underlying distribution of CNAs in the simulated data. The vertical dotted lines in the Figure 1B are the estimates of mean when we fit the mixture model (4) onto the smoothed ratios. These estimates are
, corresponding to copy numbers P*={0, 1,…, 7}.
Histogram of ratio (A) and smoothed ratio (B) across the genome in a simulated data (single realization) with ψ=0.30 (30% contamination). In (B), the dotted vertical lines mark the estimates of means when we fit a mixture distribution on the smoothed ratios. More details on the ratios across genome are presented in the Supplementary Figure S6.
It is estimated by AIC that we have eight components in the mixture model, and this can be clearly seen from the figure. The most common mixture component in the simulated data is the fifth component which corresponds to four copy number in the tumour sample. The estimates of the proportions
are (in percentages) {4.4, 2.4, 15.9, 15.1, 34.2, 13.6, 9.0, 5.3}. From this result, we align the fifth component to the ratio one with
. The contamination is estimated at
%, which is close to the true value of 30%.
To see the results of the normalization in a chromosome, Figure 2 presents the estimates of CNA in Chromosome 2 before and after the normalization and scaling for contamination. Since the most common component is tetraploid, we find that the expected ratio is in the increment of 0.25 as shown in the figure as horizontal dashed lines. The figure indicates the estimates of CNA are now properly centered.
The ratio in Chromosome 2 of simulated data with ψ=0.3 (30% contamination) before (A) and after (B) normalization and scaling. The solid line is the estimate of CNA
as a smoothed signal. The horizontal dashed lines are the different expected ratios if there is no contamination in the samples. Note that the ‘normal’ copy number is four, so that the expected ratios are in G*={0/4, 1/4, 2/4,…}. The patterns of genomic CNA are shown in Supplementary Figures S6 and S7.
Next, we compare the results of the 100 simulations obtained using CNAnorm with those obtained using FREEC (Boeva et al., 2011) as presented in Figure 3. The figure indicates that the performance is getting worse as the contamination level increases using all methods. When the information on ploidy is provided, CNAnorm (solid black line) performs better than FREEC (dash-dotted line) 95% of the time. When ploidy is not provided, CNAnorm still performs better than FREEC 72% of the time. Understandably, all three methods perform better at low rather than high contamination levels. Remarkably, CNAnorm has very good and consistent performance (Δ is very close to zero) for contamination up to 40% even when the information on ploidy is not provided.
The scores Δ from 100 simulated data as a function of normal genome contamination level (percent). A low value of Δ indicates that the estimate of CNA is close to the true value. The score for FREEC is shown in the dash-dotted line and for CNAnorm is shown in solid line. The plus (+) grey points are the scores for CNAnorm when the information on the most frequent ploidy is not provided.
A closer inspection on the above results indicates that, while the performance of FREEC steadily deteriorates with increased contamination, CNAnorm performs consistently well unless it fails to correctly detect some mixture components. When this happens, there are mainly three consequences. First, the mixture components are correctly detected but the copy number is ‘shifted’. The estimates of CNA are off by a discrete number of ploidies. This is easily rectified if the most abundant ploidy is provided. Experiments with average distance of 1, but low Δ when ploidy is available (Fig. 3) fall in this category. Secondly, if a limited number of mixture components are missed or wrongly estimated, but the most abundant one is correctly identified, the ratio is correctly shifted but over- or under-scaled. Thirdly, several mixture components could be missed or wrongly assigned. In this case, the normalization is severely affected. Providing the ploidy of the most abundant mixture may or may not improve the normalization.
3.2 Normalization of patients' sample data
We present the results of analysis on LS041 dataset here, while the results on LS010 dataset are presented mainly in the Supplementary Material. Figure 4 shows the effect of smoothing on the distribution of copy number ratio (see Supplementary Fig. S8 for more details on the ratios rjk and smoothed ratios
).
From Figure 4A, we are not able to see clearly the multi-modality in the distribution of the ratio rjk in the genome. The smoothing process accentuates the multi-modality as shown Figure 4B . The fitting of the mixture model (4) is performed on the distribution of
, which is shown in the figure. Based on AIC, for LS041, we came to a result that the optimal number of mixture components is M=7. Given M=7, the estimates of the means are
=(0.47, 0.65, 0.92, 1.15, 1.35, 1.53, 2.43), and they are marked as vertical dotted lines in Figure 4B.
Histogram of ratio (A) and smoothed ratio (B) across the genome. In (B), the dotted vertical lines mark the estimates of means when we fit a mixture distribution on the smoothed ratios. More details are shown in Supplementary Figure S8.
The estimated proportion (in percentages) of the mixture components are
=(1.0, 26.7, 29.5, 28.7, 10.3, 1.0, 2.8), indicating that the third mixture component is the most common one (v=3). This suggests that the tumour genome is largely diploid. Next, we plot the estimates
against the copy number as shown in Figure 5. The figure shows a linear relationship between the estimates
with the copy number. For the last mixture component, we allow a ‘leap’ in the copy number so that we have a better fit of the linear regression (dotted line). The fitted line has a lower slope estimate than the expected line due to contamination.
Relationship between estimates of mean
from fitting mixture models and copy number. The dotted line is the fitted linear regression line, and the dashed line is the expected line with slope 0.5 if there is no contamination in the tumour sample.
In this step, we multiplicatively align the whole distribution of rjk and
so that the mixture component which corresponds to the normal copy number is centered to ratio one. We obtained the estimates
and
as the multiplicative factor.
3.3 Contamination correction on patients' sample data
The estimate of slope in Figure 5 is lower than the expected (0.5) due to the contamination of tumour sample. We assume in this study that the effect is proportional to the distance of
to the ratio one (Supplementary Material). After scaling for the diploid component to have ratio one, the scaled estimates
are {0.52, 0.72, 1.01, 1.27, 1.50, 1.70, 2.69}. The value for the diploid component is not exactly one, due to the use of fitted value in Equation (6). This gives the estimate of contamination
, or 49.1%. We correct the whole distribution of
(and applicable to rjk) using a multiplication, centered on ratio one, so that the mean estimates in Figure 5 are aligned to be close to the expected distribution as presented in Figure 6A.
(A) Relationship between estimates of mean
(after correction for contamination) and copy number. The dotted line is the fitted linear regression line, and the dashed line is the expected line with slope 0.5 when there is no contamination in the tumour sample. (B) Histogram of the segmented ratio across the genome after correction for contamination (estimated at 49.1%). The vertical dashed line marks the mean for normal copy number after alignment.
In Figure 6B, we can see that the distribution of
is expanded from reference point one compared to Figure 4. The result of this contamination correction across the genome is presented in Supplementary Figures S8 and S9. The figures indicate that the estimated CNA are now close to the expected copy number ratios in G. To see the estimates in more detail in a chromosome, Figure 7 shows the result of our proposed method on chromosome 1 for LS041 data. After the normalization step and correction for contamination that we have performed, the estimates of CNA are now correctly aligned to their expected values in G.
Chromosome 1, before (A) and after (B) the normalization. The solid line is the estimate of CNA.
3.4 Analysis of cell line data: HCC1143 and COLO829
Although CNAnorm was developed to deal with low coverage data from clinical samples, we tested the package by analysing two publicly available datasets of higher coverage sequencing of cells lines: HCC1143, obtained from a human breast cancer genome (Chiang et al., 2009) and COLO829, obtained from a human malignant melanoma (Pleasance et al., 2010). The results of analysis on the COLO829 data are presented in the Supplementary Material.
For cell line HCC1143, we performed the default analysis and CNAnorm found that the third component was the most abundant and conservatively assigned it to P*=2. In doing so, it also estimated a tumour content of 47.5%. Although, from a technical point of view, this is a plausible solution, our knowledge about the starting material informs us that a tumour content of 47.5 is suspiciously low for a cell line. We then shifted the estimates of P* and obtained P*+1 and P*+2 that would then predict a more likely 87% tumour content or an unrealistic 153.5%, respectively. We then accepted P*+1 and normalized the data. The results for whole genome and a subset of chromosomes are shown in Supplementary Figures S15 and S16. Here, we used external clues to perform the most sensible normalization.
HCC1143 is a very well-characterized cell line and SKY karyotypes are available from http://www.path.cam.ac.uk/~pawefish/BreastCellLineDescriptions/HCC1143.html. We were able to compare the estimation of CNAnorm against independently observed evidence and confirm that the normalization is appropriate. In particular, CNAnorm estimates that Chromosome 5 is largely tetraploid, Chromosome 12 largely triploid and the acrocentric Chromosome 13 largely diploid (Supplementary Fig. S16). These estimations are easily confirmed by the karyotype. As HCC1143 is extensively rearranged, other abnormalities are harder to validate, but given that CNAnorm performs a genome-wide normalization, there is no reason why the copy number estimation on other chromosomes should not be correct, in particular within the 2–4 ploidy. Furthermore, CNAnorm estimates the median ploidy at 3.2 copies, that is compatible with the description of ‘near-tetraploid’ or ‘hypo-tetraploid’ as reported on the above Cancer Genomics Program web page.
4 DISCUSSION
We have investigated a method to estimate CNA from patient tumour samples using next-generation sequencing. We showed how the method performs with a range of simulated data, on low coverage data from patients' samples and on higher coverage data from cell lines. Compared with several segmentation tools available to analyse high-throughput sequencing data, CNAnorm focuses on correcting the data for contamination, different read depth and different genomic size. We acknowledge that the problem could lead to several equally valid solutions, but provide an easy way for the user to correct the estimation from CNAnorm when independent clues (such as tumour content, or strong and independent evidence about ploidy of certain regions) are available. We believe that the normalization step is often underestimated or, due to its intrinsic difficulty and plurality of solution, left to a simplistic approach that assumes that the overall size of a cancer genome is comparable with that of a normal cell.
With the data from the two cell lines (HCC1143 and COLO829), we have shown how information on tumour content, ploidy of some chromosomes and overall ploidy could guide the experimenter to perform a more meaningful normalization. At the same time, CNAnorm could provide further insight on the cancer material analysed. When no external information is available, e.g. in the case of patients' tumours LS041 and LS010, CNAnorm performs the most conservative normalization. In this regard, if regions of homozygous deletions could be identified and confirmed, they would be a valuable guide during the normalization process.
Despite being a powerful tool, CNAnorm is not a ‘silver bullet’ for CNA analysis. In particular, we would like to point out how polyclonal tumours could produce misleading results. If several tumour clones, each with its gains and loss, constitute the tumour sample, it will not be possible to detect the underlying ploidy and the mixture model approach would over-fit distributions within a single ploidy range. CNAnorm is meant to be robust and clonal variability in a few chromosomal regions would not be problematic. However, in these cases, CNAnorm would tend to underestimate tumour content. This is, in our opinion, what happens with cell lines HCC1143 and COLO829. Although the cell line should be 100% tumour, CNAnorm estimates only 87 and 90%, respectively. We think this is due to some variability within cells. This variability can be observed, for instance, in HCC1143 where the copy number of the long arm of Chromosome 2, or Chromosome 6 are, unlike most of the rest of the genome, not close to any integer copy number. Since we are aware of possible polyclonal variability, we chose an approach and a segmentation tool, DNAcopy, that does not force every region of the genome to fit into an integer copy number. Clues about polyclonal variation are, per se, potentially informative.
5 CONCLUSION
Next-generation sequencing data from clinical samples obtained directly from patients presents a serious challenge. Analysis of CNAs in tumour samples is not straightforward. This is because the observed raw copy number ratios do not necessarily take the expected values due to random error, different sequencing coverage and contamination with normal cells. We deal with the random error using smoothing methods. The other challenges are dealt with by acknowledging the multi-modality in the distribution of the segmented copy number ratios. This allows us to model the locations of the distribution of ratios corresponding to the different copy numbers and make the necessary correction. The simulation study shows that the method works properly to estimate genomic regions with gains and losses and that it works well in a range of real datasets from patients' samples and cell lines.
ACKNOWLEDGEMENTS
A.G. would like to thank Terry Speed and Yuval Benjamini for a helpful discussion. We thank the surgical team at St James's University Hospital (Leeds, UK).
Funding: Yorkshire Cancer Research (L341PA).
Conflict of Interest:none declared.
REFERENCES
Author notes
Associate Editor: Martin Bishop






