sCNAphase: using haplotype resolved read depth to genotype somatic copy number alterations from low cellularity aneuploid tumors

Abstract Accurate identification of copy number alterations is an essential step in understanding the events driving tumor progression. While a variety of algorithms have been developed to use high-throughput sequencing data to profile copy number changes, no tool is able to reliably characterize ploidy and genotype absolute copy number from tumor samples that contain less than 40% tumor cells. To increase our power to resolve the copy number profile from low-cellularity tumor samples, we developed a novel approach that pre-phases heterozygote germline single nucleotide polymorphisms (SNPs) in order to replace the commonly used ‘B-allele frequency’ with a more powerful ‘parental-haplotype frequency’. We apply our tool—sCNAphase—to characterize the copy number and loss-of-heterozygosity profiles of four publicly available breast cancer cell-lines. Comparisons to previous spectral karyotyping and microarray studies revealed that sCNAphase reliably identified overall ploidy as well as the individual copy number mutations from each cell-line. Analysis of artificial cell-line mixtures demonstrated the capacity of this method to determine the level of tumor cellularity, consistently identify sCNAs and characterize ploidy in samples with as little as 10% tumor cells. This novel methodology has the potential to bring sCNA profiling to low-cellularity tumors, a form of cancer unable to be accurately studied by current methods.


INTRODUCTION
Somatic copy number alterations (sCNAs) represent an important class of mutation in the cancer genome, evident by the large number of short focal sCNAs and larger chromosomal scale changes seen in the analysis of individual tumor genomes (1). This class of mutation has been linked to tu-mor progression, metastasis, multidrug resistance and poor clinical outcomes (2)(3)(4)(5)(6). Despite the sporadic accumulation of sCNAs during tumor progression, a number of regions are subject to recurrent sCNAs (7). Some of these recurrent sCNAs are found across different cancer types, while others were specific to a particular type or subtype of the disease (6,(8)(9)(10). As a result, determining the sCNAs in an individual tumor sample has become standard practice in pathology labs for the treatment of some cancers. For example, this type of analysis is routinely used to assign the optimal chemotherapeutic treatments for patients with breast cancer who contain additional copies of the HER2 gene (11,12).
Despite the importance of this class of mutation, it can be difficult to characterize the copy number profile of a tumor genome (13). A typical tumor biopsy will contain both tumor cells as well as cells with a normal, diploid genome. This can be quantified via the cellularity (the proportion of tumor cells in this mixture) or via the tumor DNA purity (the proportion of tumour DNA in the mixture of normal and tumor DNA). Tumor purity is a function of both the cellularity and the tumor ploidy (which we define as the average copy number of the tumor) -e.g. a 50% cellularity tetraploid tumor and will have a 66% tumor purity. The best current methods fail to produce the correct copy number segmentation when tumor cellularity in a sample falls below 40% (Supplementary Table S1). The cellularity for a number of serious forms of cancer, such as Breast Invasive Carcinoma, Lung Adenocarcinoma and some forms of Melanoma routinely fall below this threshold (13), moreover multiple cancers including renal clear cell carcinoma and lower grade glioma show a decreased survival time with lower tumor cellularity (14). Thus, there is an important unmet need to identify copy number mutations in low purity samples.
It is possible to survey the copy number profile of samples with a high tumor content, using a number of different techniques, however, it is difficult to characterize the full spectrum of sCNAs with a single technology. Spectral karyotyping (SKY) is one of the most accurate tools for characterizing and visualizing genome wide changes in ploidy (15)(16)(17)(18), but suffers from a limited resolution and is low throughput.
More recent technologies, such as single-nucleotide polymorphism (SNP) microarrays has provided powerful approaches for interrogating the tumor genome and identifying copy number mutations (13,(19)(20)(21)(22). These include AS-CAT (23) and ABSOLUTE (13), both of which initially presegment SNPs into regions of equal copy-number (using a threshold based, model-free approach) and subsequently estimate ploidy and tumor purity by use of a model for the observed read-depth data conditional on the fixed segmentation. ASCAT and ABSOLUTE are highly successful in samples with as little as 40% tumor DNA (13,23), however, the reliance on an initial model-free segmentation is likely to limit the ability of these methods to detect copy number alterations at lower tumor cellularities (Supplementary Table S1). The performance of these tools are also restricted by the resolution of different microarray platforms as well as fluorescence signal saturation at high copy number.
High throughput sequencing (HTS) is a powerful technology for identifying sCNAs that may make it possible to characterize the complete copy number profile of impure cancer samples. Both ASCAT and ABSOLUTE have been modified to be applicable to HTS, and a number of other computational tools have been specifically designed to identify copy number changes, characterize loss of heterozygosity (LOH) and identify homozygous deletions from HTS data (Supplementary Table S1). These tools use a variety of different signals present in HTS data including read depth aberration, B-allele frequency at somatic and germline SNPs. Most of these tools are not suitable for samples with tumor cellularity less than 40% (23)(24)(25)(26). CLImAT is a recently introduced tool that uses read-depth and Ballele frequency (BAF) to estimate the ploidy and purity of impure tumor genomes and also characterizes copy number and LOH changes (27). At 20% simulated tumor cellularity, CLImAT demonstrated more robust cellularity and ploidy estimates as well as greater sCNA and LOH calling accuracy than Absolute (13), SNVMix (24), Control-FREEC (25) and Patchwork (26); however, this was evaluated using simulated tumor chromosomes rather than tumor-normal mixture samples.
Modeling BAF has strengths and limitations complementary to read-depth (RD) modeling. BAF, which effectively uses an internal control of one allele versus the other, is less susceptible to position-specific biases, such as GC and mappability biases. However, one striking disadvantage of BAF modeling is that it is not possible to directly summate the allelic depth signal over multiple adjacent SNPs, as the non-reference alleles at one adjacent position may not be on the same parental haplotype. Summating RD over windows of size from 10 kb up to 1 Mb leads to substantial increases in statistical power to detect sCNAs. We hypothesized that application of state-of-the-art computational phasing approaches, incorporating population haplotype from the 1000 genomes project (27) as well as direct within-read phasing (28), would allow us to sum allelic depth along phased haplotypes to obtain parentalhaplotype frequency (PHF) estimates. We further hypothesized that modeling PHF instead of BAF could lead to improved power characterize tumor ploidy and sCNAs at ultra-low levels of tumor purity.
In this manuscript we present sCNAphase that has been developed to characterize the full copy number profile of a cancer sample across a range of tumor purity. It achieves this by inferring tumor ploidy, sCNAs and regions of LOH across all levels tumor purity by integrated modeling of PHF and RD. We show that sCNAphase has accuracy comparable to SKY in determining genome-wide changes in ploidy and is able to identify focal sCNAs that are consistent with results from microarray analyses. We also show that sCNAphase can confidently determine regions that have undergone a loss of heterozygosity event and identify regions of homozygous deletion. Moreover, sCNAphase consistently generates accurate sCNA segmentations at low levels of tumor purity and can accurately define levels of tumor purity in mixtures containing 5% tumor DNA.

Data sets
The Illumina whole genome sequencing data of four pairs of tumor and matched normal cell line samples were downloaded from The Cancer Genome Atlas (TCGA) (29) or Illumina BaseSpace (https://basespace.illumina.com). All of the 8 samples are at higher than 49x coverage, except for HCC2218BL at 37x (Table 1). Two independent normal 30X samples were also available for HCC1143 and HCC1954. For the other two samples HCC1187 and HCC2218, we generated a second 'normal' by downsampling the available matched normal to 30X. We will refer to these second normal samples as '0% mixtures' as we will use them as a negative control to investigate whether methods detect tumor DNA in mixtures without tumor DNA.
By mixing the different amount of reads from pure tumor samples together with the '0% mixture' samples (i.e. the second normal sample), a series of mixtures samples were created at 30x coverage with 5%, 20%, 40%, 60%, 80% and 95% of tumor DNA. These mixture samples were obtained as BAM files from TCGA (29) for HCC1143 and HCC1954, from Illumina BaseSpace for HCC1187 and HCC2218 (Supplementary Table S2). We also created an extra 10% mixture for all samples.

Phasing matched normal
By running samtools (30) and BCFtools (30) on the normal samples, we determined the germline heterozygous SNPs. At these loci, samtools is used to calculate the depths for the tumor sample. The somatic mutations were ignored in this step. Then SHAPEIT2 (28), an in silico haplotype phasing tool, was used to phase whole genome sequence short read data. Because SHAPEIT2 requires a set of pre-phased reference haplotypes as input, we used pre-phased haplotypes provided by SHAPEIT2 calculated from the 1000G Phase I data set. Based on this analysis, we assigned each allele from each heterozygote in the matched normal to a haplotype, labeled H1 or H2.

Calculating regional haplotype depth
Our approach only considers read-depth at heterozygous germline SNPs, and ignores information from somatic mu- tations. We calculate the total RD for the tumor (t) and normal (n) in windows i = 1..N, each consisting of K germline heterozygous SNPs (with a default K = 40) as We also calculate the read-depth in these regions specific to haplotype H1 (RD H1 ) as If the K SNPs in a window are split by a large gap (greater than 1M, e.g a centromere), this window is excluded from further analysis. We discuss below detection of copy number switches that occur within a window of n SNPs.
The data modeled by sCNAphase is D = (m t , d t , m n , d n ), We also calculate the values D t , D n as the sum totals of d t , d n respectively across the genome.
We also define the tumor PHF t , which is a generalization of the standard BAF as

Statistical model of haplotype depth under null hypothesis of absence of tumor DNA
Under the null hypothesis of the presence of no tumor DNA in the sample or the matched normal the distribution of H1 allele counts m i t calculated at each window i can be modeled with a binomial distribution with a 50% 'probability of success' However, the presence of copy number variation and mapping biases lead to shifting from 0.5 to an unknown p 0 and greater than expected variation in m i t . As a result, it is necessary to instead use a beta-binomial with parameters alpha and beta equal to the number of counts mapping to H1 and H2 in the normal sample: As m i n and d i nm i n increase, the beta-binomial distribution approaches to the Binomial(m i t , d i t , p bin = p 0 ).

Tumor purity and cellularity
The percentage of tumor content can be measured in two different scales, (i) tumor cellularity (tc) defined as the percentage of tumor cells or (ii) tumor purity (tp) defined as the percentages of tumor DNA in mixtures of tumor and normal cells. These two quantities are related via the tumor ploidy, which we define as the average copy number of the If we assume a diploid normal, then the relationship between tc and tp is given by:

Hidden Markov Model based on haplotype segments
We model the probability of the data conditional on the tumor celullarity tc and the tumor ploidy pl as assuming that the normal depth is independent of tumor cellularity and ploidy, and the total tumor read depth is also independent of cellularity and ploidy. We use a hidden Markov Model to calculate this probability. The hidden states (s i ) of this model are the unobserved copy numbers, x for H1 and y for H2 in the tumor genome in window i, represented by g = (x, y). The total copy number is given by CN(g) = x + y. We consider all hidden states g with copy number in the range 0 ≤ CN(g) ≤ 12, that makes up a set G of 91 possible configurations. We also define a transition probability for transitioning between pairs of states g and l as t(s i = g | s i-1 = l). We can write down the joint probability of the observed tumor depth data and the unobserved state path using the following equation: Emission probability. In the above equation, the emission probability was split into two parts (the second and third components): one models the effect of sCNA on the total read depth d i t comparing to D t and another models the effect of sCNA on read depths for H1 comparing to H2. They can be calculated as, This effectively extends the maximal copy number to 100 to capture the significantly amplified regions, e.g. 50 copies of HER2, which otherwise seems equally unlikely to be any of the genotypes with copy number ≤ 12.
Initial and transition probability. The initial probability p(s 1 = g) in Equation (6) was set to be 1/91 for each g from G, so that there is no a-priori preferences for any particular genotypes. We used a fixed transition probability model defined on the all the states G at locus i. The transition probability between two successive sites s i-1 and s i is defined as in which, rev(l) is to reverse the order of the copy numbers for H1 and H2 for state l. Combining Equations (7) and (8), A can be calculated. Essentially, this transition probability imposes a weak preference for lower copy number states.
Estimation of tumor cellularity and ploidy. We first estimate the tumor purity and ploidy by maximizing the likelihood function in Equation (6) calculated by the Baum-Welch algorithm (31). Because of the flat likelihood surface at low purity tumor (Supplementary Figure S1), two dimensional gradient-free local searching algorithms occasionally failed to find the global maximum. We solved this by performing a one-dimensional optimization to determine the ideal tumor cellularity, for a set of ploidy values increased from 1.8 to 5 with an increment of 0.2. The pair of cellularity and ploidy with the biggest likelihood are chosen as cellularity and ploidy estimate.
Estimation of copy number profile. With the estimated tumor purity and ploidy, the hidden states (s) at each the window i are estimated using Viterbi algorithm (32). The density of the windows calculated by counting the distance between two neighbor windows (Supplementary Figure S2) is less than 100 kb for the vast majority. To impute the copy number information for regions in between two successive windows (segments), two windows are joined into one segment, if they are with the same copy number state and locate in short distance of less than 100 kb. This generates a copy number segmentation for the whole genome, with a few undetermined regions when two windows cannot be joined.

Removal of merging errors at copy number switches
The merging of every K SNPs gives a merging error, if the region spans the boundary of two adjacent sCNAs, which cannot be resolved with a single copy number state. Therefore, we nullify the estimation for segments with merging errors, which can be detected by a likelihood ratio test for each SNP k in a window i as follows, in which m k t , d k t are the alternative allelic depth and the read depth from a tumor sample at locus k; m k n and d k n are depths from the matched normal sample. The numerator assumes the K SNPs are in a single sCNAs, with a single copy number state g from G which maximizes the likelihood. The numerator assumes each SNP k can have an individual genotype that maximizes the likelihood. We then removed 1% of the segments with minimal values.

Workflow
We built a pipeline to generate a full copy number profile including copy number alterations to each parental haplotype, tumor purity and tumor ploidy using the method described above.
The pre-processing steps of the workflow ( Figure 1A) include (i) calling germline SNP variants, (ii) resolving germline allelic haplotypes and (iii) generating regional haplotype depths. Compared with B-allelic depth at per site, the phased allelic depths generated by ordering the alleles according to parental haplotypes (H1 and H2), gives a less variable allelic frequency, from which the regional haplotype depths further reduce the variability ( Figure 1B-D). The regional haplotype depth information is then used for the estimation step. By default, sCNAphase uses Equation (4) to calculate the distribution of P-values for the observed parental haplotype counts under a null model. This is used to infer the presence of tumor DNA. If the distribution of P-values matches the expected distribution in the QQ plot, then running sCNAphase is not recommended as there is not a strong enough signal in the data to infer the copy number segmentation. In terms of computational resources, variant calling using samtools mpileup ran on a single core (Intel Xeon X5680 3.3 GHz) per chromosome and took up to 12 h to complete the largest chromosome for a bam with 71x coverage. SHAPEIT2 runs on a single core per chromosome (Intel Xeon X5680 3.3 GHz), and the largest chromosome takes 2 h to complete.
The estimation step performed an integrated calculation of tumor purity, tumor ploidy and copy number profile, using the idea from Equation (6). The output from this includes (i) a copy number segmentation file that shows the regional changes in overall copy number as well as haplotype copy number in a region; (ii) a digital SKY (dSKY) plot based on the segmentation is generated for visualization similar to traditional SKY plot; (iii) a vcf file includes allelic copy numbers and phases for the each germline SNPs. This estimation procedure was implemented in R, powered by the multithreading technique, OpenMP. With the tasks split to 12 threads on as single chip (Intel Xeon X5680 6 3.3GHz), the estimation step takes ∼4 h.

Comparison of two copy number segmentations
Using one segmentation as the reference, the consistency of the other segmentation (test segmentation) with the reference can be measured in varied ways, given different criteria for counting if two copy number states are consistent. We performed this analysis using three different criteria: Counting per base overlap for copy number gain or loss. A segment is identified as gain if the copy number is higher than the average ploidy; otherwise, it is identified as loss according to a particular segmentation. The two segmentations are counted as overlapped at a single base, if that base is consistently seen as gain or loss. The fraction of overlapped bases to the number of bases in reference and the test segmentation are defined as sensitivity and specificity, respectively.
Counting per base overlap for LOH. The sensitivity and specificity are calculated based on how many bases are consistently identified as LOH. This only stresses identification of one haplotype copy number being zero (not both), disregards whether the overall copy numbers from two predictions being equal.
Counting 50% reciprocal overlap for focal amplifications. A segment is identified as focal amplification if the copy number is at least twice the average ploidy and the size is between 100 kb to 4 Mb. Once a segment is identified as a focal amplification, the actual copy number is disregarded. A reference segment and a test segment are overlapped, if the overlapped region accounts for at least 50% of each of the two segments. The ratio of the total number of overlapped segments to the number of segments from reference or test segmentation are defined as the sensitivity and specificity, respectively.

RESULTS
We developed a new method, sCNAphase, for estimating the copy number and LOH profile of the tumor genome for low cellularity tumors. sCNAphase integrates haplotypespecific allele counts together with total read depth in a Hidden Markov model that explicitly models both tumor DNA purity and ploidy. We evaluated the performance of sCNAphase against two state-of-the-art algorithms (CLI-mAT (version 1.1) and ASCAT (version 2.4)) using mixture samples derived from whole genome sequence data from 4 well-characterized tumor cell-line samples covering a range of ploidies with known copy number information from SNP array and SKY data. The samples used in this study are described in Table 1. Mixture samples were generated in silico over a range of tumor purities from 5% to 95% as described in Materials and Methods.

Haplotype phasing greatly improves the power to identify sCNA from allelic depth
The BAF is commonly used to detect the presence of copy number variants (CNVs) in normal, diploid genome (33). This signal can also be interrogated to find sCNAs by looking for deviations from an expected equal ratio of two alleles at germ-line heterozygous SNPs. In order to investigate the power of this signal, we calculated allelic depth (AD) at all germ-line heterozygous SNPs in a tumor derived cell-line (HCC1143, see Table 1) at varying tumor purities, as well as in the matched normal sample (note that an independent normal was used to generate artificial mixed tumor/normal samples at varying purity, see Materials and Methods). The distribution of P-values under a null model that assumes an equal ratio is inflated even for the normal sample, due to the presence of germline CNVs (Supplementary Figure  S3), thus we instead calculate P-values under the assumption that tumor BAF is the same as germ-line BAF, which corrects this inflation (Figure 2A). BAF provides substantial power to identify sCNA at 100% purity ( Figure 2B), but the signal is too weak at 5% purity to identify any sCNAs ( Figure 2C).
To investigate the improvement in power to detect sCNAs using PHF rather than BAF, we phased germ-line heterozygotes using SHAPEIT2 (28) and the panel of reference haplotypes from the 1000 Genomes Project, then used this information to calculate tumor PHF in non-overlapping windows of 40 consecutive heterozygous SNPs (Materials and Methods). This strategy dramatically improved the power to detect sCNAs in the 100% purity sample ( Figure 2E) and 5% ( Figure 2F), and did not lead to spurious identification of sCNAs in the normal sample ( Figure 2D).

Digital spectral karyotyping with sCNAphase is concordant with spectral karyotyping
The genome-wide results from the analysis of each of the pure tumor cell-lines and the matched germline samples were visualized using dSKY plots. These images were designed to build on the effectiveness of SKY images to represent genome-wide changes in ploidy (Figure 3). Our dSKY plots make visual identification of aneuploidy, loss of heterozygosity and focal changes straightforward. To build on the SKY visualization, red shading on the chromosome ideograms was added to indicate regions of LOH, while green shading indicates homozygous deletions.
Comparing the three available SKY images from these cell-lines to the corresponding dSKY plots, demonstrated that the results from sCNAphase were highly concordant with those from the SKY analyses, both at the ploidy level ( Table 2)   suggested that this cell-line had 4 copies of chromosome 7, and 2 copies of chromosome 17 ( Figure 3A and B; Supplementary Figure S4). In addition to the chromosomal gains or loss shown in SKY, dSKY plots are able to display more important information such as focal copy number changes, regions with loss of parental chromosomes and regions that are altered by events more complicated than whole chromosome gains or loss. For example, at the chromosome level, both SKY and sCNAphase suggest that HCC1187 has four copies of chromosome 7 ( Figure 3A). However, the higher resolution result from sCNAphase was able to detect that two of these copies shared an identical deletion at the tip of the q arm. Analysis of chromosome 17, revealed that while both the SKY and the dSKY suggested HCC1187 contained two copies of the chromosome (suggesting this chromosome had not undergone a copy number mutation), the dSKY plot revealed this chromosome has undergone a copy neutral LOH event ( Figure 3B).
In addition to visualizing large-scale chromosomal copy number alterations, inter-chromosomal translocations make it difficult for traditional SKY analyses to determine the exact copy number of translocated regions. For example, the results of from the SKY analysis of chr1 in HCC1187, make it difficult to resolve the copy number profile of this chromosome at the genome level ( Figure 3C). While the dSKY plots do not provide any information about the translocations, they are able to demonstrate which region has undergone a specific copy number alteration. Together these examples show that dSKY plots, make it possible to begin to untangle complex phenotypes and identify mutations invisible to previous methodologies at a genome-wide level.

sCNAphase accurately calculates tumor ploidy, tumor purity and sCNAs across a range of different levels of simulated tumor purity
The presence of stromal cells or other cells with a normal diploid genome in a solid tumor sample can impact the capacity of genomic-based approaches to characterize muta-  Table 3), we assessed the degree to which the analysis of the mixture samples were consistent with the results from the pure cell lines. A similar analysis was carried out using CLI-mAT and ASCAT, which estimate sCNAs as well as tumor purity and ploidy. This analysis revealed that sCNAphase was able to accurately recapitulate the ploidy results from each of the pure cell-lines across the majority of the mixtures as well as accurately determine the level of tumor DNA in each sample ( Figure 4A and B, Supplementary Table S3). The only inaccurate ploidy calls came from HCC1954 and HCC2218 mixtures at 5% tumor purity. Despite this, sCNAphase was still able to accurately report on the amount of cell-line DNA in these samples. Across the entire cohort of mixtures, the sC-NAphase results only deviated from the simulated proportion by maximum 2% (Figure 4B, Supplementary Table S3). Cellularity estimates, which can be calculated as a function of tumor purity and ploidy (see Materials and Methods), were also reported ( Figure 4C). The sCNAs identified were consistent across the mixtures (see Materials and Methods) except for the 5% mixtures from HCC2218 and HCC1954 in which the ploidy estimates had been incorrectly calculated ( Figure 4D). Adjusting the ploidy to the correct value rescued the segmentation and allowed sCNAphase to capture the same broad copy number profile from these 5% mixtures (Supplementary Figure S5).
In comparison, both CLImAT and ASCAT provided robust purity estimate down to 20% simulated tumor purity, but substantially over-estimated tumor cellularity for the simulated low purity samples ( Figure 4C). Both tools also failed to correctly resolve the correct ploidy of these low purity samples ( Figure 4A). ASCAT provided results that were consistent with the ploidy estimates from sCNAphase for high purity samples, but at 40% and 20% purity, the estimates changed significantly for one and three of cell-lines, respectively. CLImAT also showed the limitation in ploidy estimation at 40%, and it also misinterpreted the tumor ploidy of all the mixtures from HCC1954 and HCC2218. The sCNA predictions of CLImAT and ASCAT were selfconsistent down to 40% purity for 3 of 4 samples, although with lower self-consistency scores than sCNAphase (Figure 4D). These results showcase the utility of the increased power offered by per-segment based haplotype counting strategy to resolve complex tumor genomes over the current state-of-the-art BAF based approach.

The effect of sequencing and mapping artifacts
To test the tolerance of the methods to sequencing artifacts, we applied the tools to a matched tumor normal pair in which the tumor was an independent normal sample (i.e. a 0% tumor sample). In this case, the estimated tumor DNA purity should be 0%, and any sCNAs predicted are due to noise alone. For all four 0% tumor samples, sCNAphase identified < 0.1% tumor purity ( Figure 4C; Supplementary  Table S3). It did, however, identify spurious sCNAs present at this level of purity. On this basis, we can recommend that the sCNAphase segmentation should be disregarded if the estimated tumor cellularity is less than 1%. In order to avoid inference of spurious sCNA, sCNAphase also performs an initial significance test that infers whether there is detectable tumor DNA in the mixture (see Materials and Methods). The output QQ-plots for the four cell-lines at 0% showed no inflation of the test statistics, indicating no detectable tumor DNA, whereas substantial inflation of the test statistic was observed for 10% mixture samples and most of the 5% samples (Supplementary Figure S6). ASCAT and CLImAT reported tumor purity estimates of above 20% for these 0% tumor mixtures, which indicates that the copy number segmentation of these tools may be only reliable down to 40% purity ( Figure 4C).  100%  80  23  66  52  100  26  61  31  80%  80  24  67  47  100  25  56  31  60%  80  24  63  47  100  27  44  38  40%  67  28  61  52  100  30  78  25  20%  80  22  65  57  100  30  61  44  10%  87  29  77  52  100  23  61  48  5%  80  23  37  93  100  29  22  100 Sen for sensitivity; Spe for specificity.

Microarray analysis of cell-lines validates the sCNAphase results
The Cancer Cell Line Encyclopedia and the COSMIC Cell Line Project (34) provide an independent annotation of the mutations present in publically available cell-lines. The copy number profiles of the cell-lines analyzed by sCNAphase had been characterized as part of this COSMIC project, and were independently profiled using a PICNIC analysis of microarray data (35). This resource provides us with an independent annotation of the specific sCNAs in each of these cell-lines and allows us to investigate the capacity of sCNAphase to report on individual sCNAs. To compare the annotations of these cell-lines (Table 2), the base ploidy for each cell-line were determined, by rounding up the ploidy estimates from SKY, flow cytometry and PICNIC to integers and taking the consensus values. In this process, hyper or hypo-teraploidy was round to tetraploidy; hypo-triploidy to triploidy. On this basis, HCC1187 was considered as triploid and all other cell-lines were treated as tetraploids. Any segment with a copy number greater than the ploidy was defined as an amplification, otherwise as a deletion. Sensitivity and specificity were calculated by counting the per base overlap for copy number gain or loss (Materials and Methods). This comparison revealed that the majority of events present in the COSMIC annotation of each cell line could be found across the range of mixtures for each cell-line using sCNAphase, when ploidy was properly assigned (Figure 5A and B). In the samples in which ploidy was incorrectly calculated (5% mixtures for HCC2218 and HCC1954), the capacity of sCNAphase to reflect the COSMIC results was greatly diminished. A similar comparison with the results from CLImAT showed that this approach failed to correctly profile any of the HCC2218 or HCC1954 mixtures ( Figure 5A and B). ASCAT had comparable performance with sCNAphase for all four celllines in the samples that contained more than 40% tumor DNA, but the consistency with COSMIC segmentation significantly dropped at lower purity due to incorrect ploidy estimates. The comparison of the sCNAs identified by an array-based approach to those identified by sCNAphase, CLImAT and ASCAT, illustrate the ability of sCNAphase to identify valid copy number changes across a range of different simulated tumor purities.
LOH events are a common feature of the cancer genome and have been previously linked to the inactivation of tumor suppressors (36,37). Given the capacity of our approach to quantify and identify the haplotype of each chromosome, we assessed the ability of sCNAphase to identify the LOH events present in the COSMIC annotation of these cell lines ( Figure 5C and D). This comparison showed sCNAphase identified ∼90% of the regions of LOHs in COSMIC annotations of these cell-lines while producing few spurious results (except for HCC1954). Furthermore, sCNAphase was still able to identify the same regions of LOH in the low purity samples in which ploidy had been incorrectly assigned. CLImAT and ASCAT, however, showed very inconsistent LOH profile with COSMIC at low than 40% and 20% tumor purity.
One limitation for sCNAphase to reach even higher sensitivity was that sCNAphase provided no estimation at long regions with few SNPs or with highly variable depth profile (see Materials and Methods). For example, chr9p1-13 of HCC1187 was defined as a region of LOH by both COS-MIC and CLImAT ( Figure 5E); however, sCNAphase only identified small islands of LOH (marked in pink) and left the majority as undetermined. When investigating the raw BAFs from this location ( Figure 5F), the majority of the BAFs fell randomly between 0 to 1, producing a depth profile that was too confounding for sCNAphase to confidently resolve. The complexity of the copy number profile at this region was also recognized by ASCAT as revealed in the frequent switching between different copy number states.
In the analysis of HCC1954 there was a low degree of overlap between the regions of LOH identified by the three tools and those present in the COSMIC LOH annotation of the cell-line. Closer inspection indicated that these differences were due to large chromosomal regions predicted to be LOH in COSMIC, but which appear to be regions of high copy number and allelic imbalance, rather than regions of LOH (Supplementary Figure S7). We also found substantial differences between the copy number estimates from COSMIC and sCNAphase for multiple chromosomal arms for HCC1954 as well as differences between the ploidy estimates in COSMIC and those from sCNAphase and the previous SKY analyses (Supplementary Figures S4 and S7). Given the inconsistent results, the COSMIC annotation of the HCC1954 may underestimate the ploidy of this cell-line (Table 2) and as a result, may have reported some regions of LOHs and focal deletions for HCC1954 that do not reflect the true copy number profile of this cell-line.

Focal sCNAs identified by sCNAphase mirror those identified by microarray analysis
Given the clinical importance of recurrent focal amplifications in the cancer process, we assessed the performance of sCNAphase and CLImAT to detect the focal amplifications present in the COSMIC annotation for these four celllines, using the criteria of counting 50% reciprocal overlap for focal amplifications (Materials and Methods). A very stringent threshold was applied, which required focal amplifications to be in between 100 kb and 4 Mb as well as a copy number that was greater than twice the ploidy of the cell-line. It is worth noting that this is a more difficult copy number threshold for amplification than the one used in the previous section. In the analysis of the pure cell-lines, sCNAphase was able to detect the majority of the focal amplifications present in the COSMIC annotation of each of these cell-lines, with a sensitivity approximately twice that of CLImAT (Table 3; Supplementary Table S4).
For mixture samples, sCNAphase was still able to identify the majority of focal events at 5% tumor purity for HCC1187 and HCC1143, and 10% for HCC1954 and HCC2218. The drop in the 5% mixtures from HCC2218 and HCC1954 was due to the underestimation of the tumor ploidy in two samples. Despite this, sCNAphase was able to detect at least 12 copies of the pathologically relevant ERBB2 in HCC1954 and HCC2218 across the entire cohort of mixtures (Supplementary Table S5), highlighting the diagnostic potential of sCNAphase (38,39) at even 5% tumor DNA. Likewise, another focal peak of amplification, that was found to be recurrently altered in breast cancer (11q13) (40,41), was detected in both the 100% HCC1143 and HCC1954 samples, as well as their corresponding 5% mixtures. The identification of focal events, including those that are well known to be clinically significant, across the full range of mixtures demonstrates the capacity of sCNAphase to identify pathologically relevant sCNAs in ultra-low purity samples.
We hypothesized that one reason for a low validation rate (on average, only 41% and 31% of focal amplifications detected by sCNAphase and CLImAT, respectively, were validated) could be that COSMIC underestimates copy number of highly duplicated regions (due to florescence signal saturation) and so the 2 × ploidy threshold for declaring a focal amplification is not reached. To test this, we re-calculated specificity and sensitivity after increasing the sCNAphase detection threshold, but keeping the COSMIC threshold (Supplementary Table S6). As the sCNAphase threshold increased, the specificities almost doubled across all tumor purities, with a much smaller effect on sensitivity.
Homozygous deletions are another class of copy number mutation involved in the tumorigenic process (42)(43)(44). The COSMIC annotations show that there are 7 homozygous deletions larger than 100 Kb in these four cell-lines. sCNAphase was able to consistently detect the three longer homozygous deletions from HCC1143 and HCC1187 (Supplementary Table S7). The size of a focal deletion, more specifically, the number of germ-lines SNPs in the region limits sCNAphase from identifying shorter events. Our analysis of HCC1954 did not detect the longer deletion from chr22 of HCC1954; however, analysis of the raw se-e34 Nucleic Acids Research, 2017, Vol. 45, No. 5 PAGE 12 OF 14 quencing data suggest that this deletion may be shorter than the 332 kb listed in the COSMIC annotation, as the read depth at the flanking region is significantly higher than the deleted region (Supplementary Figure S8). Despite this, sC-NAphase did not identify any false positives and was able to consistently identify these deletions at minimal levels of tumor purity. In contrast, CLImAT failed to identify any of homozygous deletions present in COSMIC, and it classified a few other regions as homozygous deletions that were not present in either the COSMIC or sCNAphase results.

sCNAphase identifies copy neutral/amplified regions of LOH in low purity samples
Examination of the pure cell-lines with sCNAphase, revealed a substantial fraction of the tumor genome that had undergone LOH and contained two or more copies ( Table  2). This is a compound event that requires at least a deletion of one haplotype as well as amplification of the other haplotype. Analysis of these regions revealed a number of loci that were present in multiple unrelated cell-lines, including chr5q and chr17p as well as parts of chr17q, all of which had been previously found to undergo recurrent CN-LOH in cancer (37,(45)(46)(47) (Supplementary Figure S4). One of these regions contains the tumor suppressor TP53, the gene most commonly altered by CN-LOH (45,48). We observed homozygous somatic mutations in TP53 for three of the cell-lines (Supplementary Table S8; Supplementary Figure S9), which are likely to be disruptive. In the remaining cell-line, two heterozygous germline SNPs were identified (Supplementary Table S9), one of which, R283C has been previously shown to increase a carrier's risk of developing cancer (45,49,50). It is likely that the loss of the wild-type allele and the amplification of the deleterious SNP increase this risk. These results demonstrate the power offered by HTS and sCNAphase to characterize the biological impacts of a complex mutation in low purity samples.

DISCUSSION
Although somatic copy number alterations are a wellestablished driver of cancer, the capacity to identify these mutations are impacted by a number of issues including varying levels of tumor purity and frequent changes in tumor ploidy. As a result, the majority of methods designed to characterize the sCNAs in the cancer genome are unable to accurately profile a significant fraction of primary tumor samples. In this study, we have shown that by taking a haplotype-based approach, sCNAphase can overcome these issues and reliably characterize both genomewide, chromosome-arm-wide as well as focal copy number changes present in a tumor genome across a range of tumor purities.
Comparison of the results from the sCNAphase to those from SKY and flow cytometry, illustrated the ability of sC-NAphase to correctly reflect the genome-wide changes in ploidy. Using a range of mixtures, to simulate the challenges posed by low-cellularity primary tumors, we were able to recapitulate the changes in ploidy seen in the analysis of the pure cell-lines, across the spectrum of tumor purities. Comparison of copy number and LOH segmentations obtained from a high density microarray at 100% purity to those reported by sCNAphase across a range of mixture samples demonstrated the high specificity and sensitivity of the method down to 10% tumor purity.
Equally importantly, samples without any tumor DNA were predicted to have less than 0.1% tumor cellularity by sCNAphase, whereas both of the other approaches tested reported at least 20% tumor DNA. sCNAphase has this inbuilt robustness to sequencing and mapping artifacts because it models the observed regional tumor depth data (including total depth and haplotype-specific depth), conditional on normal depth data from the same region.
To focus on characterizing the copy number profile of low purity tumors, we have made the simplifying assumption that there is a dominant tumor clone with low heterogeneity in the tumor biopsy. Regions with heterogeneous copy number between clones with similar abundance would lead to a failure of the merging test statistic, and thus these regions would likely be excluded.
The accurate characterization of the copy number profile in low cellularity samples as well as the identification of mutations in cancer genes, is suggestive of the potential clinical utility of this tool. In addition to low purity tumors, a potential application of our haplotype-based methodology would be in studies that aim to profile the copy number changes in a tumor through the analysis of circulating tumor DNA. Circulating tumor DNA, has been previously shown to contain mutations present in the tumor genome, but tumor DNA in circulation is mixed with DNA from normal cells and the tumor purity is frequently low (51). While sCNAphase can successfully profile low purity mixtures, it is likely that in order to realize this, an optimized version of the tool will need to be developed.