Multi-factor data normalization enables the detection of copy number aberrations in amplicon sequencing data

Motivation: Because of its low cost, amplicon sequencing, also known as ultra-deep targeted sequencing, is now becoming widely used in oncology for detection of actionable mutations, i.e. mutations influencing cell sensitivity to targeted therapies. Amplicon sequencing is based on the polymerase chain reaction amplification of the regions of interest, a process that considerably distorts the information on copy numbers initially present in the tumor DNA. Therefore, additional experiments such as single nucleotide polymorphism (SNP) or comparative genomic hybridization (CGH) arrays often complement amplicon sequencing in clinics to identify copy number status of genes whose amplification or deletion has direct consequences on the efficacy of a particular cancer treatment. So far, there has been no proven method to extract the information on gene copy number aberrations based solely on amplicon sequencing. Results: Here we present ONCOCNV, a method that includes a multifactor normalization and annotation technique enabling the detection of large copy number changes from amplicon sequencing data. We validated our approach on high and low amplicon density datasets and demonstrated that ONCOCNV can achieve a precision comparable with that of array CGH techniques in detecting copy number aberrations. Thus, ONCOCNV applied on amplicon sequencing data would make the use of additional array CGH or SNP array experiments unnecessary. Availability and implementation: http://oncocnv.curie.fr/ Contact: valentina.boeva@curie.fr Supplementary information: Supplementary data are available at Bioinformatics online.


Processing of validation array CGH and SNP array data
Array CGH processing DNA extraction, sample preparation and hybridization were carried out according to the manufacturer's protocol for formalin-fixed paraffin-embedded (FFPE) samples (Protocol G4410-90020, Version 3.4, July 2012, Agilent Technologies). Human genomic DNA from Promega (Male -#G147A, Female -#G152A) was used as reference genome. Array CGH analyses were performed using Sureprint G3 unrestricted CGH 8X60K microarrays (#G4827A, Agilent Technologies), which were scanned with an Agilent Type C scanner (G2505C -DNA microarray scanner with SureScan highresolution technology) using Agilent Scan Control software Version A.8.4.1. Feature Extraction software (Version 11.0.1.1, Agilent Technologies) with an hg19 annotated design file (025683_D_F_20100809, Agilent Technologies) and a modified extraction protocol (CGH_1100_Jul11, provided by Agilent Technologies) were used to extract raw data from tiff files obtained after scanning. The array-based CGH data were analyzed with Agilent Genomic Workbench 7.0.4.0 (Agilent Technologies). The log ratio values were segmented using R package CGHseg (Picard et al., 2011). The SCA algorithm (unpublished) was used to detect significantly aberrant genomic regions.

SNP array processing
SNP array experiments were performed using a CytoScan® HD Array Kit according the manufacturer's protocol (Affymetrix). 250 ng of genomic DNA were used for the target preparation and hybridized microarrays. When the quantity of genomic DNA available was less than 250 ng, a first whole genome amplification (Qiagen, REPLI-g Mini Kit PN: 150023) step was carried out before the assay. The CytoScan® HD profiles were normalized with the Affymetrix Power Tools software package (http://www.affymetrix.com). The log R ratio profile was then segmented in order to detect breakpoints and assign copy number status using Colibri (Rigaill, 2010) and GLAD (Hupé et al., 2004) software. The detection of CNAs was performed with the GAP algorithm (Popova et al., 2009). The sample cellularity and tumor ploidy were estimated by GAP.

Supplementary figure 1. Summary of amplicon sequencing technique
A. Main steps of amplicon sequencing technique. Currently, in amplicon sequencing DNA fragments are sequenced with the low input/low output sequencing technology (e.g., Ion Proton by Life Technologies). When this article is written, the sequencers of this type are the most appropriate for sequencing small amount of reads (~1-2 millions) at low cost and high speed. Reads sequenced with these technologies have various lengths. Since the total length of the targeted regions is relatively small (usually less than 2 Mb), the mapping of reads is fast and unambiguous. The user can use any mapping algorithm that allows mapping of reads with various lengths. B. Visualization of the typical configuration of read mappings (with Integrative Genomics Viewer). Exons 7 and 8 of the TNFRSF14 gene.

Supplementary Figure 2. GC-content bias in the amplicon sequencing data (dataset X)
A. Logarithms of read count normalized with regard to library size (log(NRC Lib )) versus GC-content. B. Fitted values of GC-content functional dependency for 15 control samples. C. Logarithms of read count normalized with regard to GC-content (log(NRC GC )).

Supplementary Figure 3. Amplicon length bias in the amplicon sequencing data (dataset X)
A-B. Logarithms of read count normalized with regard to library size (log(NRC GC )) versus amplicon length. C-D. Logarithms of read count normalized with regard to amplicon length (log(NRC Len )) versus GC-content (C) or amplicon length (D).

Supplementary Figure 4. Distribution of read counts normalized with regard to library size (NRC Lib )
A. Normal control sample X1 (median NRC Lib = 0.96); B. Tumor sample B1 (median NRC Lib = 0.61).

Supplementary Figure 5. Baseline creation
A. Read count profile after the 3 normalization steps (Control sample X1). In orange, a region with extremely high normalized read count values. B. Correlation between normalized read count profiles (Control samples X1 and X2). C. Second and third components of PCA performed on 15 normalized control samples (dataset X). D. Final profile of normalized read count (log(NRC Final )) (Control sample X1). The removed bias cannot correspond to the germline copy number changes. First, the baseline is calculated using at least (n+1) diploid control samples, where n is the number of principal components to keep. Thus, even if one of the control samples contains a real gain or loss, this event is unlikely to be included into the baseline. Second, the targeted regions in the amplicon sequencing almost always correspond to the exons of genes coding for proteins with important cellular functions. Thus, it is also very unlikely that such genes carry copy number changes in exons at the constitutional level. Slightly perturbed weighted mean values of log ratios: Random noise ~ added, where is the standard error of the weighted mean of the corresponding segment. Putative gains and losses corresponding to high densities to left and right of zero (black solid line). Centers of clusters shown by vertical dashed gray lines.
Supplementary figure 10. ONCOCNV uses two statistical tests to prevent the prediction of false positive copy number alterations: the t-test and the fixed variance test. ONCOCNV keeps only those candidate CNAs for which the fixed variance test and t-test p-values are lower than 0.01. (A) A situation when the fixed variance test produces a significant p-value while the t-test does not (Sample A4, chr1, segment 4 from 42525784 to 42526792 bp, annotated as copy-neutral by the array CGH analysis). (B) A situation when the t-test produces a significant p-value while the fixed variance test does not (Sample A5, chr22, segment 4 from 65321270 to 65330546 bp, annotated as copyneutral by the array CGH analysis). These examples demonstrate that using a threshold on the pvalues of both tests allows ONCOCNV to discard false positive predictions. Indeed, the t-test allows us to discard short segments with high overall variance (left panel) that contains several amplicon regions with extremely high or low NRCs (this situation results in a significant p-value of the fixed variance test), while the fixed variance test allows us to discard short regions that, by chance, contain only very close NRC values, all positive or all negative, with relatively small absolute values (this situation results in a significant p-value of the t-test).

Supplementary figure 11. Theoretical sensitivity of CNA detection by ONCOCNV
A. "Shrinkage" of NRC values towards 1 with higher tumor ploidy and higher contamination by normal cells does not allow for the correct CNA detection in highly contaminated or/and hyperploid tumors: loss of one copy (purple), gain of one copy (orange), copy neutral region (yellow). ONCOCNV filters out all candidate CNAs whose weighted geometric mean of NRCs falls within the range [0.875, 1.125] (light blue). B. This filter is likely to remove all CNAs present in less than 25% of cells of diploid tumors, in less than 33% of cells in triploid tumors and in less than 50% of cells of tetraploid tumors. Combination of tumor purity and clonality allowing a one-copy CNA to pass the filter: red (passed), blue (rejected).