Allele-specific multi-sample copy number segmentation in ASCAT

Abstract Motivation Allele-specific copy number alterations are commonly used to trace the evolution of tumours. A key step of the analysis is to segment genomic data into regions of constant copy number. For precise phylogenetic inference, breakpoints shared between samples need to be aligned to each other. Results Here, we present asmultipcf, an algorithm for allele-specific segmentation of multiple samples that infers private and shared segment boundaries of phylogenetically related samples. The output of this algorithm can directly be used for allele-specific copy number calling using ASCAT. Availability and implementation asmultipcf is available as part of the ASCAT R package (version ≥2.5) from github.com/Crick-CancerGenomics/ascat/.


Introduction
Allele-specific copy number alterations (CNAs) are commonly used to trace the evolution of tumours. One of the most frequently used algorithms to infer these copy number changes is ASCAT (Van Loo et al., 2010), which segments each sample separately. Due to measurement noise, the inferred locations of breakpoints shared between samples often differ. These differences can impair analyses of phylogenetic relationships between the samples, because evolutionary methods depend on the assumption that shared breakpoints appear at exactly the same location. Previous approaches to address this problem include extensive experimental breakpoint validation (Schwarz et al., 2015), an expensive approach that is not always feasible, or size-based heuristic filters (Mangiola et al., 2016). Another approach infers allele and clone-specific CNA from multisample data by binning without segmentation (Zaccaria and Raphael, 2018).
To rigorously address the problem of multi-sample breakpoint detection, we have developed asmultipcf (allele-specific multisample piecewise constant fitting), a robust allele-specific multisample segmentation algorithm that is tightly integrated into the ASCAT framework (Van Loo et al., 2010). The ability of asmultipcf to improve phylogenetic inference was shown in a large case study on 181 samples from 10 patients with lethal metastatic breast cancer (De Mattos-Arruda et al., 2019).

Approach
asmultipcf incorporates and extends two copy number segmentation algorithms previously developed by Nilsen et al. (2012), which leverage vector operations for efficient implementation: first, aspcf (an allele-specific segmentation method for single samples), and second, multipcf (a multi-sample segmentation method, which is not allele-specific). Additionally, asmultipcf handles missing values, making extensive data filtering unnecessary.

Input data
For each sample, the following input data are required across germline heterozygous sites: (i) log ratios (logR), representing logtransformed copy numbers derived from sequencing depth or single nucleotide polymorphism (SNP) array data, and (ii) B allele frequencies (BAF), describing the allelic imbalance of SNPs. The algorithm presented here can handle missing values and thus loci with incomplete data across samples do not need to be excluded.

Pre-processing
asmultipcf uses the same pre-processing steps as the allelespecific single sample algorithm of Nilsen et al. (2012), including (i) mirroring BAFs to obtain a single track in regions of allelic imbalance and (ii) removing extreme outliers from logR and BAF data [see Nilsen et al. (2012) for details]. Given n samples across p SNP

An exact algorithm for weighted segmentation
We evaluate the fit of a segmentation solution to the data with a weighted least squares function that models missing values in the data matrix. A weight matrix W ¼ ðw ij Þ 2 R 2nÂp is derived by assigning w ij a weight of 0 if y ij is missing and 1 otherwise. Then all missing values in Y are assigned an arbitrary [non-not assigned (NA)] value. Our aim is to find a segmentation S ¼ fI 1 ; . . . ; I M g that minimizes the cost function where the best fit on a given segment I is the weighted average of the observations on that segment and where c is a penalty parameter that controls the number of segments. Expanding the square in (2) and omitting the term independent of S: To find an optimal solution to the cost function, we adapt the dynamic programming algorithm of Nilsen et al. (2012) to our weighted problem. The algorithm iteratively minimizes the total errors e k at locus k across all samples using the errors e kÀ1 up to k, the costs of the current segments, d k , and the penalty c, together with intermediate variables A k and C k :

A heuristic algorithm for large data sets
Algorithm 1 is of order Oðp 2 Þ, which means that the segmentation becomes computationally expensive for long sequences. However, instead of allowing breakpoints at any of the p positions, we can pre-select potential breakpoints and thereby reduce the runtime to Oðq 2 Þ where q is the number of potential breakpoints. To identify potential breakpoints, different heuristics can be used. Here, we apply Algorithm 1 to overlapping subsequences (length 5000 with an overlap of 1000), combine all of the inferred breakpoints and use them as input for the subsequent global segmentation. Algorithm 2 describes the fast heuristic version of asmultipcf.

Post-processing
Both algorithms yield a single segmentation solution S for all samples. However, we expect that only some of the segments will be shared between all samples while others will be private. While ASCAT can be run directly on the global segmentation solution, removing unnecessary breakpoints on a per-sample base can reduce noise in the segment average estimates by generating larger segments. To refine breakpoints individually for each sample, we simply use the breakpoints inferred from the multi-sample segmentation and rerun steps 2 and 3 of Algorithm 2 on each sample individually based on these potential breakpoints.

Implementation
asmultipcf is part of the ASCAT R package from version 2.5 onwards. The asmultipcf function contains a parameter to select whether the exact or the fast algorithm should be run, as well as an option to include the per-sample breakpoint refinement. Furthermore, samples can be weight adjusted to account for quality differences in the data. The manual contains example use cases, including a comparison to HATCHet (Zaccaria and Raphael, 2018).

Discussion
The independent segmentation of related samples can artificially inflate tumour heterogeneity. The algorithm presented here addresses this problem by joint segmentation. While this approach can potentially underestimate tumour heterogeneity, because CNAs that are shared by many samples are more likely to be detected than CNAs that are private or shared by only few samples, in practice, the penalty parameter c can be adjusted to ensure sensitivity. Overall, asmultipcf substantially improves the analysis of copy number changes of multiple samples.

Algorithm 1: asmultipcf
Input: Matrix Y of log-transformed copy numbers and B allele frequencies; weight matrix W; penalty c > 0; Output: Segment start indices and segment averages k Þ where denotes an elementwise matrix product and C À1 k the element-wise inverse • e k ¼ ½e kÀ1 minðd k þ e kÀ1 þ cÞ storing also the index t k 2 1; . . . ; k at which the minimum in the last step is achieved. 2. Find segment start indices from right to left as . . ., s M ¼1, where M 1. 3. Find segment averages y m ¼ ðw :sm y :sm þ Á Á Á þ w :smÀ1À1 y :smÀ1À1 Þ ðw :sm þ Á Á Á þ w :smÀ1À1 Þ : Algorithm 2: Fast asmultipcf Input: Matrix Y of log-transformed copy numbers and B allele frequencies; weight matrix W; penalty c > 0; Output: Segment start indices and segment averages 1. Split data set into overlapping subsequences and apply steps 1 and 2 of Algorithm 1 to each of them in order to find potential breakpoints r 0 , r 1 , . . ., r q where r 0 ¼ 1 and r 1 ¼ p þ 1. 2. Aggregate sequences between breakpoints by setting x ik ¼ P r k À1 j¼r kÀ1 w ij y ij and v ik ¼ P r k À1 j¼r kÀ1 w ij . 3. Calculate segmentation solution by using the aggregated matrices X and V 2 R 2nÂq as input to Algorithm 1 instead of Y and W, respectively.