UNMASC: tumor-only variant calling with unmatched normal controls

Abstract Despite years of progress, mutation detection in cancer samples continues to require significant manual review as a final step. Expert review is particularly challenging in cases where tumors are sequenced without matched normal control DNA. Attempts have been made to call somatic point mutations without a matched normal sample by removing well-known germline variants, utilizing unmatched normal controls, and constructing decision rules to classify sequencing errors and private germline variants. With budgetary constraints related to computational and sequencing costs, finding the appropriate number of controls is a crucial step to identifying somatic variants. Our approach utilizes public databases for canonical somatic variants as well as germline variants and leverages information gathered about nearby positions in the normal controls. Drawing from our cohort of targeted capture panel sequencing of tumor and normal samples with varying tumortypes and demographics, these served as a benchmark for our tumor-only variant calling pipeline to observe the relationship between our ability to correctly classify variants against a number of unmatched normals. With our benchmarked samples, approximately ten normal controls were needed to maintain 94% sensitivity, 99% specificity and 76% positive predictive value, far outperforming comparable methods. Our approach, called UNMASC, also serves as a supplement to traditional tumor with matched normal variant calling workflows and can potentially extend to other concerns arising from analyzing next generation sequencing data.


Comparing tumor-only methods features
Existing methods that address tumor-only or control-free tumor variant calling incorporate some filtering such as on sequencing depth, alternate read depth, variant quality score, read quality, etc. Beyond filtering, several methods can be considered supervised or unsupervised learning approaches. Supervised methods rely on a high quality, pre-labeled, large sample size (subjects and variant calls) dataset of variant calls labeled only as germline or somatic. Unsupervised methods involve clustering variants by modeling the relationship between a variant and known clusters of germline or artifact variant allele frequencies or modeling the underlying biology of the sample (tumor purity, copy number, intratumor heterogeneity, multiplicity). Table S1 provides a breakdown of methods and their features and shortcomings.

Method/Citation
Type tSEG ART UMNs H2M RFGS Hiltemann [1] FW SomVarIUS [2] FW,UL (*) SGZ [3] FW,UL ISOWN [4] FW,SL LumosVar [5], LumosVar 2.0 [6] FW,UL (*) GATKcan [7] FW,SL TOBI [8] FW,SL Teer et al. [9] FW UNMASC FW,UL 3 Identifying low quality unmatched normal controls A crucial challenge of tumor-only variant calling with UMNs is to isolate a reliable, high quality set of controls. Otherwise, a low quality sequenced normal control could consistently miscall a variant leading to a decrease in sensitivity. From among over 1,000 tumors with matched normal samples, we plotted and clustered their Picard hsMetrics to identify low quality normals to exclude from the UMN set. While this manuscript does not focus on our approach to identify outlier samples and retain high quality UMNs, this aspect can be further explored. Figure S6 UMN quality metrics. Picard hsMetrics colored by inferred cluster (blue for non-outlier and red for outlier) to identify and exclude lower quality normal v8 samples. PCT TARGET BASES 50X and PCT TARGET BASES 100X denote the percent of target bases covered to 50X and 100X coverage or more, respectively. FOLD 80 BASE PENALTY denotes the fold over-coverage necessary to increase 80% of bases in non-zero coverage targets to the mean coverage level in those targets. PCT PF UQ READS ALIGNED denotes the fraction of passing filter reads that are unique and align to the reference genome. ON BAIT VS SELECTED denotes the fraction of bases on or near baits that are covered by baits. TOTAL READS denotes the total number of reads assigned to the index. AT DROPOUT denotes a measure of how regions with less than 50% GC content are undercovered relative to mean coverage.

Notation and Framework
For each tumor and UMN, a set of variant calls are generated. Each variant call is characterized by a pair of normal and tumor read counts. We describe the procedure applied to the UMN read counts. In a given genomic position, a variant present in the tumor that is not present in the MN would be considered somatic. And if a variant allele was considered present in the MN, the position in the tumor would be ignored. However, without the MN, the status of variant alleles detected at a position is unknown when calling somatic variants in the tumor sample. Using a set of UMNs, our procedure aims to characterize the nature of variant calling at a position or within a genomic interval. All of the proposed methods described below to generate variant call features rely on modeling read counts with a mixture model composed of a discrete uniform distribution and binomial densities with success probability 0.5, p, and 1 − p for 0 < p < 0.5. Let b denote the bth genomic locus, where b = 1, . . . , B. Let (A b , R b ) denote the bth locus's normal alternate and reference read counts, respectively. Letting denote the unobserved classification of the bth locus. Let C denote the set of possible classifications. Let θ = (π C , p) denote the set of underlying parameters where π C denotes the vector of mixture proportions and c∈C π c = 1. Let Bin(n, p) denote a binomial distribution with n trials and success probability p. In addition, let 1 {·} denote the indicator function where 1 {A} equals 1 if A and zero otherwise. The mixture distribution is characterized as The marginal likelihood is B b=1 P (A b |T b ; θ) which is optimized by introducing latent variables C 1 , . . . , C B and optimizing the expected complete log-likelihood through the EM algorithm.

Clustering normal read counts
In normal samples, we assume each locus is drawn from a diploid genome and thus there are three underlying genotypes (AA, AB, BB) or probability densities to consider. Positions with genotype AA, AB, and BB tend to have a VAF near zero, VAF near 0.5, and VAF near one, respectively. To capture positions that do not reflect the three possible genotypes, we include a "noise" density. This density encapsulates sequencing and alignment phenomenon that disagree with the underlying biology. Treating these four underlying classes as unknown, we can cluster their normal read counts. We model the B pairs of read counts with a mixture distribution consisting of six proposed densities. The first density accounts for noise genotypes, the second density accounts for heterozygous genotypes, the third density accounts for homozygous reference genotypes with expected VAF p, and finally, the fourth density accounts for homozygous alternate genotype with expected VAF 1 − p.
Using the notation above, let C = {1, 2, 3, 4}. Let θ = (π C , p). After the EM algorithm converges, we retain all loci classified to classes c = 1, 2, in other words, classified as noise or heterozygous, respectively, for subsequent segmentation. Assuming there are Z UMNs, this procedure is repeated across each set of the tumor variant calls' UMN read counts.

Normal VAF Segmentation (Hard to map regions)
Taking the union of positions across all Z variant calls and retained positions labeled noise or heterozygous, we perform segmentation. The goal is to construct a sparse segmentation, by taking advantage of the ordered loci, where each genomic interval is characterized by the proportion of loci with VAF deviating from 0.5. Suppose there are B Z unique and ordered genomic positions remaining where b = 1, . . . , B Z indexes the relative position and each position has at most n b pairs of normal read counts indexed by UMN z. Let V bz = 1 and V bz = 0 denote the indicator of a variant called and classified at noise or heterozygous or not called, respectively, at locus b for the zth UMN. For example, position b was called a variant among all Z UMNs but only UMNs 1, 2, and 5 have corresponding normal read counts classified as noise or heterozygous, and so n b = Z z=1 V bz = 3. As a data-driven annotation, our proposed segmentation scheme is designed to identify genomic intervals where the normal VAF does not match our expectations, which we refer to as hard-to-map (H2M) intervals. Generated through simulation, Figure S7 illustrates the goal of this segmentation. Figure 1b illustrates this phenomenon observed along chromosome 1.
Circular binary segmentation (CBS) [10] was not utilized because the algorithm is optimized for piecewise constant segments and pre-smoothed univariate data. CBS would not account for sequencing depth and not allow us to incorporate segments with expected VAFs centered around 0.5 and identify noisier genomic segments. The aim of our approach is to quantify the proportion of noisy loci per segment.
Using Figure 1b as motivation for our segmentation method, suppose segment i is composed of normal read counts with indicies b bounded by start and end indices s i−1 and s i , respectively. Using the notation defined, the segment-specific parameter is Within segment i, let z,V bz =1 V bz denote the observed likelihood, model size, and number of pairs of normal read counts, respectively. Due to possible local optimum, the goal within each segment is to retain the clustering result with largest BIC. Given I segments with fixed segmentation bounds s 0 , s 1 , . . . , s I , where s 0 = 0 and s I = B Z , the objective function to be maximized is Notice that calculating L (I) requires EM algorithms run per segment. Let I (t) , I + (t) , I − (t) , I ∼ (t) denote the current segmentation, current segmentation with added segment, current segmentation with dropped segment, current segmentation with adjusted segment, respectively. Here is our segmentation algorithm.
Algorithm 1 VAF segmentation 1: t ← 0, initialize segmentation I (0) and calculate L I (0) . 2: while true do 3: At iteration t, propose adding to, dropping from, and adjusting the current segmentation. 10: Exit the segmentation 12: end if 13: end while Once the segmentation of the normal read counts terminates, the relative positions b are mapped back to genomic positions to annotate genomic intervals by their locally estimated noise parameter π i . Since π i represents the estimated proportion of UMN VAFs drawn from the discrete uniform distribution, we consider segments with π i > 0.5 as hard-to-map, otherwise they are mappable, to reflect a simple majority of VAFs deviating from 0.5.

Strand bias
Strand bias detection is conducted by determining if the allele frequencies of the forward and reverse strands are consistent with one another at a given locus. Using Rsamtools [12], forward and reverse strand read counts harboring alternate and reference bases are obtained per tumor variant call. UNMASC implements the Fisher exact test and the outputted annotation includes the strand-specific alternate and reference read counts along with the strand bias p-value.

Identifying oxoG and Paraffin artifacts
Given existing works [13][14][15] that have studied specific artifacts, we attempt to incorporate their findings into our workflow. The oxoG artifact has been characterized by a sample's relative abundance of C>A (and G>T) base substitutions over all other base substitutions. If these variants occurred at approximately the same time or with the same proportion of reads after library amplification, then we expect them to occur with VAFs centered around a single mean regardless of somatic copy number alterations. We also expect that paraffin induced variants, previously characterized by a sample's relative abundance of C>T and G>A base substitutions and indels should also follow a similar pattern with their own respective allele frequency.
These artifacts are identified by modeling the tumor variant calls made by the zth UMN with a mixture of a discrete uniform density and binomial density for non-artifact and artifact variants, respectively. Using the above notation, let θ = (π 1 , p) and C = {1, 3}. The model is characterized as where variants clustered to the binomial density are considered artifact (labeled "signal"), otherwise the variants are considered not artifact (labeled "noise"). To identify oxoG artifacts, we cluster variants that meet our filtering criteria (total depth, Qscore, 1000 Genomes, ExAC, targeted, and lacking strand bias) with base substitutions C>A and G>T and tumor VAF less than a maximum expected tumor VAF, hopefully less than 0.5. Similar to oxoG, to identify paraffin artifacts, we cluster variants that meet the above filtering criteria but with indels and base substitutions C>T and G>A. This clustering procedure is repeated over each of the Z UMN tumor variant calls to be aggregated together. Conceptually, this procedure is applied to each UMN's set of tumor variant calls to assess the consistency of the artifact distribution.
In addition to clustering variants with the specific base substitution or indel, we make use of the above clustering results in an attempt to identify additional artifacts based on those that share a similar tumor VAF. Taking variants with the same filtering criteria but not satisfying the ExAC, 1000 Genomes, or base substitution/indel criteria, we infer their status as either artifact (labeled "signal like") or not artifact ("noise") using the maximum posterior probability characterized as .
The remaining variants that do not undergo artifact clustering or prediction are labeled "NA".

tVAF segmentation and inferring germline status
With strand and artifact annotations inferred, the remaining tumor-only variants can be segmented to identify local allele frequency patterns of GVs. Without excluding potential AVs, our segmentation algorithm could be distorted by clusters of artifacts at similar allele frequencies.
Due to loci repeatedly called across the Z UMNs, we take the unique union of tumor variant calls. There are many tumor variants with VAFs near 0 and 1. These variants are removed before filtering. Within segment i, to account for possible copy number alterations, the mixture model is composed of a discrete uniform density, a binomial density centered around 0.5, and a pair of binomial densities centered around p i and 1 − p i where p i < 0.5 to identify somatic mutations, segments of allelic balance, and segments of allelic imbalance, respectively. Our proposed segmentation algorithm does not rely on GDs to identify and segment GVs. Let θ i = (π i1 , π i2 , π i3 , p i ). The proposed mixture model for each unique tumor variant read count within segment i is characterized as The steps of tumor VAF segmentation algorithm are identical to the steps of normal VAF segmentation, the only difference being the mixture model proposed for tumor versus normal read counts. The output from this algorithm will contain segment start and end positions, where segment i is characterized by parameters θ i , to be used for inferring the germline status of candidate somatic variants.
Before characterizing a variant relative to its local segment, we need to determine if the segment itself is in allelic balance or imbalance to determine the B allele frequency. If π i1 = 1, this indicates that all loci within a segment are classified as noise and thus the allele status of that segment is unknown. But if π i1 < 1, the ith segment is considered to be in allelic balance if π i2 ≥ 0.5, otherwise the segment is in allelic imbalance. Given the allelic status for a segment, the posterior probability can be calculated of a variant not belonging to the distribution of local GVs. The poster probability of a variant not clustering to the distribution of local GVs For females, chromosome X's tumor VAF was segmented to determine local BAF status relative to candidate variant loci whereas no segmentation along chromosome X was performed for males to characterize the relation between tVAF to BAF.

Annotation-based Filters
Given a set of UMNs, the aim is to extract the set of (1) exonic/coding, (2) on-target, (3) (3), variants with a total tumor read depth of 10 and where the 25th percentile in normal read depth across UMNs is at least 10, harbor at least 5 alternate reads, and 25th percentile in Qscore across UMNs is 30 were retained. If a variant did not meet these thresholds but was reported at least 10 times in COSMIC, not reported in 1000 Genomes and not reported in ExAC or with a population frequency less than 0.005, they would also be retained. Moreover, through clustering of the normal read counts we have inferred homozygous reference (G=0), homozygous alternate (G=2), heterozygous (G=1), or noise (G=NA) status. Among variants in positions classified as G=0 among 50% of UMNs and called among 90% of UMNs. For (4), variants not reported in 1000 Genomes and not in ExAC or with a population frequency less than 0.005 are retained. The final set of variants that met all of the above criteria are referred to as candidate somatic variants. Through tumor VAF segmentation, UNMASC calculates and annotates them with their posterior probability of not sharing the distribution of the local segmented B allele frequency.  Figure S8 UNMASC workflow for a single tumor sample. The UNMASC workflow starting with Z unmatched normals (UMNs) bam files (N j .bam) and a tumor sample's bam file (T.bam) as inputs and resulting in annotated variant calls (VCs) for manual inspection and filtering, where j = 1, . . . , Z. The vcf files (VCF j ) are generated by Strelka and annotated with genes, COSMIC counts, ExAC, SnpEff, Variant Prediction, 1000 Genomes by Oncotator. The vcfs are combined (union MAF) and labeled as ON or OFF target, according to a supplied bed file. UMN read counts are clustered to infer noise or genotype status. Combining UMN genotype status, Oncotator annotations, tumor/normal read depth, qscore, and variant call frequency, we obtain a set of candidate somatic variant calls. Using only noise and heterozygous loci, we segment the corresponding read counts to infer mappable and unmappable genomic segments per chromosome. Tumor strand-specific read counts are extracted with samtools for strand bias annotation. In addition, strand bias (SB), OXOG and FFPE analyses are conducted to further infer artifact status before segmentation to avoid misclassifying germline variants. Non-strand bias, non-OXOG, and non-FFPE tumor loci read counts are segmented. Candidate variant calls are further annotated by tumor and UMNs segmentation results.   Figure S9 An overlay of 100 benchmark nVAF segmentations along chromosomes 1, 6, 10, and 14. The x-axis denotes the genomic position in megabases and the y-axis denotes the proportion of loci within a segment classified as noise. Darker color intensities correspond to more consistently identified regions detected at similar noise proportions. For example, the centromere region on chr1 is consistently detected with 100% of variants classified as noisy for segmentation. In addition, a narrow region less than 20 megabases on chr1 is also consistently identified as hard to map. Similar cases of wide and narrow H2M regions are displayed.

Aggregated filtering and one sample classification performance
We observed the combined effects of applying UNMASC's filtering criteria, varying the number of unmatched normal controls, and varying the threshold for proportion of times a variant was called across UMNs (25%, 50%, 75%, 90%, and 100%). This was conducted by sampling (without replacement) pools of UMNs with sizes ranging from one UMN to all available UMNs. A pool size was sampled ten times to capture any variability in SENS, PPV, and number of filtered variant calls each distinct set of UMNs might result in. The six cumulative layers of applied filtering criterion were applied to both the TO variants and TMN variants. Criteria BASE corresponds to applying the annotation-based filters including depth, Qscore, on-target, exonic, Cosmic, ExAC, and 1000 Genomes filters. Criteria UMNQC includes Criteria BASE specifications but removes three specific UMNs that resulted in decreases in sensitivity for a subset of samples from Criteria BASE. This was due to drops in Qscore and depth below our pre-defined thresholds and a handful of variants not called by one of the three UMNs. We attributed these undesired properties to outlier metrics in the percent of bases covered and fold 80 base change from Picard hsMetrics. For one sample presented in Figure S10, SENS was restored to 100% while PPV suffered a small decrease.
Criteria H2MGERM builds on Criteria UMNQC and filters variants by integrating the segmentation results. More specifically, variants falling in H2M regions and variants inferred to be germline are excluded. This resulted in a significant gain in PPV, reflected by an overall decrease in filtered variant calls, while maintaining SENS at 100%. Next, Criteria OXOG incorporates Criteria H2MGERM by excluding variants UNMASC deemed to be oxoG artifacts. In Figure S10, the dotted black line represents the corresponding criteria's number of filtered TMN variant calls (fVCs). Thus the drop in fVCs due to the oxoG filter suggests this sample tumor variants may have included oxoG artifacts.
Criteria STRAND builds on Criteria OXOG's filtered set by excluding variants deemed to be strand bias artifacts. UNMASC utilizes Rsamtools to count the number of forward or reverse strand and reference or alternate reads to conduct Fisher's Exact test per variant call. At a threshold of 0.05, variants with smaller p-values were considered strand bias artifacts. In Figure S10, the decrease in variability of PPV and mean increase in PPV appears to be attributed to strand bias artifacts present only among the TO variants and not the TMN variants. At last, Criteria FFPE utilized all previous criteria and filtered out variants deemed formalin-fixed paraffin-embedded artifacts. Below, the highlighted sample's variants did not appear to harbor paraffin artifacts after applying the previous sequence of filters.
A more in-depth look at Figure S10 illustrates the performance of a single sample in terms of SENS, PPV, and F1 measure (F1M) as they vary with respect to the layered filtering criteria, number of UMNs, and proportion variant called threshold.
The sample's PPV plateaued early at about 4 UMNs due to the presence of highly private germline variants not identified by 1000 Genomes or ExAC. Two false positive variants were considered hard-to-map by normal VAF segmentation. After applying Criteria C's filtering, the PPV increased from about 45% to 80% when each variant was required to appear in at least 90% of UMNs. Of the remaining TO variants, our OXOG approach inferred three OXOG variants with tumor VAFs of 0.021, 0.029, and 0.027 present in the TMN calls. After excluding inferred OXOG variants from both TMN and TO calls, this sample did not appear to have gold standard variants presenting with strand bias but they were present among TO calls. And finally, based on our proposed FFPE approach, the previously mentioned three OXOG variants also appeared to be FFPE artifacts. And thus for this sample, sensitivity was maintained at 100%, and with at least one UMN, the PPV is approximately 80% with a total of twelve TO variants after all filters are applied.

False Negatives
Figure S12 These plots capture the frequency of false negatives excluded by our workflow and the tumor purity of the 30 samples.