CScape-somatic: distinguishing driver and passenger point mutations in the cancer genome

Abstract Motivation Next-generation sequencing technologies have accelerated the discovery of single nucleotide variants in the human genome, stimulating the development of predictors for classifying which of these variants are likely functional in disease, and which neutral. Recently, we proposed CScape, a method for discriminating between cancer driver mutations and presumed benign variants. For the neutral class, this method relied on benign germline variants found in the 1000 Genomes Project database. Discrimination could, therefore, be influenced by the distinction of germline versus somatic, rather than neutral versus disease driver. This motivates this article in which we consider predictive discrimination between recurrent and rare somatic single point mutations based solely on using cancer data, and the distinction between these two somatic classes and germline single point mutations. Results For somatic point mutations in coding and non-coding regions of the genome, we propose CScape-somatic, an integrative classifier for predictively discriminating between recurrent and rare variants in the human cancer genome. In this study, we use purely cancer genome data and investigate the distinction between minimal occurrence and significantly recurrent somatic single point mutations in the human cancer genome. We show that this type of predictive distinction can give novel insight, and may deliver more meaningful prediction in both coding and non-coding regions of the cancer genome. Tested on somatic mutations, CScape-somatic outperforms alternative methods, reaching 74% balanced accuracy in coding regions and 69% in non-coding regions, whereas even higher accuracy may be achieved using thresholds to isolate high-confidence predictions. Availability and implementation Predictions and software are available at http://CScape-somatic.biocompute.org.uk/. Contact mark.f.rogers.phd@gmail.com or C.Campbell@bristol.ac.uk Supplementary information Supplementary data are available at Bioinformatics online.


COSMIC training data
The initial set of filtered training data used for these models is largely balanced between positives, recurrent (r ≥ 2) somatic variants, and the negatives, r = 1 somatic variants, for each chromosome and based solely in cancer patient data. We selected negative examples from regions within 1,000 (non-coding) or 10,000 (coding) positions of positives to mitigate location bias. *The COSMIC database contains enough data that we limited our datasets to 50,000 drivers for coding and non-coding regions.   The ICGC database [16] provided enough non-coding examples along with patient identifiers that allowed us to assign recurrence values to each variant. Thus we used the same approach with these data as with our COSMIC training set: we labeled variants appearing just once (r = 1) as putative passengers (negative) and those with r > 1 as putative drivers (positive). We had a sufficient number of positive examples that we were able to create three distinct data sets for drivers with r ≥ 2, r ≥ 3, or r ≥ 4. After removing all variants common between ICGC and the COSMIC data sets, we were left with the data sets shown in Supplementary Table 3.

Coding regions
After removing variants common to both the ICGC and COSMIC data sets, we were left with just  4,427 coding examples, including 1,695 positive and 2,732 negative examples. Few of these remaining examples had recurrence levels greater than r = 2, so we were unable to investigate performance at more than one recurrence level.

Features
We used eight distinct feature groups for our two classifiers, five of which are common to both. These are described briefly below.
• Spectrum kernel To capture the disruption that occurs in the sequence surrounding a mutation, we use spectrum kernels [5] that compare k-mer composition in windows of size w across a mutation. For singlepoint mutations the disruption is confined to a relatively small region, so for both coding and noncoding models a window size w = 3 and maximum k-mer size 2 yields the best discrimination.

• Conservation and Evolutionary
We use several conservation based measures, such as PhyloP [7], PhastCons [13] and FATHMM [12] scores. These are derived from multiple sequence alignments of 46 or 100 vertebrate genomes to the human genome [2]. In addition, we used dN/dS ratios between human and 65 different species (one-to-one orthologues).
• GC content Similar to the spectrum kernel, these features represent the percentage GC content for five distinct window sizes of 2, 4, 8, 16 and 32 bases flanking a position. This yields a 10-element vector of percent-GC values: five using the reference allele and five using the variant.
• Distance from gene features To assess a variant's potential impact on nearby gene products, we measure the distance from each variant to gene features annotated in the ENSEMBL gene models: start codon, stop codon, gene, UTR, CDS and exon. This approach is simple, yet provides an opportunity for models to learn relationships between variants and important gene elements.

• Mutation tolerance
The potential impact of a cancer variant may depend on the tolerance for mutations within its region [4]. We thus provide our classifiers with a count of the number of annotated variants within windows of 100 to 1 million bases flanking a variant, not including the variant itself.
• Functional elements This feature group encapsulates chromatin state segmentation from the ENCODE "BroadHMM" dataset encompassing nine human cell types. Genome segments that predict functional elements were generated from a combination of computationally integrated ChIP-seq data and input from a Hidden Markov Model (HMM).
• Mapability This feature group is derived from the ENCODE "Mapability" dataset that incorporates three different measurements and different window sizes to measure sequence uniqueness within a reference genome. A high score indicates an area where sequence is unique.
• Variant effects For coding regions we derive features from the ENSEMBL Variant Effect Predictor (VEP). The VEP provides characteristics for specific genomic locations that we can exploit to predict the likely impact of a SNV. Our features represent just two forms of data from this database: (1) consequence types, such as missense variant, nonsense variant or TF binding site variant and (2) the amino acid composition of the reference and allele residues.
For more thorough descriptions of these feature groups, see [9].
To identify the most effective recurrence levels for identifying positive training examples, we used complete forward-selection training procedures for coding and non-coding models, using values from r ≥ 2 to r ≥ 8 (coding regions) or r ≥ 9 (non-coding) for positive examples. As shown in Figure  1(top) of the main paper, we found that balanced accuracy increased substantially for recurrence levels up to r ≥ 6 for coding regions, reaching a nominal peak with r ≥ 7. For non-coding regions, performance increased steadily from r ≥ 2 up to r ≥ 8 before declining again at higher recurrence levels. For our models we thus selected r ≥ 7 as positive examples for coding regions, and r ≥ 8 for non-coding regions.

Non-coding model
Supplementary Figure 1: Progress of the forward selection process shows that balanced accuracy increases steadily for both coding (left) and non-coding (right) models. Left: coding model balanced accuracy increases from 63.3% to 74.0% as we add the first six feature groups, but plateaus at that point. Right: non-coding model balanced accuracy increases from 61.8% to 69.0% as we add the first four feature groups, but plateaus at the fifth group at 69.2%. Error bars represent 95% confidence intervals over 10 LOCO-CV runs.
Supplementary Figure 1 shows the progress of our forward selection procedure as individual feature groups are added to each of our models. Each model was tested using LOCO-CV, using balanced training sets of 4, 000 positive and 4, 000 negative from 21 chromosomes and testing on the remaining chromosome (test set composition shown in Supplementary Table 1). For both models, balanced accuracy increased steadily as we added feature groups. Coding model performance peaked at six feature groups with balanced accuracy just over 74% (Supplementary Figure 1, right). Forward selection identified feature groups for the coding model in the following order: 1=Spectrum kernel, 2=Mutation tolerance, 3=46-way conservation, 4=Chromatin state, 5=GC content, 6=Mapability and 7=VEP. Since there was no statistically significant difference between the last two models, only the first six groups were included in the final model.
Non-coding performance also increased steadily with a nominal peak at five feature groups with balanced accuracy at just over 69% (Supplementary Figure 1, left). Forward selection identified feature groups for the non-coding model as follows: 1=100-way conservation, 2=Mutation tolerance, 3=GC content, 4=Mapability and 5=Distance from gene features. Coding models used r ≥ 7 as positive examples, while non-coding models used r ≥ 8.

Comparison of benign germline variants and rare (r = 1) somatic variants which are likely enriched as passengers
A key aspect of our motivation for exploring models based solely on somatic variants was our hypothesis that there would be characteristic differences between neutral germline variants and putative somatic passenger variants (occurrence of r = 1). We thus evaluated feature values for assumed neutral variants from the 1000 Genomes database [14] and compared those with the same feature values for putative somatic passenger variants from the COSMIC database. We found a considerable number of features that exhibited different characteristics between these two populations.

Non-coding regions
In non-coding regions, at least three feature groups from the final models exhibit noticeable differences between putative germline neutral variants and somatic neutral variants (Supplementary Figure 2). The Mutation tolerance feature group provides simple counts of the number of somatic variants (both drivers and passengers) within the regions flanking a position. For smaller windows (100 to 1,000 positions) somatic variants are extremely rare around neutral germline variants, but appear more frequently near putative somatic passenger variants. This trend continues as we increase the window size up to 100,000 positions, although the distinctions become less obvious; in fact the final models include only the features for smaller windows. Within the Conservation group we include features such as FATHMM, PhastCons and PhyloP scores as well as constituents of the FATHMM score: HMMER state transition probabilities for wild-type (pW) and mutant (pM) alleles, and the difference between them. Finally, GC content reveals subtle differences in the frequencies, particularly in the regions surrounding wild-type alleles, with neutral germline variants appearing more frequently in low-GC content regions than putative passenger somatic variants (r = 1). somatic examples in non-coding regions supports our hypothesis that there may be characteristic differences between these two kinds of "benign" variants. All differences in these distributions are statistically significant at p < 10 −18 or below (by a Mann-Whitney test).

Coding regions
In coding regions, we again see differences in the Conservation and GC content groups. Within the Conservation group, the three precomputed conservation scores PhastCons, PhyloP and FATHMM all exhibit noticeable differences between germline and somatic variants. Similarly, there are differences in the HMMER state transition probabilities pW and pM as well. GC content scores exhibit differences in the regions immediately surrounding variants. In contrast to non-coding variants, it is the somatic variants that appear more frequently in regions with low GC content. coding regions again appears to support our hypothesis that there are characteristic differences between neutral germline and neutral somatic variants. All differences in these distributions are statistically significant at p < 10 −40 or below (Mann-Whitney).

Classifying non-drivers between germline and somatic variants
To complete the picture of the potential differences between rare (r = 1) somatic variants, which are likely putative passengers, and benign germline variants, we constructed classifiers using the same kinds of features used in the respective CScape-somatic models for coding and non-coding regions. These feature groups include all of the features shown in the preceding two sections. In coding regions, the model achieved 74% balanced accuracy and 0.82 AUC (Figure 4, left), while for non-coding regions the model had balanced accuracy of 62% and 0.67 AUC (Figure 4, right). It is worth reiterating that these models used the same feature groups as the CScape-somatic models: stronger performance might be achieved using our forward selection procedure on each of these. However, our goal was to evaluate characteristic differences between these two sets of neutral variants in the context of our CScape-somatic classifiers.

Cautious Classification
As in our previous studies [11,9], we establish thresholds for each classifier that yield the highest possible accuracy by restricting our attention only to high-confidence predictions. By varying a threshold τ through the range of possible values from the default threshold at 0.5 up to the maximum 1.0, a model's accuracy tends to increase monotonically until it reaches a peak that may be considerably higher than its accuracy at the default threshold. The non-coding classifier reaches its peak at

An alternative model with ICGC as training data and COS-MIC as the test data
Our procedure for training models on COSMIC examples and then testing on a separate data set such as ICGC relies on the suitability of COSMIC recurrence levels for predicting high-recurrence variants, that we interpret as drivers. However other data sets such as ICGC provide enough information to infer recurrence levels, and hence could be used as the basis for a alternative model. As noted in the main paper, we found a sufficient number of examples in the ICGC database to evaluate the distinction between recurrence levels r = 1 and r ≥ 2 in coding regions, and between r = 1 and r ≥ 2, r ≥ 3 or r ≥ 4 in non-coding regions. Hence we can apply the same model construction but with ICGC as training data so as to evaluate the predictive power of models based on ICGC data instead. Models trained on ICGC data did not perform as well in LOCO-CV, nor in testing on the COS-MIC examples (Table 4). In coding regions, balanced accuracy in LOCO-CV declined by up to six percentage points, with ICGC-trained models yielding 68.2% balanced accuracy compared with 74.2% accuracy for COSMIC-trained models. In non-coding regions, the differences were less obvious, with balanced accuracy in LOCO-CV ranging from 62.4% to 68.5% for ICGC-trained models, compared with 69.2% for the final COSMIC-trained non-coding model. We also found a substantial decline in performance when testing the ICGC-trained models against COSMIC examples. This may be due to some important differences in the feature groups used in the ICGC-trained models.
A prominent difference in these feature groups is the reliance on ICGC data, rather than COSMIC data, to obtain mutation frequencies (Table 5). This was necessary to mitigate potential bias in ICGC-trained models by preventing them from "learning" anything about the COSMIC data prior to testing on the COSMIC examples. However, the ICGC database contains 3.5 million distinct point mutations compared to nearly 10.8 million in the COSMIC dataset. Hence there is a slightly higher capacity for models to learn mutation frequencies specific to ICGC that may not translate to other data sets; in other words, these models may tend to overfit to ICGC mutation frequencies. The final selection of feature groups for ICGC models given in Table 5 shows that, as with the COSMIC models,  Table 5: Above are shown the feature groups selected for the best ICGC-trained models based on the same forward selection procedure used with the COSMIC-trained models. The most prominent feature group is mutation frequency based on the ICGC database, followed by the spectrum kernel. For variants in coding regions, distance from gene features such as start/stop codons and transcription start sites is also informative; in non-coding regions, conservation scores, GC content and sequence uniqueness are more predictive of high-recurrence variants.
conservation scores, GC content and sequence uniqueness also play an important role.
6 Evaluation of CScape-somatic on characterised common recurrent driver mutations In this section we report performance on some of the top top recurring single point driver mutations in coding regions proposed by Rheinbay et al (Extended Data Figure 1 in [8]). The predictions are for characterised drivers in well-known cancer genes which are typically also marked by a high recurrence within tumours. We consider all of their recurring mutations labeled by CScape-somatic as lying within coding regions and excluding the allosomes (X and Y). Entries in Supplementary  Figure 1 in [8]). Their study uses data from the Pan-Cancer Analysis of Whole Genomes Consortium and uses in excess of 2,700 cancer genomes from more than 2,500 patients. This table only gives single nucleotide variants located on autosomes and labeled by our classifier as residing in coding regions. The table presents the gene, chromosome (Chr.), position and reference nucleotide (Ref.) based on the GRCh37 reference genome. The three prospective variants are presented (Mut.) with the confidence driver-status given in the next column, in the same relative order, and derived from our predictor CScape-somatic (http://cscape-somatic.biocompute.org.uk).
Though CScape-somatic gives the correct prediction label in all instances, this Table should be regarded as a validity check. These SNVs are well characterised examples of drivers and have a high recurrence rate across a large sample of tumours. The Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium dataset derives from a combination of the International Cancer Genome Consortium (ICGC) [16] and The Cancer Genome Atlas (TCGA) [15] datasets. This is separate dataset from that used to train CScape-somatic, i.e. the COSMIC database [3]. However, highly recurrent SNVs in the PCAWG data are likely to be highly recurrent in the COSMIC database. If a classifier is trained to zero training error on its training set, then the re-presentation of a training example will be classified correctly. In short, these well characterised examples from the PCAWG dataset may effectively present as training examples and correct classification could be expected. Hence, the presented Supplementary Table 6 is to demonstrate that the method correctly classifies well characterised examples of SNV-drivers. In the main text, though, performance is evaluated via test evaluation i.e. on data which is unseen to the classifier. This is implemented by, for example, LOCO-CV in which the held-out chromosome equates to unseen data for classifier evaluation.