-
PDF
- Split View
-
Views
-
Cite
Cite
Wenyi Zhao, Jingwen Yang, Jingcheng Wu, Guoxing Cai, Yao Zhang, Jeffrey Haltom, Weijia Su, Michael J Dong, Shuqing Chen, Jian Wu, Zhan Zhou, Xun Gu, CanDriS: posterior profiling of cancer-driving sites based on two-component evolutionary model, Briefings in Bioinformatics, Volume 22, Issue 5, September 2021, bbab131, https://doi.org/10.1093/bib/bbab131
- Share Icon Share
Abstract
Current cancer genomics databases have accumulated millions of somatic mutations that remain to be further explored. Due to the over-excess mutations unrelated to cancer, the great challenge is to identify somatic mutations that are cancer-driven. Under the notion that carcinogenesis is a form of somatic-cell evolution, we developed a two-component mixture model: while the ground component corresponds to passenger mutations, the rapidly evolving component corresponds to driver mutations. Then, we implemented an empirical Bayesian procedure to calculate the posterior probability of a site being cancer-driven. Based on these, we developed a software CanDriS (Cancer Driver Sites) to profile the potential cancer-driving sites for thousands of tumor samples from the Cancer Genome Atlas and International Cancer Genome Consortium across tumor types and pan-cancer level. As a result, we identified that approximately 1% of the sites have posterior probabilities larger than 0.90 and listed potential cancer-wide and cancer-specific driver mutations. By comprehensively profiling all potential cancer-driving sites, CanDriS greatly enhances our ability to refine our knowledge of the genetic basis of cancer and might guide clinical medication in the upcoming era of precision medicine. The results were displayed in a database CandrisDB (http://biopharm.zju.edu.cn/candrisdb/).
INTRODUCTION
Cancer is driven by somatic mutations that accumulate in the genome over an individual’s lifetime [1–3]. Thanks to the advances of next-generation sequencing (NGS) technologies in cancer genomics, the Cancer Genome Atlas (TCGA) has accumulated millions of somatic mutations in cancer from over 10 000 tumor-normal pairs. With the massive amounts of cancer genomic data, researchers have the opportunity to gain the somatic mutation landscape of more than 30 tumor types to better understand cancer biology and improve cancer diagnosis and therapy [4]. In the past decade, various computational tools have been developed for predicting cancer driver genes that contribute to malignant initiation or progression, based on mutation frequency, features [such as functional impact, structure (domain), and network/pathway], evolutionary selective pressure and multi-omics integration [5, 6]. Using these tools, lists of cancer driver genes have been generated from various studies. So far, over 1000 genes have been identified as drivers to carcinogenesis according to the data collected from the Cancer Gene Census (CGC) of the Catalogue of Somatic Mutations in Cancer (COSMIC) database and various published researches [4, 7–10].
While the detailed lists of cancer driver genes are gradually increasing, it is not always straightforward to identify which mutations in those genes might be drivers. With the upcoming precision medicine era, this will become an increasingly important question. However, not all mutations occurring in cancer driver genes are driver mutations, nor need every driver mutation to occur in cancer driver genes [11]. Driver mutations are mutations that confer a growth advantage on the cells, increase survival or proliferation and have been positively selected during the evolution of cancer, whereas passenger mutations do not have these features. Due to the overwhelming number of passenger somatic mutations that are unrelated to cancer, a great challenge is how to identify genetic and/or epigenetic events with functional implications for carcinogenesis [11–13]. To tackle this important task, several computational tools have been designed, which can be classified into five categories based on their major characteristics: (1) sequence-based, (2) functional impact-based, (3) structure-based, (4) network or pathway-based and (5) regulatory-based [5, 6, 14, 15]. Besides, there are also some machine learning- and deep learning-based methods [16–18]. And, the lack of a gold standard for evaluating the performance of different algorithms is a persistent problem. A recent study proposed complementary benchmark datasets and used them to comprehensively assess the performance of different algorithms in predicting cancer driver mutations [19]. These datasets provide good standards for evaluating algorithms.
To this end, here we report a new method (CanDriS V1.0) to profile amino acid sites that are potentially cancer-driving (so-called cancer-driving sites), focusing on ‘missense’ somatic mutations in the coding region of a gene. The notion that carcinogenesis is a form of evolution at the level of somatic cells suggests that our understanding of cancer initiation and progression can benefit from molecular evolutionary approaches. The driver sites under the positive selection during the tumor initialization and proliferation are supposed to have high rates of cancer evolution. Therefore, we model driver sites as the rapidly evolving component and passenger sites as the slowly evolving component. Since slowly evolving sites and rapidly evolving sites are usually unknown, a mixture model was then implemented. Then, we implemented an empirical Bayesian procedure to calculate the posterior probability of a site being cancer-driving for all sites in a gene and to make the significant one stands out. One may choose a posterior probability cutoff (for instance, 0.50, 0.90, 0.99, etc.) to filter the results. In the meantime, we applied three complementary benchmark datasets to comprehensively assess the performance of our method comparing it with other 33 algorithms, and the results showed that CanDriS had a consistently promising performance. The source code of CanDriS is available on GitHub (https://github.com/jiujiezz/candris), and the profiling results of PanCancer and 33 tumor types are displayed in database CandrisDB, which is freely available for all academic users at http://biopharm.zju.edu.cn/candrisdb/.
RESULTS
The computational procedure of software CanDriS
We developed a practically feasible analytical procedure CanDriS V1.0, which includes the following components (Figure 1A):
(i) Input the counts of somatic mutations over sites of a gene that can be collected from genomics databases by the user.
(ii) Estimate the unknown parameters under the two-component mixture model. Because the empirical distribution of cancer somatic mutations over sites of a cancer gene is highly skewed, we implement a statistically reliable procedure to estimate the unknown parameters.
(iii) Calculate the posterior probability of each amino acid site (Q) being cancer-driving with the number of somatic missense mutations at a site (z).

Posterior profiling of cancer-driving sites for TCGA at the pan-cancer level. (A) An illustration of CanDriS workflow. (B–D) The distribution of estimated parameters (η, m0 and m1) of 4047 genes. (E) Histogram of the posterior probability Qz for all sites of 4047 genes. (F) Relationship between Qz and z for all sites of 4047 genes.
Posterior profiling of cancer-driving sites in PanCancer analysis
We applied our computational procedure CanDriS to analyze cancer somatic mutations of 9078 tumor-normal pairs across 33 tumor types from the TCGA PanCanAtlas MC3 project (Supplementary Table S1, see Supplementary Data available online at http://bib.oxfordjournals.org/). In PanCancer analysis, genes that have no amino acid site with at least three recurrent missense mutations were excluded from further analysis; therefore, we have estimated the parameters m0, m1 and η for 4047 genes (m0, m1, the recurrent rate of somatic mutations at a passenger or driver site, respectively; η, the probability of any site being a driver in a gene). First, the estimated proportion (η) of cancer-driving sites in a gene is, on average, very small. Roughly speaking, only less than 1% of amino acid sites are potentially cancer-driving. Second, the number of somatic mutations per driver site m1 (mean ± s.e. = 3.204 ± 0.040) is about 31-fold greater than that per passenger site m0 (mean ± s.e. = 0.102 ± 0.001), indicating that recurrent somatic mutations at driver sites are much more frequent than that at passenger sites (Figure 1B–D). For each gene, we apply the empirical Bayesian procedure to calculate the posterior probability (Q-value) of any kth site being cancer-driving according to Equation (2); see the histogram of Q-values in Figure 1E. 715 370 amino acid sites have at least one somatic mutation among which 7327 sites (1.02%) have over a 0.50 posterior probability, 3201 (0.45%) sites have over a 0.90 posterior probability, and 1228 (0.17%) sites have over a 0.99 posterior probability, respectively (Supplementary Table S2, see Supplementary Data available online at http://bib.oxfordjournals.org/). The overall relationship between Q and z (the count of somatic mutations at a site) is shown in Figure 1F. In short, our analyses support the notion that among numerous somatic mutations appearing in cancer, only a very small portion of them may drive the process of carcinogenesis.
As a case in point, we found tumor suppressor gene PTEN contains many somatic mutations in the TCGA dataset, more than 40% of which are missense mutations. PTEN encodes a phosphatidylinositol-3,4,5-trisphosphate 3-phosphatase, which is an important signaling molecule in the PI3K signaling pathway and is often dysregulated in human cancers [20]. Among the 403 amino acid sites of PTEN, 122 sites have at least one somatic missense mutation, but of which, 59 sites are only mutated once (Figure 2C). We estimated that m0 = 0.361 and m1 = 22.8, respectively, and the proportion of cancer-driving sites η = 0.0215. Hence, on average, the number of cancer-driving sites of PTEN is about 403 × 0.0215 ≈ 8. Then, we analyzed the profile of the posterior probability of being a cancer-driving site by plotting it against the human PTEN amino acid sites. It shows that eight sites with at least seven recurrent mutations (z ≥ 7) have a high (Qz ≥ 0.90) posterior probability to be drivers. In contrast, sites with a few somatic mutations (z ≤ 5) are highly unlikely to be drivers. It is important to note that, one should be cautious for those sites with z = 6 because the posterior profiling may be highly sampling-dependent (Figure 2D). Oncogene PIK3CA, encoding the catalytic subunit of phosphatidylinositol 3-kinase (PI3K) which then activates the downstream effectors including pAKT and mTOR, is recurrently mutated in breast invasive carcinoma (BRCA) (312/779, 40.05%) and uterine corpus endometrial carcinoma (UCEC) (221/396, 55.81%). Among the 1068 amino acid sites, 17 out of 127 mutation sites have at least 10 recurrent mutations, and their posterior probabilities to be drivers are approximately equal to 1 (Figure 2G). Some mutations occurring in these sites are well-known oncogenic ones, such as E542K, E545K and H1047R, and also have shown a possibility of therapeutic implications from clinical trials, published clinical evidence and some literature publications. For instance, some clinical trials indicate that PIK3CA H1047R could be a druggable target in colon adenocarcinomas (COAD) and rectum adenocarcinomas (READ) for its sensitive drug response and effect to PI3K inhibitors, AKT inhibitors and mTOR inhibitors [21]. Moreover, PIK3CA E545K and E542K also show a sensitive response to mTOR inhibitors in a preclinical study [22]. Besides, a study suggests that a secondarily acquired PIK3CA H1047R mutation in a gastrointestinal stromal tumor with a primary BRAF mutation may have contributed to resistance to dabrafenib, a reversible adenosine triphosphate (ATP)-competitive kinase inhibitor that targets the MAPK pathway [23].

The profiling analysis of tumor suppressor gene PTEN and oncogene PIK3CA based on CanDriS. (A, E) Lollipop plot of cancer somatic missense mutations in PTEN (340 missense mutations) and PIK3CA (1130 missense mutations). (B, F) The Qz values of amino acid sites in PTEN and PIK3CA. (C, G) The distribution of amino acid sites with different missense mutations (z). (D, H) The profile of the posterior probability of being a cancer-driving site plotting against the PTEN and PIK3CA amino acid sites.
Identification of driver mutations across tumor types
We further applied CanDriS in 33 different tumor types and obtained posterior profiling of cancer-driving sites across tumor types. In almost all tumor types [except stomach adenocarcinoma (STAD) and skin cutaneous melanoma (SKCM)], more than 99% of mutated sites are considered as passengers as we can see in Figure 3. SKCM has the largest number of over 1000 driver sites (Qz ≥ 0.90). The eight tumor types with less than 2000 mutated sites [adrenocortical carcinoma (ACC), cholangiocarcinoma (CHOL), kidney chromophobe (KICH), mesothelioma (MESO), pheochromocytoma and paraganglioma (PCPG), testicular germ cell tumors (TGCT), thymoma (THYM) and uveal melanoma (UVM)] tend to have few, if any, driver sites. These cancer types either have a very small sample size or few mutations.

The number of predicted driver sites across 33 tumor types from TCGA under different posterior probabilities cutoff. The table on the left shows the number of mutated sites for the 33 tumor types under different posterior probability cutoffs (Qz < 0.5, 0.5 ≦ Qz < 0.75, 0.7 ≦ Qz < 0.9, Qz ≧ 0.9). The bar plot on the right shows the proportion of four types of driver sites under different Qz cutoffs (different colors represent the corresponding types).
Then, we calculated the posterior probability of all mutations occurring in cancer-driving sites and identified the potential driver mutations. The result shows that only a small group of mutations are drivers in more than 10 tumor types (cancer-wide drivers), all of which occur in well-known cancer genes (TP53, KRAS and PIK3CA). While most mutations act as drivers in very few tumor types (cancer-specific drivers), among which we found some interesting mutations (Figure 4). For example, general transcription factor II-I (GTF2I) is altered in almost half of all THYM by a single mutation L424H. GTF2I mediates the expression of various genes with importance in cell cycle control, stress response and genomic stability, including FOS, BRCA and STAT proteins [24], but has not been annotated as a driver gene in the COSMIC CGC yet. A recent study found that GTF2I L424H induced cell transformation and metabolic alterations in thymic epithelial cells [25]. Though GTF2I L424H occurred at high frequency in thymic epithelial tumors, strikingly prevalent in the A and AB histotypes, the frequency of mutation progressively decreased in the more aggressive histological types. Therefore, it can be a biomarker of prognosis [26]. IDH1 R132H is the most frequent mutation (>70%) in brain lower-grade glioma (LGG) and acts as a driver both in brain LGG and glioblastoma multiforme (GBM). Dang et al. performed structural studies demonstrating that IDH1 R132H results in the production of the oncometabolite 2-hydroxyglutarate (2HG) which contributes to the formation and malignant progression of gliomas [27]. Oncogenes GNAQ and GNA11 are both members of the G(q) subfamily. Their mutations predominantly cluster in exon 5 (affecting Q209), which is a cancer-driving site based on our analysis. The major types of mutation in GNAQ are Q209L and Q209P; the former is a cancer-specific driver in UVM, and the latter acts as a driver in both UVM and SKCM. GNA11 is a paralog of GNAQ. There was a mutually exclusive pattern in these two genes, and approximately, all UVM tumors harbor these mutations [28]. GNA11 Q209L is another driver in both UVM and SKCM. A study indicated that GNA11 Q209L could induce spontaneously metastasizing tumors in a mouse model, and the constitutive activation of the pathway involving GNA11 and GNAQ appeared to be a major contributor to the development of UVM [28].

The landscape of driver mutations across tumor types. Each driver mutation is represented as a single dot in the scatterplot. The size of the dot is corresponding to the max prevalence of driver mutations in tumor type. The orange represents cancer-specific drivers which are identified as drivers only in very few tumor types, and the red represents cancer-wide drivers in more than 10 tumor types. Only the driver mutations that occurred at high frequency (over 10%) in the corresponding tumor type are denoted by their abbreviations.
Considering the cancer heterogeneity for multi-types of cancer, we also applied the CanDriS for International Cancer Genome Consortium (ICGC) data at the pan-cancer and tumor-type level. And, we analyzed the cancer-wide and cancer-specific driver sites, which had posterior probabilities larger than 0.90 for TCGA and ICGC, respectively, and compared the results. We found that the posterior probabilities of cancer-wide driver sites in TCGA (n = 14) are all greater than 0.90 at the pan-cancer level of ICGC and tended to be drivers in more different tumor types of ICGC. And, due to tumor heterogeneity, most cancer-specific driver sites (n = 979) have a posterior probability of less than 0.90, and more sites (n = 1014) are not even identified as drivers for a single tumor type (Supplementary Tables S3–S5, see Supplementary Data available online at http://bib.oxfordjournals.org/).
CanDriS has a consistently good performance in the benchmark analysis
We then evaluated the performance of CanDriS by comparing it with other 33 algorithms on three benchmark datasets: literature annotation based on OncoKB [29], functional annotation based on in vitro cell viability assays [12] and effects of cancer mutations on tumor formation in xenograft experiment [30]. Three benchmark datasets for comparison represent different features distinguishing driver mutations from over-excess passenger mutations in cancer genomes and are highly complementary to each other [19]. The positive (driver) and negative (passenger) mutations were predefined according to the functional annotation in each benchmark dataset (Figure 5A).

The performance of CanDriS based on the benchmark analysis. (A) The pie plot showing the overlapping summary of negative and positive cases in three benchmark datasets. (B) A heatmap showing the ranks of the CanDriS and other 33 algorithms based on each benchmark dataset, indicated by color (green, in vitro cell viability benchmark; purple, in vivo tumor formation benchmark; and orange, OncoKB annotation benchmark). Ranks are labeled for the top 10 algorithms only. Red, higher ranks; and white, lower ranks. (C–E) The ROC curves of CanDriS and other 33 algorithms in three benchmark datasets (in vivo tumor formation benchmark, in vitro cell viability benchmark and OncoKB annotation benchmark, respectively).
To assess the predictive performance of our method, we employed area under the curve (AUC) of receiver operating characteristics (ROC) curves based on numeric scores generated from each algorithm. In three benchmark datasets, CanDriS had a promising performance, especially in the in vivo tumor formation benchmark, achieved the highest AUC score of 83.1%, and was the only method with an AUC score of >80% (Figure 5C). And CanDriS ranked sixth in the in vitro cell viability benchmark (Figure 5D), with an AUC score of 68.1%. The top three algorithms by the AUC score were CTAT-cancer (AUC = 74.0%), CHASM (AUC = 73.6%) and CanDrA (AUC = 70.2%) in this benchmark, only ranked top 10 in one or two benchmark datasets, and showed an inconsistent performance. Based on OncoKB annotation, CanDriS ranked ninth with an AUC score of 64.4%, and in the meanwhile, the best performing method VEST4 had approximately 70% (AUC = 68.7%) (Figure 5E). A possible reason for the slightly poorer performance in OncoKB may be intrinsic to the data collected from OncoKB, with only slightly more than half of cancer genes (n = 145, 51.42%) overlapping with our consensus driver gene list. In general, CanDriS was one of the only two methods that achieved top-performance (the top 10 algorithms highlighted for each benchmark) and showed an overall good consistency in each benchmark dataset despite the diversity of three benchmark datasets (Figure 5E). The benchmark analysis showed that CanDriS had a consistently good performance.
Cancer-driving site profiling database CandrisDB
To facilitate the utilization of our analysis results, we developed a user-friendly, interactive and open-access web portal, CandrisDB V1.0 (cancer-driving site profiling database), which is freely available for all academic users (Figure 6). We merged the prediction results of various algorithms based on different principles along with some previous PanCancer publications [3, 4, 7–10] to generate a consensus driver gene list. Then, we adopted the results of CanDriS to posterior profiling driver mutation sites for thousands of tumor samples from both TCGA and ICGC at the pan-cancer- and tumor-type level. Furthermore, we annotated all driver mutation sites with functional and pharmacogenomics data collected from other public databases, which include clinical and therapeutic information that might be of interest to clinicians. CandrisDB displays the putative cancer-driving sites in driver genes, which helps to understand the mechanisms of cancer evolution and progression. Besides, with the prospective clinical sequencing of cancer being widely used in cancer diagnosis and treatment, CandrisDB provides a list of potential targets for cancer therapy and guidance for clinical medication. For convenient searching, we only displayed 7327 (TCGA)/6681 (ICGC) potential driver mutation sites whose posterior probability (Qz) is greater than 0.50, as well as 2961 potential driver genes in our database CandrisDB. Moreover, we also presented the CanDriS profiling results for different tumor types on the ‘Tumor type’ page. Besides, the webserver of CanDriS is available on the ‘CanDriS’ page, which helps users easily predict cancer driver mutations.

DISCUSSION
Recently, developing new therapies or drugs that target somatic mutations in putative driver genes has become a hot spot. For example, there is a small-molecule inhibitor targeting BRAF V600E mutations called vemurafenib, which is effective in shrinking tumors [31]. Also, neoantigens generated from somatic mutations have attracted lots of attention as ideal immunotherapy targets [32–35]. Thus, identifying cancer-driving mutations is a key step in understanding the molecular mechanism of cancer development, which helps to generate a list of potentially druggable mutations and to optimize medication prescribing in the upcoming era of precision medicine.
In cancer genomics, the number of recurrent somatic mutations at the same amino acid site is the primary indicator for predicting a particular gene to be carcinogenesis-related. Therefore, one may intuitively predict the cancer-driving sites of a cancer-related gene if the counts of somatic mutations at those sites are above a pre-specified cutoff, but this kind of approach is highly sample-dependent and lacks statistical justification. In this study, to overcome those drawbacks, we developed a new method called CanDriS by implementing a two-component evolutionary model and by calculating the posterior probability of a site being a potential driver while given the observed number of mutations. Our approach has two advantages: first, it takes advantage of the large-scale cancer genomic data without strong gene- or site-specific sampling bias; second, it avoids the cutoff arbitrary, that is, how many recurrent somatic mutations are sufficient for a site to be predicted as cancer-driving, and our method calculates posterior probabilities for sites. And through the benchmark analysis, our method shows a consistently good performance. Using CanDriS, we profiled the potential cancer-driving sites for TCGA and ICGC at the pan-cancer- and tumor-type level, and only approximately 1% of the sites have posterior probabilities larger than 0.90, which indicates that most sites in a gene are passengers (Supplementary Tables S3 and S4, see Supplementary Data available online at http://bib.oxfordjournals.org/). The potential cancer-wide and cancer-specific driver mutations were shown on the database CandrisDB and were annotated with functional and pharmacogenomics information.
Our method potentially has broad applications but may compromise prediction sensitivity. There are still several biological covariates, which may have profound influences on the prediction accuracy and sensitivity. At the sample level, they include cancer stage (primary or metastasis), clinical severity, type of cancers, as well as medical treatments. At the genome level, the major factor could be mutational heterogeneity [7, 36, 37]. We recognized that some assumptions we made in our analyses may be biologically oversimplified. By computer simulations, we found that the posterior probability for a cancer-driving site calculated under the two-component Poisson model tends to be overestimated around the turning point of z in Equation (9) when somatic mutations are over-dispersed. In future studies, we propose two approaches to address this issue. The first one is to improve the computational pipeline so that it has more options and flexibility to treat different cancer samples, for example, primary or metastasis, cancer tissues, or medical treatments. The second one is to extend the two-component mixture model to some more sophisticated models such as the negative binomial distribution (NBD) model. For instance, we implement a Poisson–NBD mixture model, that is, the number of somatic mutations at passenger sites follows a Poisson model, whereas that at driver sites follows an NBD model. It appears that the posterior profiling converges to virtually the same between the Poisson–Poison model and the Poisson–NBD model when the posterior probabilities are above 0.75 or higher. With these caveats in mind, we conclude that the two-component Poisson model can be used as the first statistical attempt toward cancer-driving site profiling.
METHODS
The two-component mixture model for cancer somatic mutations
Empirical Bayesian profiling of cancer-driving sites
Interestingly, under the two-component mixture model, the relationship between P(driver|z) and z is logistic-like; the baseline of P(driver|z) is usually close to zero by allowing z = 0, and a higher posterior probability value indicates higher statistical confidence for the site being cancer-driving. As z at a site increases, the chance of this site being a driver increases toward 100% ultimately. In practice, one may choose a posterior probability cutoff, say, 0.50 such that a site is predicted to be cancer-driving if it receives a posterior probability of more than 0.50. This cutoff is too liberal which may result in a high rate of false positives. On the other hand, a stricter cutoff can be chosen as high as 0.90.
Implementation of parameter estimations
The current version of our method implements an empirical Bayesian approach to calculating Equation (4). The first step is to estimate unknown model parameters. From TCGA, it is straightforward to count the number of somatic mutations at each site of a gene under study and then obtain the empirical frequency of somatic mutations, for example, f0 is the proportion of sites with zero somatic mutation, f1 is that of sites with single somatic mutation and so forth. Here, we develop an efficient procedure to estimate unknown model parameters, which can avoid some technical issues usually caused by the small sampling problem.
Further, we suggest that this empirical Bayesian approach is biologically meaningful only when the statistical testing shows that the two-component mixture model is statistically significantly better than any one-component model, which can be formulated by the likelihood ratio test (LRT) against the null hypothesis of η = 0 versus the alternative of η > 0. The calculated log-LRT value approximately follows a chi-square distribution with the degrees of freedom df = 2. In the case of PTEN, for instance, the null hypothesis of η = 0 was statistically highly rejected (LRT test, P < 10−22).
Some technical issues: cancer-related genes may have highly recurrent somatic mutations at very few sites (hotspots), for example, the 600th codon of BRAF, the highest recurrent mutation site, has received 545 recurrent mutations, much higher than the second site (469th codon) with 16 mutations. In our dataset, there are six well-known cancer-related genes (including oncogenes KRAS, BRAF, IDH1, PIK3CA, NRAS and tumor suppressor gene TP53) each of which has at least one site with more than 100 recurrent somatic mutations. While those highly recurrent mutations are cancer-related, they may cause some statistical biases in our prediction (not shown). Therefore, we excluded any amino acid site with z ≥ 100 in our PanCancer analysis and assigned the posterior probability of being a cancer-driving site Q (z ≥ 100) = 1.
Data sources for CanDriS
Cancer somatic mutation data we used in our study were from TCGA and ICGC. Somatic mutations across 33 tumor types and 10 224 cancer donors were obtained from the TCGA PanCanAtlas MC3 project (https://gdc.cancer.gov/about-data/publications/mc3-2017), including 3 517 790 somatic mutations in the coding regions, 2 035 693 of which caused amino acid changes in cancers (missense mutations). The mutation data from 35 tumor types of ICGC are publicly available from ICGC Release 25 (https://dcc.icgc.org/releases). After filtered out some redundancies, 9078 samples with 928 012 missense mutations from TCGA and 6970 samples with 423 047 missense mutations from ICGC were used in our study (Supplementary Table S1, see Supplementary Data available online at http://bib.oxfordjournals.org/). Sequences and annotations of human genes were extracted from the Ensembl database (http://www.ensembl.org, GRCh37, Release 75) [51]. For each gene, the transcript which is most commonly used in TCGA was chosen.
Data sources for CandrisDB
The list of over 2000 potential cancer driver genes was generated from the combined result of multiple driver gene prediction algorithms based on different principles. And, we also annotated the driver genes based on some driver gene lists from different sources [3, 4, 7–10]. There are 15 algorithms included in our analysis, which can be categorized into five types based on their major features: (1) frequency-based (MuSiC [52], MutSig2CV [8] and OncodriveCLUST [53]), (2) feature-based, for example, functional impact (20/20+ [54], OncodriveFML [55] and CompositeDriver [4]), (3) structural (domain)-based (e-Driver [56], ActiveDriver [57]), (4) evolutionary selective pressure-based (dNdScv [44], CBaSE [58], CN/CS calculator [50]) and (5) network- or pathway-based (DriverNet [59], OncoIMPACT [60], PNC [15]). We also included the result of a machine learning-based method DriverML [18]. And, we linked the driver genes to cancer hallmarks based on CancerGeneNet [61].
The potential list of cancer-driving sites was generated by CanDriS, and the functional and pharmacogenomics annotation of these sites were collected from several public databases and publications. The biologic and oncogenic effects of mutations were collected from OncoKB. And, the prognostic and clinical effects of mutations were collected from published papers, including different levels of evidence (e.g. preclinical, clinical trials, case reports, FDA-approved, etc.), which can support whether a specific mutation can be considered druggable and associated with known drug responses and effects in the listed cancer types. Besides, the entries about mutation–drug interaction were collected from some public databases [e.g. personalized cancer therapy (PCT) [62], DrugBank [63]], which were also based on different levels of evidence that tumors harboring a specific mutation respond to certain drugs. Moreover, the drug-resistant mutations were collected from published resources, such as TTD [64], COSMIC and DEPO [65], which were identified in the literature as resistance mutation, including mutations conferring acquired resistance (after treatment) and intrinsic resistance (before treatment).
Benchmark analysis for CanDriS
Seventy-one somatic mutations with their oncogenicity annotation based on in vivo tumor formation assays were downloaded from the study by Kim et al. [30]. In the analysis, 45 mutations labeled as ‘functional’ were used as positive cases and 26 other mutations were used as negative cases. In vitro cell viability data of missense mutations were collected from FASMIC v1.6 http://bioinformatics.mdanderson.org/main/FASMIC. Overall, 828 mutations were included for analysis, among which, 361 mutations annotated as activating, inactivating, inhibitory, or non-inhibitory in consensus functional call were considered as positive cases, while neutral mutations were considered as negative cases. OncoKB annotations were downloaded from OncoKB v2.7 (https://www.oncokb.org). This version contained 3246 missense mutations with functional annotations (such as oncogenic, likely oncogenic and likely neutral). The oncogenic and likely oncogenic mutations were labeled as positive cases and the likely neutral mutations as negative cases.
The 33 algorithms we used for comparison employ various information as features to build predictive models, which can be grouped into six main categories: sequence context, protein feature, conservation, epigenomic information, ensemble score and cancer-specific [19]. Prediction scores of most algorithms were extracted from dbNSFP V4.1 [66]. Otherwise, CanDrA [67] scores were calculated by using the ‘cancer-in-general’ scores with CanDrA v+. CHASM v3.1 [68] scores were retrieved from the CRAVAT web server (v5.2.4) (http://www.cravat.us/CRAVAT/). CTAT-cancer and CTAT-population scores were calculated by performing principal component analysis in R, as described in the original paper [4]. FATHMM-cancer [69] scores were retrieved from http://fathmm.biocompute.org.uk/cancer.html. TransFIC [70] scores were calculated by a Perl script downloaded from http://bbglab.irbbarcelona.org/transfic/home. Missing values were omitted.
For each benchmark dataset, the ROC curves and AUC scores were obtained by using the R function ‘roc’ provided in the pROC package.
We developed a two-component evolutionary model named CanDriS to estimate the proportion of cancer-driving sites of a gene, and estimate the posterior profiling of cancer-driving sites by an empirical Bayesian procedure.
Compared with 33 tools in three benchmark datasets, CanDriS has a consistently good performance.
We comprehensively profile all potential cancer-driving sites for thousands of tumor samples from TCGA and ICGC at the pan-cancer- and tumor-type level, displaying the results in a database CandrisDB (http://biopharm.zju.edu.cn/candrisdb/).
We also provided annotation for these potential driver mutations with functional and pharmacogenomics information collected from published papers and public resources.
Data availability
The source codes of CanDriS and the original data are available on GitHub (https://github.com/jiujiezz/candris). CandrisDB is freely available for all academic users at http://biopharm.zju.edu.cn/candrisdb/.
Authors’ contributions
X.G. and Z.Z. designed and supervised this study. X.G. conceived the method. Z.Z., W.Z. and J.W. wrote the programs. W.Z., J.Y., G.C., Y.Z., J.W., J.H., W.S. and Z.Z. performed the data analysis. X.G., Z.Z., W.Z. and J.Y. wrote the manuscript. M.J.D., S.C. and Jian.W. revised the manuscript. All authors read and approved the final manuscript.
Acknowledgments
We gratefully acknowledge the clinical contributors and data producers from the TCGA Research Network for referencing the TCGA dataset.
Funding
This work was supported by grants from Key R&D Program of Zhejiang Province (grant no. 2020C03010); National Natural Science Foundation of China (grant no. 31971371); Zhejiang Provincial Natural Science Foundation of China (grant no. LY19H300003); Fundamental Research Funds for the Central Universities; Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare.
Conflict of interest
The authors declare no conflict of interests.
Wenyi Zhao is a PhD candidate at the College of Pharmaceutical Sciences & College of Computer Science and Technology, Zhejiang University. Her main research interests include the development of bioinformatics tools and databases and cancer genomics.
Jingwen Yang is a postdoc fellow at the MOE Key Laboratory of Contemporary Anthropology, Human Phenome Institute, School of Life Sciences, Fudan University. Her main research interests include the development of bioinformatics tools and evolutionary genomics.
Jingcheng Wu is a PhD candidate at the College of Pharmaceutical Sciences, Zhejiang University. His main research interests include the development of bioinformatics tools and databases and cancer genomics.
Guoxing Cai is currently a master’s student at the College of Pharmaceutical Sciences, Zhejiang University. His main research interests include the development of bioinformatics tools and databases and cancer immunogenomics.
Yao Zhang is an undergraduate student at the College of Pharmaceutical Sciences, Zhejiang University. His main research interests include the development of bioinformatics tools and cancer genomics.
Jeffrey Haltom is a PhD candidate at the Department of Genetics, Development and Cell Biology, Iowa State University. His main research interests include evolutionary genomics and cancer genomics.
Weijia Su is a PhD candidate at the Department of Genetics, Development and Cell Biology, Iowa State University. Her main research interests include evolutionary genomics.
Michael J. Dong is a PhD candidate at the Department of Genetics, Development and Cell Biology, Iowa State University. His main research interests include evolutionary genomics.
Shuqing Chen is a professor at the College of Pharmaceutical Sciences, Zhejiang University. His main research interests include biopharmaceuticals and cancer precision medicine.
Jian Wu is a professor at the College of Computer Science and Technology & School of Medicine, Zhejiang University. His main research interests include artificial intelligence in medicine.
Zhan Zhou is an associate professor at the College of Pharmaceutical Sciences, Innovation Institute for Artificial Intelligence in Medicine, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Zhejiang University. His main research interests include the development of bioinformatics tools and databases and cancer genomics.
Xun Gu is a professor at the Department of Genetics, Development and Cell Biology, Iowa State University. His main research interests include evolutionary genomics and molecular evolution.
References
Author notes
Wenyi Zhao and Jingwen Yang authors contributed equally to this work.