Abstract

Current cancer genomics databases have accumulated millions of somatic mutations that remain to be further explored. Due to the over-excess mutations unrelated to cancer, the great challenge is to identify somatic mutations that are cancer-driven. Under the notion that carcinogenesis is a form of somatic-cell evolution, we developed a two-component mixture model: while the ground component corresponds to passenger mutations, the rapidly evolving component corresponds to driver mutations. Then, we implemented an empirical Bayesian procedure to calculate the posterior probability of a site being cancer-driven. Based on these, we developed a software CanDriS (Cancer Driver Sites) to profile the potential cancer-driving sites for thousands of tumor samples from the Cancer Genome Atlas and International Cancer Genome Consortium across tumor types and pan-cancer level. As a result, we identified that approximately 1% of the sites have posterior probabilities larger than 0.90 and listed potential cancer-wide and cancer-specific driver mutations. By comprehensively profiling all potential cancer-driving sites, CanDriS greatly enhances our ability to refine our knowledge of the genetic basis of cancer and might guide clinical medication in the upcoming era of precision medicine. The results were displayed in a database CandrisDB (http://biopharm.zju.edu.cn/candrisdb/).

INTRODUCTION

Cancer is driven by somatic mutations that accumulate in the genome over an individual’s lifetime [1–3]. Thanks to the advances of next-generation sequencing (NGS) technologies in cancer genomics, the Cancer Genome Atlas (TCGA) has accumulated millions of somatic mutations in cancer from over 10 000 tumor-normal pairs. With the massive amounts of cancer genomic data, researchers have the opportunity to gain the somatic mutation landscape of more than 30 tumor types to better understand cancer biology and improve cancer diagnosis and therapy [4]. In the past decade, various computational tools have been developed for predicting cancer driver genes that contribute to malignant initiation or progression, based on mutation frequency, features [such as functional impact, structure (domain), and network/pathway], evolutionary selective pressure and multi-omics integration [5, 6]. Using these tools, lists of cancer driver genes have been generated from various studies. So far, over 1000 genes have been identified as drivers to carcinogenesis according to the data collected from the Cancer Gene Census (CGC) of the Catalogue of Somatic Mutations in Cancer (COSMIC) database and various published researches [4, 7–10].

While the detailed lists of cancer driver genes are gradually increasing, it is not always straightforward to identify which mutations in those genes might be drivers. With the upcoming precision medicine era, this will become an increasingly important question. However, not all mutations occurring in cancer driver genes are driver mutations, nor need every driver mutation to occur in cancer driver genes [11]. Driver mutations are mutations that confer a growth advantage on the cells, increase survival or proliferation and have been positively selected during the evolution of cancer, whereas passenger mutations do not have these features. Due to the overwhelming number of passenger somatic mutations that are unrelated to cancer, a great challenge is how to identify genetic and/or epigenetic events with functional implications for carcinogenesis [11–13]. To tackle this important task, several computational tools have been designed, which can be classified into five categories based on their major characteristics: (1) sequence-based, (2) functional impact-based, (3) structure-based, (4) network or pathway-based and (5) regulatory-based [5, 6, 14, 15]. Besides, there are also some machine learning- and deep learning-based methods [16–18]. And, the lack of a gold standard for evaluating the performance of different algorithms is a persistent problem. A recent study proposed complementary benchmark datasets and used them to comprehensively assess the performance of different algorithms in predicting cancer driver mutations [19]. These datasets provide good standards for evaluating algorithms.

To this end, here we report a new method (CanDriS V1.0) to profile amino acid sites that are potentially cancer-driving (so-called cancer-driving sites), focusing on ‘missense’ somatic mutations in the coding region of a gene. The notion that carcinogenesis is a form of evolution at the level of somatic cells suggests that our understanding of cancer initiation and progression can benefit from molecular evolutionary approaches. The driver sites under the positive selection during the tumor initialization and proliferation are supposed to have high rates of cancer evolution. Therefore, we model driver sites as the rapidly evolving component and passenger sites as the slowly evolving component. Since slowly evolving sites and rapidly evolving sites are usually unknown, a mixture model was then implemented. Then, we implemented an empirical Bayesian procedure to calculate the posterior probability of a site being cancer-driving for all sites in a gene and to make the significant one stands out. One may choose a posterior probability cutoff (for instance, 0.50, 0.90, 0.99, etc.) to filter the results. In the meantime, we applied three complementary benchmark datasets to comprehensively assess the performance of our method comparing it with other 33 algorithms, and the results showed that CanDriS had a consistently promising performance. The source code of CanDriS is available on GitHub (https://github.com/jiujiezz/candris), and the profiling results of PanCancer and 33 tumor types are displayed in database CandrisDB, which is freely available for all academic users at http://biopharm.zju.edu.cn/candrisdb/.

RESULTS

The computational procedure of software CanDriS

We developed a practically feasible analytical procedure CanDriS V1.0, which includes the following components (Figure 1A):

  • (i) Input the counts of somatic mutations over sites of a gene that can be collected from genomics databases by the user.

  • (ii) Estimate the unknown parameters under the two-component mixture model. Because the empirical distribution of cancer somatic mutations over sites of a cancer gene is highly skewed, we implement a statistically reliable procedure to estimate the unknown parameters.

  • (iii) Calculate the posterior probability of each amino acid site (Q) being cancer-driving with the number of somatic missense mutations at a site (z).

Posterior profiling of cancer-driving sites for TCGA at the pan-cancer level. (A) An illustration of CanDriS workflow. (B–D) The distribution of estimated parameters (η, m0 and m1) of 4047 genes. (E) Histogram of the posterior probability Qz for all sites of 4047 genes. (F) Relationship between Qz and z for all sites of 4047 genes.
Figure 1

Posterior profiling of cancer-driving sites for TCGA at the pan-cancer level. (A) An illustration of CanDriS workflow. (BD) The distribution of estimated parameters (η, m0 and m1) of 4047 genes. (E) Histogram of the posterior probability Qz for all sites of 4047 genes. (F) Relationship between Qz and z for all sites of 4047 genes.

Posterior profiling of cancer-driving sites in PanCancer analysis

We applied our computational procedure CanDriS to analyze cancer somatic mutations of 9078 tumor-normal pairs across 33 tumor types from the TCGA PanCanAtlas MC3 project (Supplementary Table S1, see Supplementary Data available online at http://bib.oxfordjournals.org/). In PanCancer analysis, genes that have no amino acid site with at least three recurrent missense mutations were excluded from further analysis; therefore, we have estimated the parameters m0, m1 and η for 4047 genes (m0, m1, the recurrent rate of somatic mutations at a passenger or driver site, respectively; η, the probability of any site being a driver in a gene). First, the estimated proportion (η) of cancer-driving sites in a gene is, on average, very small. Roughly speaking, only less than 1% of amino acid sites are potentially cancer-driving. Second, the number of somatic mutations per driver site m1 (mean ± s.e. = 3.204 ± 0.040) is about 31-fold greater than that per passenger site m0 (mean ± s.e. = 0.102 ± 0.001), indicating that recurrent somatic mutations at driver sites are much more frequent than that at passenger sites (Figure 1BD). For each gene, we apply the empirical Bayesian procedure to calculate the posterior probability (Q-value) of any kth site being cancer-driving according to Equation (2); see the histogram of Q-values in Figure 1E. 715 370 amino acid sites have at least one somatic mutation among which 7327 sites (1.02%) have over a 0.50 posterior probability, 3201 (0.45%) sites have over a 0.90 posterior probability, and 1228 (0.17%) sites have over a 0.99 posterior probability, respectively (Supplementary Table S2, see Supplementary Data available online at http://bib.oxfordjournals.org/). The overall relationship between Q and z (the count of somatic mutations at a site) is shown in Figure 1F. In short, our analyses support the notion that among numerous somatic mutations appearing in cancer, only a very small portion of them may drive the process of carcinogenesis.

As a case in point, we found tumor suppressor gene PTEN contains many somatic mutations in the TCGA dataset, more than 40% of which are missense mutations. PTEN encodes a phosphatidylinositol-3,4,5-trisphosphate 3-phosphatase, which is an important signaling molecule in the PI3K signaling pathway and is often dysregulated in human cancers [20]. Among the 403 amino acid sites of PTEN, 122 sites have at least one somatic missense mutation, but of which, 59 sites are only mutated once (Figure 2C). We estimated that m0 = 0.361 and m1 = 22.8, respectively, and the proportion of cancer-driving sites η = 0.0215. Hence, on average, the number of cancer-driving sites of PTEN is about 403 × 0.0215 ≈ 8. Then, we analyzed the profile of the posterior probability of being a cancer-driving site by plotting it against the human PTEN amino acid sites. It shows that eight sites with at least seven recurrent mutations (z ≥ 7) have a high (Qz ≥ 0.90) posterior probability to be drivers. In contrast, sites with a few somatic mutations (z ≤ 5) are highly unlikely to be drivers. It is important to note that, one should be cautious for those sites with z = 6 because the posterior profiling may be highly sampling-dependent (Figure 2D). Oncogene PIK3CA, encoding the catalytic subunit of phosphatidylinositol 3-kinase (PI3K) which then activates the downstream effectors including pAKT and mTOR, is recurrently mutated in breast invasive carcinoma (BRCA) (312/779, 40.05%) and uterine corpus endometrial carcinoma (UCEC) (221/396, 55.81%). Among the 1068 amino acid sites, 17 out of 127 mutation sites have at least 10 recurrent mutations, and their posterior probabilities to be drivers are approximately equal to 1 (Figure 2G). Some mutations occurring in these sites are well-known oncogenic ones, such as E542K, E545K and H1047R, and also have shown a possibility of therapeutic implications from clinical trials, published clinical evidence and some literature publications. For instance, some clinical trials indicate that PIK3CA H1047R could be a druggable target in colon adenocarcinomas (COAD) and rectum adenocarcinomas (READ) for its sensitive drug response and effect to PI3K inhibitors, AKT inhibitors and mTOR inhibitors [21]. Moreover, PIK3CA E545K and E542K also show a sensitive response to mTOR inhibitors in a preclinical study [22]. Besides, a study suggests that a secondarily acquired PIK3CA H1047R mutation in a gastrointestinal stromal tumor with a primary BRAF mutation may have contributed to resistance to dabrafenib, a reversible adenosine triphosphate (ATP)-competitive kinase inhibitor that targets the MAPK pathway [23].

The profiling analysis of tumor suppressor gene PTEN and oncogene PIK3CA based on CanDriS. (A, E) Lollipop plot of cancer somatic missense mutations in PTEN (340 missense mutations) and PIK3CA (1130 missense mutations). (B, F) The Qz values of amino acid sites in PTEN and PIK3CA. (C, G) The distribution of amino acid sites with different missense mutations (z). (D, H) The profile of the posterior probability of being a cancer-driving site plotting against the PTEN and PIK3CA amino acid sites.
Figure 2

The profiling analysis of tumor suppressor gene PTEN and oncogene PIK3CA based on CanDriS. (A, E) Lollipop plot of cancer somatic missense mutations in PTEN (340 missense mutations) and PIK3CA (1130 missense mutations). (B, F) The Qz values of amino acid sites in PTEN and PIK3CA. (C, G) The distribution of amino acid sites with different missense mutations (z). (D, H) The profile of the posterior probability of being a cancer-driving site plotting against the PTEN and PIK3CA amino acid sites.

Identification of driver mutations across tumor types

We further applied CanDriS in 33 different tumor types and obtained posterior profiling of cancer-driving sites across tumor types. In almost all tumor types [except stomach adenocarcinoma (STAD) and skin cutaneous melanoma (SKCM)], more than 99% of mutated sites are considered as passengers as we can see in Figure 3. SKCM has the largest number of over 1000 driver sites (Qz ≥ 0.90). The eight tumor types with less than 2000 mutated sites [adrenocortical carcinoma (ACC), cholangiocarcinoma (CHOL), kidney chromophobe (KICH), mesothelioma (MESO), pheochromocytoma and paraganglioma (PCPG), testicular germ cell tumors (TGCT), thymoma (THYM) and uveal melanoma (UVM)] tend to have few, if any, driver sites. These cancer types either have a very small sample size or few mutations.

The number of predicted driver sites across 33 tumor types from TCGA under different posterior probabilities cutoff. The table on the left shows the number of mutated sites for the 33 tumor types under different posterior probability cutoffs (Qz < 0.5, 0.5 ≦ Qz < 0.75, 0.7 ≦ Qz < 0.9, Qz ≧ 0.9). The bar plot on the right shows the proportion of four types of driver sites under different Qz cutoffs (different colors represent the corresponding types).
Figure 3

The number of predicted driver sites across 33 tumor types from TCGA under different posterior probabilities cutoff. The table on the left shows the number of mutated sites for the 33 tumor types under different posterior probability cutoffs (Qz < 0.5, 0.5 ≦ Qz < 0.75, 0.7 ≦ Qz < 0.9, Qz ≧ 0.9). The bar plot on the right shows the proportion of four types of driver sites under different Qz cutoffs (different colors represent the corresponding types).

Then, we calculated the posterior probability of all mutations occurring in cancer-driving sites and identified the potential driver mutations. The result shows that only a small group of mutations are drivers in more than 10 tumor types (cancer-wide drivers), all of which occur in well-known cancer genes (TP53, KRAS and PIK3CA). While most mutations act as drivers in very few tumor types (cancer-specific drivers), among which we found some interesting mutations (Figure 4). For example, general transcription factor II-I (GTF2I) is altered in almost half of all THYM by a single mutation L424H. GTF2I mediates the expression of various genes with importance in cell cycle control, stress response and genomic stability, including FOS, BRCA and STAT proteins [24], but has not been annotated as a driver gene in the COSMIC CGC yet. A recent study found that GTF2I L424H induced cell transformation and metabolic alterations in thymic epithelial cells [25]. Though GTF2I L424H occurred at high frequency in thymic epithelial tumors, strikingly prevalent in the A and AB histotypes, the frequency of mutation progressively decreased in the more aggressive histological types. Therefore, it can be a biomarker of prognosis [26]. IDH1 R132H is the most frequent mutation (>70%) in brain lower-grade glioma (LGG) and acts as a driver both in brain LGG and glioblastoma multiforme (GBM). Dang et al. performed structural studies demonstrating that IDH1 R132H results in the production of the oncometabolite 2-hydroxyglutarate (2HG) which contributes to the formation and malignant progression of gliomas [27]. Oncogenes GNAQ and GNA11 are both members of the G(q) subfamily. Their mutations predominantly cluster in exon 5 (affecting Q209), which is a cancer-driving site based on our analysis. The major types of mutation in GNAQ are Q209L and Q209P; the former is a cancer-specific driver in UVM, and the latter acts as a driver in both UVM and SKCM. GNA11 is a paralog of GNAQ. There was a mutually exclusive pattern in these two genes, and approximately, all UVM tumors harbor these mutations [28]. GNA11 Q209L is another driver in both UVM and SKCM. A study indicated that GNA11 Q209L could induce spontaneously metastasizing tumors in a mouse model, and the constitutive activation of the pathway involving GNA11 and GNAQ appeared to be a major contributor to the development of UVM [28].

The landscape of driver mutations across tumor types. Each driver mutation is represented as a single dot in the scatterplot. The size of the dot is corresponding to the max prevalence of driver mutations in tumor type. The orange represents cancer-specific drivers which are identified as drivers only in very few tumor types, and the red represents cancer-wide drivers in more than 10 tumor types. Only the driver mutations that occurred at high frequency (over 10%) in the corresponding tumor type are denoted by their abbreviations.
Figure 4

The landscape of driver mutations across tumor types. Each driver mutation is represented as a single dot in the scatterplot. The size of the dot is corresponding to the max prevalence of driver mutations in tumor type. The orange represents cancer-specific drivers which are identified as drivers only in very few tumor types, and the red represents cancer-wide drivers in more than 10 tumor types. Only the driver mutations that occurred at high frequency (over 10%) in the corresponding tumor type are denoted by their abbreviations.

Considering the cancer heterogeneity for multi-types of cancer, we also applied the CanDriS for International Cancer Genome Consortium (ICGC) data at the pan-cancer and tumor-type level. And, we analyzed the cancer-wide and cancer-specific driver sites, which had posterior probabilities larger than 0.90 for TCGA and ICGC, respectively, and compared the results. We found that the posterior probabilities of cancer-wide driver sites in TCGA (n = 14) are all greater than 0.90 at the pan-cancer level of ICGC and tended to be drivers in more different tumor types of ICGC. And, due to tumor heterogeneity, most cancer-specific driver sites (n = 979) have a posterior probability of less than 0.90, and more sites (n = 1014) are not even identified as drivers for a single tumor type (Supplementary Tables S3S5, see Supplementary Data available online at http://bib.oxfordjournals.org/).

CanDriS has a consistently good performance in the benchmark analysis

We then evaluated the performance of CanDriS by comparing it with other 33 algorithms on three benchmark datasets: literature annotation based on OncoKB [29], functional annotation based on in vitro cell viability assays [12] and effects of cancer mutations on tumor formation in xenograft experiment [30]. Three benchmark datasets for comparison represent different features distinguishing driver mutations from over-excess passenger mutations in cancer genomes and are highly complementary to each other [19]. The positive (driver) and negative (passenger) mutations were predefined according to the functional annotation in each benchmark dataset (Figure 5A).

The performance of CanDriS based on the benchmark analysis. (A) The pie plot showing the overlapping summary of negative and positive cases in three benchmark datasets. (B) A heatmap showing the ranks of the CanDriS and other 33 algorithms based on each benchmark dataset, indicated by color (green, in vitro cell viability benchmark; purple, in vivo tumor formation benchmark; and orange, OncoKB annotation benchmark). Ranks are labeled for the top 10 algorithms only. Red, higher ranks; and white, lower ranks. (C–E) The ROC curves of CanDriS and other 33 algorithms in three benchmark datasets (in vivo tumor formation benchmark, in vitro cell viability benchmark and OncoKB annotation benchmark, respectively).
Figure 5

The performance of CanDriS based on the benchmark analysis. (A) The pie plot showing the overlapping summary of negative and positive cases in three benchmark datasets. (B) A heatmap showing the ranks of the CanDriS and other 33 algorithms based on each benchmark dataset, indicated by color (green, in vitro cell viability benchmark; purple, in vivo tumor formation benchmark; and orange, OncoKB annotation benchmark). Ranks are labeled for the top 10 algorithms only. Red, higher ranks; and white, lower ranks. (CE) The ROC curves of CanDriS and other 33 algorithms in three benchmark datasets (in vivo tumor formation benchmark, in vitro cell viability benchmark and OncoKB annotation benchmark, respectively).

To assess the predictive performance of our method, we employed area under the curve (AUC) of receiver operating characteristics (ROC) curves based on numeric scores generated from each algorithm. In three benchmark datasets, CanDriS had a promising performance, especially in the in vivo tumor formation benchmark, achieved the highest AUC score of 83.1%, and was the only method with an AUC score of >80% (Figure 5C). And CanDriS ranked sixth in the in vitro cell viability benchmark (Figure 5D), with an AUC score of 68.1%. The top three algorithms by the AUC score were CTAT-cancer (AUC = 74.0%), CHASM (AUC = 73.6%) and CanDrA (AUC = 70.2%) in this benchmark, only ranked top 10 in one or two benchmark datasets, and showed an inconsistent performance. Based on OncoKB annotation, CanDriS ranked ninth with an AUC score of 64.4%, and in the meanwhile, the best performing method VEST4 had approximately 70% (AUC = 68.7%) (Figure 5E). A possible reason for the slightly poorer performance in OncoKB may be intrinsic to the data collected from OncoKB, with only slightly more than half of cancer genes (n = 145, 51.42%) overlapping with our consensus driver gene list. In general, CanDriS was one of the only two methods that achieved top-performance (the top 10 algorithms highlighted for each benchmark) and showed an overall good consistency in each benchmark dataset despite the diversity of three benchmark datasets (Figure 5E). The benchmark analysis showed that CanDriS had a consistently good performance.

Cancer-driving site profiling database CandrisDB

To facilitate the utilization of our analysis results, we developed a user-friendly, interactive and open-access web portal, CandrisDB V1.0 (cancer-driving site profiling database), which is freely available for all academic users (Figure 6). We merged the prediction results of various algorithms based on different principles along with some previous PanCancer publications [3, 4, 7–10] to generate a consensus driver gene list. Then, we adopted the results of CanDriS to posterior profiling driver mutation sites for thousands of tumor samples from both TCGA and ICGC at the pan-cancer- and tumor-type level. Furthermore, we annotated all driver mutation sites with functional and pharmacogenomics data collected from other public databases, which include clinical and therapeutic information that might be of interest to clinicians. CandrisDB displays the putative cancer-driving sites in driver genes, which helps to understand the mechanisms of cancer evolution and progression. Besides, with the prospective clinical sequencing of cancer being widely used in cancer diagnosis and treatment, CandrisDB provides a list of potential targets for cancer therapy and guidance for clinical medication. For convenient searching, we only displayed 7327 (TCGA)/6681 (ICGC) potential driver mutation sites whose posterior probability (Qz) is greater than 0.50, as well as 2961 potential driver genes in our database CandrisDB. Moreover, we also presented the CanDriS profiling results for different tumor types on the ‘Tumor type’ page. Besides, the webserver of CanDriS is available on the ‘CanDriS’ page, which helps users easily predict cancer driver mutations.

The architecture of CandrisDB.
Figure 6

The architecture of CandrisDB.

DISCUSSION

Recently, developing new therapies or drugs that target somatic mutations in putative driver genes has become a hot spot. For example, there is a small-molecule inhibitor targeting BRAF V600E mutations called vemurafenib, which is effective in shrinking tumors [31]. Also, neoantigens generated from somatic mutations have attracted lots of attention as ideal immunotherapy targets [32–35]. Thus, identifying cancer-driving mutations is a key step in understanding the molecular mechanism of cancer development, which helps to generate a list of potentially druggable mutations and to optimize medication prescribing in the upcoming era of precision medicine.

In cancer genomics, the number of recurrent somatic mutations at the same amino acid site is the primary indicator for predicting a particular gene to be carcinogenesis-related. Therefore, one may intuitively predict the cancer-driving sites of a cancer-related gene if the counts of somatic mutations at those sites are above a pre-specified cutoff, but this kind of approach is highly sample-dependent and lacks statistical justification. In this study, to overcome those drawbacks, we developed a new method called CanDriS by implementing a two-component evolutionary model and by calculating the posterior probability of a site being a potential driver while given the observed number of mutations. Our approach has two advantages: first, it takes advantage of the large-scale cancer genomic data without strong gene- or site-specific sampling bias; second, it avoids the cutoff arbitrary, that is, how many recurrent somatic mutations are sufficient for a site to be predicted as cancer-driving, and our method calculates posterior probabilities for sites. And through the benchmark analysis, our method shows a consistently good performance. Using CanDriS, we profiled the potential cancer-driving sites for TCGA and ICGC at the pan-cancer- and tumor-type level, and only approximately 1% of the sites have posterior probabilities larger than 0.90, which indicates that most sites in a gene are passengers (Supplementary Tables S3 and S4, see Supplementary Data available online at http://bib.oxfordjournals.org/). The potential cancer-wide and cancer-specific driver mutations were shown on the database CandrisDB and were annotated with functional and pharmacogenomics information.

Our method potentially has broad applications but may compromise prediction sensitivity. There are still several biological covariates, which may have profound influences on the prediction accuracy and sensitivity. At the sample level, they include cancer stage (primary or metastasis), clinical severity, type of cancers, as well as medical treatments. At the genome level, the major factor could be mutational heterogeneity [7, 36, 37]. We recognized that some assumptions we made in our analyses may be biologically oversimplified. By computer simulations, we found that the posterior probability for a cancer-driving site calculated under the two-component Poisson model tends to be overestimated around the turning point of z in Equation (9) when somatic mutations are over-dispersed. In future studies, we propose two approaches to address this issue. The first one is to improve the computational pipeline so that it has more options and flexibility to treat different cancer samples, for example, primary or metastasis, cancer tissues, or medical treatments. The second one is to extend the two-component mixture model to some more sophisticated models such as the negative binomial distribution (NBD) model. For instance, we implement a Poisson–NBD mixture model, that is, the number of somatic mutations at passenger sites follows a Poisson model, whereas that at driver sites follows an NBD model. It appears that the posterior profiling converges to virtually the same between the Poisson–Poison model and the Poisson–NBD model when the posterior probabilities are above 0.75 or higher. With these caveats in mind, we conclude that the two-component Poisson model can be used as the first statistical attempt toward cancer-driving site profiling.

METHODS

The two-component mixture model for cancer somatic mutations

In spite that only very few amino acid sites of a cancer gene could be cancer-driving, each driver site may have a much higher incident rate of cancer evolution than that of passenger sites. Based on this fact, we developed a two-component mixture model to distinguish between these two types of sites. To be concise, we assume that each site is either a driver or passenger: the probability of any site being a driver is η, whereas that of being a passenger is 1 − η. Let z be the number of somatic missense mutations (i.e. causing amino acid changes) at a site. It follows that the distribution of z of a gene under study is given by
(1)
where P0(z) is the distribution of z at a passenger site, and P1(z) is that at a driver site.
Somatic mutations at passenger sites are neutral or nearly neutral during cancer progression [38–40]. Without loss of generality, we assume that P0(z) follows a Poisson model, characterized by m0, the recurrent rate of somatic mutations at a passenger site, that is,
(2)
On the other hand, cancer somatic mutations occurring at driver sites are highly recurrent, subject to positive selection pressures in cancer progression that may differ among genes and sites [41–50]. We also assume the number of somatic mutations at a driver site follows a Poisson distribution, and P1(z) follows a Poisson model given by
(3)
which is characterized by the mean m1, the recurrent rate of somatic mutations at a driver site. With these specifications, we call Equation (1) the Poisson–Poisson mixture model.

Empirical Bayesian profiling of cancer-driving sites

Our goal is to profile cancer-driving sites based on high-throughput cancer somatic mutations. To this end, we have to calculate P(driver|z), the posterior probability of an amino acid site being a cancer-driver when the number (z) of somatic mutations is observed at this site. For a given gene, we denote π(driver) = η to be the prior probability of an amino acid site being a driver. The Bayesian rule claims that
(4)

Interestingly, under the two-component mixture model, the relationship between P(driver|z) and z is logistic-like; the baseline of P(driver|z) is usually close to zero by allowing z = 0, and a higher posterior probability value indicates higher statistical confidence for the site being cancer-driving. As z at a site increases, the chance of this site being a driver increases toward 100% ultimately. In practice, one may choose a posterior probability cutoff, say, 0.50 such that a site is predicted to be cancer-driving if it receives a posterior probability of more than 0.50. This cutoff is too liberal which may result in a high rate of false positives. On the other hand, a stricter cutoff can be chosen as high as 0.90.

Implementation of parameter estimations

The current version of our method implements an empirical Bayesian approach to calculating Equation (4). The first step is to estimate unknown model parameters. From TCGA, it is straightforward to count the number of somatic mutations at each site of a gene under study and then obtain the empirical frequency of somatic mutations, for example, f0 is the proportion of sites with zero somatic mutation, f1 is that of sites with single somatic mutation and so forth. Here, we develop an efficient procedure to estimate unknown model parameters, which can avoid some technical issues usually caused by the small sampling problem.

High-throughput cancer genomics revealed that, for almost all cancer genes, only a very small portion of amino acid sites are cancer-driving. This rare cancer-driving site assumption, that is, η << 1, m0 < 1 and m1 > 1. As f0 is expected to be f(z = 0), from Equation (1) we have
(5)
Under the assumptions that m1 >> m0 and η is small, one can use the first-order approximation to estimate m0, that is,
(6)
Given m0, parameters (m1 and η) can be estimated by either the maximum likelihood method or the simple method of moments. The mean and variance of z are calculated directly from the data. After treating m0 as known, we obtain
(7)
and
(8)
respectively. We denote Qk = P(driver|zk) for the empirical posterior probability for the kth site of a gene. As shown in Equation (4), one can derive
(9)

Further, we suggest that this empirical Bayesian approach is biologically meaningful only when the statistical testing shows that the two-component mixture model is statistically significantly better than any one-component model, which can be formulated by the likelihood ratio test (LRT) against the null hypothesis of η = 0 versus the alternative of η > 0. The calculated log-LRT value approximately follows a chi-square distribution with the degrees of freedom df = 2. In the case of PTEN, for instance, the null hypothesis of η = 0 was statistically highly rejected (LRT test, P < 10−22).

Some technical issues: cancer-related genes may have highly recurrent somatic mutations at very few sites (hotspots), for example, the 600th codon of BRAF, the highest recurrent mutation site, has received 545 recurrent mutations, much higher than the second site (469th codon) with 16 mutations. In our dataset, there are six well-known cancer-related genes (including oncogenes KRAS, BRAF, IDH1, PIK3CA, NRAS and tumor suppressor gene TP53) each of which has at least one site with more than 100 recurrent somatic mutations. While those highly recurrent mutations are cancer-related, they may cause some statistical biases in our prediction (not shown). Therefore, we excluded any amino acid site with z ≥ 100 in our PanCancer analysis and assigned the posterior probability of being a cancer-driving site Q (z ≥ 100) = 1.

Data sources for CanDriS

Cancer somatic mutation data we used in our study were from TCGA and ICGC. Somatic mutations across 33 tumor types and 10 224 cancer donors were obtained from the TCGA PanCanAtlas MC3 project (https://gdc.cancer.gov/about-data/publications/mc3-2017), including 3 517 790 somatic mutations in the coding regions, 2 035 693 of which caused amino acid changes in cancers (missense mutations). The mutation data from 35 tumor types of ICGC are publicly available from ICGC Release 25 (https://dcc.icgc.org/releases). After filtered out some redundancies, 9078 samples with 928 012 missense mutations from TCGA and 6970 samples with 423 047 missense mutations from ICGC were used in our study (Supplementary Table S1, see Supplementary Data available online at http://bib.oxfordjournals.org/). Sequences and annotations of human genes were extracted from the Ensembl database (http://www.ensembl.org, GRCh37, Release 75) [51]. For each gene, the transcript which is most commonly used in TCGA was chosen.

Data sources for CandrisDB

The list of over 2000 potential cancer driver genes was generated from the combined result of multiple driver gene prediction algorithms based on different principles. And, we also annotated the driver genes based on some driver gene lists from different sources [3, 4, 7–10]. There are 15 algorithms included in our analysis, which can be categorized into five types based on their major features: (1) frequency-based (MuSiC [52], MutSig2CV [8] and OncodriveCLUST [53]), (2) feature-based, for example, functional impact (20/20+ [54], OncodriveFML [55] and CompositeDriver [4]), (3) structural (domain)-based (e-Driver [56], ActiveDriver [57]), (4) evolutionary selective pressure-based (dNdScv [44], CBaSE [58], CN/CS calculator [50]) and (5) network- or pathway-based (DriverNet [59], OncoIMPACT [60], PNC [15]). We also included the result of a machine learning-based method DriverML [18]. And, we linked the driver genes to cancer hallmarks based on CancerGeneNet [61].

The potential list of cancer-driving sites was generated by CanDriS, and the functional and pharmacogenomics annotation of these sites were collected from several public databases and publications. The biologic and oncogenic effects of mutations were collected from OncoKB. And, the prognostic and clinical effects of mutations were collected from published papers, including different levels of evidence (e.g. preclinical, clinical trials, case reports, FDA-approved, etc.), which can support whether a specific mutation can be considered druggable and associated with known drug responses and effects in the listed cancer types. Besides, the entries about mutation–drug interaction were collected from some public databases [e.g. personalized cancer therapy (PCT) [62], DrugBank [63]], which were also based on different levels of evidence that tumors harboring a specific mutation respond to certain drugs. Moreover, the drug-resistant mutations were collected from published resources, such as TTD [64], COSMIC and DEPO [65], which were identified in the literature as resistance mutation, including mutations conferring acquired resistance (after treatment) and intrinsic resistance (before treatment).

Benchmark analysis for CanDriS

Seventy-one somatic mutations with their oncogenicity annotation based on in vivo tumor formation assays were downloaded from the study by Kim et al. [30]. In the analysis, 45 mutations labeled as ‘functional’ were used as positive cases and 26 other mutations were used as negative cases. In vitro cell viability data of missense mutations were collected from FASMIC v1.6 http://bioinformatics.mdanderson.org/main/FASMIC. Overall, 828 mutations were included for analysis, among which, 361 mutations annotated as activating, inactivating, inhibitory, or non-inhibitory in consensus functional call were considered as positive cases, while neutral mutations were considered as negative cases. OncoKB annotations were downloaded from OncoKB v2.7 (https://www.oncokb.org). This version contained 3246 missense mutations with functional annotations (such as oncogenic, likely oncogenic and likely neutral). The oncogenic and likely oncogenic mutations were labeled as positive cases and the likely neutral mutations as negative cases.

The 33 algorithms we used for comparison employ various information as features to build predictive models, which can be grouped into six main categories: sequence context, protein feature, conservation, epigenomic information, ensemble score and cancer-specific [19]. Prediction scores of most algorithms were extracted from dbNSFP V4.1 [66]. Otherwise, CanDrA [67] scores were calculated by using the ‘cancer-in-general’ scores with CanDrA v+. CHASM v3.1 [68] scores were retrieved from the CRAVAT web server (v5.2.4) (http://www.cravat.us/CRAVAT/). CTAT-cancer and CTAT-population scores were calculated by performing principal component analysis in R, as described in the original paper [4]. FATHMM-cancer [69] scores were retrieved from http://fathmm.biocompute.org.uk/cancer.html. TransFIC [70] scores were calculated by a Perl script downloaded from http://bbglab.irbbarcelona.org/transfic/home. Missing values were omitted.

For each benchmark dataset, the ROC curves and AUC scores were obtained by using the R function ‘roc’ provided in the pROC package.

Key Points
  • We developed a two-component evolutionary model named CanDriS to estimate the proportion of cancer-driving sites of a gene, and estimate the posterior profiling of cancer-driving sites by an empirical Bayesian procedure.

  • Compared with 33 tools in three benchmark datasets, CanDriS has a consistently good performance.

  • We comprehensively profile all potential cancer-driving sites for thousands of tumor samples from TCGA and ICGC at the pan-cancer- and tumor-type level, displaying the results in a database CandrisDB (http://biopharm.zju.edu.cn/candrisdb/).

  • We also provided annotation for these potential driver mutations with functional and pharmacogenomics information collected from published papers and public resources.

Data availability

The source codes of CanDriS and the original data are available on GitHub (https://github.com/jiujiezz/candris). CandrisDB is freely available for all academic users at http://biopharm.zju.edu.cn/candrisdb/.

Authors’ contributions

X.G. and Z.Z. designed and supervised this study. X.G. conceived the method. Z.Z., W.Z. and J.W. wrote the programs. W.Z., J.Y., G.C., Y.Z., J.W., J.H., W.S. and Z.Z. performed the data analysis. X.G., Z.Z., W.Z. and J.Y. wrote the manuscript. M.J.D., S.C. and Jian.W. revised the manuscript. All authors read and approved the final manuscript.

Acknowledgments

We gratefully acknowledge the clinical contributors and data producers from the TCGA Research Network for referencing the TCGA dataset.

Funding

This work was supported by grants from Key R&D Program of Zhejiang Province (grant no. 2020C03010); National Natural Science Foundation of China (grant no. 31971371); Zhejiang Provincial Natural Science Foundation of China (grant no. LY19H300003); Fundamental Research Funds for the Central Universities; Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare.

Conflict of interest

The authors declare no conflict of interests.

Wenyi Zhao is a PhD candidate at the College of Pharmaceutical Sciences & College of Computer Science and Technology, Zhejiang University. Her main research interests include the development of bioinformatics tools and databases and cancer genomics.

Jingwen Yang is a postdoc fellow at the MOE Key Laboratory of Contemporary Anthropology, Human Phenome Institute, School of Life Sciences, Fudan University. Her main research interests include the development of bioinformatics tools and evolutionary genomics.

Jingcheng Wu is a PhD candidate at the College of Pharmaceutical Sciences, Zhejiang University. His main research interests include the development of bioinformatics tools and databases and cancer genomics.

Guoxing Cai is currently a master’s student at the College of Pharmaceutical Sciences, Zhejiang University. His main research interests include the development of bioinformatics tools and databases and cancer immunogenomics.

Yao Zhang is an undergraduate student at the College of Pharmaceutical Sciences, Zhejiang University. His main research interests include the development of bioinformatics tools and cancer genomics.

Jeffrey Haltom is a PhD candidate at the Department of Genetics, Development and Cell Biology, Iowa State University. His main research interests include evolutionary genomics and cancer genomics.

Weijia Su is a PhD candidate at the Department of Genetics, Development and Cell Biology, Iowa State University. Her main research interests include evolutionary genomics.

Michael J. Dong is a PhD candidate at the Department of Genetics, Development and Cell Biology, Iowa State University. His main research interests include evolutionary genomics.

Shuqing Chen is a professor at the College of Pharmaceutical Sciences, Zhejiang University. His main research interests include biopharmaceuticals and cancer precision medicine.

Jian Wu is a professor at the College of Computer Science and Technology & School of Medicine, Zhejiang University. His main research interests include artificial intelligence in medicine.

Zhan Zhou is an associate professor at the College of Pharmaceutical Sciences, Innovation Institute for Artificial Intelligence in Medicine, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Zhejiang University. His main research interests include the development of bioinformatics tools and databases and cancer genomics.

Xun Gu is a professor at the Department of Genetics, Development and Cell Biology, Iowa State University. His main research interests include evolutionary genomics and molecular evolution.

References

1.

Stratton
MR
,
Campbell
PJ
,
Futreal
PA
.
The cancer genome
.
Nature
2009
;
458
:
719
24
.

2.

Martincorena
I
,
Campbell
PJ
.
Somatic mutation in cancer and normal cells
.
Science
2015
;
349
:
1483
9
.

3.

Vogelstein
B
,
Papadopoulos
N
,
Velculescu
VE
, et al.
Cancer genome landscapes
.
Science
2013
;
340
:
1546
58
.

4.

Bailey
MH
,
Tokheim
C
,
Porta-Pardo
E
, et al.
Comprehensive characterization of cancer driver genes and mutations
.
Cell
2018
;
173
:
371
85.e18
.

5.

Yi
S
,
Lin
S
,
Li
Y
, et al.
Functional variomics and network perturbation: connecting genotype to phenotype in cancer
.
Nat Rev Genet
2017
;
18
:
395
410
.

6.

Cheng
F
,
Zhao
J
,
Zhao
Z
.
Advances in computational approaches for prioritizing driver mutations and significantly mutated genes in cancer genomes
.
Brief Bioinform
2016
;
17
:
642
56
.

7.

Kandoth
C
,
McLellan
MD
,
Vandin
F
, et al.
Mutational landscape and significance across 12 major cancer types
.
Nature
2013
;
502
:
333
9
.

8.

Lawrence
MS
,
Stojanov
P
,
Mermel
CH
, et al.
Discovery and saturation analysis of cancer genes across 21 tumour types
.
Nature
2014
;
505
:
495
501
.

9.

Sondka
Z
,
Bamford
S
,
Cole
CG
, et al.
The COSMIC cancer gene census: describing genetic dysfunction across all human cancers
.
Nat Rev Cancer
2018
;
18
:
696
705
.

10.

Martínez-Jiménez
F
,
Muiños
F
,
Sentís
I
, et al.
A compendium of mutational cancer driver genes
.
Nat Rev Cancer
2020
;
20
:
1
18
.

11.

Chang
MT
,
Asthana
S
,
Gao
SP
, et al.
Identifying recurrent mutations in cancer reveals widespread lineage diversity and mutational specificity
.
Nat Biotechnol
2016
;
34
:
155
63
.

12.

Ng
PKS
,
Li
J
,
Jeong
KJ
, et al.
Systematic functional annotation of somatic mutations in cancer
.
Cancer Cell
2018
;
33
:
450
62.e10
.

13.

Tokheim
C
,
Karchin
R
.
CHASMplus reveals the scope of somatic missense mutations driving human cancers
.
Cell Syst
2019
;
9
:
9
23.e8
.

14.

Nussinov
R
,
Jang
H
,
Tsai
C-J
, et al.
Review: precision medicine and driver mutations: computational methods, functional assays and conformational principles for interpreting cancer drivers
.
PLoS Comput Biol
2019
;
15
:
e1006658
.

15.

Guo
W-F
,
Zhang
S-W
,
Zeng
T
, et al.
A novel network control model for identifying personalized driver genes in cancer
.
PLoS Comput Biol
2019
;
15
:
e1007520
.

16.

Sundaram
L
,
Gao
H
,
Padigepati
SR
, et al.
Predicting the clinical impact of human mutation with deep neural networks
.
Nat Genet
2018
;
50
:
1161
70
.

17.

Quang
D
,
Chen
Y
,
Xie
X
.
DANN: a deep learning approach for annotating the pathogenicity of genetic variants
.
Bioinformatics
2015
;
31
:
761
3
.

18.

Han
Y
,
Yang
J
,
Qian
X
, et al.
DriverML: a machine learning algorithm for identifying driver genes in cancer sequencing studies
.
Nucleic Acids Res
2019
;
47
:
e45
5
.

19.

Chen
H
,
Li
J
,
Wang
Y
, et al.
Comprehensive assessment of computational algorithms in predicting cancer driver mutations
.
Genome Biol
2020
;
21
:
43
.

20.

Sanchez-Vega
F
,
Mina
M
,
Armenia
J
, et al.
Oncogenic signaling pathways in the cancer genome atlas
.
Cell
2018
;
173
:
321
37.e10
.

21.

Janku
F
,
Wheler
JJ
,
Naing
A
, et al.
PIK3CA mutation H1047R is associated with response to PI3K/AKT/mTOR signaling pathway inhibitors in early-phase clinical trials
.
Cancer Res
2013
;
73
:
276
84
.

22.

Weigelt
B
,
Warne
PH
,
Downward
J
.
PIK3CA mutation, but not PTEN loss of function, determines the sensitivity of breast cancer cells to mTOR inhibitory drugs
.
Oncogene
2011
;
30
:
3222
33
.

23.

Falchook
GS
,
Trent
JC
,
Heinrich
MC
, et al.
BRAF mutant gastrointestinal stromal tumor: first report of regression with BRAF inhibitor dabrafenib (GSK2118436) and whole exomic sequencing for analysis of acquired resistance
.
Oncotarget
2013
;
4
:
310
5
.

24.

Tanikawa
M
,
Wada-Hiraike
O
,
Nakagawa
S
, et al.
Multifunctional transcription factor TFII-I is an activator of BRCA1 function
.
Br J Cancer
2011
;
104
:
1349
55
.

25.

Kim
I-K
,
Rao
G
,
Zhao
X
, et al.
Mutant GTF2I induces cell transformation and metabolic alterations in thymic epithelial cells
.
Cell Death Differ
2020
;
27
:
2263
79
.

26.

Petrini
I
,
Meltzer
PS
,
Kim
I-K
, et al.
A specific missense mutation in GTF2I occurs at high frequency in thymic epithelial tumors
.
Nat Genet
2014
;
46
:
844
9
.

27.

Dang
L
,
White
DW
,
Gross
S
, et al.
Cancer-associated IDH1 mutations produce 2-hydroxyglutarate
.
Nature
2009
;
462
:
739
44
.

28.

Van Raamsdonk
CD
,
Griewank
KG
,
Crosby
MB
, et al.
Mutations in GNA11 in uveal melanoma
.
N Engl J Med
2010
;
363
:
2191
9
.

29.

Chakravarty
D
,
Gao
J
,
Phillips
S
, et al.
OncoKB: a precision oncology knowledge base
.
JCO Precis Oncol
2017
;
1
:
PO.17.00011
. doi: .

30.

Kim
E
,
Ilic
N
,
Shrestha
Y
, et al.
Systematic functional interrogation of rare cancer variants identifies oncogenic alleles
.
Cancer Discov
2016
;
6
:
714
26
.

31.

Chapman
PB
,
Hauschild
A
,
Robert
C
, et al.
Improved survival with vemurafenib in melanoma with BRAF V600E mutation
.
N Engl J Med
2011
;
364
:
2507
16
.

32.

Zhao
W
,
Wu
J
,
Chen
S
, et al.
Shared neoantigens: ideal targets for off-the-shelf cancer immunotherapy
.
Pharmacogenomics
2020
;
21
:
637
45
.

33.

Zacharakis
N
,
Chinnasamy
H
,
Black
M
, et al.
Immune recognition of somatic mutations leading to complete durable regression in metastatic breast cancer
.
Nat Med
2018
;
24
:
724
30
.

34.

Keskin
DB
,
Anandappa
AJ
,
Sun
J
, et al.
Neoantigen vaccine generates intratumoral T cell responses in phase Ib glioblastoma trial
.
Nature
2019
;
565
:
234
9
.

35.

Hilf
N
,
Kuttruff-Coqui
S
,
Frenzel
K
, et al.
Actively personalized vaccination trial for newly diagnosed glioblastoma
.
Nature
2019
;
565
:
240
5
.

36.

Alexandrov
LB
,
Nik-Zainal
S
,
Wedge
DC
, et al.
Signatures of mutational processes in human cancer
.
Nature
2013
;
500
:
415
21
.

37.

Lawrence
MS
,
Stojanov
P
,
Polak
P
, et al.
Mutational heterogeneity in cancer and the search for new cancer-associated genes
.
Nature
2013
;
499
:
214
8
.

38.

Beckman
RA
,
Loeb
LA
.
Negative clonal selection in tumor evolution
.
Genetics
2005
;
171
:
2123
31
.

39.

Luo
B
,
Cheung
HW
,
Subramanian
A
, et al.
Highly parallel identification of essential genes in cancer cells
.
Proc Natl Acad Sci U S A
2008
;
105
:
20380
5
.

40.

McFarland
CD
,
Korolev
KS
,
Kryukov
GV
, et al.
Impact of deleterious passenger mutations on cancer progression
.
Proc Natl Acad Sci
2013
;
110
:
2910
5
.

41.

Crespi
BJ
,
Summers
K
.
Positive selection in the evolution of cancer
.
Biol Rev Camb Philos Soc
2006
;
81
:
407
24
.

42.

Greaves
M
,
Maley
CC
.
Clonal evolution in cancer
.
Nature
2012
;
481
:
306
13
.

43.

Hanahan
D
,
Weinberg
RA
.
Hallmarks of cancer: the next generation
.
Cell
2011
;
144
:
646
74
.

44.

Martincorena
I
,
Raine
KM
,
Gerstung
M
, et al.
Universal patterns of selection in cancer and somatic tissues
.
Cell
2017
;
171
:
1029
41.e21
.

45.

Merlo
LMF
,
Pepper
JW
,
Reid
BJ
, et al.
Cancer as an evolutionary and ecological process
.
Nat Rev Cancer
2006
;
6
:
924
35
.

46.

Ostrow
SL
,
Barshir
R
,
DeGregori
J
, et al.
Cancer evolution is associated with pervasive positive selection on globally expressed genes
.
PLoS Genet
2014
;
10
:
16
20
.

47.

Ovens
K
,
Naugler
C
.
Preliminary evidence of different selection pressures on cancer cells as compared to normal tissues
.
Theor Biol Med Model
2012
;
9
:
44
.

48.

Podlaha
O
,
Riester
M
,
De
S
, et al.
Evolution of the cancer genome
.
Trends Genet
2012
;
28
:
155
63
.

49.

Yates
LR
,
Campbell
PJ
.
Evolution of the cancer genome
.
Nat Rev Genet
2012
;
13
:
795
806
.

50.

Zhou
Z
,
Zou
Y
,
Liu
G
, et al.
Mutation-profile-based methods for understanding selection forces in cancer somatic mutations: a comparative analysis
.
Oncotarget
2017
;
8
:
58835
46
.

51.

Yates
AD
,
Achuthan
P
,
Akanni
W
, et al.
Ensembl 2020
.
Nucleic Acids Res
2020
;
48
:
D682
8
.

52.

Dees
ND
,
Zhang
Q
,
Kandoth
C
, et al.
MuSiC: identifying mutational significance in cancer genomes
.
Genome Res
2012
;
22
:
1589
98
.

53.

Tamborero
D
,
Gonzalez-Perez
A
,
Lopez-Bigas
N
.
OncodriveCLUST: exploiting the positional clustering of somatic mutations to identify cancer genes
.
Bioinformatics
2013
;
29
:
2238
44
.

54.

Tokheim
CJ
,
Papadopoulos
N
,
Kinzler
KW
, et al.
Evaluating the evaluation of cancer driver genes
.
Proc Natl Acad Sci
2016
;
113
:
14330
5
.

55.

Mularoni
L
,
Sabarinathan
R
,
Deu-Pons
J
, et al.
OncodriveFML: a general framework to identify coding and non-coding regions with cancer driver mutations
.
Genome Biol
2016
;
17
:
1
13
.

56.

Porta-Pardo
E
,
Godzik
A
.
E-driver: a novel method to identify protein regions driving cancer
.
Bioinformatics
2014
;
30
:
3109
14
.

57.

Reimand
J
,
Bader
GD
.
Systematic analysis of somatic mutations in phosphorylation signaling predicts novel cancer drivers
.
Mol Syst Biol
2013
;
9
:
637
.

58.

Weghorn
D
,
Sunyaev
S
.
Bayesian inference of negative and positive selection in human cancers
.
Nat Genet
2017
;
49
:
1785
8
.

59.

Bashashati
A
,
Haffari
G
,
Ding
J
, et al.
DriverNet: uncovering the impact of somatic driver mutations on transcriptional networks in cancer
.
Genome Biol
2012
;
13
:
R124
.

60.

Bertrand
D
,
Chng
KR
,
Sherbaf
FG
, et al.
Patient-specific driver gene prediction and risk assessment through integrated network analysis of cancer omics profiles
.
Nucleic Acids Res
2015
;
43
:
e44
4
.

61.

Iannuccelli
M
,
Micarelli
E
,
Surdo
PL
, et al.
CancerGeneNet: linking driver genes to cancer hallmarks
.
Nucleic Acids Res
2020
;
48
:
D416
21
.

62.

Dumbrava
EI
,
Meric-Bernstam
F
.
Personalized cancer therapy—leveraging a knowledge base for clinical decision-making
.
Mol Case Stud
2018
;
4
:
a001578
.

63.

Wishart
DS
,
Feunang
YD
,
Guo
AC
, et al.
DrugBank 5.0: a major update to the DrugBank database for 2018
.
Nucleic Acids Res
2018
;
46
:
D1074
82
.

64.

Wang
Y
,
Zhang
S
,
Li
F
, et al.
Therapeutic target database 2020: enriched resource for facilitating research and early development of targeted therapeutics
.
Nucleic Acids Res
2020
;
48
:
D1031
41
.

65.

Sun
SQ
,
Mashl
RJ
,
Sengupta
S
, et al.
Database of evidence for precision oncology portal
.
Bioinformatics
2018
;
34
:
4315
7
.

66.

Liu
X
,
Li
C
,
Mou
C
, et al.
dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs
.
Genome Med
2020
;
12
:
103
.

67.

Mao
Y
,
Chen
H
,
Liang
H
, et al.
CanDrA: cancer-specific driver missense mutation annotation with optimized features
.
PLoS One
2013
;
8
:
e77945
.

68.

Wong
WC
,
Kim
D
,
Carter
H
, et al.
CHASM and SNVBox: toolkit for detecting biologically important single nucleotide mutations in cancer
.
Bioinformatics
2011
;
27
:
2147
8
.

69.

Shihab
HA
,
Gough
J
,
Mort
M
, et al.
Ranking non-synonymous single nucleotide polymorphisms based on disease concepts
.
Hum Genomics
2014
;
8
:
11
.

70.

Gonzalez-Perez
A
,
Deu-Pons
J
,
Lopez-Bigas
N
.
Improving the prediction of the functional impact of cancer mutations by baseline tolerance transformation
.
Genome Med
2012
;
4
:
89
.

Author notes

Wenyi Zhao and Jingwen Yang authors contributed equally to this work.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)