-
PDF
- Split View
-
Views
-
Cite
Cite
Jun Wang, Ziying Yang, Carlotta Domeniconi, Xiangliang Zhang, Guoxian Yu, Cooperative driver pathway discovery via fusion of multi-relational data of genes, miRNAs and pathways, Briefings in Bioinformatics, Volume 22, Issue 2, March 2021, Pages 1984–1999, https://doi.org/10.1093/bib/bbz167
- Share Icon Share
Abstract
Discovering driver pathways is an essential step to uncover the molecular mechanism underlying cancer and to explore precise treatments for cancer patients. However, due to the difficulties of mapping genes to pathways and the limited knowledge about pathway interactions, most previous work focus on identifying individual pathways. In practice, two (or even more) pathways interplay and often cooperatively trigger cancer. In this study, we proposed a new approach called CDPathway to discover cooperative driver pathways. First, CDPathway introduces a driver impact quantification function to quantify the driver weight of each gene. CDPathway assumes that genes with larger weights contribute more to the occurrence of the target disease and identifies them as candidate driver genes. Next, it constructs a heterogeneous network composed of genes, miRNAs and pathways nodes based on the known intra(inter)-relations between them and assigns the quantified driver weights to gene–pathway and gene–miRNA relational edges. To transfer driver impacts of genes to pathway interaction pairs, CDPathway collaboratively factorizes the weighted adjacency matrices of the heterogeneous network to explore the latent relations between genes, miRNAs and pathways. After this, it reconstructs the pathway interaction network and identifies the pathway pairs with maximal interactive and driver weights as cooperative driver pathways. Experimental results on the breast, uterine corpus endometrial carcinoma and ovarian cancer data from The Cancer Genome Atlas show that CDPathway can effectively identify candidate driver genes [area under the receiver operating characteristic curve (AUROC) of |$\geq $|0.9] and reconstruct the pathway interaction network (AUROC of>0.9), and it uncovers much more known (potential) driver genes than other competitive methods. In addition, CDPathway identifies 150% more driver pathways and 60% more potential cooperative driver pathways than the competing methods. The code of CDPathway is available at http://mlda.swu.edu.cn/codes.php?name=CDPathway.
Introduction
Cancer is one of the most complex diseases that threaten human health. Systemic cancer genomics projects, such as The Cancer Genome Atlas (TCGA) [1] and the International Cancer Genome Consortium [2], have taken initial steps toward developing a blueprint of human cancer genomes by identifying, characterizing and cataloging alterations in thousands of tumor samples. A major challenge of interpreting the data in these projects is to distinguish the mutations that play a role in the initiation and progression of cancer, from the much larger number of passenger alterations that have little influence in cancer cell development.
Most of the early studies focus on detecting individual driver genes based on periodic mutation rates of genes in a large population of cancer patients [3, 4]. The useful implementation of this concept, however, faces a number of well-known challenges. First, the vast majority of driver genes occur rarely, making them difficult to be detected statistically [5]. Second, different types of mutations exist, and the amount of mutated genes is very large; as a consequence, the sets of genes involved in different studies typically have a small overlap. Therefore, it is difficult to establish a consistent causal mechanism for a given cancer [6]. Third, transformed cells are typically mutated in multiple members of a set of functionally related genes. Consequently, mutations that drive transformation are best sought and understood in a pathway context, especially when the mutation rates are relatively low [7]. For these reasons, efforts have been moving toward identifying driver pathways [8–10].
One way to proceed is to identify known pathways that are enriched in genes carrying somatic mutations [1, 4]. More recent methods exploit genomic characteristics of mutations (i.e. mutual exclusivity that genes in the same pathway rarely mutate in the same sample), to identify oncogenic modules [11, 12]. Vandin et al. [13] proposed Dendrix (de novo driver exclusivity), which maximizes a novel weight function that considers the high coverage and mutual exclusivity of the driver gene set, to discover mutated driver pathways. To better address the maximum weight function, Zhao et al. [14] proposed an exact binary linear programming (BLP) model to maximize the weight function. Zhang et al. [15] proposed two optimization models to separately discover common driver gene sets among multiple cancer types (ComMDP) and specific driver gene sets of a particular cancer (SpeMDP). However, these de novo methods focus on single pathways or modules, and they cannot shed light on how various cellular and physiological processes cooperatively alter during the initiation and progression of cancer. Besides mutation data, gene level knowledge including gene size bias [16, 17], gene-interaction networks and expression levels [5, 7, 18] have been incorporated to uncover the significant mutations of a pathway. Copy number alterations [19] and other biological knowledge of mutational processes, such as transcript isoforms, variation in mutation type and redundancy of genetic code, were also integrated for analysis [20]. Approaches that infer patient specific pathways have also been developed [21, 22]. However, these methods ignore pathway interactions and cannot directly identify driver pathways. Furthermore, all the aforementioned methods may be biased towards super important genes, while ignoring other important driver genes with a low mutation rate.
It has been recognized that pathways often function cooperatively in carcinogenesis [23, 24]. For example, TGF-|$\beta $| (transforming growth factor-|$\beta $|) transactivates EGFR (epidermal growth factor receptor) through canonical Smad3 and ERK/Sp1 signaling pathways and then facilitates breast cancer migration and invasion [25]. Unfortunately, computational solutions for cooperative driver pathway discovery are still much limited, due to the difficulty of mapping genes to pathways and the incomplete pathway knowledge. Furthermore, it is difficult to collect the information of molecules within pathways and the associations among pathways. Yeang et al. [12] reported that there are significant combinatorial patterns of mutations occurring in the same patients (i.e. co-occurrence), whose driver genes usually function in different pathways, and genes in the same pathway rarely mutate in the same sample (i.e. mutual exclusivity). Leiserson et al. [26] proposed Multi-Dendrix, which uses a sum of multiple quantities for multiple pathways to identify driver pathways. Multi-Dendrix may obtain only sub-optimal results and weaken the relations between pathways, since it does not consider the functional interactions or co-occurrence of multiple pathways. Zhang et al. [27] proposed CoMDP, which extends Dendrix by maximizing the mutual exclusivity within a pathway and the co-occurrence of mutations between different pathways. However, CoMDP can only obtain one pair of cooperative driver gene sets for a target disease. So CoMDP can be easily trapped, only identify super important genes with the highest coverage and miss other potential driver pathways. Dao et al. [28] proposed BeCo-WithMEFun to discover cooperative driver pathways. BeCo-WithMEFun pursues the mutual exclusivity and functional interactions within pathways and the co-occurrence between pathways using integer linear programming. However, when there are insufficient co-occurrences or functional interactions among genes, BeCo-WithMEFun cannot find a solution or at best can only find few genes. This is because gene modules with joint mutual exclusivity, co-occurrence and functional interaction are rare. Yang et al. [29] first integrated multiple prior knowledge data by matrix factorization and then applied a tri-random walk to identify cooperative driver pathways. This solution ignores mutation data, and it suffers from too many parameters and high time cost.
Existing cooperative driver pathway discovery methods suffer from at least one of the following issues: neglecting pathway knowledge, unable to directly identify cooperative driver pathways [26–28], limited information given by (noisy) single type of data [30], difficulty in optimizing the problem [28] or tedious parameter selection [29]. To address these issues, we develop a novel approach called CDPathway to leverage prior knowledge [i.e. gene–gene interactions (GGIs) and signal networks] and mutation profiles for cooperative driver pathways discovery. CDPathway identifies candidate driver genes based on a driver impact quantification function that integrates somatic mutation, signaling network and GGI data. The quantification function measures the importance score of each gene in the mutation profile and prior functional networks and then takes the score as driver weight of the gene. In this way, CDPathway can uncover driver genes with maximal weights that play important roles in the carcinogenesis and signaling network. Next, CDPathway imports the quantified driver weights, along with the multi-relational data of genes, miRNAs and pathways, to construct a heterogeneous network, in which the edges of gene–miRNA and gene–pathway are specified by the quantified driver weights. To map driver genes to driver pathways and to uncover the cooperations between these driver pathways, CDPathway collaboratively factorizes the weight adjacency matrices of the heterogeneous network. As such, CDPathway can transfer the influence of driver genes on disease to pathways, to reconstruct the pathway interaction network. After that, CDPathway identifies the interactive pathways with the maximal driver weights on disease as cooperative driver pathways.

The workflow of CDPathway. Step 1: CDPathway introduces a driver impact quantification function to measure the importance score of each gene to disease based on the mutation profile and on prior functional networks integrated by GGI and signaling networks and takes the score as driver weight of each gene. Step 2: CDPathway imports the driver weights to construct a heterogeneous network composed of inter-relational and intra-relational networks of genes, miRNAs and pathways and then collaboratively factorizes the weight adjacency matrices of the heterogeneous network to reconstruct the pathway-to-pathway interaction network. CDPathway identifies the pathway pairs with the maximal interactive and driver weight as cooperative driver pathways.
We apply CDPathway on two real TCGA data sets: somatic mutation and copy number variation profiles of breast cancer (BRCA) and of uterine corpus endometrial carcinoma (UCEC). After statistical and biological enrichment analysis (GO biological process terms and KEGG pathways), the genes identified by the driver impact quantification function show strong associations with disease [area under the receiver operating characteristic curve (AUROC) of |$\geq $|0.9] and play crucial roles in important biological processes. In addition, these identified genes are tightly connected in signaling pathways, which suggests that cooperations do exist between pathways to cooperatively trigger cancer. We also evaluate the effectiveness of the collaborative matrix factorization via 5-fold cross validation and obtain an AUROC of >0.9, which supports the credibility of the identified pathway cooperations. Finally, we evaluate the identified cooperative driver pathways. We find that 100% of them contain driver genes and are related to important biological activities. We compare CDPathway with related methods, including individual driver pathways identification methods Dendrix [13] and BLP [14] and multiple driver pathways identification methods CoMDP [27], Multi-Dendrix [26], BeCo-WithMEFun [28] and CoDP [29]. At gene level, CDPathway can identify a larger number of known driver genes with higher statistical significance. At pathway level, CDPathway performs better by 150% in identifying more driver pathways and by 60% in uncovering more potential cooperations between driver pathways than competitive methods. We also apply CDPathway on an ovarian cancer data set (8911 genes |$\times $| 316 genes, detailed results are shown in Section 2 of the Supplementary file). Results show that CDPathway can identify more known driver genes by over 100% and uncover more significant cooperations between pathways by 80%, which further show the effectiveness of CDPathway on different data sets.
Methods
To identify cooperative driver pathways, CDPathway follows the two-step procedure illustrated in Figure 1.
Step 1: identify candidate driver genes via a driver impact quantification function
Mutations in a cancer genome tend to converge on a few biological pathways [31], and there are relatively complete mutation data on diseases. Therefore, we first identify candidate driver genes in this step, and then map these candidates to pathways. In this way, the effects of the candidate driver genes on cancer can be transferred to pathways. In addition, pathway-based or network-based approaches [21, 22] for discovering cancer genes have shown that functional networks are helpful to identify cancer driver genes. Therefore, we import the GGI network and signaling network to identify driver genes. Several methods have been introduced to integrate mutation and prior knowledge data to identify driver genes [32, 33]. Inspired by MaxMIF [33], we introduce a new driver impact quantification function to quantify the driver weight of each gene and to identify candidate driver genes using the weights.
Step 2: identify cooperative driver pathways by matrix tri-factorization
Results
Data sets and parameter investigation
To investigate the effectiveness of CDPathway, we carry out experiments on publicly available cancer data of breast cancer (BRCA) and uterine corpus endometrial carcinoma (UCEC) from TCGA [1]. Details about the multiple data sources used for the experiments are listed in Table 1. Each of the interaction weight between two genes was extracted and standardized with a value ranging from 0 to 1, divided to the largest interaction weight among all genes. Self-loop interactions were removed to simplify the networks. Multiple miRNAs’ expressions change during tumorigenesis and development, and cascade to downstream target gene expression in the signaling pathway, and thus affect tumor progression. Given that, we assume that an miRNA and a pathway are more likely to be related if they are associated with the same gene, so we construct the miRNA–pathway association network based on the known gene–pathway and gene–miRNA associations.
Data type . | Size (BRCA) . | Size (UCEC) . | Direction . | Weight . | Sources . | Version or date . |
---|---|---|---|---|---|---|
mutation profile | 13 965|$\times $|776 | 18 421|$\times $|248 | No | No | TCGA [1] | 2018.11 |
GGI | 16 907|$\times $|16 907 | 16 907|$\times $|16 907 | Undirected | Weighted | STRING [59] | 2017.05 |
signaling network | 12 456|$\times $|12 456 | 12 456|$\times $|12 456 | Directed | Binary | SignaLink [46],SPIKE [47, 48]Pathway Commons [49] | 2018.11 |
gene–gene (|$\mathbf{R}_{gg}$|) | 1631|$\times $|1631 | 2010|$\times $|2010 | Undirected | Binary | reconstructed | 2019.05 |
gene–miRNA (|$\mathbf{R}_{gm}$|) | 1631|$\times $|559 | 2010|$\times $|559 | Undirected | Weighted | miRBase [50] | 2019.03 |
gene–pathway (|$\mathbf{R}_{gp}$|) | 1631|$\times $|212 | 2010|$\times $|212 | Undirected | Weighted | PID [51] | 2018.01 |
miRNA–miRNA (|$\mathbf{R}_{mm}$|) | 559|$\times $|559 | 559|$\times $|559 | Undirected | Binary | Cui’s lab [52] | 2018.01 |
miRNA–pathway (|$\mathbf{R}_{mp}$|) | 559|$\times $|212 | 559|$\times $|559 | Undirected | Binary | reconstruted | 2019.05 |
pathway–pathway (|$\mathbf{R}_{pp}$|) | 212|$\times $|212 | 212|$\times $|212 | Undirected | Binary | PID [51] | 2018.01 |
Data type . | Size (BRCA) . | Size (UCEC) . | Direction . | Weight . | Sources . | Version or date . |
---|---|---|---|---|---|---|
mutation profile | 13 965|$\times $|776 | 18 421|$\times $|248 | No | No | TCGA [1] | 2018.11 |
GGI | 16 907|$\times $|16 907 | 16 907|$\times $|16 907 | Undirected | Weighted | STRING [59] | 2017.05 |
signaling network | 12 456|$\times $|12 456 | 12 456|$\times $|12 456 | Directed | Binary | SignaLink [46],SPIKE [47, 48]Pathway Commons [49] | 2018.11 |
gene–gene (|$\mathbf{R}_{gg}$|) | 1631|$\times $|1631 | 2010|$\times $|2010 | Undirected | Binary | reconstructed | 2019.05 |
gene–miRNA (|$\mathbf{R}_{gm}$|) | 1631|$\times $|559 | 2010|$\times $|559 | Undirected | Weighted | miRBase [50] | 2019.03 |
gene–pathway (|$\mathbf{R}_{gp}$|) | 1631|$\times $|212 | 2010|$\times $|212 | Undirected | Weighted | PID [51] | 2018.01 |
miRNA–miRNA (|$\mathbf{R}_{mm}$|) | 559|$\times $|559 | 559|$\times $|559 | Undirected | Binary | Cui’s lab [52] | 2018.01 |
miRNA–pathway (|$\mathbf{R}_{mp}$|) | 559|$\times $|212 | 559|$\times $|559 | Undirected | Binary | reconstruted | 2019.05 |
pathway–pathway (|$\mathbf{R}_{pp}$|) | 212|$\times $|212 | 212|$\times $|212 | Undirected | Binary | PID [51] | 2018.01 |
Data type . | Size (BRCA) . | Size (UCEC) . | Direction . | Weight . | Sources . | Version or date . |
---|---|---|---|---|---|---|
mutation profile | 13 965|$\times $|776 | 18 421|$\times $|248 | No | No | TCGA [1] | 2018.11 |
GGI | 16 907|$\times $|16 907 | 16 907|$\times $|16 907 | Undirected | Weighted | STRING [59] | 2017.05 |
signaling network | 12 456|$\times $|12 456 | 12 456|$\times $|12 456 | Directed | Binary | SignaLink [46],SPIKE [47, 48]Pathway Commons [49] | 2018.11 |
gene–gene (|$\mathbf{R}_{gg}$|) | 1631|$\times $|1631 | 2010|$\times $|2010 | Undirected | Binary | reconstructed | 2019.05 |
gene–miRNA (|$\mathbf{R}_{gm}$|) | 1631|$\times $|559 | 2010|$\times $|559 | Undirected | Weighted | miRBase [50] | 2019.03 |
gene–pathway (|$\mathbf{R}_{gp}$|) | 1631|$\times $|212 | 2010|$\times $|212 | Undirected | Weighted | PID [51] | 2018.01 |
miRNA–miRNA (|$\mathbf{R}_{mm}$|) | 559|$\times $|559 | 559|$\times $|559 | Undirected | Binary | Cui’s lab [52] | 2018.01 |
miRNA–pathway (|$\mathbf{R}_{mp}$|) | 559|$\times $|212 | 559|$\times $|559 | Undirected | Binary | reconstruted | 2019.05 |
pathway–pathway (|$\mathbf{R}_{pp}$|) | 212|$\times $|212 | 212|$\times $|212 | Undirected | Binary | PID [51] | 2018.01 |
Data type . | Size (BRCA) . | Size (UCEC) . | Direction . | Weight . | Sources . | Version or date . |
---|---|---|---|---|---|---|
mutation profile | 13 965|$\times $|776 | 18 421|$\times $|248 | No | No | TCGA [1] | 2018.11 |
GGI | 16 907|$\times $|16 907 | 16 907|$\times $|16 907 | Undirected | Weighted | STRING [59] | 2017.05 |
signaling network | 12 456|$\times $|12 456 | 12 456|$\times $|12 456 | Directed | Binary | SignaLink [46],SPIKE [47, 48]Pathway Commons [49] | 2018.11 |
gene–gene (|$\mathbf{R}_{gg}$|) | 1631|$\times $|1631 | 2010|$\times $|2010 | Undirected | Binary | reconstructed | 2019.05 |
gene–miRNA (|$\mathbf{R}_{gm}$|) | 1631|$\times $|559 | 2010|$\times $|559 | Undirected | Weighted | miRBase [50] | 2019.03 |
gene–pathway (|$\mathbf{R}_{gp}$|) | 1631|$\times $|212 | 2010|$\times $|212 | Undirected | Weighted | PID [51] | 2018.01 |
miRNA–miRNA (|$\mathbf{R}_{mm}$|) | 559|$\times $|559 | 559|$\times $|559 | Undirected | Binary | Cui’s lab [52] | 2018.01 |
miRNA–pathway (|$\mathbf{R}_{mp}$|) | 559|$\times $|212 | 559|$\times $|559 | Undirected | Binary | reconstruted | 2019.05 |
pathway–pathway (|$\mathbf{R}_{pp}$|) | 212|$\times $|212 | 212|$\times $|212 | Undirected | Binary | PID [51] | 2018.01 |
We divided the original data by 7:1:2 into training data, validation data and test data and selected the parameters using the validation data. The driver weight threshold |$\theta $| is an important parameter for filtering useless genes in Step 2. We split useless genes with driver weight less than |$\theta $| to reconstruct relational networks of genes, miRNAs and pathways. As a result, the reconstructed gene–gene network used in Step 2 is a subnetwork of the original GGI network used in Step 1. We show the distribution of driver weight in Figure 2, the majority of genes’ driver weight are less than 0.01 and 0.001 for BRCA and UCEC data sets, indicating that these genes are passenger genes, which do not contribute to the carcinogenesis. To study the sensitivity of |$\theta $|, based on the distribution of driver weights in Figure 2, we fix low-rank size |$k_g=k_m=k_p=k_o$| as 50 and penalty coefficient |$\gamma $| as 100 and increase |$\theta $| from 0 to 0.1 in BRCA data set and from 0 to 0.01 in UCEC data set. Figure 3 reports the AUROC and the area under the precision recall curve (AURPC) under different input values of |$\theta $| in reconstructing pathway–pathway associations.


Both the AUROC value and AUPRC value vary no more than 0.2 as the increase of |$\theta $| and reach to the highest when |$\theta \approx 0.07$| on the BRCA data set and |$\theta \approx 0.006$| on the UCEC data set. Particularly, when |$\theta $| =0, both the AUROC and AUPRC are the lowest, indicating that filtering passenger genes is helpful for reconstructing the pathway–pathway interaction network. In practice, we also tested with a larger threshold, like 0.5, 1 and 5, the AUROC and AUPRC values drop to lower than when |$\theta =0$|, since important driver genes are excluded with such a large threshold. Given these observations, we adopt |$\theta $| = 0.07 and |$\theta $| = 0.006 for experiments on BRCA and UCEC data sets, respectively.
The low-rank size |$k_o$| and penalty coefficient |$\gamma $| are important parameters for reconstructing pathway–pathway interaction network, and tuning them altogether is time consuming. For simplicity, we simply fix |$k_g=k_m=k_p=k_o$| and then study how |$k_o$| and |$\gamma $| affect the reconstruction of pathway–pathway interaction network. We increase |$k_o$| from 10 to 110 and |$\gamma $| from |$10^{-4}$| to |$10^6$| on BRCA and UCEC data sets. Figure 4 reports the AUROC and AURPC under different input values of |$k_o$| and |$\gamma $|.

The AUROC and AUPRC of CDPathway under different low-rank size |$k_o$| and sparsity coefficient |$\gamma $|.
Both the AUROC and AUPRC increase as the increase of |$k_o$| and reach to the plateau when |$k_o \geq 40$| on the two data sets. That is because the intra-relations cannot be preserved when the size (|$k_o$|) of low-rank representation |$\mathbf{H}_o$| is too small. Similarly, both the AUROC and AUPRC increase as the rise of |$\gamma $| and reach to a plateau when |$\gamma \geq 10$| on both data sets. The values of AUROC and AUPRC manifest a significant reduce when |$\gamma < 1$|. That is because |$\gamma \sum _{o \in \{g,m,p\}} {|{\mathbf{H}_o}|_1^2}$| encourages the sparsity of |$\mathbf{H}_o$| and thus to reduce the false positive reconstructed edges between pathway nodes. A too low |$\gamma $| cannot enforce the sparsity.
Based on the above analysis, we adopt |$k_o=90$| and |$k_o=100$|, |$\gamma =10^3$| and |$\gamma =10^3$| for experiments on BRCA and UCEC data sets, respectively. Based on these selected parameters, CDPathway again performs well on the test data with an AUROC of 0.8952 and 0.9041, an AUPRC of 0.5714 and 0.5846 on the BRCA and UCEC dataset, respectively.
The driver weight quantification function can effectively identify driver genes
Known driver genes can be identified by our driver weight quantification function with statistic significance. To study the effectiveness of identifying known driver genes (reported in [53]), we specify the number of identified candidate driver genes in Step 1 from one to the number of genes in the mutation profiles (13 965 in BRCA and 18 421 in UCEC) and plot the ROC and PR curves in Figure 5.

Receiver operating characteristics (ROC) and precision recall (PR) curves of identifying known driver genes for BRCA and UCEC data sets. The corresponding AUROC and AUPRC values are also included in the legend. CDPathway-noSig: CDPathway without the signaling network. CDPathway-noGGI: CDPathway without the GGI network. CDPathway-noNetwork: CDPathway without the signaling and GGI networks.
For both BRCA and UCEC data sets, the AUROC values are over 0.9, and AUPRC values are over 0.2. The reason why the AUPRC is relatively low compared to the AUROC is due to that CDPathway identifies genes which drive the carcinogenesis of the target cancer, while the known genes are drivers for various diseases. Another cause is that AUPRC is more sensitive to the class-imbalance task and identifying known driver gene in essence is a class-imbalance task. As can be seen, the precision is very high at the beginning, which indicates that top-ranked genes are important driver genes. The genes with weights less than |$\theta $| should be excluded from the heterogeneous network, and they drag down the Precision for a high recall and cause a low AUPRC value. Overall, CDPathway can effectively identify known driver genes.
In order to investigate the contribution of GGI and signaling networks, we also compare CDPathway with CDPathway without the GGI or the signaling networks. When a single functional network is disregarded, the AUROC and AUPRC values are reduced by 1.7% and 16.7% on BRCA and by 1.8% and 13.1% on UCEC, respectively. When both networks are removed, the AUROC (AUPRC) value drops sharply by 15.1% (62.8%) and 16.3% (53.6%) on BRCA and UCEC data sets, respectively. Therefore, we can conclude that both prior knowledge networks contribute to the identification of driver genes, and disregarding any of the two will compromise the performance. In addition, we also study how CDPathway perform when the functional networks are not complete. We randomly remove 5–50% edges in the GGI or signaling network and record the AUROC and AUPRC values in Figure 6. As can be seen, the AUROC and AUPRC manifest a decrease as more edges removed. Generally, CDPathway relies to the completeness of functional networks, but the AUROC (AUPRC) drops less than 10% (25%) when even 50% of edges in the GGI or signaling network are removed. That is because we also integrate the somatic mutation data in Step 1, and the mutation rate of genes (i.e. coverage) are helpful to identify potential driver genes. Therefore, we can infer that the driver weights of genes are not so dependent on the completeness of the functional networks.

AUROC and AUPRC values of identifying known driver genes with different proportions of removed interactions in the GGI and signaling networks. CDPathway-rGGI(Sig): CDPathway uses only the GGI(signaling) network with removed interactions. CDPathway-GGI(Sig): CDPathway using only the GGI(signaling) network.
After filtering useless genes, we further apply a hypergeometric test on the identified candidate driver genes of the two real data sets to study the statistical significance of the candidate driver genes. The |$P$|-values of the identified known driver genes are 9.68E-62 and 1.67E-54 for BRCA and UCEC data sets, respectively. Therefore, we can conclude that CDPathway can identify known driver genes with statistical significance. In addition, we select the top 30 candidate driver genes in Step 1 and do functional and pathway enrichment analysis using the widely used DAVID toolkit [54]. As Table 2 shows, there are 17 and 14 known driver genes among the 30 candidate driver genes for BRCA and UCEC data sets, respectively. Besides the known driver genes, CDPathway can also identify several novel driver genes. For the BRCA data set, it is highlighted that FOXA1 (Forkhead box protein A1) and GATA3 (GATA binding protein 3) co-regulate the expression of genes, which is essential for the luminal mammary epithelial cell development [55, 56]. Another example is that mutations in MAP2K4 produce perturbations similar to MAP3K1 loss [57]. For UCEC data set, TP53 (tumor protein p53) is involved with DNA repair, growth arrest and apoptosis. In particular, mutations in TP53 can lead to uncontrolled proliferation and invasive growth. On the other hand, DCC (deleted in colorectal carcinoma) is reported to have an anti-metastatic role [58], meaning that it may only contribute to cancer in the context of a pre-existing condition. We conjecture that the mutations in DCC may contribute to cancer progression for patients with defective mismatch repair and/or impaired TP53 functionality. A network-based analysis also provides supporting evidence of TTN mutations as a disease marker [60].
Candidate driver genes identified by CDPathway. The known driver genes and enriched biological processes in BRCA and UCEC data sets are highlighted in boldface font
Data set . | Genes . | GO/KEGG/Disease enrichment . |
---|---|---|
BRCA | TP53 PIK3CA CDH1 PTEN BRCA2 MAP3K1 AKT1 CTCF PIK3R1 RUNX1 GATA3 FOXA1 NCOA3 CNOR1 DMD MAP2K4 ERBB2 NF1 ARID1A LRP2 ATM KRAS ACTN2 FLT4 MTOR RELN LRBA ARHGEF1 CASP8 DCC | Breast cancer; pathways in cancer; central carbon metabolism in cancer; apoptosis; microRNAs in cancer; transcriptional activator activity; RNA polymerase II transcription regulatory region sequence-specific binding |
UCEC | PTEN PIK3CA CTNNB1 TP53 PIK3R1 ARID1A CTCF KRAS FGFR2 FBXW7 PPP2R1A TTN SGK1 DMD ESR1 SPOP CCND1 TNS1 MTOR FAT1 LRP6 ERBB3 CUX1 MED12 DCC APC SOS1 NFE2L2 CTNND1 ZMYM2 | Endometrial cancer; pathways in cancer; central carbon metabolism in cancer; microRNAs in cancer; negative regulation of transcription from RNA polymerase II promoter |
Data set . | Genes . | GO/KEGG/Disease enrichment . |
---|---|---|
BRCA | TP53 PIK3CA CDH1 PTEN BRCA2 MAP3K1 AKT1 CTCF PIK3R1 RUNX1 GATA3 FOXA1 NCOA3 CNOR1 DMD MAP2K4 ERBB2 NF1 ARID1A LRP2 ATM KRAS ACTN2 FLT4 MTOR RELN LRBA ARHGEF1 CASP8 DCC | Breast cancer; pathways in cancer; central carbon metabolism in cancer; apoptosis; microRNAs in cancer; transcriptional activator activity; RNA polymerase II transcription regulatory region sequence-specific binding |
UCEC | PTEN PIK3CA CTNNB1 TP53 PIK3R1 ARID1A CTCF KRAS FGFR2 FBXW7 PPP2R1A TTN SGK1 DMD ESR1 SPOP CCND1 TNS1 MTOR FAT1 LRP6 ERBB3 CUX1 MED12 DCC APC SOS1 NFE2L2 CTNND1 ZMYM2 | Endometrial cancer; pathways in cancer; central carbon metabolism in cancer; microRNAs in cancer; negative regulation of transcription from RNA polymerase II promoter |
Candidate driver genes identified by CDPathway. The known driver genes and enriched biological processes in BRCA and UCEC data sets are highlighted in boldface font
Data set . | Genes . | GO/KEGG/Disease enrichment . |
---|---|---|
BRCA | TP53 PIK3CA CDH1 PTEN BRCA2 MAP3K1 AKT1 CTCF PIK3R1 RUNX1 GATA3 FOXA1 NCOA3 CNOR1 DMD MAP2K4 ERBB2 NF1 ARID1A LRP2 ATM KRAS ACTN2 FLT4 MTOR RELN LRBA ARHGEF1 CASP8 DCC | Breast cancer; pathways in cancer; central carbon metabolism in cancer; apoptosis; microRNAs in cancer; transcriptional activator activity; RNA polymerase II transcription regulatory region sequence-specific binding |
UCEC | PTEN PIK3CA CTNNB1 TP53 PIK3R1 ARID1A CTCF KRAS FGFR2 FBXW7 PPP2R1A TTN SGK1 DMD ESR1 SPOP CCND1 TNS1 MTOR FAT1 LRP6 ERBB3 CUX1 MED12 DCC APC SOS1 NFE2L2 CTNND1 ZMYM2 | Endometrial cancer; pathways in cancer; central carbon metabolism in cancer; microRNAs in cancer; negative regulation of transcription from RNA polymerase II promoter |
Data set . | Genes . | GO/KEGG/Disease enrichment . |
---|---|---|
BRCA | TP53 PIK3CA CDH1 PTEN BRCA2 MAP3K1 AKT1 CTCF PIK3R1 RUNX1 GATA3 FOXA1 NCOA3 CNOR1 DMD MAP2K4 ERBB2 NF1 ARID1A LRP2 ATM KRAS ACTN2 FLT4 MTOR RELN LRBA ARHGEF1 CASP8 DCC | Breast cancer; pathways in cancer; central carbon metabolism in cancer; apoptosis; microRNAs in cancer; transcriptional activator activity; RNA polymerase II transcription regulatory region sequence-specific binding |
UCEC | PTEN PIK3CA CTNNB1 TP53 PIK3R1 ARID1A CTCF KRAS FGFR2 FBXW7 PPP2R1A TTN SGK1 DMD ESR1 SPOP CCND1 TNS1 MTOR FAT1 LRP6 ERBB3 CUX1 MED12 DCC APC SOS1 NFE2L2 CTNND1 ZMYM2 | Endometrial cancer; pathways in cancer; central carbon metabolism in cancer; microRNAs in cancer; negative regulation of transcription from RNA polymerase II promoter |
Table 2 also shows some enriched function terms. There are several overlapping important biological processes enriched in both BRCA and UCEC data sets, including transcriptional activator activity, central carbon metabolism in cancer, apoptosis, etc. The candidate driver genes are also enriched in the target disease and cancer related pathways. These results demonstrate that CDPathway can effectively identify candidate driver genes engaging in the cancer related pathways and biological processes.
Collaborative matrix factorization can effectively reconstruct the pathway-pathway interaction network
To investigate the effectiveness of collaborate matrix factorization in Step 2, we randomly divide the original pathway–pathway interaction network into 5-folds for cross validation. Next, we plot the ROC and PR curves of CDPathway in Figure 7. We run 5-fold cross validation for 10 independent rounds and report the average results. We can see that CDPathway achieves high AUROC values of 0.9419 and 0.9318, AUPRC values of 0.7838 and 0.7707 on BRCA and UCEC data sets, respectively. The high AUROC and AUPRC values indicate that CDPathway can accurately reconstruct pathway–pathway associations.

ROC and PR curves of reconstructing the pathway–pathway interaction network for BRCA and UCEC data sets. CDPathway-noSig: CDPathway without the signaling network. CDPathway-noGGI: CDPathway without the GGI network. CDPathway-noNetwork: CDPathway without the signaling and GGI networks. CDPathway-noWeight: CDPathway without the weights of driver genes.
To investigate the contribution of the quantified driver gene weights obtained in Step 1, we test the collaborate matrix factorization without importing driver weights. As shown in Figure 7, both AUROC and AUPRC drop by 12.6% and 8.3%, 27.2% and 23.4% on BRCA and UCEC data sets, respectively, indicating that the imported weights are helpful in reconstructing the pathway–pathway interaction network. In addition, we also test CDPathway without prior functional networks. Without any functional network, both the AUROC and AURPC values drop more seriously. The integration of the signaling and GGI networks can jointly improve the performance. In addition, the GGI and signaling networks show equal importance (no more than 3% difference) for the reconstruction of the pathway–pathway interaction network. We can conclude that the GGI and signaling networks complement each other and both contribute to the reconstruction of the pathway–pathway interaction network.
CDPathway can identify significant cooperative driver pathways
We select 10 pathway pairs with top 10 highest values in the reconstructed pathway–pathway interaction network as cooperative driver pathways. As shown in Table 3, by checking the member genes of these pathways, we find that each pathway contains at least one known driver gene, which indicates that all pathways play important roles in carcinogenesis by undertaking the biological activities of the driver genes. In addition, these pathways also confirm that the driver effects of genes on the disease are correctly transferred to pathways by the collaborative matrix factorization on the heterogeneous network in Step 2.
No. . | BRCA . | UCEC . |
---|---|---|
1 | Internalization of ErbB1, Trk receptor signaling mediated by PI3K and PLC-gamma | JNK signaling in the CD4 TCR pathway, HIV-1 Nef negative effector of Fas and TNF-alpha |
2 | Degradation of beta catenin, canonical Wnt signaling pathway | Internalization of ErbB1, Trk receptor signaling mediated by PI3K and PLC-gamma |
3 | Ceramide signaling pathway, EPO signaling pathway | ATM pathway, reelin signaling pathway |
4 | IL2 signaling events mediated by STAT5, osteopontin-mediated events | canonical Wnt signaling pathway, BCR signaling pathway |
5 | Alternative NF-kappaB pathway, HIF-1-alpha transcription factor network | TGF-beta receptor signaling, ceramide signaling pathway |
6 | p53 pathway, class I PI3K signaling events mediated by Akt | FOXA1 transcription factor network, regulation of Androgen receptor activity |
7 | Aurora A signaling, signaling events mediated by hepatocyte growth factor receptor (c-Met) | mTOR signaling pathway, IL2 signaling events mediated by STAT5 |
8 | Notch signaling pathway, Iintegrin family cell surface interactions | TNF receptor signaling pathway, FOXM1 transcription factor network |
9 | Aurora C signaling, C-MYC pathway | Netrin-mediated signaling events, alternative NF-kappaB pathway |
10 | IL8- and CXCR1-mediated signaling events, Integrins in angiogenesis | Insulin-mediated glucose transport, direct p53 effectors |
No. . | BRCA . | UCEC . |
---|---|---|
1 | Internalization of ErbB1, Trk receptor signaling mediated by PI3K and PLC-gamma | JNK signaling in the CD4 TCR pathway, HIV-1 Nef negative effector of Fas and TNF-alpha |
2 | Degradation of beta catenin, canonical Wnt signaling pathway | Internalization of ErbB1, Trk receptor signaling mediated by PI3K and PLC-gamma |
3 | Ceramide signaling pathway, EPO signaling pathway | ATM pathway, reelin signaling pathway |
4 | IL2 signaling events mediated by STAT5, osteopontin-mediated events | canonical Wnt signaling pathway, BCR signaling pathway |
5 | Alternative NF-kappaB pathway, HIF-1-alpha transcription factor network | TGF-beta receptor signaling, ceramide signaling pathway |
6 | p53 pathway, class I PI3K signaling events mediated by Akt | FOXA1 transcription factor network, regulation of Androgen receptor activity |
7 | Aurora A signaling, signaling events mediated by hepatocyte growth factor receptor (c-Met) | mTOR signaling pathway, IL2 signaling events mediated by STAT5 |
8 | Notch signaling pathway, Iintegrin family cell surface interactions | TNF receptor signaling pathway, FOXM1 transcription factor network |
9 | Aurora C signaling, C-MYC pathway | Netrin-mediated signaling events, alternative NF-kappaB pathway |
10 | IL8- and CXCR1-mediated signaling events, Integrins in angiogenesis | Insulin-mediated glucose transport, direct p53 effectors |
No. . | BRCA . | UCEC . |
---|---|---|
1 | Internalization of ErbB1, Trk receptor signaling mediated by PI3K and PLC-gamma | JNK signaling in the CD4 TCR pathway, HIV-1 Nef negative effector of Fas and TNF-alpha |
2 | Degradation of beta catenin, canonical Wnt signaling pathway | Internalization of ErbB1, Trk receptor signaling mediated by PI3K and PLC-gamma |
3 | Ceramide signaling pathway, EPO signaling pathway | ATM pathway, reelin signaling pathway |
4 | IL2 signaling events mediated by STAT5, osteopontin-mediated events | canonical Wnt signaling pathway, BCR signaling pathway |
5 | Alternative NF-kappaB pathway, HIF-1-alpha transcription factor network | TGF-beta receptor signaling, ceramide signaling pathway |
6 | p53 pathway, class I PI3K signaling events mediated by Akt | FOXA1 transcription factor network, regulation of Androgen receptor activity |
7 | Aurora A signaling, signaling events mediated by hepatocyte growth factor receptor (c-Met) | mTOR signaling pathway, IL2 signaling events mediated by STAT5 |
8 | Notch signaling pathway, Iintegrin family cell surface interactions | TNF receptor signaling pathway, FOXM1 transcription factor network |
9 | Aurora C signaling, C-MYC pathway | Netrin-mediated signaling events, alternative NF-kappaB pathway |
10 | IL8- and CXCR1-mediated signaling events, Integrins in angiogenesis | Insulin-mediated glucose transport, direct p53 effectors |
No. . | BRCA . | UCEC . |
---|---|---|
1 | Internalization of ErbB1, Trk receptor signaling mediated by PI3K and PLC-gamma | JNK signaling in the CD4 TCR pathway, HIV-1 Nef negative effector of Fas and TNF-alpha |
2 | Degradation of beta catenin, canonical Wnt signaling pathway | Internalization of ErbB1, Trk receptor signaling mediated by PI3K and PLC-gamma |
3 | Ceramide signaling pathway, EPO signaling pathway | ATM pathway, reelin signaling pathway |
4 | IL2 signaling events mediated by STAT5, osteopontin-mediated events | canonical Wnt signaling pathway, BCR signaling pathway |
5 | Alternative NF-kappaB pathway, HIF-1-alpha transcription factor network | TGF-beta receptor signaling, ceramide signaling pathway |
6 | p53 pathway, class I PI3K signaling events mediated by Akt | FOXA1 transcription factor network, regulation of Androgen receptor activity |
7 | Aurora A signaling, signaling events mediated by hepatocyte growth factor receptor (c-Met) | mTOR signaling pathway, IL2 signaling events mediated by STAT5 |
8 | Notch signaling pathway, Iintegrin family cell surface interactions | TNF receptor signaling pathway, FOXM1 transcription factor network |
9 | Aurora C signaling, C-MYC pathway | Netrin-mediated signaling events, alternative NF-kappaB pathway |
10 | IL8- and CXCR1-mediated signaling events, Integrins in angiogenesis | Insulin-mediated glucose transport, direct p53 effectors |
We further check their cooperation and find that all the pathway pairs are associated in the collected pathway–pathway interaction network. We visualize pathway pair 1 of BRCA data set in Figure 8 as an example. Other examples are provided in the Supplementary file. We observe that the pathway pair 1 interacts with each other to undertake biological activities. These two pathways also share common known driver genes: PIK3CA, PIK3R1 and KRAS. These genes have interactions and may cooperate with each other to drive the target disease. The shared gene NCOA3 does not have evident functional edges with any other gene, but it was predicted to have connections with the driver gene EGFR [59], which also supports the capability of CDPathway in identifying cooperative driver pathways.
![Pathway pair 1 of BRCA data set in the GGI network from String [59]. The rose red lines indicate interactions from curated databases and blue lines indicate interactions have been experimentally determined. Genes in the ’Internalization of ErbB1’ pathway are circled, and genes in ’Trk receptor signaling mediated by PI3K and PLC-gamma’ pathway are enclosed by rectangles. These two pathways share common known driver genes of PIK3CA, PIK3R1 and KRAS.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bib/22/2/10.1093_bib_bbz167/2/m_bbz167f8.jpeg?Expires=1747907618&Signature=JNyH~WOgE3-JEjamtcvJjtmh7Hy88WppV0AF~pox9EnXr60rA~lUj4LNfoFUBdioyTat-ppyhIKjclM78q5Pg1HbcI6-MZWabyC1ib03roC4xvI5uOAwkJnMzDrfeXcz43wNOA6H1~rLculSduYH2L5cT1fWRprPbyIYFy2lKJgXBlYE1siPKtyjDIJZ6Kug9kiYtgouESpraXym0zHn6x3lfT0W1gn8-9x7YrLsJOvkpAXsoZQEjGoRfyZ~FlO0EmkYqEBGpW9oP4rg4AOrdi3V3UPs92e4M~uAP8sGf52UIlusjBpHkcxAPHmMn1KzAJHgUu-p7~rrdAGXp~fxlg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Pathway pair 1 of BRCA data set in the GGI network from String [59]. The rose red lines indicate interactions from curated databases and blue lines indicate interactions have been experimentally determined. Genes in the ’Internalization of ErbB1’ pathway are circled, and genes in ’Trk receptor signaling mediated by PI3K and PLC-gamma’ pathway are enclosed by rectangles. These two pathways share common known driver genes of PIK3CA, PIK3R1 and KRAS.
There are five known pan-cancer related pathways identified by CDPathway, including Class I PI3K signaling events mediated by Akt, mTOR signaling pathway, TGF-|$\beta $| receptor signaling, canonical Wnt signaling pathway and Notch signaling pathway [61]. Breast cancer studies have shown that the abnormality of internalization of ErbB1, canonical Wnt signaling pathway, alternative NF-kappaB pathway, class I PI3K signaling events mediated by Akt, Notch signaling pathway and C-MYC pathway promote the carcinogenesis [62–67]. For UCEC, Trk receptor signaling mediated by PI3K and PLC-gamma, TGF-beta receptor signaling, FOXM1 transcription factor network and mTOR signaling pathway are reported to have tight associations with the target disease [68–70].
We can also find that identified driver pathway pairs play important roles in the life activities related to the initiation of cancer. For example, in pathway pair 3 (Figure S15 of Supplementary file) on the UCEC data set, ATM pathway provides the crucial link between DNA damage, cell cycle progression and cell death by first sensing double stranded DNA breaks and subsequently phosphorylating and activating other downstream proteins functioning in DNA damage repair, cell cycle arrest and apoptotic pathways [71]. Therefore, the identified pathways’ pairs cooperate with each other and play important roles in biological activities. Itis quite possible that their cooperation drives the occurrence of the target disease.
Comparison with other methods
We compare CDPathway against six related and competitive methods, including two individual driver pathways identification methods Dendrix [13] and BLP [14] and four multiple driver pathways identification methods CoMDP [27], Multi-Dendrix [26], BeCo-WithMEFun [28] and CoDP [29]. We discussed these approaches in the Introduction. We set the parameter ranges of the comparing methods as described in the original papers and select pathways that contain the most known driver genes with statistic significance. Since these methods, except CoDP, do not identify exact driver pathways, we perform a comparison at the gene level, by mapping pathways identified by CDPathway and CoDP to genes using known gene–pathway association networks. On the other hand, we compare CDPathway with these methods at the pathway level by doing a KEGG enrichment test for the driver gene set identified by comparing methods via DAVID [54].
As shown in Table 4, CDPathway outperforms other methods by identifying on average twice more known driver genes. Biologically more genes make less sense, but CDPathway uncovers more driver genes instead of passenger genes than other methods. In fact, CDPathway uncovers how driver pathways cooperatively trigger cancer by including these driver genes to affect important biological activities. Therefore, CDPathway provides driver pathways with diversity. As shown in Table 4, the known driver genes are identified with statistical significance (hypergeometric test |$P$|-value <0.05) and has better statistical significance and F1-Score than other comparing methods, except getting a similar F1-Score with CoMDP on the UCEC data set. Therefore, CDPathway does not identify more known driver genes by only identifying more pathways, it balances the precision and recall of identifying cooperative driver pathways. In contrast, the comparing methods identify limited driver genes. As a result, they also provide limited diversity of driver pathways in different data sets. These results indicate that CDPathway can better capture the heterogeneity of different data sets than comparing methods.
Known driver gene sets of BRCA and of UCEC identified by CDPathway and other comparing methods
Method . | Driver genes of BRCA . | Enrichment . | F1-score . | Driver genes of UCEC . | Enrichment . | F1-score . |
---|---|---|---|---|---|---|
CDPathway | PIK3CA, PIK3R1, KRAS, EGFR, BRAF, AKT1, APC, AXIN1, CTNNB1, MAP3K1, CASP8, RB1, JAK2, PTPN11, JAK1, EP300, CREBBP, SMAD4, GATA2, MET, TP53, CDH1, TNFAIP3, FBXW7 | 6.28E-26 | 0.3912 | MAP3K1, CASP8, PIK3CA, PIK3R1, KRAS, NRAS, EGFR, BRAF, AKT1, ATM, GATA3, JAK2, PTPN11, JAK1, TP53, CTNNB1, APC, CDH1, MET, EP300, CREBBP | 1.52E-23 | 0.3105 |
CoDP | FGFR2, FGFR3, MET, CDKN2A, EP300, TP53, MAP2K1, PIK3CA, PIK3R1, CTNNB1, AKT1, PTPN11, BRAF, CASP8, MAP3K1 | 5.27E-5 | 0.1222 | NRAS, MAP2K1, B2M, HRAS, KRAS, BRAF, JAK2, SOCS1, CTNNB1, KLF4, CDKN2A, MED12, CDH1, AR, EP300, APC, SMARCA4, AKT1, PIK3CA | 2.32E-10 | 0.1935 |
BeCo-WithMEFun | NCOA3 TP53 | 3.82E-03 | 0.1111 | ARID1A TP53 | 2.70E-03 | 0.1176 |
Dendrix | CDH1, TP53 | 2.56E-03 | 0.1250 | PIK3CA, PIK3R1 | 1.83E-03 | 0.1333 |
BLP | TP53, PIK3CA, GATA3 | 5.19E-05 | 0.1764 | TP53, PTEN | 1.83E-03 | 0.1333 |
Multi-Dendrix | CDH1, TP53, GATA3, PIK3CA | 6.85E-07 | 0.2353 | PIK3CA, PIK3R1 | 1.83E-03 | 0.1333 |
CoMDP | MAP3K1, TP53, CDH1, GATA3, MLL3, PIK3CA | 3.81E-11 | 0.3529 | PIK3R1, FBXW7, PIK3CA, PPP2R1A, TP53, CTNNB1, PTEN | 4.97E-14 | 0.3889 |
Method . | Driver genes of BRCA . | Enrichment . | F1-score . | Driver genes of UCEC . | Enrichment . | F1-score . |
---|---|---|---|---|---|---|
CDPathway | PIK3CA, PIK3R1, KRAS, EGFR, BRAF, AKT1, APC, AXIN1, CTNNB1, MAP3K1, CASP8, RB1, JAK2, PTPN11, JAK1, EP300, CREBBP, SMAD4, GATA2, MET, TP53, CDH1, TNFAIP3, FBXW7 | 6.28E-26 | 0.3912 | MAP3K1, CASP8, PIK3CA, PIK3R1, KRAS, NRAS, EGFR, BRAF, AKT1, ATM, GATA3, JAK2, PTPN11, JAK1, TP53, CTNNB1, APC, CDH1, MET, EP300, CREBBP | 1.52E-23 | 0.3105 |
CoDP | FGFR2, FGFR3, MET, CDKN2A, EP300, TP53, MAP2K1, PIK3CA, PIK3R1, CTNNB1, AKT1, PTPN11, BRAF, CASP8, MAP3K1 | 5.27E-5 | 0.1222 | NRAS, MAP2K1, B2M, HRAS, KRAS, BRAF, JAK2, SOCS1, CTNNB1, KLF4, CDKN2A, MED12, CDH1, AR, EP300, APC, SMARCA4, AKT1, PIK3CA | 2.32E-10 | 0.1935 |
BeCo-WithMEFun | NCOA3 TP53 | 3.82E-03 | 0.1111 | ARID1A TP53 | 2.70E-03 | 0.1176 |
Dendrix | CDH1, TP53 | 2.56E-03 | 0.1250 | PIK3CA, PIK3R1 | 1.83E-03 | 0.1333 |
BLP | TP53, PIK3CA, GATA3 | 5.19E-05 | 0.1764 | TP53, PTEN | 1.83E-03 | 0.1333 |
Multi-Dendrix | CDH1, TP53, GATA3, PIK3CA | 6.85E-07 | 0.2353 | PIK3CA, PIK3R1 | 1.83E-03 | 0.1333 |
CoMDP | MAP3K1, TP53, CDH1, GATA3, MLL3, PIK3CA | 3.81E-11 | 0.3529 | PIK3R1, FBXW7, PIK3CA, PPP2R1A, TP53, CTNNB1, PTEN | 4.97E-14 | 0.3889 |
Known driver gene sets of BRCA and of UCEC identified by CDPathway and other comparing methods
Method . | Driver genes of BRCA . | Enrichment . | F1-score . | Driver genes of UCEC . | Enrichment . | F1-score . |
---|---|---|---|---|---|---|
CDPathway | PIK3CA, PIK3R1, KRAS, EGFR, BRAF, AKT1, APC, AXIN1, CTNNB1, MAP3K1, CASP8, RB1, JAK2, PTPN11, JAK1, EP300, CREBBP, SMAD4, GATA2, MET, TP53, CDH1, TNFAIP3, FBXW7 | 6.28E-26 | 0.3912 | MAP3K1, CASP8, PIK3CA, PIK3R1, KRAS, NRAS, EGFR, BRAF, AKT1, ATM, GATA3, JAK2, PTPN11, JAK1, TP53, CTNNB1, APC, CDH1, MET, EP300, CREBBP | 1.52E-23 | 0.3105 |
CoDP | FGFR2, FGFR3, MET, CDKN2A, EP300, TP53, MAP2K1, PIK3CA, PIK3R1, CTNNB1, AKT1, PTPN11, BRAF, CASP8, MAP3K1 | 5.27E-5 | 0.1222 | NRAS, MAP2K1, B2M, HRAS, KRAS, BRAF, JAK2, SOCS1, CTNNB1, KLF4, CDKN2A, MED12, CDH1, AR, EP300, APC, SMARCA4, AKT1, PIK3CA | 2.32E-10 | 0.1935 |
BeCo-WithMEFun | NCOA3 TP53 | 3.82E-03 | 0.1111 | ARID1A TP53 | 2.70E-03 | 0.1176 |
Dendrix | CDH1, TP53 | 2.56E-03 | 0.1250 | PIK3CA, PIK3R1 | 1.83E-03 | 0.1333 |
BLP | TP53, PIK3CA, GATA3 | 5.19E-05 | 0.1764 | TP53, PTEN | 1.83E-03 | 0.1333 |
Multi-Dendrix | CDH1, TP53, GATA3, PIK3CA | 6.85E-07 | 0.2353 | PIK3CA, PIK3R1 | 1.83E-03 | 0.1333 |
CoMDP | MAP3K1, TP53, CDH1, GATA3, MLL3, PIK3CA | 3.81E-11 | 0.3529 | PIK3R1, FBXW7, PIK3CA, PPP2R1A, TP53, CTNNB1, PTEN | 4.97E-14 | 0.3889 |
Method . | Driver genes of BRCA . | Enrichment . | F1-score . | Driver genes of UCEC . | Enrichment . | F1-score . |
---|---|---|---|---|---|---|
CDPathway | PIK3CA, PIK3R1, KRAS, EGFR, BRAF, AKT1, APC, AXIN1, CTNNB1, MAP3K1, CASP8, RB1, JAK2, PTPN11, JAK1, EP300, CREBBP, SMAD4, GATA2, MET, TP53, CDH1, TNFAIP3, FBXW7 | 6.28E-26 | 0.3912 | MAP3K1, CASP8, PIK3CA, PIK3R1, KRAS, NRAS, EGFR, BRAF, AKT1, ATM, GATA3, JAK2, PTPN11, JAK1, TP53, CTNNB1, APC, CDH1, MET, EP300, CREBBP | 1.52E-23 | 0.3105 |
CoDP | FGFR2, FGFR3, MET, CDKN2A, EP300, TP53, MAP2K1, PIK3CA, PIK3R1, CTNNB1, AKT1, PTPN11, BRAF, CASP8, MAP3K1 | 5.27E-5 | 0.1222 | NRAS, MAP2K1, B2M, HRAS, KRAS, BRAF, JAK2, SOCS1, CTNNB1, KLF4, CDKN2A, MED12, CDH1, AR, EP300, APC, SMARCA4, AKT1, PIK3CA | 2.32E-10 | 0.1935 |
BeCo-WithMEFun | NCOA3 TP53 | 3.82E-03 | 0.1111 | ARID1A TP53 | 2.70E-03 | 0.1176 |
Dendrix | CDH1, TP53 | 2.56E-03 | 0.1250 | PIK3CA, PIK3R1 | 1.83E-03 | 0.1333 |
BLP | TP53, PIK3CA, GATA3 | 5.19E-05 | 0.1764 | TP53, PTEN | 1.83E-03 | 0.1333 |
Multi-Dendrix | CDH1, TP53, GATA3, PIK3CA | 6.85E-07 | 0.2353 | PIK3CA, PIK3R1 | 1.83E-03 | 0.1333 |
CoMDP | MAP3K1, TP53, CDH1, GATA3, MLL3, PIK3CA | 3.81E-11 | 0.3529 | PIK3R1, FBXW7, PIK3CA, PPP2R1A, TP53, CTNNB1, PTEN | 4.97E-14 | 0.3889 |
We also compare CDPathway with comparing methods on identifying known driver genes. We fix the number of identified driver genes from 1 to the number of genes in the somatic mutation profile and show the ROC and PR curves in Figure 9. Since BeCo-WithMEFun cannot get a solution when the number of identified genes is larger than 40, its ROC and PR curves cannot be obtained. As shown in Figure 9, CDPathway outperforms the comparing methods with respect to AUROC and AUPRC by at least 15%, since it makes use of somatic mutation data and prior knowledge of GGI network to identify potential driver genes. We observe that multiple driver pathway identification methods outperform individual driver pathway identification methods, since the former methods uncover driver pathway with diversity, while the latter ones only uncover very limited driver genes. In addition, the PR curves of these comparing methods drop sharply with the increased number of identified genes. That is because these individual driver pathway identification methods can only identify one or two gene sets (i.e. pathway), and they cannot provide meaningful results when identifying more than 10 genes within only one or two pathways.

Comparison of ROC and PR curves of identifying known driver genes for BRCA and UCEC data sets.
CDPathway can also uncover novel driver genes, which play important roles in important biological activities. For example, we obtain EGF (epidermal growth factor) in the BRCA data set, and EGF has tight associations with the survival rate of breast cancer patients [72]. The reason why CDPathway outperforms other methods is that CDPathway integrates both mutation and GGI data, which help reflecting the importance of genes and map weights of driver genes to pathways. As a result, its identified cooperative driver pathways contain more known (potential) driver genes. These results suggest that instead of considering individual genes, transferring the driver impacts of genes to pathways through their associations can better explore how pathways cooperatively drive cancer.
As shown in Tables 5 and 3, pathways identified by these methods also take responsibility for disease-related activities. Overall, 40% of the pathways identified by CDPathway and by the comparing methods are overlapped, which support CDPathway’s accuracy in identifying disease-related pathways. For example, for UCEC, the ‘mTOR signaling pathway’ was identified as the driver pathway by Dendrix, Multi-Dendrix, CoMDP and also by CDPathway. The ‘mTOR signaling pathway’ is ubiquitously expressed in cells and is a therapeutic target for the cancer treatment arsenal. Our CDPathway can identify more significant driver pathways by 150% than the individual driver identification methods, and it also can discover more driver pathways with cooperation by 60% than cooperative driver pathway identification methods. Particularly, only 60% of the cooperation among pathways identified by CoDP (most effective method among the comparing methods) is supported by the pathway–pathway interaction network, while that of CDPathway is 100%. This observation indicates that CDPathway is more effective in identifying pathway pairs with both cooperation and functional significance. CDPathway can accurately identify important genes in Step 1 and Step 2, can retain the driver effects of genes on interacting pathways and thus can more accurately identify cooperative driver genes. Because of the driver genes, these identified pathways not only have interaction with each other but also cooperatively carry out disease related activities.
Method . | BRCA . | UCEC . |
---|---|---|
CoDP | Beta1 integrin cell surface interactions, syndecan-1-mediated signaling events; salidated transcriptional targets of AP1 family members Fra1 and Fra2, signaling mediated by p38-alpha and p38-beta; signaling events mediated by VEGFR1 and VEGFR2, ceramide signaling pathway; ddirect p53 effectors, validated nuclear estrogen receptor alpha network; | Downstream signaling in naive CD8 T cells, IL12-mediated signaling events; regulation of RAC1 activity, regulation of RhoA activity; regulation of nuclear beta catenin signaling and target gene transcription, IL4-mediated signaling events; TCR signaling in naive CD4 T cells, validated targets of C-MYC transcriptional repression; |
BeCo-WithMEFun | p53 pathway, class I PI3K signaling events mediated by Akt; | B cell receptor signaling pathway, T cell receptor signaling pathway; |
Dendrix | Thyroid cancer | Trka receptor signaling pathway, mTOR signaling pathway |
BLP | Small cell lung cancer | p53 signaling pathway |
Multi-Dendrix | Thyroid cancer,p53 pathway | mTOR signaling pathway, IL2 signaling events mediated by STAT5; |
CoMDP | Thyroid cancer,p53 pathway | ErbB signaling pathway, mTOR signaling pathway; apoptosis, B cell receptor signaling pathway; |
Method . | BRCA . | UCEC . |
---|---|---|
CoDP | Beta1 integrin cell surface interactions, syndecan-1-mediated signaling events; salidated transcriptional targets of AP1 family members Fra1 and Fra2, signaling mediated by p38-alpha and p38-beta; signaling events mediated by VEGFR1 and VEGFR2, ceramide signaling pathway; ddirect p53 effectors, validated nuclear estrogen receptor alpha network; | Downstream signaling in naive CD8 T cells, IL12-mediated signaling events; regulation of RAC1 activity, regulation of RhoA activity; regulation of nuclear beta catenin signaling and target gene transcription, IL4-mediated signaling events; TCR signaling in naive CD4 T cells, validated targets of C-MYC transcriptional repression; |
BeCo-WithMEFun | p53 pathway, class I PI3K signaling events mediated by Akt; | B cell receptor signaling pathway, T cell receptor signaling pathway; |
Dendrix | Thyroid cancer | Trka receptor signaling pathway, mTOR signaling pathway |
BLP | Small cell lung cancer | p53 signaling pathway |
Multi-Dendrix | Thyroid cancer,p53 pathway | mTOR signaling pathway, IL2 signaling events mediated by STAT5; |
CoMDP | Thyroid cancer,p53 pathway | ErbB signaling pathway, mTOR signaling pathway; apoptosis, B cell receptor signaling pathway; |
Method . | BRCA . | UCEC . |
---|---|---|
CoDP | Beta1 integrin cell surface interactions, syndecan-1-mediated signaling events; salidated transcriptional targets of AP1 family members Fra1 and Fra2, signaling mediated by p38-alpha and p38-beta; signaling events mediated by VEGFR1 and VEGFR2, ceramide signaling pathway; ddirect p53 effectors, validated nuclear estrogen receptor alpha network; | Downstream signaling in naive CD8 T cells, IL12-mediated signaling events; regulation of RAC1 activity, regulation of RhoA activity; regulation of nuclear beta catenin signaling and target gene transcription, IL4-mediated signaling events; TCR signaling in naive CD4 T cells, validated targets of C-MYC transcriptional repression; |
BeCo-WithMEFun | p53 pathway, class I PI3K signaling events mediated by Akt; | B cell receptor signaling pathway, T cell receptor signaling pathway; |
Dendrix | Thyroid cancer | Trka receptor signaling pathway, mTOR signaling pathway |
BLP | Small cell lung cancer | p53 signaling pathway |
Multi-Dendrix | Thyroid cancer,p53 pathway | mTOR signaling pathway, IL2 signaling events mediated by STAT5; |
CoMDP | Thyroid cancer,p53 pathway | ErbB signaling pathway, mTOR signaling pathway; apoptosis, B cell receptor signaling pathway; |
Method . | BRCA . | UCEC . |
---|---|---|
CoDP | Beta1 integrin cell surface interactions, syndecan-1-mediated signaling events; salidated transcriptional targets of AP1 family members Fra1 and Fra2, signaling mediated by p38-alpha and p38-beta; signaling events mediated by VEGFR1 and VEGFR2, ceramide signaling pathway; ddirect p53 effectors, validated nuclear estrogen receptor alpha network; | Downstream signaling in naive CD8 T cells, IL12-mediated signaling events; regulation of RAC1 activity, regulation of RhoA activity; regulation of nuclear beta catenin signaling and target gene transcription, IL4-mediated signaling events; TCR signaling in naive CD4 T cells, validated targets of C-MYC transcriptional repression; |
BeCo-WithMEFun | p53 pathway, class I PI3K signaling events mediated by Akt; | B cell receptor signaling pathway, T cell receptor signaling pathway; |
Dendrix | Thyroid cancer | Trka receptor signaling pathway, mTOR signaling pathway |
BLP | Small cell lung cancer | p53 signaling pathway |
Multi-Dendrix | Thyroid cancer,p53 pathway | mTOR signaling pathway, IL2 signaling events mediated by STAT5; |
CoMDP | Thyroid cancer,p53 pathway | ErbB signaling pathway, mTOR signaling pathway; apoptosis, B cell receptor signaling pathway; |
Discussion
Cooperative driver pathways discovery helps understanding how multiple pathways function cooperatively in cancer initiation and progression and provides more precise therapy to patients. For this discovery, we proposed a two-stage based approach called CDPathway. CDPathway first uses a driver impact quantification function to identify candidate driver genes. Next, CDPathway collaboratively factorizes the adjacency matrices of a heterogeneous network of genes, miRNAs and pathway to fuse multi-relational data. CDPathway then reconstructs the pathway–pathway interaction network and identifies the pathway pairs with the highest reconstruction values as the cooperative driver pathways.
Experimental results on public breast, endometrial and ovarian cancer data sets show that CDPathway not only uncovers known driver genes with better statistical significance than competitive methods but also discovers more novel driver genes. We observe that integrating mutation data and prior functional networks can better differentiate driver genes from passenger genes because CDPathway uncovers genes with high mutation rate, which are hub nodes in the functional networks. The collaborative matrix factorization correctly transfers these driver genes to pathways, and thus the cooperative driver pathways identified by CDPathway contain more known driver genes than those of existing methods. Additionally, less known and less frequently altered genes with well-characterized cancer drivers suggest a mechanism of action. For example, on BRCA data set, FOXA1 and GATA3 co-regulate the expression of genes, which are essential for the luminal mammary epithelial cell development. On the pathway level, CDPathway can identify more potential biological cooperation among pathways. On both BRCA and UCEC data sets, each identified pathway contains at least one known driver gene, and we found the cooperation of Internalization of ErbB1 and Trk receptor signaling mediated by PI3K and PLC-gamma may be important to the occurrence of disease. We can infer that because of the tight associations within and between genes, miRNAs and pathways, transferring the driver impact from genes to pathways through miRNAs helps to uncover cooperative driver pathways.
We plan to extend this work to other biological networks. Given that there are many other types of molecules important to biological activities, such as LncRNAs, genes and drugs. Our method can be further applied on multi-type related biological networks to uncover more disease-related molecules. The challenge of extending our method is to build such relational biological networks and to choose a type of molecules with trigger influence on the disease.
Pathways undertake important biological activities and are robust by providing reliable gene regulatory relationships. Biological pathways have been applied to explore the pathology involved with cancer occurrence, diagnosis and prognosis. In practice, multiple pathways cooperatively drive the occurrence and progression of cancer, but scanty computational models can identify cooperative driver pathways.
We propose a novel model CDPathway to identify cooperative driver pathways. CDPathway combines the mutation data and prior knowledge of genes (i.e. gene–gene interaction network) to capture the influences of genes on the occurrence of cancers. CDPathway further exploits these influences to explore cooperative driver pathways via tightly associated biological networks of genes, miRNAs and pathways.
CDPathway shows a good prognostic performance in breast, endometrial and ovarian cancers. CDPathway can identify driver pathways with significance and diversity. It not only identifies known driver genes already explored by existing methods but also finds out novel driver genes. In addition, CDPathway uncovers the cooperations between driver pathways, which better reflects how biological activities trigger cancer than existing methods.
Funding
Natural Science Foundation of China (61873214 and 61872300); Fundamental Research Funds for the Central Universities (XDJK2020B028 and XDJK2019B024); Natural Science Foundation of CQ CSTC (cstc2018jcyjAX0228); fund from King Abdullah University of Science and Technology (FCC/1/1976-19-01).
Jun Wang is a Professor of the School of Software, Shandong University. Her research interests include cancer data mining, pathway analysis, GWAS and machine learning.
Ziying Yang is a master student of College of Computer and Information Sciences, Southwest University. Her research interests include cancer pathways analysis and biological data fusion.
Carlotta Domeniconi is an associate professor of Department of Computer Science, George Mason University. Her research interests include biological data mining and machine learning.
Xiangliang Zhang is an associate professor of Computational Bioscience Research Center (CBRC), Computer Science, Electrical and Mathematical Science and Engineering Division, King Abdullah University of Science and Technology, SA. Her research interests include data mining and machine learning.
Guoxian Yu is a Professor of the School of Software, Shandong University and Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), SA. His research interests include gene function prediction, ontology analysis and data mining.
References