Explainable Multilayer Graph Neural Network for cancer gene prediction

Abstract Motivation The identification of cancer genes is a critical yet challenging problem in cancer genomics research. Existing computational methods, including deep graph neural networks, fail to exploit the multilayered gene–gene interactions or provide limited explanations for their predictions. These methods are restricted to a single biological network, which cannot capture the full complexity of tumorigenesis. Models trained on different biological networks often yield different and even opposite cancer gene predictions, hindering their trustworthy adaptation. Here, we introduce an Explainable Multilayer Graph Neural Network (EMGNN) approach to identify cancer genes by leveraging multiple gene–gene interaction networks and pan-cancer multi-omics data. Unlike conventional graph learning on a single biological network, EMGNN uses a multilayered graph neural network to learn from multiple biological networks for accurate cancer gene prediction. Results Our method consistently outperforms all existing methods, with an average 7.15% improvement in area under the precision–recall curve over the current state-of-the-art method. Importantly, EMGNN integrated multiple graphs to prioritize newly predicted cancer genes with conflicting predictions from single biological networks. For each prediction, EMGNN provided valuable biological insights via both model-level feature importance explanations and molecular-level gene set enrichment analysis. Overall, EMGNN offers a powerful new paradigm of graph learning through modeling the multilayered topological gene relationships and provides a valuable tool for cancer genomics research. Availability and implementation Our code is publicly available at https://github.com/zhanglab-aim/EMGNN.


Introduction
Understanding the precise function and disease pathogenicity of a gene is dependent on the target gene's properties, as well as its interaction partners in a disease-specific context [36,47,12].High-throughput experiments, such as whole-genome sequencing and RNA sequencing of bulk and single-cell assays, have enabled unbiased profiling of genetic and molecular properties for all genes across the genome.Experimental methods to probe both physical [30,3] and genetic interactions [27,8] provide valuable insights of the functional relevance between a pair of genes.Based on these data, computational methods have been developed to predict gene functions for understudied and uncharacterized genes by combining the gene's property with its network connectivity patterns [2,17,29].However, the prediction of gene pathogenicity in disease-specific contexts is challenging.Functional assays describing the gene and its gene network are relevant to disease only to the degree to which the measured property correlates with disease physiology [28]; while our understanding of complex disease physiology is poor, even for diseases with large sample size and data modalities, such as cancer [24].
As the completeness of known cancer genes is questioned, predicting novel cancer genes remains a crucial task in cancer genomics research.These genes, which are often mutated or aberrant expressed in cancer cells, play a key arXiv:2301.08831v2[cs.LG] 3 May 2023 role in the development and progression of the disease [39].Large-scale cancer sequencing consortia projects have generated genomic and molecular profiling data for a variety of cancer types, providing an information-rich resource for identifying novel cancer genes.Building on the hypothesis that pan-cancer multi-omic modalities provide critical information to cancer gene pathogenicity, a pioneering work EMOGI [35] innovatively modeled the multi-omics features of cancer genes in Protein-Protein interaction (PPI) networks to predict novel cancer genes.To address the challenge of functional properties irrelevant to cancer disease physiology, EMOGI featurized each gene by a vector summarizing multi-omics data levels across various cancer types in The Cancer Genome Atlas (TCGA) [43].EMOGI then modeled the gene-gene interactions from pre-defined generic PPI networks using a Graph Convolution Neural network (GCN).When trained on a set of high-confidence cancer-and non-cancer genes, EMOGI identified 165 newly predicted cancer genes without necessarily recurrent alterations, but interact with known cancer genes.
A major limitation of EMOGI is that it didn't address the disease physiology relevance in the pre-defined graph topology and connectivity patterns.EMOGI employed six different pre-defined graphs, including genetic-focused networks such as Multinet [19], and generic protein interaction networks such as STRING-db [40].Among EMOGI models trained on different PPI networks, we found an average standard deviation of 25.2% in unlabelled cancer gene predictions, demonstrating the newly predicted cancer genes were different when using different PPI networks.Thus, a trustworthy adaptation of the EMOGI method's output is challenging because conflicting prediction results are ubiquitous.Furthermore, as cancer disease physiology is complex, using a single predefined graph to represent the gene-gene relationships cannot fully capture its molecular landscape; therefore, more sophisticated, data-driven methods are needed to decipher the gene relationships in disease-specific contexts.
Here, we propose a novel graph learning framework, EMGNN (Explainable Multilayer Graph Neural Network), for predicting gene pathogenicity based on multiple input biological graphs.EMGNN maximizes the concordance of functional gene relationships with the unknown disease physiology by jointly modeling the multilayered graph structure.We evaluated the performance of EMGNN in predicting cancer genes using the same compiled datasets as EMOGI and showed that our proposed method achieves state-of-the-art performance by combining information from all six PPI networks.Furthermore, we explained EMGNN's prediction by both model-level integrated gradients and molecularlevel gene pathways.By examining newly predicted cancer genes identified by EMGNN, we demonstrated biological insights can be revealed by leveraging the complementary information in different types of biological networks.Overall, EMGNN provides a powerful new paradigm of graph learning through modeling the multilayered topological gene relationships.Our key contributions can be summarized as follows: • We develop an Explainable Multilayer Graph Neural Network (EMGNN) approach to identify cancer genes by leveraging multiple protein-protein interaction networks and multi-omics data.
• Our method demonstrates superior performance compared to existing approaches as quantified by a significant increase in the AUPRC across six PPI networks.The average improvement in performance is 7.15% over the current state-of-the-art method, EMOGI.
• We identify the most important multi-omics features for the prediction of each cancer gene, as well as the most influential PPI networks, using model interpretation strategies.
• EMGNN identifies newly predicted cancer genes by integrating multiple PPI networks, providing a unified and robust prediction for novel cancer genes discovery.Our code is publicly available on GitHub1 2 Results

Overview of EMGNN model
To effectively model multilayered graph structures for cancer gene prediction, we developed a graph neural network (GNN) model EMGNN that jointly learned from multiple biological networks as inputs.The extension of GNNs to handle multiple networks is a non-trivial task, as they are designed to operate on a single graph.The input for EMGNN is multiple graphs where each graph describes gene-gene relationships by an adjacency matrix, and a feature matrix for genes that is shared by multiple graphs.The output is a binary node classification of putative cancer vs non-cancer gene.EMGNN achieves multilayered graph modeling in a three-step approach (Figure 1).In the first step, EMGNN updates the node representation within each graph layer (i.e., a single biological network) by a shared message passing operator.
As different layer-wise graphs have distinct connectivity patterns, node representations will be updated differently in each layer.The shared message passing operator allows the EMGNN model to incorporate new graphs as inputs, while keeping the model's trainable parameters fixed.

Input Graphs Meta Graph Construction Shared Graph Neural Network
Meta Graph Neural Network Classification Layer EMGNN then integrates the layer-wise node representations of the same genes across multiple biological networks by a meta graph (Methods).For each gene, the meta-graph consists of a meta-node for this gene, which receives messages across individual networks from the given gene's representations.This message passing step is modeled with a second GNN, referred to as the Meta GNN (Figure 1).Meta GNN enables directed message-passing to combine and exchange information from different networks to the meta nodes, which will contain the final representations of the genes.In the last step, a multi-layer perceptron (MLP) takes the meta node representations as input, and performs the node classification task.
Notably, our EMGNN model is a generalized, multilayered form of a single graph GNN.In the special cases where multilayered graphs have identical adjacency matrices, or only one graph is provided as inputs, EMGNN reduces to a standard single-graph GNN, where the shared message passing operators are standard GNN operators, and meta GNN reduces to an identity operation.Thus, EMGNN generalizes single-graph GNN by jointly learning from the complementary information stored in multiple graphs.We applied EMGNN to predict cancer genes due to the wealth of multi-omic profiling data available for cancer, yet the underlying tumorigenesis is highly complex and cannot be fully captured by a single biological network.To ensure fair comparisons, we used a compiled dataset and kept the identical training/test split from a previous report [35] (Methods).In total, this dataset consisted of 887 labeled cancer genes, 7753 non-cancer genes and 14019 unlabeled genes, along with six PPI networks with multi-omics features.The detailed numbers of labeled positive and negative genes for training the EMGNN model in each PPI network can be found in Supplementary Table 1.
The integration of multiple graphs leads to substantial improvements in the performance of EMGNN.We trained EMGNN models and evaluated the testing performance with respect to different numbers of PPI networks.While we added other networks to the training and validation set, we held-out the same testing set from the combined training set and kept the test set identical to previous works [15,35].An illustration of the process of adding a new biological network and a definition of training and validation splits are shown in Supplementary Figure 1.As shown in Figure 2, the testing performance increased for each of the six testing datasets as the number of input networks increased.For Multinet, STRING-db, IRef, IRef(2015) testsets, the incorporation of more graphs steadily increased the performance without reaching a plateau.For PCNet testset, EMGNN achieved the best testing performance by combining four PPI networks.Because PCNet is the most densely connected network, this behavior is consistent with previously reported benchmarking results, which suggests that the performance scale with the network size [16].
EMGNN trained by incorporating all six graphs achieved state-of-the-art performance for all six test sets (Table 1).As each test set was an independent set of held-out labeled cancer and non-cancer genes from each network and we kept the set identical to previous reports (Methods), the testing performance will inform generalization error of the trained EMGNNs and provide fair comparisons to previous results.EMGNN outperformed the current state-of-the-art method EMOGI by an average margin of 7.15% AUPRC across all test sets, with the largest gain of 11.1% in performance observed in the old version Iref.The smallest gain was observed in PCNet, likely because PCNet is already an expert-assembled graph combining the information from the other five graphs [16].Nevertheless, for PCNet test set, EMGNN combining six graphs is significantly more accurate than EMOGI using PCNet (p-value=0.012,t-test).Therefore, incorporating information from multiple networks leads to enhanced predictive power for gene pathogenicity prediction.

Evaluating the performance of different GNN architectures and graph ablations.
To understand each model component's contribution to EMGNN's superior performance and model robustness, we next performed an ablation study using different GNN architectures and input perturbations.For GNN architectures, we compared Graph Convolutional Network (GCN) [21], and Graph Attention Network (GAT) [41].After identifying the most effective GNN architecture, we proceeded to evaluate its performance under varied input conditions, including the incorporation of random and constant features, and the random removal of 20% and 40% of edges within the input graphs.
As shown in Table 1, the GNN architecture played an essential role in EMGNN testing performance.We observed that GCN is the best-performing GNN architecture in all test datasets.Our findings demonstrated that the choice of GNN architecture has a significant impact on the performance of our model.Thus, EMGNN refers to EMGNN(GCN) throughout the paper unless otherwise stated.
Biologically, the EMGNN node features and edges are determined using high-throughput assays that inherently have measurement errors.To this end, we sought to answer the two following questions: How robust is EMGNN to node feature and graph structure perturbations?Are the node features and the graph structure both crucial for the prediction of cancer genes?We examined the performance of EMGNN under different types of input perturbations (Table 2), including removal of the multi-omics node features and edges, and adding uninformative vectors (random or constant).Each line represents a test set of positive and negative labeled genes held out in a specific PPI network.The addition of PPI networks was conducted using a random sampling approach, where three combinations of PPI networks were sampled randomly at each point.Note that the testing nodes remain the same as more networks are added.We observe that the performance increased for the majority of the test datasets, as the number of input networks increases.
We found that EMGNN decreased in performance for both random and all-one node features, suggesting that the node features derived from TCGA consortia were informative and highly relevant to cancer pathophysiology and cancer gene pathogenicity.For edge ablations, we randomly removed 20% and 40% edges in each PPI network.The removal of edges slightly decreased EMGNN performance.This is expected, because EMGNN integrates six PPI networks and demonstrates its robustness towards connectivity pattern by jointly modeling a multilayered graph topology.Overall, EMGNN effectively leveraged both node features and edges to achieve accurate predictions.

Explaining EMGNN reveals biological insights of cancer gene pathogenicity
Explainable and trustworthy models are essential for understanding the biological mechanisms of known cancer genes and facilitating the discovery of novel cancer genes.Given EMGNN's accurate and robust predictive performance, we developed Integrated Gradients methods [22] to explain the node and edge attributions of EMGNN (Methods).
A unique advantage of EMGNN is its integration of multiple PPI networks; therefore, we focused our analysis on the relative contributions from each PPI network to the known cancer gene predictions (Figure 3A).Each PPI network's importance was measured by the corresponding meta edge importance (Methods).We examined the relative contributions for two well-known cancer genes (TP53 and BRCA1, predicted cancer gene probability ŷ = 0.99, 0.98 respectively) and two newly predicted cancer genes (COL5A1 and MSLN, predicted cancer gene probability ŷ = 0.98, 0.90 respectively; see Figure 3A).Notably, different genes were predicted as cancer genes leveraging evidence from different PPI networks.For example, Multinet contributed to BRCA1, but not for MSLN.The newly predicted cancer gene COL5A1 combined information from all six PPI networks.To systematically examine if some PPI networks were statistically more important contributors than others, we performed an ANOVA test across all meta edge importance for cancer genes (Figure 3B).We found that different PPI networks made significantly different contributions to cancer gene prediction (P-value=1.2e− 65).This suggests that certain PPI networks were more informative than others, likely due to their unique connectivity patterns that were more reflective of cancer development and progression.Among pairwise comparisons, the updated version of Iref(2015) achieved a significantly higher contribution than the original Iref (P-value=1.3e− 50, t-test), which consolidated our observation that the incorporation of other PPI networks substantially improved upon the model using only Iref (see Iref performance over input PPI networks in Figure 2).Thus, EMGNN successfully learned complementary information from the connectivity patterns in each layer of the multilayer graph, as shown by the overall contributions from individual graphs and gene-specific variations.COL5A1 and MSLN are newly predicted cancer genes.B) Overall distribution of meta-edge feature importance for all known cancer genes across six PPI networks.Meta-edge feature importance was normalized to 1 (see Methods for details).C) An hypothetical illustration of PPI network cancer neighborhood implicated in the variation of meta-edge importance.D) Empirical analysis demonstrates a higher correlation between meta-edge importance and cancer neighborhood for genes with a large meta-edge variance.
We hypothesized that the variation of meta-edge importance was a result of different cancer gene neighborhood in different PPI networks (Figure 3C).For a target gene, we define "cancer neighbors" as the number of neighboring genes that are known cancer genes, which is then normalized by the degree of the target node in the given network.Hypothetically, for gene A whose neighbors were all cancer genes across PPI networks, the meta-edge importances were also comparable.In contrast, for gene B whose cancer gene neighbors varied across PPI networks, we should observe a positive covariance between cancer gene neighbors and meta-edge importance.To test our hypothesis, we computed the standard deviations of meta-edge importance for each labeled cancer gene, as well as the Pearson correlation between meta-edge importance and the cancer gene neighbors across PPI networks.For genes with large standard deviation of meta-edge importance, the important PPI networks tend to have a higher cancer neighbors, as demonstrated by a higher correlation between meda-edge importance and cancer neighbors (Figure 3D).Due to the complex graph convolutions that enabled message-passing beyond one-hop neighbors, our cancer neighbors may not fully capture the variations of PPI network importance.Overall, our empirical analysis demonstrates that cancer neighborhood is implicated in genes with divergent connectivity patterns across PPI networks.
On the individual gene level, a detailed explanation will reveal gene-specific genetic and molecular aberrations that the EMGNN model relies on for cancer gene prediction.As EMGNN's node features were derived by multi-omics pan-cancer datasets, we next assessed if certain types of omic data were informative to cancer gene predictions.Our model explanation results of node features indicated that Single Nucleotide Variation (SNV) features were found to be significantly less informative than other types of features, which is consistent with previous reports that CNAs were more detrimental to cancer progression than SNVs [9].In contrast, we observed that DNA methylation was significantly more important for known cancer gene prediction than other omics data (P-value<0.01for all three pairwise t-test of other omics against DNA methylation).We further examined the node feature importance of the same four genes (TP53, BRCA1, COL5A1, MSLN).As expected, the omics feature contributions varied on different genes and were highly gene specific, demonstrating the heterogeneity of tumorigenesis.Point mutations were major contributors to the prediction of TP53 as a cancer gene, which is consistent with findings from previous studies [1,14].Moreover, the prediction of BRCA1 correctly identified gene expression and copy number aberrations as the most significant features [5].DNA methylation had a moderate contribution for the two newly predicted cancer genes, COL5A1 and MSLN (Figure 4B).As DNA methylations are reversible epigenetic modifications, this may suggest potential novel therapeutic targets for certain cancer genes mediated by DNA methylation [37].

EMGNN identifies newly predicted cancer genes by integrating multilayer graphs
A key utility of EMGNN, given its superior performance and explainability, is to identify newly predicted cancer genes that share similar topological patterns to known cancer genes, but may have been missed by a conventional recurrent alteration analysis [35,38].We applied the trained EMGNN model to predict cancer genes on a total of n = 14019 unlabeled genes.
By integrating multilayer graphs, EMGNN addressed the divergent and inconsistent cancer gene predictions from previous models trained using a single PPI network.Prior to EMGNN, models trained using single PPI networks such as EMOGI had made conflicting predictions on which genes were cancerous.Indeed, we observed substantial variations among the predictions of EMOGI models trained on individual PPI networks, with an average of 29% and 63% difference between the highest and lowest predictions for the top-100 and all unlabelled nodes, respectively.Furthermore, we found an average standard deviation of 25.2% in unlabelled cancer gene predictions of EMOGI, demonstrating the predicted novel cancer genes were different when using different PPI networks.In contrast, EMGNN resolved these discrepancies using a more accurate and robust representation of the data from multilayer graphs.
Our analysis identified 435 genes with a high probability of being newly predicted cancer genes, with an 88% predicted cancer gene probability threshold.This threshold was selected based on its ability to provide a precision greater than 95% in the labeled set, indicating a high degree of accuracy in identifying true cancer genes.The identification of these novel cancer genes may provide new insights into the molecular mechanisms of cancer and offer potential targets for the development of novel therapies, demonstrating that machine learning predictions can augment the completeness of cancer gene catalogs [35].The complete list of gene predictions from EMGNN can be found in the project repository on GitHub.
As a case study, we analyzed the predictions of a newly predicted cancer gene, COL5A1 (Figure 5).For this gene, EMOGI model trained on STRINGdb predicted a non-cancer gene with high confidence (ŷ = 0.03), EMOGI models trained on IRefIndex, CPDB and PCNet predicted a cancer gene with high confidence (ŷ > 0.98), while the models trained on Multinet and IRefIndex2015 predicted a cancer gene with moderate likelihood (ŷ = 0.775 and 0.897, respectively).The fact that STRINGdb was the best-performing PPI network among models trained on individual PPI networks further complicated the decision making whether COL5A1 should be considered as cancer/non-cancer gene  (Table 1).This level of divergence in predictions hinders a trustworthy adaptation of model predictions in clinical and pragmatic settings.In contrast, EMGNN integrated the information from each individual PPI network in a data-driven approach and provided more accurate, unified predictions of cancer genes (Table 1).EMGNN predicted COL5A1 as a cancer gene with high confidence (Figure 5A).Importantly, we also found all individual PPI networks were contributing similarly to the final EMGNN prediction (Figure 3A), and revealed the multi-omics features implicated in this prediction (Figure 4B).
Leveraging the explanation results of node contributions for COL5A1 across its neighboring genes, we further illustrated the potential biological mechanisms of COL5A1 using a gene set enrichment analysis (Methods).These neighboring genes formed a ranked gene list based on their explained EMGNN contributions to the prediction of COL5A1 as a cancer gene or not.We discovered that three cancer hallmark gene sets, i.e. apical junction, coagulation, and complement system (part of the innate immune system), were significantly enriched in COL5A1 neighboring genes (Figure 5B).For example, the apical junction cancer hallmark geneset contained genes annotated to function in cell-cell adhesion among epithelial cells, many of those were enriched in the top contributors (Figure 5C).This was further supported by other studies, where COL5AI was associated with skin cancer, the type of cancer with a strong epithelial cell origin [25,45,13].Therefore, we demonstrated how molecular mechanisms of newly predicted cancer genes could be interpreted and discovered using the explainable EMGNN framework.

Discussion
The biomedical and biological domain contains a wealth of information, which is often represented and analyzed using graph structures to reveal relationships and patterns in complex data sets.Various gene interaction and proteinprotein interaction networks describe the functional relationships of genes and proteins, respectively.The gene-gene relationships were often described in generic cellular contexts and/or by integrating different, heterogeneous sources of information.Therefore, a single graph often struggles to best match disease-specific conditions, and different graph construction and integration methods render distinct predictive powers [4,6].Substantial efforts have been devoted to develop integrated [4,6] and tissue-specific graphs [46].Here, we took a complementary approach and developed a new graph learning framework, EMGNN, to jointly model multilayered graphs.Applying EMGNN to predict cancer genes demonstrated its superior performance over previous graph neural networks trained on single graphs.We also employed model explanation techniques to assess both node and edge feature importance.Our results showed that EMGNN leveraged the complementary information from different graph layers and omics features to predict cancer genes.Importantly, we found that cancer genes that have conflicting predictions based on different single graphs, or are missed by previous state-of-the-art predictors, can be recovered effectively using EMGNN.This demonstrates the robustness of EMGNN predictions by joint modeling the multilayered graphs.
The EMGNN model can be viewed as a data-driven, gradient-enabled integration method for multiple graphs.By providing multiple PPI networks as input, EMGNN learns from the different connectivity patterns that represent complementary information to predict cancer genes.Since all PPI networks share the same type of nodes and edges (though not necessarily the same set of nodes in each network), EMGNN currently integrates homogeneous, undirected graphs; however, the EMGNN framework can be extended to various types of graphs and to perform cross-data modality integration.In biology and biomedicine, hierarchical graphs and heterogeneous graphs are particularly prevalent, such as Gene Ontology [7].For example, biomedical data is often organized in hierarchical levels, starting with genes and molecules, moving on to cells and tissues, and finally reaching the level of individual patients and populations.Therefore, an interesting future direction is to apply EMGNN to model multiple graphs with more heterogeneous node and edge types, and with more complex inter-graph structures.
DNA methylation and gene expression aberrations are major contributors to EMGNN's cancer gene predictions, which are considered important features when explaining its omics node features.Unlike single nucleotide variation and copy number alteration that introduced permanent mutations to DNAs, epigenetic and transcriptomic alterations of cancer genes are potentially reversible by targeted therapies.Model explanations for EMGNN revealed these molecular aberrations that may be leveraged for screening and re-purposing of drugs, especially for previously less well-characterized, newly predicted cancer genes.This highlights the importance of model explanations to gain biological and biomedical insights when developing deep learning models to predict gene pathogenicity.
In summary, we present a novel deep learning approach for the prediction of cancer genes by integrating multiple gene-gene interaction networks.By applying graph neural networks to each individual network and then combining the representations of the same genes across networks through a meta-graph, our model is able to effectively integrate information from multiple sources.We demonstrate the effectiveness of our approach through experiments on benchmark datasets, achieving state-of-the-art performance.Furthermore, the ability to interpret the model's decision-making process through the use of integrated gradients allows for a better understanding of the contribution of different multi-omic features and PPI networks.Overall, our approach presents a promising avenue for the prediction of novel cancer genes.

Datasets
We use the datasets and train/test splits compiled by Schulte-Sasse et al. [35] to ensure a fair comparison.Specifically, we trained our proposed model with six PPI Networks: CPDB [18], Multinet [19], PCNet [16], STRING-db [40], Iref [31] and its newest version Iref(2015).The preprocessing of these six PPI networks was done in Schulte-Sasse et al. [35].We provide a brief description of the preprocessing steps here for clarity and self-containment.Depending on the source of the data, different confidence thresholds were applied to filter out low-confidence interactions.Interactions with a score higher than 0.5 in CPDB and 0.85 in STRING-db were included.For Multinet and IRefIndex(2015), the data was obtained from the Hotnet2 github repository [33].In the case of the recent version of IRefIndex, analysis was restricted to binary interactions between two human proteins.No further processing was performed on the PCNet.
As node features, we used single-nucleotide variants (MF), copy number aberrations (CNA), DNA methylation (METH) and gene expression (GE) data of 29, 446 samples from TCGA [43], from 16 different cancer types.The SNV frequency was calculated for each gene in each cancer type by dividing the number of non-silent SNVs in that gene by its exonic gene length.Gene-associated CNAs were collected from TCGA, where the copy number rate for each gene was defined as the number of times it was amplified or deleted in a specific cohort.DNA methylation for each gene in a cancer type was computed as the difference in methylation signal between tumor and matched normal samples.The expression level of each gene in each sample was quantified using RNA-seq data [42], and differential expression was computed as a log 2 fold change between cancer and matched normal samples, averaged across samples.
The positive examples included expert-curated cancer genes, high-confidence cancer genes mined from PubMed abstracts, and genes with altered expression and promoter methylation in at least one cancer type [32,39].The negative examples were compiled by removing genes not associated with cancer from a set of all genes and were selected based on their absence in the positive set and various cancer databases [32,26].For further information regarding the data collection and processing methods, refer to the study by Schulte-Sasse et al. [35].

Multilayer Graph Neural Network
GNNs.Let a graph be denoted by G = (V, E), where V = {v 1 , . . ., v N } is the set of vertices and E is the set of edges.Let A ∈ R N ×N denote the adjacency matrix, X = [x 1 , . . ., x N ] T ∈ R N ×d I denote the node features where d I is the features dimensions and Y = [y 1 , . . ., y N ] T ∈ N N denote the label vector.Graph neural networks have been successfully applied to many graph-structured problems [34,11], as they can effectively leverage both the network structure and node features.They typically employ a message-passing scheme, which constitutes of the two following steps.In the first step, every node aggregates the representations of its neighbors using a permutation-invariant function.
In the second step, each node updates its own representation by combining the aggregated message from the neighbors with its own previous representation, where h u represents the hidden representation of node u at the l th layer of the GNN architecture.Many choices for the Aggregate and Combine functions have been proposed in the recent years, as they have a huge impact in the representation power of the model [44].Among the most popular architectures, are Graph Convolution Networks (GCNs) [21], and Graph Attention Networks (GAT) [41].In GCN, each node aggregates the feature vectors of its neighbors with fixed weights inversely proportional to the central and neighbors' node degrees, In GAT, each node aggregates the messages from its neighbor using learnable weighted scores: where the attention coefficients α u,v are computed as Multilayer Graph Construction.Extending graph neural networks to handle multiple networks is not a trivial task, as they are designed to operate on a single graph.Next, we describe our method, which can accurately learn node representations using graph neural networks, from multilayer graphs.
Let N be the total number of genes, each associated with a feature vector x j ∈ R d I .Let also K be the number of gene-gene interaction networks.We represent each network G (i) with an adjacency matrix A (i) ∈ Z Ni×Ni and feature matrix X (i) ∈ R Ni×d I , where N i is the number of genes in the i-th network.Since some genes are not presented in all the graphs, the following equation holds In the first step, for each graph G i we apply a graph neural network f 1 that performs message-passing and updates the node representation matrix H (i) = f 1 (X (i) , A (i) ) of each graph i ∈ {0, 1, . . ., K − 1}.We set f 1 to be shared across all graphs.This design allows us to handle a variable number of graphs while keeping the number of trainable parameters fixed.
Next, to aggregate and share information between each graph, we construct a meta graph G meta,j for each gene/node j, where the same genes j across all graphs are connected to a meta node v j .We initialize the features of the meta node v j with the initial features of the corresponding gene j.We apply a second GNN f 2 to update the representation of the meta node v j , H meta,j = f 2 (X meta,j , A meta,j ), where X meta,j contains the features of gene j from all the networks and A meta,j is the adjacency matrix of the meta graph G meta,j , j ∈ {0, 1, . . ., N }.We set f 2 to be shared across all genes.Therefore, in this stage, the model combines and exchanges information between the different networks.Finally, a multi-layer perceptron f 3 predicts the class of the meta node j, ŷj = f 3 (H meta,j ).An illustration of the proposed model can be found in Figure 1.
Experimental Details.To ensure a fair comparison with previous work, we utilized the same experimental setup as in Schulte et al. [35].In particular, we divided the data for each testing graph into a 75% training set and a 25% testing set using stratified sampling to maintain equal proportions of known cancer and non-cancer genes in both sets.Since our model takes multiple graphs as input for each experiment, we retained the test nodes of one graph as the test set, and we allocated 90% of the remaining nodes from the other graphs to the training set and 10% to the validation set.When adding other PPI networks to the training and validation set, we held-out the same testing set from the combined training set and kept the test set identical to previous works [15,35].An illustration of the process of adding a new biological network and a definition of training and validation splits are shown in Supplementary Figure 1.The model was trained for 2000 epochs, using the cross-entropy loss function, and the ADAM optimizer [20] with a learning rate of 0.001.The initial GNN had three layers with a hidden dimension of 64, while the meta-GNN had a single layer with a hidden dimension of 64.The Supplementary Table 1 provides information on the PPI networks, including statistics, as well as the number of positive and negative genes used for training.

Model interpretation
We used the integrated gradient (IG) module in Captum, to assign an importance score to each input feature.Captum is a tool for understanding and interpreting the decision-making process of machine learning models [22].It offers a range of interpretability methods that allow users to analyze the predictions made by their models and understand how different input variables contribute to these predictions.IG interprets the decisions of neural networks by estimating the contribution of each input feature to the final prediction.The integrated gradient approximates the integral of gradients of the model's output with respect to the inputs along a straight line path from a specific baseline input to the current input.The baseline input is typically chosen to be a neutral or a meaningless input, such as an allzero vector or a random noise.Formally, let F (x) be the function of a neural network, where x is the input and x is the baseline input.The integrated gradients for input x and baseline x0 along the i-th dimension is defined as: dα, where the integral is taken along the straight line path from x to x and ∂F (x)/∂x i is the gradient of F (x) along the i-th dimension.
However, the traditional integrated gradient method, which is designed for single input models, is not directly applicable to graph neural networks as they have two distinct inputs, namely node features and network connectivity.This necessitates the development of a modified approach for computing integrated gradients in graph neural networks that considers both inputs.To this end, we propose a decomposition of the problem into two parts: identifying the most important node features and identifying the most crucial edges in the network separately.Since we predict the class of each gene by combining all the graphs, from the meta-node representations, we apply the interpretation analysis only to the meta-nodes.
Node feature interpretation analysis.We analyze the contribution of node features to the predictions of the GNN by using the traditional integrated gradient method while keeping the edges in the network fixed.Specifically, we interpolate between the current node features input and a baseline input where the node features are zero: Attribution xi = (x i − xi ) ∂F (x+α(x−x,A) ∂xi dα, where A are the adjacency matrices of the graphs.Since the prediction for each gene is also based on the features of surrounding genes in the graphs, we extract attribution values for the k-hop neighbor genes as well, where k is equal to the number of message-passing layers in the first GNN.Therefore, the output of the attribution method for each node u, is a matrix K (u) ∈ R N ×d .Each entry K ij of the matrix, corresponds to the attribution of the feature j of node i to the target node u.From this matrix, we select the row that corresponds to the feature attributions of the corresponding meta node.
Edge feature interpretation analysis.To analyze the contribution of edges in the meta-graph to the predictions of the GNN, we use the integrated gradient method for the edges while keeping the node features fixed.Specifically, we interpolate between the current edge input and a baseline input where the weights of the edges are zero: Attribution ei = 1 α=0 ∂F (X,Aα) ∂we i dα, where A α corresponds to the graphs with the edge weights equal to α.We further normalize the attribution values of each meta node by dividing them by their maximum value, resulting in a range of [0, 1] for each edge.This explanation technique allows us to understand which edges in the meta-graph are crucial for the model's decision-making process, and therefore which input PPI networks are important for each gene prediction.

Newly predicted cancer gene discovery
We applied the trained EMGNN model that combined all six individual PPI networks to predict novel cancer genes in the n = 14019 unlabeled genes.We ranked these genes by their predicted cancer gene probability for potential new predicted cancer genes (NPCG) in this study.For each unlabeled gene, we also applied EMOGI models trained on individual PPI networks to predict the probability of it being a cancer gene.The complete list of the predicted probabilities for all the unlabeled genes, can be found in the project repository on GitHub.

Gene set enrichment analysis
To understand the biological mechanisms of EMGNN's cancer gene prediction, we employed gene set enrichment analysis (GSEA) to analyze the functional enrichment of important gene features in curated cancer pathway annotations.Specifically, to determine the importance of neighboring gene nodes, we aggregated the maximum feature importance of each node using Captum's feature explanation results.Genes with zero importance were excluded in this analysis as they did not contribute to the prediction of this target gene.We then ranked the neighboring gene nodes based on their importance and used this ranked gene list as input for GSEA.The enrichment p-value and multiple testing corrected FDR were computed by GSEA python package [10] against cancer hallmark gene sets [23].

Figure 2 :
Figure2: Test AUPRC and standard deviation values of EMGNN(GCN) with respect to the number of input PPI networks.Each line represents a test set of positive and negative labeled genes held out in a specific PPI network.The addition of PPI networks was conducted using a random sampling approach, where three combinations of PPI networks were sampled randomly at each point.Note that the testing nodes remain the same as more networks are added.We observe that the performance increased for the majority of the test datasets, as the number of input networks increases.

Figure 3 :
Figure3: Explanation of each PPI network's contribution to cancer gene predictions.A) Representative PPI network contributions in known cancer genes and newly predicted cancer genes.TP53 and BRCA1 are known cancer genes; COL5A1 and MSLN are newly predicted cancer genes.B) Overall distribution of meta-edge feature importance for all known cancer genes across six PPI networks.Meta-edge feature importance was normalized to 1 (see Methods for details).C) An hypothetical illustration of PPI network cancer neighborhood implicated in the variation of meta-edge importance.D) Empirical analysis demonstrates a higher correlation between meta-edge importance and cancer neighborhood for genes with a large meta-edge variance.

Figure 4 :
Figure4: Explanations of multi-omic node features importance in cancer gene predictions.A) Overall distribution of node feature importance grouped by omic feature types, including single-nucleotide variants (MF), DNA methylation (METH), gene expression (GE) and copy number aberrations (CNA), for known cancer genes.B) Detailed node feature importance for the four genes analyzed in Figure3B.X-axis labels were color-coded to match the omic feature types in panel A. Individual tumor types were coded according to TCGA study abbreviations[43].

Figure 5 :
Figure 5: EMGNN predicts COL5A1 as a novel cancer gene and reveals biological insights.A) A comparison of predicted cancer gene probability from EMGNN and EMOGI models trained on single PPI networks.As a probability of 50% equaled random guessing between cancer vs non-cancer gene, the bar heights reflected the prediction confidence.B) Three cancer hallmark genesets were significantly enriched in the important neighboring genes of COL5A1 as revealed by interpreting EMGNN model.C) Enrichment of apical junction cancer hallmark geneset in COL5A1 neighboring genes.The neighboring genes of COL5A1 were ranked by their EMGNN node importance on the x-axis, where each blue bar represented a gene in the apical junction geneset.A strong left-shifted curve demonstrates enrichment of apical junction geneset in the top important genes to predict COL5A1 as a cancer gene.

Table 1 :
Test AUPRC values and standard deviation across different PPI networks across five different runs.
Figure 1: An illustration of our proposed Explainable Multilayer Graph Neural Network (EMGNN) approach.The model consists of three main steps: (1) Apply a shared GNN to update the node representation matrix of each input graph;(2) Construct a meta graph for each gene, where the same genes across all graphs are connected to a meta node, and update the representation of the meta nodes with a second GNN, Meta GNN; (3) Use a multi-layer perceptron to predict the class of each meta node.

Table 2 :
Test AUPRC and standard deviation of EMGNN(GCN) for different input perturbations methods across three different runs.