MCDHGN: heterogeneous network-based cancer driver gene prediction and interpretability analysis

Abstract Motivation Accurately predicting the driver genes of cancer is of great significance for carcinogenesis progress research and cancer treatment. In recent years, more and more deep-learning-based methods have been used for predicting cancer driver genes. However, deep-learning algorithms often have black box properties and cannot interpret the output results. Here, we propose a novel cancer driver gene mining method based on heterogeneous network meta-paths (MCDHGN), which uses meta-path aggregation to enhance the interpretability of predictions. Results MCDHGN constructs a heterogeneous network by using several types of multi-omics data that are biologically linked to genes. And the differential probabilities of SNV, DNA methylation, and gene expression data between cancerous tissues and normal tissues are extracted as initial features of genes. Nine meta-paths are manually selected, and the representation vectors obtained by aggregating information within and across meta-path nodes are used as new features for subsequent classification and prediction tasks. By comparing with eight homogeneous and heterogeneous network models on two pan-cancer datasets, MCDHGN has better performance on AUC and AUPR values. Additionally, MCDHGN provides interpretability of predicted cancer driver genes through the varying weights of biologically meaningful meta-paths. Availability and implementation https://github.com/1160300611/MCDHGN


Supplemental materials Introduction
We did the following works in this paper: (1) Extraction of initial features of gene nodes and construction of heterogeneous networks.A 48-dimensional multi-omics vector is generated for each gene based on the mutation probability of the gene in the tumour cells of the GEO clinical cases, including the probability of copy number variation (CNV) and single nucleotide variation (SNV); the probability of the occurrence of DNA methylation; and the expression product level of the gene.The heterogeneous network is also constructed in conjunction with gene expression product interrelationships and knowledge of relevant biological pathway annotations in preparation for subsequent tasks.
(2) Metapath-based Feature Extraction for Heterogeneous Networks and Cancer-Driven Gene Prediction Based on Heterogeneous Networks Node Classification, designs custom metapaths for message aggregation in heterogeneous networks, samples different kinds of metapaths in heterogeneous networks using neighbourhood-based random wandering sampling.And using the attention mechanism to aggregate messages within and between different meta-paths to get the final node embedding representation and to predict node attributes using multilayer perceptron (MLP).
(3) For the classification results of cancer driver genes, we compare the classification effectiveness of our model with current state-of-the-art methods and multiple GNN models, design ablation experiments to explore the factors affecting the model performance, make predictions of potential cancer driver genes using the full sample, present the prediction results and select two cases with high confidence from them to conduct a demonstration line case study, and find supportive dissertation evidence for the analysis results.

Initial features
The initial features of the genes are composed of three parts, resulting in a final vector dimension of 48*N, where N is the number of gene nodes.
The specific three components are: Gene Mutation Frequency: We calculate the gene mutation frequency using the occurrence rate of single nucleotide variations (SNVs) from the MAF (Mutation Annotation Format) files of patients in the TCGA database.MAF files contain detailed information about single nucleotide variations, which are typically produced by cancer genome projects like TCGA (The Cancer Genome Atlas).Using GENCODE annotations, our scripts are able to compute the normalized SNV frequency based on the length of the exon genes.When preprocessing MAF files for individual cancer types, the first step is to load the MAF file and remove non-silent mutations.Next, hypermutated samples are eliminated based on a list of hypermutated samples.Subsequently, a gene × sample matrix is calculated.Finally, if necessary, the matrix is normalized based on gene length.
The averages of all gene-sample matrices for 16 types of cancer are calculated, resulting in an average matrix with dimensions of 16*N.This matrix represents the average SNV mutation frequency of each gene in TCGA samples for each type of cancer.
Gene Methylation: DNA methylation data for tumor and normal samples from the 450k Illumina bead array are processed using the get_mean_sample_meth.py script to compute the methylation matrices for both tumor and normal samples.This script also defines the promoter regions of genes and assigns each measured CpG site along with its distance from the transcription start site (TSS) to a specific gene.By calculating, the average methylation matrices for tumor and normal samples are obtained, and differential DNA methylation is calculated using log2 fold changes.
tumor samples is the heterogeneity value of tumor samples.mean(normal samples )is the average heterogeneity value of normal samples.
Ultimately, we obtain a 16*N matrix representing the differential DNA methylation expression for 16 different types of cancer.
Gene expression: preprocess_gene_expression.py is a script for preprocessing gene expression data, which has been quantified using the FPKM method (Fragments Per Kilobase of transcript per Million mapped reads), including normal and tumor tissue samples from TCGA (The Cancer Genome Atlas), as well as normal tissue samples from GTEX.In the application of this script, only normal tissue data from GTEX and tumor tissue data from TCGA are used to calculate the log2 fold changes between them.This method is commonly used to compare gene expression levels in different samples, such as normal and pathological tissues, to identify differentially expressed genes that may play a crucial role in the development of diseases.The processing workflow involves reading the gene expression data for tumor and normal tissues, ensuring that the genes (rows) in both datasets are aligned, calculating the ratio of the median expression of each gene across all tumor samples to the median expression across all normal samples, and taking the log base 2 of these ratios to obtain the log2 fold changes, as shown in equation 2.
FC c = log 2 ( median(P c ) median(N t ) ) equation 2Ultimately, a 16*N matrix is obtained, representing the log2 fold changes of N different genes across 16 types of cancer samples compared to normal tissue samples, reflecting the differential expression of genes between cancerous and normal tissues.

Datasets
CPDB is an online cancer cell line proteomics database that encompasses protein expression data derived from large-scale mass spectrometry and RNA sequencing techniques for thousands of cancer cell lines.It serves as a valuable resource for cancer treatment and research.For our study, we specifically utilized the protein-protein interaction (PPI) network data available in CPDB.This network represents the relationships between genes based on the interactions between their protein products.The Human MsigDB collection is an online repository designed for the analysis of genomic data.It consists of various specialized gene sets that can identify gene expression patterns related to different biological processes, diseases, and drug responses.The MsigDB collection is composed of two main components: Clinically validated signal pathway collections and gene or protein annotation information from clinical trials.These biological process and entity annotations contribute to a better understanding of the interactions between pairs of genes and the linkages under common belonging to the same physiological process; Computationally derived gene sets that provide highly informative summaries of gene information in specific contexts.These gene sets are generated using modern highthroughput sequencing technologies and computational methods.For all nine types of nodes, we only consider the intersection with the genes present in the CPDB network to construct the heterogeneous network.Additionally, we manually define nine types of meta-paths.Utilizing metapaths allows us to fully leverage the diverse types of nodes and edges present in the heterogeneous network, thereby creating a new way to describe gene interactions more effectively and comprehensively.Selecting appropriate meta-paths helps to better reflect the relationships between genes while also carrying some biological significance.This approach allows us to strike a balance between the predictive performance and interpretability of the model.

Extracted meta-paths:
In Supplementary Figure 7, we provided a diagram of the heterogeneous network.The biological entity descriptions before the nodes can be matched one-to-one with Table 2-1 in the main text.The content in parentheses represents the abbreviated form of intermediate nodes within the meta-path, as shown in Supplementary Figure 1.
The final extracted meta-paths, along with their biological nodes and the semantics of the paths, are presented in Supplementary Table 1.The abbreviated names within parentheses for intermediate nodes correspond to the node names shown in Supplementary Figure 1.Additionally, different colors of edges in Supplementary Figure 1 represent different types of meta-path connection patterns.

Baseline Introduction:
GCN (Kipf and Welling, 2016):This method is a commonly used network convolution algorithm for processing homogeneous graph data.Here, a three-layer GCN convolution approach is employed for computation in the homogeneous network.
GAT (Veliˇckovi ć et al., 2017):This method implementation uses the attention mechanism to compute the representation of a node on a neighbouring node using masked attention GGAT (Qiu et al., 2021):This method introduces a gated network to learn the weights of each attention head, allowing for dynamic adjustment of attention head weights.
HGT (Hu et al., 2020):An attention model for heterogeneous networks, where nodes are considered as entities and edges as relations.Entity vectors are learnable representation vectors, eliminating the need for entity vector initialization and using random vectors instead.
EMOGI (Schulte-Sasse et al., 2021):This method is based on GCN and uses multiple omics data such as genomics and gene expression data as gene features to predict pan-cancer driver genes in a PPi network.
MTGCN (Peng et al., 2022):Building upon the EMOGI method, this approach considers situations where nodes in the homogeneous network have no interactions with each other.It adds a multi-task module for link prediction to enhance the model's learning of input samples.
MODIG (Zhao et al., 2022):This method takes into account multi-dimensional homogeneous networks, where gene-gene interaction networks are generated based on multiple common relationships.It employs a joint learning mechanism based on GAT in multi-dimensional homogeneous networks.
MAGNN (Fu et al., 2020):This method is a commonly used meta-path-based approach for heterogeneous networks.Considering the limitations of the computational scale, it is not possible to perform full graph operations on the entire heterogeneous map, only Mini-Batch learning methods are feasible for training.