-
PDF
- Split View
-
Views
-
Cite
Cite
Peifeng Ruan, Shuang Wang, DiSNEP: a Disease-Specific gene Network Enhancement to improve Prioritizing candidate disease genes, Briefings in Bioinformatics, Volume 22, Issue 4, July 2021, bbaa241, https://doi.org/10.1093/bib/bbaa241
- Share Icon Share
Abstract
Biological network-based strategies are useful in prioritizing genes associated with diseases. Several comprehensive human gene networks such as STRING, GIANT and HumanNet were developed and used in network-assisted algorithms to identify disease-associated genes. However, none of these networks are disease-specific and may not accurately reflect gene interactions for a specific disease. Aiming to improve disease gene prioritization using networks, we propose a Disease-Specific Network Enhancement Prioritization (DiSNEP) framework. DiSNEP first enhances a comprehensive gene network specifically for a disease through a diffusion process on a gene–gene similarity matrix derived from disease omics data. The enhanced disease-specific gene network thus better reflects true gene interactions for the disease and may improve prioritizing disease-associated genes subsequently. In simulations, DiSNEP that uses an enhanced disease-specific network prioritizes more true signal genes than comparison methods using a general gene network or without prioritization. Applications to prioritize cancer-associated gene expression and DNA methylation signal genes for five cancer types from The Cancer Genome Atlas (TCGA) project suggest that more prioritized candidate genes by DiSNEP are cancer-related according to the DisGeNET database than those prioritized by the comparison methods, consistently across all five cancer types considered, and for both gene expression and DNA methylation signal genes.
Introduction
Gene networks capture existing knowledge of biochemistry, functional and statistical interactions between genes, where nodes represent genes and edges represent interactions between genes [1]. Human gene networks have been widely used to identify genetic or genomic factors of diseases from gene expression profiles [2, 3], genome-wide association studies (GWAS) [4, 5] and epigenome-wide association studies (EWAS) [6, 7]. Many network-assisted algorithms were developed, which utilize gene network topologies to help prioritize disease-associated genes. These algorithms can be broadly grouped into two categories: direct neighborhood methods that prioritize candidate genes through utilizing directly connected neighbors on a network [5, 8] and network diffusion methods that prioritize candidate genes through propagating disease association signals on an entire network [9, 10]. As emphasized in a review paper, prioritizing candidate genes helps maximize the biological relevance of follow-up validation experiments or functional studies [11]. Using network-assisted prioritization algorithms, novel candidate omics features have been identified in a wide spectrum of diseases, such as Cerebral Cavernous Malformations [3], Crohn’s disease [4] and different types of cancers [5–7], when many of these omics features may be missed by standard statistical tests [2–7]. In these network-assisted algorithms, biological networks help prioritize disease-associated genes through their proximity to genes within networks that have strong disease association signals [12].
To better prioritize candidate genes for a disease using a network-assisted algorithm, a gene network that accurately represents gene interactions for the disease is critical. A number of comprehensive human gene networks were developed, such as STRING [13], HumanNet [4], GIANT [14] and ConsensusPathDB [15]. Also growing rapidly is the number of interactions in these networks [1]. As none of these comprehensive human gene networks are disease-specific, they may not accurately reflect gene interactions for a specific disease. Huang et al. [1] showed that the choice of a gene network greatly affects identification of downstream candidate genes. Nam et al. [16] proposed a disease-specific network-assisted prioritization method, but their method only works for immune diseases and binary protein–protein interaction networks. In this article, we propose a general Disease-Specific Network Enhancement Prioritization (DiSNEP) framework that enhances a comprehensive human gene network for a specific disease in order to improve prioritization of genes associated with the disease.
The proposed DiSNEP framework first enhances a general gene network into a disease-specific gene network for a disease. This is done through a random walk process using a gene similarity matrix derived from disease omics data where each element of the gene similarity matrix is a correlation between two genes. The enhanced disease-specific gene network can then be used to prioritize gene association signals by diffusion. We conducted simulation studies to show the better performance of the DiSNEP framework compared to methods that prioritize candidate genes using a general human gene network or without prioritization. In all simulation scenarios considered, the genes prioritized by DiSNEP include more true disease signals than those by the comparison methods with the original gene network or without prioritization. We applied the DiSNEP framework to prioritize cancer-associated gene expression signal genes using gene expression-enhanced cancer-specific networks and to prioritize DNA methylation signal genes using DNA methylation-enhanced cancer-specific networks for five cancer types from The Cancer Genome Atlas (TCGA) project: kidney renal papillary cell carcinoma (KIRP), thyroid carcinoma (THCA), stomach adenocarcinoma (STAD), liver hepatocellular carcinoma (LIHC) and prostate adenocarcinoma (PRAD). Our results show that (1) consistently across all five cancer types considered and (2) for both gene expression signal genes and DNA methylation signal genes, more candidate genes prioritized by DiSNEP are cancer-related than those by a general gene network or without prioritization according to the DisGeNET database [17]. We also applied DiSNEP to independent 450K DNA methylation data of hepatocellular carcinoma (HCC) tumor and adjacent normal tissues from the Columbia University Irving Medical Center (CUIMC) GES54503 [44] for a replication analysis and concluded that the prioritization results are reproducible if a different dataset is used. Additional pathway enrichment analyses across the five TCGA cancers suggest that the prioritized genes by DiSNEP are indeed cancer-specific.
Methods
The proposed DiSNEP framework has three steps: (1) enhance a general gene network S into a disease-specific gene network SE, (2) diffuse gene association signals g into gE on the enhanced disease-specific gene network SE and (3) prioritize and select top ranked candidate genes based on gE. Figure 1 shows the workflow of the proposed DiSNEP framework.

The DiSNEP framework workflow. We first diffuse a general gene network S on a gene similarity matrix E, which is generated using a type of disease omics data. We then remove ‘noise’ edges and make the network binary. The resulted disease-specific network SE better reflects the true interactions between genes for the disease. SE is then used to prioritize gene association measures through a diffusion process, and the top-ranked prioritized genes by the prioritized gene association measures are selected.
Step 1: Enhance a general gene network S
Generate a gene similarity matrix using disease omics data
We use disease omics data, for example gene expression data of a disease from both case and control samples to generate a gene similarity matrix E that reflects correlations/interactions between gene–gene pairs for the disease by calculating absolute values of pairwise correlations between genes. This could be similarly applied to DNA methylation data where the largest absolute value of correlation between any two CpGs on two different genes is selected to represent the correlation between the two genes.
Enhance a general human network S into SE
Here α is a regularization parameter representing weights for signal sources where α = 0 means no disease-specific enhancement, i.e. no data from the disease are used to enhance S0, and t is iteration steps. Simulation studies suggest that the optimal α value is 0.75, while different values of α > 0.5 do not have much impact on prioritization results. This is consistent with previous observations using similar random walk processes [18, 19]. We set iterations to stop when L1 matrix norm of St + 1 − St is smaller than 1 × 10−6. Typically, the threshold is reached after 10–15 iterations.
Note that after convergence, St is no longer symmetric. To build SE, we further remove ‘noise edges’ in St, i.e. for top 10% of edges ordered by edge weights, we set their edge weights to be 1. We set the rest 90% edge weights to be 0. The 10% was chosen based on a previous study that investigated a set of gene ontology-derived gold standards [18]. Since St is not symmetric, in order to have an undirected network, i.e. symmetric, we then set edge weights between any pairs of genes as 1 if at least one of the two weights between two genes is 1 [18]. The resulted network is then denoted as SE.
Step 2: Update gene association signals g
Disease-specific gene association signal scores g
Association P-values could be obtained from appropriate statistical tests based on study designs, e.g. t-test comparing mean gene expression levels between cases and controls. Denote pi as the association P-value for gene i, we calculate the association signal score gi = Φ−1 (1 − pi), where Φ is the standard normal distribution function, and denote the association signal scores of n genes as g0 = (g1, g2, …, gn). This could be similarly applied to DNA methylation data where the largest association signal score of all CpGs on a gene represents the association signal score for the gene.
Update gene association signals g into gE
Here β is a regularization parameter representing weights for signal sources where β = 0 means no prioritization using SE. Similar to α, β has an optimal value at 0.75 from simulation studies and different values of β > 0.5 do not have much impact on prioritization results, which is consistent with previous observations [18, 19]. We set iterations to stop when L1 vector norm of gt + 1 − gt is smaller than 1 × 10−6. The final updated association signal scores are denoted as gE.
Step 3: Prioritize candidate genes based on gE
We prioritize genes ranked by their gE scores and choose m top-ranked genes as the prioritized candidate signal genes, where m is usually under a few hundred [20].
Note that the same set of disease omics data is used twice for two purposes in the DiSNEP framework, first to enhance a general gene network into a disease-specific network and second to generate disease association measures to be diffused for prioritization.
Results
Simulation studies
We conducted simulation studies to investigate the performance of the proposed DiSNEP framework, which enhances a general gene network into a disease-specific network and then prioritizes gene associations on the enhanced network. We evaluated the performance using true positive rates (TPR) and false positive rates (FPR) and constructed receiver operating characteristic (ROC) curves with areas under the ROC curves (AUCs). We compared the performance of DiSNEP to that of (1) the GeneWanderer method that prioritizes candidate genes by diffusion on the column-normalized STRING general gene network [9] and (ii) the t-test method that generates disease association measures similarly as in DiSNEP and GeneWanderer without using network information.
Simulation settings
Generating a general human gene network
We downloaded the human gene network from the STRING database v11, which has 19 354 genes and 11 759 454 edges [13]. We trimmed the STRING network to keep genes that are also in the TCGA KIRP gene expression dataset after quality control steps (see Real data applications). We ended with 14 066 genes and 4 056 937 edges. Using KIRP as an example to prioritize KIRP-associated genes, we downloaded a list with 499 renal carcinoma-associated genes from the DisGeNET database and considered these 499 genes as the true KIRP-associated genes. The DisGeNET database V6.0 was developed by expert curated repositories, GWAS catalogues, animal models and literatures [17]. We set the general gene network S in simulation studies to have 2000 genes, with 50 genes being renal carcinoma-associated signal genes randomly selected from the 499 renal carcinoma-associated genes and the remaining 1950 genes being noises randomly selected from the 13 567 non-renal-carcinoma-associated genes on the STRING network. We kept the edges of these 2000 genes selected from the STRING network.
Simulating gene expression data
In each simulation, we simulated gene expressions of the 2000 genes for 300 tumor samples and 50 independent normal samples mimicking real data applications using a multivariate normal distribution. For the 50 signal genes of tumor samples, we set 25 genes to have large association signals with means generated from N (μ1, 1) and the other 25 to have medium association signals with means generated from N (μ2, 1). The 1950 noise genes of tumor samples and all 2000 genes of normal samples have means generated from N (8, 1). For correlations among the 2000 genes, we set strong pairwise correlations among the 25 large signal genes from Uniform (0, 0.1) and median correlations among the other 25 median signal genes from Uniform (0, 0.02). We set weak correlations among the 1950 noise genes from Uniform (0, 0.01). Finally, we denote standard deviations of expression level of all genes as σ. We set parameters (μ1, μ2,σ) to have values (10, 9, 3, 10, 9, 2, 10, 9, 1, 11, 10, 1, 12, 11, 1) for different effect sizes. We then simulated gene expression data of 2000 genes for 300 cases and 50 controls from a multivariate normal distribution with parameters generated above and calculated the association signal scores g and the gene similarity matrix E using simulated gene expression data in each simulated dataset. We simulated 100 datasets and applied the proposed DiSNEP framework and the comparison methods, GeneWanderer and two-sample t-test.
Simulation results
With the selected top-ranked m candidate genes by signal scores, we calculated TPR and FPR and constructed ROC curves with AUCs when we ranged m from 1 to 2000 with a grid of 1. As typically a few hundred candidate genes are considered in the follow-up analyses [20], we compared parts of AUCs of the three methods when the minimum top-ranked m prioritized genes were chosen so that FPR = 0.1. Across the 100 simulated datasets, the mean of the minimum m’s selected to have FPR = 0.1 is 225 when α and β were set at 0.75. Figure 2A displays AUC results from the 100 simulated datasets when α and β were set at 0.75, and (μ1, μ2,σ) were set at (10, 9, 1), respectively. Under the same parameter setting, Figure 2B displays the average numbers of true signal genes out of top-ranked m prioritized genes when m increases from 0 to 500. Table 1 summarizes averaged AUCs when FPR ≤ 0.1 under different α and β settings when (μ1, μ2) were set at (10, 9) and σ varying from 1, 2 to 3. Additional simulation results are included in Supplementary Table S7. It is clear that in all simulation scenarios, the proposed DiSNEP framework that enhances a general gene network for a specific disease achieves higher AUCs than that using a general gene network or without network-assisted prioritization. This demonstrates that the enhancement of a general gene network for a specific disease does improve the prioritization of candidate genes for the disease. Among different values of α and β, the optimal values are both 0.75 when different α and β values ranging from 0.5 to 1 do not have much impact on the performance of DiSNEP and GeneWanderer. Previous studies using similar random walk processes also reported similar findings [18, 19].

Evaluations of the three methods in simulation studies. (A) Boxplot of AUCs when FPR ≤ 0.1; α and β were set at 0.75 and σ, μ1 and μ2 were set at 1, 10 and 9. (B) The number of true signal genes out of top ranked 0–500 genes increased by 50.
Simulation results with mean AUCs for t-test, GeneWanderer and DiSNEP when FPR ≤ 0.1 with different values of α and β in different simulation scenarios
Methods | Mean AUC | |||||
t-test | 0.019 | |||||
Scenario I μ1 = 10a μ2 = 9b σ = 1c | β | |||||
0.5 | 0.75 | 1 | ||||
GeneWanderer | 0.028 | 0.029 | 0.027 | |||
DiSNEP | α | 0.5 | 0.034 | 0.036 | 0.036 | |
0.75 | 0.037 | 0.038 | 0.037 | |||
1 | 0.035 | 0.036 | 0.035 | |||
Scenario II μ1 = 10 μ2 = 9 σ = 2 | Methods | Mean AUC | ||||
t-test | 0.018 | |||||
β | ||||||
0.5 | 0.75 | 1 | ||||
GeneWanderer | 0.026 | 0.027 | 0.025 | |||
DiSNEP | α | 0.5 | 0.031 | 0.032 | 0.027 | |
0.75 | 0.032 | 0.033 | 0.027 | |||
1 | 0.030 | 0.030 | 0.024 | |||
Scenario III μ1 = 10 μ2 = 9 σ = 3 | Methods | Mean AUC | ||||
t-test | 0.016 | |||||
β | ||||||
0.5 | 0.75 | 1 | ||||
GeneWanderer | 0.025 | 0.026 | 0.024 | |||
DiSNEP | α | 0.5 | 0.028 | 0.028 | 0.025 | |
0.75 | 0.029 | 0.030 | 0.025 | |||
1 | 0.026 | 0.027 | 0.023 |
Methods | Mean AUC | |||||
t-test | 0.019 | |||||
Scenario I μ1 = 10a μ2 = 9b σ = 1c | β | |||||
0.5 | 0.75 | 1 | ||||
GeneWanderer | 0.028 | 0.029 | 0.027 | |||
DiSNEP | α | 0.5 | 0.034 | 0.036 | 0.036 | |
0.75 | 0.037 | 0.038 | 0.037 | |||
1 | 0.035 | 0.036 | 0.035 | |||
Scenario II μ1 = 10 μ2 = 9 σ = 2 | Methods | Mean AUC | ||||
t-test | 0.018 | |||||
β | ||||||
0.5 | 0.75 | 1 | ||||
GeneWanderer | 0.026 | 0.027 | 0.025 | |||
DiSNEP | α | 0.5 | 0.031 | 0.032 | 0.027 | |
0.75 | 0.032 | 0.033 | 0.027 | |||
1 | 0.030 | 0.030 | 0.024 | |||
Scenario III μ1 = 10 μ2 = 9 σ = 3 | Methods | Mean AUC | ||||
t-test | 0.016 | |||||
β | ||||||
0.5 | 0.75 | 1 | ||||
GeneWanderer | 0.025 | 0.026 | 0.024 | |||
DiSNEP | α | 0.5 | 0.028 | 0.028 | 0.025 | |
0.75 | 0.029 | 0.030 | 0.025 | |||
1 | 0.026 | 0.027 | 0.023 |
aMeans of the 25 large association signal genes were generated from N(μ1, 1).
bMeans of the 25 medium association signal genes generated from N(μ2, 1).
cσ is the standard deviation of gene expression data.
Simulation results with mean AUCs for t-test, GeneWanderer and DiSNEP when FPR ≤ 0.1 with different values of α and β in different simulation scenarios
Methods | Mean AUC | |||||
t-test | 0.019 | |||||
Scenario I μ1 = 10a μ2 = 9b σ = 1c | β | |||||
0.5 | 0.75 | 1 | ||||
GeneWanderer | 0.028 | 0.029 | 0.027 | |||
DiSNEP | α | 0.5 | 0.034 | 0.036 | 0.036 | |
0.75 | 0.037 | 0.038 | 0.037 | |||
1 | 0.035 | 0.036 | 0.035 | |||
Scenario II μ1 = 10 μ2 = 9 σ = 2 | Methods | Mean AUC | ||||
t-test | 0.018 | |||||
β | ||||||
0.5 | 0.75 | 1 | ||||
GeneWanderer | 0.026 | 0.027 | 0.025 | |||
DiSNEP | α | 0.5 | 0.031 | 0.032 | 0.027 | |
0.75 | 0.032 | 0.033 | 0.027 | |||
1 | 0.030 | 0.030 | 0.024 | |||
Scenario III μ1 = 10 μ2 = 9 σ = 3 | Methods | Mean AUC | ||||
t-test | 0.016 | |||||
β | ||||||
0.5 | 0.75 | 1 | ||||
GeneWanderer | 0.025 | 0.026 | 0.024 | |||
DiSNEP | α | 0.5 | 0.028 | 0.028 | 0.025 | |
0.75 | 0.029 | 0.030 | 0.025 | |||
1 | 0.026 | 0.027 | 0.023 |
Methods | Mean AUC | |||||
t-test | 0.019 | |||||
Scenario I μ1 = 10a μ2 = 9b σ = 1c | β | |||||
0.5 | 0.75 | 1 | ||||
GeneWanderer | 0.028 | 0.029 | 0.027 | |||
DiSNEP | α | 0.5 | 0.034 | 0.036 | 0.036 | |
0.75 | 0.037 | 0.038 | 0.037 | |||
1 | 0.035 | 0.036 | 0.035 | |||
Scenario II μ1 = 10 μ2 = 9 σ = 2 | Methods | Mean AUC | ||||
t-test | 0.018 | |||||
β | ||||||
0.5 | 0.75 | 1 | ||||
GeneWanderer | 0.026 | 0.027 | 0.025 | |||
DiSNEP | α | 0.5 | 0.031 | 0.032 | 0.027 | |
0.75 | 0.032 | 0.033 | 0.027 | |||
1 | 0.030 | 0.030 | 0.024 | |||
Scenario III μ1 = 10 μ2 = 9 σ = 3 | Methods | Mean AUC | ||||
t-test | 0.016 | |||||
β | ||||||
0.5 | 0.75 | 1 | ||||
GeneWanderer | 0.025 | 0.026 | 0.024 | |||
DiSNEP | α | 0.5 | 0.028 | 0.028 | 0.025 | |
0.75 | 0.029 | 0.030 | 0.025 | |||
1 | 0.026 | 0.027 | 0.023 |
aMeans of the 25 large association signal genes were generated from N(μ1, 1).
bMeans of the 25 medium association signal genes generated from N(μ2, 1).
cσ is the standard deviation of gene expression data.
To further evaluate the performance of DiSNEP, for each gene, we defined an interaction ratio score, which is the ratio of the number of interactions of this gene to the average number of interactions of all genes. We then calculated correlations between interaction ratio scores and prioritization ranks of signal genes and noise genes separately. Supplementary Table S4 shows that across all simulation scenarios considered, interaction ratio scores of signal genes are larger on average using DiSNEP than those using GeneWanderer, while interaction ratio scores of noise genes are smaller on average using DiSNEP than those using GeneWanderer. Correlations between interaction ratio scores and prioritization ranks for signal genes are on average larger using DiSNEP than those using GeneWanderer, while correlations for noise genes are on average smaller using DiSNEP than those using GeneWanderer. This indicates that signal genes, which should rank higher, indeed have larger interaction ratio scores using DiSNEP, i.e. DiSNEP improves prioritization through utilizing gene interaction information specifically for the disease as designed.
Real data applications
Prioritize cancer-associated genes using omics data from TCGA for five cancer types
We applied DiSNEP to prioritize cancer-associated gene expression (or DNA methylation) signal genes using gene expression-enhanced (or DNA methylation-enhanced) cancer-specific networks for five cancer types from TCGA V16: KIRP, THCA, STAD, LIHC and PRAD. We used gene expression data (DNA methylation data) of tumor and adjacent normal samples to enhance a general gene network into a gene expression-enhanced cancer-specific network (DNA methylation-enhanced cancer-specific network) and then prioritizing cancer-associated gene expression signals (DNA methylation signals) using the enhanced cancer-specific network.
We conducted the same quality control steps for gene expression data of the five cancers where we removed samples with more than 30% missing in gene expression measures and removed genes with more than 30% samples missing. We then imputed missing values using KNN [21] and corrected batch effects using Combat [22]. For DNA methylation data, if both Infinium HumanMethylation27 and HumanMethylation450 BeadChips are available, we used the overlapping probes. If only HumanMethylation450 BeadChip is available, we used all probes of the HumanMethylation450 BeadChip. We conducted the same quality control steps for all five cancers where we removed CpGs on sex chromosomes and overlapping with single-nucleotide polymorphisms. We removed samples with more than 30% missing in methylation measures and removed CpGs with more than 30% samples missing. We then imputed missing values using KNN [21], corrected batch effects using Combat [22] and corrected type I/II probe bias using wateRmelon [23]. Table 2 summarizes gene expression data and DNA methylation data of the five cancer types. After these quality control steps, we restricted the omics data and the general STRING gene network to have the same set of genes, which are used for all three methods.
Summary of gene expression data and DNA methylation data of the five TCGA cancer types
. | . | KIRP . | THCA . | STAD . | LIHC . | PRAD . |
---|---|---|---|---|---|---|
GeneExpression | No. of tumor samples | 290 | 501 | 274 | 367 | 497 |
No. of tumor/adjacent normal pairs | 30 | 58 | 27 | 49 | 52 | |
No. of genes | 16,729 | 16 842 | 20 396 | 16 085 | 17 139 | |
DNAMethylation | No. of tumor samples | 291 | 503 | 443 | 337 | 498 |
No. of tumor/adjacent normal pairs | 50 | 56 | 27 | 50 | 50 | |
No. of CpG sites | 22 485 | 337 497 | 22 488 | 337 483 | 337 491 | |
No. of genes | 13 157 | 23 232 | 13 158 | 23 232 | 23 233 |
. | . | KIRP . | THCA . | STAD . | LIHC . | PRAD . |
---|---|---|---|---|---|---|
GeneExpression | No. of tumor samples | 290 | 501 | 274 | 367 | 497 |
No. of tumor/adjacent normal pairs | 30 | 58 | 27 | 49 | 52 | |
No. of genes | 16,729 | 16 842 | 20 396 | 16 085 | 17 139 | |
DNAMethylation | No. of tumor samples | 291 | 503 | 443 | 337 | 498 |
No. of tumor/adjacent normal pairs | 50 | 56 | 27 | 50 | 50 | |
No. of CpG sites | 22 485 | 337 497 | 22 488 | 337 483 | 337 491 | |
No. of genes | 13 157 | 23 232 | 13 158 | 23 232 | 23 233 |
Summary of gene expression data and DNA methylation data of the five TCGA cancer types
. | . | KIRP . | THCA . | STAD . | LIHC . | PRAD . |
---|---|---|---|---|---|---|
GeneExpression | No. of tumor samples | 290 | 501 | 274 | 367 | 497 |
No. of tumor/adjacent normal pairs | 30 | 58 | 27 | 49 | 52 | |
No. of genes | 16,729 | 16 842 | 20 396 | 16 085 | 17 139 | |
DNAMethylation | No. of tumor samples | 291 | 503 | 443 | 337 | 498 |
No. of tumor/adjacent normal pairs | 50 | 56 | 27 | 50 | 50 | |
No. of CpG sites | 22 485 | 337 497 | 22 488 | 337 483 | 337 491 | |
No. of genes | 13 157 | 23 232 | 13 158 | 23 232 | 23 233 |
. | . | KIRP . | THCA . | STAD . | LIHC . | PRAD . |
---|---|---|---|---|---|---|
GeneExpression | No. of tumor samples | 290 | 501 | 274 | 367 | 497 |
No. of tumor/adjacent normal pairs | 30 | 58 | 27 | 49 | 52 | |
No. of genes | 16,729 | 16 842 | 20 396 | 16 085 | 17 139 | |
DNAMethylation | No. of tumor samples | 291 | 503 | 443 | 337 | 498 |
No. of tumor/adjacent normal pairs | 50 | 56 | 27 | 50 | 50 | |
No. of CpG sites | 22 485 | 337 497 | 22 488 | 337 483 | 337 491 | |
No. of genes | 13 157 | 23 232 | 13 158 | 23 232 | 23 233 |
TCGA KIRP cancer
Using TCGA KIRP as an example, we investigated KIRP candidate genes prioritized by DiSNEP and the comparison methods.
With gene expression data, after quality control steps, there are 16 729 genes for 290 tumor samples and 30 adjacent normal samples, among which there are 30 tumor and adjacent normal pairs. All 320 samples were used to generate the gene similarity matrix E. The 30 tumor and adjacent normal pairs were used to generate association signal scores g. Of those 16 729 genes, 14 066 are in the STRING network with 4 056 937 edges. We chose the top 100 genes ranked by prioritized association signal scores using DiSNEP and GeneWanderer and the top 100 genes ranked by association signal scores using paired t-test. Table 3 shows that among the top-ranked 100 genes by paired t-test, GeneWanderer and DiSNEP, 7, 30 and 35 genes are associated with renal carcinoma according to DisGeNET [17]. This suggests that the enhanced KIRP-specific gene network more accurately reflects gene interactions for renal carcinoma than STRING, a general gene network. Therefore, more renal carcinoma-associated genes are prioritized. The pathway enrichment analyses according to KEGG [24] show that the top 100 genes by DiSNEP also achieve the most significant enrichment P-values among the three methods not only for two general cancer pathways, ‘pathway in cancer’ and ‘proteoglycans in cancer’, but also for the renal cancer specific pathway, ‘renal cell cancer’ (Table 3).
Results for TCGA KIRP cancer when gene expression signal genes and DNA methylation signal genes are prioritized using gene expression-enhanced KIRP-specific network and DNA methylation-enhanced KIRP-specific network, respectively. Displayed are the number of renal carcinoma-associated genes according to DisGeNETa and the enrichment P-values of ‘pathway in cancer’, ‘proteoglycans in cancer’ and ‘renal cell cancer’ from KEGG among top-ranked 100 signal genes for the three methods
. | Gene expression signals . | DNA methylation signals . | ||||
---|---|---|---|---|---|---|
Paired t-test . | GeneWanderer . | DiSNEP . | Paired t-test . | GeneWanderer . | DiSNEP . | |
Number of renal carcinoma-associated genes | 7 | 30 | 35 | 8 | 34 | 40 |
Enrichment P-value of ‘pathway in cancer’ | NS | 1.4 × 10−20 | 2.4 × 10−23 | NS | 3.5 × 10−24 | 3.4 × 10−34 |
Enrichment P-value of ‘proteoglycans in cancer’ | NS | 6.1 × 10−21 | 1.6 × 10−27 | NS | 1.7 × 10−22 | 2.1 × 10−24 |
Enrichment P-value of ‘renal cell cancer’ | NS | 3.4 × 10−6 | 2.4 × 10−10 | NS | 6.8 × 10−7 | 1.7 × 10−11 |
. | Gene expression signals . | DNA methylation signals . | ||||
---|---|---|---|---|---|---|
Paired t-test . | GeneWanderer . | DiSNEP . | Paired t-test . | GeneWanderer . | DiSNEP . | |
Number of renal carcinoma-associated genes | 7 | 30 | 35 | 8 | 34 | 40 |
Enrichment P-value of ‘pathway in cancer’ | NS | 1.4 × 10−20 | 2.4 × 10−23 | NS | 3.5 × 10−24 | 3.4 × 10−34 |
Enrichment P-value of ‘proteoglycans in cancer’ | NS | 6.1 × 10−21 | 1.6 × 10−27 | NS | 1.7 × 10−22 | 2.1 × 10−24 |
Enrichment P-value of ‘renal cell cancer’ | NS | 3.4 × 10−6 | 2.4 × 10−10 | NS | 6.8 × 10−7 | 1.7 × 10−11 |
NS, not significant.
aThere are 499 renal-carcinoma-associated genes according to DisGeNET [17].
Results for TCGA KIRP cancer when gene expression signal genes and DNA methylation signal genes are prioritized using gene expression-enhanced KIRP-specific network and DNA methylation-enhanced KIRP-specific network, respectively. Displayed are the number of renal carcinoma-associated genes according to DisGeNETa and the enrichment P-values of ‘pathway in cancer’, ‘proteoglycans in cancer’ and ‘renal cell cancer’ from KEGG among top-ranked 100 signal genes for the three methods
. | Gene expression signals . | DNA methylation signals . | ||||
---|---|---|---|---|---|---|
Paired t-test . | GeneWanderer . | DiSNEP . | Paired t-test . | GeneWanderer . | DiSNEP . | |
Number of renal carcinoma-associated genes | 7 | 30 | 35 | 8 | 34 | 40 |
Enrichment P-value of ‘pathway in cancer’ | NS | 1.4 × 10−20 | 2.4 × 10−23 | NS | 3.5 × 10−24 | 3.4 × 10−34 |
Enrichment P-value of ‘proteoglycans in cancer’ | NS | 6.1 × 10−21 | 1.6 × 10−27 | NS | 1.7 × 10−22 | 2.1 × 10−24 |
Enrichment P-value of ‘renal cell cancer’ | NS | 3.4 × 10−6 | 2.4 × 10−10 | NS | 6.8 × 10−7 | 1.7 × 10−11 |
. | Gene expression signals . | DNA methylation signals . | ||||
---|---|---|---|---|---|---|
Paired t-test . | GeneWanderer . | DiSNEP . | Paired t-test . | GeneWanderer . | DiSNEP . | |
Number of renal carcinoma-associated genes | 7 | 30 | 35 | 8 | 34 | 40 |
Enrichment P-value of ‘pathway in cancer’ | NS | 1.4 × 10−20 | 2.4 × 10−23 | NS | 3.5 × 10−24 | 3.4 × 10−34 |
Enrichment P-value of ‘proteoglycans in cancer’ | NS | 6.1 × 10−21 | 1.6 × 10−27 | NS | 1.7 × 10−22 | 2.1 × 10−24 |
Enrichment P-value of ‘renal cell cancer’ | NS | 3.4 × 10−6 | 2.4 × 10−10 | NS | 6.8 × 10−7 | 1.7 × 10−11 |
NS, not significant.
aThere are 499 renal-carcinoma-associated genes according to DisGeNET [17].
Among the top 100 gene expression signal genes by DiSNEP only but not the other two methods, there are several important KIRP genes based on DisGeNET, such as CUL1 [26], PIK3CA [26], CD44 [27], MMP9 [28], MDM2 [29] and KDR [30]. We examined the gene CUL1 as an example. CUL1 has an association signal score 4.34 from paired t-test and ranks #3945. After prioritization using GeneWanderer, CUL1 ranks #202 with an interaction ratio score of 3.45. After prioritization using DiSNEP, CUL1 ranks #56 with an interaction ratio score of 6.29. This suggests that the KIRP-enhanced network SE more accurately reflects the strong interactions between CUL1 and other genes, which helps boosting the CUL1 gene signal.
For DNA methylation data, both HumanMethylation27 and HumanMethylation450 BeadChips are available for KIRP samples. After quality control steps, there are 13 157 genes with 22 485 CpGs for 291 tumor samples and 50 adjacent normal samples, among which there are 50 tumor and adjacent normal pairs. All 341 samples were used to generate the gene similarity matrix E. The 50 tumor and adjacent normal pairs were used to generate association signal scores g. Of those 13 157 genes, 10 104 are in the STRING network with 2 359 788 edges. We similarly chose the top 100 genes ranked by the association signal scores. Among these top 100 genes by paired t-test, GeneWanderer and DiSNEP, 8, 34 and 40 genes are reported to be associated with renal carcinoma according to DisGeNET [17] (Table 3), suggesting that the enhanced KIRP-specific gene network more accurately reflects gene interactions for renal carcinoma. The pathway enrichment analyses [24] also show that DiSNEP achieves the most significant enrichment P-values not only for two general cancer pathways, ‘pathway in cancer’ and ‘proteoglycans in cancer’, but also for the renal cancer-specific pathway, ‘renal cell cancer’.
We similarly investigated DNA methylation signal genes prioritized by DiSNEP only but not the other two methods. There are several important KIRP genes based on DisGeNET, such as PIK3CA [25], CUL1 [26], KDR [30], PIK3R1 [31], CXCL12 [32], NFKB1 [33] and PLK1 [34]. Using gene PIK3R1 as an example, PIK3R1 has an association signal score 1.82 from paired t-test and ranks #5062. After prioritization using GeneWanderer, PIK3R1 ranks #549 with an interaction ratio score of 2.54. After prioritization using DiSNEP, PIK3R1 ranks #95 with an interaction ratio score of 6.09. Again, this suggests that the KIRP-enhanced network SE more accurately reflects the strong interactions between PIK3R1 and other genes, which helps boosting the PIK3R1 gene signal.
TCGA THCA cancer
We similarly investigated prioritized THCA-associated genes as the second example.
Table 4 shows that among the top 100 gene expression (DNA methylation) signal genes by paired t-test, GeneWanderer and DiSNEP, 8 (8), 49 (52) and 54 (58) genes are associated with thyroid carcinoma according to DisGeNET [17]. The pathway enrichment analyses also show that DiSNEP-prioritized top genes achieve the most significant enrichment P-values among the three methods not only for two general cancer pathways, ‘pathway in cancer’ and ‘proteoglycans in cancer’, but also for the thyroid cancer specific pathway, ‘thyroid cancer’. Among the top 100 gene expression signal genes by DiSNEP only, there are several important THCA genes, such as ATM [35], FYN [36], PIK3CA [37], FGF2 [38] and NRAS [39]. We similarly examined gene PIK3CA. PIK3CA has an association signal score 1.99 from paired t-test and ranks #10428. After prioritization using GeneWanderer, PIK3CA ranks #148 with an interaction ratio score of 4.08. After prioritization using the THCA-enhanced network, PIK3CA ranks #71 with an interaction ratio score of 6.29. This suggests that the THCA-enhanced network SE more accurately reflects the strong interactions between PIK3CA and other genes, which helps boosting the PIK3CA gene signal. Similarly, among the top 100 DNA methylation signal genes by DiSNEP only, there are important THCA genes, such as ATM [35], PIK3CA [37], NRAS [39], CXCR4 [40], STAT1 [41], MMP9 [42] and PPARG1 [43]. We examined the gene CXCR4. CXCR4 has an association signal score 3.98 from paired t-test and ranks #7320. After prioritization using GeneWanderer, CXCR4 ranks #172 with an interaction ratio score of 3.78. After prioritization using the THCA-enhanced network, CXCR4 ranks #82 with an interaction ratio score of 6.28. This similarly suggests that the THCA-enhanced network SE more accurately reflects the strong interactions between CXCR4 and other genes, which helps boosting the CXCR4 gene signal.
Results for TCGA THCA cancer when gene expression signal genes and DNA methylation signal genes are prioritized using gene expression-enhanced THCA-specific network and DNA methylation-enhanced THCA-specific network, respectively. Displayed are the number of thyroid carcinoma-associated genes according to DisGeNETa and the enrichment P-values of ‘pathway in cancer’, ‘proteoglycans in cancer’ and ‘thyroid cancer’ from KEGG among top-ranked 100 signal genes for the three methods
. | Gene expression signals . | DNA methylation signals . | ||||
---|---|---|---|---|---|---|
Paired t-test . | GeneWanderer . | DiSNEP . | Paired t-test . | GeneWanderer . | DiSNEP . | |
Number of thyroid carcinoma-associated genes | 8 | 49 | 54 | 8 | 52 | 58 |
Enrichment P-value of ‘pathway in cancer’ | NS | 1.8 × 10−23 | 9.7 × 10−28 | NS | 4.5 × 10−28 | 7.8 × 10−39 |
Enrichment P-value of ‘proteoglycans in cancer’ | NS | 1.1 × 10−23 | 1.9 × 10−26 | NS | 3.3 × 10−26 | 8.2 × 10−29 |
Enrichment P-value of ‘thyroid cancer’ | NS | 6.9 × 10−9 | 1.2 × 10−11 | NS | 5.9 × 10−9 | 1.1 × 10−11 |
. | Gene expression signals . | DNA methylation signals . | ||||
---|---|---|---|---|---|---|
Paired t-test . | GeneWanderer . | DiSNEP . | Paired t-test . | GeneWanderer . | DiSNEP . | |
Number of thyroid carcinoma-associated genes | 8 | 49 | 54 | 8 | 52 | 58 |
Enrichment P-value of ‘pathway in cancer’ | NS | 1.8 × 10−23 | 9.7 × 10−28 | NS | 4.5 × 10−28 | 7.8 × 10−39 |
Enrichment P-value of ‘proteoglycans in cancer’ | NS | 1.1 × 10−23 | 1.9 × 10−26 | NS | 3.3 × 10−26 | 8.2 × 10−29 |
Enrichment P-value of ‘thyroid cancer’ | NS | 6.9 × 10−9 | 1.2 × 10−11 | NS | 5.9 × 10−9 | 1.1 × 10−11 |
NS, not significant.
aThere are 720 thyroid carcinoma-associated genes according to DisGeNET [17].
Results for TCGA THCA cancer when gene expression signal genes and DNA methylation signal genes are prioritized using gene expression-enhanced THCA-specific network and DNA methylation-enhanced THCA-specific network, respectively. Displayed are the number of thyroid carcinoma-associated genes according to DisGeNETa and the enrichment P-values of ‘pathway in cancer’, ‘proteoglycans in cancer’ and ‘thyroid cancer’ from KEGG among top-ranked 100 signal genes for the three methods
. | Gene expression signals . | DNA methylation signals . | ||||
---|---|---|---|---|---|---|
Paired t-test . | GeneWanderer . | DiSNEP . | Paired t-test . | GeneWanderer . | DiSNEP . | |
Number of thyroid carcinoma-associated genes | 8 | 49 | 54 | 8 | 52 | 58 |
Enrichment P-value of ‘pathway in cancer’ | NS | 1.8 × 10−23 | 9.7 × 10−28 | NS | 4.5 × 10−28 | 7.8 × 10−39 |
Enrichment P-value of ‘proteoglycans in cancer’ | NS | 1.1 × 10−23 | 1.9 × 10−26 | NS | 3.3 × 10−26 | 8.2 × 10−29 |
Enrichment P-value of ‘thyroid cancer’ | NS | 6.9 × 10−9 | 1.2 × 10−11 | NS | 5.9 × 10−9 | 1.1 × 10−11 |
. | Gene expression signals . | DNA methylation signals . | ||||
---|---|---|---|---|---|---|
Paired t-test . | GeneWanderer . | DiSNEP . | Paired t-test . | GeneWanderer . | DiSNEP . | |
Number of thyroid carcinoma-associated genes | 8 | 49 | 54 | 8 | 52 | 58 |
Enrichment P-value of ‘pathway in cancer’ | NS | 1.8 × 10−23 | 9.7 × 10−28 | NS | 4.5 × 10−28 | 7.8 × 10−39 |
Enrichment P-value of ‘proteoglycans in cancer’ | NS | 1.1 × 10−23 | 1.9 × 10−26 | NS | 3.3 × 10−26 | 8.2 × 10−29 |
Enrichment P-value of ‘thyroid cancer’ | NS | 6.9 × 10−9 | 1.2 × 10−11 | NS | 5.9 × 10−9 | 1.1 × 10−11 |
NS, not significant.
aThere are 720 thyroid carcinoma-associated genes according to DisGeNET [17].
Reproducibility results using TCGA LIHC DNA methylation data and CUIMC HCC DNA methylation data
. | Paired t-test . | GeneWanderer . | DiSNEP . |
---|---|---|---|
No. of overlapping genes between the two sets of top-ranked 100 prioritized genes using the TCGA LIHC and CUIMC HCC data | 23 | 91 | 76 |
No. of liver cancer-associated genes out of the top-ranked 100 prioritized genes based on DisGeNET using the TCGA LIHC data | 2 | 43 | 46 |
No. of liver cancer-associated genes out of the top-ranked 100 prioritized genes based on DisGeNET using the CUIMC HCC data | 3 | 40 | 46 |
No. (%) of overlapping liver cancer-associated genes using the TCGA LIHC and CUIMC HCC data | 0 out of 23 (0%) | 38 out of 91 (42%) | 36 out of 76 (47%) |
. | Paired t-test . | GeneWanderer . | DiSNEP . |
---|---|---|---|
No. of overlapping genes between the two sets of top-ranked 100 prioritized genes using the TCGA LIHC and CUIMC HCC data | 23 | 91 | 76 |
No. of liver cancer-associated genes out of the top-ranked 100 prioritized genes based on DisGeNET using the TCGA LIHC data | 2 | 43 | 46 |
No. of liver cancer-associated genes out of the top-ranked 100 prioritized genes based on DisGeNET using the CUIMC HCC data | 3 | 40 | 46 |
No. (%) of overlapping liver cancer-associated genes using the TCGA LIHC and CUIMC HCC data | 0 out of 23 (0%) | 38 out of 91 (42%) | 36 out of 76 (47%) |
Reproducibility results using TCGA LIHC DNA methylation data and CUIMC HCC DNA methylation data
. | Paired t-test . | GeneWanderer . | DiSNEP . |
---|---|---|---|
No. of overlapping genes between the two sets of top-ranked 100 prioritized genes using the TCGA LIHC and CUIMC HCC data | 23 | 91 | 76 |
No. of liver cancer-associated genes out of the top-ranked 100 prioritized genes based on DisGeNET using the TCGA LIHC data | 2 | 43 | 46 |
No. of liver cancer-associated genes out of the top-ranked 100 prioritized genes based on DisGeNET using the CUIMC HCC data | 3 | 40 | 46 |
No. (%) of overlapping liver cancer-associated genes using the TCGA LIHC and CUIMC HCC data | 0 out of 23 (0%) | 38 out of 91 (42%) | 36 out of 76 (47%) |
. | Paired t-test . | GeneWanderer . | DiSNEP . |
---|---|---|---|
No. of overlapping genes between the two sets of top-ranked 100 prioritized genes using the TCGA LIHC and CUIMC HCC data | 23 | 91 | 76 |
No. of liver cancer-associated genes out of the top-ranked 100 prioritized genes based on DisGeNET using the TCGA LIHC data | 2 | 43 | 46 |
No. of liver cancer-associated genes out of the top-ranked 100 prioritized genes based on DisGeNET using the CUIMC HCC data | 3 | 40 | 46 |
No. (%) of overlapping liver cancer-associated genes using the TCGA LIHC and CUIMC HCC data | 0 out of 23 (0%) | 38 out of 91 (42%) | 36 out of 76 (47%) |
TCGA STAD, LIHC and PRAD cancers
We obtained similar results and reached similar conclusions for STAD, LIHC and PRAD. Details are included in Supplementary Tables S1–S3. The analysis results of interaction ratio scores for five cancer types are included in Table S5.
Reproducibility of DiSNEP
To investigate whether the prioritization results of DiSNEP are reproducible, we applied DiSNEP and the two comparison methods to independent 450K DNA methylation data of tumor and adjacent normal tissues of HCC from CUIMC GES54503 [44]. After applying the same quality control steps, the CUIMC HCC DNA methylation data have 412 287 CpGs from 23 809 genes for 66 matched tumor and adjacent normal pairs. Note that the TCGA LIHC DNA methylation data have 337 483 CpGs from 23 233 genes for 50 matched tumor and adjacent normal pairs. We compared the two sets of top-ranked 100 prioritized genes using TCGA LIHC and CUIMC HCC DNA methylation data and summarized results in Table 5. Among the two sets of top-ranked 100 prioritized genes, there are 91 overlapping genes using GeneWanderer and 76 overlapping genes using DiSNEP. It is expected that network-assisted gene prioritization methods with a general network will reproduce more. When we further examined the 76 overlapping prioritized genes using DiSNEP based on the DisGeNET database [3], 36 (47%) are liver cancer-associated. While out of the 91 overlapping prioritized genes using GeneWanderer, 38 (42%) are liver cancer-associated. This suggests that the prioritization results by DiSNEP and GeneWanderer are reproducible, when DiSNEP reproduces slightly more cancer-associated genes based on DisGeNET.
Specificity of DiSNEP
We also investigated if genes prioritized by DiSNEP for a disease are indeed disease-specific.
To do so, we conducted additional enrichment analyses using all terms of the Disease Ontology (DO) database (May 2020 release) [46].
We first used top 100 DiSNEP-prioritized genes for enrichment analysis on all DO terms and selected top 20 most significantly enriched DO terms. We then extracted these 20 DO terms’ enrichment P-values using the top 100 GeneWanderer-prioritized genes. Supplementary Tables S8–S12 summarize these enrichment results of the top 20 enriched DO terms for five TCGA cancers. We can clearly see that most (≥80%) of the top 20 enriched DO terms are cancer terms. For all the enriched cancer DO terms, DiSNEP achieves the most significant enrichment P-values among the three methods. Since different cancer DO terms share many common genes, these results suggest that genes prioritized by DiSNEP are disease-specific or, in this case, cancer-specific, although specific cancer DO terms such as ‘Kidney cancer’ or ‘Thyroid cancer’ are not in the top 20 enriched DO terms.
We then used top 100 GeneWanderer-prioritized genes for enrichment analysis on all DO terms and selected top 20 most significantly enriched DO terms. We then extracted these 20 DO terms’ enrichment P-values using the top 100 DiSNEP-prioritized genes. Supplementary Tables S13–S17 summarize these enrichment results of the top 20 enriched DO terms for five TCGA cancers. We can clearly see that DiSNEP achieves the most significant enrichment P-values among the three methods for all enriched cancer DO terms but not for some of the enriched non-cancer DO terms, such as ‘connective tissue disease’, ‘musculoskeletal system disease’, ‘immune system disease’, ‘hypersensitivity reaction disease’ and ‘bone disease’ for KIRP. This also suggests that genes prioritized by DiSNEP are disease-specific, or cancer-specific in this case.
Next, we examined the DO terms with the names of the TCGA cancers we investigated: ‘Kidney cancer’, ‘Thyroid cancer’, ‘Stomach cancer’, ‘Liver cancer’, and ‘Prostate cancer’. Using the top 100 DiSNEP-prioritized genes, the enrichment P-values for these five DO cancer terms are the most significant among the three methods (Supplementary Table S18).
Finally, using the top 100 DiSNEP-prioritized genes, the top 100 GeneWanderer-prioritized genes and the top 100 genes by t-test, we examined the number of enriched non-cancer DO terms out of the top 20 most significantly enriched DO terms for five TCGA cancers (Supplementary Table S19). It is clear that there are fewer non-cancer DO terms out of the top 20 enriched DO terms using top 100 DiSNEP-prioritized genes than those enriched using top 100 GeneWanderer-prioritized or top 100 genes by t-test. These results again suggest that genes prioritized by DiSNEP are disease-specific, or cancer-specific in this case.
Discussion
In this article, we proposed DiSNEP, a framework to enhance a general human gene network into a disease-specific network in order to improve prioritization of candidate genes associated with this disease. Network-assisted methods have been proven to be useful in prioritizing candidate disease genes. Several comprehensive gene networks were developed. However, these networks are not for specific diseases, thus interactions among genes in these networks are not necessarily reflecting true interactions between genes for specific diseases. This may lead to a reduced accuracy when prioritizing candidate disease genes. Our proposed DiSNEP framework addresses this problem by first enhancing a general gene network into a disease-specific network using a gene similarity matrix generated from disease omics data through a random walk process. The enhanced disease-specific network is then used to prioritize candidate genes associated with the disease by diffusion. In simulation studies, the proposed DiSNEP framework has improved performance than that using a general gene network. We applied DiSNEP to prioritize gene expression signal genes and DNA methylation signal genes associated with five different cancer types from TCGA using gene expression-enhanced cancer-specific networks and DNA methylation-enhanced cancer-specific networks, respectively. Results suggest that more prioritized genes by DiSNEP are cancer-related than those by a general gene network or without prioritization consistently across the five cancers considered when we used the DisGeNET database as the gold standards for cancer-associated genes.
We also considered more restricted criteria to define gold standards for cancer-associated genes based on the DisGeNET database. We investigated three DisGeNET criteria, namely (1) the global disease association (GDA) score, (2) the number of reference papers and (3) the disease-specific index (DSI) to see how changes in DisGeNET gold standards affect prioritization results. We found that DiSNEP consistently prioritizes more cancer-associated genes out of the top 100 prioritized genes with different DisGeNET criteria to define true cancer-associated genes as gold standards. Detailed results are included in Part E in the Supplementary Material and Supplementary Figure S2. Moreover, the replication analysis to independent 450K DNA methylation data of CUIMC HCC tumor and adjacent normal tissues suggests that the prioritization results by DiSNEP are reproducible. One limitation is that the enhanced disease-specific network is based on STRING, while evaluations of DiSNEP are based on the DisGeNET database. It is possible that some disease associations are included in both STRING and DisGeNET, making it easy for DiSNEP to identify these associations. More rigorous evaluation procedures may be taken to evaluate the performance of the DiSNEP method.
We demonstrated that using either gene expression data or DNA methylation data to capture correlations/interactions among genes for a specific cancer type can help improving prioritization genes associated with the cancer. DiSNEP can be readily applied to use other types of omics data or multiple types of omics data jointly, which potentially could better capture correlations/interactions among genes for a disease to further improve the enhancement of a general gene network into a disease-specific network. Note that biological implications of multi-omics data should be accounted carefully when using multi-omics data for enhancement. In both simulation studies and real data applications, we used the same omics data to enhance a general gene network and to define gene signal scores. We also investigated using one type of omics data to enhance a general gene network, with which to enhance signal scores obtained using a different type of omics data. Supplementary Table S6 shows the gene expression signal genes prioritized using gene expression-enhanced cancer-specific networks or DNA methylation-enhanced cancer-specific networks and the DNA methylation signal genes prioritized using DNA methylation-enhanced cancer-specific networks or gene expression-enhanced cancer-specific networks. We can see that the difference is modest. More investigations are needed when trying to use multi-omics data to enhance a general gene network into a disease-specific gene network, with which to prioritize association signals. Since a small subset of cancer genes, such as those in the Cancer Gene Census (CGC) [45], have been causally implicated using experiments, and dysfunctions of these genes explain how they drive cancers, we could take advantages of these information and further connect these genes in an enhanced cancer-specific network after enhancing with disease omics data.
Although we focused on cancers and enhanced a general network into cancer-specific networks to prioritize genes associated with different cancer types, the DiSNEP framework is general and can be readily applied to other diseases. The developed DiSNEP framework is implemented in an R package, which is freely available at https://github.com/pfruan/DiSNEP.
The proposed DiSNEP framework enhances a general human gene network into a network for a specific disease that better reflects true gene interactions for the disease, which subsequently improves network-assisted candidate gene prioritization.
The proposed DiSNEP framework is able to prioritize more cancer-related genes than those by network-assisted methods with a general network.
The proposed DiSNEP framework consistently improves cancer gene prioritization across all five cancer types considered.
The prioritization results by the proposed DiSNEP framework are reproducible and are indeed disease-specific.
Peifeng Ruan was a doctoral student at the George Washington University, USA and is a postdoc associate at Yale University, USA.
Shuang Wang is a Professor at Columbia University, USA.