NetCore: a network propagation approach using node coreness

Abstract We present NetCore, a novel network propagation approach based on node coreness, for phenotype–genotype associations and module identification. NetCore addresses the node degree bias in PPI networks by using node coreness in the random walk with restart procedure, and achieves improved re-ranking of genes after propagation. Furthermore, NetCore implements a semi-supervised approach to identify phenotype-associated network modules, which anchors the identification of novel candidate genes at known genes associated with the phenotype. We evaluated NetCore on gene sets from 11 different GWAS traits and showed improved performance compared to the standard degree-based network propagation using cross-validation. Furthermore, we applied NetCore to identify disease genes and modules for Schizophrenia GWAS data and pan-cancer mutation data. We compared the novel approach to existing network propagation approaches and showed the benefits of using NetCore in comparison to those. We provide an easy-to-use implementation, together with a high confidence PPI network extracted from ConsensusPathDB, which can be applied to various types of genomics data in order to obtain a re-ranking of genes and functionally relevant network modules.


ConsensusPathDB high confidence network
The underlying PPI network for this study was constructed from PPIs collected from 19 different publicly available databases (see http://consensuspathdb.org for a list of databases). In order to improve false positive rates we have developed a confidence assessment for every interaction based on topological-and annotation-based measures [1] and assigned every interaction a score between 0 (low confidence) and 1 (high confidence).
For this study we kept interactions with confidence score > 0.95 what resulted in a PPI network consisting of 10,707 proteins and 114,516 interactions. This network falls into a large connected component as well as several smaller connected components ranging from sizes of 2-4. Since convergence of network propagation assumes that the underlying graph is connected we performed all analyses for this study on the largest connected component of the PPI network consisting of 10,586 proteins and 114,341 interactions.
To characterize the PPI network further we have conducted network analysis using the NetworkAnalyzer [2] plugin for Cytoscape [3].

Node degree distribution
The PPI network shows the characteristic node degree distribution of biological networks with a couple of nodes with very high degree and most nodes having a smaller degree (Fig. 1). The power law fit has an R-squared of 0.903. Figure M1. Node degree distribution in the ConsensusPathDB high confidence PPI network. Red line is a power law fit y = ax b with a = 9,193.4 and b = -1.577.

Shortest paths
The average shortest path length in the PPI network is 3.577, the longest distance of two nodes (i.e. the network diameter) is 11.
Supplementary Methods Figure M2. Histogram of shortest path lengths in the PPI.

Major hubs
Supplementary Methods Table M1 shows the 30 major hubs in the network.

Example of the network propagation process
To exemplify the process of network propagation we've chosen one of the modules that were identified by NetCore for Type-2 Diabetes. The module consists of seven genes, three of them were previously associated with the disease in the GWAS catalog: ATP8B2, MTNR1B and PTPRD. These nodes were scored with a weight of 1, and the rest of the nodes in the module with a weight of 0. Figure 3 displays the spread of the weights during the random walk with restart procedure for the sub-network of the seven nodes and their connections in the PPI network. After six steps the weights are already very close to the value at convergence. The weights for every node at every step are given by the table. We note that the nodes are also connected to other nodes in the network (which are not displayed here) and therefore their final weight is also affected by other connections. Since the restart parameter was set to 0.8, the weight that is propagated from the three disease nodes is only 0.2. Since we applied core normalization, the weight is spread according to the core of the neighbors. Finally, four more nodes are included in the sub-network, which had a significant P-value for their weight after the propagation, as well as a higher weight than the minimal that was chosen for the entire network (wmin = 0.002). IL2RG is connected only to PTPRD, while MTNR1B is connected to both PTPRD and MTNR1A. LPAR6 is connected to both PTPRD and ATP8B2, which are not directly connected to each other, as well as to MCOLN3. Overall, our analysis was able to predict four novel genes that are connected to well-known disease genes and could be contributing to the disease phenotype.

Gene
Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Final weights

Figure S1
Figure S1: We compared the performance of NetCore in identifying 11 GWAS gene sets for three different values of the restart parameter α: 0.3, 0.5 and 0.8. The lower the value, the smaller the restart probability, which results in more of the weight being diffused throughout the network. We calculated the ratio between the number of significant genes that were reported by NetCore which belong to the input GWAS gene set, and the total number of significant genes that were reported by NetCore. In 5 of the 11 GWAS gene sets the highest performance is when α=0.8.

Figure S5
Figure S5: Extended seed sub-networks for 11 GWAS gene sets. The orange nodes are original seed nodes, the gray nodes were added to the seed sub-network after the propagation, according to their results (significant P-value of p < 0.01 and a minimum weight, which is calculated based on the weights distribution after the propagation). The sizes of the nodes reflect their weights after the propagation. The edges are originally from the PPI network.

Figure S6
Figure S6: Largest modules for 11 GWAS sets based on NetCore. The largest module is extracted from the extended seed sub-network, where it is the largest connected component of the sub-network. Orange genes are in the original gene sets (seed nodes), and gray ones were added after the propagation. The edges are from the PPI network.

Figure S7
Figure S7: "Pathways in Cancer". The pathway as depicted by KEGG and generated using Pathview (1). The colored nodes are present in the module from NetCore. Red nodes are present in the NCG cancer consensus list. Blue nodes are newly predicted genes, some are present in the NCG cancer candidate list, and some not. The Cox regression plots are based on