NIAPU: network-informed adaptive positive-unlabeled learning for disease gene identification

Abstract Motivation Gene–disease associations are fundamental for understanding disease etiology and developing effective interventions and treatments. Identifying genes not yet associated with a disease due to a lack of studies is a challenging task in which prioritization based on prior knowledge is an important element. The computational search for new candidate disease genes may be eased by positive-unlabeled learning, the machine learning (ML) setting in which only a subset of instances are labeled as positive while the rest of the dataset is unlabeled. In this work, we propose a set of effective network-based features to be used in a novel Markov diffusion-based multi-class labeling strategy for putative disease gene discovery. Results The performances of the new labeling algorithm and the effectiveness of the proposed features have been tested on 10 different disease datasets using three ML algorithms. The new features have been compared against classical topological and functional/ontological features and a set of network- and biological-derived features already used in gene discovery tasks. The predictive power of the integrated methodology in searching for new disease genes has been found to be competitive against state-of-the-art algorithms. Availability and implementation The source code of NIAPU can be accessed at https://github.com/AndMastro/NIAPU. The source data used in this study are available online on the respective websites. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
The discovery of gene-disease associations (GDAs) is made difficult by incomplete knowledge of biological and physiological processes.When approaching complex, multi-gene diseases and traits, it is hard to disentangle the contribution of each gene, and computational biological approaches for predicting GDAs (Opap and Mulder, 2017;Piro and Cunto, 2012) can support and address experimental methods (e.g., Genome-Wide Association Studies -GWAS -or linkage studies, among others) which are expensive and time-consuming.
The fuzzy background of yet unknown or truly unassociated genes contributes to making the computational identification of disease genes challenging to carry out with accuracy.In machine learning (ML), this setting translates into the ability to identify new positive instances among a set of positive and unlabeled samples, a task known as "positive-unlabeled (PU) learning" (Liu et al., 2003;Bekker and Davis, 2020).This task can be addressed through semi-supervised learning algorithms, trained using two approaches.In the first one, the set of unlabeled instances is assumed to be a contaminated set of negative instances and the contamination is considered during the modeling process by weighting the data points or adding penalties on misclassification (Elkan and Noto, 2008;Mordelet and Vert, 2014;Claesen et al., 2015;Ke et al., 2018).In the specific case of gene discovery, this contamination is given by the possibility of the negative instances of containing not yet discovered positive genes.The second approach, called two-step technique, aims at relabeling the instances and then training a supervised learning algorithm (Liu et al., 2003;Yang et al., 2012Yang et al., , 2014)).For example, Yang et al. (2012) introduced a multi-class labeling procedure considering five different labels, namely Positive (P), Likely Positive (LP), Weakly Negative (WN), Likely Negative (LN), and Reliable Negative (RN), based on a Markov process with restart (Can et al., 2005), widely applied in disease genes identification (Köhler et al., 2008;Li and Patra, 2010b,a).Then, a supervised learning algorithm is trained on the relabeled data.
In the present work, we considered the multi-class labeling approach since it allows identifying a set of originally unlabeled items, namely the LP set, whose features are close to that of the items in P.This translates into the identification of a small set of genes more likely to contain true arXiv:2108.06158v4[cs.LG] 25 Jan 2023 positive instances, hence providing a set of new candidate disease genes for prioritization.
Going beyond the approach from Yang et al. (2012), we propose several significant modifications of the multi-class method regarding the distance matrix defining the Markov process and the selection of the different classes.Some of these modifications were needed in order to apply the method to general PU data sets, while others were proposed to make the process of class formation more rigorous and, at the same time, flexible.The approach considered here, being a two-step technique, is based on the separability and smoothness assumptions (Bekker and Davis, 2020), which require that the features should be able to distinguish between positive and negative instances and, at the same time, instances with similar features should be more likely to have the same label.Therefore, as a further contribution, we propose the use of specific network-informed features, one of them introduced for the first time in this work, based on proteinprotein interaction (PPI) data, which provide a characterization of the topological relationships of all human genes with respect to the set of disease genes.The use of such measures grants a much more precise classification of genes than other topological measures.In particular, the set of seed genes is identified very precisely as well as the genes closest and farthest to them, as shown in Section 3.1.The Network-Informed Adaptive Positive-Unlabeled (NIAPU) framework is therefore formed by two components: the Network Diffusion and Biology-Informed Topological (NeDBIT) features and the Adaptive Positive-Unlabeled (APU) labeling algorithm.

Data sources and preprocessing
The proposed methodology exploits two types of data, i.e., reliable PPIs and known GDA data.PPI data provide valuable biological knowledge for the identification of undiscovered disease genes (Piro and Cunto, 2012;Doncheva et al., 2012;Silverman et al., 2020;Tieri et al., 2019;Petti et al., 2021).Human PPI data, i.e., the human interactome, were gathered from the BioGRID (Stark et al., 2006) dataset1 .The human interactome is obtained by choosing Homo sapiens genes (organism ID 9606), from which we extract a connected network consisting of 19761 genes and 678932 nonredundant, undirected interactions (see Supplementary File 1).
GDAs were derived from DisGeNET2 (Piñero et al., 2016(Piñero et al., , 2020)), a discovery platform containing one of the largest publicly available collections of genes and variants associated with human diseases together with a score denoting the association confidence and significance.Ten diseases were considered: malignant neoplasm of breast (disease ID C0006142, 1074 genes), schizophrenia (C0036341, 883 genes), liver cirrhosis (C0023893, 774 genes), colorectal carcinoma (C0009402, 702 genes), malignant neoplasm of prostate (C0376358, 616 genes), bipolar disorder (C0005586, 477 genes), intellectual disability (C3714756, 447 genes), drug-induced liver disease (C0860207, 404 genes), depressive disorder (C0011581, 289 genes), and chronic alcoholic intoxication (C0001973, 268 genes).The selection criterion for these diseases was the highest cardinality of GDAs in the DisGeNET curated dataset to ensure sufficient information for the ML task.To validate the gene discovery results, we relied on the all genes DisGeNET dataset, which we refer to as extended dataset.The latter contains associated genes from additional sources not present in the curated version (Bundschus et al., 2008(Bundschus et al., , 2010;;Bravo et al., 2014Bravo et al., , 2015)).More details can be found in Supplementary File 2. After performing additional cleaning steps (see Supplementary File 2), we ended up having a set of seed genes for each disease, denoted by Σ, with their association score S. In particular, we have 1025 genes for disease C0006142, 832 for C0036341, 747 for C0023893, 672 for C0009402, 606 for C0376358, 451 for C0005586, 431 for C3714756, 320 for C0860207, 279 for C0011581, and 255 for C0001973.

Multi-class labeling: Adaptive PU (APU) labeling algorithm and classification
The APU algorithm consists of a multi-class labeling procedure that relies on the labels introduced in Yang et al., 2012: P, LP, WN, LN, and RN.P instances are the known disease genes, RN instances represent the genes whose features are the furthest from the average features in the P set, while the remaining labels are assigned through a Markov process with restart (Can et al., 2005).The novelty of the proposed method is the construction of a new transition matrix starting from the distance matrix between the features of the genes.The matrix needs to be normalized in order to preserve the total transition probability of the state vector whose initial value is different from zero only for genes in the P and RN classes.Moreover, the class selection has been made flexible by using an adaptable quantile separation instead of fixed thresholds.These characteristics have been implemented in order to make the process of class formation more rigorous and, at the same time, more flexible hence easily adaptable to different settings, datasets, and needs.Let V be a set whose generic i th element v i=1,...,n is characterized by the couple (x i , y i ) where x i ∈ [0, 1] d represent the feature vector, and y i ∈ {0, 1} the initial label.The APU algorithm is defined by the following steps: Step 1: Compute the matrix W , whose elements w ij are defined as follows where The symmetric matrix W represents the similarity score between elements i and j.
Step 2: Compute the reduced matrix W r as follows The threshold qw is computed as a given quantile of the distribution of the elements in the matrix W in order to exclude from the propagation process links between poorly related elements.To obtain a proper Markov process, i.e., preserving the probability distribution, the matrix W r must be normalized as W n = D −1 W r , where D is the diagonal matrix with elements d ii = j w r,ij .
Step 3: Initialize the propagation process with the initial state vector g 0 defined as follows.Let |P | be the cardinality of P (set of seed genes) and x = x1 , . . ., xd , where xk = 1/|P | i∈P x k i , be the average features of P. The RN genes are chosen to be the ones having the most distant features from x.We select the |P | most distant genes from x in order to keep the classes balanced.Then, the i-th element of g 0 is defined as When needed, a different number of RN genes can be selected.In this case, the initial value of the RN genes in the state vector g 0 must be set to −|P |/|RN | so that the two distributions of positive and negative values i i are balanced in g 0 , with the sum of its elements equal to zero.
Step 4: Define a Markov process with restart as where the parameter α is usually set to 0.8 (Yang et al., 2012;Li and Patra, 2010a).Starting from the state vector g 0 the dynamics in Equation ( 2) ends in the stationary state g ∞ , numerically reached when |g r − g r−1 | < 10 −6 .
Step 5: Use G∞ to assign the remaining labels.Selecting only the elements that belong neither to P nor to RN, the values of the asymptotic distribution of those elements are sorted and the ranking of the corresponding elements is used to form the remaining classes: LP, WN and LN.A simple rule is to divide the ranking into three equal parts and identify LP samples with the first third, WN with the second third and LN with the third third.However, depending on the type of analysis and the problem addressed, any division of the ranking can be considered acceptable.

Network Diffusion and Biology-Informed Topological (NeDBIT) features
The NeDBIT features include two network diffusion-based features, namely heat diffusion and balanced diffusion, and two biology-informed topological metrics, namely NetShort and NetRing.Network diffusion methods are widely used in disease gene discovery processes (Lancour et al., 2018;Picart-Armada et al., 2019;Janyasupab et al., 2021).We coupled network diffusion methods and innovative topological-based features in order to make the most of the combined predictive power of both approaches.Moreover, all the features are computed exploiting the association score S. In this way, the NeDBIT features, not assigning the same weight to all seed genes, are certainly more significant for the disease under investigation.

Heat diffusion feature
This feature is obtained by using a heat diffusion process over the network, which is among the most used processes for disease gene prioritization and prediction (see Carlin et al., 2017 and references therein).Starting with a distribution of weights, with positive values only on the seed genes, their evolution is determined by using the diffusion equation on graph (Nitsch et al., 2010) where L is the Graph Laplacian matrix, L = K − A, K is the diagonal matrix with the degree of nodes on the diagonal, namely K ii = k i , and A is the adjacency matrix of the PPI.The weights at time t are given by the formal solution of Equation ( 3) where exp is the exponential of the matrix.Regarding the initial distribution of weights, we assign z i (0) = s i for seed genes in Σ and 0 otherwise, where s i is the association score.

Balanced diffusion feature
This feature is obtained by using the diffusion equation in (3) but with another version for the Graph Laplacian matrix, i.e., L b = I − K −1 A.
The weights at time t are obtained as in Equation ( 4) by using operator L b and the initial weights are given as for the previous measure.
This form of the graph diffusion operator differs from the heat diffusion in the fact that the operator L diffuses the same amount of score for each link, whereas L b diffuses the same amount of score for each node.This implies a different short-time behavior of the diffusion process on the graph.

NetShort
The NetShort measure (White and Smyth, 2003) is based on the idea that a generic node is topologically important for a disease if a large number of seed nodes must be traversed to reach it.For each node, the weights are assigned as follows , where si = and min S and max S are the minimum and the maximum of the association scores, α is the penalization parameter given to non-seed nodes, and a ij is the (i, j) element of the adjacency matrix A. We use α = 0.5 so that all non-seed nodes have normalized score si = 1 2 min S max S while seed nodes have normalized score min S max S ≤ si ≤ 1.Then, the NetShort measure N S i of node i is defined as where d ij is the length of the weighted shortest path from i to j.

NetRing
The NetRing measure, introduced for the first time in this work, is based on the concept of ring structure (Baronchelli and Loreto, 2006) generalized to a set of seed nodes.Starting from seed nodes, a partition of the graph in sub-graphs, or rings, is introduced with the following property where l ij is the (unweighted) length of the shortest path from i to j. R(l) contains all the non-seed nodes with a minimal distance l from, at least, one seed node.From the definition follows that R(0 , where L is the highest value of the minimal distance from non-seed nodes to seed nodes.
An initial rank defined by means of the association score is computed as , then the NetRing measure r i of node i is defined as , where the score for seed genes is a convex combination of the initial rank ri and the average of the initial rank of the neighbors of the node, so that seed nodes having many seed nodes as neighbors have a higher rank.The rank of non-seed nodes is obtained by summing the level of the ring and the average of two terms, i.e., the number of genes belonging to the same or higher rings ) and the sum of the rank of genes in the lower ring corrected by the ring level.The correction is introduced to make the rank  r j comparable with rj .Additional important considerations about the NetRing measure can be found in Supplementary File 2.

Results
The performance of NIAPU is tested on the ten disease datasets detailed in Section 2.1.A visual overview of the workflow can be grasped in Figure 1.Section 3.1 is devoted to testing the performance of NIAPU (APU+NeDBIT) against the implementation of the APU labeling algorithm with two different sets of features commonly used when dealing with disease gene identification.The performances are investigated in terms of out-of-sample classification.Section 3.2 analyzes the performance of NIAPU in the identification of candidate disease genes.
To this end, a subset of seed genes is masked out to see whether such genes are predicted as LP.Section 3.3 deals with comparing NIAPU with other disease gene identification algorithms, while Section 3.4 presents results from an enrichment analysis of the candidate disease genes obtained by the NIAPU methodology.

NeDBIT classification performances
The effectiveness of the NeDBIT features is tested by comparing NIAPU against the implementation of the APU labeling algorithm with two different sets of features: the first (PUDI) computed following Yang et al., 2012 is based on topological features (originally taken from Xu and Li, 2006) and functional information based on the semantic similarity of GO terms (originally taken from Wang et al., 2007), the second (TFO) includes simple topological, functional, and ontological features (see Supplementary Files 2 and 3).The comparison is carried out in terms of out-of-sample classification performance, namely the ten datasets detailed in Section 2.1 were split into training set (70%) and test set (30%), keeping class balance.Then, we trained the three ML algorithms defined in Step 6 of Section 2.2 for the three different applications of the APU algorithm.
Results related to malignant neoplasm of breast disease are reported in Figure 2 in terms of confusion matrices.The comparison among TFO, PUDI, and NeDBIT features shows that the latter are far superior to the others.The joint usage of APU and NeDBIT features (NIAPU) succeeded in discriminating the class P from the rest of the genes and better separating the pseudo-classes LP, WN, LN, and RN.
Regarding the pseudo-classes, the identification performances were also satisfying using TFO and PUDI features, even if with a drop in accuracy compared to NeDBIT.This highlights the effectiveness of the APU label assignment.RF and MLP delivered the best performances.Regarding SVM, LN samples were sometimes misclassified as either WN or RN.
Overall, for P and RN classes, the NIAPU classification is almost perfect since NeDBIT features allow those classes to be properly separated from the others since they grasp the topological aspects of the set of seed genes as a whole, assigning lower and lower weights to genes that are progressively "far" from the set of seed genes.For the rest of the classes, the performances are good but some genes are misclassified.This is due to the label assignment via quantiles, which obviously introduces some arbitrary noise at the boundary of such quantiles.
Results related to the other diseases are provided in Supplementary File 2, along with the results of a five-fold cross-validation study carried out for the three sets of features.

NIAPU performances in disease gene identification
We tested the ability of NIAPU to identify new candidate genes.We performed a validation by excluding the 20% of seed genes, setting them as unlabeled both in the computation of the NeDBIT features and in the APU labeling algorithm.We repeated the procedure five times with non-overlapping gene sets.We investigated whether NIAPU was able to properly classify the removed positive genes as LP.For brevity, the results for malignant neoplasm of breast only are reported in Table 1 (other diseases in Supplementary File 2).On average, around 46% of unlabeled seed genes fell in the LP class, while the rest fell in a decreasing classification trend toward the RN class.We also observed a clear correspondence between the labeling and the association score: the higher the score, the more likely the gene is to be found in the LP class.This underlines the influence of scores on the NeDBIT features.Analogous results can be found in Supplementary File 2 for the remaining diseases.
Aggregated results related to ML classification for malignant neoplasm of breast are reported in Table 2.All the classes were identified by RF and MLP with high scores, while SVM reported lower metrics, particularly with regard to the LN class.Therefore, NIAPU turned out to be robust also in more challenging settings with reduced seed gene sets.

NIAPU vs. other disease gene identification tools
We compared the predictive performance in the identification of candidate disease genes of NIAPU against known gene discovery algorithms, namely DIAMOnD (Ghiassian et al., 2015), Markov clustering (MCL) (Enright et al., 2002;Sun et al., 2011), random walk with restart (RWR) (Köhler et al., 2008;Valdeolivas et al., 2019), two variants of GUILD (Guney and Oliva, 2012), one exploiting the NetCombo measure and the other based on Functional Flow (fFlow) (Nabieva et al., 2005), and ToppGene (Chen et al., 2009a) (relying on the implementation provided by the GUILD software).See Supplementary File 2 for a detailed description of these algorithms.
For this analysis, we relied on the extended GDA dataset provided by DisGeNET.We assigned the labels using NIAPU on the curated version of the dataset and then investigated whether the seed genes contained in the extended version (but not in the curated one) fell into the LP class.We considered the ranking retrieved by NIAPU at different quantile thresholds.In Figure 3, we report the results of this comparison in terms of F1 score.Most of the time, our methodology outperformed or was at par with the state-of-the-art algorithms for disease gene identification, being often the best-performing method when looking for a large number of candidate genes and of comparable performances for lower ones.Indeed, DIAMOnD performs at its best when considering a low ratio (10-20%) of predicted genes, while NIAPU shows good performances both for low and high percentages of candidate genes, outperforming DIAMOnD in the latter case.In fact, as stated by the authors themselves, DIAMOnD becomes unreliable when exceeding 200 predicted genes (Ghiassian et al., 2015).

Enrichment analysis
For a further evaluation of our results, for each of the ten diseases considered, we performed a gene ontology/pathway/disease enrichment analysis of the first 100 predicted genes in the LP class from the validation on the extended GDA dataset.This analysis was performed using Enrichr (Chen et al., 2013;Kuleshov et al., 2016;Xie et al., 2021).
The selected LP genes do not correspond to any of the curated GDA disease genes; therefore, among the enriched diseases, we cannot expect to find the same disease for which the gene discovery process is carried out.Instead, among the enriched terms (diseases, GO terms, or pathways), we should be able to find diseases and biological processes that are somehow related to the disease under scrutiny.
We report the enrichment analysis results in Table 3.In particular, we present the top enriched diseases or biological processes for each analyzed disease, together with references to literature that endorse such relevant links.
Although not conclusive, the fact that there is evidence in literature of links and shared biological mechanisms between the analyzed diseases and enriched diseases is additional proof of the validity and efficacy of the disease gene discovery process.

Discussions and conclusions
In this paper, we presented the NIAPU algorithm, which fits the typical problem of the computational identification of previously unknown disease genes in the context of positive-unlabeled learning.The advantage of the proposed method is that it allows accurate characterization of the positive samples (P set) -via the NeDBIT features -and refined control of the likely positive samples (LP set) -via the APU labeling procedurewhich, extracted from the set of unlabeled elements, contains, with the highest probability, elements related to the disease of interest.Moreover, NIAPU turned out to be an effective labeling procedure, allowing machine learning models to be trained appropriately and deliver highly accurate classification performances.As for disease gene identification, NIAPU proved to be efficient in two different experiments.In the first one, masking out a subset of seed genes, it turned out that ~46% of those fell in the LP class.In the second one, assigning labels using NIAPU on the curated version of the DisGeNET dataset and then searching for the seed genes of the extended version only, the predictive performance of the NIAPU algorithm outperformed or was at par with the state-of-the-art algorithms for disease gene discovery.
It is worth noting that the NeDBIT features are designed to be able to use link-weighted and node-weighted graphs and that, by having increasingly accurate PPIs, we expect increasingly good results from the application of NIAPU.On the other hand, NIAPU methodology is clearly influenced by the reliability of seed genes, the association score assigned to them, and the background network topology (here, the PPI network and its reliability).
Indeed, GDA datasets may be affected by disease-gene association bias due to the quantity of research on a given disease/trait.In this regard, a recent systematic review (De Magalhães, 2021) demonstrated that 87.7% of all genes could be associated with cancer.This indicates that given the massive amount of research focused on cancer, which also applies to other types of diseases, the definition "associated with" is to be checked carefully and critically.
The usage of datasets that are as error-free, unbiased, and reliable as possible (e.g., using an interactome validated in the specific pathological context, possibly with weighted PPIs) could potentially improve the classification performance of the method.In this regard, it is worth mentioning that an algorithm with the same theoretical ground of NIAPU has been applied in different contexts (e.g., nephrology, gastroenterology, rare diseases) (Shahini et al., 2022a,b), paying particular attention to the selection of seed genes and reference interactomes.

Fig. 1 .
Fig. 1.The complete NIAPU pipeline.PPI and GDAs are used to obtain a disease-related network.Features are extracted (Section 2.3) and APU is applied (Section 2.2) to assign new labels to train ML algorithms for the final gene classification.The new labels can be used for disease gene-discovery purposes (Section 3.3).

Fig. 2 .
Fig. 2. Confusion matrices for multi-class classification on malignant neoplasm of breast (C0006142).The APU labeling and the newly defined NeDBIT features allow for a better and clear distinction of the P class and the pseudo-classes.

Fig. 3 .
Fig.3.Gene discovery performances in terms of F1 score.Results are reported for six diseases for increasing numbers of candidate genes considered as a percentage of the total number of associated genes in the extended dataset, which is different for each disease.The rest of the diseases can be found in Supplementary File 2.

Table 1 .
Labeling of the unlabeled seed genes by NIAPU for malignant neoplasm of breast (C0006142).Results are intended as average with standard deviation over the five runs (GDAS: association score S).

Table 2 .
Classification scores as pooled mean and standard deviation (over all the diseases).Five runs were performed for each disease, masking out 20% of seed genes.

Table 3 .
Enrichment analysis of the LP genes predicted for the ten diseases of interest.The top enriched diseases and GO terms are reported, along with notes about disease relationships and main reference articles.