Supervised learning is an accurate method for network-based gene classification

Abstract Background Assigning every human gene to specific functions, diseases and traits is a grand challenge in modern genetics. Key to addressing this challenge are computational methods, such as supervised learning and label propagation, that can leverage molecular interaction networks to predict gene attributes. In spite of being a popular machine-learning technique across fields, supervised learning has been applied only in a few network-based studies for predicting pathway-, phenotype- or disease-associated genes. It is unknown how supervised learning broadly performs across different networks and diverse gene classification tasks, and how it compares to label propagation, the widely benchmarked canonical approach for this problem. Results In this study, we present a comprehensive benchmarking of supervised learning for network-based gene classification, evaluating this approach and a classic label propagation technique on hundreds of diverse prediction tasks and multiple networks using stringent evaluation schemes. We demonstrate that supervised learning on a gene’s full network connectivity outperforms label propagaton and achieves high prediction accuracy by efficiently capturing local network properties, rivaling label propagation’s appeal for naturally using network topology. We further show that supervised learning on the full network is also superior to learning on node embeddings (derived using node2vec), an increasingly popular approach for concisely representing network connectivity. These results show that supervised learning is an accurate approach for prioritizing genes associated with diverse functions, diseases and traits and should be considered a staple of network-based gene classification workflows. Availability and implementation The datasets and the code used to reproduce the results and add new gene classification methods have been made freely available. Contact arjun@msu.edu Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
A grand challenge in the post-genomic era is to characterize every gene across the genome in terms of the cellular pathways they participate in, and which multifactorial traits and diseases they are associated with.Computationally predicting the association of genes to pathways, traits or diseases-the task termed here as 'gene classification'-has been critical to this quest, helping prioritize candidates for experimental verification and for shedding light on poorly characterized genes (Bernardes and Pedreira, 2013;Jiang et al., 2016;Peña-Castillo et al., 2008;Piro and Cunto, 2012;Radivojac et al., 2013;Sharan et al., 2007;Yang et al., 2011).Key to the success of these methods has been the steady accumulation of large amounts of publicly available data collections, such as curated databases of genes and their various attributes (Buniello et al., 2019;Kanehisa et al., 2017Kanehisa et al., , 2019;;Kanehisa and Goto, 2000;Liberzon et al., 2011;Piñero et al., 2015Piñero et al., , 2017;;Smith et al., 2018;Subramanian et al., 2005;Wu et al., 2013;Xin et al., 2016), controlled vocabularies of biological terms organized into ontologies (Ashburner et al., 2000;Schriml et al., 2012;Smith et al., 2004;The Gene Ontology Consortium, 2019), high-throughput functional genomic assays (Athar et al., 2019;Edgar et al., 2002;Leinonen et al., 2011) and molecular interaction networks (Greene et al., 2015;Huang et al., 2018;Li et al., 2017;Stark et al., 2006;Szklarczyk et al., 2015).
While protein sequence and 3D structure are remarkably informative about the corresponding gene's molecular function (Altschul et al., 1990;Jiang et al., 2016;Radivojac et al., 2013;Sleator and Walsh, 2010;Whisstock and Lesk, 2003), the pathways or phenotypes that a gene might participate in significantly depends on the other genes that it works with in a context dependent manner.Molecular interaction networks-graphs with genes or proteins as nodes and their physical or functional relationships as edges-are powerful models for capturing the functional neighborhood of genes on a whole-genome scale (Karaoz et al., 2004;Leone and Pagnani, 2005;Vazquez et al., 2003).These networks are often constructed by aggregating multiple sources of information about gene interactions in a context-specific manner (Greene et al., 2015;Ideker and Sharan, 2008).Therefore, unsurprisingly, several studies have taken advantage of these graphs to perform network-based gene classification (Guan et al., 2010;Ko ¨hler et al., 2008;Leiserson et al., 2015;Park et al., 2013;Vanunu et al., 2010;Warde-Farley et al., 2010).
The canonical principle of network-based gene classification is 'guilt-by-association', the notion that proteins/genes that are strongly connected to each other in the network are likely to perform the same functions, and hence, participate in similar higher-level attributes, such as phenotypes and diseases (Wang et al., 2011).Instead of just aggregating 'local' information from direct neighbors (Schwikowski et al., 2000), this principle is better realized by propagating pathway or disease labels across the network to capture 'global' patterns, achieving state-of-the-art results (Ca ´ceres and Paccanaro, 2019; Cowen et al., 2017;Deng et al., 2004;Karaoz et al., 2004;Ko ¨hler et al., 2008;Komurov et al., 2010;Leiserson et al., 2015;Leone and Pagnani, 2005;Mostafavi et al., 2008;Murali et al., 2011;Nabieva et al., 2005;Page et al., 1999;Tsuda et al., 2005;Vanunu et al., 2010;Vazquez et al., 2003;Warde-Farley et al., 2010;Zhou et al., 2003;Zhu et al., 2003).These global approaches belong to a class of methods referred to here as 'label propagation'.Distinct from label propagation (LP) is another class of methods for gene classification that relies on the idea that network patterns characteristic of genes associated with a specific phenotype or pathway can be captured using supervised machine learning (Barutcuoglu et al., 2006;Greene et al., 2015;Guan et al., 2010;Krishnan et al., 2016;Lanckriet et al., 2004;Park et al., 2013).While this class of methods-referred to here as 'supervised learning'-has yielded promising results in a number of applications, how it broadly performs across different types of networks and diverse gene classification tasks is unknown.Consequently, supervised learning (SL) is used far less compared to LP for network-based gene classification.
The goal of this study is to perform a comprehensive, systematic benchmarking of SL for network-based gene classification across a number of genome-wide molecular networks and hundreds of diverse prediction tasks using meaningful evaluation schemes.Within this rigorous framework, we compare SL to a widely used, classic LP technique, testing both the original (adjacency matrix A) and a diffusionbased representation of the network (influence matrix I; Fig. 1).This combination results in four methods (listed with their earliest known references): LP on the adjacency matrix (LP-A) (Schwikowski et al., 2000), LP on the influence matrix (LP-I) (Page et al., 1999), SL on the adjacency matrix (SL-A) (Barutcuoglu et al., 2006) and SL on the influence matrix (SL-I) (Lanckriet et al., 2004).Additionally, we evaluate the performance of SL using node embeddings as features, as the use of node embeddings is burgeoning in network biology.
Our results demonstrate that SL outperforms LP for genefunction, gene-disease and gene-trait prediction.We also observe that SL captures local network properties as efficiently as LP, where both methods achieve more accurate predictions for genesets that are more tightly clustered in the network.Lastly, we show that SL using the full network connectivity is superior to using low-dimensional node embeddings as the features, which, in turn, is competitive to LP.

Networks
We chose a diverse set of undirected, human gene/protein networks based on criteria laid out in Huang et al. (2018) (Fig. 1): (i) networks constructed using high-or low-throughput data, (ii) the type of interactions the network was constructed from and (iii) if annotations were directly incorporated in constructing the network.We used versions of the networks that were released prior to 2017 so as not to bias the temporal holdout evaluations.We used all edge scores (weights) unless otherwise noted, and the nodes in all networks were mapped into Entrez genes using the MyGene.infodatabase (Wu et al., 2013;Xin et al., 2016).If the original node ID mapped to multiple Entrez IDs, we added edges between all possible mappings.The networks used in this study are BioGRID (Stark et al., 2006), the full STRING network (Szklarczyk et al., 2015), as well as the subset with just experimental support (referred to as STRING-EXP in this study), InBioMap (Li et al., 2017) and the tissue-naı ¨ve network from GIANT (Greene et al., 2015), referred to as GIANT-TN in this study.These networks cover a wide size range, with the number of nodes ranging from 14 089 to 25 689 and the number of edges ranging from 141 629 to 38 904 929.More information on the networks can be found in the Supplementary Section 1.1.

Network representations
We considered three distinct representations of molecular networks: the adjacency matrix, an influence matrix and low-dimensional node embeddings.Let G ¼ V; E; W ð Þdenote an undirected molecular network, where V is the set of vertices (genes), E is the set of edges (associations between genes) and W is the set of edge weights (the strengths of the associations).G can be represented as a weighted adjacency matrix A i; j ¼ W i; j , where A 2 R VÂV .G can also be represented as an influence matrix, F 2 R VÂV , which can capture both local and global structure of the network.F was obtained using a random walk with restart transformation kernel (Leiserson et al., 2015), where a is the restart parameter, I is the identity matrix and W D is the degree weighted adjacency matrix given by W D ¼ AD À1 , where D 2 R VÂV is a diagonal matrix of node degrees.A restart parameter of 0.85 was used for every network in this study.Detailed information of how the restart parameter was chosen can be found in Supplementary Section 1.2 and Figure S1.G can also be transformed into a low-dimensional representation through the process of node embedding.In this study, we used the node2vec algorithm (Grover and Leskovec, 2016), which borrows ideas from the word2vec algorithm (Mikolov et al., 2013a(Mikolov et al., , 2013b) ) used in natural language processing.The objective of node2vec is to find a low-dimensional representation of the adjacency matrix, E 2 R VÂd , where d ( V.This is done by optimizing the following log-probability objective function: where N S u ð Þ is the network neighborhood of node u generated through a sampling strategy S and e u ð Þ 2 R d is the feature vector of node u.In node2vec, the sampling strategy is based on random walks that are controlled using two parameters p and q, in which a high value of q keeps the walk local (a breadth-first search), and a high value of p encourages outward exploration (a depth-first search).The values of p and q were both set to 0.1 for every network in this study.Detailed information of how the node2vec hyperparameters were chosen can be found in the Supplementary Section 1.2.

Prediction methods
We compared the prediction performance across four specific methods across two classes, LP and SL.

Label propagation
LP methods are the most widely used methods in network-based gene classification and achieve state-of-the-art results (Cowen et al., 2017;Ko ¨hler et al., 2008).In this study, we considered two LP methods, LP-A and LP-I.First, we constructed a binary vector of ground-truth labels, x 2 R VÂ1 , where x i ¼ 1 if gene i is a positively labeled gene in the training set, and 0 otherwise.In LP-A, we constructed a score vector, S 2 R VÂ1 , denoting the predictions, where A is the adjacency matrix.Thus, the predicted score for a gene using LP-A is equal to the sum of the weights of the edges between the gene and its direct, positively labeled network neighbors.
In LP-I, the score vector, S, is generated using equation (3), except A is by replaced by F, the influence matrix [equation ( 1 )].In both LP-A and LP-I, only positive examples in the training set are used to calculate the score vector S to reflect how LP is typically used in practice (Cowen et al., 2017;Ko ¨hler et al., 2008;Picart-Armada et al., 2019).Both positive and negative examples in the test set are later used for evaluation.

Supervised learning
SL can be used for network-based gene classification by using each gene's network neighborhoods as feature vectors, along with gene labels, in a classification algorithm.Here, we used logistic regression with L2 regularization as the SL classification algorithm, which is a linear model that aims to minimize the following cost function (Pedregosa et al., 2011): where w 2 R m is the vector of weights for a model with m features, C determines the regularization strength, n the number of examples, y is the ground-truth label, X 2 R n Â m is the data matrix and c is the intercept.After training a model using the labeled genes in the training set, the learned model weights are used to classify the genes in the testing set, returning a prediction probability for these genes that is bounded between 0 and 1.The regularization parameter, C, was set to 1.0 for all models in this study.In this study, three different network-based gene-level feature vectors were used to train three different SL classifiers: the rows of the adjacency matrix (SL-A), the rows of the influence matrix (SL-I) and the rows of the node embedding matrix (SL-E).Model selection and hyperparameter tuning are described in detail in the Supplementary Section 1.2.

Geneset-collections
We curated a number of geneset-collections to test predictions on a diverse set of tasks: function, disease and trait (Fig. 1).Function prediction was defined as predicting genes associated with biological processes that are part of the Gene Ontology (referred to here as 'GOBP') (The Gene Ontology Consortium, 2019;Ashburner et al., 2000) obtained from MyGene.info (Wu et al., 2013;Xin et al., 2016), and pathways from the Kyoto Encyclopedia of Genes and Genomes (Kanehisa et al., 2017(Kanehisa et al., , 2019;;Kanehisa and Goto, 2000), referred to 'KEGGBP' since disease-related pathways were removed from the original KEGG annotations in the Molecular Signatures Database (Liberzon et al., 2011;Subramanian et al., 2005).Disease prediction was defined based on predicting genes associated with diseases in the DisGeNET database (Piñero et al., 2015(Piñero et al., , 2017)).Annotations from this database were divided into two separate geneset-collections: those that were manually-curated (referred to as 'DisGeNet' in this study) and those derived using the BeFree textmining tool (referred to as 'BeFree' in this study).Trait prediction was defined as predicting genes linked to human traits from Genome-wide Association Studies (GWAS), curated from a community challenge (Choobdar et al., 2019), and mammalian phenotypes (annotated to human genes) from the Mouse Gene Informatics (MGI) database (Smith et al., 2018).
Each of these six geneset-collections contained anywhere from about a hundred to tens of thousands of genesets that varied widely in specificity and redundancy.Therefore, each collection was preprocessed to ensure that the final set of prediction tasks from each source is specific, largely non-overlapping and not driven by multiattribute genes.First, if genesets in a collection corresponded to terms in an ontology (e.g.biological processes in the GOBP collection), annotations were propagated along the ontology structure to obtain a complete set of annotations for all genesets.Second, we removed genesets if the number of genes annotated to the geneset was above a certain threshold and then compared these genesets to each other in order to remove genesets that were highly-overlapping with other genesets in the collection, resulting in a set of specific, non-redundant genesets.Finally, individual genes that appeared in more than 10 of the remaining genesets in a collection were removed from all the genesets in that collection to remove multi-attribute (e.g.multi-functional) genes that are potentially easy to predict (Gillis and Pavlidis, 2011).Detailed information on geneset preprocessing and geneset attributes can be found in the Supplementary Section 1.3, Table S2 and Figure S3.
Selecting positive and negative examples: In each geneset-collection, for a given geneset, genes annotated to that set were designated as the set of positive examples.The SL methods additionally required a set of negative genes for each given geneset for training, and both SL and LP methods require a set of negative genes for each geneset for testing.A set of negative genes was generated by: (i) finding the union of all genes annotated to all genesets in the collection, (ii) removing genes annotated to the given geneset and (iii) removing genes annotated to any geneset in the collection which significantly overlapped with the given geneset (P-value <0.05 based on the onesided Fisher's exact test).

Validation schemes
We performed extensive and rigorous evaluations based on three validation schemes: temporal holdout, study-bias holdout and 5fold cross validation (5FCV).In temporal holdout, within a genesetcollection, genes that only had an annotation to any geneset in the collection after January 1, 2017 were considered test genes, and all other genes were considered training genes.Temporal holdout is the most stringent evaluation scheme for gene classification since it mimics the practical scenario of using current knowledge to predict the future and is the preferred evaluation method used in the CAFA challenges (Jiang et al., 2016;Radivojac et al., 2013).Since the Gene Ontology was the only source with clear date-stamps for all its annotations, temporal holdout was applied only to the GOBP geneset-collection.For study-bias holdout, genes were ranked by the number of PubMed articles they were mentioned in, obtained from Brown et al. (2015).The top two-thirds of the most-mentioned genes were considered training genes, and the rest of the leastmentioned genes were used for testing.Study-bias holdout mimics the real-world situation of learning from well-characterized genes to predict novel un(der)-characterized genes.The last validation scheme is the traditional 5FCV, where the genes are split into five equal folds in a stratified manner (i.e. in each split, the proportion of genes in the positive and negative classes is preserved).In all these schemes, only genesets with at least 10 positive genes in both the training and test sets were considered.More information on the validation schemes is available in the Supplementary Section 1.4.

Evaluation metrics
In this study, we considered three evaluation metrics: the area under the precision-recall curve (auPRC), the precision of the top-K ranked predictions (P@TopK) and, the area under the receiver-operator curve (auROC).For P@TopK, we set K equal to the number of ground-truth positives in the testing set.Since the standard auPRC and P@TopK scores are influenced by the prior probability of finding a positive example (equal to the proportion of positives to the total of positives and negatives), we expressed both metrics as the logarithm (base 2) of the ratio of the original metric to the prior.More details on the evaluation metrics can be found in the Supplementary Section 1.5.

Results
We systematically compare the performance of four gene classification methods (Fig. 1): SL-A, SL-I, LP-A and LP-I.We choose six geneset-collections that represent three prominent gene classification tasks: gene-function (GOBP, KEGGBP), gene-disease (DisGeNet, BeFree) and gene-trait (GWAS, MGI) prediction.We use three different validation schemes: temporal holdout (train on genes annotated before 2017 and test on genes annotated in 2017 or later; only done for GOBP as it has clear timestamps), holdout based on study bias (train on well-studied genes and predict on less-studied genes) and the traditional 5FCV.Temporal holdout and study-bias holdout validation schemes are presented in the main text as they are more stringent and reflective of real-world tasks as compared to 5FCV (Kahanda et al., 2015).To ascertain the robustness of the relative performance of the methods to the underlying network, we choose five different genome-scale molecular networks that differ in their content and construction.To be in concert with temporal holdout evaluation and curtail data leakage, all the networks used throughout this study are the latest versions released before 2017.We present evaluation results based on the auPRC in the main text and results based on the P@topK and auROC in the Supplementary Figures S4-S8.We note that the 5FCV, P@topK and auROC results in the Supplementary Material are, for the most part, consistent with the results presented in the main text of this study.
Our first analysis was to directly compare all four prediction methods against each other for each geneset in a given collection.For each geneset-collection-network combination, we rank the four methods per geneset (based on auPRC) using the standard competition ranking and calculate each method's average rank across all the genesets in the collection (Fig. 2).For function prediction, SL-A is the top-performing method by a wide margin (particularly clear based on GOBP temporal holdout), with SL-I being the second best method.For disease and trait prediction, SL-A and SL-I still outperform LP-I, but to a lesser extent.In all cases, LP-A is the worst performing method.The large performance difference between the SL and LP methods in the GOBP temporal holdout validation is noteworthy since temporal holdout is the most stringent validation scheme and the one employed in community challenges, such as CAFA (Jiang et al., 2016;Radivojac et al., 2013).
Following the observation that SL methods outperform LP methods based on relative ranking, we use a non-parametric paired test (Wicoxon signed-rank test) to statistically assess the difference between specific pairs of methods (Fig. 3A).For each geneset-collection-network combination, we compare the two methods in one class to the two methods in the other class (i.e.we compare SL-A to LP-A, SL-A to LP-I, SL-I to LP-A and SL-I to LP-I).Each comparison yields a P-value along with the number of genesets in the collection where one method outperforms the other.After correcting the four P-values for multiple hypothesis testing (Benjamini et al., 2006), if a method from one class outperforms both methods from the other A B C Fig. 2. Average rank across the four methods.Each point in each boxplot represents the average rank for a geneset-collection-network combination, obtained based on ranking the four methods in terms of performance for each geneset in a geneset-collection using the standard competition ranking.(A) Functional prediction tasks using GOBP temporal holdout, (B) functional prediction tasks using study-bias holdout for GOBP and KEGGBP and (C) disease and trait prediction tasks using study-bias holdout for DisGeNet, BeFree, GWAS and MGI.The results are shown for auPRC where different colors represent different networks and different marker styles represent the different geneset-collections.SL methods outperform LP methods for all prediction tasks class independently (in terms of the number winning genesets), and if both (corrected) P-values are <0.05,we consider a method to have significantly better performance compared to the entire other class.
Additionally, we track the percentage of times the SL methods outperform the LP methods across all four comparisons within a geneset-collection-network combination.
The results show that for function prediction SL is almost always significantly better than LP when considering auPRC (Fig. 3B).Based on temporal holdout on GOBP, both SL-A and SL-I are always significantly better than both LP methods.Based on study-bias holdout, in the 10 function prediction geneset-collections-network combinations using GOBP and KEGGBP, SL-A is a significantly better method 8 times (80%) and SL-I is a significantly better method 6 times (60%).Neither LP-I nor LP-A ever significantly outperforms the SL models.The performance of SL and LP is more comparable for disease and trait prediction, but SL methods still perform better in a larger fraction of genesets.For the 20 disease and trait genesetcollection-network combinations, SL-I is a significantly better method 8 times (40%), and SL-A is a significantly better method 6 times (30%), LP-I is a significantly better method once (5%), and LP-A is never a significantly better method.
To visually inspect not only the relative performance of all four methods, but to also see how well the models are performing in an absolute sense, we examined the boxplots of the auPRC values for every geneset-collection-network combination (Fig. 4).The first notable observation is that, regardless of the method, function prediction tasks have much better performance results than disease/trait prediction tasks (Fig. 4B).Based on temporal holdout for function prediction (GOBPtmp), SL-A is the top-performing model based on the highest median performance for every network.Additionally, for all networks except STRING-EXP, SL-I is the second best performing model.For the 10 combinations of five networks with GOBP and KEGGBP, the top method based on the highest median performance is an SL method all but once, with SL-A being the top model 7 times (70%), SL-I being the top model 2 times (20%, GOBP and KEGGBP on GIANT-TN) and LP-A being the top model once (10%, KEGGBP on STRING-EXP).As noted earlier, for disease and trait prediction, SL and LP methods have more comparable performance.Of the 20 geneset-collection-network combinations, each of SL-A, SL-I, LP-I and LP-A is the top method based on median performance 5 (25%), 10 (50%), 4 (20%) and 1 (5%) times, respectively.
Although the boxplots in Figure 4 can give an idea of effect sizes, to further quantify this, we looked at the ratios of auPRC values across all genesets (Supplementary Section 2.2 and Fig. S10).The results show that SL-A and SL-I both have a substantial effect size compared to LP-I for function prediction.Also, for all prediction tasks, the effect size of SL methods over LP-I is equal to or greater than the effect size of LP-I over LP-A, where LP-I is widely considered a much better model than LP-A and thus, the comparison between LP-I and LP-A can be viewed as a baseline effect size.
Among the two classes of network-based models-SL and LP-it is intuitively clear how LP directly uses network connections to propagate information from the positively nodes to other nodes close in the network.On the other hand, while SL is an accurate method for gene classification, it has not been studied if SL's performance is tied to any traditional notion of network connectivity.To shed light on this problem, we investigated the performance of SL-A and LP-I as a function of three different properties of individual genesets in a collection: the number of annotated genes, edge density (a measure of how tightly connected the geneset is within itself) and segregation (a measure of how isolated the geneset is from the rest of the network).While the performance of neither SL-A nor LP-I has a strong association with the size of the geneset, the Fig. 3. Testing for a statistically significant difference between SL and LP methods.(A) A key on interpreting the analysis.For each network-geneset combination, each method is compared to the two methods from the other class (i.e.SL-A versus LP-I, SL-A versus LP-A, SL-I versus LP-I, SL-I versus LP-A).If a method was found to be significantly better than both methods from the other class (Wilcoxon ranked-sum test with an FDR threshold of 0.05), the cell is annotated with that method.If both models in that class were found to be significantly better than the two methods in the other class, the cell is annotated in bold with just the class.The color scale represents the fraction of genesets that were higher for the SL methods across all four comparisons.The first column uses GOBP temporal holdout, whereas the remaining six columns use study-bias holdout.(B) SL methods show a statistically significant improvement over LP methods, especially for function prediction performance of SL-A has a strong positive correlation with both edge density and segregation of the geneset, similar to what is seen for LP-I (Fig. 5).For visual clarity, Figure 5 presents results for just the STRING network, but very similar results are seen in the other networks as well (Supplementary Fig. S9).Detailed information on how the geneset and network properties are calculated can be found in the Supplementary Section 1.3.
Finally, since machine learning on node embeddings is gaining popularity for network-based node classification, we compared the top SL and LP methods tested here to this approach.Specifically, we compared LP-I and SL-A to an SL method using embeddings (SL-E) obtained from the node2vec algorithm (Grover and Leskovec, 2016) (Fig. 6).For function prediction, we observe that SL-E substantially outperforms LP-I.For GOBP temporal holdout, SL-E is always significantly better than LP-I.For the GOBP and KEGGBP study-bias holdout, out of the 10 geneset-collection-network combinations, SL-E is significantly better than LP-I 5 times (50%), whereas the converse is true only once (10%).These patterns nearly reverse for the 20 disease/trait prediction tasks, with LP-I performing significantly better than SL-E 6 times (30%), and SL-E significantly outperforming LP-I 3 times (15%).The comparison between SL-E and SL-A showed that SL-A demonstrably outperforms SL-E for both function and disease/trait prediction tasks.Among the 30 genesetcollection-network combinations, SL-A is a significantly better model 20 times (67%), whereas SL-E comes out on top just once (3%).This shows that although methods that use node embeddings are a promising avenue of research, they should be compared to the strong baseline of SL-A when possible.

Discussion
We have conducted the first comprehensive benchmarking of SL for network-based gene classification, establishing it as a leading approach.. Further, to the best of our knowledge, neither the studies that propose new methods nor those that systematically compare existing approaches have directly compared the two classes of methods-SL and LP-against each other.Our work provides this systematic comparison and shows that SL methods demonstrably outperform LP methods for network-based gene classification, particularly for function prediction.
Both SL and LP methods are, in general, more accurate for function prediction than disease and trait prediction.This trend is likely due to the fact that molecular interaction networks are primarily intended, either through curation or reconstruction, to reflect biological relationships between genes/proteins as they pertain to 'normal' cellular function.The utility of network connectivity to gene-disease or gene-trait prediction is incidental to the information the network holds about gene-function associations.This notion is supported by the observation that genesets related to function genesets are more tightly clustered than disease and trait genesets in the genome-wide molecular networks used in this study (Supplementary Fig. S3).Further analysis of prediction accuracy of genesets as a function of their network connectivity lends credence to the use of network structure by SL (Fig. 5 and Supplementary Fig. S9).Part of LP's appeal, widespread use and development is this natural use of network topology to predict gene properties by diffusing information from characterized genes to uncharacterized genes in their network vicinity.Therefore, we expect that genes associated with There is no correlation between the number of genes in the geneset versus performance (A, D), but there is a strong correlation between the performance and the edge density (B, E) as well as segregation (C, F).The different colored dots represent function genesets (red, GOBP and KEGGBP), disease genesets (blue, DIGenet and BeFree) and trait genesets (black, GWAS and MGI).The vertical line is the 95% confidence interval.Similar trends can be seen for the other networks (Supplementary Fig. S9).(Color version of this figure is available at Bioinformatics online.)A B Fig. 6.Performance of SL-E versus LP-I and SL-A.We compare the performance of SL on the embedding matrix (SL-E) versus LP-I and SL-A using a Wilcoxon rankedsum test.The performance metric is auPRC, the color scale represents the fraction of terms that were higher for the SL-E model (with purple meaning that SL-E had a higher fraction of better performing genesets compared to either LP-I or SL-A) and an 'x' signifies that the P-value from the Wilcoxon test was below 0.05.(A) Shows that SL-E is quite competitive with the classic method of LP-I and (B) shows that SL-A outperforms SL-E in a majority of cases.(Color version of this figure is available at Bioinformatics online.)tightly clustered pathways, traits or diseases will be easier to predict using LP, which is observed in our analysis (Fig. 5 and Supplementary Fig. S9).On the other hand, since SL (based on the full network) is designed to use global gene connectivity, it has been unclear if there is any association between the local clustering of genesets and their prediction performance using SL.Here, we show that the performance of SL, across networks and types of prediction tasks, is highly correlated with local network clustering of the genes of interest (Fig. 5 and Supplementary Fig. S9).This result substantiates SL as an approach that can accurately predict gene attributes by taking advantage of local network connectivity.
While being accurate, training a SL model on the adjacency matrix (SL-A) can take some computational time and resources as the size of the molecular network increases, thus, considerably differing in speed for, say, STRING-EXP (14 089 nodes and 141 629 unweighted edges) and GIANT-TN (25 689 nodes and 38 904 929 weighted edges).Worthy of note in this context is the recent excitement in deriving node embeddings for each node in a network, concisely encoding its connectivity to all other nodes, and using them as features in SL algorithms for node classification (Cai et al., 2018;Cui et al., 2018;Goyal and Ferrara, 2018;Grover and Leskovec, 2016;Hamilton et al., 2017;Perozzi et al., 2014;Wang et al., 2016).Although we show that SL-A markedly outperforms SL-E (Fig. 6), the unique characteristics of SL-E methods call for further exploration.For instance, the greatly reduced number of features allows SL-E methods to be more readily applicable to classifiers more complex than logistic regression, such as deep neural networks, which are typically ill-suited for problems where the number of features is much greater than the number of training examples.Further, since the reduced number of features allow SL-E methods to be trained orders of magnitude faster than SL-A or SL-I, they can be easily incorporated into ensemble learning models, which combine the results from many shallow learning algorithms.Akin to LP (Valdeolivas et al., 2019;Warde-Farley et al., 2010;Zhao et al., 2019), node embeddings also offer a convenient route to incorporating multiple networks into SL approaches.While methods such as SL-I and SL-A may require concatenating the original networks or integrating them into a single network before learning, recent work has shown that SL-E-based methods can embed information from multiple molecular/heterogeneous networks and learn gene classifiers in tandem (Alshahrani and Hoehndorf, 2018;Ata et al., 2018;Bai et al., 2019;Cho et al., 2016;Gligorijevi c et al., 2018;Li et al., 2019b;Nelson et al., 2019;Yang et al., 2018;Zitnik and Leskovec, 2017).However, none of these studies have compared the variety of SL-E methods to learning directly on the adjacency matrix.Given our finding here that SL-A greatly outperforms SL-E for function, disease and trait prediction, we advise and urge that every new SL-E method should be compared to SL-A for network-based gene classification.
In past work, SL methods for gene classification have mostly relied on hand-crafting features from graph-theory metrics, such as degree and centrality measures, or combining metrics to expand the feature set, resulting in a feature set size of $30 or less (Li et al., 2019a;Zhang et al., 2016).We do not include a comparison to these types of methods in this study because predicting genes to functions or diseases based on generic network metrics, such as high degree, does not capture anything unique about specific functions or diseases.On the other hand, SL models with individual genes as features contain information biologically relevant to the specific prediction task (Lee et al., 2013(Lee et al., , 2019)).
Critical to all these conclusions is the rigorous preparation of diverse, specific prediction tasks and the choice of meaningful validation schemes and evaluation metrics.Temporal holdout and studybias holdout validations help faithfully capture the performance of the computational methods when a researcher uses them to prioritize novel uncharacterized genes in existing molecular networks for experimental validation based on a handful of currently known genes.Although we provide all the results for the auROC metric in the Supplementary Materials for completion (Supplementary Figs S4, S5 and S8), we base our conclusions on metrics driven by precision: auPRC and P@topK.While auROC is still commonly used in genomics, it is ill-suited to most biological prediction tasks including gene classification since they are highly imbalanced problems, with negative examples far outnumbering positive examples (Saito and Rehmsmeier, 2015).Optimizing for precision-based metrics, on the other hand, helps control for false-positives among the top candidates (Davis and Goadrich, 2006), an important consideration when providing a list of candidate genes for further study.Accompanying the results in this manuscript, we are providing our comprehensive evaluation framework in the form of data-networks, prediction tasks and evaluation splits-on Zenodo and the underlying code on Github to enable other researchers to not only reproduce our results but also to add new network-based gene classification methods for comparison.Together, the data and the code provide the community a systematic framework to conduct gene classification benchmarking studies.See 'Availability of data and materials' for more information.
In this study, we have presented conclusive evidence that SL is an accurate method for gene classification.However, there exist many possible avenues for future exploration including studying how negative example selection affects gene classification.Although negative example selection has been studied for function prediction in a few studies (Youngs et al., 2013(Youngs et al., , 2014)), there is room for exhaustive testing of how to best generate and incorporate negative examples for gene classification in both LP and SL methods.To begin to address this, we included an analysis using negative examples for LP (Supplementary Section 2.3 and Figs S11-S14), where we show using negative examples slightly improves the performance of LP models, but not enough to change any of the findings in this work.Another avenue of exploration is performing a large model selection and hyperparameter tuning benchmarking study, including more non-linear models and embedding techniques.Lastly, it will be of interest to try network-based SL on other types of tasks, such as predicting enhancer roles in gene regulatory networks or predicting amino acid residue properties from 3D protein structures.
In conclusion, we have established that SL outperforms LP for network-based gene classification across networks and prediction tasks (functions, diseases and traits).We show that SL, in which every gene is its own feature, is able to capture network information just as well as LP.Finally, we show that SL-A demonstrably outperforms SL using node embeddings, and thus we strongly recommend that future work on using node embeddings for gene classification draws a comparison to using SL-A.

Fig. 1 .
Fig. 1.Workflow for gene classification pipeline.Four methods are compared: SL-A, SL-I, LP-A and LP-I.Model performance on a variety of gene classification tasks is evaluated over a number of different molecular networks, validation schemes and evaluation metrics.Additionally, the performance of SL using node embeddings as features (SL-E) is evaluated (not shown in this figure)

Fig. 4 .
Fig. 4. Boxplots for performance across all geneset-collection-network combinations.(A)The performance for each individual geneset-collection-network combination is compared across the four methods; SL-A (red), SL-I (light red), LP-I (blue) and LP-A (light blue).The methods are ranked by median value with the highest scoring method on the left.Results show SL methods outperform LP methods, especially for function prediction.(B) Each point in the plot is the median value from one of the boxplots in (A).This shows that both SL and LP methods perform better for function prediction compared to disease/trait prediction.(Color version of this figure is available at Bioinformatics online.)

Fig. 5 .
Fig. 5. Performance versus network/geneset properties.SL-A (A-C) is able to capture network information as efficiently as LP-I (D-F), for the STRING network.There is no correlation between the number of genes in the geneset versus performance (A, D), but there is a strong correlation between the performance and the edge density (B, E) as well as segregation (C, F).The different colored dots represent function genesets (red, GOBP and KEGGBP), disease genesets (blue, DIGenet and BeFree) and trait genesets (black, GWAS and MGI).The vertical line is the 95% confidence interval.Similar trends can be seen for the other networks (Supplementary Fig.S9).(Color version of this figure is available at Bioinformatics online.)