Pitfalls of machine learning models for protein–protein interaction networks

Abstract Motivation Protein–protein interactions (PPIs) are essential to understanding biological pathways as well as their roles in development and disease. Computational tools, based on classic machine learning, have been successful at predicting PPIs in silico, but the lack of consistent and reliable frameworks for this task has led to network models that are difficult to compare and discrepancies between algorithms that remain unexplained. Results To better understand the underlying inference mechanisms that underpin these models, we designed an open-source framework for benchmarking that accounts for a range of biological and statistical pitfalls while facilitating reproducibility. We use it to shed light on the impact of network topology and how different algorithms deal with highly connected proteins. By studying functional genomics-based and sequence-based models on human PPIs, we show their complementarity as the former performs best on lone proteins while the latter specializes in interactions involving hubs. We also show that algorithm design has little impact on performance with functional genomic data. We replicate our results between both human and S. cerevisiae data and demonstrate that models using functional genomics are better suited to PPI prediction across species. With rapidly increasing amounts of sequence and functional genomics data, our study provides a principled foundation for future construction, comparison, and application of PPI networks. Availability and implementation The code and data are available on GitHub: https://github.com/Llannelongue/B4PPI.


Protein-protein interaction data
To train machine learning algorithms, the quality of the gold standard is paramount.Data on PPIs was obtained from IntAct (Orchard et al. 2014) and downloaded from the EMBLE-EBI FTP server (timestamp: 15/10/2021).We restricted the data to human heterodimeric protein-protein interactions with UniProt IDs.To reduce the risk of false positives, we removed complex expansions (where the pairwise interactions within a complex are unreliable) and interactions based on colocalisation only.This quality control step leaves 128,790 PPIs, covering 15,506 proteins (out of 20,386 in UniProt).Based on this dataset, we created an index of the number of recorded interactions per protein and made a list of hubs (highly connected proteins).In line with the literature, hubs are defined as the 20% of proteins with the most interactions (Jin et al. 2007), which here is equivalent to proteins with more than 21 partners.The quality of the interactions is assessed further by looking at the MIscore (IntAct -User Guide), a quality score based on the manual curation of the interactions and annotations of the HUPO PSI-MI consortium that takes into account the detection method, the interaction type and the number of publications reporting it.In case of PPIs with multiple entries, the highest MIscore was used.When looking at the distribution of the MIscores in the dataset (Figure 1), a threshold of 0.47 is visible, which restrict the dataset to 78,229 interactions, covering 12,026 proteins.We also ran the same analyses without filtering on MIscores (i.e. using all 128,790 interactions) and found that all the results presented here held true.

Functional genomics annotations and amino acids sequences
Protein sequences in humans are well documented and can be obtained from UniProt, but FG features can be more challenging as they should be diverse (i.e.cover a wide range of properties), of high-quality and have high coverage (i.e.few missing proteins).For the same reasons as described above, aggregated, manually curated, and professionally reviewed databases are preferred.Based on features that have been successfully used for the task before, it is relevant to include information about cellular and tissue localisation, biological functions and gene expression patterns (Jansen 2003;Ben-Hur and Noble 2005;Zhang et al. 2012;Kotlyar et al. 2015).
One of the main databases on proteins is UniProt (The UniProt Consortium 2021) and in particular its knowledgebase UniProtKB.Swiss-Prot, the section of UniProtKB that is reviewed and manually curated, is used in this work to ensure optimal quality.The data from Swiss-Prot is downloaded through their API by restricting to reviewed, non-obsolete, human proteins (last download is 09/11/2021).The different columns are then cleaned to extract the information of interest in a standardised format, and we use UniProt IDs throughout.There is information for 20,386 proteins and more details about each feature are in Table 1.UniProt's API is also used to map UniProt IDs between different databases and to map outdated IDs.In particular, we extract amino acids sequences for each protein, more than 95% of which come from the translation of coding sequences submitted to the International Nucleotide Sequence Database Collaboration (Arita et al. 2021).Annotated domains and motifs are also included in the database.Additionally, we extract gene ontology (GO) annotations of biological processes, cellular components and molecular functions.For each protein, each of the FG features is represented as a bag-of-words, i.e. a sparse vector of length the number of annotations in the database.
When working with gene expression data, both biological and technical noise need to be accounted for correctly.The Bgee public repository (Bastian et al. 2021) does that by regrouping curated healthy wild-type standardised gene expression patterns.The human data is mainly from GTEx v6 (phs000424.v6.p1), with an added layer of manual curation to remove unhealthy subjects.For a gene, the final data provides binary calls of presence or absence of expression for each combination of anatomical entity and developmental stage.We downloaded the database from their FTP server (version 14.2) and obtained information for 59,777 genes, 320 anatomical entities and 33 developmental stages, which leads to 1,147 stage/entity combinations.The Bgee entries are matched to the UniProt IDs using UniProt's own mapping table.
The Human Protein Atlas (HPA) (Uhlén et al. 2015;Thul et al. 2017) provides data mapping human proteins to tissues and cells.In particular, we used the Tissue Atlas (Uhlén et al. 2015) that presents the distribution of proteins in tissues and cell types and the Cell Atlas (Thul et al. 2017) that contains the distribution across subcellular locations.The Tissue Atlas contains data similar to Bgee, but the overlap is likely to be limited as the two databases only share GTEx RNA-seq data.While Bgee has a more thorough curation process, HPA contains a lot of original in-house experimental results, which justifies the inclusion of both data sources.We downloaded the HPA data from their website (release 20.1, Ensembl version 92.38).The data in HPA is identified by Ensembl gene IDs, which are mapped to UniProt IDs using UniProt's API.We restricted the dataset to the reviewed proteins present in Swiss-Prot and to ensure the quality of annotations, we discarded the entries HPA annotated as "uncertain".For the tissue IHC data, we mapped expression levels to numerical values (high=3, medium=2, low=1 and "not detected"=-1) with untested tissues being mapped to 0. Similar pre-processing was used for the consensus RNA-seq data and the subcellular location.

Pre-processing to measure features similarity
For a protein, each FG feature was represented as a vector, of length the number of annotations.
To measure the feature-specific similarity between two proteins, we compared the two vectors using cosine similarity (Singhal 2001), a popular tool widely used for similar tasks in Natural Language Processing.For two vectors  = ( ! ) and  = ( ! ), their cosine similarity (, ) is: As a result, for each of the 207,784,305 possible pairs of proteins, we obtained 12 similarity features: biological processes, cell components, molecular function, domains and motifs from UniProt, gene expression from Bgee, tissue/cell expression, tissue expression, RNA-seq expression and subcellular locations from the Human Protein Atlas (Table 1).

Creation of the gold standard
The PPIs obtained from IntAct are divided between a training set and two testing sets.First, a set of 1,562 proteins (13%) was randomly set aside to ensure some unseen proteins are present in the testing set; the necessity of this is shown in Supplementary Figure 2.This percentage was chosen as it gives a train/test ratio of ~70%/30%, which is standard when data is limited.The dataset was then randomly divided under this constraint and included 53,331 PPIs in the training set, 12,449 in T1 and the same in T2 (Supplementary Table 1).We tested the sensitivity of the results to the division parameters by creating new gold standards where 10% and 15% of proteins were set aside for testing (instead of the 13% mentioned above) and found that all the results remained identical.
The negative examples (i.e.non interacting proteins) are obtained using random sampling among all the possible pairs, excluding any pair that has been observed experimentally to limit the risk of false negative.For the training set, balanced sampling is used (Yu et al. 2010) to favour learning, which means that the probability of sampling a protein for the negative set is proportionate to its frequency in the positive set.For T1 and T2, we used uniform sampling (all proteins have the same probability of sampling) to limit sampling bias.The training set and T1 both have 50% of positive examples, while T2 has only 1% of interacting pairs (Supplementary Table 1).
To confirm that the results obtained are not due to sequence homology between the training and testing sets, we designed two additional testing scenarios.For the first one, following a commonly used rule (Greener et al. 2022;Hou et al. 2022), we limited sequence similarity between any distinct training and testing proteins to 25%.The second scenario was more restrictive and excluded all paralogs of training proteins from the testing set.Supplementary Figure 12 shows a comparison of the different model's performance in these different scenarios, and confirms that the results reported are not due to sequence homology.
To investigate how models deal with different network topologies, especially hubs and lone proteins, we had to create a separate testing set to ensure sufficient sample size in each category (hub-hub, hub-lone and lone-lone interactions).We do so by aggregating PPIs from T1 and T2, and using balanced sampling for the non-interacting proteins.This results in 49,796 pairs (50% positive) (Supplementary Table 2).
We quantified the difference in FG annotations between hubs and non-hubs.For example in the training set, hubs have on average 11.6 annotations for biological processes (significant feature in the logistic regression model discussed in Results) while non-hubs only have 5.8 (median 6 vs 3).The same phenomenon is observed for cellular compartments (6.2 vs 3.6 annotations on average).

S. cerevisiae data
The pipeline describe above was also followed for the S. cerevisiae data.UniProt lists 6,721 yeast proteins and the same information as for humans (Supplementary Table 3) but HPA and Bgee do not include data for this organism.PPIs were obtained from IntAct following the same procedure, although no selection based on MIscores was made considering the absence of an obvious choice when looking at the distribution (Supplementary Figure 10).The final PPI dataset comprised 43,068 interactions covering 5,679 proteins.
The split between training and testing sets was done similarly by setting aside 737 proteins for testing and then randomly allocating PPIs to keep 30,369 PPIs for training (70% of the gold standard).Because there is fewer data on yeast, and only one testing set is needed to replicate the analysis conducted on humans, dividing the remaining 12,699 further between T1 and T2 is not suitable here.But if the goal was to measure generalisability of a yeast model, this could be easily done.

Training
FG-based machine learning models were trained using the scikit-learn library (Pedregosa et al. 2011).For models that cannot deal with missing data, mean imputation was used (Supplementary Table 4).Hyper-parameter search was done using Weight-and-Bias's Bayesian method (Biewald 2020) to find the optimal settings of each algorithm in a reasonable time.All hyperparameter choices are in Supplementary Table 4 and Supplementary Table 5.
Deep learning models were trained using PyTorch Lightning (Falcon and The PyTorch Lightning team 2019; Paszke et al. 2019).The Siamese architecture (Bromley et al. 1994;Chopra, Hadsell and LeCun 2005) was composed of a bidirectional Gated Recurrent Unit (GRU) (Cho et al. 2014) followed by a linear output (Figure 2).Long-Short Term Memory networks (LSTM) (Hochreiter and Schmidhuber 1997) and Convolutional Neural Networks (CNN) (LeCun et al. 1990) were also tested, but GRU was preferred because of runtime efficiency and its ability to account for proteins of various lengths.Full parameters are in Supplementary Table 4, Supplementary Table 5 and in the open-source code.

Evaluation
The Receiver Operating Characteristic (ROC) and the Precision-Recall (PR) curves are complementary options for PPI prediction.While the ROC curve is unaffected by the prevalence of interacting proteins, a benefit as the true prevalence of PPIs is mostly unknown, it also means that both classes are considered equally, whereas often, PPIs are more interesting than noninteracting proteins.This is addressed by the PR curve where precision puts an emphasis on positive examples.It has also been shown that ROC tend to overestimate the performance of PPI prediction tools (Wang et al. 2023), but most published models still report it, which justifies the inclusion of both metrics.
Both curves are reported alongside their respective Areas Under the Curve (AUC).To statistically compare ROC curves for a same testing set, we used a DeLong nonparametric test(DeLong, DeLong and Clarke-Pearson 1988) and reported the p-value.We corrected for multiple testing by using a conservative significance threshold of 5 x 10 -4 , corresponding to a Bonferroni correction for 100 pairwise comparisons (Neyman and Pearson 1928).

Environmental impact statement
We did our best to minimise greenhouse gas emissions related to this project and, using the Green Algorithms calculator (v2.1)(Lannelongue,Grealey and Inouye 2021), we estimated that the carbon footprint of this project was 51 kgCO2e, which corresponds to 4.7 tree-years.

Figure 1 :
Figure 1: Distribution of the MIscore in IntAct.

Figure 2 :
Figure 2: Diagram of the deep learning architecture used to predict interactions from a pair of protein sequences.

Table 1 :
Features used to train human models (GO = Gene Ontology).