ProteinPrompt: a webserver for predicting protein–protein interactions

Abstract Motivation Protein–protein interactions (PPIs) play an essential role in a great variety of cellular processes and are therefore of significant interest for the design of new therapeutic compounds as well as the identification of side effects due to unexpected binding. Here, we present ProteinPrompt, a webserver that uses machine learning algorithms to calculate specific, currently unknown PPIs. Our tool is designed to quickly and reliably predict contact propensities based on an input sequence in order to scan large sequence libraries for potential binding partners, with the goal to accelerate and assure the quality of the laborious process of drug target identification. Results We collected and thoroughly filtered a comprehensive database of known binders from several sources, which is available as download. ProteinPrompt provides two complementary search methods of similar accuracy for comparison and consensus building. The default method is a random forest (RF) algorithm that uses the auto-correlations of seven amino acid scales. Alternatively, a graph neural network (GNN) implementation can be selected. Additionally, a consensus prediction is available. For each query sequence, potential binding partners are identified from a protein sequence database. The proteom of several organisms are available and can be searched for binders. To evaluate the predictive power of the algorithms, we prepared a test dataset that was rigorously filtered for redundancy. No sequence pairs similar to the ones used for training were included in this dataset. With this challenging dataset, the RF method achieved an accuracy rate of 0.88 and an area under the curve of 0.95. The GNN achieved an accuracy rate of 0.86 using the same dataset. Since the underlying learning approaches are unrelated, comparing the results of RF and GNNs reduces the likelihood of errors. The consensus reached an accuracy of 0.89. Availability and implementation ProteinPrompt is available online at: http://proteinformatics.org/ProteinPrompt, where training and test data used to optimize the methods are also available. The server makes it possible to scan the human proteome for potential binding partners of an input sequence within minutes. For local offline usage, we furthermore created a ProteinPrompt Docker image which allows for batch submission: https://gitlab.hzdr.de/proteinprompt/ProteinPrompt. In conclusion, we offer a fast, accurate, easy-to-use online service for predicting binding partners from an input sequence.


Collecting data points
In total, we collected 41,482 positive protein-protein pairs with at most 50% sequence identity, while collecting 39,457 negative protein-protein pairs. The positive set contains 10,538 individual proteins and the negative set contains 13,304 individual proteins. A summary of the respective test and training sets can be seen in Table S1. An overview of the overlapping proteins among those four datasets is depicted in Figure S1. The following sections investigate the datasets in further detail.  Table S1: Summary of individual protein pairs and number of unique proteins per dataset.
Figure S1: Number of overlapping proteins between different datasets. Numbers in parenthesis denote the number of unique proteins in each dataset.
Feature vector correlation Sequence similarity 40% Sequence Similarity 50% > 0.8 0.00074% 0.0065% > 0.9 0.0001% 0.00081% Table S2: Percentage of protein pairs with a feature vector correlation above a threshold of 0.8 and 0.9. Figure S2: Boxplot representing the distribution of all pairwise feature vector correlations from protein data sets that have been filtered based on 50% and 40% sequence identity.

Analysis of the CD-hit threshold
To quantify the impact of a more stringent filtering of proteins with respect to their feature vectors, we calculated feature vetor correlations of pairwise proteins from a set of proteins with at most 50% sequence identity (as we used in our study) and a set of proteins with at most 40% sequence identity (max threshold offered by CD-hit).
Starting from a protein sequence database containing 31867 proteins (as described in section 2.1 of the manuscript), we ran CD-hit with cutoffs 50% and 40% and generated datasets containing 15728 and 13719 distinct protein clusters, respectively. For each of those clusters, we retrieved exactly one reference protein that shares at most 50% or 40% sequence identity to all other reference proteins, depending on the dataset.
Based on those two sets of reference proteins, we calculated all pairwise feature vector correlations within each dataset and plotted their distributions. For the set of 15728 proteins sharing at most 50% sequence identity, we calculated 123.677.128 pairwise feature vector correlations while calculating 94.112.340 pairwise correlations for the smaller dataset.

Analysis of the data sets
In this section, we examine the relationship of sequence pairs in our datasets rather than of individual sequences as in the previous section. Because of the pairwise nature of binding, it is critical that different datasets do not contain similar pairs. A BLAST search was performed with an e-value cutoff of 100, creating pairwise alignments for all sequences in the datasets. Alignments with e-values larger than the threshold are considered insignificant or unrelated. Sequence pairs for which alignments were returned had similarities above 30%. Thus, in our depiction unaligned sequences with an e-value above 100 are subsumed into the region with similarity up to 30%. As we cannot distinguish anything below 30% similarity, we fused this into a larger bin. Above 30% there is a binning of 10%. However, in the range of 30 to 40% only few sequence pairs were aligned. Thus, we would consider this bin as rather unreliable.
To further investigate the influence of the remaining similarities, we removed in several steps all similarities above 60%, 50%, 40% and 30% between the training and test datasets. Table S6 shows the quality measures for the RF, Table S7 for the GNN. The data suggests that neither the RF nor the GNN suffer greatly from the removal of the remaining sequence similarity. Only the sensitivity and specificity of the RF appear to be affected by similarity cutoff of the datasets and more precisely by the size of the positive and negative training data sets (sizes are listed in Table Table S8). Interestingly, for RF the specificity increases with decreasing cutoff. This can be understood by the relative shift in size between binders (pos) and non-binders (neg). Naturally, the positive data has a higher propensity to be filtered due to similarity. For a 30% cutoff there are nearly twice as many negative data-points as positive (see Table S8). Thus, negative data will be overemphasized and this leads to an increase in specificity. This is because the random forest will try to minimize overall error rate, which in this case will result in an increase.
Since there is not much difference in the AUROC and AUPRC scores for the different datasets, we would argue that the two methods presented in the paper are robust to sequence similarity and also to its removal. Both algorithms seem to learn principles of the protein-protein-interactions beyond simple similarity detection. The magnitude of the changes in the overall performance currently observed could to some degree also be explained by the reduction in training data. Figure S3 illustrates the calculation of similarities of data points. Figure S4 shows the similarities between the training and test data, Figure S5 shows similarities between entries within the training dataset. Figure  S6 shows similarities between entries of the test dataset. Tables S3, S4, and S5 show the data that has been visualized in the those three figures, respectively. . Accordingly, the final similarity is S = max(S1 + S2, S3 + S4). Figure S4: Redundancy analysis between binders in training and test data. Depicted is a 2D histogram of similarity between sequence pairs from binders of the training versus binding sequence pairs from the test dataset. The image on the right is in log-scale to visualize the background. See Table S3 for values. The image on the left is identical to Figure 1 in the main manuscript, except for the color scheme. Figure S5: Redundancy analysis between binders in training data. See Table S4 Table S8: Sizes of the filtered datasets Figure S7: Schematic description of the graph neural network implementation. The proteins are encoded into graphs in which each amino acid is represented by one node and subsequent residues are connected by an edge. During the optimization procedure, the GraphNet (GN) modifies the aggregate and update functions. From each GNN, a feature vector is created. These feature vectors are concatenated and then passed to the final MLP, which predicts the binding. All elements of the entire scheme are optimized commonly in a single back-propagation. Figure S8: Server workflow. The user pastes the query sequence, provides a unique tag, and selects a method and organism to scan. The server returns a sorted table of potential binders.  Here we show additional 20 test cases to the 5 presented in 3.1.2 of the main manuscript. The additional proteins were taken from [5]. The procedure was equivalent. Figure S9 is the extended version of Figure 5 in the main manuscript, Table S10 is the extended version of Table S9. Figure S9: EV PPI scores from ProteinPrompt (green) and from STRING database (grey) for an extended set of 25 different proteins. A few of the EV PPI in the STRING database appear in the training data of ProteinPrompt. Some protein protein pairs that were falsely classified as non binding have been experimentally verified to be binders. This explains the outliers for CDK1, GRB2 and KDR. There are few outliers in the STRING database data due to the cutoff at a confidence score of 0.7.  Table S10: Detailed comparison of the STRING database with ProteinPrompt. Only EV PPI from STRING db with a high confidence were used (score above 0.7).

Verification of ProteinPrompt::RF with the datasets of Park and Marcotte
Park & Marcotte [1] performed a detailed analysis of the pairwise input of sequence based protein-protein interaction (PPI) predictions. As a result, the Marcotte lab provides the download of several training and test datasets [3]. In principle, they provide two human and two yeast datasets that differ in their composition: 'hu-man_balanced', 'human_random', 'yeast_balanced', and 'yeast_random'. Each dataset contains 'typical' cross-validation sets (CV) but also sets where the test data was split into distinct classes (C1-C3), depending on the appearance of proteins in the respective training data, see Table S11 for a definition of those test classes. Park & Marcotte also provide the results of 7 different PPI predictions methods, which were trained on that data (M1-M7). Ding et al extended this by another 3 methods (MMI, NMBAC, MMI+NMBAC) [2]. The availability of this data has two major benefits: First, when developing a new method for PPI prediction, one can compare directly with the results of these 10 methods without having to go through the often time-consuming process of installing the software and all the necessary dependencies. Second, as in our case, having obtained results with higher accuracy than previous methods it is possible to distinguish whether this increase in prediction quality is due to an improved learning method or whether a more suitable dataset is the cause of the increase. We performed two tests on each of the four available datasets. First, we ran the normal version of ProteinPrompt, trained on our human training data as described in the main text. The results for the test data in both human subsets 'random' and 'balanced' show a similar level of accuracy than on our own test data. However, it has to be noted that this is not a very robust result, as there may be significant overlap between our training data and the test data provided by Park & Marcotte. Since we have filtered our data very rigorously, as described in the previous section, the verification using our own test data is thus more reliable.
As a second approach, we also trained new random forest models using the training datasets of Park & Marcotte. The models that are summarized as ProteinPrompt retrain were trained similar to our original random forest model, i.e., we used the given protein pair orientation A-B from the training data file, but also the reversed orientation B-A and two combinations where one of both proteins has the inverted amino acid sequence: A-B inv and B-A inv . This enhancement of the training space has small but consistent positive effects on the models. Table S12 shows the results for the 'human balanced' dataset, S13 for 'human random', S14 for 'yeast balanced', S15 for 'yeast random'. The first thing to notice is that the accuracy for the human test sets decreases significantly when the random forest model is trained on the Park & Marcotte datasets. However, the achieved AUROC (area under the roc curve) and AUPRC (area under the precision recall curve) values are still highly comparable to the more powerful competitors.
This clearly suggests that our presented combination of autocorrelated biochemical residue scales and a random forest model is as good as other methodological approaches used to predict protein-protein interactions. The true benefit that clearly elevates ProteinPrompt comes with the extensive training and testing datasets that we collected and curated. Here, we could show that collecting as much training data as possible in combination with a rigorous cleaning can boost the prediction accuracy.
The decrease in accuracy for yeast in the normal version of ProteinPrompt, reveals a limitation in generalizability to distant organisms. This is not too surprising since learning methods are generally valid only for the training data domain. We could show that the accuracies increase when models are trained with yeast-specific data. C1 Both proteins in the test pair are found in the training set C2 Only one protein in the test pair is found in the training set C3 Neither protein in the test pair is found in the training set

Feature evaluation
To assess the information content of the feature scales, we calculated the Pearson correlation between all pairs of the scales and attempted to estimate the importance of each scale to the outcome of the models by performing permutation feature importance. Figure S11 shows the Pearson correlation coefficient between all pairs of scales used to train the RF. Despite some notable correlations, e.g. between hydrophobicity and polarity, hydrophilicity and polarity or surface area and side chain volume, most pairs are not correlated. Figure S11: Heatmap of Pearson correlation between different residue scales. Despite some strong correlations between some scales, there is no scale that correlates with all others that could be omitted. Figure S12: Permutation feature importance analysis for random forest.

Impact of individual scales on RF
To evaluate the importance of each of the seven amino acid scales in our random forest approach, we applied a feature permutation approach as it was illustrated by Breiman [4]. Therein, to introduce noise for a specific feature (amino acid scale) we extracted those 60 corresponding feature vector elements of each protein pair for all pairs in our test set. That means, based on our 420 dimensional feature vector that describes a certain protein pair, we extracted the position 1-30 and 211-240 that represent the 30 autocorrelation values of the hydrophobicity feature for protein one and two. All other features can be selected based on their respective columns in the feature vector. For each of the seven features we therfore extracted a set of 60 autocorrelation values from 9876 protein pairs. We subsequently shuffled those values to guarantee the exact same distribution and re-introduced those randomized feature values in our feature vector. We ran the model evaluation for each of the seven disturbed feature scales and calculated the difference to our original test data set for the overall accuracy, area under the ROC curve, sensitivity, and specificity. The results can be seen in Figure S12.
In general, we observed a loss of accuracy and AUC for each of the seven features. The sensitivity is also highly decreased for all features while the specificity is alwasy increased. However, the magnitude of gain in specificity is always overturned by the loss in sensitivity, resulting in an overall decrease in accuracy.
The biggest effect could be found for Polarity, which results in a 15% loss of accuracy. Hydrophobicity and Hydrophilicity resulted also in a significant loss in accuracy with slightly over 8%. Minor effects were calculated for the remaining four features with a loss in accuracy between 2.4 and 4.0%.

Impact of the individual scales on GNN
The results of the permutation feature importance are shown in figure S13. Permutation of any scale resulted in large losses of accuracy, sensitivity, specifity or are-under-curve of the receiver operating characteristic.
For this reason, no statement can be made about the significance of the individual scales for the final result. Rather, this is an indicator of the model's lack of robustness to such permutations. Figure S13: For the GNN model, a permutation analysis of the features was performed to determine the impact of each feature on the model. However, each permutation resulted in an immense loss of accuracy. This is more an indicator that the model lacks robustness to a shift in the input data, and does not allow conclusions about the importance of an individual feature scale.