Cracking the black box of deep sequence-based protein–protein interaction prediction

Abstract Identifying protein–protein interactions (PPIs) is crucial for deciphering biological pathways. Numerous prediction methods have been developed as cheap alternatives to biological experiments, reporting surprisingly high accuracy estimates. We systematically investigated how much reproducible deep learning models depend on data leakage, sequence similarities and node degree information, and compared them with basic machine learning models. We found that overlaps between training and test sets resulting from random splitting lead to strongly overestimated performances. In this setting, models learn solely from sequence similarities and node degrees. When data leakage is avoided by minimizing sequence similarities between training and test set, performances become random. Moreover, baseline models directly leveraging sequence similarity and network topology show good performances at a fraction of the computational cost. Thus, we advocate that any improvements should be reported relative to baseline methods in the future. Our findings suggest that predicting PPIs remains an unsolved task for proteins showing little sequence similarity to previously studied proteins, highlighting that further experimental research into the ‘dark’ protein interactome and better computational methods are needed.

The results of an extensive literature screening for high-performing PPI prediction methods can be found in Supplemental Table S1.Only twelve of 32 reviewed publications made their code available (13 methods; Richoux et al. 1 proposed two).Eight methods were written in Python (PIPR 2 , DeepFE 3 , Richoux-FC, Richoux-LSTM 1 , D-SCRIPT 4 , Topsy-Turyv 5 , PRoBERTa 6 , and TransformerGO 7 ), two in Matlab (You et al. 8 , Ding et al. 9 ), one in Lua (Hashemifar et al. 10 ), and two in C++ (Hamp and Rost 11 , SPRINT 12 ).We excluded the methods that did not only use sequences as input (Hashemifar et al. 10 , Hamp and Rost 11 , TransformerGO 7 ) and focused on DL methods with high reported accuracies, which we managed to reproduce with reasonable effort.Additionally, we included the SPRINT method as a baseline comparison since it only relies on sequence similarity for its predictions.As further similarity-based baselines, we included simple baseline ML methods (random forest, SVM), which we ran on three different encodings for the amino acid sequences (PCA, MDS, node2vec).We included two node label classification algorithms, which we ran on the line graphs of the PPI networks (harmonic function, global and local consistency) to test how much can be predicted using topology alone.Overall, we hence tested seven published methods and eight baseline ML models.
PIPR.PIPR 2 encodes the two input sequences using pre-trained embeddings.These embeddings take co-occurrence similarities of amino acids and physicochemical properties into account.PIPR feeds the embeddings to a deep siamese network: the embeddings are input to two residual recurrent CNN units (convolutional layer, max pooling, bidirectional GRU with residual shortcuts, global average pooling) with shared weights.The outputs are then combined using element-wise multiplication.Finally, the network predicts binary PPIs by training a multi-layer perceptron with leaky ReLU (final activation function: softmax, loss function: binary cross-entropy, optimizer: adam with AMSgrad, learning rate= 0.001, epsilon = 1e −6 ).For our tests, we fixed the hyperparameters according to the results of the original publication (number of epochs: 50, batch size: 256, dimension of the convolutional and GRU layers: 25, sequence embedding cropped at position: 2000).
DeepFE.DeepFE 3 pre-trained a Word2vec/Skip-gram model (size: 20, window: 4), using all Swissprot protein sequences (from 2018) as input.The model treats every amino acid as a word and, therefore, a sequence as a sentence.PPI candidate sequences are then represented using the Word2vec embeddings and are fed to a deep siamese network consisting of four units (dense layer with ReLU, batch normalization, dropout layer) with decreasing dimensionality (2048, 1024, 512, and 128).The outputs are then concatenated and put through another unit with dimensionality 8. PPI prediction is made at the final two-dimensional layer with a soft-max activation function.A stochastic gradient descent optimizer is used (learning rate: 0.01, momentum: 0.9, decay: 0.001).For our tests, we fixed the hyperparameters according to the results of the original publication (number of epochs: 45, batch size: 256, maximum protein length fed to the model: 850).
Richoux-FC and Richoux-LSTM.Richoux et al. 1 represent all sequences using one-hot encoding and present two DL models: a fully connected model and a recurrent LSTM model, which we refer to as Richoux-FC and Richoux-LSTM.Richoux-FC first flattens both sequence inputs and passes them separately through two units consisting of a dense layer (dimensionality: 20, ReLU) and a batch normalization layer.The two outputs are concatenated and passed to another unit (dense layer, dimensionality: 20, ReLU).The final layer (dimensionality: 1) predicts the PPI using a sigmoid activation function.Richoux-LSTM first extracts features from the sequence via three units (convolution, pooling, batch normalization) and a final LSTM layer.The parameters are shared for these layers for the two input sequences.Then, the outputs are concatenated and passed to a dense layer (ReLU) with another batch normalization.Finally, the LSTM model predicts the PPIs using a sigmoid activation function on the final one-dimensional dense layer.Both models use the Adam optimizer with an initial learning rate of 0.001, which reduces by 10% if the loss plateaus for 5 epochs until a minimum of 0.0008.For our tests, we fixed the hyperparameters according to the results of the original publication (number of epochs for Richoux-FC: 25, number of epochs for Richoux-LSTM: 100, batch size: 256, maximum sequence length: 1166).
D-SCRIPT and Topsy-Turvy Before running the D-SCRIPT model, sequences are embedded using a pre-trained language model by Bepler and Berger 13 , which transforms a sequence of length n to a matrix of dimension n × 6165.In our case, we embedded the yeast and the human proteome separately.The two embeddings of a PPI between proteins of length n and m are passed separately through a projection module with shared weights, which reduces the dimensionality to n × 100 and m × 100, respectively.This module consists of a dense layer (dimensionality: n × 100 and m × 100, respectively, with ReLu activation), followed by a dropout layer.The outputs are combined by concatenating their element-wise differences and Hadamard products and used as input for the residue contact module.A dense layer (dimensionality: 50), followed by a batch normalization layer and a ReLU activation, produces a tensor of dimensionality n × m × 50.D-SCRIPT passes the matrix to a convolutional layer (width: 7), followed by a batch normalization layer and a sigmoid activation function to produce a contact prediction matrix Ĉ ∈ [0, 1] n×m .The interaction prediction module transforms this matrix into an interaction probability by subjecting it to a max-pooling operation (size: 9), a custom global pooling operation, and a custom logistic activation function.The loss function minimized during training is a weighted sum of the binary cross-entropy loss (prediction) and a contact-map magnitude loss.We fixed all hyperparameters to the ones from the publication, i.e., training for 10 epochs with a batch size of 25 and an Adam optimizer with a learning rate of 0.001.
Topsy-Turvy computes a topology score, the GLIDE score, and the D-SCRIPT prediction probability for each PPI.The GLIDE score combines a local, neighborhood-based metric (common weighted normalized score) and a global metric (diffusion state distance) to capture the network surrounding the two protein nodes, regardless of whether they are close (local metric) or not (global metric).The score is binarized at a cutoff of 92.5.A GLIDE score prediction loss extends the part of the D-SCRIPT loss responsible for interaction probability with a relative importance of 0.2.The GLIDE score prediction loss is the binary cross-entropy loss between the D-SCRIPT prediction and the binarized GLIDE score.Hyperparameters were fixed as specified in the publication, i.e., training for 10 epochs with a batch size of 25 and an Adam optimizer with a learning rate of 0.001.
SPRINT.Instead of finding hidden patterns via DL, SPRINT 12 entirely relies on the hypothesis that a protein pair that is pairwise similar to an interacting protein pair is more likely to interact.Therefore, we can use it as a baseline model to see how well a dataset can be predicted if we only use sequence similarities (Explanation 3).SPRINT searches for similar subsequences between the candidate protein pair and the database of known interactions by using an approach very similar to BLAST's hit-and-extend approach.Instead of having consecutive initial seeds of a certain length like BLAST, SPRINT also allows for gapped initial seeds to increase the number of hits.Using all hits, SPRINT calculates similarity scores and sorts the output decreasingly by the scores where a higher score represents a higher probability for interaction.

Baseline ML models
Similarity-based models We implemented additional baseline models to compare the performance of DL methods against classical ML.Note that the idea behind implementing the baseline models is not to show that classical ML models suffice for the PPI prediction task but to explain the phenomenal accuracies reported for sequence-based DL models.Because of this, we designed our baseline models such that they can only learn from sequence similarities and node degrees: the similarities are the only input features and the node degrees can be learnt implicitly during training.
As models, we trained a random forest (sklearn 1.0.2RandomForestClassifier, 100 decision trees, six parallel jobs, a prediction greater than 0.5 is interpreted as interaction) and an SVM (sklearn SVC, RBF kernel, maximum number of iterations: 1000, a prediction > 0.5 is interpreted as interaction).We adopted these hyperparameters from the sklearn defaults; no tuning was done.
Proteins were represented using a similarity vector where each entry is the pairwise similarity score (bitscore) to every other protein in the human or yeast proteome (defined by Swissprot).A PPI was encoded as the concatenation of the two embeddings.An all-against-all similarity matrix for human and yeast was computed by the team of SIMAP2 14 , yielding a 6718 × 6718 matrix for yeast and a 20 353 × 20 353 matrix for human.Since this dimensionality is too high, we reduced it via three approaches: PCA, MDS, and node2vec, each returning 128 dimensions.
Theoretically, preprocessing the whole dataset before splitting it into train and test is considered data leakage.In this case, it is acceptable since we consider the whole universe (proteome).Introducing data leakage via joint preprocessing usually means that we have information available during testing that we would not have in a real scenario (e.g., divergence from the mean of the dataset).However, the dimensionality reductions are computed on the similarity matrix of all known human proteins; therefore, we do not introduce any dataset-specific biases.Given a new PPI, we would already possess the embeddings of its individual proteins.
PCA and MDS yielded matrices of dimensions 6718 × 128 and 20 353 × 128 for the yeast and human proteomes.For the computation of the node2vec encoding, we built a weighted similarity network from the similarity matrix such that an edge between two proteins exists if and only if their bitscore is positive.Self-loops were excluded.Since SIMAP2 uses cutoffs, some of the proteins had no edges.Because the input to node2vec is just an edge list, the resulting node2vec embeddings were smaller, resulting in matrices of dimensions 6194 × 128 and 20 210 × 128 for the yeast and human proteomes.Altogether, we hence tested three different embeddings as input to two ML models, yielding six baseline models.
Topology-based models Both models are node label classification algorithms implemented in NetworkX 2.8.Because the PPI labels are edge labels, we first converted all PPI networks to line graphs for this task (i.e., all edges become nodes, and all nodes become edges, Supplementary Figure S1).
Zhu et al. 15 formulate the problem of assigning labels to unlabelled nodes in a partially labeled network (our line graph) in terms of a Gaussian random field on the graph.The solution is given by the minimum harmonic energy function, which is unique and can be calculated explicitly using matrix operations.In NetworkX 2.8, the solution is calculated iteratively.On an unweighted graph, in the first iteration, node label probabilities are calculated by looking at the labels of neighboring, labeled nodes.In the next iteration, these probabilities are updated according to the label probabilities of the neighboring nodes and the node labels.This procedure is repeated n times, in our case, n = 30.
Zhou et al. 16 slightly modified the method presented by Zhu et al., causing information to be spread symmetrically.They also introduce a parameter specifying the relative amount of information a node receives from its neighbors (here, α = 0.99) and the initial labels (1 − α).A final difference is that, in contrast to the harmonic function, the label probabilities also influence the initial node labels, so outliers can potentially be corrected.Like the harmonic function, the local and global consistency algorithm is iterated n times, here n = 30.

Figure S2 :Figure S3 : 1 I N T R A 2 Figure S4 :
FigureS2: Node degrees of the proteins in the original vs. rewired networks.This serves merely as a sanity check to see that indeed, the rewiring preserved the node degrees for all proteins in expectation.

Figure S5 :
Figure S5: Distribution of SPRINT scores.For almost all original datasets, scores assigned to positive PPIs are significantly higher than scores assigned to negative examples.This difference drops strongly for the rewired test sets and the INTER → INTRA 0 / INTRA 1 prediction (the y-axis is logged).For INTRA 0 → INTRA 1 , no difference between scores for positive and negative samples can be seen, resulting in random performance measures.

Figure S6 :
Figure S6: Degree ratios for the training datasets for the (a) original, (b) rewired, and (c) partition datasets.Most of the proteins in HUANG and PAN have either exclusively positive or negative interactions.DU has many proteins that only have positive interactions, the RICHOUX datasets have many proteins with only negative annotations.

Figure S8 :
Figure S8: Training and validation accuracy over all epochs until early stopping for all deep learning methods on the original datasets.The red line indicates which model was used to evaluate the test set in the early stopping setting.

Figure S9 :
Figure S9: Training and validation loss over all epochs until early stopping for all deep learning methods on the original datasets.The red line indicates which model was used to evaluate the test set in the early stopping setting.

Figure S10 :
Figure S10: Training and validation accuracy over all epochs until early stopping for all deep learning methods on the rewired datasets.The red line indicates which model was used to evaluate the test set in the early stopping setting.

Figure S11 :
Figure S11: Training and validation loss over all epochs until early stopping for all deep learning methods on the rewired datasets.The red line indicates which model was used to evaluate the test set in the early stopping setting.

Figure S12 :
Figure S12: Training and validation accuracy over all epochs until early stopping for all deep learning methods on the partition datasets.The red line indicates which model was used to evaluate the test set in the early stopping setting.

Figure S14 :
Figure S14: Balanced accuracies (AUPRs for SPRINT) for all methods and datasets.

Figure S16 :
Figure S16: Precision (AUPRs for SPRINT) for all methods and datasets.

Figure S18 :
Figure S18: Recall (AUPRs for SPRINT) for all methods and datasets.
(d) Overlap of proteins occurring in the negative rewired training and test sets.
Overlap of proteins occurring in the positive partitioned training and test sets.
Overlap of proteins occurring in the negative partitioned training and test sets.

Figure S32 :
Figure S32: Overlap of proteins for all datasets, visualized separately for positive and negative datasets.

Table S5 :
Accuracies of all methods evaluated on the Gold Standard test set.Richoux-LSTM, DeepFE, Richoux-FC, and PIPR were trained with the training set and validated on the validation set.The other methods do not have tunable hyperparameters and hence were trained on the collapsed training and validation set.
).For INTRA 0 → INTRA 1 , no difference between scores for positive and negative samples can be seen, resulting in random performance measures.

Table S6 :
Accuracies of all methods evaluated on the Gold Standard test set after early stopping.

Table S7 :
Epochs in which early stopping was triggered.

Table S7 :
Epochs in which early stopping was triggered.
Training and validation loss over all epochs until early stopping for all deep learning methods on the partition datasets.The red line indicates which model was used to evaluate the test set in the early stopping setting.
Figure S15: Difference between the balanced accuracies from the early stopping setting and the balanced accuracies obtained without early stopping.Only D-SCRIPT and Topsy-Turvy seem to profit from early stopping.
Figure S17: Difference between the precisions from the early stopping setting and the precisions obtained without early stopping.Precision increases especially for D-SCRIPT.
Figure S19: Difference between the recalls from the early stopping setting and the recalls obtained without early stopping.Recall increases for most methods except for Richoux-FC and PIPR.