GeNNius: an ultrafast drug–target interaction inference method based on graph neural networks

Abstract Motivation Drug–target interaction (DTI) prediction is a relevant but challenging task in the drug repurposing field. In-silico approaches have drawn particular attention as they can reduce associated costs and time commitment of traditional methodologies. Yet, current state-of-the-art methods present several limitations: existing DTI prediction approaches are computationally expensive, thereby hindering the ability to use large networks and exploit available datasets and, the generalization to unseen datasets of DTI prediction methods remains unexplored, which could potentially improve the development processes of DTI inferring approaches in terms of accuracy and robustness. Results In this work, we introduce GeNNius (Graph Embedding Neural Network Interaction Uncovering System), a Graph Neural Network (GNN)-based method that outperforms state-of-the-art models in terms of both accuracy and time efficiency across a variety of datasets. We also demonstrated its prediction power to uncover new interactions by evaluating not previously known DTIs for each dataset. We further assessed the generalization capability of GeNNius by training and testing it on different datasets, showing that this framework can potentially improve the DTI prediction task by training on large datasets and testing on smaller ones. Finally, we investigated qualitatively the embeddings generated by GeNNius, revealing that the GNN encoder maintains biological information after the graph convolutions while diffusing this information through nodes, eventually distinguishing protein families in the node embedding space. Availability and implementation GeNNius code is available at https://github.com/ubioinformat/GeNNius.


Introduction
The process of identifying new drugs to treat a specific disease can be simplified by seeking a chemical compound that modulates a pharmacological target implicated in that disease, with the goal of altering its biological activity.Even though different biological entities can be chosen as targets, such as RNA or proteins, the latter are the most common pharmacological targets (Santos et al. 2017).Targeting proteins allows the modulation of many biological processes implicated in maintaining health and potentially preventing or treating diseases.For example, drugs targeting metabolic enzymes can alter how cells process nutrients and energy (Duggan et al. 2023).
Although high-throughput wet-lab techniques were developed to accelerate drug discovery pipelines, both in vitro and in vivo approaches are time-consuming and can be costly (DiMasi et al. 2016).To address these limitations, computational methods have arisen as promising tools to reduce the time and resources required to bring new treatments to market.The field of drug repurposing involves predicting novel drug-target interactions (DTIs) that will ultimately enable the discovery of new uses for already approved drugs (Pushpakom et al. 2019).In recent years, the availability of large amounts of data has made it possible to design machine learning models that can assist in these drug development tasks, through, e.g. the identification of complex molecular patterns that were not previously uncovered.These models typically leverage multiple types of data, including amino acid sequences (Zhao et al. 2021) and the 3-D protein structures (Verma et al. 2021), as recent advances in protein structure prediction such as AlphaFold (Jumper et al. 2021, Varadi et al. 2021) have significantly increased the amount of structural information available.
Specific to DTI prediction, several different machine learning architectures have been proposed in recent years (see Section 2.1).From these, several are computationally expensive, thereby hindering the ability to use large networks that better describe the available corpus of drug-target interaction, but further running times are often not reported.In terms of architecture, while some models utilize simple linear regression techniques, others contain more complex mechanisms such as transformers (Huang et al. 2020).However, most of these technologies do not consider the global view of how proteins and drugs are connected, which could be informative towards the discovery of novel relationships.To allow for modeling the network topology, recent works have been proposed to represent DTI data as a graph (Luo et al. 2017, Peng et al. 2021).Specifically, DTIs can be modeled as a heterogeneous graph connecting drugs and proteins (both represented as nodes) based on recorded interactions in wet-lab experiments (edges).This representation can be augmented by adding additional node types (e.g.diseases), or edge types (e.g. protein similarity).The DTI prediction model is then trained to predict whether a drug has the potential to interact with a protein.
Advances in machine learning for graphs have highlighted Graph Neural Networks (GNNs) as a powerful tool to model these complex networks for a wide range of applications across diverse fields including economics (Xu et al. 2022), particle physics (Que and Luk 2022), and especially biomedicine (Yi et al. 2021).The defining characteristic of a GNN is that it uses a form of neural message passing, where at each iteration the hidden embeddings of the nodes are updated (Hamilton 2020).Further, contrary to most previous node embedding techniques such as Node2Vec (Grover and Leskovec 2016), several GNN architectures are able to generalize from a set of training examples to unseen data points.This capability is of utmost importance to guarantee the generalization capabilities of the developed models when facing unseen interactions.
Recently, entire libraries have been developed to work with GNNs.Special mention should be made to PyTorch Geometric (PyG), a geometric deep learning library built on top of PyTorch (Fey and Lenssen 2019).Among other functions and layers, PyG implements the SAGEConv layer, which corresponds to the GraphSAGE operator that was originally designed to allow the training of GNNs in large networks (Hamilton et al. 2017).SAGEConv simultaneously learns the topological structure of the neighborhood of each node, as well as the distribution of the features of the nodes in the neighborhood.
In this work, we present a novel DTI prediction method, termed Graph Embedding Neural Network Interaction Uncovering System (GENNIUS), built upon SAGEConv layers followed by a neural network (NN)-based classifier.GENNIUS outperforms state-of-the-art DTI prediction methods across several datasets, not only in AUROC and AUPRC, but also in execution time.Since ensuring the capabilities of in silico drug repurposing approaches to find new interactions is of utmost importance, we also evaluated the ability of GENNIUS to predict true interactions reported as negative in a given dataset, yielding promising results.We further assessed the generalization capability of our model by training in one dataset and testing in a different one.This procedure mimics more realistically how the model would behave in a real-world setting.
Finally, while drug repurposing approaches based on complex machine learning models have eased the discovery of new targets, they often lack explainability.We analyzed qualitatively how drug features (such as commonly used molecular descriptors) and protein features (such as the amino acid ratio of protein sequences) are combined in a nonlinear manner while solving the DTI prediction task.This analysis revealed that the GNN encoder maintains biological information while diffusing this information through nodes, eventually distinguishing protein families in the node embeddings.
Overall, the results of our evaluation provide strong support for the effectiveness of GENNIUS, and introduce relevant guidelines to build GNN-based drug repurposing approaches.

Methods
2.1.1Model architecture GENNIUS architecture is composed of a Graph Neural Network (GNN) encoder that generates node embeddings and a Neural Network (NN)-based classifier that aims to learn the existence of an edge (i.e. an interaction) given the concatenation of a drug and protein node embeddings (Fig. 1).
In GNNs, nodes in the graph exchange messages with their neighbors to update their feature representation, which is formulated with two fundamental functions: the message and the update functions.(Wu et al. 2022): (1) where k 2 f1; . . .; Kg represents the layer, m v the aggregated message vector for node v, N(v) the neighbor nodes of v, and h v 2 R d the node v embedding, of dimension d.Mðh v ; h u ; e vu Þ defines the message between node v and its neighbor node u, which depends on the edge information e vu .Finally, U k is the node update function, which combines aggregated messages from the node's neighbors with the node's own representation.GENNIUS's encoder is composed of four SAGEConv layers, which are responsible for generating network-preserving node embeddings h 2 R d (d ¼ 17 in our case) by aggregating information from the embeddings of each node's local neighborhood.Thus, in GENNIUS, the embedding of node v at SAGEConv layer k is given by: where f is the activation function (Tanh in our case) and AGG represents the aggregation function (SUM in our case).W k 0 and W k 1 are the learnable weight matrices; since we are working with heterogeneous graphs, where a drug is only connected to proteins and vice versa, if W 1 0 2 R dÂdP then W 1 1 2 R dÂd N , or the other way around, being d P (d N ) the initial dimension of proteins (drugs) node features.For k > 1, both matrices have dimension d Â d.
The NN-based classifier is composed of two dense layers, both using ReLu as the activation function, followed by the output layer, which is composed of a single neuron with a sigmoid activation function.The input to the classifier is a vector of dimension 2d (corresponding to the concatenation of a drug and protein embeddings), and the output is the estimated probability of having an interaction (positive edge).
GENNIUS architecture (depicted in Fig. 1) and hyperparameters were chosen through a grid search with ten independent runs, using different types and number (K) of GNN layers, different embedding dimension d, activation functions, aggregation functions, and different number of heads for layers with attention.This approach helped us to fine-tune the model (see Supplementary Material S1 for a detailed description of the process and hyperparameters).

Model configuration
The model was trained with the Adam optimizer (Kingma and Ba 2017) and a learning rate of 0.01.We use a loss that combines the sigmoid of the output layer and the binary cross entropy in a single function.This combination takes advantage of the log-sum-exp trick for numerical stability (Paszke et al. 2019).Given a dataset divided into batches of size N, the loss l n for sample n in the batch is computed as follows: where y n 2 f0; 1g is the associated label for sample n, ŷn ¼ rðx n Þ the estimated probability of the sample belonging to the positive class (i.e.existence of an interaction), and x n the output of the last linear layer (before the activation function).The final batch loss L is then computed as the average of ðl 1 ; . . .; l N Þ.Finally, a dropout of 0.2 is used at the encoder stage (the GNN) to address potential collinearities of node features (Wager et al. 2013) (dropout rate chosen through hyperparameter-tuning, see Supplementary Section S1).
The model is implemented with early stopping, calculated on validation data, with a minimum of 40 training epochs.The latter is especially useful for small datasets where an early stop may occur during the first epochs, eventually causing underfitting.The model was built with the latest version of PyTorch Geometric (2.3.0), with PyTorch 2.0.0-cuda11.7,and the following packages: pyg-lib (0.2.0þpt20cu117), torch-scatter (2.1.1þpt20cu117)and torch-sparse (0.6.17þpt20cu117).A Dockerfile for running the model is available at https://github.com/ubioinformat/GeNNius.

Model training and evaluation
In the standard setting in which a single dataset (graph in our case) is used to evaluate model performance, the input graph is randomly split into a 70:10:20 ratio for train, validation, and test, respectively, via the random link split function of PyG.This function also randomly selects the negative edges needed for training and testing the NN-based classifier for a 1:1 positive/negative ratio.The training set requires further shuffling of positive and negative edges.Only 70% of train edges is used for training the encoder, while the rest are kept apart for the edge prediction part (i.e. the edge classifier).
To assess the performance of the models in the edge classification task on test data, we use the area under the Receiver Operating Characteristic curve (AUROC), as well as the area under the precision-recall curve (AUPRC), both widely used for evaluating DTI prediction models.We refer to Supplementary Section S2 for a more extended description of these metrics.

Node features
Due to the different nature of drugs and proteins, we choose a vastly different set and dimension of features for drug and protein nodes.The protein node features are encoded as a 20D vector, accounting for the 20 different amino acids, where each feature indicates the proportion of the corresponding amino acid in the protein sequence associated to the node.Drug node features are chosen to be well-known molecular descriptors, calculated with RDKit (Landrum 2022), from their SMILES (Simplified Molecular-Input Line-Entry System).Specifically, the 12 selected features for drug nodes are: LogP value, molecular weight, number of hydrogen bond acceptors, number of hydrogen bond donors, number of heteroatoms (i.e.any atom that is not carbon or hydrogen), number of rotatable bonds, topological polar surface area (surface sum over all polar atoms, primarily O and N, also including their attached H atoms), number of rings and aromatic rings, number of NHs and OHs, number of nitrogen and oxygen atoms, number of heavy atoms a molecule (atomic number > 1), and number of valence electrons.While some of the above-mentioned features are related, model learning and performance is not expected to deteriorate as a dropout layer was introduced to reduce the potential effect of features' collinearity (the correlation matrices of drug/protein features are provided in Supplementary Section S3).While other node features could be considered, such as protein pre-computed embeddings, training the model with those features showed almost no increase in performance (Supplementary Section S4).In addition, the pre-computed protein embeddings were not available for all proteins, which led to a decrease in nodes, hindering the training process and making it impossible for the model to generalize when trained in small networks (Supplementary Section S4).

Related work
In order to benchmark our proposed method GENNIUS, we focus on the latest DTI prediction models that have been shown to outperform previously developed models in their respective publications.
• DTINet (Luo et al. 2017).It considers a heterogeneous graph with four node types (drugs, proteins, side effects and diseases) and six edge types (DTIs, protein-protein interaction, drug-drug interaction, drug-disease association, protein-disease association, drug-side-effect association, plus similarity edges between drugs and proteins).After compact feature learning on each network drugs/proteins, it calculates the best projection of one space onto another using a matrix completion method, and then infers interactions according to the proximity criterion.We note that the large quantity of data required to run the method hampered its reproducibility, as the code for generating all these matrices was not available.As HyperAttentionDTI above, it needs the SMILES for drugs and amino acid sequences for proteins.Then, it makes use of unlabeled data to decompose both drug and nodes into high-quality substructures, to later create an augmented embedding using transformers.Due to this architecture, the model is able to identify which substructures are contributing more to the overall interaction between a protein and a drug.• MCL-DTI (Qian et al. 2023).This model inputs drug SMILES to generate two multimodal information of drugs (molecular image and chemical text) and the amino acid sequence for target representation.After, the model uses two decoders with the generated embeddings, which include a multi-head self-attention block and a bi-directional multi-head cross-attention block for crossinformation learning.Both drug and protein decoders are used separately to obtain the interaction feature maps.Finally, these feature maps generated by decoders are fed into a fusion block for feature extraction and output the DTI prediction results.
We selected these models following two main criteria: reproducibility and architectural diversity.Firstly, all these models have available working codes.Secondly, our choice of models encompassed both diverse architectural approaches and feature selection, allowing for a comprehensive exploration of the problem space.Notably, we decided to include DTINet, even if it is an older model compared to the rest because it is an example of a widely cited model and used as a comparison several times in the field.Alternatively, some other recent methods were excluded from this review due to reproducibility issues (Li et al. 2022, 2021, Binatlı and Go ¨nen 2023, Yao et al. 2023).

Datasets
In this work we selected various datasets that have been widely used for DTI prediction tasks: For all considered datasets, we dropped those drugs and proteins from which SMILES or amino acid sequences could not be generated, yielding slightly smaller networks (see Supplementary Section S5).
Note that the above-mentioned datasets, with the exception of BindingDB and Davis, contain only positive samples, i.e. positive links in the network.Nevertheless, when choosing negative samples, we performed random subsampling to have a balanced dataset prior to training the model.
Datasets statistics are summarized in Table 1.These datasets were released in different years, and thus some drugtarget interactions can be shared across datasets (See Supplementary Section S6).

Dataset configuration for inferring unknown positives
DTI datasets contain information from diverse sources, have been released in different years, and may be curated in various ways.As a result, negatively labeled edges in one dataset may be reported as positive in other datasets.We evaluate these unknown positive edges for each dataset to assess if GENNIUS can predict them (see Supplementary Section S6 for details on the number of these edges for each dataset).Importantly, we ensured that testing edges do not appear as negatives during training to assess how well GENNIUS predicts these specific interactions; we repeated the process ten independent times, enabling us to investigate the variability of the prediction depending on training edges, which is often not reported in DTI prediction models.

Data leakage prevention during evaluation on unseen datasets
Contrary to previously proposed models, we assess the generalization capability of GENNIUS by training it on one dataset and testing it on another.For a fair assessment, it is necessary to ensure that there is no data leakage of DTIs between training and testing.
Let us consider two nodes that are present both in the training and test datasets.There are four possible scenarios for an edge connecting these nodes.A positive edge in both datasets is a clear example of data leakage from the train to the test set, as we already informed the model about that positive edge during training.Hence those repeated DTIs are removed during training.On the other hand, edges that appear in one dataset but not on the other one are kept.Keeping the negative edges in the training data makes sense from a usability perspective since a nonreported DTI in a given dataset does not necessarily mean that that pair does not interact, and we would like to test the capabilities of the model under this general scenario.Further, a negative edge may be shared in both datasets; however, since negative edges are randomly selected when generating the training and testing sets, the probability of picking the same edge in both datasets is very low.As an illustrative example, when using DrugBank for training and NR for testing, the probability of selecting the same negative edge is approximately 3e À6 .
We performed five independent training runs on each dataset, i.e. randomly selecting each time a different set of edges for training the model.Next, for each trained model, we performed five independent testing runs.We report the average and standard deviation of the AUROC and AUPRC metrics, of the test set, across the total 25 runs per training-testing dataset pair.

Protein and drug annotation
Protein family and enzyme annotation was retrieved from the ChEMBL database (release 31), as its family hierarchy is manually curated and according to commonly used nomenclature (Gaulton et al. 2016).Drug chemical annotation was generated using ClassyFire, an automated structural classifier of chemical entities (Djoumbou Feunang et al. 2016).Annotation was used for coloring t-SNEs.

Hardware
All simulations were performed on a server with 64 intel xeon gold 6130 2.1 Ghz cores with 754 GB of RAM and a NVIDIA GeForce RTX 3080, driver version 515.43.04, with cuda 11.7.version.

GENNIUS outperforms state-of-the-art methods
The proposed model was run on the eight selected datasets with five independent runs.The resulting AUROC and AUPRC metrics on the test sets across all datasets, as well as running times (corresponding to train, validation and test), are presented in Fig. 2 (see also Supplementary Section S7).GENNIUS returned AUROC and AUPCR performance close to 1 (>0.9) for large datasets, and while smaller datasets reported worse results, they are still compelling (>0.8 in almost all runs).NR, being the smallest one, achieved the worst results (>0.7).In addition, the large datasets showed stable results, with a low standard deviation, across the five independent runs.Further, the model execution time was ultrafast for all datasets (less than a minute for the largest dataset).Note that the time variance in the large datasets is due to early stopping.Furthermore, we also performed experiments with different subsampling ratios of positive: negative pairs (1:1, 1:2, 1:10), showing that the ratio 1:1 is not oversimplifying the problem of predicting DTIs (see Supplementary Section S8).
Next, we compared the performance of GENNIUS with previously proposed methods.Table 2 shows the performance results of GENNIUS and the state-of-the-art methods for both DrugBank and BIOSNAP, the largest standard DTI datasets.We focus on these datasets as they better characterize the current size of testable available drugs.GENNIUS outperformed all benchmarked methods in terms of AUROC and AUPRC.Importantly, the execution time is significantly reduced, even when executed without GPU (see Supplementary Section S9).Previous methods' running time was in the order of tens of minutes (except DTINet, which took 4.23 min), while GENNIUS took <0.6 min to perform the training, validation, and testing.The closest performance in AUROC and AUPRC to GENNIUS was achieved by EEG-DTI.However, EGG-DTI took four orders of magnitude more time to run (917.39versus 0.58 min in DrugBank).Finally, we also compared GENNIUS to off-the-shelf machine learning baselines Logistic Regression (LR) and Random Forest (RF), to assess the actual

GENNIUS prediction capabilities for inferring previously unreported drug-target interactions
To analyze the capability of GENNIUS to detect unknown interactions, we first identified those target-protein pairs lacking an edge in one dataset (negative label) but connected in the other datasets (positive label).Then, we assessed whether GENNIUS was able to annotate these edges as positive.We trained the model ensuring that the edges for testing were not seen during the training process and repeated the process ten times.Further details of experiment set-up in Section 2.1 (Section 2.2.2).
The ratio of correctly predicted edges for each dataset is presented in Fig. 3.When trained with large datasets, GENNIUS returned good prediction capabilities, detecting more than 80% of edges in almost all cases.It is worth noting that with DrugBank GENNIUS successfully predicted more than 90% of these edges across all runs.Further, when using Yamanishi datasets (E, GPCR, IC, and NR), GENNIUS returned satisfactory results, predicting 70% of DTIs on average across different runs, although with higher variability than when using large datasets.This suggests that training on a small dataset hinders the inference of new interactions, as the random choice of edges for training has larger impact on the predictive power in these cases.We note that the observed outliers could be due to a noninformative random selection of training edges.Finally, the Davis dataset yielded significantly worst results than the other datasets.At first sight, this behavior could be due to the origin of the Davis dataset, as it is generated from affinity experiments.However, BindingDB, which is also generated from affinity data, does not yield such low performance.Hence, this may indicate that the problem comes from the significant difference in the topology of Davis versus all the other datasets.Davis is the only dataset formed as a uniquely connected network, while other datasets have more than one connected component.It also presents significantly different average degree values in drug nodes (see Table 1).

GENNIUS generalization capabilities
We evaluated GENNIUS performance when training and testing on different datasets.In order to ensure that there is no data leakage that might oversimplify the prediction task, DTIs that were common to train and test datasets were discarded prior to applying the model (see Section 2.1, Section 2.2.3, for a more detailed description of the set-up).
AUROC results are presented in Fig. 4 (obtained AUPRC results are similar, see Supplementary Section S11), where each entry of the heatmap shows the performance of GENNIUS on the row dataset when trained on the column dataset.The reported values correspond to 25 runs, where statistical deviation in AUROC and AUPRC arise from the random selection of edges.
In general terms, GENNIUS returned compelling results in its generalization capabilities; however, there was a strong dependence on the training dataset.GENNIUS reported the best generalization capabilities when trained on larger datasets,  All AUROC and AUPRC reported results correspond to test set, execution time correspond to train/validation/test.SOTA models were run in their default configuration, i.e.Moltrans and MCL-DTI correspond to 5 independent runs (execution time corresponds to an unique iteration), while DTINet and EEG-DTI correspond to a 10-Fold Cross Validation, and HyperAttentionDTI to 10-times repeated 5fold Cross-Validation.All the simulations were run in the hardware specified in Section 2.1.such as DrugBank, BIOSNAP, and E. On the other hand, when the model is trained on the smallest dataset, NR, it cannot generalize, resulting in lower AUROC/AUPRC values compared to others (whiter colors in the NR column).In addition, despite the Davis dataset being similar in size to other Yamanishi datasets, it returned the second-to-worst results for both training and testing.As mentioned previously, Davis' topology is different from the rest of the networks.In addition, Davis and BindingDB, unlike other datasets, come from affinity experiments.However, the latter seems to perform similarly, albeit slightly worse, than DrugBank when used for training.
We also found that, for smaller networks, our method obtains better results when trained on large datasets and tested on smaller ones compared to when trained and tested on the same small dataset (see results Section 3.1).For

Figure 1 .
Figure 1.Schematic of GENNIUS architecture.GENNIUS inputs a graph containing drug (red) and protein (blue) nodes, where N d and N p represent the number of drugs and proteins, respectively.First, a GNN generates node representations with an embedding of dimension d ¼ 17.Second, a NN-based classifier aims at learning the existence of an edge given a set of concatenations of drug and protein embeddings.Note that at this stage a negative set of edges is generated.

Figure 2 .
Figure 2. Boxplots of AUROC and AUPRC metrics on test data for five independent runs of GENNIUS for the eight selected datasets.(Upper) AUROC results.(Middle) AUPRC results.(Lower) Time results in minutes.

Figure 3 .
Figure3.Boxplot of the ratio of correctly identified positive edges in 10 independent runs trained and tested on the same dataset.Note that e is the number of edges to be evaluated.

Figure 4 .
Figure 4. Performance of GENNIUS in terms of AUROC when training in one dataset (column) and testing in other (row).Train datasets do not contain positive edges that appear in the test dataset.
(Peng et al. 2021 al. 2021).EEG-DTI, which stands for end-to-end heterogeneous graph representation learningbased method, also considers a heterogeneous network, where nodes and edges are the same as in DTINet (see above).The model first generates a low-dimensional embedding for drugs and proteins with three Graph Convolutional Networks (GCN) layers, and then concatenates these layers for drugs and proteins separately.Finally, it calculates the inner product to get the proteindrug score.
(Huang et al. 2020)(Zhao et al. 2021).This method only requires the SMILES string for drugs and the amino acid sequence for proteins.Then, it embeds each character of the different sequences into vectors.The model is based on the attention mechanism and Convolutional Neural Networks (CNNs), in order to make DTI predictions.•MolTrans(Huangetal. 2020).Its name stands for Molecular Interaction Transformer for predicting DTIs.

Table 2 .
Benchmarking results of GENNIUS against four state-of-the-art DTI methods and two off-the-self machine learning baselines, for BIOSNAP and DrugBank datasets.a a Best values are highlighted in bold, excluding baseline results.