Abstract

Due to the lack of a method to efficiently represent the multimodal information of a protein, including its structure and sequence information, predicting compound-protein binding affinity (CPA) still suffers from low accuracy when applying machine-learning methods. To overcome this limitation, in a novel end-to-end architecture (named FeatNN), we develop a coevolutionary strategy to jointly represent the structure and sequence features of proteins and ultimately optimize the mathematical models for predicting CPA. Furthermore, from the perspective of data-driven approach, we proposed a rational method that can utilize both high- and low-quality databases to optimize the accuracy and generalization ability of FeatNN in CPA prediction tasks. Notably, we visually interpret the feature interaction process between sequence and structure in the rationally designed architecture. As a result, FeatNN considerably outperforms the state-of-the-art (SOTA) baseline in virtual drug evaluation tasks, indicating the feasibility of this approach for practical use. FeatNN provides an outstanding method for higher CPA prediction accuracy and better generalization ability by efficiently representing multimodal information of proteins via a coevolutionary strategy.

Introduction

Since it is time and resource consuming to experimentally assess compounds and target protein binding affinities during drug discovery and development, effective drug identification approaches using computational methods could greatly accelerate the drug candidate discovery process by learning the abstract binding information between drug and target and accurately predicting compound-protein binding affinities (CPA) [1, 2], especially in cases where great numbers of sources for compound and protein interaction data are available through open source databases. For instance, BindingDB [3] currently provides a comprehensive collection of experimentally measured binding affinity data including more than 1 million protein–ligand complexes in the Protein Data Bank (PDB) [4], which substantially increases the potential for in silico CPA prediction. However, even with these abundant data, accurately predicting CPA is still the fundamental challenge preventing this method from being used in practical drug candidate screening applications due to the lack of a method to efficiently extract features from the data. To increase the accuracy of CPA prediction, the development of computational methods has proceeded with a variety of protein information embedding and representation strategies [5–8]. Despite substantial advancements, these strategies have met challenges with respect to further increasing the accuracy of CPA prediction.

Initially, researchers tended to represent protein features only using the protein sequence information, namely, the target (protein) is regarded as a sequence of residues. In these models, a pairwise array with the residue features of the protein as its column (or row) and the SMILES sequence information of the compound as its row (or column) is often utilized as the attention matrix to learn the potential interaction between a protein and a compound [9]. Typically, these models rely on the sequence information of the compounds and proteins of interest to learn their interactions via pairwise matrices, with the aim of predicting the binding affinities between them [9–13]. For example, multilayer 1-dimensional convolutional neural networks (1D-CNNs) are utilized to extract the features from the residue sequences of proteins, and the obtained vectors are used to represent the features of proteins, predict the CPAs and intensively study the noncovalent interaction between the ligand and binding target [14–16]. However, in addition to a protein’s sequence of residues, the 3D structure of a protein also contributes significantly to its features [17, 18]. Therefore, neglecting the 3D spatial structure information of the protein may prevent the full realization of the potential of computational modeling in CPA prediction.

In this scenario, the approaches of representing and embedding protein structure information have been tentatively proposed to improve the accuracy in CPA prediction. To do so, molecular docking simulation methods [19, 20] based on background molecular dynamics knowledge and structure-based machine learning methods [8, 21] have been proposed. Relying on the knowledge of biophysics, the docking method computationally simulates the potential binding sites and 3D structures of compound-protein complexes, so it heavily depends on high-quality 3D protein structure data during CPA prediction [22, 23]. Despite a few successful stories, this method is severely limited due to the scarcity of high-quality 3D structure data of proteins (the precise position of each atom in a protein) [24]. By contrast, machine learning algorithm-based approaches can use 3D protein structure data with either high or low resolutions (the positions of key atoms in a protein). These models are fed with the spatial 3D information of the proteins in order to attain a superior ability to predict CPA [25–27]. For instance, the structural features of proteins were extracted through 3D atomic representations in voxel space by applying 3D CNNs [8]. However, the performance of these models was not significantly improved by introducing the structural information of the proteins [6, 8]. We hypothesized that this was due to the lack of the comprehensive consideration of the multimodal information (both sequence and structure information) of the protein by these methods. To address this problem, we sought to develop a method that can rationally incorporate the multimodal information of protein into CPA prediction models in order to improve CPA prediction performance.

Inspired by the multi-feature fusion tactics via coevolution [28], we designed an end-to-end neural network architecture (Figure 1), named the fast evolutional aggregating and thoroughgoing graph neural network (FeatNN). Through the coevolutionary strategy, FeatNN efficiently represented the multimodal information (containing both structure and sequence information) of proteins and thus overcame the multimodal protein information representation challenge. Upon the IC50 and KIKD datasets generated from PDBbind [29], FeatNN outperforms the SOTA method (MONN) in CPA prediction tasks by 21.33 and 17.07% with respect to the R2 metric, 6.16 and 2.98% in terms of the root mean square error (RMSE), and 7.00 and 5.45% in the Pearson coefficients, respectively (Figure 2).

Architecture overview of FeatNN. (A) The atom and bond information of a given compound is encoded into a molecular graph, which acts as the input for the compound extractor module to distill its features. The compound extractor includes a deep GCN block (Supplementary Figure 12 available online at http://bib.oxfordjournals.org/) and multihead attention blocks (Supplementary Figure 14 available online at http://bib.oxfordjournals.org/). (B). The features of a protein are embedded with matrices and vectors as inputs to the Prot-Aggregation module (Supplementary Figure 17 available online at http://bib.oxfordjournals.org/), whose outputs are then fed to the Evo-Updating module (Supplementary Figure 18 available online at http://bib.oxfordjournals.org/), which co-evolutionarily updates the structure and sequence features. Both the Prot-Aggregation module and the Evo-Updating module form the protein extractor block. (C) The extracted atom and residue features are processed by the affinity learning module (Supplementary Figure 20 available online at http://bib.oxfordjournals.org/), which also enables FeatNN to learn the potential interaction features between the atoms of the compound and the residues of the protein. Additionally, the sets of information derived from the atom features and residue features are integrated through the affinity learning module to predict the CPA. The parameter settings of FeatNN are shown in Supplementary Table 2 available online at http://bib.oxfordjournals.org/.
Figure 1

Architecture overview of FeatNN. (A) The atom and bond information of a given compound is encoded into a molecular graph, which acts as the input for the compound extractor module to distill its features. The compound extractor includes a deep GCN block (Supplementary Figure 12 available online at http://bib.oxfordjournals.org/) and multihead attention blocks (Supplementary Figure 14 available online at http://bib.oxfordjournals.org/). (B). The features of a protein are embedded with matrices and vectors as inputs to the Prot-Aggregation module (Supplementary Figure 17 available online at http://bib.oxfordjournals.org/), whose outputs are then fed to the Evo-Updating module (Supplementary Figure 18 available online at http://bib.oxfordjournals.org/), which co-evolutionarily updates the structure and sequence features. Both the Prot-Aggregation module and the Evo-Updating module form the protein extractor block. (C) The extracted atom and residue features are processed by the affinity learning module (Supplementary Figure 20 available online at http://bib.oxfordjournals.org/), which also enables FeatNN to learn the potential interaction features between the atoms of the compound and the residues of the protein. Additionally, the sets of information derived from the atom features and residue features are integrated through the affinity learning module to predict the CPA. The parameter settings of FeatNN are shown in Supplementary Table 2 available online at http://bib.oxfordjournals.org/.

Evaluation of FeatNN, BACPI, SIGN, GraphDTA (GATNet, GATGCN, GCNNet, GINConvNet) and MONN. Performance evaluated on compound-clustered strategy datasets with similarity thresholds of 0.3, 0.4, 0.5 and 0.6 constructed from PDBbind with KIKD and IC50 measurement, respectively. The benchmark dataset is generated from PDBbind (version 2020, the general set) and contains 12 699 compound-protein pairs. Performance results are plotted as the mean values and standard deviations (SD) by 5-fold cross-validation strategy with 10 independent experiments. Each point represents the independent experimental group mean with error bars indicating SD. We choose the three indicators (the RMSE, Pearson coefficient and R2) that can best evaluate the prediction performances of the methods in terms of the continuous values (CPA) they predicted. (A) Performances evaluated on the dataset generated from PDBbind with KIKD measurement. (B) Performances evaluated on the dataset generated from PDBbind with IC50 measurement. Please note that the results of SIGN present here were different from the results reported by the original literature [26], possibly because we use PDBbind-v2020 as our benchmark database instead of PDBbind-v2016 used in their study. In addition, considering the biology means behind the data, we split the dataset into two parts (‘IC50’ and ‘KIKD’ [9]) instead of simply mixing the affinity measured with ‘IC50’, ‘Ki’ and ‘Kd’ together in their study. Moreover, we applied compound-cluster and protein-cluster strategies in our study to avoid data leakage caused by the biology-correlated knowledge (similarity structure or sequence in protein or compound). In most case, MONN achieved the best performances in baselines; therefore, we consider MONN as the SOTA baseline in our paper.
Figure 2

Evaluation of FeatNN, BACPI, SIGN, GraphDTA (GATNet, GATGCN, GCNNet, GINConvNet) and MONN. Performance evaluated on compound-clustered strategy datasets with similarity thresholds of 0.3, 0.4, 0.5 and 0.6 constructed from PDBbind with KIKD and IC50 measurement, respectively. The benchmark dataset is generated from PDBbind (version 2020, the general set) and contains 12 699 compound-protein pairs. Performance results are plotted as the mean values and standard deviations (SD) by 5-fold cross-validation strategy with 10 independent experiments. Each point represents the independent experimental group mean with error bars indicating SD. We choose the three indicators (the RMSE, Pearson coefficient and R2) that can best evaluate the prediction performances of the methods in terms of the continuous values (CPA) they predicted. (A) Performances evaluated on the dataset generated from PDBbind with KIKD measurement. (B) Performances evaluated on the dataset generated from PDBbind with IC50 measurement. Please note that the results of SIGN present here were different from the results reported by the original literature [26], possibly because we use PDBbind-v2020 as our benchmark database instead of PDBbind-v2016 used in their study. In addition, considering the biology means behind the data, we split the dataset into two parts (‘IC50’ and ‘KIKD’ [9]) instead of simply mixing the affinity measured with ‘IC50’, ‘Ki’ and ‘Kd’ together in their study. Moreover, we applied compound-cluster and protein-cluster strategies in our study to avoid data leakage caused by the biology-correlated knowledge (similarity structure or sequence in protein or compound). In most case, MONN achieved the best performances in baselines; therefore, we consider MONN as the SOTA baseline in our paper.

The major technical advances of FeatNN are listed as follows.

  • (i) An Evo-Updating block is employed in the protein encoding module to interactively update the sequence and structure information of proteins so that the high-quality features of proteins are extracted and presented, enabling FeatNN to outperform the SOTA model by great margins exceeding 21.33% in R2.

  • (ii) In FeatNN, the distance matrices of protein residues are discretized into one dimension, and the word embedding strategy is applied to encode protein structure information, so that the network could effectively represent the multimodal protein information and lower the computational cost simultaneously.

  • (iii) With respect to the extraction of compound features, a specific residual connection is applied to represent the molecular graph, in which the features of the initial nodes are added onto each layer of the GCN [30], such that the graph features representation limitation caused by the notorious oversmoothing problem in traditional deep GCNs is solved.

  • (iv) With the pretraining and fine-tuning strategy, the R2 performance of the optimized model, FeatNNoptm, further increases by 3.29% on average compared to that of FeatNN.

  • (v) FeatNN has excellent generalization in the affinity prediction task, which is vital and pivotal in the drug discovery domain. Targeting severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) 3-chymotrypsin (3C)-like protease and Akt-1, the generalization of FeatNN vastly outperforms the SOTA baseline in the affinity value prediction task.

  • (vi) The prediction results of FeatNN with different conformations of the same protein are robust when 3D structure information is directly introduced in the model while neglecting the molecular dynamics of the protein.

Materials

Dataset construction

Even though PDBbind [31], BindingDB [3] and Binding MOAD [32–34] databases (Supplementary Figure 2 and Supplementary Table 1 available online at http://bib.oxfordjournals.org/) contain paired information of protein-ligand complexes with structural data and the corresponding binding affinities, it was necessary to eliminate some data to comply with the quality standards of our model and baselines. The exclusion criteria included protein PDB file defects, and sequence information inconsistency in UniProt and PDB. Based on these criteria, we constructed a benchmark dataset based on PDBbind (version 2020, the general set) [29] that contains 12 699 compound-protein pairs. Meanwhile, a refined dataset [31] with higher quality of structural information has also been constructed from PDBbind (version 2020, the refined set, see Supplementary Figure 2F available online at http://bib.oxfordjournals.org/). Additionally, we generated another dataset based on BindingDB (version 6 Feb 2022; the general set) [3] that is rich in data on compound-protein paired complexes but poor in protein diversity. The complex structure information in such dataset is not strictly paired and remains low quality, because not all complexes in BindingDB have strictly paired 3D structure conformations, and most of these complexes correspond to multiple protein conformations with different PDB entries. Therefore, we preferentially chose the ligand-free or high-resolution PDB file for these complexes without strict correspondence between protein and compound. This generated dataset contains more than 210 thousand compound-protein pairs (Supplementary Table 1 available online at http://bib.oxfordjournals.org/). To test the generalization ability of the models, we constructed new datasets from the Binding MOAD (see in Supplementary Table 1 available online at http://bib.oxfordjournals.org/) database and excluded the complexes that appear in the datasets (train, validation and test datasets) constructed from PDBbind [35] (Supplementary Figure, Supplementary Table 1 available online at http://bib.oxfordjournals.org/). An affinity value of a certain measurement type (i.e. Ki, Kd, or IC50.) for each complex was provided, and ‘KIKD’ was used to refer to the combination of Ki-measured data and Kd-measured data due to their high homogeneity. More details about the dataset construction process are available in the Supplementary Methods 3.3 available online at http://bib.oxfordjournals.org/.

Training data generation

The PDBbind-based (both the general and refined datasets) training dataset generation process included three key steps. (1) Before performing data cleaning, we first assessed whether the regression labels (CPA values) in both PDBbind and BindingDB followed normal distributions to avoid the potential prediction deviation problem (Supplement Figure 2 available online at http://bib.oxfordjournals.org/); (2) We then clustered the input compound and protein information according to a certain threshold (0.3, 0.4, 0.5 and 0.6) [9] to avoid the potential data leakage problem that could occur due to data similarities. In this evaluation, we assessed the similarity of the proteins using their multi-sequence alignment (MSA) scores and calculated the similarity of the compounds based on their fingerprints. Then, the same kinds of compounds or proteins with a certain threshold were divided into the same dataset; the details of this process are provided in Supplementary Methods 3.4 and 3.5 available online at http://bib.oxfordjournals.org/; (3) Finally, we used a 5-fold cross-validation strategy [36] to generate training datasets to alleviate the potential overfitting problem. Then, the dataset was randomly shuffled with a training-validation-testing splitting ratio of approximately 7:1:2. For the generation of the BindingDB-based training dataset, we directly shuffled and split the dataset with the same training-validation-testing splitting ratio. The datasets generated from Binding MOAD were only used for testing the models’ generalization ability and transferability.

Baseline methods

To assess the performance of FeatNN, we chose to represent the SOTA algorithm architecture with the multiobjective neural network (MONN) [9], the structure-aware interactive graph neural network (SIGN) [26] and chose two classic methods, the drug-target binding affinity graph neural network (GraphDTA) [37], the bidirectional attention neural network for compound-protein interaction (BACPI) [38] as our baseline models. We followed the same experimental settings as those used in in the original studies that reported these baseline models.

MONN applies a GCN block [30] to extract compound features and a 1D-CNN block to extract protein features and then constructs a pairwise matrix from the features of compounds and proteins to describe noncovalent interactions and predict CPA.

GraphDTA comprises four models: the graph attention network (GATNet), graph convolutional network (GCNNet), the combined GAT and GCN (GATGCN) and graph isomorphism network (GINConvNet), all of which utilize architectures with a GCN block and an attention mechanism to extract protein and compound features and finally predict CPA through several dense layers that aggregate the features of compounds and proteins.

BACPI serves as a bidirectional attention neural network and uses a 1D-CNN block to extract protein features from residue sequences and a graph attention network to extract compound features. CPA is predicted through several dense layers; this is similar to the GraphDTA approach.

SIGN is as a structure-based method that converts the protein-ligand complex into a complex interaction graph and extract its features from such graph. The training data for this model must strictly contain the pair data (both protein and compound) in a complex with high-quality structure information.

Results

The design of FeatNN with input protein sequence and structure information

Given that the structure-based models that only consider the structure information of a protein might not well represent the protein’s multimodal information, namely the sequence and structure information, we hypothesized that introducing the multimodal information of protein with a rational strategy in the CPA prediction model may further improve its CPA prediction performance.

To test this hypothesis, in an end-to-end neural network architecture, we first developed a method to represent the protein structure information (including the Euclidean distances between the residues of proteins in 3D space, the dihedral angles (Φ and ψ) on the backbones of proteins. Then we co-evolutionally updated this structure information with the residue sequences information of proteins, with the aim to comprehensively and efficiently represent their multimodal information. The general workflow of this model, FeatNN, is depicted in Figure 1. FeatNN was designed based on a dexterous architecture that can process amino acid sequences and atom sequence with any lengths; thus, the whole set of information about proteins and compounds can be characterized. More specifically, the compound information proceeds through the compound extractor module (Figure 1A and Supplementary Figure 13 available online at http://bib.oxfordjournals.org/) that consists of a multihead vertex representation (Figure 1A and Supplementary Figure 14 available online at http://bib.oxfordjournals.org/) and deep GCN blocks (Figure 1A and Supplementary Figure 9 available online at http://bib.oxfordjournals.org/). Notably, the deep GCN block is applied to prevent the oversmoothing problem during training process [39] of the compound extractor (the oversmoothing problem is described in more detail in Supplementary Note 1.1 available online at http://bib.oxfordjournals.org/). To allow the remote atoms to communicate with a certain node, a master node is employed to simultaneously capture both local and global features so that FeatNN can learn comprehensive compound features from both global and local views at the same time.

Meanwhile, for the representation of protein structure information, the distance matrix of protein residues is discretized into one dimension, and the strategy of word embedding is applied to encode structure information regarding the Euclidean distances between protein residues as a discrete distance matrix (DDM), which greatly reduces the computational cost of obtaining structure information while still allowing the model to effectively represent the structure information of proteins. After that, the protein features are generally learned by the protein extractor module (Figure 1B and Supplementary Figure 15 available online at http://bib.oxfordjournals.org/). In the protein extractor module, a Prot-Aggregation block (Figure 1B and Supplementary Figure 17 available online at http://bib.oxfordjournals.org/) first converts the residue sequence of the given protein, the DDM and the torsion matrix into two variables: a new matrix representing the residue sequence of the protein and a new distance matrix encoded with the structure information of the protein. The two outputs generated from the Prot-Aggregation block are then fed into the Evo-Updating block (Figure 1B and Supplementary Figure 18 available online at http://bib.oxfordjournals.org/), which serves as the vital component in the protein encoder module (Figure 1B and Supplementary Figure 16 available online at http://bib.oxfordjournals.org/). In this way, the structure and sequence information are interactively aggregated through a coevolutionary strategy in the Evo-Updating block, which ensures that FeatNN can learn preeminent features from multimodal protein information.

Finally, the learned representations of compound features and protein features are input into the affinity learning module (Figure 1C and Supplementary Figure 20 available online at http://bib.oxfordjournals.org/). The detailed designs of the compound extraction module, protein extraction module and affinity learning module are described in the Methods and Supplementary sections available online at http://bib.oxfordjournals.org/.

FeatNN outperformed the SOTA model in CPA prediction

To assess the performance of FeatNN, seven kinds of models mentioned above were trained on the dataset generated from the general PDBbind set, and their CPA prediction performances were compared (Figure 2 and Supplementary Figure 3 available online at http://bib.oxfordjournals.org/). In addition to our model (FeatNN), the baseline models were BACPI [38], SIGN [26], MONN [9] and four variants of GraphDTA (i.e. GATGCN, GCNNet, GATNet and GINConvNet) [37]. Because some compounds and proteins tend to be highly similar and homologous, we followed the clustering strategy (for details, see Supplementary Methods 3.4 and 3.5 available online at http://bib.oxfordjournals.org/) proposed in previous studies to prevent information leakage from the test set data during the model training process [9, 40]. Four different clustering thresholds were used to split and cluster the similarity data into training, valid and test sets in the control group experiment. They were 0.3, 0.4, 0.5 and 0.6, indicating the minimum distance between each similar class. For example, a 0.3 clustering threshold meant that any compounds from two different sets (training, valid or test set) were at least 30% different in terms of their respective structures. In terms of the compound-clustered test group, FeatNNgeneral outperformed the SOTA baselinegeneral (MONN) by 21.33% in the R2 metric under IC50 (Figure 2B and Supplementary Table 3 available online at http://bib.oxfordjournals.org/) and 17.07% under KIKD (Figure 2A and Supplementary Table 3 available online at http://bib.oxfordjournals.org/). In addition, the evaluation results of the protein-clustered test group can be found in Supplementary Figure 3 available online at http://bib.oxfordjournals.org/. FeatNNgeneral also surpassed the baseline models in most cases (Supplementary Figure 3 and Supplementary Table 3 available online at http://bib.oxfordjournals.org/). However, as shown by Supplementary Figure 3A available online at http://bib.oxfordjournals.org/, the SIGN model achieved the best performance in RMSE but the worst in Pearson and R2 on the ‘KIKD’ dataset constructed from the general set of PDBbind-v2020, possibly because the SIGN model efficiently learned the absolute error (RMSE) between the prediction affinity and the real ones, but unable to learn their correlation (Pearson, R2). Even though the similarity of the data (protein or compound) in the same dataset (training, validation, or test datasets) decreases with increasing threshold, the CPA prediction correlation performances of FeatNNgeneral remained consistent and it outperformed the baselines, indicating the robustness and outstanding performance of FeatNN in comparison with the baseline models. Furthermore, we trained FeatNNrefine on the refined datasets of PDBbind [31] to assess whether a high-quality structural dataset can enhance its CPA prediction performances. Interestingly, we found that the Pearson performances of FeatNNrefine and SOTA baselinerefine were, respectively, elevated by 2.65 and 5.45% compared to the corresponding methods trained on general datasets of PDBbind with the s method (with the threshold of 0.3, details in Supplementary Figures 4A, 5A and Supplementary Tables 4, 5 available online at http://bib.oxfordjournals.org/). However, R2 and Pearson values of FeatNNrefine and SOTA baselinerefine were found to be somewhat lower when applying the protein-clustered method, indicating that the accuracy and generalization of models were affected, possibly due to the limited number of high-quality data in the refined dataset of PDBbind-v2020 (Supplementary Figure 4B, 5B and Supplementary Tables 4, 5 available online at http://bib.oxfordjournals.org/). According to the statistic result (Supplementary Table 1 available online at http://bib.oxfordjournals.org/), we found the protein diversity is poor in the refined dataset. Such a negative effect is observed possibly because the diversification of protein data is crucial for the performance of a computational model in CPA prediction tasks [41].

Performances of FeatNN on the BindingDB dataset

Even though the PDBbind database has rich protein diversity, the amount of paired information in this database is limited (12 699 records). By contrast, the BindingDB database is much larger (218 615 records), but the quality of the structural data in this database is not very high, and it is also poor in protein diversity and provides limited structure information for the compound and protein complexes. To comprehensively evaluate the performances of FeatNN, we first tested FeatNN and baseline models on BindingDB with a large-scale compound–protein interaction dataset. To do so, on the dataset generated from BindingDB with 218 615 compound-protein pairs, FeatNN and the baseline models were evaluated with 153 031 training samples, 21 861 validation samples and 43 723 test samples [3]. To conduct a fair comparison, we evaluated the CPA prediction performance of the models by averaging the prediction results obtained over approximately 10 independent training processes on the dataset generated from BindingDB database. In contrast to the computer vision and natural language processing fields, the data in the biotechnology field are more flexible. The diversity of data in different datasets and the composition of data pairs may greatly change the performance of the model. As shown in Table 1, FeatNN outperformed the SOTA baseline with the best RMSE (0.765), Pearson correlation coefficient (0.850) and R2 value (0.719).

Table 1

Performance evaluation of different prediction approaches on the dataset generated from BindingDB

ModelR2RMSE ↓Pearson ↑
FeatNN0.719 (0.003)0.765 (0.004)0.850 (0.001)
MONN0.706 (0.004)0.783 (0.005)0.844 (0.002)
BACPI0.577 (0.005)0.935 (0.006)0.769 (0.002)
GATGCN0.543 (0.015)0.992 (0.016)0.742 (0.012)
GCNNet0.510 (0.023)1.030 (0.023)0.717 (0.015)
GINConvNet0.451 (0.124)1.080 (0.119)0.669 (0.094)
GATConvNet0.327 (0.027)1.200 (0.024)0.585 (0.001)
ModelR2RMSE ↓Pearson ↑
FeatNN0.719 (0.003)0.765 (0.004)0.850 (0.001)
MONN0.706 (0.004)0.783 (0.005)0.844 (0.002)
BACPI0.577 (0.005)0.935 (0.006)0.769 (0.002)
GATGCN0.543 (0.015)0.992 (0.016)0.742 (0.012)
GCNNet0.510 (0.023)1.030 (0.023)0.717 (0.015)
GINConvNet0.451 (0.124)1.080 (0.119)0.669 (0.094)
GATConvNet0.327 (0.027)1.200 (0.024)0.585 (0.001)

We apply RMSE, Pearson and R2 to evaluate the CPA prediction performances. The results of each group were counted with 10 independent experiments. The mean value (and SD) of each independent experimental group are shown in the table. Note: The SIGN is highly dependent on the structure information of the complex and binding pockets while most structure information recorded in BindingDB is redundant and low-quality (lack of the information of pocket and binding site to represent the complex graph as the input training data), it is difficult to process the data before training the SIGN. Therefore, we did not train the SIGN on BindingDB

Table 1

Performance evaluation of different prediction approaches on the dataset generated from BindingDB

ModelR2RMSE ↓Pearson ↑
FeatNN0.719 (0.003)0.765 (0.004)0.850 (0.001)
MONN0.706 (0.004)0.783 (0.005)0.844 (0.002)
BACPI0.577 (0.005)0.935 (0.006)0.769 (0.002)
GATGCN0.543 (0.015)0.992 (0.016)0.742 (0.012)
GCNNet0.510 (0.023)1.030 (0.023)0.717 (0.015)
GINConvNet0.451 (0.124)1.080 (0.119)0.669 (0.094)
GATConvNet0.327 (0.027)1.200 (0.024)0.585 (0.001)
ModelR2RMSE ↓Pearson ↑
FeatNN0.719 (0.003)0.765 (0.004)0.850 (0.001)
MONN0.706 (0.004)0.783 (0.005)0.844 (0.002)
BACPI0.577 (0.005)0.935 (0.006)0.769 (0.002)
GATGCN0.543 (0.015)0.992 (0.016)0.742 (0.012)
GCNNet0.510 (0.023)1.030 (0.023)0.717 (0.015)
GINConvNet0.451 (0.124)1.080 (0.119)0.669 (0.094)
GATConvNet0.327 (0.027)1.200 (0.024)0.585 (0.001)

We apply RMSE, Pearson and R2 to evaluate the CPA prediction performances. The results of each group were counted with 10 independent experiments. The mean value (and SD) of each independent experimental group are shown in the table. Note: The SIGN is highly dependent on the structure information of the complex and binding pockets while most structure information recorded in BindingDB is redundant and low-quality (lack of the information of pocket and binding site to represent the complex graph as the input training data), it is difficult to process the data before training the SIGN. Therefore, we did not train the SIGN on BindingDB

Applying Pretraining strategy enhanced the performances of FeatNN

First, to assess the generalization ability of FeatNN (Details in Supplementary Methods 3.6 available online at http://bib.oxfordjournals.org/), we set up an independent third database named Binding MOAD with high-quality paired information data (The details for the generation of this dataset are provided in Supplementary Table 1 available online at http://bib.oxfordjournals.org/.) As shown in Supplementary Figure 6 available online at http://bib.oxfordjournals.org/, we found that the generalization ability of FeatNN was strongly depended on the amount of paired information in the training datasets. When trained on the general PDBbind dataset, FeatNNgeneral showed superior generalization performance, outperforming the SOTA baselinegeneral by 4.57 and 5.72% for the evaluation of the Pearson coefficient tested on IC50 and KIKD measurement datasets constructed from Binding MOAD (Supplementary Figures 6, 7 and Supplementary Tables 6, 8 available online at http://bib.oxfordjournals.org/). However, when trained on the refined datasets of PDBbind even with higher data quality, the models (both FeatNNrefine and the SOTA baselinerefine) trained on the refined dataset of PDBbind showed considerably lower generalization ability compared to the corresponding models (FeatNNgeneral and the SOTA baselinegeneral) trained on the general PDBbind dataset (Supplementary Figure 6, Supplementary Table 6 available online at http://bib.oxfordjournals.org/), with decreases by 62.95 and 93.10% in R2 evaluation for FeatNN and SOTA baseline, respectively, possibly due to the limited amount of paired information used in the training process.

To further enhance the performance of FeatNN, FeatNNoptm was tentatively trained by applying a pretraining strategy [42] to warm FeatNN up on the dataset with relatively low-quality structure data generated from BindingDB (Figure 3A, Supplementary Methods 3.7 available online at http://bib.oxfordjournals.org/). Considering that CPA prediction on PDBbind and BindingDB served as the same type of task, the parameters of the compound extractor learned from the two datasets could be highly generalized and portable. To test this hypothesis, we attempted to assess whether the performance of FeatNN on the PDBbind dataset could be improved by this parameter transfer strategy. To do so, the compound extractor parameters learned from BindingDB were frozen at first. The next steps were to fine-tune the protein extractor and affinity learning module, take the ‘knowledge’ learned from BindingDB as the initial parameters of the protein extractor and affinity learning module. In this way, we fine-tuned these two modules on the datasets generated from PDBbind, that is, to conduct multiple rounds of training and thus obtain FeatNNoptm (Figure 3A). As a result, the RMSE, Pearson coefficient and R2 of FeatNNoptm for the PDBbind test dataset were increased by 3.29, 1.93 and 5.47% (Figure 3B and Supplementary Table 7 available online at http://bib.oxfordjournals.org/), respectively, suggesting the excellent transferability of FeatNN to different datasets. Interestingly, the generalization ability of FeatNNoptm is further enhanced by 2.04 and 5.79% for Pearson and R2 compared with FeatNN directly trained on the PDBbind (Supplementary Figure 7 and Supplementary Table 8 available online at http://bib.oxfordjournals.org/).

The performance of FeatNN is greatly improved after optimization with fine-tuning strategy. (A). To optimize the performance of FeatNN, the parameters of the compound extractor obtained from the warm-up (pretraining) strategy on BindingDB are frozen, and then the protein extractor module and affinity learning module are fine-tuned on PDBbind to obtain FeatNNoptm. (B) The RMSE, Pearson coefficient and R2 of FeatNN with the fine-tuning strategy (FeatNNoptm) were increased by 3.29, 1.93 and 5.47% compared with that of the FeatNN version directly trained on PDBbind-v2020. FeatNN: original FeatNN trained on PDBbind. FeatNNoptm: FeatNN optimized with a fine-tuning strategy. The results of each group were counted with 10 independent experiments by 5-fold cross-validation strategy. The mean value, upper and lower quartiles, and SD of each independent experiment group are clearly shown in Figure 3B. Box plots; boxes depict the upper and lower quartiles of the data, and the vertical line in the box indicates the median of the statistical value of the group.
Figure 3

The performance of FeatNN is greatly improved after optimization with fine-tuning strategy. (A). To optimize the performance of FeatNN, the parameters of the compound extractor obtained from the warm-up (pretraining) strategy on BindingDB are frozen, and then the protein extractor module and affinity learning module are fine-tuned on PDBbind to obtain FeatNNoptm. (B) The RMSE, Pearson coefficient and R2 of FeatNN with the fine-tuning strategy (FeatNNoptm) were increased by 3.29, 1.93 and 5.47% compared with that of the FeatNN version directly trained on PDBbind-v2020. FeatNN: original FeatNN trained on PDBbind. FeatNNoptm: FeatNN optimized with a fine-tuning strategy. The results of each group were counted with 10 independent experiments by 5-fold cross-validation strategy. The mean value, upper and lower quartiles, and SD of each independent experiment group are clearly shown in Figure 3B. Box plots; boxes depict the upper and lower quartiles of the data, and the vertical line in the box indicates the median of the statistical value of the group.

The functionality-based interpretation of the FeatNN module

To elucidate the function of each block in FeatNN, we sought to assess the performance of FeatNN by ablating the blocks (Supplementary Methods 3.8 available online at http://bib.oxfordjournals.org/) that were specifically designed to elevate its performance (for details, see section Materials and Methods). The results shown in Figure 4 demonstrate that a variety of components contribute significantly to the accuracy of FeatNN in CPA prediction. For instance, the robustness and prediction accuracy of FeatNN declined by approximately 14.34% in terms of the RMSE, 11.60% in the Pearson coefficient and 31.25% in R2 without Evo-Updating, emphasizing the significance of the coevolutionary strategy in the protein extractor. Strikingly, the prediction accuracy decreased by approximately 15.22% in the RMSE, 15.61% in the Pearson coefficient and 36.33% in R2 without addressing the oversmoothing problem via the deep GCN block. In addition, the master node in the deep GCN block, which represented the global information of each compound and communicated with the remote graph node through the graph warp unit (Figure 4 and Supplementary Table 9 available online at http://bib.oxfordjournals.org/), also contributed significantly to the accuracy of CPA prediction, highlighting the importance of interactively updating the global and local features and the importance of addressing the oversmoothing problem when representing the information of compounds. More importantly, the performance of the FeatNN versions that only used protein sequence information or structure information (DDM and torsion matrix) declined markedly by approximately 36.52 and 69.34%, respectively, in R2 compared with the intact FeatNN baseline (Figure 4 and Supplementary Table 9 available online at http://bib.oxfordjournals.org/), emphasizing the importance of introducing the coevolutionary strategy to jointly aggregate and update the sequence and structure information of proteins. We ablated the compound–protein interactive matrix in the affinity learning module, which could help FeatNN to represent and learn the interaction information between compound and protein, and found that the R2 performance declined by 38.09% (Figure 4), indicating the rationality of learning effective interaction features by compound-protein interactive matrix. In addition, we ablated the torsion-related architecture and found that the performances declined by 13.48% in R2 (Figure 4 and Supplementary Table 9 available online at http://bib.oxfordjournals.org/), highlighting the necessity of introducing the torsion information into FeatNN.

Essential block ablation results of FeatNN. Ablation results of FeatNN on the dataset generated from PDBbind, emphasizing the functionality of the essential blocks of FeatNN. The accuracy and robustness of FeatNN in terms of CPA prediction dramatically decline without the Evo-Updating block or torsion information, which functions as the core in protein feature extraction. Addressing the oversmoothing problem in the deep GCN block also remarkably increases the ability of the compound extractor to extract features from compounds, which in turn enhances the CPA prediction accuracy of the overall model. In addition, introducing the master node into the network to learn the global information of compounds is also important. The performances of the FeatNN version that only uses protein sequence information or structure information also remarkably decline compared with the entire FeatNN baseline, suggesting the importance of applying the coevolutionary strategy to interactively represent and update features of both sequence and 3D protein structure information. Furthermore, with ablation of the compound-protein interactive matrix, significant decline is observed in performances of the FeatNN, indicating the importance of learning the interaction features between protein and compound. The results of each group were counted with 10 independent experiments by 5-fold cross-validation strategy. The mean value, upper and lower quartiles and SD of each independent experimental group are clearly depicted in Figure 4. Box plots; boxes depict the upper and lower quartiles of the data, and the vertical line in the box indicates the median of the statistical value of the group. Abbreviations: Info: information.
Figure 4

Essential block ablation results of FeatNN. Ablation results of FeatNN on the dataset generated from PDBbind, emphasizing the functionality of the essential blocks of FeatNN. The accuracy and robustness of FeatNN in terms of CPA prediction dramatically decline without the Evo-Updating block or torsion information, which functions as the core in protein feature extraction. Addressing the oversmoothing problem in the deep GCN block also remarkably increases the ability of the compound extractor to extract features from compounds, which in turn enhances the CPA prediction accuracy of the overall model. In addition, introducing the master node into the network to learn the global information of compounds is also important. The performances of the FeatNN version that only uses protein sequence information or structure information also remarkably decline compared with the entire FeatNN baseline, suggesting the importance of applying the coevolutionary strategy to interactively represent and update features of both sequence and 3D protein structure information. Furthermore, with ablation of the compound-protein interactive matrix, significant decline is observed in performances of the FeatNN, indicating the importance of learning the interaction features between protein and compound. The results of each group were counted with 10 independent experiments by 5-fold cross-validation strategy. The mean value, upper and lower quartiles and SD of each independent experimental group are clearly depicted in Figure 4. Box plots; boxes depict the upper and lower quartiles of the data, and the vertical line in the box indicates the median of the statistical value of the group. Abbreviations: Info: information.

The interpretation of information flows in FeatNN

To understand how information flows in the deep GCN, Evo-Updating and affinity learning module, we visualized the original features in the intermediate layers of FeatNN (Supplementary Figure 8 available online at http://bib.oxfordjournals.org/). Because it is difficult to show the information transformation process in the original features directly, we applied t-distributed stochastic neighbor embedding (t-SNE) [43], a compression algorithm for high-dimensional data, to obtain a limpid data distribution in two-dimension view (Figure 5). As shown in Figure 5, the atom features became more aggregated as the GCN layers deepened. This phenomenon dynamically explained why the node information flows in the layers and aggregates the features of neighbor nodes through the message passing mechanism [44] in the deep GCN block (Figure 5A). In the Evo-Updating block, embedded sequence features and structure features were obtained from the Prot-Aggregation block, and then the sequence features and structure features were partially updated on each other, and part of their own information was integrated into the Evo-Updating block (Figure 5B). When the Evo-Updating layers deepened, the difference between the sequence features and structural features gradually lessened, and the layers fused more multimodal information into themselves. Additionally, we extract the compound and protein features, which are learned from the deep GCN block and Evo-Updating block, respectively, in each layer for dimension reduction analysis (Figure 5C and D). The distributions of compound features learned in the deep GCN block of each layer are clearly illustrated (Figure 5C and D). We found that the features aggregated by the first three layers of the block have a certain degree of similarity, whereas the distribution of compound features tends to be more distinguishable in deep layers of GCN block (Figure 5C), which might enable FeatNN to learn the precise features of the compound and address the notorious oversmoothing problem (Figure 4 and Supplementary Figure 11A–C available online at http://bib.oxfordjournals.org/). In the Evo-Updating block, we showed that the eigenspace distance between protein structural features and sequence features that are learned in the same layer remains adjacent (Figure 5D). More interestingly, we found that both the sequence and structural features learned in the deep layer of the block are updated along the same direction (evolution) through this coevolutionary strategy, which efficiently represents the multimodal information of proteins and ultimately benefits the CPA prediction accuracy (Figure 4).

Information flows in FeatNN’s deep GCN and Evo-Updating blocks. (A) Visualization of the compound information aggregation process in the deep GCN block. (B) Visualization of the coevolutionary process between the protein sequence and structure information in the Evo-Updating block. (C) t-SNE dimensionality reduction analysis of deep GCN block (6 layers). (D) t-SNE dimensionality reduction analysis of Evo-Updating block (2 layers). Abbreviations: EU L1 or L2: Evo-Updating Layer1 or Layer2. GCN L1 or L2: GCN block Layer1 or Layer2. Struct L1 or L2: Structure features in EU L1 or L2. Seq L1 or L2: Sequence features in EU L1 or L2. Embedded Sequence Info: sequence features obtained from the Prot-Aggregation block. Embedded Structure Info: structure features obtained from the Prot-Aggregation block. Initial atom features: atom features obtained from the graph embedding.
Figure 5

Information flows in FeatNN’s deep GCN and Evo-Updating blocks. (A) Visualization of the compound information aggregation process in the deep GCN block. (B) Visualization of the coevolutionary process between the protein sequence and structure information in the Evo-Updating block. (C) t-SNE dimensionality reduction analysis of deep GCN block (6 layers). (D) t-SNE dimensionality reduction analysis of Evo-Updating block (2 layers). Abbreviations: EU L1 or L2: Evo-Updating Layer1 or Layer2. GCN L1 or L2: GCN block Layer1 or Layer2. Struct L1 or L2: Structure features in EU L1 or L2. Seq L1 or L2: Sequence features in EU L1 or L2. Embedded Sequence Info: sequence features obtained from the Prot-Aggregation block. Embedded Structure Info: structure features obtained from the Prot-Aggregation block. Initial atom features: atom features obtained from the graph embedding.

FeatNN outperformed the SOTA baseline in virtual drug evaluation tasks

To verify the feasibility of the use of FeatNN [38, 45], we initially selected ‘SARS-CoV-2 3C-like protease’ as the drug target (receptor), which is a verified target for developing drugs to cure SARS-CoV-2 [46]. We unbiasedly selected 28 bioactive small molecules [46–60] (listed in Supplementary Table 10 available online at http://bib.oxfordjournals.org/, note: these molecules related to the target did not exist in PDBbind nor BindingDB) from publication research and the DrugBank database. The process of receptor-based affinity value prediction by applying FeatNN is shown in Figure 6A. In addition, we selected a ligand-free protein structure of SARS-CoV-2 3C-like protease with the identity number of 7CWC in PDB. Strikingly, we found that the Pearson coefficient reached a value of 0.612 (Figure 6B) in a CPA prediction task. Compared with the SOTA baseline (MONN) that obtained a Pearson coefficient of 0.402 (Figure 6C), this was suggestive of the outstanding performance of FeatNN in searching for potential drug candidates from a massive database.

Affinity prediction results of FeatNN and the SOTA baseline in practice. (A) Receptor-based virtual drug evaluation task: targeting both receptors of the SARS-CoV-2 3C-like protease and Akt-1, related bioactive compounds were unbiasedly selected (Supplementary Tables 10 and 11 available online at http://bib.oxfordjournals.org/) from published research and the DrugBank database to test the affinity prediction precision and generalization ability of FeatNN. Targeting 3CL protease, (B) the affinity prediction of 28 validated bioactive compounds by FeatNN result in a Pearson coefficient of 0.612. (C) The affinity prediction of 28 validated bioactive compounds by MONN result in a Pearson coefficient of 0.402. Targeting Akt-1. (D) The affinity prediction of 10 validated bioactive compounds by FeatNN results in a Pearson coefficient of 0.735. (E) The affinity prediction of 10 validated bioactive compounds by MONN results in a Pearson coefficient of 0.551. Note: From the above experiments, it can be seen that MONN serves as the SOTA baseline in both datasets that generated from PDBbind and BindingDB databases, which is the reason that we only used MONN as a representative baseline model for testing. Both structure conformations of 3CL protease and Akt-1 are extracted from the PDB file with the PDB id of 7CWC and 3O96. Each point was obtained by the average of 15 independent experiments.
Figure 6

Affinity prediction results of FeatNN and the SOTA baseline in practice. (A) Receptor-based virtual drug evaluation task: targeting both receptors of the SARS-CoV-2 3C-like protease and Akt-1, related bioactive compounds were unbiasedly selected (Supplementary Tables 10 and 11 available online at http://bib.oxfordjournals.org/) from published research and the DrugBank database to test the affinity prediction precision and generalization ability of FeatNN. Targeting 3CL protease, (B) the affinity prediction of 28 validated bioactive compounds by FeatNN result in a Pearson coefficient of 0.612. (C) The affinity prediction of 28 validated bioactive compounds by MONN result in a Pearson coefficient of 0.402. Targeting Akt-1. (D) The affinity prediction of 10 validated bioactive compounds by FeatNN results in a Pearson coefficient of 0.735. (E) The affinity prediction of 10 validated bioactive compounds by MONN results in a Pearson coefficient of 0.551. Note: From the above experiments, it can be seen that MONN serves as the SOTA baseline in both datasets that generated from PDBbind and BindingDB databases, which is the reason that we only used MONN as a representative baseline model for testing. Both structure conformations of 3CL protease and Akt-1 are extracted from the PDB file with the PDB id of 7CWC and 3O96. Each point was obtained by the average of 15 independent experiments.

In addition, to verify the robustness of FeatNN, we repeated the prediction task many times and analyzed the results statistically (Figure 6B). Nonetheless, a concern remained regarding the multimodality-based model of FeatNN: the prediction results obtained with different 3D protein structure conformations might have been variable. To assess this possibility, we selected the ligand-free protein conformations from 3 PDB files (recorded with PDB-ids of 7CWC, 7CWB and 7BAJ in the PDB Database, Supplementary Figure 9A available online at http://bib.oxfordjournals.org/) of SARS-CoV-2 3C-like proteases as receptors for CPA prediction with FeatNN (Figure 6b and Supplementary Figure 9B and C available online at http://bib.oxfordjournals.org/). Remarkably, the CPA prediction task among 28 validated compounds still achieved robustness and exhibited excellent results with Pearson coefficients of 0.606 and 0.607, indicating that the prediction results obtained with FeatNN do not exhibit unstable changes in different target conformations (Figure 6B and Supplementary Figure 9B and C available online at http://bib.oxfordjournals.org/). To verify the feasibility of the use of FeatNN on different targets, we additionally chose a target named Akt-1 (PDB-id: 3O96) that is a critical receptor for the transmission of growth-promoting signals and resisting cancer [52]. In this experiment, 10 previously reported drugs (Supplementary Table 11 available online at http://bib.oxfordjournals.org/) that target Akt-1 [61–70] were selected for this virtual drug evaluation task, and FeatNN showed a better Pearson performance of 0.735 in the CPA prediction task compared with the SOTA baseline. Using different Akt-1 conformations (PDB-ids of 6HHJ, 3MV5, 3CQW and the ligand-free conformation predicted by AlphaFold2 [28]), the Pearson performance also remained stable (Figure 6D, Supplementary Figure 10 and Supplementary Table 11 available online at http://bib.oxfordjournals.org/), indicating the robustness and reliable prediction ability of FeatNN in various virtual drug evaluation tasks with different targets.

Discussion

The FeatNN model proposed in this study introduced a coevolutionary strategy to effectively represent multimodal protein features. Through a t-SNE visualization analysis and a module ablation study, from the perspective of interpretation, we showed that the information between protein sequences and structure features was jointly updated and aggregated, which ultimately benefited the CPA prediction accuracy of our approach. In this study, we found that the Evo-Updating block and deep GCN block in FeatNN function as the key components for aggregating and updating the features of both proteins and compounds (Figure 4), emphasizing the significance of applying the coevolutionary strategy in protein feature extraction. Altogether, FeatNN learns efficiently from a limited data resource but is still able to cope with the complexity of structure data and achieve outstanding performance.

Although it is theoretically appealing to introduce the structural information of proteins in a CPA prediction model, we overcame numerous obstacles in the development of FeatNN. First, we elegantly overcame the oversmoothing problem [39] by introducing a specific residual connection in each layer of the GCN, which could add part of the initial information of the molecular graph into the current layers [71, 72]; therefore, the extraction ability of the model with respect to compound features was enhanced when the layers deepened (Supplementary Figure 11A–C available online at http://bib.oxfordjournals.org/). Second, in the deep GCN block, a master node was employed to learn the global features during the training process, thus facilitating communication among remote nodes. Third, the protein distance matrix was discretely encoded to overcome the overwhelming information problem of the traditional continuous distance matrix. As a result, FeatNN greatly outperformed the SOTA model in tasks involving generalization ability on an independent database and targeting the ‘SARS-CoV-2 3C-like protease’ and ‘Akt-1’ affinity value prediction, indicating that FeatNN can be a powerful tool for advancing the drug development process.

Nevertheless, due to the scarcity of precise noncovalent interaction binding site data between the ligand and the binding pocket, and the data imbalance problem in the distribution of the few positive and predominantly negative data of binding sites, FeatNN faces difficulties in interpreting the CPA prediction results at the interaction level at current stage. Traditional methods such as upsampling and gradient penalty still cannot address such a dilemma (imbalance problem) without enough data for binding interactions [73]. Possibly, docking simulation combined with AI may be able to interpret the results predicted by the AI models at the interaction level [20], which may be a new research direction in the further development of FeatNN in our future study. Moreover, 3D structural information is not only relevant to proteins but also to other compounds [25, 74]. In this study, we only introduced the protein structural information, and experiments to additionally introduce compound geometry information are ongoing [75]. Theoretically, the strategy developed for protein feature extraction in our model could also be utilized to extract the geometric information of compounds. It could be appealing to introduce both protein and compound structure features in our model to further enhance its performance, given that the application of only the protein structure features in this study has already achieved a remarkable result. Other protein properties, such as the residue types of binding ligands, secondary structures and physicochemical characteristics, are also very important features. Incorporating these features into our model might further improve its performance. However, the challenge is how to represent these features with a rational method or provide an interpretable architecture, which is left to be addressed in future studies.

Limitations

(i) The training of the deep learning model depends strongly on the training data. In practice, if compounds or proteins are encountered with fairly different similarities that are very different from the data in the training set, the confidence in the prediction results will be greatly reduced. (ii) Furthermore, because the architecture of FeatNN highly depends on the 3D structure of the protein, some protein data cannot be characterized due to the residue continuity defect of PDB files, so they must be discarded. Therefore, the number of training data will be decreased, but this will not significantly affect the performance of FeatNN. (iii) Even though FeatNN can achieve improved precision and generalization ability in CPA prediction while ignoring the information regarding the binding pose between the ligand and the binding pocket, it is difficult for FeatNN to interpret the CPA prediction results at the interaction level, because of the scarcity and data imbalance problems of precise noncovalent interaction data between the ligand and the binding pocket.

Conclusion

The proposed FeatNN model introduces a torsion matrix and a distance matrix in its protein extractor module, and it utilizes the deep GCN block with the master node in the compound extractor module to predict the affinity of a given compound-protein pair. The experimental results of our study showed that FeatNN outperformed the SOTA baseline by a significant margin, and the accessibility of FeatNN applied in lead compound screening was also verified; this approach demonstrates great potential for reducing the considerable time and expense involved in drug candidate screening experiments, and provides an interpretable architecture based on biology databases.

Author Contributions

B.G., H.Z. and H.J. contributed equally to this work. X.W., B.G. and H.Z. conceptualized and designed the study. B.G., H.Z. and H.J. conducted the experiments and collected the data. B.G., H.Z., H.J., X.W., H.Y., X.L., N.G. and Y.Z. analyzed and interpreted the data. B.G., H.Z. and X.W. drafted the paper. All authors critically revised the manuscript and approved the final version for submission.

Key Points
  • We apply both 3D protein structure and sequence information with a coevolutionary strategy.

  • We addressed the oversmoothing problem in graph representation of compounds.

  • FeatNN achieved highly enhanced affinity prediction on well-known databases compared with the state-of-the-art methods.

  • Generalization ability and feasibility of FeatNN are superior to the SOTA baseline both on the datasets generated from the Binding MOAD database and the virtual drug evaluation tasks targeting the receptor of the SARS-CoV-2 3CL protease and Akt-1.

Acknowledgement

We would like to thank professor Jie Yang for his feedback and advice on writing this paper.

Funding

The Scientific and Technological Innovation 2030 Program of China—major projects (2021ZD0200408 to X.W.), the National Natural Science Foundation of China (81971866 to X.W.), the Natural Science Foundation of Zhejiang Province (LR20H090002 to X.W.), the Leading Innovative and Entrepreneur Team Introduction Program of Zhejiang (2019R01007 to X.W.) and the Fundamental Research Funds for the Central Universities (K20210195 to X.W.).

Data availability

The data that support the findings of this study are included in the paper, and further data are available from the corresponding author upon reasonable request.

Code and data availability

Source codes, original data for figures and datasets (generated from PDBbind and BindingDB databases) used to train and test the models are available at: https://github.com/StarLight1212/FeatNN. Data in this paper is available at https://drive.google.com/file/d/12Z9AwrAfYto4-2JplLBxo-KGhMQ3314a/view?usp=share_link.

Author Biographies

Binjie Guo is a Ph.D candidate student at the Department of Neurobiology and Department of Rehabilitation Medicine, First Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hanzhou, China; Liangzhu Laboratory, MOE Frontier Science Center for Brain Science and Brain-machine Integration, State Key Laboratory of Brain-machine Intelligence, Zhejiang University, Hanzhou, China; NHC and CAMS Key Laboratory of Medical Neurobiology, Zhejiang University, Hangzhou, China; Co-innovation Center of Neuroregeneration, Nantong University, Nantong, Jiangsu, China. His research interests include screening and design of Drugs and drug delivery systems powered by artificial intelligence technologies for the treatment of central nervous system diseases.

Hanyu Zheng is a Ph.D candidate student at the School of Medicine, Zhejiang University, Hangzhou, China. Her research interests include regenerative medicine and design of gene delivery systems powered by artificial intelligence technologies.

Haohan Jiang is a Ph.D candidate student at the School of Medicine, Zhejiang University, Hangzhou, China. His research interests include gene therapy and design of gene delivery systems powered by artificial intelligence technology.

Xiaodan Li is a postdoctoral associate at the School of Medicine at the Zhejiang University. Her research interests focus on protein-based drugs and gene therapy for central nervous system diseases.

Naiyu Guan is a postdoctoral associate at the School of Medicine, Zhejiang University, Hangzhou, China. Her research interests include construction of brain organoids and design of biocompatible porous micro-nano materials for the treatment of spinal cord injury.

Yanming Zuo is a postdoctoral associate at the Liangzhu Laboratory, MOE Frontier Science Center for Brain Science and Brain-machine Integration, State Key Laboratory of Brain-machine Intelligence, Zhejiang University, Hangzhou, China. His research interests include design of nanodrugs and bio-compatible hydrogel materials for the treatment of stroke and spinal cord injury.

Yicheng Zhang is an undergraduate student at the School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou, China. His research focuses on the application of deep learning technologies in biomedicine.

Hengfu Yang is a professor at School of Computer Science, Hunan First Normal University, Changsha, China. His research interests include image processing, information security, privacy data protection and artificial intelligence technologies in biomedicine.

Xuhua Wang is a professor at the Department of Neurobiology and Department of Rehabilitation Medicine, First Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hanzhou, China; Liangzhu Laboratory, MOE Frontier Science Center for Brain Science and Brain-machine Integration, State Key Laboratory of Brain-machine Intelligence, Zhejiang University, Hanzhou, China; NHC and CAMS Key Laboratory of Medical Neurobiology, Zhejiang University, Hangzhou, China; Co-innovation Center of Neuroregeneration, Nantong University, Nantong, Jiangsu, China. His research interests include artificial intelligence powered drug discovery and drug/gene delivery systems, tissue engineering and regenerative medicine, brain-computer interface technologies.

References

1.

Chen
X
,
Yan
CC
,
Zhang
X
, et al.
Drug-target interaction prediction: databases, web servers and computational models
.
Brief Bioinform
2016
;
17
:
696
712
.

2.

Rester
U
.
From virtuality to reality—virtual screening in lead discovery and lead optimization: a medicinal chemistry perspective
.
Curr Opin Drug Discov Devel
2008
;
11
:
559
68
.

3.

Gilson
MK
,
Liu
T
,
Baitaluk
M
, et al.
BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology
.
Nucleic Acids Res
2016
;
44
:
D1045
53
.

4.

Rose
Y
,
Duarte
JM
,
Lowe
R
, et al.
RCSB protein data Bank: architectural advances towards integrated searching and efficient access to macromolecular structure data from the PDB archive
.
J Mol Biol
2021
;
433
(
11
):
166704
.

5.

Ozturk
H
,
Ozgur
A
,
Ozkirimli
E
.
DeepDTA: deep drug-target binding affinity prediction
.
Bioinformatics
2018
;
34
:
i821
9
.

6.

Zheng
S
,
Li
Y
,
Chen
S
, et al.
Predicting drug–protein interaction using quasi-visual question answering system
.
Nat Mach Intell
2020
;
2
:
134
40
.

7.

Qureshi
R
,
Zhu
M
,
Yan
H
.
Visualization of protein-drug interactions for the analysis of drug resistance in lung cancer
.
IEEE J Biomed Health Inform
2021
;
25
:
1839
48
.

8.

Jones
D
,
Kim
H
,
Zhang
X
, et al.
Improved protein-ligand binding affinity prediction with structure-based deep fusion inference
.
J Chem Inf Model
2021
;
61
:
1583
92
.

9.

Li
S
,
Wan
F
,
Shu
H
, et al.
MONN: a multi-objective neural network for predicting compound-protein interactions and affinities
.
Cell Systems
2020
;
10
:
308
322.e311
.

10.

Ru
X
,
Ye
X
,
Sakurai
T
, et al.
NerLTR-DTA: drug-target binding affinity prediction based on neighbor relationship and learning to rank
.
Bioinformatics
2022
;
38
:
1964
71
.

11.

Bleakley
K
,
Yamanishi
Y
.
Supervised prediction of drug-target interactions using bipartite local models
.
Bioinformatics
2009
;
25
:
2397
403
.

12.

Cao
DS
,
Zhang
LX
,
Tan
GS
, et al.
Computational prediction of DrugTarget interactions using chemical, biological, and network features
.
Mol Inform
2014
;
33
:
669
81
.

13.

Ozturk
H
,
Ozkirimli
E
,
Ozgur
A
.
A comparative study of SMILES-based compound similarity functions for drug-target interaction prediction
.
BMC Bioinform
2016
;
17
:
128
.

14.

Ragoza
M
,
Hochuli
J
,
Idrobo
E
, et al.
Protein-ligand scoring with convolutional neural networks
.
J Chem Inf Model
2017
;
57
:
942
57
.

15.

Lee
I
,
Keum
J
,
Nam
H
.
DeepConv-DTI: prediction of drug-target interactions via deep learning with convolution on protein sequences
.
PLoS Comput Biol
2019
;
15
:
e1007129
.

16.

Rifaioglu
AS
,
Nalbat
E
,
Atalay
V
, et al.
DEEPScreen: high performance drug-target interaction prediction with convolutional neural networks using 2-D structural compound representations
.
Chem Sci
2020
;
11
:
2531
57
.

17.

Klepeis
JL
,
Lindorff-Larsen
K
,
Dror
RO
, et al.
Long-timescale molecular dynamics simulations of protein structure and function
.
Curr Opin Struct Biol
2009
;
19
:
120
7
.

18.

Zhang
J
,
Liang
Y
,
Zhang
Y
.
Atomic-level protein structure refinement using fragment-guided molecular dynamics conformation sampling
.
Structure
2011
;
19
:
1784
95
.

19.

Ballester
PJ
,
Mitchell
JB
.
A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking
.
Bioinformatics
2010
;
26
:
1169
75
.

20.

Bai
Q
,
Liu
S
,
Tian
Y
, et al.
Application advances of deep learning methods for de novo drug design and molecular dynamics simulation, WIREs computational molecular
.
Science
2021
;
12
:
12
.

21.

Smith
JS
,
Roitberg
AE
,
Isayev
O
.
Transforming computational drug discovery with machine learning and AI
.
ACS Med Chem Lett
2018
;
9
:
1065
9
.

22.

Kitchen
DB
,
Decornez
H
,
Furr
JR
, et al.
Docking and scoring in virtual screening for drug discovery: methods and applications
.
Nat Rev Drug Discov
2004
;
3
:
935
49
.

23.

Weiss
DR
,
Karpiak
J
,
Huang
XP
, et al.
Selectivity challenges in docking screens for GPCR targets and antitargets
.
J Med Chem
2018
;
61
:
6830
45
.

24.

Yamanishi
Y
,
Araki
M
,
Gutteridge
A
, et al.
Prediction of drug-target interaction networks from the integration of chemical and genomic spaces
.
Bioinformatics
2008
;
24
:
i232
40
.

25.

Fang
X
,
Liu
L
,
Lei
J
, et al.
Geometry-enhanced molecular representation learning for property prediction
.
Nat Mach Intell
2022
;
4
:
127
34
.

26.

Li
S
,
Zhou
J
,
Xu
T
et al. Structure-aware interactive graph neural networks for the prediction of protein-ligand binding affinity. CoRR
2021
;
abs/2107.10670
:975–85.

27.

Jiang
D
,
Hsieh
CY
,
Wu
Z
, et al.
InteractionGraphNet: a novel and efficient deep graph representation learning framework for accurate protein-ligand interaction predictions
.
J Med Chem
2021
;
64
:
18209
32
.

28.

Jumper
J
,
Evans
R
,
Pritzel
A
, et al.
Highly accurate protein structure prediction with AlphaFold
.
Nature
2021
;
596
:
583
9
.

29.

Wang
R
,
Fang
X
,
Lu
Y
, et al.
The PDBbind database: methodologies and updates
.
J Med Chem
2005
;
48
:
4111
9
.

30.

Kipf
TN
,
Welling
M
.
Semi-supervised classification with graph convolutional networks
. In:
ICLR
,
2017
. OpenReview.net.

31.

Liu
Z
,
Li
Y
,
Han
L
, et al.
PDB-wide collection of binding data: current status of the PDBbind database
.
Bioinformatics
2015
;
31
:
405
12
.

32.

Ahmed
A
,
Smith
RD
,
Clark
JJ
, et al.
Recent improvements to binding MOAD: a resource for protein-ligand binding affinities and structures
.
Nucleic Acids Res
2015
;
43
:
D465
9
.

33.

Smith
RD
,
Clark
JJ
,
Ahmed
A
, et al.
Updates to binding MOAD (mother of all databases): polypharmacology tools and their utility in drug repurposing
.
J Mol Biol
2019
;
431
:
2423
33
.

34.

Hu
L
,
Benson
ML
,
Smith
RD
, et al.
Binding MOAD (mother of all databases)
.
Proteins
2005
;
60
:
333
40
.

35.

Green H, Koes DR, Durrant JD.

DeepFrag: a deep convolutional neural network for fragment-based lead optimization
.
Chem Sci
2021
;
12
:
8036
47
.

36.

Refaeilzadeh
PTL
,
Liu
H
. Cross-validation.
Encyclopedia of Database Systems
,
2009
;
5
:
532
8
.

37.

Nguyen
T
,
Le
H
,
Quinn
TP
, et al.
GraphDTA: predicting drug-target binding affinity with graph neural networks
.
Bioinformatics
2021
;
37
:
1140
7
.

38.

Li
M
,
Lu
Z
,
Wu
Y
, et al.
BACPI: a bi-directional attention neural network for compound-protein interaction and binding affinity prediction
.
Bioinformatics
2022
;
38
:
1995
2002
.

39.

Ishiguro
K
,
Maeda
S-i
,
Koyama
M
.
Graph warp module: an auxiliary module for boosting the power of graph neural networks in molecular graph analysis
.
arXiv:1902.01020
,
2019
.

40.

Mayr
A
,
Klambauer
G
,
Unterthiner
T
, et al.
Large-scale comparison of machine learning methods for drug target prediction on ChEMBL
.
Chem Sci
2018
;
9
:
5441
51
.

41.

Freschlin
CR
,
Fahlberg
SA
,
Romero
PA
.
Machine learning to navigate fitness landscapes for protein engineering
.
Curr Opin Biotechnol
2022
;
75
:
102713
.

42.

Pengfei Liu
WY
,
Jinlan
F
,
Jiang
Z
, et al.
Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing
.
ACM Comput Surv
2022
;e3560815.

43.

Hinton
G
,
van der
Maaten
L
.
Visualizing high-dimensional data using t-SNE
.
J Mach Learn Res
2008
;
9
:
2579
605
.

44.

Stachenfeld
K
,
Godwin
J
,
Battaglia
P
.
Graph networks with spectral message passing
.
arXiv:2101.00079
,
2020
.

45.

Bai
Q
,
Tan
S
,
Xu
T
, et al.
MolAICal: a soft tool for 3D drug design of protein targets by artificial intelligence and classical algorithm
.
Brief Bioinform
2021
;
22
(
3
):1–12.

46.

Hillen
HS
,
Kokic
G
,
Farnung
L
, et al.
Structure of replicating SARS-CoV-2 polymerase
.
Nature
2020
;
584
:
154
6
.

47.

Jian Li CL, Xuelan Zhou, Fanglin Zhong, et al.

Structural basis of the main proteases of coronavirus bound to drug candidate PF-07321332
.
J Virol
2022
;
96
:e02013–21.

48.

Mahdi
M
,
Motyan
JA
,
Szojka
ZI
, et al.
Analysis of the efficacy of HIV protease inhibitors against SARS-CoV-2's main protease
.
Virol J
2020
;
17
:
190
.

49.

Chen
J
,
Xia
L
,
Liu
L
, et al.
Antiviral activity and safety of Darunavir/Cobicistat for the treatment of COVID-19, open forum
.
Infect Dis
2020
;
7
:
ofaa241
.

50.

Ahmed
MH
,
Hassan
A
.
Dexamethasone for the treatment of coronavirus disease (COVID-19): a review
.
SN Compr Clin Med
2020
;
2
:
2637
46
.

51.

Hoffman
RL
,
Kania
RS
,
Brothers
MA
, et al.
Discovery of ketone-based covalent inhibitors of coronavirus 3CL proteases for the potential therapeutic treatment of COVID-19
.
J Med Chem
2020
;
63
:
12725
47
.

52.

Hinz
N
,
Jucker
M
.
Distinct functions of AKT isoforms in breast cancer: a comprehensive review
.
Cell Commun Signal
2019
;
17
:
154
.

53.

Lopez-Medina
E
,
Lopez
P
,
Hurtado
IC
, et al.
Effect of Ivermectin on time to resolution of symptoms among adults with mild COVID-19: a randomized clinical trial
.
JAMA
2021
;
325
:
1426
35
.

54.

Shamsi
A
,
Mohammad
T
,
Anwar
S
, et al.
Glecaprevir and Maraviroc are high-affinity inhibitors of SARS-CoV-2 main protease: possible implication in COVID-19 therapy
.
Biosci Rep
2020
;
40
(
6
):BSR20201256.

55.

Vankadara
S
,
Wong
YX
,
Liu
B
, et al.
A head-to-head comparison of the inhibitory activities of 15 peptidomimetic SARS-CoV-2 3CLpro inhibitors
.
Bioorg Med Chem Lett
2021
;
48
:
128263
.

56.

Mody
V
,
Ho
J
,
Wills
S
, et al.
Identification of 3-chymotrypsin like protease (3CLPro) inhibitors as potential anti-SARS-CoV-2 agents
.
Commun Biol
2021
;
4
:
93
.

57.

Xiang
R
,
Yu
Z
,
Wang
Y
, et al.
Recent advances in developing small-molecule inhibitors against SARS-CoV-2
.
Acta Pharm Sin B
2022
;
12
:
1591
623
.

58.

Costanzo
M
,
De Giglio
MAR
,
Roviello
GN
.
SARS-CoV-2: recent reports on antiviral therapies based on Lopinavir/ritonavir, Darunavir/Umifenovir, Hydroxychloroquine, Remdesivir, Favipiravir and other drugs for the treatment of the new coronavirus
.
Curr Med Chem
2020
;
27
:
4536
41
.

59.

Lo
HS
,
Hui
KPY
,
Lai
HM
, et al.
Simeprevir potently suppresses SARS-CoV-2 replication and synergizes with Remdesivir
.
ACS Cent Sci
2021
;
7
:
792
802
.

60.

Hosseini-Zare
MS
,
Thilagavathi
R
,
Selvam
C
.
Targeting severe acute respiratory syndrome-coronavirus (SARS-CoV-1) with structurally diverse inhibitors: a comprehensive review
.
RSC Adv
2020
;
10
:
28287
99
.

61.

Grimshaw
KM
,
Hunter
LJ
,
Yap
TA
, et al.
AT7867 is a potent and oral inhibitor of AKT and p70 S6 kinase that induces pharmacodynamic changes and inhibits human tumor xenograft growth
.
Mol Cancer Ther
2010
;
9
:
1100
10
.

62.

Politz
O
,
Siegel
F
,
Barfacker
L
, et al.
BAY 1125976, a selective allosteric AKT1/2 inhibitor, exhibits high efficacy on AKT signaling-dependent tumor growth in mouse models
.
Int J Cancer
2017
;
140
:
449
59
.

63.

Rhodes
N
,
Heerding
DA
,
Duckett
DR
, et al.
Characterization of an Akt kinase inhibitor with potent pharmacodynamic and antitumor activity
.
Cancer Res
2008
;
68
:
2366
74
.

64.

Wu
WI
,
Voegtli
WC
,
Sturgis
HL
, et al.
Crystal structure of human AKT1 with an allosteric inhibitor reveals a new mode of kinase inhibition
.
PLoS One
2010
;
5
:
e12913
.

65.

Andrikopoulou
A
,
Chatzinikolaou
S
,
Panourgias
E
, et al.
The emerging role of capivasertib in breast cancer
.
Breast
2022
;
63
:
157
67
.

66.

McLeod
R
,
Kumar
R
,
Papadatos-Pastos
D
, et al.
First-in-human study of AT13148, a dual ROCK-AKT inhibitor in patients with solid Tumors
.
Clin Cancer Res
2020
;
26
:
4777
84
.

67.

Nandan
D
,
Zhang
N
,
Yu
Y
, et al.
Miransertib (ARQ 092), an orally-available, selective Akt inhibitor is effective against Leishmania
.
PLoS One
2018
;
13
:
e0206920
.

68.

Weisner
J
,
Landel
I
,
Reintjes
C
, et al.
Preclinical efficacy of covalent-allosteric AKT inhibitor borussertib in combination with trametinib in KRAS-mutant pancreatic and colorectal cancer
.
Cancer Res
2019
;
79
:
2367
78
.

69.

Song
M
,
Liu
X
,
Liu
K
, et al.
Targeting AKT with oridonin inhibits growth of esophageal squamous cell carcinoma in vitro and patient-derived xenografts in vivo
.
Mol Cancer Ther
2018
;
17
:
1540
53
.

70.

Iksen
PS
,
Pongrakhananon
V
.
Targeting the PI3K/AKT/mTOR Signaling pathway in lung cancer: an update regarding potential drugs and natural products
.
Molecules
2021
;
26
:e26134100.

71.

He
K
,
Zhang
X
,
Ren
S
, et al.
Deep residual learning for image recognition
. In:
Proceedings of the IEEE conference on computer vision and pattern recognition
.
2016
, 770–8.

71.

Chen
M
,
Wei
Z
,
Huang
Z
, et al.
Simple and deep graph convolutional networks
. In:
International Conference on Machine Learning
. PMLR,
2020
, 1725–35.

73.

Susan
S
,
Kumar
A
.
The balancing trick: optimized sampling of imbalanced datasets—a brief survey of the recent state of the art
.
Eng Rep
2020
;
3
:
3
.

74.

Shin
WH
,
Zhu
X
,
Bures
MG
, et al.
Three-dimensional compound comparison methods and their application in drug discovery
.
Molecules
2015
;
20
:
12841
62
.

75.

Hadfield
TE
,
Deane
CM
.
AI in 3D compound design
.
Curr Opin Struct Biol
2022
;
73
:
102326
.

Author notes

Binjie Guo, Hanyu Zheng and Haohan Jiang contributed equally to this work.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)