-
PDF
- Split View
-
Views
-
Cite
Cite
Ali Ezzat, Min Wu, Xiao-Li Li, Chee-Keong Kwoh, Computational prediction of drug–target interactions using chemogenomic approaches: an empirical survey, Briefings in Bioinformatics, Volume 20, Issue 4, July 2019, Pages 1337–1357, https://doi.org/10.1093/bib/bby002
Close - Share Icon Share
Abstract
Computational prediction of drug–target interactions (DTIs) has become an essential task in the drug discovery process. It narrows down the search space for interactions by suggesting potential interaction candidates for validation via wet-lab experiments that are well known to be expensive and time-consuming. In this article, we aim to provide a comprehensive overview and empirical evaluation on the computational DTI prediction techniques, to act as a guide and reference for our fellow researchers. Specifically, we first describe the data used in such computational DTI prediction efforts. We then categorize and elaborate the state-of-the-art methods for predicting DTIs. Next, an empirical comparison is performed to demonstrate the prediction performance of some representative methods under different scenarios. We also present interesting findings from our evaluation study, discussing the advantages and disadvantages of each method. Finally, we highlight potential avenues for further enhancement of DTI prediction performance as well as related research directions.
Introduction
In silico prediction of interactions between drugs and their target proteins is desirable, as it effectively complements wet-lab experiments that are typically costly and laborious. The newly discovered drug–target interactions (DTIs) are critical for discovering novel targets interacting with existing drugs, as well as new drugs targeting certain disease-associated genes.
Drug repositioning, for instance, is the reuse of existing drugs for novel indications, that is, existing drugs may be used to treat diseases other than those that they were originally developed for [1]. As existing drugs have already been extensively studied (e.g. their bioavailability and safety profiles), repositioning them would significantly reduce costs and accelerate the drug discovery process, which made drug repositioning a popular strategy for drug discovery [2]. One famous example of a repositioned drug is that of Gleevec (imatinib mesylate), which was originally thought to interact only with the Bcr-Abl fusion gene associated with leukemia. Nevertheless, Gleevec was later found to also interact with PDGF and KIT, eventually leading it to be repositioned to treat gastrointestinal stromal tumors as well [3, 4]. This is one of many drug repositioning success stories that exist in the literature [5–10]. As demonstrated in the example of Gleevec, a drug’s promiscuity (i.e. interaction with multiple targets) may contribute to its polypharmacology (i.e. having multiple therapeutic effects), which is clear motivation for attempting to discover new DTIs for existing drugs.
On the other hand, there is also a large number of small-molecule compounds that have not been used as drugs yet and, for the majority of them, their interaction profiles with proteins are still unknown. For example, the PubChem database currently houses >90 million compounds, most of which have unknown interaction profiles [11]. Detecting interactions (with disease-associated genes and target proteins) for these compounds would be useful for new drugs, as this would help narrow down prospective drug candidates to work with in the drug discovery process [12]. Moreover, detecting such interactions may provide insight by discovering off-targets that can cause undesirable side effects [13]. Therefore, prediction of DTIs is of great importance; it is essential for drug repositioning, assists with drug candidate selection and helps detect side effects in advance.
While experimental wet-lab techniques exist for predicting such interactions, they involve tedious and time-consuming work. This is where computational methods prove useful, as they may be used to efficiently predict potential interaction candidates with reasonable accuracy, thus narrowing down the DTI search space to be investigated by their wet-lab counterparts.
Currently, there are three major categories of computational methods for predicting DTIs. The first category is the ligand-based approaches, which leverage the concept that similar molecules tend to share similar properties and usually bind similar proteins [14]. In particular, they predict interactions using the similarity between the proteins’ ligands [15]. However, the prediction results of ligand-based approaches may become unreliable when the number of known ligands per protein is insufficient [16].
The second category is the docking approaches, which take the 3 D structures of a drug and a protein and then run a simulation to determine whether they would interact [17–19]. However, there are proteins for which the 3 D structure is not known, so docking cannot be applied to them. For example, many drug targets are membrane proteins [20] for which the prediction of the 3 D structure is still challenging [21]. In addition, dealing with a receptor protein’s flexibility can be challenging, as a large number of degrees of freedom need to be considered in the calculations.
The third category is the chemogenomic approaches, which use information from both the drug and target sides simultaneously to perform prediction. An advantage of chemogenomic approaches is that they can work with widely abundant biological data to perform prediction. For example, the information used for prediction in [22] consisted of chemical structure graphs and genomic sequences for the drugs and targets, respectively, which are available and easy to obtain from publicly accessible online databases.
In this survey, we focus on reviewing the more popular third category, the chemogenomic methods. The survey starts by describing the kinds of data required to perform the prediction task as well as how they may be obtained. Next, we classify the chemogenomic methods into five types and aimed to provide an overview of all the important prediction methods that belong to each of these types. Furthermore, we choose representative methods for each of the five types below, present a comprehensive comparison among them and discuss the advantages and disadvantages for these methods.
Neighborhood models. Neighborhood methods predict the interaction profile for a drug (or a target) based on its nearest neighbors’ interaction information.
Bipartite local models. Bipartite local models (BLMs) first perform two sets of predictions individually, namely, one from the drug side and one from the target side, and then aggregate these predictions to generate the final prediction scores for given drug–target pairs.
Network diffusion models. Network diffusion methods investigate graph-based techniques (e.g. Random Walk) for influence propagation in drug–target networks and predict novel DTIs.
Matrix factorization models. Matrix factorization first learns the latent feature matrices for drugs and targets from the DTI matrix, and then multiplies these two latent feature matrices to reconstruct the interaction matrix for prediction.
Feature-based classification models. Drug–target pairs in training data are represented as feature vectors, which are then fed into machine learning models [e.g. Random Forest, Support Vector Machines (SVMs)] for predicting novel interactions.
In previous surveys [23, 24], it was commonplace to separate the prediction methods into only two categories, similarity-based methods and feature-based methods. However, as more prediction methods were being proposed by researchers, we decided to further divide the similarity-based methods into four categories, each with their unique characteristics. The new categorization of chemogenomic methods was found to be convenient and useful when representative methods were chosen from each of the categories and compared with each other in cross-validation (CV) experiments whose results are presented later in this study. Conclusions were drawn regarding the advantages and disadvantages of the prediction methods and their corresponding categories, which is useful information for practitioners as well as newcomers to the field.
Compared with previous reviews on this topic of DTI prediction [23–26], our survey is more comprehensive and up-to-date regarding the chemogenomic methods for predicting DTIs. In addition, we provide a novel categorization for the different chemogenomic approaches. Moreover, we describe the kinds of data that may be used in chemogenomic prediction tasks; however, note that we especially focus on listing software packages that generate features for representing drugs and targets (as opposed to online databases containing readily available information on DTIs). Furthermore, in the Supplementary Material of this study, we provide a comprehensive list of data sets that have been compiled by fellow researchers and used in previous work. We also perform an empirical comparison among various state-of-the-art methods from the different categories and discuss their advantages and limitations based on the results. One of the surveys, [23], also provided comparison results among different methods; however, as many new prediction methods have appeared since it was published in 2013, it is desirable to summarize more recent advanced methods. Finally, we discuss potential future trends as well as promising research directions that could be used to further improve DTI prediction.
In the recent review by Chen et al. [26], all online databases that store information on drugs and their targets were mentioned and described in detail (KEGG [27], DrugBank [28], etc.). Furthermore, a literature review on algorithms for DTI prediction was provided where the different algorithms are described and discussed. In addition to the algorithms, online Web servers for predicting interactions were described, and promising research directions in the field of drug discovery have been discussed as well. Our survey is similar to [26] in terms of reviewing the state-of-the-art methods and providing a list of potential future research directions. However, we provide a different categorization of the various prediction methods, and the future research directions proposed here differ from and complement those discussed in [26]. Finally, from the data perspective, while we do not focus on the databases from which data can be obtained, we make an effort here to list the different software packages that may be used to generate further descriptors for drugs and targets. We also provide, as Supplementary Material, a list of data sets that have been used in previous efforts in DTI prediction.
While targets exist in multiple forms, this survey primarily considers protein targets. As such, unless otherwise stated, all targets being referred to in this work are proteins.
The rest of this survey is organized as follows. ‘Data representation and types’ section first introduces the data for representing drugs, targets and their interactions. Then, ‘Methods’ section presents our novel categorization for various prediction methods in details. Next, ‘Empirical evaluation’ section demonstrates the empirical comparison results for various methods on benchmark data. Finally, ‘Avenues for improvement and further research’ section discusses future directions for DTI prediction.
Data representation and types
To train a classifier for predicting DTIs, a list of known DTIs is required. In other words, we want to predict which drugs and targets interact or not based on existing training data. Data for representing the drugs and targets involved are also needed. These required data are described in more detail below. Furthermore, we provide as Supplementary Material a complete list of publicly accessible data sets that have been used in previous efforts for predicting DTIs. An overview of a typical DTI prediction task is given in Figure 1.
Flowchart of a standard DTI prediction task using a chemogenomic prediction method.
Interaction data
Information on known DTIs needs to be gathered, as a classifier will be trained on these known interactions to predict the new interactions. Such information can be found in publicly available online databases that store information on drugs and their known targets. Examples of databases that have been used in previous work include KEGG [27], DrugBank [28], ChEMBL [29] and STITCH [30] (see [26] for an exhaustive list of such databases). The interaction data gathered from these databases are usually formatted into an interaction (adjacency) matrix between drugs and targets. This matrix corresponds to a bipartite graph where nodes represent drugs and targets, and edges connect drug–target pairs that interact.
Drug and target data
Types of data that are available for drugs and may be used for training DTI classifiers include—but are not limited to—graphical representations of drugs’ chemical structures [31], side effects [32], Anatomical Therapeutic Chemical (ATC) codes [33] and gene expression responses to drugs [34]. Other forms of data may further be extracted from the chemical structure graphs of drugs including substructure fingerprints as well as constitutional, topological and geometrical descriptors among other molecular properties (e.g. via the Rcpi [35], PyDPI [36] or Open Babel [37] packages).
As for targets, available data that can be obtained include genomic sequences [38], Gene Ontology (GO) information [39], gene expression profiles [40], disease associations [41] and protein–protein interaction (PPI) network information [42, 43] among others. Further, information may also be extracted from protein sequences, including amino acid composition, CTD (composition, transition and distribution) and autocorrelation descriptors (e.g. via the PROFEAT Web server [44]).
Methods
Many (chemogenomic) DTI prediction methods have been developed over the past decade. We briefly describe them here and categorize them based on the techniques that they use for prediction. Table 1 demonstrates our summarized categories of different methods for predicting DTIs.
The Categories of the different methods for predicting DTIs.
| Categories . | Methods . | Category description . |
|---|---|---|
| Neighborhood | Nearest Profile and Weighted Profile [22], SRP [45] | Neighborhood methods use relatively simple similarity functions to perform predictions |
| BLMs | Bleakley et al. [46], LapRLS [47], RLS-avg and RLS-kron [48], BLM-NII [49] | BLMs perform two sets of predictions, one from the drug side and one from the target side, and then aggregates these predictions to give the final prediction scores |
| Network diffusion | NBI [50], Wang et al. [51], NRWRH [52], PSL [53], DASPfind [54] | Network diffusion methods investigate graph-based techniques to predict new interactions |
| Matrix factorization | KBMF2K [55], PMF [56], CMF [57], WGRMF [58], NRLMF [59], DNILMF [60] | Matrix factorization finds two latent feature matrices that, when multiplied together, reconstruct the interaction matrix |
| Feature-based classification | He et al. [61], Yu et al. [62], Fuzzy KNN [63], Ezzat et al. [64], EnsemDT [65], SITAR [66], RFDT [78], PDTPS [81], ER-Tree [83], SCCA [84], MH-L1SVM [86] | Feature-based classification methods are those that need the drug–target pairs to be explicitly represented as fixed-length feature vectors |
| Categories . | Methods . | Category description . |
|---|---|---|
| Neighborhood | Nearest Profile and Weighted Profile [22], SRP [45] | Neighborhood methods use relatively simple similarity functions to perform predictions |
| BLMs | Bleakley et al. [46], LapRLS [47], RLS-avg and RLS-kron [48], BLM-NII [49] | BLMs perform two sets of predictions, one from the drug side and one from the target side, and then aggregates these predictions to give the final prediction scores |
| Network diffusion | NBI [50], Wang et al. [51], NRWRH [52], PSL [53], DASPfind [54] | Network diffusion methods investigate graph-based techniques to predict new interactions |
| Matrix factorization | KBMF2K [55], PMF [56], CMF [57], WGRMF [58], NRLMF [59], DNILMF [60] | Matrix factorization finds two latent feature matrices that, when multiplied together, reconstruct the interaction matrix |
| Feature-based classification | He et al. [61], Yu et al. [62], Fuzzy KNN [63], Ezzat et al. [64], EnsemDT [65], SITAR [66], RFDT [78], PDTPS [81], ER-Tree [83], SCCA [84], MH-L1SVM [86] | Feature-based classification methods are those that need the drug–target pairs to be explicitly represented as fixed-length feature vectors |
The Categories of the different methods for predicting DTIs.
| Categories . | Methods . | Category description . |
|---|---|---|
| Neighborhood | Nearest Profile and Weighted Profile [22], SRP [45] | Neighborhood methods use relatively simple similarity functions to perform predictions |
| BLMs | Bleakley et al. [46], LapRLS [47], RLS-avg and RLS-kron [48], BLM-NII [49] | BLMs perform two sets of predictions, one from the drug side and one from the target side, and then aggregates these predictions to give the final prediction scores |
| Network diffusion | NBI [50], Wang et al. [51], NRWRH [52], PSL [53], DASPfind [54] | Network diffusion methods investigate graph-based techniques to predict new interactions |
| Matrix factorization | KBMF2K [55], PMF [56], CMF [57], WGRMF [58], NRLMF [59], DNILMF [60] | Matrix factorization finds two latent feature matrices that, when multiplied together, reconstruct the interaction matrix |
| Feature-based classification | He et al. [61], Yu et al. [62], Fuzzy KNN [63], Ezzat et al. [64], EnsemDT [65], SITAR [66], RFDT [78], PDTPS [81], ER-Tree [83], SCCA [84], MH-L1SVM [86] | Feature-based classification methods are those that need the drug–target pairs to be explicitly represented as fixed-length feature vectors |
| Categories . | Methods . | Category description . |
|---|---|---|
| Neighborhood | Nearest Profile and Weighted Profile [22], SRP [45] | Neighborhood methods use relatively simple similarity functions to perform predictions |
| BLMs | Bleakley et al. [46], LapRLS [47], RLS-avg and RLS-kron [48], BLM-NII [49] | BLMs perform two sets of predictions, one from the drug side and one from the target side, and then aggregates these predictions to give the final prediction scores |
| Network diffusion | NBI [50], Wang et al. [51], NRWRH [52], PSL [53], DASPfind [54] | Network diffusion methods investigate graph-based techniques to predict new interactions |
| Matrix factorization | KBMF2K [55], PMF [56], CMF [57], WGRMF [58], NRLMF [59], DNILMF [60] | Matrix factorization finds two latent feature matrices that, when multiplied together, reconstruct the interaction matrix |
| Feature-based classification | He et al. [61], Yu et al. [62], Fuzzy KNN [63], Ezzat et al. [64], EnsemDT [65], SITAR [66], RFDT [78], PDTPS [81], ER-Tree [83], SCCA [84], MH-L1SVM [86] | Feature-based classification methods are those that need the drug–target pairs to be explicitly represented as fixed-length feature vectors |
In sections ‘Neighborhood’, ‘Bipartite local models’, ‘Network diffusion’ and ‘Matrix factorization’, the input data for the methods below consist of an interaction matrix showing which drugs and targets interact, a similarity matrix for drugs and a similarity matrix for targets. In section 'Feature-based classification’, the similarity matrices are replaced by drug and target feature matrices, and , for representing the drugs and targets, respectively.
Neighborhood
Neighborhood methods use relatively simple similarity functions to perform predictions. More precisely, a new drug or target has its interaction profile predicted using its similarities to other drugs or targets, respectively; a new drug is one that has no known targets and, similarly, a new target is one that has no known interactions with any drugs.
Nearest Profile and Weighted Profile
In both of these methods, predictions from both the drug and target sides are averaged to obtain the final predictions.
Similarity Rank-based Predictor
In addition to the above score, which was computed using Sd, a similar corresponding score is also obtained using St, and then the two scores are averaged to give the final prediction score.
Bipartite local models
BLMs perform two sets of predictions, namely, one from the drug side and one from the target side, and then aggregate these predictions to give the final prediction scores for the potential interaction candidates.
SVM-based BLMs
This pioneering effort [46] introduced the concept of BLM where a local model is trained for each drug (or target) to predict which targets (or drugs) would interact with it. In the case of [46], the local models were SVM classifiers. Predictions from the drug and target sides are then averaged to get the final results.
Specifically, assuming a bipartite DTI network, the algorithm tries to predict whether the edge eij exists between drug di and target tj. The following steps are performed:
Ignoring tj, a classifier is trained for di using the list of its known interactions with other targets (positive examples) as well as the list of targets not known to interact with di (negative examples). Interactions are labeled as + 1, whereas noninteractions are labeled as −1. The trained classifier is used to predict for eij.
Ignoring di, a classifier is trained for tj using the list of its known interactions with other drugs as well as the list of drugs not known to interact with tj. Interactions are labeled as + 1, whereas noninteractions are labeled as −1. The trained classifier then predicts for eij.
Predictions from both the drug and target sides (i.e. from both classifiers) are aggregated using the function.
Laplacian Regularized Least Squares
Regularized Least Squares
Bipartite Local Models with Neighbor-based Interaction Profile Inferring
BLM algorithms exploiting local models achieved decent performance for DTI prediction. However, they had an outstanding issue that they are not able to train local models for drugs or targets that do not have any known interactions (i.e. new drugs or targets). To address this issue, Bipartite Local Models with Neighbor-based Interaction Profile Inferring (BLM-NII) [49], which is based on RLS-avg, introduces a preprocessing method denoted as NII to infer temporary interaction profiles for those novel drugs or targets.
Now that drug di does not have an empty profile, classifier training and prediction may proceed as per normal. The NII procedure is similarly applied to the target side wherever applicable, and predictions are obtained from the target side as well. Predictions from the drug and target sides are then aggregated as is typical in algorithms from the category of BLMs. The NII preprocessing procedure was found to improve prediction performance.
Regularized Least Squares with Weighted Nearest Neighbors
Network diffusion
The network diffusion category of methods includes those that investigate graph-based techniques to predict new interactions; the network diffusion technique is predominant in this category, which is why it is named as such.
Network-based inference
Heterogeneous graph inference
(A) Bipartite DTI network, (B) heterogeneous network that additionally includes drug and target pairwise similarities (the dashed lines).
Network-based Random Walk with Restart on the Heterogeneous network
Probabilistic Soft Logic
To predict new interactions, the above rules are applied wherever applicable on the DTI network (i.e. each of the rules is applied to its corresponding relations that exist in the network).
Furthermore, to avoid investigating the large number of all possible triad and tetrad relations, a technique called blocking is used beforehand where edges between pairs of drugs or targets (which correspond to pairwise similarities) are removed from the network if their weights (i.e. similarity values) are below some user-defined cutoff value.
Determine All Simple Paths, Find Interactions
Matrix factorization
Matrix factorization takes an input matrix and tries to find two other matrices that, when multiplied together, approximate the input matrix. In the case of DTI prediction, the interaction matrix is factorized into two matrices and such that . k is an adjustable parameter corresponding to the number of latent features in A and B, and .
Matrix factorization identifies latent features of drugs and targets in an unsupervised fashion, which is useful for collaborative filtering. For example, if the latent vectors of two drugs turn out to be similar, then it is likely that these drugs share many of the same interactions, thus allowing the transfer of interactions between them.
As we are looking for missing interactions in the matrix Y, matrix factorization can be used as a matrix completion technique (i.e. the abovementioned transfer of interactions between drugs themselves and targets themselves), which makes it a good fit for the DTI prediction problem. An illustration of matrix factorization is given in Figure 3.
Illustration of matrix factorization. The goal is to find two latent feature matrices, A and B, that reconstruct the interaction matrix Y when multiplied together.
Kernelized Bayesian Matrix Factorization with Twin Kernels
Kernelized Bayesian Matrix Factorization with Twin Kernels (KBMF2K) [55] is, to our knowledge, the first in a number of methods that uses matrix factorization for predicting DTIs. It uses a Bayesian probabilistic formulation along with the concept of matrix factorization to perform prediction. Worded differently, it uses variational approximation to perform nonlinear dimensionality reduction, thus improving efficiency in terms of computation time.
As there are too many algorithmic details to mention, we only provide a minimal overview of the algorithm here. Please observe Figure 4 below which is inspired from a figure in [55].
Minimal representation of the KBMF2K algorithm, which predicts DTIs from drug and target kernels, Sd and St.
Assuming R is the chosen subspace dimensionality, contains projection parameters, and Λd contains the corresponding priors. With the projection matrix Pd, the drug kernel matrix Sd is used to project the interactions (more precisely, the drug–target pairs) to a low-dimensional space (called the pharmacological space). This results in Gd, which consists of the low-dimensional representations of drugs in this space. The same is done to obtain Gt (using a projection matrix ), which consists of lower-dimensional representations of targets in that same space. Having obtained lower-dimensional representations of both drugs and targets in the same unified space, a prediction scores matrix F is obtained and presented as the interaction matrix .
Probabilistic Matrix Factorization
Probabilistic Matrix Factorization (PMF) [56] is another matrix factorization method that uses probabilistic formulations as well. Specifically, it models interactions via ‘probabilistic linear models with Gaussian noise’. Unlike KBMF2K, however, it does not depend on or use similarity matrices between drugs or targets while performing prediction, and thus it achieves relatively lower performance than other matrix factorization techniques introduced here.
The first term on the right-hand side of the above equation is the squared-error function to be minimized, while the last two terms are extra Tikhonov regularization terms that are added to help avoid overfitting by preventing the latent features of A and B from assuming large values. The goal here is to find the two latent matrices A and B that maximize the log-likelihood presented above. Finally, the prediction scores matrix is obtained as .
Collaborative Matrix Factorization
where Md and Mt are the numbers of drug and target similarity matrices, respectively, and is a parameter. ωd and ωt are weight vectors for linearly combining the drug and target similarity matrices, respectively. The fifth line in the above equation includes Tikhonov regularization terms for ωd and ωt, while the sixth line is a constraint that ensures the weights in each of ωd and ωt sum up to 1.
Weighted Graph Regularized Matrix Factorization
The weight matrix W here has the same role as in CMF; by setting Wij = 0 for unknown drug–target pairs (i.e. test set instances), they would not contribute toward the prediction of interactions. The weight matrix W is important, as, otherwise, these test instances would count as noninteractions (i.e. as negative instances) and may unfavorably affect predictions.
Neighborhood Regularized Logistic Matrix Factorization
Dual-Network Integrated Logistic Matrix Factorization
Dual-Network Integrated Logistic Matrix Factorization (DNILMF) [60] can be considered an extension of NRLMF. DNILMF additionally incorporates network-based similarity in a way that is similar to how it is done in RLS-avg and RLS-kron. Unlike RLS-avg and RLS-kron, however, all kernels (i.e. similarity matrices) undergo a kernel diffusion step beforehand. For a drug (or target) kernel, a local similarity matrix is generated by keeping the similarities to the nearest k neighbors for each drug (or target) while discarding the rest, and then the local similarity matrix is diffused with the global similarity matrix over a number of iterations.
As both NRLMF and DNILMF are based on logistic matrix factorization, their objective functions [from Equations (32) and (35)] resemble one another. However, while NRLMF uses graph regularization to make use of the similarity matrices Sd and St in prediction, DNILMF instead obtains the diffused kernels Kd and Kt, incorporates them into the logistic function and uses them in the objective function from Equation (35).
Feature-based classification
Feature-based classification methods are those that need drug–target pairs to be explicitly represented as fixed-length feature vectors. Given a drug feature vector and a target feature vector , the drug–target pair would typically be represented by the concatenated feature vector . In addition to the feature vector, each drug–target pair has a label to show whether it is a known interaction (i.e. positive class) or a noninteraction (i.e. negative class). With the feature vectors and labels, various supervised machine learning methods can thus be developed for predicting DTIs as illustrated in Figure 5.
Illustration of how feature-based prediction models are created. In the training phase, feature vectors for the training instances (i.e. the drug–target pairs) are generated by concatenating the feature vectors of the involved drugs and targets. Along with their labels, the training instances are used to train the prediction model. In the testing phase, feature vectors are generated for the testing instances, and the prediction model (from the training phase) is used to predict for the testing instances.
Note that it is more accurate to call noninteractions as unlabeled pairs, as we do not know for sure whether these pairs are true noninteractions. Despite this detail, however, methods of this category commonly treat unlabeled pairs as if they are, in fact, true noninteractions.
Incremental and forward feature selection
In [61], drugs were represented by a number of common functional groups that are found in drugs’ chemical structures, while targets were represented by pseudo amino acid composition. An innovative feature selection procedure was additionally introduced in this work for the sake of improving the prediction performance by using a better feature set.
The feature selection procedure starts by ranking features using the mRMR (minimum Redundancy Maximum Relevance) algorithm [76]. Incremental feature selection is then applied on the ranked features, i.e. the ranked features are added to the selected feature set in order, one by one, until the prediction performance on a temporary validation set stops improving. Finally, the set of selected features is further filtered by applying forward feature selection to it. After the feature selection phase is complete, a nearest neighbor algorithm is then applied to obtain final predictions.
Random Forest and SVMs
Random Forest and SVM models were proposed for predicting interactions in [62]. Assuming that the data consist of nd drugs and nt targets, this means that there are drug–target pairs in total. When the dimensionality of the data is high (i.e. drug–target pairs are represented by many features), it becomes challenging, if not infeasible, to use the entire data set of all drug–target pairs as training data to train a classification model. Therefore, the set of noninteractions, which is far bigger than the set of known interactions, is undersampled until its size is equal to that of the set of interactions.
Drug and target features were generated using the DRAGON (http://www.talete.mi.it/) and PROFEAT [44] packages, respectively. Drug features generated by DRAGON include constitutional and topological descriptors, eigenvalue-based indices and 2 D autocorrelations among others. On the other hand, target features generated by PROFEAT include CTD and autocorrelation descriptors, amino acid composition and so on.
Fuzzy K-nearest neighbors
Fuzzy K-nearest neighbors (Fuzzy KNN) [63] models each training instance as belonging to two classes (i.e. positive and negative classes) with different membership values. For each test instance, its membership value to each of the two classes is computed by a kind of weighted average of its similarities to its nearest K neighbors, and the higher of the two values decides which class it belongs to. The drugs were represented by FP2 fingerprints that were generated using the Open Babel [37] package, while the targets were represented using pseudo amino acid composition.
Decision tree ensemble with oversampling
An ensemble technique was introduced in [64] to predict interactions. Drug descriptors were computed using the Rcpi package [35], whereas target descriptors were generated via the PROFEAT Web server [44]. Similar to the Random Forest used in [62], an ensemble of decision trees is trained, and feature subspacing is applied (i.e. a subset of the features is randomly sampled for each decision tree). However, in contrast to Random Forest, which performs bagging on the same sampled group of negatives, a different set of negatives is randomly sampled for each decision tree, which means better coverage of the negative class in the data and including more of it in the training process. In addition, clustering is used to look for small disjuncts in the interacting class that are then oversampled to reinforce them. This is to deal with an issue in the data known as the within-class imbalance.
Decision tree ensemble with dimensionality reduction
EnsemDT [65] is another ensemble technique that is similar to the one presented in [64]. However, EnsemDT does not oversample small disjuncts in the interacting class. Instead, it uses dimensionality reduction. That is, dimensionality reduction is applied to the drug and target feature vectors before concatenating them to form the instances. Three dimensionality reduction techniques were investigated, namely, Singular Value Decomposition (SVD), Partial Least Squares [77] and Laplacian Eigenmaps [69]. While dimensionality reduction is commonly used to improve the computational efficiency (i.e. reducing the running time), these techniques were found to improve the prediction performance as well.
Rotation Forest-based Predictor of Drug–Target Interactions
Rotation Forest-based Predictor of Drug–Target Interactions (RFDT) [78] uses yet another ensemble learning technique to predict DTIs. In particular, a variant based on Rotation Forest [79] was used. For each base classifier, the feature set is randomly divided into K roughly equal subsets (where K is a parameter). In other words, the feature matrix is split into K submatrices such that each submatrix has around p/K columns where n is the number of instances, and p is the number of features. Bagging is then applied to the training set, that is a subset of the training examples is randomly sampled to form the training set for the current base classifier. Next, principal component analysis (PCA) is applied to each of the submatrices separately, and then the resulting features from each submatrix are combined to form a diagonal block matrix called the rotation matrix. Finally, the feature matrix X is multiplied by the rotation matrix, and the resulting matrix is used as the training set along with the corresponding labels to train the base classifier. This procedure is repeated for all base classifiers constituting the ensemble.
Using a rotation matrix (that is constructed by dividing the feature set into K randomized subsets) and bagging are both ways of injecting diversity into the ensemble. Increased diversity within the ensemble is known to improve the overall prediction performance [80].
In this study, PubChem fingerprints (i.e. binary vectors indicating presence or absence of 881 common substructures) are used to represent drugs. Targets, on the other hand, are represented using autocovariance vectors that were generated using the targets’ genomic sequences; specifically, a position-specific scoring matrix (PSSM) was computed for each target (using its sequence), and then the PSSMs were used to generate autocovariance vectors for representing the targets.
Predicting Drug Targets with Protein Sequence
Predicting Drug Targets with Protein Sequence (PDTPS) [81] is similar to RFDT in that it makes use of PSSMs to represent targets. However, in place of autocovariance, it instead computes bi-gram probabilities from the PSSMs. In addition, PCA is later applied to reduce the dimensionality of the features. For prediction, PDTPS uses Relevance Vector Machines (RVMs). Experimental results showed that the proposed method was successful.
RVM [82] is a machine learning method that is functionally identical to SVM. However, unlike SVM, it uses Bayesian learning to make use of probabilistic formulations in prediction. Prediction models trained via RVM are typically sparse (i.e. compact and interpretable), while, at the same time, they are able to produce results that are comparable with (and exceed) those of SVM.
Extremely Randomized Trees
In [83], Extremely Randomized Trees (ER-Tree) are used to perform prediction. In regular Decision Tree-based ensembles, each Decision Tree follows certain rules for (i) selecting attributes to use for tree-splitting and (ii) determining cutoff points within the attributes. In ER-Tree, randomization is explicitly introduced into the training process by random selection of attributes and cutoff points. This explicit randomization helps strongly reduce the variance of the tree-based models, thus improving prediction performance. Furthermore, bagging is avoided (i.e. the entire training set is used) to keep the bias as low as possible.
This step is recursively repeated for the two child nodes, NL and NR, and so on until this base classifier is trained. This procedure is repeated for all the base classifiers, forming the ensemble.
In terms of data representation, drugs are represented as PubChem fingerprints, while targets were represented using Pseudo Substition Matrix Representation (pseudo-SMR).
Similarity-based Inference of drug–TARgets (SITAR)
Chemical substructures–protein domains correlation model
In [84], drugs are represented as PubChem fingerprints (binary vectors indicating the absence/presence of 881 common chemical substructures), while targets are represented as domain fingerprints (binary vectors indicating the absence/presence of 876 protein domains obtained from the Pfam database [85]).
When used to predict DTIs, SCCA produced results that are comparable with those of SVM. However, unlike SVM, which is focused only on prediction, SCCA is an interpretable classifier that, having been trained, can be inspected for learned rules that may contain useful insights. As stated above, SCCA emphasizes learning sparse weight vectors, which makes it possible to inspect these weight vectors for biological insights; the nonzero elements in the learned weight vector would correspond to the most significant chemical structures and protein domains that govern DTIs.
SVMs and minwise hashing
Linear SVM is used as the classifier. Two variants have been considered: one with an L2 regularization term (MH-L2SVM) and another with an L1 regularization term (MH-L1SVM). The two variants were found to produce similar prediction performance. However, the learned weight vector from the MH-L1SVM is more interesting because the number of features extracted was much smaller than that of MH-L2SVM (i.e. less features to inspect for insights).
Finally, using the inverse operation of the minwise operation mentioned above, the weight vector learned using the compact fingerprints is converted into a final weight vector for the original fingerprint. This final weight vector is then inspected for biological interpretation.
Empirical evaluation
We performed a comprehensive empirical comparison among various methods, under three distinct CV settings in [25] as follows:
S1, where random drug–target pairs are left out as the test set;
S2, where entire drug profiles are left out as the test set; and
S3, where entire target profiles are left out as the test set.
S1 is the traditional setting for evaluation. Meanwhile, S2 and S3 are proposed to evaluate the ability of various methods to predict interactions for novel drugs and targets. Here, novel drugs and targets are those for which no interaction information is available. As such, additionally conducting experiments under S2 and S3 paints a fuller picture of how the different methods perform in various given situations. Illustrations of the different CV settings are provided in Figure 6.
The different cross validation settings: (A) S1 involves leaving out random drug–target pairs from the interaction matrix Y to use as the test set, (B) S2 is the setting where entire drug profiles are left out and (C) S3 leaves out entire target profiles. Gray boxes represent left-out test instances.
In our experiments, we performed five repetitions of a 10-fold CV procedure under each of the above scenarios using AUPR [88] (area under the precision–recall curve) as the evaluation metric. That is, under each 10-fold CV procedure, the data set (the interaction data, specifically) is divided into 10 folds. The folds take turns being left out as the test set, and the prediction performance for each of them is evaluated in terms of AUPR. The computed AUPRs are then averaged to give the AUPR of the 10-fold CV. This process is repeated five times, and the AUPRs of the 10-fold CVs are averaged to give the final AUPR.
AUPR was used as the main evaluation metric in previous work in DTI prediction. Furthermore, in cases of class imbalance, the AUPR is more adequate because it severely penalizes highly ranked incorrect recommendations [89], which better reflects the aim of having accurate predictions at the top of the prediction lists. For these reasons, we use AUPR as the evaluation metric in our empirical comparison as well. In addition, in DTI prediction, the relative order of the labels is more important than the exact values of the prediction; thus, it makes more sense to use an evaluation metric that measures how well the different drug–target pairs are ranked.
Benchmark data set
Some of the most widely used data sets in the field of DTI prediction are those that are introduced in [22]. Specifically, they were four data sets concerning four different classes of target proteins, namely, enzymes (Es), ion channels (ICs), G protein-coupled receptors (GPCRs) and nuclear receptors (NRs). Interaction data were extracted from the KEGG database [27] (see Table 2 for some statistics on each of the data sets). In addition, each data set provides a drug similarity matrix Sd where the pairwise similarities between the drugs were computed using SIMCOMP [90] and a target similarity matrix St where the pairwise similarities between the targets are computed using normalized Smith–Waterman [91].
Statistics of each data set
| Data sets . | NR . | GPCR . | IC . | E . |
|---|---|---|---|---|
| Drugs | 54 | 223 | 210 | 445 |
| Targets | 26 | 95 | 204 | 664 |
| Interactions | 90 | 635 | 1476 | 2926 |
| Data sets . | NR . | GPCR . | IC . | E . |
|---|---|---|---|---|
| Drugs | 54 | 223 | 210 | 445 |
| Targets | 26 | 95 | 204 | 664 |
| Interactions | 90 | 635 | 1476 | 2926 |
Statistics of each data set
| Data sets . | NR . | GPCR . | IC . | E . |
|---|---|---|---|---|
| Drugs | 54 | 223 | 210 | 445 |
| Targets | 26 | 95 | 204 | 664 |
| Interactions | 90 | 635 | 1476 | 2926 |
| Data sets . | NR . | GPCR . | IC . | E . |
|---|---|---|---|---|
| Drugs | 54 | 223 | 210 | 445 |
| Targets | 26 | 95 | 204 | 664 |
| Interactions | 90 | 635 | 1476 | 2926 |
Selected methods
We include a subset of the methods mentioned in section ‘Methods’ such that the different categories are represented. As baseline methods, we selected the Nearest Profile and Weighted Profile from the neighbor-based methods. We further selected CMF and WGRMF from the matrix factorization methods. From the network-based methods, we selected Wang et al.’s method from section ‘Heterogeneous graph inference’ (which we will refer to as NBI+ from now on). As for BLMs, we selected Regularized Least Squares with Weighted Nearest Neighbors (RLS-WNN). In terms of prediction performance, the selected methods are the best performing ones in their respective categories as reported in the publications where they appeared, which is why these methods in particular were selected to represent their categories. The source codes for all the selected methods are downloadable via the URL: https://github.com/alizat/Chemogenomic-DTI-Prediction-Methods.
The data sets in Table 2 are in the form of similarity matrices that were precomputed from nonvectorial data, that is, the raw data from which the matrices were derived (i.e. chemical structure graphs and genomic sequences) are not in the form of fixed-length feature vectors. Therefore, feature-based classification methods were not included in this comparison. However, we conducted a separate comparison among various feature-based methods on another benchmark data set introduced in [64]. Please refer to the Supplementary Material for the results of this comparison.
Parameters for all prediction methods have been tuned to give their optimal prediction performances under each of the cross validation settings. The optimal parameter values were obtained by grid search.
Results
The results of the different methods under S1, S2 and S3 CV settings are given in Tables 3, 4 and 5, respectively. We discuss the results below, stating advantages and disadvantages of each method as well as other general comments. Note that the results on the NR data set are particularly unstable because of its excessively small size [25]. As such, while we provide results for the NR data set, they are otherwise mostly ignored in the discussion below.
AUPR results under S1
| . | NR . | GPCR . | IC . | E . |
|---|---|---|---|---|
| Nearest Profile | 0.496 (0.012) | 0.464 (0.009) | 0.522 (0.005) | 0.621 (0.003) |
| Weighted Profile | 0.425 (0.012) | 0.440 (0.010) | 0.756 (0.003) | 0.727 (0.001) |
| RLS-WNN | 0.729 (0.032) | 0.727 (0.018) | 0.856 (0.011) | 0.849 (0.006) |
| CMF | 0.639 (0.016) | 0.754 (0.002) | 0.937 (0.002) | 0.883 (0.003) |
| WGRMF | 0.602 (0.038) | 0.737 (0.002) | 0.923 (0.002) | 0.877 (0.002) |
| NBI+ | 0.287 (0.021) | 0.255 (0.005) | 0.162 (0.002) | 0.206 (0.002) |
| . | NR . | GPCR . | IC . | E . |
|---|---|---|---|---|
| Nearest Profile | 0.496 (0.012) | 0.464 (0.009) | 0.522 (0.005) | 0.621 (0.003) |
| Weighted Profile | 0.425 (0.012) | 0.440 (0.010) | 0.756 (0.003) | 0.727 (0.001) |
| RLS-WNN | 0.729 (0.032) | 0.727 (0.018) | 0.856 (0.011) | 0.849 (0.006) |
| CMF | 0.639 (0.016) | 0.754 (0.002) | 0.937 (0.002) | 0.883 (0.003) |
| WGRMF | 0.602 (0.038) | 0.737 (0.002) | 0.923 (0.002) | 0.877 (0.002) |
| NBI+ | 0.287 (0.021) | 0.255 (0.005) | 0.162 (0.002) | 0.206 (0.002) |
Note: Best and second best AUPR results in each column are bold and italic, respectively. SDs are given in (parentheses).
AUPR results under S1
| . | NR . | GPCR . | IC . | E . |
|---|---|---|---|---|
| Nearest Profile | 0.496 (0.012) | 0.464 (0.009) | 0.522 (0.005) | 0.621 (0.003) |
| Weighted Profile | 0.425 (0.012) | 0.440 (0.010) | 0.756 (0.003) | 0.727 (0.001) |
| RLS-WNN | 0.729 (0.032) | 0.727 (0.018) | 0.856 (0.011) | 0.849 (0.006) |
| CMF | 0.639 (0.016) | 0.754 (0.002) | 0.937 (0.002) | 0.883 (0.003) |
| WGRMF | 0.602 (0.038) | 0.737 (0.002) | 0.923 (0.002) | 0.877 (0.002) |
| NBI+ | 0.287 (0.021) | 0.255 (0.005) | 0.162 (0.002) | 0.206 (0.002) |
| . | NR . | GPCR . | IC . | E . |
|---|---|---|---|---|
| Nearest Profile | 0.496 (0.012) | 0.464 (0.009) | 0.522 (0.005) | 0.621 (0.003) |
| Weighted Profile | 0.425 (0.012) | 0.440 (0.010) | 0.756 (0.003) | 0.727 (0.001) |
| RLS-WNN | 0.729 (0.032) | 0.727 (0.018) | 0.856 (0.011) | 0.849 (0.006) |
| CMF | 0.639 (0.016) | 0.754 (0.002) | 0.937 (0.002) | 0.883 (0.003) |
| WGRMF | 0.602 (0.038) | 0.737 (0.002) | 0.923 (0.002) | 0.877 (0.002) |
| NBI+ | 0.287 (0.021) | 0.255 (0.005) | 0.162 (0.002) | 0.206 (0.002) |
Note: Best and second best AUPR results in each column are bold and italic, respectively. SDs are given in (parentheses).
AUPR Results under S2
| . | NR . | GPCR . | IC . | E . |
|---|---|---|---|---|
| Nearest Profile | 0.417 (0.031) | 0.283 (0.017) | 0.208 (0.013) | 0.223 (0.007) |
| Weighted Profile | 0.376 (0.022) | 0.231 (0.005) | 0.187 (0.005) | 0.118 (0.002) |
| RLS-WNN | 0.545 (0.023) | 0.369 (0.007) | 0.334 (0.010) | 0.393 (0.013) |
| CMF | 0.521 (0.027) | 0.407 (0.011) | 0.353 (0.014) | 0.384 (0.012) |
| WGRMF | 0.570 (0.014) | 0.427 (0.011) | 0.367 (0.016) | 0.413 (0.017) |
| NBI+ | 0.267 (0.025) | 0.201 (0.010) | 0.112 (0.007) | 0.110 (0.004) |
| . | NR . | GPCR . | IC . | E . |
|---|---|---|---|---|
| Nearest Profile | 0.417 (0.031) | 0.283 (0.017) | 0.208 (0.013) | 0.223 (0.007) |
| Weighted Profile | 0.376 (0.022) | 0.231 (0.005) | 0.187 (0.005) | 0.118 (0.002) |
| RLS-WNN | 0.545 (0.023) | 0.369 (0.007) | 0.334 (0.010) | 0.393 (0.013) |
| CMF | 0.521 (0.027) | 0.407 (0.011) | 0.353 (0.014) | 0.384 (0.012) |
| WGRMF | 0.570 (0.014) | 0.427 (0.011) | 0.367 (0.016) | 0.413 (0.017) |
| NBI+ | 0.267 (0.025) | 0.201 (0.010) | 0.112 (0.007) | 0.110 (0.004) |
Note: Best and second best AUPR results in each column are bold and italic, respectively. SDs are given in (parentheses).
AUPR Results under S2
| . | NR . | GPCR . | IC . | E . |
|---|---|---|---|---|
| Nearest Profile | 0.417 (0.031) | 0.283 (0.017) | 0.208 (0.013) | 0.223 (0.007) |
| Weighted Profile | 0.376 (0.022) | 0.231 (0.005) | 0.187 (0.005) | 0.118 (0.002) |
| RLS-WNN | 0.545 (0.023) | 0.369 (0.007) | 0.334 (0.010) | 0.393 (0.013) |
| CMF | 0.521 (0.027) | 0.407 (0.011) | 0.353 (0.014) | 0.384 (0.012) |
| WGRMF | 0.570 (0.014) | 0.427 (0.011) | 0.367 (0.016) | 0.413 (0.017) |
| NBI+ | 0.267 (0.025) | 0.201 (0.010) | 0.112 (0.007) | 0.110 (0.004) |
| . | NR . | GPCR . | IC . | E . |
|---|---|---|---|---|
| Nearest Profile | 0.417 (0.031) | 0.283 (0.017) | 0.208 (0.013) | 0.223 (0.007) |
| Weighted Profile | 0.376 (0.022) | 0.231 (0.005) | 0.187 (0.005) | 0.118 (0.002) |
| RLS-WNN | 0.545 (0.023) | 0.369 (0.007) | 0.334 (0.010) | 0.393 (0.013) |
| CMF | 0.521 (0.027) | 0.407 (0.011) | 0.353 (0.014) | 0.384 (0.012) |
| WGRMF | 0.570 (0.014) | 0.427 (0.011) | 0.367 (0.016) | 0.413 (0.017) |
| NBI+ | 0.267 (0.025) | 0.201 (0.010) | 0.112 (0.007) | 0.110 (0.004) |
Note: Best and second best AUPR results in each column are bold and italic, respectively. SDs are given in (parentheses).
AUPR Results under S3
| . | NR . | GPCR . | IC . | E . |
|---|---|---|---|---|
| Nearest Profile | 0.393 (0.037) | 0.444 (0.025) | 0.589 (0.021) | 0.647 (0.015) |
| Weighted Profile | 0.379 (0.024) | 0.327 (0.011) | 0.721 (0.005) | 0.673 (0.007) |
| RLS-WNN | 0.491 (0.032) | 0.574 (0.021) | 0.763 (0.007) | 0.778 (0.018) |
| CMF | 0.478 (0.017) | 0.599 (0.033) | 0.779 (0.011) | 0.782 (0.013) |
| WGRMF | 0.464 (0.018) | 0.609 (0.032) | 0.813 (0.007) | 0.808 (0.018) |
| NBI+ | 0.300 (0.020) | 0.203 (0.006) | 0.193 (0.006) | 0.210 (0.007) |
| . | NR . | GPCR . | IC . | E . |
|---|---|---|---|---|
| Nearest Profile | 0.393 (0.037) | 0.444 (0.025) | 0.589 (0.021) | 0.647 (0.015) |
| Weighted Profile | 0.379 (0.024) | 0.327 (0.011) | 0.721 (0.005) | 0.673 (0.007) |
| RLS-WNN | 0.491 (0.032) | 0.574 (0.021) | 0.763 (0.007) | 0.778 (0.018) |
| CMF | 0.478 (0.017) | 0.599 (0.033) | 0.779 (0.011) | 0.782 (0.013) |
| WGRMF | 0.464 (0.018) | 0.609 (0.032) | 0.813 (0.007) | 0.808 (0.018) |
| NBI+ | 0.300 (0.020) | 0.203 (0.006) | 0.193 (0.006) | 0.210 (0.007) |
Note: Best and second best AUPR results in each column are bold and italic, respectively. SDs are given in (parentheses).
AUPR Results under S3
| . | NR . | GPCR . | IC . | E . |
|---|---|---|---|---|
| Nearest Profile | 0.393 (0.037) | 0.444 (0.025) | 0.589 (0.021) | 0.647 (0.015) |
| Weighted Profile | 0.379 (0.024) | 0.327 (0.011) | 0.721 (0.005) | 0.673 (0.007) |
| RLS-WNN | 0.491 (0.032) | 0.574 (0.021) | 0.763 (0.007) | 0.778 (0.018) |
| CMF | 0.478 (0.017) | 0.599 (0.033) | 0.779 (0.011) | 0.782 (0.013) |
| WGRMF | 0.464 (0.018) | 0.609 (0.032) | 0.813 (0.007) | 0.808 (0.018) |
| NBI+ | 0.300 (0.020) | 0.203 (0.006) | 0.193 (0.006) | 0.210 (0.007) |
| . | NR . | GPCR . | IC . | E . |
|---|---|---|---|---|
| Nearest Profile | 0.393 (0.037) | 0.444 (0.025) | 0.589 (0.021) | 0.647 (0.015) |
| Weighted Profile | 0.379 (0.024) | 0.327 (0.011) | 0.721 (0.005) | 0.673 (0.007) |
| RLS-WNN | 0.491 (0.032) | 0.574 (0.021) | 0.763 (0.007) | 0.778 (0.018) |
| CMF | 0.478 (0.017) | 0.599 (0.033) | 0.779 (0.011) | 0.782 (0.013) |
| WGRMF | 0.464 (0.018) | 0.609 (0.032) | 0.813 (0.007) | 0.808 (0.018) |
| NBI+ | 0.300 (0.020) | 0.203 (0.006) | 0.193 (0.006) | 0.210 (0.007) |
Note: Best and second best AUPR results in each column are bold and italic, respectively. SDs are given in (parentheses).
Pair prediction case, S1
We draw two conclusions based on the results in Table 3. First, CMF is the overall best method under the S1 CV setting, followed by WGRMF. This shows that matrix factorization methods outperform other methods, which renders them as the most promising DTI prediction methods under S1. Second, Weighted Profile performs better than Nearest Profile in the IC and E data sets. The reason is likely that the IC and E data sets, being larger than the NR and GPCR data sets, have more neighbors to more accurately infer interactions from.
Drug prediction case, S2
Moving on to the S2 CV setting, it is obvious from the results in Table 4 that it is a more challenging setting than S1. According to insights obtained from a previous study on pair-input computational predictions [92], it is more difficult to predict new interactions for drugs (or targets) when they do not appear in the training set at all. This is in contrast to the S1 case where drug (or target) interaction profiles are only partially left out.
Going back to the results, WGRMF performed the best out of all the methods, followed by CMF. Again, matrix factorization methods seem to be doing well in general. WGRMF did better than CMF under S2 thanks to its graph regularization terms, which shows the usefulness of manifold learning in this less informative CV setting.
RLS-WNN, which uses network similarity, is able to give a reasonable prediction performance. This is thanks to the WNN preprocessing procedure that reinforces the learning process by inferring temporary profiles for the left-out drugs. Note that RLS-WNN computes network similarity in the form of GIP kernels that are used later in the algorithm. Naturally, the temporary profiles are better for computing network similarity than the initially empty profiles of the left-out drugs, which underscores the importance of preprocessing procedures like WNN when the incorporation of network similarity in training the classifiers is intended.
Target prediction case, S3
Finally, we reach the results for the S3 setting. As expected, the AUPR results of S3 are also lower than those obtained under S1, but they are consistently higher than those obtained under S2. This leads to the conclusion that target genomic sequence similarities are generally more reliable than drug chemical structure similarities, a conclusion that has been previously reached in [48].
As in the S2 case, the matrix factorization methods are generally superior, with WGRMF performing better than CMF thanks to its graph regularization terms. RLS-WNN gave a comparable performance. As for NBI+, similar to the S1 and S2 cases, it was unable to outperform the baseline methods, Nearest Profile and Weighted Profile. Thus, we conclude that, network-based methods are generally not the best choice for DTI prediction.
Discussions
Generally speaking, the matrix factorization methods are the best methods when it comes to predicting DTIs. In addition, the manifold assumption that points lie on or near to a low-dimensional manifold [67–69] appears to be successful in improving DTI prediction performance (as displayed by WGRMF). However, it seems that when prior information is available in abundance (the S1 setting), manifold learning becomes slightly less useful (as shown by CMF that did better than WGRMF under S1) but still useful nonetheless.
It is important to mention that while RLS-WNN did not beat the matrix factorization methods in the predictions, it is relatively a much faster algorithm. It is also more robust in terms of selecting values for its parameters—the matrix factorization methods have more parameters that are sensitive and need more fine-tuning. As such, when one goes about the task of predicting DTIs, it is always good idea to obtain initial predictions with RLS-WNN first. We also emphasize that all BLMs are generally fast and memory-efficient algorithms and that they should be the first algorithms to consider if the data set used is significantly larger than the ones used in this study.
Regarding the network-based method, NBI+, it did not do as well as the other methods. It may be that the properties of the DTI networks are not favorable for use with such a network-based method. Examples of such properties are the low average number of interactions known per drug or target in the network and the presence of a considerable number of undiscovered interactions among the noninteractions (which can negatively influence predictions). Furthermore, they do not do well in predicting new interactions for orphan drugs for which no interactions are previously known. The problem is even more challenging when the interaction that we try to predict is with an orphan target as well; this is because the path on the network between the orphan drug and target would be too indirect and would thus be given a low prediction score. Finally, it has been stated in a previous survey [26] that predictions from network-based methods tend to be biased toward those drugs with more associated targets (or targets with more associated drugs) and that it is generally nontrivial to predict ‘an interaction between a drug in one subnetwork and a target in another’.
On the other hand, network-based methods still have a place in DTIs prediction. As an example, the pioneering network-based method, NRWRH [52], generated a heterogeneous network (as in Figure 2) on which a Random Walk was performed to obtain predictions, which is an elegant idea indeed. Augmenting the heterogeneous network with more information (e.g. by adding extra drug and target pairwise similarities) may help remedy the issues that network-based methods face in predicting interactions for orphan drugs or targets to some extent. It may also be helpful to draw inspiration from previous work on generating functional linkage networks (FLNs) [93–96]. FLNs are networks of functional associations between genes, and they have been successfully used in research related to investigating gene-related functions and diseases. Constructing FLNs requires gathering of information from multiple heterogeneous sources of varying quality and completeness and that may occasionally correlate highly with each other; such experience in constructing FLNs can be transferred to the generation of heterogeneous DTI networks on which network-based methods can be applied to predict new DTIs with better accuracy.
Now, we move on to an issue that is related to experimental design. As mentioned earlier, drug–target pairs are left out as test instances to see how well they are predicted by the different prediction methods. This is done by setting the values of the test instances to 0 (i.e. set Yij = 0 for test instances). The issue here is that known noninteractions and test instances would be both be represented by the same 0 value, which may not be ideal. However, giving a unique representation for noninteractions to separate them from test instances is not straightforward. In [48], it was found via experimentation that representing noninteractions by any value that is far from 0 (e.g. −1) is generally not a good idea. This is mainly because of the severe imbalance in the data (i.e. much more noninteractions than there are interactions); supposing, for example, that noninteractions are represented as −1, classifiers would focus more on predicting noninteractions correctly at the expense of predicting interactions correctly. While some algorithms (e.g. CMF and WGRMF) partially circumvent the representation issue by using a weight matrix W that prevents test instances from contributing in the predictions, most (if not all) previous work in DTI prediction has represented test instances by setting them to 0. Note that this issue does not apply to feature-based classification methods where test instances are simply excluded from the training set used to train the classifier, and then the trained classifier is used to perform predictions on the test instances.
Avenues for improvement and further research
In this section, we give examples on how researchers attempted to improve DTI prediction performance and occasionally provide some suggestions of our own for ideas on how to improve as well.
Using more information
As mentioned in section ‘Data representation and types’, there are multiple information sources that can possibly be used for DTI prediction. These sources represent different aspects of the drugs and targets involved and can help improve prediction performance if used concurrently. We provide some examples below of previous work that used more than one source of information at once.
One such work was [97] where many drug and target kernels were used, and a multiple kernel learning method was developed to take them as input and determine how best to merge them to provide the best predictions. An interesting thing about this work is that some of the kernels were produced from the same information source. From target protein sequences, normalized Smith–Waterman, mismatch and spectrum kernels were created (via the KeBABS package [98]). From the drug side, the Rchemcpp [99] package was used to obtain spectrum, Lambda-k, Marginalized, MinMax and Tanimoto kernels from drugs’ chemical structures. For more on multiple kernel learning in general, the reader is referred to [100]. Furthermore, Table 6 provides a list of software packages that exist for extracting drug and target features from their chemical structures and genomic sequences, respectively.
Software packages to compute features for drugs and targets
| Package . | Link . |
|---|---|
| ChemCPP | http://chemcpp.sourceforge.net |
| RDKit | http://www.rdkit.org/ |
| PyDPI [36] | https://sourceforge.net/projects/pydpicao/ |
| OpenBabel [37] | http://openbabel.org/ |
| Rcpi [35] | http://bioconductor.org/packages/release/bioc/ html/Rcpi.html |
| Rchemcpp [99] | http://shiny.bioinf.jku.at/Analoging/ |
| KeBABS [98] | http://www.bioinf.jku.at/software/kebabs/ |
| PROFEAT [44] | http://bidd2.nus.edu.sg/cgi-bin/profeat2016/ main.cgi |
| Package . | Link . |
|---|---|
| ChemCPP | http://chemcpp.sourceforge.net |
| RDKit | http://www.rdkit.org/ |
| PyDPI [36] | https://sourceforge.net/projects/pydpicao/ |
| OpenBabel [37] | http://openbabel.org/ |
| Rcpi [35] | http://bioconductor.org/packages/release/bioc/ html/Rcpi.html |
| Rchemcpp [99] | http://shiny.bioinf.jku.at/Analoging/ |
| KeBABS [98] | http://www.bioinf.jku.at/software/kebabs/ |
| PROFEAT [44] | http://bidd2.nus.edu.sg/cgi-bin/profeat2016/ main.cgi |
Software packages to compute features for drugs and targets
| Package . | Link . |
|---|---|
| ChemCPP | http://chemcpp.sourceforge.net |
| RDKit | http://www.rdkit.org/ |
| PyDPI [36] | https://sourceforge.net/projects/pydpicao/ |
| OpenBabel [37] | http://openbabel.org/ |
| Rcpi [35] | http://bioconductor.org/packages/release/bioc/ html/Rcpi.html |
| Rchemcpp [99] | http://shiny.bioinf.jku.at/Analoging/ |
| KeBABS [98] | http://www.bioinf.jku.at/software/kebabs/ |
| PROFEAT [44] | http://bidd2.nus.edu.sg/cgi-bin/profeat2016/ main.cgi |
| Package . | Link . |
|---|---|
| ChemCPP | http://chemcpp.sourceforge.net |
| RDKit | http://www.rdkit.org/ |
| PyDPI [36] | https://sourceforge.net/projects/pydpicao/ |
| OpenBabel [37] | http://openbabel.org/ |
| Rcpi [35] | http://bioconductor.org/packages/release/bioc/ html/Rcpi.html |
| Rchemcpp [99] | http://shiny.bioinf.jku.at/Analoging/ |
| KeBABS [98] | http://www.bioinf.jku.at/software/kebabs/ |
| PROFEAT [44] | http://bidd2.nus.edu.sg/cgi-bin/profeat2016/ main.cgi |
Another work that used multiple kernels was [101]. Drugs were represented as FP2 fingerprints. As for targets, different representations were generated including those based on autocovariance, entropy, discrete wavelet and substitution matrices among several others. An ensemble of SVM classifiers was trained, one classifier for each target descriptor type. Each drug–target pair was represented by concatenating the FP2 fingerprint of the drug with the target’s descriptor. Predictions from the different SVM classifiers were summed up to give the final predictions.
Secondary structure information of proteins [102] is something that has not been used often. It is a type of information that can be extracted from protein sequences, which is a good thing, as genomic sequences are always available for proteins. Known drug–disease and protein–disease associations may also be used as other sources of information [103]; however, these data have also not been used frequently for DTI prediction.
Determination of 3 D structures of membrane proteins (via wet-lab techniques) is becoming more feasible over time [21], so we may eventually witness the use of protein structure information in global-scale DTI prediction. This would possibly trigger the emergence of software packages that would routinely be used to generate descriptors for a protein from its structure (instead of from its sequence). These features may yield better prediction performance, as it is widely accepted that the protein’s structure is what dictates its function. Until such software packages appear, researchers can experiment to find features that are most useful to extract from proteins’ structures.
Finally, for a comprehensive overview of the different ways to integrate multiple information sources simultaneously for improving prediction performance, we refer the reader to [104].
Ensemble learning
There are two types of ensemble methods: heterogeneous ensembles and homogeneous ensembles.
Heterogeneous ensembles consist of different learners that have different induction biases. These learners are typically trained using the same data. A procedure known as stacking is usually used where the results from the different learners are concatenated to form feature vectors that are then used to train yet another meta-learner, which gives the final predictions [80]. The improvement in prediction performance is intended to be obtained from the diversity induced by the different inductive biases of the learners constituting the ensemble.
A heterogeneous ensemble for DTI prediction was previously developed [105] that consisted of four methods: Weighted Profile, RLS-avg, LapRLS and NBI. After predicting with these methods, an SVM meta-learner is trained with their results and then used to give the final prediction. The ensemble showed improved prediction performance over all the constituent methods. Another heterogeneous ensemble, DrugE-Rank [106], also uses a number of different learners as in [105], but instead of the SVM meta-learner, it uses a ranking algorithm, LambdaMART [107], to give the final predictions.
In terms of possible future work regarding heterogeneous ensemble methods, a technique that has not been used in previous work is ensemble pruning. That is, a subset of the base learners is used to constitute the ensemble. This would lead to smaller ensembles and, subsequently, to better computational efficiency because of the lower number of base learners performing predictions. In addition, it was shown that these smaller ensembles obtained via ensemble pruning can also have better generalization performance [108].
Homogeneous ensembles, on the other hand, consist of learners of the same type. For example, Random Forest [109] is a homogeneous ensemble method that consists of many decision trees. To obtain improved prediction performance, the diversity that would help achieve this can come from different sources. An example is bagging, which induces diversity by randomly sampling with replacement; for each learner, a subset of the training examples is randomly sampled to train it (i.e. each learner uses a different training set). Another example is feature subspacing, which induces diversity by randomly sampling a subset of the features for each learner, and so on.
A homogeneous ensemble method based on decision trees was introduced in [110]. It randomly projects the features matrix (representing the different drug–target pairs) into a lower dimensionality matrix. This reduces the dimensionality of the data (thus improving computational speed) as well as injects diversity into the ensemble leading to gains in prediction performance. Examples of other homogeneous ensembles include [62, 64, 65], which have been described earlier in section ‘Methods’
Besides bagging and feature subspacing, there are other ways to generate diversity in the base learners that have not yet been used in homogeneous ensemble methods for DTI prediction. One such way is to use different parameter settings for each of the base learners [80]. Another way is to randomly flip the labels of some training examples (i.e. convert from 1 to 0 or vice versa) [111]. These tricks may be used to enhance the prediction performance further.
Deep learning
The use of deep learning has been steadily increasing in drug discovery [112, 113]. The reason for this is that deep learning has the potential to build complex models that are able to learn difficult concepts and thus outperform other competing methods. In addition, as it has the ability to extract useful features from the input features, we believe that deep learning methods would especially shine when it comes to merging different sources of information. The two main limitations that were holding deep learning from being popular were: (1) a lot of training data are needed to train the complex model being generated, and (2) a lot of computational power is needed to perform the training. However, as time goes on, these two issues are becoming less of drawbacks because of the accumulation of more data to work with as well as the emerging of more high-performance computing resources.
In [25], the authors suggested the use of a CV setting, S4, where drugs and targets used in training do not appear in the test set, and it is known to be a challenging setting indeed. In the experiments conducted in [25], only trivial interactions were predicted successfully under S4. We believe that deep learning—with its ability to obtain useful deep representations of the drugs, targets and interactions—has the potential to do much better than other state-of-the-art methods in predicting interactions under S4. This is yet to be confirmed in future work.
A number of efforts regarding DTI prediction have made use of deep learning to improve prediction performance. Deep learning techniques that have been used in DTI prediction include restricted Boltzmann machines [114], deep neural networks [115, 116, 117], stacked auto-encoders [118, 119] and deep belief networks [120]. As of yet, none of the deep learning methods developed for DTI prediction have attempted to simultaneously use multiple heterogeneous sources of drug and target information. It would be interesting to see efforts that attempt to do so in future work.
Absence of reliable negatives
A prevalent issue in DTI prediction is the absence of a list of reliable negatives, i.e. there are no confident noninteractions. Unfortunately, reporting such noninteractions is not something that researchers routinely do. However, researchers have made efforts to deal with this problem.
Biased SVM is a variant of SVM that was used in [121] to give different weights to the positive and negative classes in the data. Positive examples, being more reliable, are given higher weights than the negative examples. The weights are tuned to give the best possible prediction performance.
PUDT [122] is a DTI prediction method that uses positive unlabeled learning to deal with the issue of unconfident negatives. Sets of unlabeled drug–target pairs are labeled as reliable negative and likely negative. An SVM classifier is then trained where, similar to biased SVM, weights are given to these negative classes (along with the positive class) and are tuned to give a good prediction performance.
After that, predictions are made using a simple network-based method. From the set of predictions, a subset is taken as the reliable negative set that can be used later with any prediction method.
In [124], the BioLip [125] and BindingDB [126] databases are searched for interactions with a binding affinity to be used as negatives.
Rather than using binary values to represent interactions and noninteractions, the use of data sets where continuous values correspond to drug–target affinities has been previously proposed [25]. Examples of such data sets include those introduced in [127] and [128] which, respectively, contain kinase disassociation constants and kinase inhibition constants—constants with lower values correspond to higher affinities and vice versa. It is suggested in [25] that such data sets be used as benchmarking data sets in future DTI prediction efforts. It is an idea worth considering, as these data sets provide a more accurate representation of reality than traditional binary-valued data sets. Using such data sets would also implicitly eliminate the issue of reliable negatives discussed above. We suspect that this is a trend that will increase in the future, as data of this kind become more abundant.
Big data
Over 90 million chemical compounds are currently stored in PubChem [129], while the conducted BioAssays have only covered about 2.4 million compounds (i.e. targets are now known for these compounds) [11]. It is unlikely that future BioAssay experiments will cover the remaining compounds anytime in the foreseeable future. Virtual screening of these compounds is thus inevitable. However, as the size of the data is exceptionally large, big data technologies (e.g. cloud computing) will need to be used.
However, adjustments to algorithms (or, possibly, novel algorithms altogether) will also need to be made to handle data of such size. For example, the work done in [86] is a step in this direction—minwise hashing was used to obtain a compact representation for the drug–target pairs, which reduces the data dimensionality, and the reduced dimensionality helps lower both the space and time complexities. Another work that aims for scalability is [130] where a memory-efficient tree structure is developed to query large databases for similar drug–target pairs.
Recently, new technologies for dealing with big data have been emerging, and it is becoming easier to process huge amounts of data. Spark, for example, is one such technology that can distribute the computational tasks over a cluster of computers, leading to faster processing of the data. It would be interesting to see Spark being used to detect new interactions over large numbers of proteins and compounds and, possibly, use such detected interactions to guide the BioAssay experiments mentioned earlier, so that they may discover higher numbers of compound–protein interactions.
Network visualization
As an exploratory analysis aid, network visualization may be used to display the DTI bipartite network. Inspecting the network visually may provide clues or insights that could otherwise be difficult to reach.
For example, it may be easier to determine why certain interactions tend to get low prediction scores by carefully observing the visualized network for hints. The user may consider using edge width or coloring edges with a color scale to indicate how high or low their prediction scores are. This particular example is illustrated in Figure 7.
Visualization of the NR DTI network where circles and squares represent drugs and targets, respectively. An S1 CV experiment was performed, and the final averaged prediction scores are represented by the thickness of the edges.
Many tools exist for visualizing networks. Two tools that are used in visualizing DTI networks are NodeXL [131] and Cytoscape [132].
Noncoding RNAs
While this work primarily focuses on target proteins, there is another type of target—noncoding RNAs (ncRNAs)—for which drugs have been successfully developed. ncRNAs are RNAs that do not code for proteins, and they consist of multiple subcategories including microRNAs (miRNAs), long noncoding RNAs and intronic RNAs among several others. To give a few examples, drugs based on miRNAs have been used to treat Hepatitis C virus [133] and Alport nephropathy [134], while others based on intronic RNAs have been used to treat Duchenne muscular dystrophy [135] and Usher syndrome [136]. Each of the different types of ncRNAs has unique behaviors and mechanisms, thus presenting various challenges and opportunities, all of which are discussed with examples in a recent overview [137].
We are expecting more research involving ncRNAs in the future. Worthy of mentioning is the NRDTD database [138] that has been recently set up to store information on ncRNAs and their binding drugs. It is likely that research into ncRNAs as drug targets will witness the frequent use of this database.
Evaluation metrics
An important part of DTI prediction that deserves some attention is how the prediction performance of the different classifiers is evaluated. We take as an example the AUPR metric, which was used in this study to compare between the different prediction methods. The AUPR score was computed by first pooling the prediction values for all the drug–target pairs and then sorting them to compute the AUPR.
However, as we may also be interested in how well-ranked the predictions are for each drug/target separately, computing the AUPR differently may be worth considering. Specifically, a per-drug AUPR may be calculated by computing a separate AUPR for each drug—i.e. sort each drug’s predicted targets and compute the AUPR for each drug separately—and then average all the drug AUPRs to get the per-drug AUPR. A per-target AUPR may also be obtained in a similar fashion. The per-drug and per-target AUPRs may possibly be better in reflecting the aspects of the prediction performance that we are really interested in testing.
Conclusion and outlook
Drug repositioning involves many computational techniques that are used in various circumstances depending on the current level of available knowledge on the target disease [141]. Comprehensive surveys providing overviews on these computational approaches have been published previously [142, 143]. Of these approaches, we gave an overview of DTI prediction, which is an important task in drug discovery. Indeed, many Web servers have been developed to facilitate this task for practitioners who wish to perform it on a global scale [26]. Examples of such Web servers include DINIES [144], BalestraWeb [145] and SuperPred [146] among several others [54, 147–150].
In this work, we started by describing the data required for the task of drug–DTI prediction and gave examples of different kinds of data that may be used. Next, we gave an up-to-date overview of the different state-of-the-art methods that are trained with said data and then used to predict new interactions. We then performed an empirical comparison between a number of pre-selected methods to show their prediction performances under different scenarios. Finally, we provided a list of avenues for further improvement of the prediction performance.
Research on chemogenomic DTI prediction has been conducted for about a decade now, starting with pioneering works such as [151–153] up until the current day. Research on chemogenomic methods for predicting DTIs is expected to continue for several years with contributions involving deep learning concepts, multiview learning and possibly unprecedented clever features for representing drugs and/or targets. In addition, as algorithms get more sophisticated over time, big data technologies (e.g. Spark) may enter the picture.
From the data perspective, there is the issue of data sets being of a binary nature, i.e. given an interaction matrix Y where Yij = 1 if drug di and target tj interact and 0 otherwise. This brings forth a significant problem. Some of the 0’s in Y may be interactions that are yet undiscovered, which may throw off the training process for the different classifiers. Another point is that, in reality, drug–target pairs have binding affinities that vary over a spectrum (interactions are not binary on/off). Data sets with continuous values representing drug–target binding affinities (as opposed to discrete 0 and 1 values) have been previously proposed [127, 128], and we expect the trend of using such continuous-valued data sets to eventually catch on, as it is more useful and more meaningful (i.e. better represents reality) than the binary data sets that have been used in the majority of previous work in DTI prediction.
So far, the majority of the work has been concerned with protein targets. However, it is expected that ncRNAs will eventually snatch some of the spotlight. Many ncRNA-targeting drugs have been developed, and much more are expected to appear as our understanding of how ncRNAs operate improves [137]. As for global-scale DTI prediction using machine learning algorithms (as exemplified in this survey), it is possible that we will witness efforts that attempt to do so in the near future. Such efforts would use a repository such as NRDTD [138] that stores extensive information on known drug–ncRNA interactions.
Funding
This work was supported by the Agency for Science, Technology and Research (A*STAR), Singapore.
Ali Ezzat is a Ph.D. student at the School of Computer Engineering in Nanyang Technological University (NTU), Singapore. He received the B.Sc. degree in computer science from Ain Shams University, Egypt, in 2006 and the M.Sc. degree in bioinformatics from Nanyang Technological University, Singapore, in 2013. He also worked as an IT professional at United OFOQ, Egypt, from 2008 to 2011. His research interests include machine learning, data mining and bioinformatics.
Min Wu is currently a Research Scientist in the Data Analytics Department at the Institute for Infocomm Research (I2R) under the Agency for Science, Technology and Research (A*STAR), Singapore. He received the B.Eng. from the University of Science and Technology of China (USTC), China in 2006 and his Ph.D. degree from Nanyang Technological University, Singapore in 2011. His current research interests include machine learning, data mining and bioinformatics.
Xiao-Li Li is currently a department head at the Institute for Infocomm Research, A*STAR, Singapore. He also holds adjunct professor positions at the National University of Singapore and Nanyang Technological University. His research interests include data mining, machine learning, AI, and bioinformatics. He has served as a (senior) PC member/workshop chair/session chair in leading data mining related conferences (including KDD, ICDM, SDM, PKDD/ECML, WWW, IJCAI, AAAI, ACL and CIKM) and as an editor of bioinformatics-related books. Xiaoli has published more than 160 high quality papers and won best paper/benchmark competition awards.
Chee-Keong Kwoh received the bachelor’s degree in electrical engineering (first class) and the master’s degree in industrial system engineering from the National University of Singapore in 1987 and 1991, respectively. He received the PhD degree from the Imperial College of Science, Technology and Medicine, University of London, in 1995. He is currently an associate professor at the School of Computer Engineering, Nanyang Technological University (NTU). His research interests include data mining, soft computing and graph-based inference, bioinformatics and biomedical engineering. He is a member of the Association for Medical and Bio-Informatics, Imperial College Alumni Association of Singapore.






