- Split View
-
Views
-
Cite
Cite
Ping Xuan, Yangkun Cao, Tiangang Zhang, Xiao Wang, Shuxiang Pan, Tonghui Shen, Drug repositioning through integration of prior knowledge and projections of drugs and diseases, Bioinformatics, Volume 35, Issue 20, October 2019, Pages 4108–4119, https://doi.org/10.1093/bioinformatics/btz182
- Share Icon Share
Abstract
Identifying and developing novel therapeutic effects for existing drugs contributes to reduction of drug development costs. Most of the previous methods focus on integration of the heterogeneous data of drugs and diseases from multiple sources for predicting the candidate drug–disease associations. However, they fail to take the prior knowledge of drugs and diseases and their sparse characteristic into account. It is essential to develop a method that exploits the more useful information to predict the reliable candidate associations.
We present a method based on non-negative matrix factorization, DisDrugPred, to predict the drug-related candidate disease indications. A new type of drug similarity is firstly calculated based on their associated diseases. DisDrugPred completely integrates two types of disease similarities, the associations between drugs and diseases, and the various similarities between drugs from different levels including the chemical structures of drugs, the target proteins of drugs, the diseases associated with drugs and the side effects of drugs. The prior knowledge of drugs and diseases and the sparse characteristic of drug–disease associations provide a deep biological perspective for capturing the relationships between drugs and diseases. Simultaneously, the possibility that a drug is associated with a disease is also dependant on their projections in the low-dimension feature space. Therefore, DisDrugPred deeply integrates the diverse prior knowledge, the sparse characteristic of associations and the projections of drugs and diseases. DisDrugPred achieves superior prediction performance than several state-of-the-art methods for drug–disease association prediction. During the validation process, DisDrugPred also can retrieve more actual drug–disease associations in the top part of prediction result which often attracts more attention from the biologists. Moreover, case studies on five drugs further confirm DisDrugPred’s ability to discover potential candidate disease indications for drugs.
The fourth type of drug similarity and the predicted candidates for all the drugs are available at https://github.com/pingxuan-hlju/DisDrugPred.
Supplementary data are available at Bioinformatics online.
1 Introduction
Developing a new drug is a lengthy, complex and expensive process which generally takes 10–15 years and 0.8–1.5 billon dollars (Dickson and Gagnon, 2004; Pushpakom et al., 2018; Tamimi and Ellis, 2009). Drug repositioning is to identify novel therapeutic effects for the drugs that have been approved by the regulatory agencies (Lotfi Shahreza et al., 2018; Padhy and Gupta, 2011). The approved drugs have known and well-characterized bioavailability, safety and pharmacology which can significantly accelerate drug development. Compared to developing a drug de novo, drug repositioning may reduce the drug development period to 6.5 years and the cost for repositioning a drug is $300 million (Nosengo, 2016; Pritchard et al., 2017).
Computational prediction of novel therapeutic indications for approved drugs may screen candidate drug–disease associations for further experimental validation (Chen et al., 2016; Hurle et al., 2013; Li et al., 2015). The previous works can be roughly grouped into two categories. Since the drugs execute their functions by targeting the related genes (Bleakley and Yamanishi, 2009; Fakhraei et al., 2014; Yamanishi et al., 2008), the drugs and diseases that are associated with each other are usually related to some common genes. Furthermore, the more common genes they are related to, the more likely that they are associated with each other. Thus, several methods of the first category are proposed to infer the association propensity of a drug and a disease based on their related genes or gene expressions (Sirota et al., 2011; Wang et al., 2014a). Similarly, the association propensity can also be estimated according to the protein complexes shared by the drug and disease (Yu et al., 2015) and their common perturbed genes (Peyvandipour et al., 2018). However, these methods fail to be applied to the drugs and diseases without common interacted genes or proteins.
The second category takes advantage of the various data that includes the similarities of drugs, diseases and targets, as well as the interactions and associations between drugs, targets and diseases, for drug repositioning. The similarities of drugs and diseases are integrated by a kernel function to predict drug–disease associations (Wang et al., 2013). Several methods infer the candidate drug indications by information flow or random walks on a heterogeneous network composed of drugs, targets and diseases (Liu et al., 2016; Luo et al., 2016, 2018; Wang et al., 2014b). A couple of methods exploit the data of drugs and diseases and predict novel drug uses by a logistical regression model, a statistical model, sparse subspace learning or similarity constrained matrix factorization (Gottlieb et al., 2011; Iwata et al., 2015; Liang et al., 2017; Zhang et al., 2018a). In addition, recent researches indicated that besides proteins, the microRNAs and lncRNAs may also be used as the targets of drugs (Chen et al., 2015, 2018a; Qu et al., 2018). Responses are also one kind of important attributes of drugs (Liu et al., 2018; Zhang et al., 2018b). Therefore, microRNAs, lncRNAs and responses related to drugs, are potentially additional information for drug–disease association prediction. However, there are no enough experimentally verified microRNAs, lncRNAs and responses so far for accurately predicting drug-related diseases. Overall, integrating the heterogeneous data from multiple sources is essential for exploring the drug–disease associations. However, these previous methods ignore the prior knowledge of drugs and diseases and the biological characteristic of drug–disease associations.
In this article, we present DisDrugPred, a novel method for predicting the candidate drug–disease associations. We first calculate a new type of drug similarity based on the diseases that are associated with the drugs. DisDrugPred then completely exploits the similarity and association, as well as interaction data about drugs, diseases and target proteins of drugs. DisDrugPred deeply integrates not only the diverse prior knowledge of drugs and diseases but also the projections of drugs and diseases in low-dimensional feature space. Integrating the prior knowledge about the case in which two drugs (diseases) will be more similar can capture the relationships between the drug–disease associations and the similarities of drugs (diseases) from the biological perspectives. Projecting the drugs and diseases into a common and low-dimensional feature space contributes to the measurement of the distances between them. These distances between drugs and diseases are also closely related to their association possibilities. Hence a unified model is constructed and an iterative optimization algorithm is developed for solving the model to obtain the association possibilities of drugs and diseases. The experimental results based on cross validation show that DisDrugPred significantly outperforms than several state-of-the-art prediction methods. In particular, when focusing on the top part of prediction result, DisDrugPred successfully retrieves more actual drug–disease associations. Case studies on five drugs further confirm that DisDrugPred is able to discover the potential disease indications of drugs.
2 Materials and methods
Our goal is to predict the potential therapeutic indications, i.e. the candidate diseases, for a given drug of interest. We first calculate a new type of similarity between drugs to exploit the information of their associated diseases. A novel prediction model based on non-negative matrix factorization (Lee and Seung, 2001) is proposed by integrating the multisource data about drugs and diseases. The drug–disease association scores are able to be obtained by solving the model with an iterative algorithm. A greater association score of drug ri and disease dj means that ri is more likely to be associated with dj.
2.1 Datasets for drug indication prediction
The associations between drugs and diseases, the chemical substructure profiles of drugs, the domain profiles of target proteins of drugs, the target annotation profiles of drugs and the disease semantic similarities are obtained from the previous work on prediction of drug–disease associations (Liang et al., 2017). The 3051 drug-disease associations are originally extracted from the Unified Medical Language System (Bodenreider, 2004), and it contains the treatment relationships between 763 drugs and 681 diseases. The chemical substructure profile of drugs can be constructed by using the chemical fingerprints which are extracted from the database, PubChem (Kim et al., 2015). The domains of drug-related proteins and the gene ontology annotations of these proteins are respectively obtained from the databases, InterPro (Mitchell et al., 2015) and UniProt (Consortium, 2018). We extract the side effect indications of drugs from the Database SIDER (Kuhn et al., 2016), and 571 ones among 763 drugs have their side effect indications. The disease similarities that incorporated the disease ontology (DO) and the disease-related genes are extracted from the DincRNA database (Cheng et al., 2018), and 386 ones among 681 diseases have this kind of disease similarity. The disease names come from the US National Library of Medicine (MeSH, http://www.ncbi.nlm.nih.gov/mesh).
2.2 Calculation and representation of multisource data
2.2.1 Five types of drug similarities
As two drugs, such as ra and rb, with more common chemical substructures are usually more similar, the previous work LRSSL (Liang et al., 2017) calculates the cosine similarity on their chemical substructure vectors as the first type of similarity of ra and rb (Fig. 1a). Moreover, the drugs with more common domains of target proteins or interacted with more target proteins with similar functions often have relatively higher similarity (Ding et al., 2013; Perlman et al., 2011). Hence LRSSL also calculated the second and third types of drug similarities based on cosine similarity measure.
The disease semantic similarities are calculated by using the Wang’s method (Wang et al., 2010). The method constructs a directed acyclic graph (DAG) for a disease that contains all of the semantic terms related to the disease, such as the DAG of Breast Neoplasms in Figure 1(b). The similarity of two diseases is calculated based on their DAGs. The more their DAGs have common terms, the more similar two diseases are. The values of disease semantic similarity range between 0 and 1. Note that as only the disease semantic similarities cover all the diseases related to our interested drugs, the fourth type of drug similarity is just calculated based on these semantic similarities.
In addition, drugs sharing more similar side effects tend to interact with common target proteins and further have more similar functions (Gottlieb et al., 2012; Sridhar et al., 2016; Zitnik et al., 2018). Thus, the fifth type of drug similarity is measured by the cosine similarity based on the side effects related to the drugs. All of the five types of drug similarities are represented by the matrices R1, R2, R3, R4 and , where is the tth type of similarity of drugs ri and rj, and mr is the number of the drug similarity types.
2.2.2 Representation of disease similarities
First, the semantic similarity of two diseases quantifies how similar the disease terms related to them are. Two diseases are generally more similar when they have more common terms. Wang et al. have calculated the disease semantic similarities (Wang et al., 2010), and these similarities are widely used by the previous work on drug–disease association prediction (Liang et al., 2017; Zhang et al., 2018a). Our method also exploits the disease similarities whose values range between 0 and 1. Second, the DO has been developed as a formal ontology for human disease, and it aims to provide an etiological-based disease classification (Schriml et al., 2018). Simultaneously, the functional similarity of two diseases may be measured by their related genes. Cheng et al. integrated the DO and disease-related genes to obtain another type of disease similarity (Cheng et al., 2014). In our study, two types of disease similarities are denoted as the matrices D1 and , where is the sth type of similarity of diseases di and dj, Nd is the numbers of diseases and md is the number of the disease similarity types (Fig. 1b).
2.2.3 Representation of the drug–disease associations
As shown in Figure 1(c), the drug–disease bipartite graph is formed by the known associations between drugs and diseases. According to the graph, matrix is constructed to represent the association case between Nd diseases and Nr drugs, where Aij is 1 if disease di was observed to be associated with drug rj or 0 otherwise.
2.3 Drug–disease association prediction model
2.3.1 Modeling the drug–disease association relationships
2.3.2 Modeling the projections of diseases and drugs
2.3.3 Modeling the prior knowledge of disease similarities
2.3.4 Modeling the prior knowledge of drug similarities
2.3.5 Modeling smoothness prior
2.3.6 Modeling the biological characteristic of associations
2.3.7 Introducing regularization term for preventing overfitting
2.4 Optimization
As the objective function (10) with the variables P, Ws and Ht is not convex, it is impractical to get its global minimum. We present an algorithm to find its local minimum by separating the optimization problem into several subproblems and then optimizing them iteratively.
The convergence curve of the objective function , confirm that the function can converge to its local minima (Fig. 2). is solved by iteratively using the updating rules of P, Ws and Ht. The iterative process is over when the absolute difference of at two adjacent moments is less than a threshold () or the maximum number of iterations, 100, is reached. Finally, Pij is regarded as the estimated association score between disease di and drug rj (Fig. 3).
3 Experimental evaluations and discussions
3.1 Evaluation metrics
We perform 5-fold cross-validation for evaluating the performance of a method in predicting drug–disease associations. All known drug–disease associations are randomly divided into five equal subsets, four of which are used for training a prediction model, while the remaining subset is used for evaluation. The associations in the remaining subset are added into the testing set and regarded as positive samples. The testing set also contains all the unobserved drug–disease associations which are regarded as negative samples. In the ranking list of associations, the higher the positive samples are ranked, the better the prediction performance is. Note that association dataset is separated to 5-fold for cross-validation, the fourth type of drug similarity is recomputed by only using the drug–disease associations used for training in each cross validation test.
Precision is the proportion of the correctly identified positive samples among the retrieved samples, and recall is the same as TPR. We also evaluate the performance of association prediction by using PR curve and the area under PR curve (AUPR). In terms of 5-fold cross-validation, the final performance is obtained by using averaging CV. Averaging CV means that we obtain a separate performance (AUC or AUPR) for each of the 5-fold when used as a test set, and the five performances are averaged to give the final performance (Pahikkala et al., 2015).
Considering the candidates in the top part of ranking list are usually selected by the biologists to further validate with wet-lab experiments, it is better to make the top part contain more positive samples. We thus calculate the recall rate within top part, which is the proportion of positive samples identified correctly in the top k list among the total positive ones, as another evaluation metric.
3.2 Comparison with other methods
To evaluate the performance of the presented method, DisDrugPred, we compare it with several state-of-the-art methods for drug–disease association prediction: TL_HGBI (Wang et al., 2014b), MBiRW (Luo et al., 2016), LRSSL (Liang et al., 2017) and SCMFDD (Zhang et al., 2018a). We describe these methods in more detail below:
Furthermore, the drug and disease similarities and are introduced as constraints for learning the drug and disease features, respectively. is the estimated association score between the ith disease and the jth drug.
DisDrugPred’s hyperparameters, α1–α6, should be tuned and their values are selected from {0.05, 0.1, 0.2, 0.5, 1, 5, 10, 20, 50}. DisDrugPred yields the best performance when α1 = 10, α2 = 10, α3 = 0.1, α4 = 10, α5 = 10 and α6 = 10, and the optimal set of parameters was obtained by using grid search. To make fair comparisons, the hyperparameters of the other methods are set to their optimal values suggested by their literatures (i.e. α = 0.4 and β = 0.3 for TL_HGBI, α = 0.3, l = 2 and r = 2 for MBiRW, μ = 0.01, λ = 0.01, γ = 2 and k = 10 for LRSSL, k = 45%, μ = 1 and λ = 4 for SCMFDD). In addition, the sensitivity coefficients (SC, van Riel, 2006) of DisDrugPred’s six parameters are evaluated by changing one of parameters and fixing the remaining ones. The SC values of α1–α6 are 5.23e-04, 0.0148, 0.0121, 0.0032, 0.0191 and 7.38e-05, respectively. Hence DisDrugPred is not sensitive to the perturbation of α1, α4 and α6, while α2, α3 and α5 have relatively greater impacts on DisDrugPred.
As AUC and AUPR are the better metrics in comparing learning algorithms with probability estimations (Ling et al., 2003; Saito and Rehmsmeier, 2015), we use them to evaluate DisDrugPred and the other methods. The ROC curves and their corresponding AUCs obtained by different approaches are given in Figure 4(a). DisDrugPred_with_R4 and LRSSL_with_R4 are the instances of DisDrugPred and LRSSL which exploit four types of drug similarities, i.e. R1, R2, R3 and R4. DisDrugPred_with_R4 achieves the highest average AUC over all of the 763 drugs (AUC = 0.922). It outperforms TL_HGBI by 19.5%, MBiRW by 7.1%, LRSSL with R4 by 7% and SCMFDD by 28.4%. As shown in Figure 4(b), DisDrugPred_with_R4 also produces the highest average AUPR on 763 drugs (AUPR = 0.143). Its’ AUPR is 11.3%, 9.9%, 3.6% and 13.7% better than TL_HGBI, MBiRW, LRSSL_with_R4 and SCMFDD, respectively. LRSSL_with_R4 yields the second best performance. Its’ AUC is slightly better than MBiRW while its’ AUPR is 6.3% higher than MBiRW. SCMFDD did not perform as well as the other methods as it is very sensitive to the disease and drug similarities. DrugDisPred_with_R4 and LRSSL_with_R4 utilize multiple types of drug similarities, while the other methods focus on only one type of drug similarity. These two methods show the better performances over the other methods, which indicates that integrating more types of drug similarities is essential for improving the prediction accuracy.
In addition, the instances of DisDrugPred and LRSSL are constructed by using R1, R2, R3, R4 and the fifth type of drug similarity based on their side effects (R5), and they are referred to as DisDrugPred_side_effect and LRSSL_side_effect. The former still performs better than the latter in terms of both AUC and AUPR, which confirms the superiority of DisDrugPred’s algorithm. Since only 571 ones of 763 drugs have their side effects, the subsequent analysis still concentrates on DisDrugPred_with_R4 and LRSSL_with_R4 which cover all of 763 drugs.
For all the prediction results on 763 drugs, we perform a Wilcoxon test to evaluate whether DisDrugPred’s performance is significantly better than the other methods. The statistical results (Table 1) indicate that DisDrugPred yields the significantly better performance under the P-value threshold of 0.05 in terms of not only AUCs but AUPRs as well.
P-value between DisDrugPred and another method . | TL_HGBI . | MBiRW . | LRSSL . | SCMFDD . |
---|---|---|---|---|
P-value of ROC curve | 7.2981e-140 | 4.2955e-55 | 2.4715e-11 | 3.1511e-297 |
P-value of PR curve | 2.1728e-41 | 1.6194e-15 | 2.5977e-10 | 8.9884e-229 |
P-value between DisDrugPred and another method . | TL_HGBI . | MBiRW . | LRSSL . | SCMFDD . |
---|---|---|---|---|
P-value of ROC curve | 7.2981e-140 | 4.2955e-55 | 2.4715e-11 | 3.1511e-297 |
P-value of PR curve | 2.1728e-41 | 1.6194e-15 | 2.5977e-10 | 8.9884e-229 |
P-value between DisDrugPred and another method . | TL_HGBI . | MBiRW . | LRSSL . | SCMFDD . |
---|---|---|---|---|
P-value of ROC curve | 7.2981e-140 | 4.2955e-55 | 2.4715e-11 | 3.1511e-297 |
P-value of PR curve | 2.1728e-41 | 1.6194e-15 | 2.5977e-10 | 8.9884e-229 |
P-value between DisDrugPred and another method . | TL_HGBI . | MBiRW . | LRSSL . | SCMFDD . |
---|---|---|---|---|
P-value of ROC curve | 7.2981e-140 | 4.2955e-55 | 2.4715e-11 | 3.1511e-297 |
P-value of PR curve | 2.1728e-41 | 1.6194e-15 | 2.5977e-10 | 8.9884e-229 |
The higher the recall rate on the top k ranked potential drug–disease associations is, the more the real associations are identified successfully. DisDrugPred performs better than the other methods at various k cutoffs (Fig. 5), and ranks 70.5% in top 30, 84.5% in top 90 and 89.4% in top 150. Although the AUC of LRSSL is very close to that of MBiRW, all of the recall rates of LRSSL are higher than MBiRW. The former ranks 61.8%, 76.2% and 80.7% in top 30, 90 and 150, respectively, and the latter ranks 47%, 68.3% and 76.8%. TL_HGBI is not as good as MBiRW, and it ranks 37%, 51.7% and 61.8% in top 30, 90 and 150. SCMFDD still did not perform as well as the other methods and the corresponding recall rates are 12.5%, 28.4% and 40.8%.
3.3 Importance of the drug similarities, the disease similarities and DisDrugPred’s algorithm
To validate the importance of incorporating the fourth type of drug similarity (R4), two DisDrugPred’s instances, DisDrugPred_with_R4 and DisDrugPred_without_R4, are constructed. The former is trained with R4, while the latter is trained without R4. At the same time, since LRSSL is able to be extended to exploit more types of drug similarities, we also construct two LRSSL’s instances that are trained with R4 and without R4, LRSSL_with_R4 and LRSSL_without_R4, respectively.
First, the instances of DisDrugPred and LRSSL with R4 perform better than the ones without R4, respectively (Supplementary Fig. S1). DisDrugPred_with_R4’s AUC and AUPR are 1.9% and 1.9% higher than DisDrugPred_without_R4. LRSSL_with_R4’s AUC and AUPR also increase by 5.1% and 1.9% compared with LRSSL_without_R4. It shows the importance of incorporating the drug similarities R4 for improving prediction performance.
Second, the performances of DisDrugPred’s instances are better than LRSSL’s instances whenever their models are trained by using R4 or not (Supplementary Fig. S1). DisDrugPred_with_R4 achieves 7% and 3.6% higher AUC and AUPR than LRSSL_with_R4. DisDrugPred_without_R4’s AUC and AUPR also increase by 10.2% and 3.6% compared with LRSSL_without_R4. It confirms that the algorithm of DisDrugPred also help with the improvement of prediction performance.
In addition, to evaluate the effect of exploiting multiple types of disease similarities, an instance of DisDrugPred is constructed by using the first and second types of disease similarities (D1 and D2), and is referred to as DisDrugPred_with_D2. Another DisDrugPred’s instance is trained without D2, and is named DisDrugPred_without_D2. Since the other methods just exploit D1 and they are not available for using both D1 and D2, we only estimate DisDrugPred’s performance. As shown in Supplementary Figure S1, DisDrugPred_with_D2’s AUPR is a little bit higher than DisDrugPred_without_D2 and it increases by 0.3%, while DisDrugPred_with_D2’s AUC is equal to DisDrugPred_without_D2’s one. It indicates D2 has a slight effect on the prediction performances.
3.4 Case studies on five drugs
To further demonstrate DisDrugPred’s ability to discover the potential drug–disease associations, case studies on five drugs, ciprofloxacin, clonidine, ampicillin, etoposide and cefotaxime, are conducted. For each of these five drugs, the candidate drug–disease associations are prioritized by their association scores, and the top 10 candidates are collected, 50 candidates in total (Table 2).
Drug ID . | Rank . | Disease name . | Description . | Rank . | Disease name . | Description . |
---|---|---|---|---|---|---|
Ciprofloxacin | 1 | Gram-negative bacterial infections | CTD | 6 | Pneumonia, bacterial | CTD |
2 | Streptococcal infections | DrugBank | 7 | Soft tissue infections | CTD | |
3 | Bacterial infections | CTD | 8 | Serratia infections | PubChem | |
4 | Enterobacteriaceae infections | CTD | 9 | Chlamydia infections | CTD | |
5 | Salmonella infections | CTD | 10 | Helicobacter infections | CTD | |
Clonidine | 1 | Pain | CTD | 6 | Sleep disorders | ClinicalTrials |
2 | Neurologic manifestations | Unconfirmed | 7 | Nausea | CTD | |
3 | Depressive disorder | CTD | 8 | Edema | CTD | |
4 | Vomiting | CTD | 9 | Facial pain | Literature (Yoon et al., 2015) | |
5 | Muscle cramp | PubChem | 10 | Muscle rigidity | Unconfirmed | |
Ampicillin | 1 | Streptococcal infections | CTD | 6 | Septicemia | DrugBank, repoDB |
2 | Proteus infections | CTD | 7 | Gram-positive bacterial infections | CTD | |
3 | Bacterial infections | CTD | 8 | Enterobacteriaceae infections | DrugBank | |
4 | Pneumonia, bacterial | CTD, ClinicalTrials | 9 | Wound infection | CTD | |
5 | Gram-negative bacterial infections | CTD | 10 | Staphylococcal skin infections | DrugBank | |
Etoposide | 1 | Breast | CTD | 6 | Lymphoma | CTD |
2 | Sarcoma | CTD | 7 | Urinary tract infections | Inferred candidate by 1 literature | |
3 | Leukemia | DrugBank | 8 | Ovarian neoplasms | Literature (Bozkaya, 2017) | |
4 | Hodgkin disease | CTD | 9 | Melanoma | DrugBank | |
5 | Lymphoma, Non-Hodgkin | CTD | 10 | Head and neck neoplasms | DrugBank | |
Cefotaxime | 1 | Bacterial infections | CTD, ClinicalTrials | 6 | Gram-positive bacterial infections | CTD, DrugBank |
2 | Enterobacteriaceae infections | DrugBank | 7 | Helicobacter infections | Literature (van der Voort et al., 2000) | |
3 | Gram-negative bacterial infections | CTD, DrugBank | 8 | Eye infections, bacterial | Literature (Kramann et al., 2001) | |
4 | Pseudomonas infections | DrugBank | 9 | Staphylococcal skin infections | DrugBank | |
5 | Respiratory tract infections | CTD, ClinicalTrials | 10 | Septicemia | DrugBank |
Drug ID . | Rank . | Disease name . | Description . | Rank . | Disease name . | Description . |
---|---|---|---|---|---|---|
Ciprofloxacin | 1 | Gram-negative bacterial infections | CTD | 6 | Pneumonia, bacterial | CTD |
2 | Streptococcal infections | DrugBank | 7 | Soft tissue infections | CTD | |
3 | Bacterial infections | CTD | 8 | Serratia infections | PubChem | |
4 | Enterobacteriaceae infections | CTD | 9 | Chlamydia infections | CTD | |
5 | Salmonella infections | CTD | 10 | Helicobacter infections | CTD | |
Clonidine | 1 | Pain | CTD | 6 | Sleep disorders | ClinicalTrials |
2 | Neurologic manifestations | Unconfirmed | 7 | Nausea | CTD | |
3 | Depressive disorder | CTD | 8 | Edema | CTD | |
4 | Vomiting | CTD | 9 | Facial pain | Literature (Yoon et al., 2015) | |
5 | Muscle cramp | PubChem | 10 | Muscle rigidity | Unconfirmed | |
Ampicillin | 1 | Streptococcal infections | CTD | 6 | Septicemia | DrugBank, repoDB |
2 | Proteus infections | CTD | 7 | Gram-positive bacterial infections | CTD | |
3 | Bacterial infections | CTD | 8 | Enterobacteriaceae infections | DrugBank | |
4 | Pneumonia, bacterial | CTD, ClinicalTrials | 9 | Wound infection | CTD | |
5 | Gram-negative bacterial infections | CTD | 10 | Staphylococcal skin infections | DrugBank | |
Etoposide | 1 | Breast | CTD | 6 | Lymphoma | CTD |
2 | Sarcoma | CTD | 7 | Urinary tract infections | Inferred candidate by 1 literature | |
3 | Leukemia | DrugBank | 8 | Ovarian neoplasms | Literature (Bozkaya, 2017) | |
4 | Hodgkin disease | CTD | 9 | Melanoma | DrugBank | |
5 | Lymphoma, Non-Hodgkin | CTD | 10 | Head and neck neoplasms | DrugBank | |
Cefotaxime | 1 | Bacterial infections | CTD, ClinicalTrials | 6 | Gram-positive bacterial infections | CTD, DrugBank |
2 | Enterobacteriaceae infections | DrugBank | 7 | Helicobacter infections | Literature (van der Voort et al., 2000) | |
3 | Gram-negative bacterial infections | CTD, DrugBank | 8 | Eye infections, bacterial | Literature (Kramann et al., 2001) | |
4 | Pseudomonas infections | DrugBank | 9 | Staphylococcal skin infections | DrugBank | |
5 | Respiratory tract infections | CTD, ClinicalTrials | 10 | Septicemia | DrugBank |
Note: (1) ‘CTD’ means a drug–disease association is included by the CTD and the association is curated manually. (2) ‘ClinicalTrials’ means that a drug–disease association has been recorded in the online database ClinicalTrials.gov. (3) ‘DrugBank’ means that the drug–disease association is contained by the DrugBank database that captures the drug trial information. (4) ‘repoDB’ means that the drug–disease association is included by the repoDB database that records the approved and failed drugs and their indications. (5) ‘PubChem’ means that the PubChem database has recorded the toxicological information about the drug and disease. (6) ‘Literature’ means that there is a published literature to support the drug–disease association. (7) ‘Inferred candidate’ means that the drug–disease association is the potential one inferred by the literatures and included by CTD. (8) ‘Unconfirmed’ means that there is no evidence to confirm the drug–disease association.
Drug ID . | Rank . | Disease name . | Description . | Rank . | Disease name . | Description . |
---|---|---|---|---|---|---|
Ciprofloxacin | 1 | Gram-negative bacterial infections | CTD | 6 | Pneumonia, bacterial | CTD |
2 | Streptococcal infections | DrugBank | 7 | Soft tissue infections | CTD | |
3 | Bacterial infections | CTD | 8 | Serratia infections | PubChem | |
4 | Enterobacteriaceae infections | CTD | 9 | Chlamydia infections | CTD | |
5 | Salmonella infections | CTD | 10 | Helicobacter infections | CTD | |
Clonidine | 1 | Pain | CTD | 6 | Sleep disorders | ClinicalTrials |
2 | Neurologic manifestations | Unconfirmed | 7 | Nausea | CTD | |
3 | Depressive disorder | CTD | 8 | Edema | CTD | |
4 | Vomiting | CTD | 9 | Facial pain | Literature (Yoon et al., 2015) | |
5 | Muscle cramp | PubChem | 10 | Muscle rigidity | Unconfirmed | |
Ampicillin | 1 | Streptococcal infections | CTD | 6 | Septicemia | DrugBank, repoDB |
2 | Proteus infections | CTD | 7 | Gram-positive bacterial infections | CTD | |
3 | Bacterial infections | CTD | 8 | Enterobacteriaceae infections | DrugBank | |
4 | Pneumonia, bacterial | CTD, ClinicalTrials | 9 | Wound infection | CTD | |
5 | Gram-negative bacterial infections | CTD | 10 | Staphylococcal skin infections | DrugBank | |
Etoposide | 1 | Breast | CTD | 6 | Lymphoma | CTD |
2 | Sarcoma | CTD | 7 | Urinary tract infections | Inferred candidate by 1 literature | |
3 | Leukemia | DrugBank | 8 | Ovarian neoplasms | Literature (Bozkaya, 2017) | |
4 | Hodgkin disease | CTD | 9 | Melanoma | DrugBank | |
5 | Lymphoma, Non-Hodgkin | CTD | 10 | Head and neck neoplasms | DrugBank | |
Cefotaxime | 1 | Bacterial infections | CTD, ClinicalTrials | 6 | Gram-positive bacterial infections | CTD, DrugBank |
2 | Enterobacteriaceae infections | DrugBank | 7 | Helicobacter infections | Literature (van der Voort et al., 2000) | |
3 | Gram-negative bacterial infections | CTD, DrugBank | 8 | Eye infections, bacterial | Literature (Kramann et al., 2001) | |
4 | Pseudomonas infections | DrugBank | 9 | Staphylococcal skin infections | DrugBank | |
5 | Respiratory tract infections | CTD, ClinicalTrials | 10 | Septicemia | DrugBank |
Drug ID . | Rank . | Disease name . | Description . | Rank . | Disease name . | Description . |
---|---|---|---|---|---|---|
Ciprofloxacin | 1 | Gram-negative bacterial infections | CTD | 6 | Pneumonia, bacterial | CTD |
2 | Streptococcal infections | DrugBank | 7 | Soft tissue infections | CTD | |
3 | Bacterial infections | CTD | 8 | Serratia infections | PubChem | |
4 | Enterobacteriaceae infections | CTD | 9 | Chlamydia infections | CTD | |
5 | Salmonella infections | CTD | 10 | Helicobacter infections | CTD | |
Clonidine | 1 | Pain | CTD | 6 | Sleep disorders | ClinicalTrials |
2 | Neurologic manifestations | Unconfirmed | 7 | Nausea | CTD | |
3 | Depressive disorder | CTD | 8 | Edema | CTD | |
4 | Vomiting | CTD | 9 | Facial pain | Literature (Yoon et al., 2015) | |
5 | Muscle cramp | PubChem | 10 | Muscle rigidity | Unconfirmed | |
Ampicillin | 1 | Streptococcal infections | CTD | 6 | Septicemia | DrugBank, repoDB |
2 | Proteus infections | CTD | 7 | Gram-positive bacterial infections | CTD | |
3 | Bacterial infections | CTD | 8 | Enterobacteriaceae infections | DrugBank | |
4 | Pneumonia, bacterial | CTD, ClinicalTrials | 9 | Wound infection | CTD | |
5 | Gram-negative bacterial infections | CTD | 10 | Staphylococcal skin infections | DrugBank | |
Etoposide | 1 | Breast | CTD | 6 | Lymphoma | CTD |
2 | Sarcoma | CTD | 7 | Urinary tract infections | Inferred candidate by 1 literature | |
3 | Leukemia | DrugBank | 8 | Ovarian neoplasms | Literature (Bozkaya, 2017) | |
4 | Hodgkin disease | CTD | 9 | Melanoma | DrugBank | |
5 | Lymphoma, Non-Hodgkin | CTD | 10 | Head and neck neoplasms | DrugBank | |
Cefotaxime | 1 | Bacterial infections | CTD, ClinicalTrials | 6 | Gram-positive bacterial infections | CTD, DrugBank |
2 | Enterobacteriaceae infections | DrugBank | 7 | Helicobacter infections | Literature (van der Voort et al., 2000) | |
3 | Gram-negative bacterial infections | CTD, DrugBank | 8 | Eye infections, bacterial | Literature (Kramann et al., 2001) | |
4 | Pseudomonas infections | DrugBank | 9 | Staphylococcal skin infections | DrugBank | |
5 | Respiratory tract infections | CTD, ClinicalTrials | 10 | Septicemia | DrugBank |
Note: (1) ‘CTD’ means a drug–disease association is included by the CTD and the association is curated manually. (2) ‘ClinicalTrials’ means that a drug–disease association has been recorded in the online database ClinicalTrials.gov. (3) ‘DrugBank’ means that the drug–disease association is contained by the DrugBank database that captures the drug trial information. (4) ‘repoDB’ means that the drug–disease association is included by the repoDB database that records the approved and failed drugs and their indications. (5) ‘PubChem’ means that the PubChem database has recorded the toxicological information about the drug and disease. (6) ‘Literature’ means that there is a published literature to support the drug–disease association. (7) ‘Inferred candidate’ means that the drug–disease association is the potential one inferred by the literatures and included by CTD. (8) ‘Unconfirmed’ means that there is no evidence to confirm the drug–disease association.
First, the comparative toxicogenomics database (CTD) provides the key information about the drugs and their effects on human diseases which were manually curated from the published literatures (Davis et al., 2016). DrugBank is also a database that captures the clinical trial information of drugs including the drug and the disease for which the trial was conducted (Wishart et al., 2017). The repoDB database records the approved and failed drugs and their respective indications (Brown and Patel, 2017). As shown in Table 2, 29 candidates are contained by CTD and they are supported by the direct evidences, 13 candidates are included by DrugBank and 1 candidate is recorded by repoDB. It indicates these candidate diseases are indeed associated with the corresponding drugs.
Next, ClinicalTrials.gov (https://clinicaltrials.gov/) is an online resource provided by the US National Library of Medicine, and it includes a great many clinical trials about various drugs and the corresponding diseases. PubChem is an open chemistry database supported by the National Institutes of Health (https://pubchem.ncbi.nlm.nih.gov/), and it provides information on chemical substances which include the drugs and their biological activities (Kim et al., 2015). There are four candidates included by ClinicalTrials.gov and two candidates contained by PubChem, indicating these drug–disease associations are supported by the clinical trials. In addition, the four candidates labeled by ‘literature’ are supported by the literatures, and the drugs are confirmed to have effects on the corresponding diseases.
Besides the manually curated drug–disease associations, the CTD also contains the potential associations inferred by the literatures. There is 1 etoposide-related candidate disease, Urinary Tract Infections, contained by the inferred part of CTD. Hence etoposide is more likely to be associated with Urinary Tract Infections. In the total 50 candidates, 2 of them are not confirmed by the observed evidences and they are labeled with ‘unconfirmed.’ All the case studies indicate that DisDrugPred is indeed capable of discovering potential candidate drug–disease associations.
3.5 Prediction of novel drug–disease associations
After having evaluated its prediction performance by cross validation and case studies, we applied DisDrugPred to predict the novel drug–disease associations. All of the known drug–disease associations were utilized to train DisDrugPred’s prediction model. The potential candidate associations were then obtained by using the model and listed in Supplementary Table S1. In addition, the fourth type of drug similarity based on their associated diseases is shown in Supplementary Table S2.
4 Conclusions
A method based on non-negative matrix factorization, DisDrugPred, is developed for predicting the potential drug–disease associations. On the basis of calculating the fourth type of drug similarity, DisDrugPred captures the various intra-relationships of drugs and diseases, i.e. the five types of drug similarities and two types of disease similarities. Meanwhile, it also captures the inter-relationships among drugs and diseases, i.e. the known drug–disease associations. Moreover, the various prior knowledge and the projections of drugs and diseases are deeply integrated to enhance reasoning on the drug–disease associations. In addition, the experimental results confirm DisDrugPred’s algorithm also contributes to its’ superior performance. An iterative algorithm is developed to obtain the estimated drug–disease association scores, and these scores can be used for ranking the candidate diseases for each of the drugs. In our experiments, we find DisDrugPred consistently outperforms than the other methods tested here in terms of not only AUCs but also AUPRs. In particular, DisDrugPred is more useful for the biologists as its top ranking list contains more real drug–disease associations. Case studies on five drugs demonstrate DisDrugPred’s ability in discovering the potential disease indications. DisDrugPred can serve as a prioritization tool to generate the reliable candidates for subsequent identification of actual drug–disease associations with the wet-lab experiments.
Funding
The work was supported by the Natural Science Foundation of China [61702296, 61302139], the Heilongjiang Postdoctoral Scientific Research Staring Foundation [BHL-Q18104], the Natural Science Foundation of Heilongjiang Province [FLHPY2019329], the Fundamental Research Foundation of Universities in Heilongjiang Province for Technology Innovation [KJCX201805], the Fundamental Research Foundation of Universities in Heilongjiang Province for Youth Innovation Team [RCYJTD201805], the Young Innovative Talent Research Foundation of Harbin Science and Technology Bureau [2016RQQXJ135] and the Foundation of Graduate Innovative Research [YJSCX2018-140HLJU, YJSCX2018-047HLJU].
Conflict of Interest: none declared.
References