MDTips: a multimodal-data-based drug–target interaction prediction system fusing knowledge, gene expression profile, and structural data

Abstract Motivation Screening new drug–target interactions (DTIs) by traditional experimental methods is costly and time-consuming. Recent advances in knowledge graphs, chemical linear notations, and genomic data enable researchers to develop computational-based-DTI models, which play a pivotal role in drug repurposing and discovery. However, there still needs to develop a multimodal fusion DTI model that integrates available heterogeneous data into a unified framework. Results We developed MDTips, a multimodal-data-based DTI prediction system, by fusing the knowledge graphs, gene expression profiles, and structural information of drugs/targets. MDTips yielded accurate and robust performance on DTI predictions. We found that multimodal fusion learning can fully consider the importance of each modality and incorporate information from multiple aspects, thus improving model performance. Extensive experimental results demonstrate that deep learning-based encoders (i.e. Attentive FP and Transformer) outperform traditional chemical descriptors/fingerprints, and MDTips outperforms other state-of-the-art prediction models. MDTips is designed to predict the input drugs’ candidate targets, side effects, and indications with all available modalities. Via MDTips, we reverse-screened candidate targets of 6766 drugs, which can be used for drug repurposing and discovery. Availability and implementation https://github.com/XiaoqiongXia/MDTips and https://doi.org/10.5281/zenodo.7560544.


Introduction
Developing a new drug always takes 10-15 years and costs about 0.8-1.5 billion dollars, but it faces a considerable risk of failure (Paul et al. 2010). To alleviate this dilemma, a novel drug development strategy, drug repurposing (Pushpakom et al. 2019), was proposed to discover new indications for existing drugs. Large-scale and reliable prediction of drugtarget interactions (DTIs) will substantially facilitate drug repurposing and development.
With the development of AI techniques and the accumulation of large-scale biomedical data, deep learning (DL)-based DTI prediction models exhibit a great advantage over traditional computational methods such as molecular docking (Forli et al. 2016) and machine learning (Bagherian et al. 2021), thus, attracting increasing research attention to tackle the DTI prediction problem (Du et al. 2022). Generally, the DL-based models for DTI prediction can be classified into two types: single and multimodality models.
The single models for DTI focused on learning representations of drugs and targets from the specific modality. Then the concatenate representations are fed to a fully connected network to predict the DTI probability. Advanced DL techniques, such as convolutional neural networks (CNN) (Ö ztü rk et al. 2018), Transformer , and graph neural networks (GNN) (Zhao et al. 2021) have motivated their application in DTI prediction models. They take the 1D sequential strings, 2D graphs or image grids (de Souza et al. 2022), or 3D structures (Yazdani-Jahromi et al. 2022) of drugs and targets as input and automatically learn representation vectors from the complex data types. In addition, several models are constructed from a genomics perspective, representing the features of drugs and targets with gene expression profiles (Xie et al. 2018, Shao et al. 2020, Zhong et al. 2022. Despite these promising developments, they have several limitations. First, large amounts of known drug-target pairs (DTPs) must be used to train DL-based DTI models, while the labeled data volume is always limited. Second, they face coldstart problems, where the model accuracy decreases when predicting the interaction of a novel drug without knowing any target information (Nguyen et al. 2022).
More recently, multimodality models have rapidly progressed for DTI prediction based on single-modality-based models by integrating heterogeneous data into a unified framework. The heterogeneous data, including drug-target, drug-disease, and drug-drug (Zhu et al. 2019) interactions, formed a complicated biological network where nodes are drugs and targets, while edges are interactions, and node attributes are structural information or/and gene expression signatures of drugs and targets. Multimodal fusion, which has boosted the performance of many classical problems (e.g. visual question-answering) (Xue and Marculescu 2022), is employed to integrate heterogeneous information from networks and automatically extract features of drugs and targets to facilitate further DTI prediction (Nguyen et al. 2022). Many models integrate diverse entities (e.g. drug, target, and disease) and edge types in the heterogeneous network or knowledge graph by GNN (Li et al. 2023), network-based methods (Tian et al. 2022), and knowledge graph embedding (KGE) methods (Li et al. 2022). Furthermore, to incorporate drugs and targets' structural information and networks, these works design the protein sequence encoder and drug structure encoder to extract the initial features for targets and drugs and then fuse structural information and networks by GCNbased interaction (Wang et al. 2022), joint representation framework based on heterogeneous networks (Zhou et al. 2021), neural factorization machine (Ye et al. 2021), and shared unit (Ma et al. 2023). Such models can automatically learn representations of drugs and targets from structural information and heterogeneous network in an end-to-end manner, thus outperforming single-modality models.
Despite these promising developments in multimodality models, there still exist two shortcomings: (i) these models focus on fusing features from at most two modalities of structural information and heterogeneous network, ignoring gene expression profiles, which represent biological responses and changes in cellular processes to diverse perturbations (e.g. compound and gene knockout) (Tanoli et al. 2021). In addition, they lack a comprehensive evaluation of each modality's contribution; (ii) these models are tailored for predicting DTIs, ignoring other related drug information such as side effects and indications, which is essential for drug repurposing.
To address the above issues, we developed MDTips by integrating the knowledge graphs, gene expression profiles, and structural information of drugs/targets. Specifically, for structural information, we use Attentive FP and Transformer to learn representation vectors from 2D drug molecular graphs and 1D target amino acid sequences (AAS), respectively. We apply fully connected feed-forward networks (FC) for gene expression signatures to extract high-dimensional features of drugs and targets, respectively. For knowledge graphs, we employ ConvE (Dettmers et al. 2018) to learn the embeddings of entities and relations. This way, we can fuse all data modalities by concatenating feature vectors and feed the fused representation to an FC module to calculate the interaction probability. Further, the framework enables a comprehensive evaluation of each modality. From the application's perspective, it can efficiently utilize KG and available modalities to predict the input drug/compound's potential targets, indications, and side effects.
To summarize, MDTips differs from previous multimodality models by (i) fusing three data modalities of drugs and targets, (ii) enabling comparison of modality's contribution, (iii) providing a pretrained model for predicting multiple drug information.

Datasets
This study uses a comprehensive biological KG: Drug Repurposing Knowledge Graph (DRKG) (Ioannidis 2020), which includes 97 238 entities belonging to 13 entity types and 5 874 261 triplets belonging to 107 edge types. We extract 88 439 DTPs from DRKG as DTI task-related dataset DTP KS . The remaining triplets with multiple relations (e.g. drug-drug, drug-disease, and gene-disease) are supporting knowledge KG support . Drugs' SMILES strings are collected from DrugBank (Wishart et al. 2018) and then converted into canonical SMILES using the RDKit package. Targets' AAS are collected from UniProt. We extract 24 418 DTPs with all three modalities (i.e. KG, graph/sequence, and gene expression signature) from DTP KS to get the subset DTP KSE ( Table 1). The gene expression signatures used to measure consensus transcriptional response to perturbations of 978 genes are downloaded from consensus signatures for LINCS L1000 perturbations (Daniel Himmelstein and Baranzini 2016). All the DTPs in DTP KS or DTP KSE are positive samples. Negative samples are generated by combining n drugs and m targets into n Â m pairs and filtering all positive samples. We under-sampled negative samples with a positive-tonegative sampling ratio of 1:2. The positive and negative samples generated from DTP KS /DTP KSE are mixed to form the experiment dataset Dataset KS /Dataset KSE .

The workflow of MDTips
MDTips is a binary classification model. The input is the DTP ðd; tÞ, and the output is y2f0, 1g, when y ¼ 1, indicates an interaction between the input drug d and the target t. MDTips consists of two main components: (i) representation modules that learn the representations of d and t; (ii) a fusion and decoder module that integrates representations generated by each representation module and predicts the label of ðd; tÞ ( Fig. 1). MDTips considers all the modalities of drugs and targets in the existing datasets. Dataset KS contains two modalities (i.e. KG, graph/sequence) and Dataset KSE contains all the three modalities (i.e. KG, graph/sequence, and expression signature). For DTPs in Dataset KSE , we design three representation modules. Those are K_representation (K), S_representation (S), and E_representation (E), which learn drug/target representations based on KGs, graphs/sequences, and gene expression signatures, respectively. We then fuse the learned drug/target representations and feed the fused embedding into a four-layer perceptron neural network. Finally, the prediction score of DTI can be evaluated. The evaluation metrics and experiment details are presented in Supplementary data.

Attentive FP for drug representation
Drugs are converted to molecular graphs, where nodes represent atoms, and edges represent bonds. We construct vectors to represent the chemical features of nodes (e.g. atom type) and edges (e.g. conjugated bond) using RDKit and DGL-LifeSci packages. Concretely, formal charge and radical electron number are encoded as integers, and other features are encoded as one-hot vectors. This approach generates vectors with a length of 39 for nodes and 11 for edges (Supplementary Tables S1 and S2). The molecular graphs with node and edge features are fed into Attentive FP to learn drug representations. Attentive FP introduces an attention mechanism to GNNs, allowing a target atom to focus on its neighborhood's most relevant "messages" (Xiong et al. 2020). The atom representation is updated in the messaging and readout phases, which are formulated as follows: where v is the target atom, and u is the neighbor atom of v. At the k À 1 layer, h kÀ1 u and h kÀ1 v are the representation vectors of u and v. M kÀ1 is the message function, which uses the graph attention network (GAT) (Supplement data) (Veli ckovi c et al. 2017). In the readout phase, the current representation h k v is calculated by applying the update function: gated recurrent unit GRU kÀ1 (Supplement data) on the previous representation: h kÀ1 v and the attention context C kÀ1 v . After obtaining updated node representations, we then compute graph-level representations out of node features using the method introduced by Xiong et al. (2020).

Transformer for target representation
Explainable substructure partition fingerprint (Huang et al. 2019) is used to decompose targets' AAS into a discrete set of moderate-sized substructures, and the predefined subsequences vocabular set is S ¼ fs 1 ; s 2 ; . . . ; s 4114 g. To get inspiration from natural language processing, we use transformer model (Vaswani et al. 2017) for target representation learning.
For an input target T, we decompose the AAS to a sequence of substructures T ¼ ft 1 ; t 2 ; . . . ; t l g, where t i 2 S. We then map T to a representations vector Z ¼ ðz 1 ; z 2 ; . . . ; z l Þ, where z i is the substructure index in S for t i . The padding operation is used to unify the length of substructure sequences, and the maximum length m ¼ 545. To capture the positional information of sequences, we define a positional vector P ¼ ðp 1 ; p 2 ; . . . ; p l Þ, where p 1 ¼ i. We then generate the target substructure embedding E seq 2 R mÃd and positional embedding E pos 2 R mÃd using the word embedding method, where d ¼ 64. The target representation E 2 R mÃd is formulated as: The embedding E is fed to a transformer encoder including two sublayers, i.e. a multihead self-attention layer (m Atten) (Supplementary data) and a fully connected feed-forward network (FC). Firstly, m Atten is used to compute a new target representation by considering contextual information of sequences. Then, the updated representation is fed into FC, followed by dropout, layer normalization, and residual connection modules. The process is formulated as:

K_representation
The knowledge graph K is a multirelational graph that comprises a set of triples ðh; r; tÞ, where h, r, and t represent the head entity, relation, and tail entity, respectively. We use a KGE model ConvE (Dettmers et al. 2018) to learn representations for all entities and relations in K. The embedding vectors of h, r, and t are randomly initialized as e h 2 R k , r r 2 R k , and e t 2 R k , where k is the embedding dimension of entity and relation. Then the interaction score of ðh; r; tÞ is calculated with scoring function u r ðe h ; e t Þ, which is defined as: where ReLU is the rectified linear units, e h 2 R kwÃk h and r r 2 R kwÃk h are 2D reshaping of e h and r r , where k ¼ k w Ã k h . The 2D convolutional layer with filters x is applied to the concatenated matrix e h ; r r ½ . Binary cross-entropy loss (BCELoss) is applied to update the initial embedding and model parameters.

E_representation
For each DTP in DTP KSE , the drug d is represented as the molecule perturbated signature, and the target t is represented as the gene knockdown/overexpression perturbated signature if the drug d inhibits/activates the target t. Specifically, the representation of drug d and target t is defined as: where M D 2 R nDÃ978 , M OE 2 R n OE Ã978 , M XPR 2 R nXPRÃ978 are consensus signature datasets, and D; OE; XPR are perturbation sets of drugs, gene overexpression, and gene knockdown, respectively.

Fusion and decoder module
The representations of drugs and targets are learned by K, S, and E modules. We define these learned representations of drug d i as d ik , d is , d ie , and target t j as t jk , t js , t je . score ij is the interaction score of the DTP ij . We first map each modality representation of DTP ij into a latent feature space through FC with a Dropout and an ReLU activation, followed by layer normalization (Lei Ba et al. 2016). For example, the S representation of DTP ij is formulated as: where h s is DTP ij 's hidden state of the S representation. W and b are learnable parameters in the training process. The dropout rate is set at 0.1. The same operations are applied to K and E representations. We then concatenate h k , h s , and h e and feed into a four-layer perceptron neural network, of which three hidden layers include 1024, 1024, and 512 neural units, separately. We use the Sigmoid function in the final layer to map the output value to [0,1]. The BCELoss function is used to train the model by back-propagating the error and updating all parameters of the model in an end-to-end manner.

Comparison with other S_representation modules
Our S_representation module includes Attentive FP for drug representation and Transformer for target representation. To explore the impacts of different drug/target encoders on MDTips, we first compared drug encoders: Attentive FP with two molecular fingerprint-based methods: Morgan and Daylight, and two sequence-based methods: CNN and RNN.
We then compared the target encoders: Transformer with two target descriptors: amino acid composition (AAC) up to 3mers (Du et al. 2014) and conjoint triad features (Shen et al. 2007), and two sequence-based methods: CNN and RNN. We conducted all the experiments on Dataset KS . Attentive FP is a superior drug encoder compared to Daylight, Morgan, and RNN and comparable to CNN ( Fig. 2A). Its higher average AUPR value (primary metric) and interpretability (Xiong et al. 2020) make it useful for guiding chemists in structural optimization for target interaction. Transformer performs better than other target encoders ( Fig. 2C and D). For example, the AUPR and area under the receiver operating characteristics (AUROC) of Attentive FP are significantly higher than that of Morgan, with an increase of 0.8% (P ¼ .002) and 0.3% (P ¼ .003), respectively. The AUPR and AUROC of the Transformer are significantly higher than AAC, with an increase of 0.6% (P ¼ .002) and 0.25% (P ¼ .009), respectively.
In conclusion, Attentive FP and Transformer outperform traditional chemical descriptors/fingerprints and biological features obtained from databases.

Performance on Dataset KSE
We verified the effectiveness of three single modalities: K, S, E, and the fused ones: KS, KE, SE, and KSE on Dataset KSE . KSE performs best among all seven models ( Fig. 3A and B). The AUPR and AUROC values of KSE are significantly higher than that of other models (P < .05). The KSE, KE, KS, and K models that utilize KG perform better than those without KG (SE, E, and S). KG-driven DTI models not only utilize the interaction between drugs and targets but also incorporate other information from the large-scale triples in KG (e.g. drug-drug, drug-disease, and gene-disease), thus significantly improving DTI models' performance. Additionally, we observed that fused models significantly outperform singlemodality models. For example, SE is better than S and E; KE is better than K and E. The result is consistent with an intuition that different modalities provide information from different aspects and improve performance. Multimodal fusion learning can fully consider the importance of each modality.  Xia et al.
Additionally, we found that each model has a low standard deviation of AUPR and AUROC, demonstrating each model's robustness.

Performance on Dataset KS
We verified the effectiveness of two single modalities K and S, and the fused KS on Dataset KS . The KS model achieves the best performance (AUPR ¼ 0.951 6 0.003 and AUROC ¼ 0.970 6 0.003). The AUPR and AUROC of KS are significantly higher than that of S (AUPR: P ¼ .002 AUROC: P ¼ .006), but they do not have a significant difference between KS and K (AUPR: P ¼ .064, AUROC: P ¼ .064) ( Fig. 3C and D). These results suggest that the KG plays a pivotal role in the DTI model, and structural information of drugs/targets is only subordinate when KG is provided.

The framework of MDTips
The results indicate that KS and KSE perform best on Dataset KS and Dataset KSE , respectively. Since multimodal fusion efficiently integrates the information from KGs, graphs/ sequences of drugs/targets, and gene expression signatures for DTI prediction, we trained two fusion prediction models: KS and KSE, on Dataset KS and Dataset KSE by using all available modalities. We also trained the S model for drugs with only structural information on Dataset KS . The high AUPR and AUROC values demonstrate that these prediction models are powerful enough to detect the true DTIs labeled in the datasets (Fig. 4). The DTP scores predicted by KSE, KS, and S models strongly correlate (Spearman r > 0.8 and P < .001), indicating relatively high consistency.
Based on the above results, we constructed a comprehensive drug-centric prediction framework named MDTips, containing the KSE, KS, S, and K models. MDTips is an open system that utilizes all available modalities to predict candidate targets of compounds, and it employs the input SMILES to conduct a comprehensive database search for matching models. In cases where three available modalities are associated with the input SMILES, MDTips deploys the KSE model. For novel compounds that lack any prior knowledge, the S_model is utilized for predicting their candidate targets ( Supplementary Fig. S1). Additionally, drug-related information (e.g. side effects and indications) can be predicted by loading the pretrained K-model or re-training the K model on the combination of the predicted DTIs and the previous KG. We implemented all prediction processes with a few code lines, facilitating parallel prediction for massive drugs/ compounds.

Comparison with other methods on Dataset KS
We compared MDTips with six DTI prediction methods. There are two knowledge-based: KGE_NFM (Ye et al. 2021) and DRKG (Ioannidis 2020); a sequence-based: DeepDTA (Ö ztü rk et al. 2018); two graph-based: CPI-GNN (Tsubaki et al. 2019) and GraphDTA (Nguyen et al. 2021); a 3D structure-based: AttentionSiteDTI (Yazdani-Jahromi et al. 2022). We first compared Dataset KS with the custom-built datasets of the six models, and found Dataset KS covers the most extensive positive DTIs (Table 2). So, we compared MDTips with the other six models (Supplement data) using Dataset KS .
MDTips and all six models achieve relatively high predictive performance (Table 3). Among these, MDTips achieves the best performance (AUPR ¼ 0.951 6 0.003), significantly higher (P ¼ .002) than that of the second-best model KGE_NFM (AUPR ¼ 0.947 6 0.002). The KG-based models MDTips, KGE_NFM, and DRKG performed better than the structure/graph/sequence-based models DeepDTA, CPI-GNN, GraphDTA, and AttentionSiteDTI. This result is consistent with the previous results that KG can significantly improve the performance of DTI models by integrating other data information. Possible causes of AttentionSiteDTI does not perform well on the Dataset KS are: (i) multiple differenteffect DTIs exist, such as binding, upregulated/downregulated, and inhibit/activate. (ii) We use the AlphaFold2 predicted 3D structures of targets if they do not have experimentally determined structures in the Protein Data Bank (PDB), which may introduce errors in the model.

Performance of MDTips with different undersampling ratios
The sampling ratio of Dataset KS and Dataset KSE is positive : negative ¼ 1 : 2. We observed that with the increasing sampling ratios, the AUPR values decrease, and AUROC values remain almost unchanged. This result suggests that AUPR is an informative metric since it can reveal differences in model performance on balanced or unbalanced datasets. We also observed that KSE performs best on Dataset KSE (Fig. 5A and B), and KS performs best on Dataset KS (Fig. 5C and D) under different sampling ratios. Multimodality fusion can significantly improve model performances in the case of sample imbalance. For example, the AUPR of KSE is 8% higher than S (P ¼ .002). The fusion models KSE, KE, KS, and

Reverse screening and molecular docking
Discovering new DTIs is a crucial step in drug repurposing/ discovery. We designed a reverse screening process that employs MDTips to identify candidate targets of the input drug (Fig. 7A). We screened 6766 drugs in Dataset KS using the KS model, and 857 drugs in Dataset KSE using the KSE model. We assumed DTIs with a score >0.7 as a credible prediction.
We collected the top 10 credible targets of all the 6766 drugs, involving 2400 targets (Supplementary Table S3). The results show an incredibly high frequency of cytochromes P450 (CYP) family members. The top frequent CYP3A4 appears 2065 times that far more than the runner-up ALB with 1582 counts, meaning that around 30.6% of the drugs are predicted to interact with CYP3A4. CYP family members are heavily involved in drug metabolism (Zanger and Schwab 2013) and should be an essential consideration in drug development. Other high-frequency targets are shown in Supplementary Fig. S2. Several targets play a significant role in anticancer therapy. For example, CDK2 is a druggable target involved in acute myeloid leukemia differentiation and various cancers (e.g. ovarian and breast cancer) (Tadesse et al. 2020. CA2 is upregulated in cancers (e.g. hepatocellular carcinoma) (Xing et al. 2021) and plays a role in the establishment of tumor endothelium (Annan et al. 2019).
All the high-frequency targets mentioned above significantly appeared in the Proteomaps (Liebermeister et al. 2014) at the protein level (Fig. 6A). The candidate targets can be further grouped into highly enriched functional categories and pathways, including steroid hormone biosynthesis, amino acid metabolism, and ion channels (Fig. 6B). From the results of Metascape (Zhou et al. 2019), targets that response to hormone make up the category of biological functions with the highest proportion (12.85%) of all the targets. Besides, several categories such as GPCR signaling, response to nitrogen compound, and protein phosphorylation are all over 10% proportion of all the targets (Fig. 6C). Tissue/cell-specific results showed exceptionally high proportion (9.01%) of liver expressed targets, much higher than the runner-up tissue lung (4.06%) (Fig. 6D). This result is consistent with the fact that the majority of drugs are closely related to the conversion and metabolism of the liver (Almazroo et al. 2017).

Case study
Melphalan is a crucial agent in the realm of antineoplastic therapeutics. Its small molecular weight and alkylating  properties enable it to interact with multiple targets. Clarification of its putative targets can inform rational structure modifications, augment therapeutic effectiveness, and diminish toxicity. Notably, melphalan possesses three modalities and is the subject of several target investigations. In consideration of this, the KSE and KS models can be employed to prognosticate potential targets (Supplementary Table S4). The results suggest that melphalan may activate HMOX1, inhibit CDC20, and interact with MYC. The molecular docking results show melphalan's bound conformations ( Fig. 7B-D) and binding free energies (Fig. 7E-G) to predicted targets. The lowest binding free energy of melphalan to HMOX1 is À6.61 kcal/mol, significantly lower than that of melphalan to ABCC1 (known target): À3.42 kcal/mol (P ¼ 7.556Â10 À10 ). Moreover, melphalan is predicted to treat cancers, such as breast and ovarian cancer, consistent with previous research that melphalan is a treatment for BRCA-related ovarian carcinoma, breast cancer, and refractory cancer (Bouwman and Jonkers 2012). MDTips also identified the side effects of melphalan, such as nausea, vomiting, and thrombocytopenia (Kashimura et al. 1988) (Supplementary Table S5).

Discussion
Our work still has room for improvement. First, we still need large-scale and highly qualified datasets, which play a critical role in improving the performance of DTI prediction. Although the number of validated DTIs has increased dramatically over the past decades, there still needs to be reliable negative DTI samples. Additionally, many studies focus on hotspot drugs and targets, accumulating extensive interactions of these hotspots. These facts inevitably lead to data bias and poor predictive performance on new targets/drugs that never appeared in the training set. Thus, in this study, we used the known DTIs as much as possible. To our best knowledge, Dataset KS is the largest DTI dataset, including 88 439 DTIs, 6766 drugs, and 8089 targets. Dataset KSE is the first large-scale dataset representing drugs and targets with three modalities (i.e. KG, sequence/graph, and expression profiles), including 24 418 DTPs, 857 drugs, and 2108 targets. MDTips covers 8089 targets belonging to one or more of 41 potentially druggable gene categories, according to DGIdb (Freshour et al. 2021). The most notably druggable gene categories, such as ion channel and enzyme, are well represented, with 74.1% (352/475) and 70.6% (2194/3106) in MDTips    Table S6). However, there still exists inherent biases in the datasets, for hotspot targets such as CYP family members have a relatively high frequency. The fact will inevitably result in the model tending to identify those entities with high frequency.
Second, our current MDTips framework has not yet to consider the 3D structures of drugs and proteins. The result of comparing MDTips to AttentionSiteDTI indicates that DTI models trained on predicted 3D structures do not outperform models trained on AAS. The 3D structure-based models should be trained on highly credible datasets that include measured ligand-receptor affinities and cocrystal structures of ligands and proteins. However, the data volume of these datasets is relatively scarce due to the expensive costs and the need for experimental verification of structures. Recently, AlphaFold2 has been developed to predict structures for most human proteomes based on AAS (Jumper et al. 2021) and publicly released highaccuracy protein-structure predictions on the AlphaFold Protein Structure Database (Varadi et al. 2022). However, AlphaFold2 has potential limitations in predicting structures of proteins with a small number of experimental structures in the PDB. For example, AlphaFold2 could capture the overall backbone features of the G protein-coupled receptors but fell short in predicting transmembrane domains, the shape of the ligand-binding pockets, and the conformation of the transducer-binding interfaces (He et al. 2023), which are critical for interactions with drugs. The application of novel 3D-based methods should rely on credible datasets.
For a fair comparison, all models should be trained on the same dataset. KSE outperformed the other six models trained on the Dataset KSE . Although the LINCS is updating, the data size is relatively small compared with the vast chemical space. The data augmentation method proposed in DeepCE (Pham et al. 2021) could solve the problem of limited data.
The common problem of KGE is that the model should be re-trained if new entities or relations are added in, which costs a lot of time and computing resources. This problem still exists in our model. Moreover, it becomes increasingly challenging to retrain KGs with the increasing scale. Transfer learning can solve this problem well (Zhuang et al. 2021). For example, entity and relation embeddings of ConvKB are initialized using pretrained parameters produced by TransE (Nguyen et al. 2017).
Several potential directions may improve MDTips further: (i) integrating multimodality data with a complex algorithm, such as an attention mechanism, and gaining the interpretability of models will inevitably be the focus of future work. (ii) Learning features from multiple molecular representations by transfer learning is an important research direction. (iii) Incorporating more knowledge into the DTI prediction model with effective KGE methods is also a promising future direction. Therefore, we will integrate more structural representations (sequence strings, molecular graphs, and 3D structures), gene expression profiles, and large-scale KG into the same framework to a further version of MDTips.

Conclusion
This work has developed a DL-based DTI prediction system: MDTips, which integrates KGs, graphs/sequences, and gene expression signatures to predict the input drug's potential targets and additional information, such as side effects and indications. We demonstrated that multimodal fusion could significantly improve model performance, and MDTips achieves a high one, indicating KGs and gene expression signatures play a pivotal role in DTI prediction. Additionally, DL-based encoders that Attentive FP and Transformer outperform traditional chemical descriptors/fingerprints and biological features. In addition, MDTips outperforms other state-of-the-art methods, including knowledge-based, sequence-based, graph-based, and 3D structure-based ones. Furthermore, the case study shows that MDTips can predict DTI and additional information such as side effects and indications of the input drug. In summary, MDTips will offer a highly competitive approach for DTI prediction and benefit drug repurposing.