MolTrans: Molecular Interaction Transformer for Drug Target Interaction Prediction

Drug target interaction (DTI) prediction is a foundational task for in silico drug discovery, which is costly and time-consuming due to the need of experimental search over large drug compound space. Recent years have witnessed promising progress for deep learning in DTI predictions. However, the following challenges are still open: (1) the sole data-driven molecular representation learning approaches ignore the sub-structural nature of DTI, thus produce results that are less accurate and difficult to explain; (2) existing methods focus on limited labeled data while ignoring the value of massive unlabelled molecular data. We propose a Molecular Interaction Transformer (MolTrans) to address these limitations via: (1) knowledge inspired sub-structural pattern mining algorithm and interaction modeling module for more accurate and interpretable DTI prediction; (2) an augmented transformer encoder to better extract and capture the semantic relations among substructures extracted from massive unlabeled biomedical data. We evaluate MolTrans on real world data and show it improved DTI prediction performance compared to state-of-the-art baselines.


Introduction
Drug discovery is notoriously costly and time-consuming due to the need of experimental search over large drug compound space. Drug-target protein interaction (DTI) prediction serves as the foundation for finding new drugs (i.e. virtual screening) and new indications of existing drugs (i.e. drug repositioning), since the therapeutic effects of drug compounds are detected by examining DTIs (Hughes et al., 2011). During the compound identification process, researchers often need to conduct assay experiments and search over 97 M possible compounds in a candidate database (Broach et al., 1996).
Luckily, with massive biomedical data and knowledge being collected and available, along with the advances of deep learning technologies which demonstrated huge success in many application areas, the drug discovery process particularly DTI prediction has been significantly enhanced. Recently, various deep models have shown encouraging performance in DTI predictions. They often take drug and protein data as inputs, cast DTI as a classification problem, and make prediction by feeding the inputs through deep learning models such as deep neural network (DNN) (Unterthiner et al., 2014), deep belief network (DBN) (Wen et al., 2017) and convolutional neural network (CNN) (Mayr et al., 2018;Ö ztü rk et al., 2018. Despite these efforts, the following challenges are still open. 1. Inadequate modeling of interaction mechanism. Existing works (Gao et al., 2018;Ö ztü rk et al., 2018 learn molecular representation and make prediction based on whole molecular structure of drugs and proteins, ignoring that the interactions are sub-structural-only involving relevant sub-structures of drugs and proteins (Jia et al., 2009;Schenone et al., 2013). The full-structural molecular representations introduce noises and affect the prediction performance. Also, the learned representations are hard to interpret since they do not provide a tractable path to indicate which sub-structures of drugs and proteins contribute to the interactions. 2. Restricted to limited labeled data. Previous works (Gao et al., 2018;Lee et al., 2019;Ö ztü rk et al., 2018Wen et al., 2017) focus on data in hand and limit the scope within several thousands of drugs and proteins while ignoring the vast (e.g. order of millions) unlabeled biomedical data available. The model architectures in previous works are also not designed to enable the integration of massive dataset.
Present Work. To solve these challenges, we propose a transformer (Vaswani et al., 2017)-based bio-inspired molecular data representation method [coined as Molecular Interaction Transformer (MolTrans)] to leverage vast unlabeled data for in-silico DTI prediction. MolTrans is enabled by the following technical contributions: 1. Knowledge inspired representation and interaction modeling for more accurate and explainable prediction. Inspired by the knowledge that DTI is sub-structural, MolTrans derives a data-driven method called Frequent Consecutive Sub-sequence (FCS) mining that is adaptable to extract high-quality fit-sized sub-structures for both protein and drug. In addition, MolTrans includes a bioinspired interaction module imitating the real biological DTI process. The new sub-structure fingerprints enable a tractable path for understanding which sub-structure combination has more relevance to the outcome through an explicit map in the interaction module. 2. Leverage massive unlabeled biomedical data. MolTrans mines through millions of drugs and proteins sequences from multiple unlabeled data sources to extract high-quality sub-structures of drugs and proteins. The vast data result in a much higher quality sub-structures than using small training dataset alone. We also augment the representation using transformers (Vaswani et al., 2017), which captures the complex signals among the large sequential sub-structures outputs generated from the unlabeled data.
We provide a comprehensive performance comparison of stateof-the-art methods on various realistic drug discovery settings include unseen drug/target problems and in scarce training dataset setup. We show empirically that MolTrans has robust improved predictive performance over state-of-the-art baselines by up to 25% over the best performing baseline.
Related work. Numerous computational methods have been developed for DTI prediction problem. Similarity-based methods such as kernel regression (Pahikkala et al., 2015) and matrix factorization (Zheng et al., 2013) methods exploit known DTI's drug-target similarity information and infer new ones. However, these methods are shown to be not generalizable to different protein classes (Wen et al., 2017). Feature-based methods feed numerical descriptors of drug and proteins into downstream prediction models. Popular numerical descriptors include ECFP (Rogers and Hahn, 2010) and PubChem (Bolton et al., 2008) for drugs, Composition-Transition-Distribution (CTD; Dubchak et al., 1995) and protein sequence composition descriptor (PSC; Cao et al., 2013) for proteins. Classic machine learning methods such as gradient boosting (He et al., 2017) have shown promises in predictive performance. Recently, deep learning-based methods (Ö ztü rk et al., 2019;Tsubaki et al., 2019;Unterthiner et al., 2014;Wen et al., 2017) have shown further improvement of performance due to its capability to capture complex non-linear signals of DTI. MolTrans differs from existing works with (i) its knowledge-driven model architecture design rather than direct application of existing deep learning models; (ii) emphasis on interpretability instead of predictive performance alone to potentially aid medical chemists for better decision making and (iii) usage of external drug and target data to complement interaction dataset.

Problem definition
We formulate the DTI prediction as a classification task to determine whether a pair of drug and target protein will interact. In our setting, drug is represented by the Simplified Molecular Input Line Entry System (SMILES) S i , which consists of a sequence of chemical atoms and bonds tokens (e.g. C, O, S), generated by depth-first traversal over the molecule graph. We denote S for drug's SMILES representation. Target protein, denoted as A, is represented by a sequence of protein tokens, where each token is one of the 23 amino acids. The notation table is provided in Supplementary Material S1. The DTI prediction task is defined as below.
Problem 1 (DTI Prediction). Given compound sequence S ¼ fS 1 ; . . . ; S n g for n drugs and protein sequence A ¼ fA 1 ; . . . ; A m g for m proteins, the DTI prediction task can be casted as to learn a function mapping F : S Â A ! ½0; 1 from drug-target pairs to an interaction probability score.

The MolTrans method
The MolTrans framework learns to predict DTI as follows. Given the input drug and protein data, a FCS mining module first decomposes them into a set of explicit sequences of sub-structures using a specialized decomposition algorithm. The outputs are then fed into a augmented transformer embedding module to obtain an augmented contextual embedding for each sub-structure through transformer encoders (Vaswani et al., 2017). Next, in the interaction prediction module, drug sub-structures are paired with protein substructures with pairwise interaction scores. A CNN layer is later applied on the interaction map to capture higher-order interactions. Finally, a decoder module outputs a score indicating the probability of pairwise interactions. Method illustration is provided in Fig. 1.

FCS mining module
Driven by the domain knowledge that DTI happens in a substructural level, MolTrans first decomposes molecular sequence for proteins and drugs into sub-structures. In particular, we propose a data-driven sequential pattern mining algorithm called FCS algorithm to find recurring sub-sequences across drug and protein databases. Inspired by the invention of sub-word units in the natural language processing field (Gage, 1994;Sennrich et al., 2015), FCS aims to generate a set of hierarchy of frequent sub-sequences for sequences.
The algorithm is summarized in Algorithm 1. FCS decomposes each sequence of protein/drug hierarchically into sub-sequences, smaller sub-sequences and individual atoms/amino acids symbols. FCS first initializes a vocabulary set V of distinctive amino acid tokens or SMILES strings characters and given the tokens, tokenizes the entire drug/protein corpus. We call the tokenized set W. Then, it scans through W and identifies the most frequent consecutive tokens (A, B). FCS then updates every (A, B) in the tokenized set W with the new token (AB) and also adds this new token to the vocabulary set V. Then this process of scan, identify, update is repeated until no frequent token is above the threshold h or the size of V reaches a pre-defined maximum value '. Through this operation, frequent sub-sequences are merged into one token and sub-sequences that are not frequent enough are decomposed into a set of shorter tokens. In the end, for a drug/protein, FCS results in a sequence C ¼ fC 1 ; . . . ; C k g of sub-structural drug/target proteins with size of k, where each C i is from the set V.
Using FCS algorithm, MolTrans converts input drug and target to a sequence of explicit sub-structures C d and C p , respectively. The significance of FCS is threefolds: 1. It distinguishes from previous sub-structure fingerprinting methods as it is more explainable. Explicit sub-structure fingerprint such as PubChem encoding has on average 100 granular substructures for a small molecule where many sub-structures are a subset of other ones, making it intractable to know which substructure leads to the outcome. In contrast, FCS drug encoding is capable of giving explicit hints as it decomposes each drug molecule into discrete and moderate size partitions of sub-structures as shown in Section 3.7. It allows for leveraging the massive unlabeled data for improved sub-structure mining. For example, we use the Uniprot dataset (Boutet et al., 2007) consists of 560 823 unique protein sequences and the ChEMBL database (Gaulton et al., 2012) which includes 1 870 461 drug SMILES strings. We observe that the quality of the mined sub-structures originates from the massive unlabeled data we used. In small datasets, the occurrences of many useful sub-structures are below the reasonable minimum frequency whereas a large aggregation dataset can successfully identify them with a larger sequences pool. We also show that the encoding has better predictive power when using massive unlabeled data compared to using small datasets, in Section 3.8. 2. FCS can capture fundamental and meaningful biomedical semantics. The generated sub-structures are associated with fundamental unit of drugs and proteins that recur frequently. We find that the FCS algorithm identify similar set of fundamental biochemical sub-structures given different dataset characteristics such as different types of organisms of the protein dataset and the druglikeliness of drug dataset, suggesting the robustness of FCS algorithm (Supplementary Material S1). In general, we apply a more general dataset (e.g. ChEMBL) instead of a focused dataset (e.g. approved drugs in DrugBank) because larger dataset can improve downstream predictive performance (Section 3.8).

Augmented transformer embedding module
To capture the chemical semantics of sub-structures, MolTrans includes an augmented embedding module where it first initializes a learnable sub-structure lookup dictionary and then augment the embedding with the contextual sub-structural information via transformer encoders (Vaswani et al., 2017). The transformer is a stateof-the-art deep learning architecture that leverages self-attention mechanism to generate contextual embedding. It was originally developed to natural language processing tasks. Here, we adapted it for molecule representation learning. In our setting, the selfattention mechanism in the transformer encoder modifies each input sub-structure embedding by learning from all the sub-structures from the same molecule. The resulting sub-structural embedding is better because it is contextual by taking account into the complex chemical relationships among the neighboring sub-structures. Concretely, for each input drug-target pair, we transform the corresponding sequence of sub-structures C p and C d into two matrices M p 2 R kÂHp and M d 2 R lÂH d , where k/l is the total size of sub-structures for drug/protein or the cardinality of the vocabulary set V from FCS algorithm, H p and H d are the maximum lengths of sub-structure sequences for protein and drug, and each column M p i and M d j is a one-hot vector corresponding to the sub-structure index for the ith sub-structure of protein sequence and jth sub-structure of drug sequence. The content embedding E p conti ; E d contj for each protein and drug is generated via a learnable dictionary lookup matrix W p cont 2 R #Âk and W d cont 2 R #Âl such that where # is the size of latent embedding for each sub-structure. Since MolTrans uses sequential sub-structures, we also include a positional embedding E p pos i ; E d pos j via a lookup dictionary (Vaswani et al., 2017) W p pos 2 R #ÂHp and W d pos 2 R #ÂH d : The final embedding E p i ; E d j are generated via the sum of content and positional embedding: The models above outputs a set of independent sub-structure embedding. However, these sub-structures have chemical relationships (e.g. Octet rules) among themselves to capture these contextual information, we further augment the embedding using a transformer encoder layers (Vaswani et al., 2017):

Interaction prediction module
MolTrans includes an interaction module that consists of two layers: (i) an interaction tensor to model pairwise sub-structural interaction and (ii) a CNN layer over interaction map to extract neighborhood interaction.
Pairwise interaction. To model the pairwise interaction, for each sub-sequence i in protein and sub-sequence j in drug, we have where F is a function that measures the interaction between the pairs. It can be any function such as sum, average and dot product. Therefore, after this layer, we have a tensor I 2 R H d ÂHpÂU , where H d =H p is the length of sub-sequences for drug/protein, respectively, and U is the size of the output of function F, where each column in this tensor takes account into the interaction of individual substructure of proteins and drugs. To provide explainability, we favor dot product as the aggregation function because it generates a single scalar that explicitly measures the intensity of interaction between individual target-drug sub-structural pair. As dot product output is one-dimensional for every pair, I becomes a two-dimensional interaction map. If a value in the map is high, it will be activated in the downstream layer and have a higher likelihood of DTI interaction. Through end-to-end learning, if a pair of sub-structures indeed interact, they will have high interaction score in the corresponding substructure pair position in the interaction map. Thus, by examining this map, we directly see which sub-structure pairs contribute to the final outcome. Neighborhood interaction. Nearby sub-structure of proteins and drugs influence each other in triggering the interactions. Hence, besides modeling the individual pairwise interaction, it is also necessary to model the interaction to the nearby regions. We achieve this through a CNN (Krizhevsky et al., 2012) layer on top of the interaction map I. The intuition is that by applying several orderinvariant local convolution filters, interaction to nearby regions can be captured and aggregated. We obtain the output representation O of the input drug-target pair: This interaction module is inspired from the Deep Interactive Inference Network (Gong et al., 2018). Thanks to this explicit interaction modeling, we can later visualize the strength of individual sub-structural interaction pair from the interaction map. To output a probability indicating the likelihood of interaction, we first flatten the O into a vector and use a linear layer parameterized by weight matrix W o and bias vector b o : where rðaÞ ¼ where Y is the ground truth label.

Implementation
MolTrans is implemented in PyTorch (Paszke et al., 2019). For the FCS algorithm, we set the minimum number of occurrences of substructures in the dataset to be 500 for drugs and proteins, which results in 23 532 drug sub-structures and 16 693 protein substructures. For transformer encoders, we use two layers of transformer encoders for both drug and proteins. The input embedding is of size 384 and we set 12 attention heads for each transformer encoder with intermediate dimension 1536. We set the maximum length of sequence for drug to be 50 and protein to be 545 to cover 95% of them in the dataset. We cut/pad for the parts that are above/ below the maximum length. We show that the model performance is not biased against sequence length in Supplementary Material S2.
For the CNN, we use three filters with kernel size three. For optimization hyper-parameters, we use Adam optimizer with learning rate of 1eÀ5. We set the batch size to be 64 and we allow it to run for 30 epochs. It converges between 8 and 15 epochs. The dropout rate is 0.1.

Result
We design experiments to answer the following questions.

Experimental setup
Dataset. We use the MINER DTI dataset from BIOSNAP collection (Zitnik et al., 2018a) as our main dataset of experiments. It consists of 4510 drug nodes and 2181 protein targets, and 13 741 DTI pairs from DrugBank (Wishart et al., 2008). BIOSNAP dataset only contains positive DTI pairs. For negative pairs, we sample from the unseen pairs, following common practice (Zhang and Chen, 2018;Zitnik et al., 2018b). We obtain a balanced dataset with equal positive and negative samples. In addition to BIOSNAP, we also include two benchmark datasets in the main predictive performance comparison experiment. DAVIS consists of wet lab assay K d values among 68 drugs and 379 proteins (Davis et al., 2011) and BindingDB consists of K d values among 10 665 drugs and 1413 proteins (Liu et al., 2007). DTI pairs that have K d values <30 units are considered positive. For balanced training, we sub-sample the same number of negative DTI pairs as the positive samples for training set. We keep the dataset negative ratios in the validation and testing set. Dataset statistics are provided in Table 1.
Metrics. We use ROC-AUC (area under the receiver operating characteristic curve) and PR-AUC (area under the precision-recall curve) as metrics to measure the binary classification performance. In addition, we use sensitivity and specificity metrics where the threshold is the one that has the best F1 score in the validation set.
Evaluation strategies. We divided the dataset into training, validation and testing sets in a 7:1:2 ratio. For every experiment, we conduct five independent runs with different random splits of dataset. We then select the best performing model based on ROC-AUC performance from the validation set. The selected model via validation is then evaluated on the test set with the result reported below.

Baselines
We compared MolTrans with the following baselines. We focus on state-of-the-art deep learning models as they have demonstrated superior performance over shallow models.
1. LR (Cao et al., 2013;Rogers and Hahn, 2010) applies a logistic regression model on the concatenated drug and protein feature vectors. We experiment on all the combinations for ECFP4 (Rogers and Hahn, 2010) and PubChem (Wang et al., 2009) for drugs and PSC (Cao et al., 2013) and CTD (Dubchak et al., 1995) for proteins. We find ECFP4 for drugs and PSC for protein has the highest performance. 2. DNN uses a three layer DNN with hidden size 1024 on top of the ECFP4 and PSC concatenated vector. 3. GNN-CPI (Tsubaki et al., 2019) uses graph neural network to encode drugs and use CNN to encode proteins. The latent vectors are then concatenated into a neural network for compoundprotein interaction prediction. We follow the same hyperparameter setting described in the paper. 4. DeepDTI (Wen et al., 2017) models DTI using DBN (Hinton, 2009), which is a stack of Restricted Boltzmann Machines (Hinton, 2012). It uses the concatenation of ECFP2, ECFP4, ECFP6 as the drug feature and uses PSC for protein features. We optimize the hyper-parameters described from the paper based on validation set performance. 5. DeepDTA (Ö ztü rk et al., 2018) applies CNN on both raw SMILES string and protein sequence to extract local residue patterns. The task is to predict binding affinity values. We add a Sigmoid activation function in the end to change it to a binary classification problem and we conduct hyper-parameter search to ensure fairness. 6. DeepConv-DTI (Lee et al., 2019) uses CNN and global max pooling layer to extract various length local patterns in protein sequence and applies fully connected layer on drug fingerprint ECFP4. It conducts extensive experiment on different datasets and is the state-of-the-art model in DTI binary prediction task. We follow the same hyper-parameter setting described in the paper.

Q1: MolTrans achieves superior predictive performance
To answer Q1, we randomly select 20% drug protein pairs as test set. Table 2 shows MolTrans has consistently better predictive baselines in the DTI prediction setting in ROC-AUC and PR-AUC across all datasets. MolTrans has up to 25% increase over best performing baseline (DAVIS PR-AUC). Note that due to different thresholds across different methods, the sensitivity and specificity may vary.

Q2: MolTrans has competitive performance in unseen drug and target setting
To imitate the unseen drug/target task, we randomly select 20% drug/target proteins and all DTI pairs associated with these drugs and targets as the test set. The results are in Table 3. We observe that KronRLS's performance vary across settings. This is because KronRLS is a similarity-based method; hence, it is susceptible to the data properties in hand. In the unseen drug setting, we find the onelayer LR is better than multi-layers DNN, and is worse than the SOTA methods with more complicated deep model design. This shows the necessity for carefully designed model architecture. We also see that MolTrans has competitive performance against the SOTA deep learning baselines in both settings.

Q3: MolTrans performs best with scarce data
Although the availability of DTI data is exploding, in some realworld drug discovery pipelines, there are new target proteins or drugs that have only a handful of labels due to budget restriction. Hence, a robust performance under low resource constraint is ideal in DTI setting. We trained each method on 5%, 10%, 20% and 30% of dataset and predict on the rest of them (we use 10% of the test edges as validation set for early stopping). The result is reported in Table 4. We see that MolTrans is the most robust method. In the contrast, SOTA baselines such as DeepDTI and DeepConv-DTI drop as missing fractions increase. One reason why MolTrans is good on scarce setting is that MolTrans leverages on embeddings from sub-structures which are relatively abundant hence transferable compared to other methods which utilize the entire drugs and proteins.

Q4: MolTrans is robust in various protein families
Target proteins come from different proteins families. It is important that the prediction algorithm is not biased toward one particular protein family. In this experiment, we test on the predictive performance on four of the largest druggable targets: enzymes, ion channels, G-protein-coupled receptors (GPCRs) and nuclear receptors. We retrieve one test set of BIOSNAP and map the target proteins to the four protein families using GtoPdb database (https://www.guideto pharmacology.org/targets.jsp). We find 1908 enzymes interactions, 533 GPCRs interactions, 496 ion channels interactions and 104 nuclear receptors interactions. We find MolTrans is robust in all of the above individual protein family (Fig. 2). Particularly, enzymes, GPCRs and ion channels have higher performance than the overall protein classes.

Q5: MolTrans allows model understanding
To answer Q5, we show through examples how the interaction map I can provide hints on which sub-structure leads to the interaction. A high value cell in the interaction map stands for a potentially activated interaction between drug and target sub-structure that is important to the final interaction outcome. Thus, to visualize, we generate a heat map for I to see which cells have high values. We then select a threshold to mask out the majority of cells that have low values. We then examine literature to see if the remaining cells contain clues to the interaction outcome. We first feed drug 2-nonyl n-oxide, and the protein cytochrome b-c1 complex unit 1 into MolTrans, and we visualize the interaction map by filtering scalars that are larger than a threshold in Figure 3. We saw the nitrogen oxide group [Nþ]([OÀ]) and KNWV has the highest interaction coefficient, matching with the previous study (Lightbown and Jackson, 1956) who showed that nitrogen oxide group is essential for cytochrome inhibition activity. This example supports that MolTrans is capable of providing reasonable cues for understanding the model prediction and possibly shed light on the inner workings of DTI. To add more credibility, we feed Ephrin type-A receptor 4 (Epha4) target and Dasatinib drug into MolTrans, the map shows amino-thiazole group [S(¼O)(¼O) and N substructures] is highlighted with protein motif KF and DVG, which has an overlap with the Epha4-Dasatinib complex described in previous study (Farenc et al., 2011). We also feed the input target protein HDAC2 and the input drug hydroxamic acid. The interaction map assigns the NC(¼O) group and the carbon chain with protein substructure KK, YG, DIG, DD with high intensity. The suggested ligand sub-structure matches with the observed interaction in HDAC2-SAHA co-complex (Lauffer et al., 2013). The interaction maps for the additional examples are provided in Supplementary Material S3.

Q6: Ablation study
We conduct an ablation study on the full data setting with the following setup: 1. -CNN: we remove the CNN from interaction module, and flatten the interaction map I output and feed into the decoder. 2. -AugEmbed: we remove the transformer in the augmented embedding module and feed the interaction module with the positional and content embedding.
3. From Table 5, we see CNN, transformers and interaction module contribute to the model final performance. The FCS fingerprint alone has strong predictive performance from -Interaction. In addition, from Small, we see the massive unlabeled data are useful as it enriches the input and boosts the performance. From -FCS, we see our model is adaptable to other popular fingerprints with similar strong performance.  Note: The best performing three baselines are used for comparison.  (1). Then, drug/protein sequence of sub-structure embedding is fed into drug/target transformer encoders, respectively, to obtain an augmented contextual representationẼ d andẼ p via Equation (2).
(d) An interaction map I measuring interaction intensity among sub-structures is generated via Equation (3). The interaction is further optimized by a CNN layer that models higher-order interaction, which results in a tensor O via Equation (4). (e) A decoder module then feed the tensor for a classifier to output the DTI probability P via Equation (5). All modules are trained end-to-end with the binary classification loss via Equation (6)

Conclusion
In this work, we introduce MolTrans, an end-to-end biological inspired deep learning-based framework that models DTI process.
We test under realistic drug discovery setting and evaluate with state-of-the-art baselines. We demonstrate empirically that MolTrans has competitive performance in accurately predicting DTI under all settings with an improved explainability.
Financial Support: This work was in part supported by the National Science Foundation award SCH SCH-2014438, IIS-1418511, CCF-1533768, IIS-1838042, the National Institute of Health award NIH R01 1R01NS107291-01 and R56HL138415, and IQVIA.