DeepTraSynergy: drug combinations using multimodal deep learning with transformers

Abstract Motivation Screening bioactive compounds in cancer cell lines receive more attention. Multidisciplinary drugs or drug combinations have a more effective role in treatments and selectively inhibit the growth of cancer cells. Results Hence, we propose a new deep learning-based approach for drug combination synergy prediction called DeepTraSynergy. Our proposed approach utilizes multimodal input including drug–target interaction, protein–protein interaction, and cell–target interaction to predict drug combination synergy. To learn the feature representation of drugs, we have utilized transformers. It is worth noting that our approach is a multitask approach that predicts three outputs including the drug–target interaction, its toxic effect, and drug combination synergy. In our approach, drug combination synergy is the main task and the two other ones are the auxiliary tasks that help the approach to learn a better model. In the proposed approach three loss functions are defined: synergy loss, toxic loss, and drug–protein interaction loss. The last two loss functions are designed as auxiliary losses to help learn a better solution. DeepTraSynergy outperforms the classic and state-of-the-art models in predicting synergistic drug combinations on the two latest drug combination datasets. The DeepTraSynergy algorithm achieves accuracy values of 0.7715 and 0.8052 (an improvement over other approaches) on the DrugCombDB and Oncology-Screen datasets, respectively. Also, we evaluate the contribution of each component of DeepTraSynergy to show its effectiveness in the proposed method. The introduction of the relation between proteins (PPI networks) and drug–protein interaction significantly improves the prediction of synergistic drug combinations. Availability and implementation The source code and data are available at https://github.com/fatemeh-rafiei/DeepTraSynergy.


Introduction
It is shown that multidisciplinary drugs or drug combinations have more effective treatments than single-drug therapies and selectively inhibit the growth of cancer cells (Tang 2017). Designing new drugs with optimal performance for cancer patients is highly important in the pharmaceutical industry (He et al. 2018, Lee et al. 2018, Abbasi et al. 2019, Abbasi 2021. The primary purpose of such studies is to prioritize combination anticancer therapies based on the simultaneous use of several drugs with different mechanisms of action to overcome the resistance of single medicines and reduce side effects (Pemovska et al. 2013, Masoudi-Sobhanzadeh et al. 2021. It is revealed that several target genes are involved in cancer cell proliferation, like genomic complexity and molecular contexts, and heterogeneity of tumors. Furthermore, their protein products are essential in controlling abnormal pathways and networks that lead to diverse responses to anticancer drugs among patients (Pang et al. 2014, Rubin 2015, Schmitt et al. 2016. Drug combinations have emerged as a promising therapeutic approach to overcoming drug resistance and improving the effectiveness of anticancer therapies (Wang et al. 2021). In other words, it can use multiple drugs to aim at various targets, pathways, or cellular processes involved in the pathogenesis of a particular disease (Anighoro et al. 2014, Madani Tonekaboni et al. 2018, Masoudi-Sobhanzadeh 2020. The number of possible drug combinations increases rapidly with increasing the number of drugs; hence, wet-experimental methods are not enough to discover new drug compounds. Therefore, to reduce the search space for drug combinations, there is a need to develop more efficient computational methods for predicting synergistic drug compounds (Paul et al. 2010).
Current computational methods use synergistic scores to predict effective drug combinations. The synergistic score is defined as the degree of drug interactions. Synergy is generally determined by a selected reference model based on the properties of dose-response curves, which measure the response rate based on the difference between expected and observed doseresponse profiles (Zagidullin et al. 2021). Subsequently, this combination can be classified as synergistic, additive, or contrasting. Information such as structural similarity and biochemical properties is vital to understanding drug compounds' behavior. Also, incorporating drug-target and drug-drug interactions can improve effective combination therapies (Masoudi-Nejad et al. 2013, Mousavian et al. 2016. Moreover, drug-protein interaction and protein-protein interaction (PPI) are important factors for investigating the effectiveness of drug synergy , Chen et al. 2020, Wang et al. 2021. In recent years, the expression profile of genes has also helped to predict the synergistic effects of drug compounds on cancer cell lines (Preuer et al. 2018, Guo et al. 2021, Dehghan et al. 2023. Up to now, many approaches to drug synergy prediction are introduced. Some of these approaches use classic machine learning methods. Sidorov et al. (2019) used Random Forest (RF) and Extreme Gradient Boosting (XGBoost) machine learning techniques to predict the best synergy of a given drug combination by cell line. Their results show that XGBoost offers slightly better performance than the RF method. Julkunen et al. (2020) provided a new machine learning framework for predicting drug combination responses in preclinical studies based on cell lines or patient-derived cells called comboFM. They applied a higher-order factorization machine (FM) to learn the higher-order tensors of drug pairs. The input features of comboFM contained two molecular fingerprints of the drugs, concentration values of both drugs and gene expression profiles of cancer cell lines. It is shown that ComboFM is an effective tool for systematically prescreening drug compounds to support accurate oncology programs.  introduced a novel semisupervised algorithm that not only uses the drug's chemical structure but also utilizes a drug-target interaction network. Liu et al. (2019) inspired the drug-protein heterogeneous network-based inference to derive the properties of the drug combinations. They trained the gradient tree boosting classifier to predict new drug combinations using the extracted properties. Zhang and Yan (2019) developed a field-aware FM that incorporates pharmacological data into predicting two-and three-drug synergistic combinations.
In recent years, deep learning-based drug synergy prediction has received attention. Some of these approaches only utilize the knowledge of proteins directly targeted by drugs and diseases. Yang et al. (2021) introduced an advanced chart-based deep learning method that used the graphical representation of the PPI network to identify anticancer drug combinations. Using a space-based convolution network, the model encodes information about the topological structure of the target protein modules of a drug pair and the protein modules associated with a particular cancer cell line in the PPI network. Kim et al. (2021) used a drug synergistic prediction model based on multitask deep artificial neural networks in understudied cancer types. To overcome the data scarcity challenge, they have utilized transfer learning techniques. As a result, models trained on data-rich tissues are transferred to data-poor tissues. They also utilize a multitask deep neural network with multimodal inputs (molecular, genetic, phenotypic characteristics) and multiple outputs (drug sensitivity and synergism) for cancer cell lines. Zhang et al. (2021) used the autoencoder's deep neural network (AuDNN) model to predict the synergy of drug combinations using the integration of multiomics data and the chemical structure of the data. Three autoencoders were trained to generate representations of cancer cell lines from gene expression, copy number, and genetic mutation data of tumor samples. AuDNNsynergy can also be used to predict combinations in novel cell lines. Jiang et al. (2020) used a graph convolution network to predict the synergy of the drug combination in 39 cell lines derived from six types of cancer. Multimodal graphs were constructed for each cell line based on the drug-drug interaction networks, drug-protein interaction networks, and PPI networks. Jin et al. (2021) developed a deep learning-based model using ComboNet that jointly uses molecular structure and biological targets to predict synergistic drug combinations. Brahim Kuru et al. (2021) presented a deep neural network-based algorithm for predicting synergistic drug scores using drug chemical structure information and cell line gene expression called MatchMaker. Dong et al. (2021) introduced an interpretable deep signaling pathway called IDSP, which is a deep diagram neural network. In their work, genegene and gene-drug regulatory relationships are included in synergistic drug prediction. Li et al. (2023) utilize the Siamese convolutional network and random matrix projection to learn more informative drug combination features. Then, after extracting cell line features by using the convolutional network, these features are integrated and passed into multilayer perceptron (MLP) to predict the synergy score.
Liu and Xie (2021) introduced a knowledge-based deep learning model, TranSynergy, and a new method for enrichment analysis of the Shapley additive gene complex (SA-GSEA) to predict synergistic drug combinations and improve the interpretability of the machine learning model. Wang et al. (2021) proposed a deep learning-based model that uses graph neural networks and an attention mechanism to predict the synergy of drug combinations. Sun et al. (2020) proposed a deep tensor factorization model that combines a framework based on tensor factoring and a deep neural network to predict the synergistic effect of drug pairs. It achieves almost as good predictive performance as the advanced model (DeepSynergy) while using significantly fewer data sources.
In multitask learning (MTL), several tasks are learned and predicted simultaneously. MTL improves the prediction accuracy of each task-specific model compared to training each model separately. This study proposes a new deep multitask and multimodal approach for drug synergy prediction. The model gets PPI, cell-target interaction, and both drug sequences as input and predicts the drug-target interaction, its toxic effect, and drug combination synergy as different tasks. In our approach, drug combination synergy is the main task and the two other ones are the auxiliary tasks that help the approach to learn a better model. Our approach is different from the other state-of-the-art protocols as follows: 1) To describe drug molecules, we propose a transformerbased approach. 2) To predict the interaction between input drugs and all protein sequences, we use a binding affinity prediction model. A one-class learning loss is used to learn only active compound-target pairs.
3) We propose a new architecture that effectively combines drug-target interaction, PPI, and cell-target interaction to incorporate drug synergy prediction. 4) We propose a toxic loss to prevent overlapping exposure. Overlapping exposure happens when the drug-target modules overlap with each other and the disease modules. 5) We also conducted comprehensive ablation studies to validate the significance of the different modules of the proposed approach.
The paper is organized in the following manner. Section 2 describes in detail the proposed method. The experimental results are presented in Section 3. Finally, the discussion and future work are provided in Section 4.

Materials and methods
The overall schematic of the proposed approach is depicted in Fig. 1. As it is mentioned in Section 1, the aim is to predict the synergy of the pair drug combination based on proteinprotein, drug-protein, and cell line-protein interactions. Generally, the proposed deep learning-based architecture has four main subnetworks: a PPI network, a drug feature extraction network, a protein-compound interaction network, and a synergy prediction network.
At first, the problem formulation is presented, followed by a detailed explanation of each step of the proposed approach.

Problem formulation
where d ðiÞ 1 and d ðiÞ 2 denote the first and second drugs of the ith drug pair sample. The synergy value of the paired drug for a cell line (denoted by c ðiÞ ) is represented by s ðiÞ . The main goal is to design a system to predict the synergistic value of the input-paired drugs. To this end, the following items are defined as the input to the approach: where G pp and C pc are used as auxiliary inputs of the approach. In the G pp , each node is a protein, and the connection between two proteins exists if there are biochemical events and/or electrostatic forces between them. Set C pc is a matrix with P j j Â C j j elements where P j j and C j j, respectively, denote the number of proteins and the number of cells.
Synergy score s ðiÞ ) is computed based on the zerointeraction potency (ZIP) reference model. The ZIP model captures the drug interaction relationships by comparing the changes in the potency of the dose-response curves between individual drugs and their combinations. By combining the advantages of both the Loewe and the Bliss models, the ZIP model assumes that two noninteracting drugs are expected to incur minimal changes in their dose-response curves (Liu et al. 2020).

PPI network
In the proposed approach, the node2vec network is utilized to analyse the PPI network. To this end, node2vec is utilized to learn a representation for each node of G pp . The node2vec network which is shown by N P produces a representation for each protein based on its neighborhood in the G pp : It should be noted that the neighborhood of each node is preserved in the learned representation. This learned representation can lead to improving the predictive power.
To reduce the number of proteins, the output of the nod2vec, the learned feature vector of proteins, is fed into a clustering method, a k-means algorithm, to generate a group of proteins with the same feature representation. The reason for this is that the number of unique proteins is high. Hence it leads to computational issues in the proposed approach. In node2vec, it is expected proteins with the same local neighborhood have the same feature vector. Therefore, by clustering, these proteins are placed in the same cluster. The number of clusters is determined experimentally. The final output of clustering is o p 2 R Pc j jÂn where P c j j is the number of protein clusters and n is the dimension of the representation space.

Drug feature extraction network
The proposed approach introduces an architecture based on transformers to extract features from drugs (Fig. 2). One of the main advantages of a transformer that lead us to utilize it is that it provides context for any position in the drug molecule. Also, the parts of the drug molecule which have most efficient in predicting, get the higher importance. The vision transformer, introduced by Dosovitskiy et al. (2020), is modified to apply to the sequence data derived from the SMILES representation of the drug molecule. To this end, at first, the SMILES representation of the drug molecule is divided into some patches. To prevent missing information, the patches should overlap by choosing an appropriate segment length. Then, each patch is fed into the embedding layer, including patch embedding and position embedding. The transformer encoder consists of a normalization layer, a multihead attention layer, and a MLP. The transformer encoder is repeated L times in the architecture to create the feature extraction network. A feature extraction network is shared between two drugs in the proposed approach. Hence for the feature representation of the drugs, the output of N F network, for the first and second drugs of the input, are, respectively, shown by o d1 F and o d2 F .

Compound-protein interaction network
In this section, the compound-protein interaction network is explained. It should be noted that many compound-protein interactions are not yet discovered. Hence, in designing this network, this issue is considered. The labeled compoundprotein pairs are limited, and the available pairs are the active ones. All known inactive pairs and unknown pairs are considered inactive. Therefore, the loss function for the proteincompound interaction network should be learned with samples that all of them are active. To consider this issue, we utilize one-class classification loss. In machine learning, oneclass classification gets more attention (Perera and Patel 2019). This paper uses compactness loss and descriptiveness loss, introduced by Perera and Patel (2019). The inputs of this subnetwork are the feature representation of the drug o d li F ; l ¼ 1; 2; and the feature representation of all proteins o p . Compound-protein interaction network includes a multihead self-attention layer, a dot product layer, and an MLP. Multihead self-attention layer computes an attention mechanism several times in parallel. It is effective because it could jointly consider multiple positions of importance. MLP gets the o p as input and maps it to a new representation (o m p ) with the same distribution as the drug representation space. Moreover, if a drug can bind to some proteins to form a drugprotein complex, protein and compound binding sites are expected to have the same representation. Hence, o d li F and o m p are fed into the dot product layer to produce a binding affinity of drug l with P c proteins. The output of this subnetwork is a matrix represented by o d li B . To effectively learn the compoundprotein interaction network, the descriptiveness loss (L Descriptive ) is defined using the cross-entropy loss which states the model's ability to discriminate the different classes. In our work, similar to Perera and Patel (2019), the compactness loss (L compactness ) is measured by the variance of each feature batch. It means that the feature vector of all samples which belong to the same class should be similar.

Synergy network
The inputs of the synergy network are o d1i F and o d2i F , o d1i B and o d2i B , C pc , and c ðiÞ . As it is shown, it contains several dot product layers. As the first step, o d li B and o m p are fed into the dot product layer. This layer aims to calculate the representation of all proteins that the first drug can bind. A similar computation is done for the second drug. The outputs of the product layers for this pair of drugs are concatenated and fed into an MLP. The output of the MLP network is shown by o BM . In the second step, o d li B and the corresponding vector of c ðiÞ in C pc ðc i ð Þ Þ are fed into the dot product layer. Its goal is to consider the representation of the related signaling proteins of the input cell line. The output of the two-dot product layers is concatenated and fed into an MLP network whose output is shown by o CM . Finally, o BM and o CM are concatenated and fed into the last MLP layers to predict the synergy value. In the proposed approach, fusion is done in the class. The final synergy value for the ith drug pairs is shown by o. The synergy loss function is a binary cross-entropy loss which is defined as follows: (1) Figure 2. The overall schematic of the Transformer is used as a feature extractor in the proposed approach.
In drug synergy prediction, an effective approach should consider the toxic effect originating from overlapping exposure. It has been shown that overlapping exposure is statistically significant in the occurrence of adverse effects. Hence, we define another term in the overall loss function which considers this issue. Each drug is expected to have separated bindable target proteins to prevent toxicities (Cheng et al. 2019, Yang et al. 2021. Therefore, the toxic loss function (L Toxic ) is defined as follows: L Toxic is minimized when the inner product of o d1i B and o d2i B to be zero. It happens if the first and second drugs have separated bindable target proteins.
In our work, the whole loss is defined as follows: where k 1 , k 2 , and k 3 , respectively, are the weights of the toxic loss, compactness loss, and descriptive loss in the final loss.

Dataset
The approach is applied to two well-known datasets, including DrugCombDB (Liu et al. 2020) and OncologyScreen (O'Neil et al. 2016). DrugCombDB is a comprehensive dataset of drug combinations that are collected from many different resources like high throughput screening, an external database, or manual curation from PubMed literature. DrugCombDB contains 6 891 566 drug pairwise combinations with 2887 unique drugs and 124 unique cell lines. In our approach, similar to other competing approaches, we used a shortened version of the DrugCombDB dataset. It contains 69 436 drug pairwise combinations with 764 unique drugs and 76 unique cell lines. Oncology-Screen is a smaller dataset in drug combinations. It contains 4176 drug pairwise combinations with 21 unique drugs and 29 unique cell lines.
In both datasets, we perform 5-fold cross-validation. To do so, we divide the whole dataset into five equal parts, and then in each run, we consider the four parts as training and validation data, and the remaining part is considered as test data.
This procedure is repeated five times and finally, the average performance is reported.

Performance measures
To describe the performance of the predictive model, performance is evaluated by measures that are common for classification tasks: the area under the receiver operating characteristic curve (AUC-ROC), the area under the precision-recall curve (AUC-PR), accuracy (ACC), recall, and F1 score. These measures are selected to address different characteristics of the learned models.
In the whole experiment, hyperparameter optimization is done using grid search. This optimization is done for the number of clusters search over f200, 300, 400, 600g, learning rate search over f0.0001, 0.001, 0.01, 0.1g, the weights of the toxic loss, compactness loss, and descriptive loss (k 1 , k 2 , and k 3 ) search over f 1 4 , 1 2 , 1g, the patch size search over f20, 30, 40, 50, 60g, and the overlap stride ratio search over f 1 4 , 1 3 , 1 2 , 3 4 g. The overlap stride ratio r is defined to be the ratio of the patch size that overlapped with the neighboring patch.

Ablation study
To show the impact of the different modules on the proposed approach's performance, an ablation study is done. To do so, three versions of the proposed approach are created: (i) Transformer: in this version, only the synergy loss is considered. (ii) TransformerþToxic: in this version, synergy and toxic losses are considered, and (iii) TransformerþDTI: in this version, synergy and interaction losses are considered. The obtained results of the ablation study applied to DrugCombDB and OncologyScreen datasets are shown in Tables 1 and 2, respectively. As it is shown in Table 1, the proposed approach (DeepTraSynergy), which comprises synergy, toxic and interaction losses, gets the best results according to the considered performance metrics. From the ablation study, it can be concluded that the contribution of each component (i.e. synergy loss, toxic loss, interaction loss) brings further improvement.
From Table 2, DeepTraSynergy achieves the best performance in all five measures including ACC, Recall, AUC-ROC, AUC-PR, and F1-score. Table 2 shows that by adding drug-target interaction knowledge (TransformerþDTI), ACC gets a 1.62% improvement over the Transformer.

DeepTraSynergy
DeepTraSynergy improves ACC even more (by 4.41%) that confirms each module of the proposed approach contributes to the performance improvement. To statistically evaluate the significant improvement of the different modules of the proposed approach, the t-test is utilized at a significant level of 0.05. The obtained value of P-value in Tables 1 and 2 shows that DeepTraSynergy (full proposed approach) outperforms the other approaches.

Method comparison
To show the effectiveness of the proposed approach, three state-of-the-art approaches including Grarep (Cao et al.    Rafiei et al.
As it is shown, in Tables 3 and 4, our approach gets the best results with respect to KGNN, GCN, and DeepSynergy methods. Compared with the GraphSynergy method, the values AUC-ROC for the DrugCombDB dataset is reduced. In DrugCombDB, we get a 3.27% improvement in the AUC-PR measure over the best comparing method. It means our approach cares more about the positive class than the other approaches. The presented DeepTraSynergy gets the most improvement in recall measure. From these results, it can be inferred that the proposed approach effectively lowers the number of false-negative samples. It causes the proposed DeepTraSynergy method outperforms GraphSynergy and NexGB for the prediction of the synergic drug pairs. We have done t-test to statistically evaluate the proposed approach respect to the other approaches. The obtained results show that on DrugCombDB dataset, DeepTraSynergy outperforms five approaches at a significant level of 0.05 and compared with GraphSynergy and NEXGB outperforms with a P-value of lower than .1. Also, on oncologyScreen dataset, DeepTraSynergy outperforms all other approaches with a P-value lower than .02. From the results of Tables 2 and 4, it is found that a simple architecture in which a transformer is used as a feature extractor (i.e. Transformer) could get comparable results with respect to the other methods denoting that the Transformer can learn more discriminative features.
It should be noted that the OncologyScreen dataset is a subset of the O'Neil dataset. In other words, the OncologyScreen dataset contains 4176 pairwise drugs and cell line combinations with 21 unique drugs and 29 unique cell lines. While, the O'Neil dataset, not only contains all two drug-cell triplets of the OncologyScreen dataset but also contains other additional samples. To recap, there are 13 243 unique pairwise drugs and cell line combinations, consisting of 38 drugs and 31 cell lines. It should be noted that the original dataset contains 23 052 drug pairs and cell lines. After removing replicated ones like Wang et al. (2021) and Hu et al. (2022), we have 13 243 triplets. Also, like Wang et al. (2021) and Hu et al. (2022), we have used 10 as a threshold to classify the triplets as synergistic and nonsynergistic ones. The obtained results are shown in Fig. 3. As it is shown, DeepTraSynergy gets better results in all measures.

Investigate the effect of patch size
To investigate the effect of patch size and overlap stride ratio in the final performance of the proposed approach, we have experimented. To do so, we have done an experiment with the best-chosen hyperparameter values for the learning rate (0.001) and the number of clusters (300). Then, we change the patch size search over f20, 30, 40, 50, 60g and the overlap stride ratio search over f 1 4 , 1 3 , 1 2 , 3 4 g and apply the approach to the DrugCombDB dataset. The obtained results are given in Table 5. As it is shown, when the overlap stride ratio is set to 1 = 2 , the proposed approach in all patch size values gets better performance. When the overlap stride ratio is low, it means that two neighboring patches have a low overlap with each other. Moreover, the best AUPR value is obtained at a patch size of 40.
In the following, we have designed another experiment to investigate how the proposed approach could predict synergy scores for new unseen data. To do so, for the DrugCombDB dataset, we have utilized the leave-one-cell-line-out setting. In this case, for a cell line, whole related samples are excluded and then training is done and then the trained model is applied to the excluded samples. This procedure is done for whole cell lines and results are averaged. The obtained results are shown in Table 6. As is shown in all measures except AUC-ROC, the proposed approach performs better than the other approaches.

Evaluation of independent test set
In this section, the goal is to evaluate the generalization ability of the proposed approach. To do so, we have designed another experiment. In this experiment, we have trained the model on a dataset and applied it to an independent dataset. In this case, the model is trained on the O'Neil dataset and is evaluated on the DrugCombDB dataset. The reason is that DrugCombDB is bigger than the O'Neil dataset and there are some cell lines and drugs that do not exist in the O'Neil dataset. Table 7 shows the obtained results. The obtained results are compared with the model when it is trained and tested on the same dataset (i.e. the DrugCombDB dataset). By   Table 7 with the reported results of the state-of-the-art methods in Table 3, we have found out that the proposed method achieves comparable results with many of them like DeepSynergy, GraRep, and KGNN. It verifies the generalizability of the proposed approach. To compare the generalization ability with the stateof-the-art methods, we have downloaded the provided source code of GraphSynergy and DeepSynergy and then we run the code with the same setting. After running, GraphSynergy's result for all measures was zero and it is because of its algorithm. The comparison between DeepSyergy and DeepTraSynergy is given in Fig. 4.

Performance evaluation by hyperparameter tuning
To show how the proposed approach is sensitive to hyperparameter tuning, we have done another experiment. In this case, we have two scenarios: (i) we tune the hyperparameter values on the DrugCombDB dataset and we train and test the model on the DrugCombDB dataset. (ii) We tune the hyperparameter values on the OncologyScreen dataset and we train and test the model on the DrugCombDB dataset. The obtained results are shown in Table 8-DrugCombDB. As is shown, there is a small difference between the results but it is not too significant. Also, we have done another experiment. In this case, first, we tune the hyperparameter values on the OncologyScreen dataset and we train and test the model on the OncologyScreen dataset. Then, we tune the hyperparameter values on the DrugCombDB dataset and we train and test the model on the OncologyScreen dataset. The obtained results are shown in Table 8-OncologyScreen. In this case, too, the outcome shows a slight discrepancy between the results.

Predicting novel synergistic combinations
In this section, we have designed an experiment to predict a novel synergistic combination using DeepTraSynergy. In this    Table 9. We have done a literature search and found the literature confirmation for at least five of them. We have given the reference to these related publications in Table 9. As it is shown, Temsirolimus exists in seven of ten predicted combinations. The reason is that it is an antineoplastic agent used in the treatment of renal cell carcinoma. By comparing the reported result of DeepDDS and DeepTraSynergy, we found out that only one combination, Copanlisib, and Regorafenib, is common among the top ten predicted synergistic combinations of the two approaches. For some combinations, we did not find any literature confirmation but for example, in Axitinib and Temsirolimus combinations, Axitinib has demonstrable single-agent activity in melanoma (Fruehauf et al. 2008). Also, it is shown that Talazoparib in combination with Niraparib could treat melanoma cells (Jonuscheit et al. 2021). It should be noted that we have provided the top 100 highly scored predictions as an Excel file in Supplementary Data.

Discussion
This paper presents a new method for predicting drug synergies. The contributions of the proposed approach are utilizing transformers as feature extractors and proposing a new architecture that uses auxiliary knowledge like protein-protein interaction network, compound-protein interaction, and cellprotein interaction. Transformer-based feature extractor simultaneously captures the local structure and encodes the long-range dependencies. Since a limited number of drug combinations and drug-protein interactions for each specific cell line exist, some hidden connections in the network may be obscure. To meet this problem, an architecture is designed in the proposed approach that incorporates two other auxiliary knowledge, including drug-protein interaction and toxicity prediction, for drug synergy prediction. The ablation study results and comparing results with state-of-the-art approaches confirm that the transformer-based feature extractor (without utilizing any other knowledge) learns more discriminative features for drug molecules. Moreover, it can be seen from the experiment that incorporating a compound-protein interaction network in the proposed approach can improve the results.
To evaluate the generalization ability of the proposed approach, we have evaluated the learned model on an independent dataset and the obtained results confirm that the proposed approach has a good generalizability. Also, by predicting new synergistic combinations for the A375 cell line, we have found out that in the top ten predicted pairs, at least five of them have literature confirmation.
For future work, we seek to find a better way to learn the compound-protein interaction network. One of the main contributions of the proposed approach is that it shows that the drug-protein interaction network plays an important role in drug synergy prediction. Moreover, not only drug-protein interaction but also how drugs bind to the protein or proteins (in the same cell line) provide too meaningful information for drug synergy prediction. Hence, providing pockets that drug could bind them and incorporating this information in drug synergy prediction could improve drug synergy prediction performance. Moreover, the way of combining knowledge of different modalities could be enhanced using attention-based fusion techniques.

Supplementary data
Supplementary data are available at Bioinformatics online.

Conflict of interest
None declared.

Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.