Abstract

Motivation

A global effort is underway to identify compounds for the treatment of COVID-19. Since de novo compound design is an extremely long, time-consuming and expensive process, efforts are underway to discover existing compounds that can be repurposed for COVID-19 and new viral diseases.

We propose a machine learning representation framework that uses deep learning induced vector embeddings of compounds and viral proteins as features to predict compound-viral protein activity. The prediction model in-turn uses a consensus framework to rank approved compounds against viral proteins of interest.

Results

Our consensus framework achieves a high mean Pearson correlation of 0.916, mean R2 of 0.840 and a low mean squared error of 0.313 for the task of compound-viral protein activity prediction on an independent test set. As a use case, we identify a ranked list of 47 compounds common to three main proteins of SARS-COV-2 virus (PL-PRO, 3CL-PRO and Spike protein) as potential targets including 21 antivirals, 15 anticancer, 5 antibiotics and 6 other investigational human compounds. We perform additional molecular docking simulations to demonstrate that majority of these compounds have low binding energies and thus high binding affinity with the potential to be effective against the SARS-COV-2 virus.

Availability and implementation

All the source code and data is available at: https://github.com/raghvendra5688/Drug-Repurposing and https://dx.doi.org/10.17632/8rrwnbcgmx.3. We also implemented a web-server at: https://machinelearning-protein.qcri.org/index.html.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

The breakout of COVID-19 started in December 2019, in China’s Hubei province (Dong et al., 2020), and to date, this pandemic has caused over 95 million infections and over 2 million deaths worldwide (World Health Organization, 2020). There is an immediate need for effective treatment and vaccines to contain the spread of this pandemic. Based on the time and resources required to develop new compounds to treat COVID-19 and emerging viral diseases, it is not feasible to rely completely on the traditional process of compound discovery, which takes an average 15 years and costs $2–3 billion to bring a new compound to market (Pushpakom et al., 2019). A more pragmatic approach would be to perform drug repurposing, more specifically, accurately identify a set of candidate compounds which can exhibit high activity against viral proteins and potentially inhibit them using novel in silico techniques.

In this article, we present a consensus framework of in silico embedding-based modeling techniques, which utilizes different combination of representations for compounds and viral proteins including:

  • Morgan Fingerprints (MFP) (Capecchi et al., 2020) as chemoinformatic descriptors of compounds + a convolutional neural network (CNN) (LeCun et al., 1995) autoencoder-based vector representation for viral protein sequence.

  • A teacher forcing—long short-term memory neural network (TF-LSTM) (Lamb et al., 2016) autoencoder-based vector representation for compounds + CNN autoencoder-based vector representation for viral proteins.

  • Canonical SMILES based sequential representation of compounds + Primary structure (linear chain of amino acid) based sequential representation of viral proteins.

The goal of the consensus framework is to identify known and investigational compounds as candidates for viral diseases, using COVID-19 as a specific use case. The crux of our approach is that when new viruses emerge, already collected information on other viruses might be useful for inferring virus-specific compound activity. This is further supported by observations in quantitative structure–activity relationship (QSAR) models (Roy et al., 2015), where the intuition that compounds with similarities in structure and physio-chemical properties tend to have similar activities against given viral proteins is commonly utilized. For our use case, we focus on primary protein targets of severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2).

In the recent literature, a plethora of AI and network medicine-based approaches have been applied for drug repurposing/repositioning (Beck et al., 2020; Gysi et al., 2020; Zenget al., 2020; Zhou et al., 2020). The most commonly solved problem is prediction of interaction/activity/binding affinity between compound and protein targets using variety of AI methods (Thafar et al., 2019). One of the limitations of these approaches is that they are often trained on human protein sequences (kinases, nuclear receptors and G-protein-coupled receptors) which are very different from viral protein sequences and hence they need not generalize well for in silico compound-viral protein activity prediction. In another recent work, AtomNet (Wallach et al., 2015), the authors predict the compounds-protein binding affinity using the 3D structural information extracted by convolutional neural network (CNN). However, high-quality 3D structure for novel viruses is seldom available. In Gao et al. (2018), compound–target interactions were predicted using a hybrid approach of graph neural network (Kipf and Welling, 2016) and recurrent neural network (RNN) (Gao et al., 2018) approach. Similarly, in Beck et al. (2017), a hybrid CNN and RNN model called molecule transformer drug–target interaction predictor was proposed using known antiviral drugs for the potential treatment of SARS-CoV-2 infection (Beck et al., 2020). One limitation of these approaches is that these models are trained on labeled compound–viral protein interactions in databases such as ChEMBL and do not benefit from viral protein sequences (∼2.5 million) available in other databases such as Uniprot, as well as compounds (∼2.5 million) in databases such as PubChem due to missing labeled interaction information. However, as shown in Rao et al. (2019) that unsupervised or self-supervised learning on unlabeled data (i.e. learning a vector representation for a given data type) can greatly benefit the downstream supervised learning task.

Furthermore, there also exist network medicine based approaches which use knowledge graph representations (Gysi et al., 2020; Zenget al., 2020; Zhou et al., 2020) in combination with graph-theoretic (network-propagation, network proximity and diffusion) as well as graph neural network-based approaches to identify potential compounds targeting the COVID-19 as a disease. The knowledge graph is constructed using the interaction between multiple entities such as diseases, compounds, genes, human proteins and viral protein interactome. The goal is usually to identify links between existing approved compounds and new diseases such as COVID-19. Authors in Zeng et al. (2020) highlighted that the quality of the originally constructed knowledge graph from noisy sources was a potential limitation, which could impact their downstream Cov-KGE models. Additionally, these techniques can benefit from the vector representation learned for compounds and viral proteins using unsupervised learning framework as proposed in our work. The vector representation can be used in addition to the node representation learned through graph embedding procedure. In Gysi et al. (2020), the authors indicated that their deep graph neural network approach does not consider node features and are currently based only on the topology of the underlying graph.

In this article, we try to address several of these limitations following a data-driven perspective. We collect information about various viral organisms, their main proteins and their known compound interactions from plethora of resources including ChEMBL (Gaulton et al., 2017), PubChem (Kim et al., 2016), NCBI (Wheeler et al., 2008), UniProt (The UniProt Consortium, 2017), DrugBank (Wishart et al., 2018) etc. In this work, we use the term compounds for small molecules and compounds interchangeably. The traditional approach for estimating compound (ligand) activity for a particular viral protein (enzyme) is through molecular docking (Kitchen et al., 2004). For performing molecular docking, an inherent requirement is the availability of high-quality 3D crystal structure of the protein of interest as well as annotation information about the presence of active sites (Chakraborti and Srinivasan, 2020). Moreover, it is computationally expensive to perform the docking simulations for a large number of compounds in combination with many viral proteins. However, it is relatively easy to collect information about the primary structure (linear chain of amino acids) for proteins associated with viruses from resources such as UniProt. Moreover, chemical information for compounds in the form of SMILES strings is readily available in resources such as DrugBank and ChEMBL. Finally, standardized activity (inhibition/potency/affinity) information for a plethora of compound-viral protein combinations is available in databases such as PubChem and ChEMBL.

These are essential resources required to build in silico embedding-based compound-viral protein activity predictors using machine learning (ML) techniques. The primary notion is that by providing a large dataset of compound-viral protein activity, ML models can identify frequently occurring patterns in the form of presence of k-mers in the viral protein sequences and subsequences in SMILES representation of compounds (or frequently occurring patterns in the MFP) that together drive the activity values to be high or low.

Our primary contributions are:

  • Collection and curation of compound-viral protein activity from resources such as PubChem and ChEMBL leading to >60k interactions between >50k compounds and 100 viral organisms.

  • Propose autoencoder frameworks (unsupervised) to obtain numeric vector representations for compounds (2.5 million) and viral proteins (2.5 million), respectively, which can be utilized for downstream compound-viral protein activity prediction task by traditional supervised ML techniques.

  • Propose four different end-to-end deep learning techniques to predict compound-viral protein activity based on SMILES strings of compounds and primary structure of viral proteins.

  • Showcase the effectiveness of the consensus framework as it outperforms all the individual modeling techniques on the test set.

  • Identify a ranked list of 47 compounds as potential therapeutic agents for COVID-19 by targeting the three main proteins of the SARS-COV-2 virus using our consensus framework. These include 21 antivirals, 15 anticancer, 5 antibiotics and 6 other investigational human compounds.

  • Majority of the compounds in the top ranked list attain low binding energies (high binding affinity) in molecular docking experiments for each of the three viral proteins of SARS-COV-2 virus.

  • Provide a general and extensible framework where individual components can be replaced to test if the respective change helps to improves the overall results. The entire source code is made publically available (https://github.com/raghvendra5688/Drug-Repurposing) and a web-server (https://machinelearning-protein.qcri.org/index.html) is also provided for the ease of nonexperts.

Figure 1 illustrates our compound-viral activity prediction framework.

Flowchart of our proposed consensus framework. We collect ≈ 2.5 million SMILES representations of compounds from MOSES and ChEMBL databases. This is utilized to learn a SMILES embedding representation (numeric vector representation) via a TF-LSTM autoencoder model. We also collect ≈ 2.5 million viral protein amino acid (AA) sequences from Uniprot database. These are passed through a CNN autoencoder to learn viral protein embedding representation (numeric vector representation). We collect, curate and assimilate compound-viral protein activities from resources such as NCBI, PubChem and ChEMBL to build our dataset (D). The corresponding bioactivities in these samples are transformed into a standardized pChEMBL value and are used to build downstream regression models. These regression models are various machine learning (ML) techniques which take advantage of different representations of compounds and viral proteins for in silico compound–viral protein activity prediction. We then take a consensus of the top 5 predictors based on their performance w.r.t. 4 evaluation metrics on the test set. Here ‘blue’ color edges correspond to traditional ML models based on SMILES embedding + Protein embedding representations, ‘red’ color edges represent ML models based on Morgan Fingerprint (chemoinformatic descriptors) + Protein embedding representations and ‘green’ color edges correspond to end-to-end deep learning models based on canonical SMILES + Protein Amino Acid (AA) Sequence representations for predicting compound-viral protein activities. (Color version of this figure is available at Bioinformatics online.)
Fig. 1.

Flowchart of our proposed consensus framework. We collect 2.5 million SMILES representations of compounds from MOSES and ChEMBL databases. This is utilized to learn a SMILES embedding representation (numeric vector representation) via a TF-LSTM autoencoder model. We also collect 2.5 million viral protein amino acid (AA) sequences from Uniprot database. These are passed through a CNN autoencoder to learn viral protein embedding representation (numeric vector representation). We collect, curate and assimilate compound-viral protein activities from resources such as NCBI, PubChem and ChEMBL to build our dataset (D). The corresponding bioactivities in these samples are transformed into a standardized pChEMBL value and are used to build downstream regression models. These regression models are various machine learning (ML) techniques which take advantage of different representations of compounds and viral proteins for in silico compound–viral protein activity prediction. We then take a consensus of the top 5 predictors based on their performance w.r.t. 4 evaluation metrics on the test set. Here ‘blue’ color edges correspond to traditional ML models based on SMILES embedding + Protein embedding representations, ‘red’ color edges represent ML models based on Morgan Fingerprint (chemoinformatic descriptors) + Protein embedding representations and ‘green’ color edges correspond to end-to-end deep learning models based on canonical SMILES + Protein Amino Acid (AA) Sequence representations for predicting compound-viral protein activities. (Color version of this figure is available at Bioinformatics online.)

2 Materials and Methods

In order to build our in silico embedding-based compound-viral protein activity predictors, we collected information about compounds, viral protein sequences and compound–viral protein interactions (activity values) from resources such as MOSES (Polykovskiy et al., 2018), ChEMBL, UniProt, PubChem and NCBI. Below, we describe the details of data collection and curation steps required for the preparation of quality data, essential for accurate downstream predictive models.

2.1 Data collection and curation

2.1.1 Compounds

We initially collected 556 134 SMILES strings for compounds used in Gupta et al. (2018). However, in order to have more robust and realistic set of molecules, the dataset was augmented with 1 936 962 compounds available in the MOSES dataset (Polykovskiy et al., 2018). Together, these two datasets represented 2.5 million SMILES for compounds. We then filtered this dataset to remove salts and stereochemical information. In Gupta et al. (2018), the authors restricted their canonical SMILES sequence length to be in the range [34,74] for their LSTM based compound generation methodology. In Connor (2020), the authors highlighted that increasing the sequence length to 128 characters lead to better quality compound generation using an LSTM framework. In our work, we include compounds whose SMILES strings are in the range [10,128] to allow small sized compounds as well as large size ligands to be part of our chemical search space which is more inclusive and comprehensive than that used in Gupta et al. (2018). As a result, our final compound set S consisted of 2 459 695 canonical SMILES for small molecules.

To train the majority of traditional supervised ML algorithms, it is essential to have numeric vector representation for compounds. We used the set S to train a TF-LSTM (Gers et al., 1999; Lambet al., 2016) based autoencoder (Kramer, 1991) which generates a low-dimensional vector representation (LSc) for each compound. Furthermore, to have a comprehensive comparison, we also used traditional chemoinformatic descriptors such as Morgan Fingerprints (MFP) (Capecchi et al., 2020) derived from compound structure as an alternative vector representation for each compound.

2.1.2 Viral proteins

We downloaded all the viral protein sequences available in UniProt (The UniProt Consortium, 2017) comprising a total of 2 684 774 protein sequences. Among these 10 685 are deposited in SwissProt (Boeckmann et al., 2003) i.e. are manually curated and functionally annotated, whereas the remaining 2 674 089 are obtained from TrEBML (Boeckmann et al., 2003) and are not well-curated. These viral proteins span over 2742 viral organisms. A necessary condition for training deep learning models with protein sequence is to have a fixed length L. In Khurana et al. (2018) and Elbasir et al. (2019), the authors used sequence lengths of 800 and 1200 for training their deep learning models. In this work, we filter viral proteins to keep sequences with L2000 resulting in a set V of 2 658 225 viral protein sequences, thereby, retaining 99% of all viral proteins available in Uniprot.

In order to train traditional supervised methods, it is essential to have numeric vector representation for protein sequences. We utilized the set V to train a CNN (LeCun et al., 1995) based autoencoder which then generates the required low dimensional representation (LSv) for each viral protein sequence.

2.1.3 Compound-viral protein activities

The primary focus of our use case are the three main proteins of the SARS-COV-2 virus including papain-like proteinase (PL-PRO), 3 C-like proteinase (3CL-PRO also referred as cleavage protein) and the Spike glycoprotein (S glycoprotein). We centered our work on these SAR-COV-2 proteins due to the following reasons: (i) availability of high-quality 3d-structures deposited in protein data bank (PDB) (Protein Data Bank, 1971) (PDB Ids: 6W02, 5R7Y, 6M0J respectively). This makes validation possible through molecular docking experiments. (ii) For several other viral organisms, the PL-PRO and 3CL-PRO are the main proteins targeted by compounds (Fear et al., 2007). (iii) It has been shown (Lan et al., 2020) that Spike protein attaches the virion to the cell membrane by interacting with host receptor, initiating the infection.

However, our proposed framework can easily be extended to other viral proteins associated with the SARS-COV-2 virus as well as proteins associated with other viruses. As the SARS-COV-2 is a new virus, it is harder to get quality data about compound-viral protein activity. However, information about similar viruses, their main proteins and small molecules used to target these viral proteins are available in repositories such as PubChem, ChEMBL and BindingDB (Liu et al., 2007).

We initially searched for compound activity information related to SARS-COV-1 (SARS-1), Middle East Respiratory Syndrome (MERS), Human Immunodeficiency Virus (HIV) and Hepacivirus C (HepC) using the ‘PUG-REST’ API of NCBI (Wheeler et al., 2008) which was used to download raw information from various NCBI Assay records. We processed only those records which contain Assay Id’s (AID). A given assay can report different kinds of compound bioactivities depending on the objective of the study. These bioactivities include measurements such as IC50,EC50,AC50,Ki,Kd, Potency etc. as described in Haas et al. (2017). These biological activities are standard potency measures that are derived from dose–response assays at different concentrations designed to measure activation, inhibition of targets and pathways of pharmacological significance (Haas et al., 2017).

We note these bioactivity measurements may vary across assays but to obtain a large set of compound-viral protein activities for the in silico modeling techniques, it is essential to combine several of these bioactivities with certain restrictions. For example, we filter those records which do not contain a PubChem standard value for activity (as otherwise, it makes it difficult to have an unbiased comparison of compound activities). The PubChem standard value for bioactivity is measured in micromolar (µM = 10−6) concentration. We initially selected records containing IC50 value as done by Ullah et al. (2017), which is based on the concentration of a compound at which 50% inhibition of a viral protein is observed. Furthermore, it is known from enzyme kinetics (Cheng-Prusoff Equation (16)) that when a compound binds to a protein in an uncompetitive scenario i.e. an assay, the Ki value is equal to IC50 value. Similarly, it was shown in Thafar et al. (2019), that records containing Kd and Potency values as bioactivities (measured in PubChem standard value i.e. μM) can be combined with those holding IC50 and Ki values. Here, combining corresponds to creating a dataset which includes all compound-viral protein bioactivity samples that either had Ki,Kd,IC50 or Potency as a label information for downstream supervised learning task. Thus, using these 4 measurements of compound-viral proteins activities and filtering records based on the aforementioned sequence lengths of compounds and viral proteins, we obtain an interaction set of 13 763 compound-viral protein activities from PubChem.

We next downloaded all compounds and viral protein interactions available in ChEMBL (Gaulton et al., 2017) repository. As a part of internal quality checks provided by ChEMBL, we include only those compound–viral protein interactions which have a confidence score of at least 5. The confidence score value reflects both the type of target assigned to a particular assay and the confidence that the target assigned is the correct target for that assay. As stated in Gaulton et al. (2017), assays assigned a nonmolecular target type, e.g. a cell-line or an organism, receive a confidence score of 1, while assays with assigned protein targets receive a confidence score of at least 5. Moreover, we remove those activities for which a standard pChEMBL value is not available. The myriad published activities from heterogeneous resources utilized by ChEMBL are converted into a standardized activity, namely, the pChEMBL value. This value allows us to compare different measures of half-maximal response on a negative logarithmic scale. For instance, an IC50 value of 1 nanomolar (nM = 10−9) would have a pChEMBL value of 9. The PubChem standard value is measured in micromolar (μM) concentration whereas the pChEMBL value is measurement in nanomolar (nM) concentration. Hence, in order to have a change of unit and convert bioactivity measurements obtained from PubChem to standard pChEMBL value, we use the following formulae: pChEMBL=log10(ActivityPubChem)+6.

Here, ActivityPubChem corresponds to either IC50,Ki,Kd or Potency. Hence 103 unit of PubChem standard value or 1 nM corresponds to a pChEMBL value of 9 (=log10(103)+6). We initially obtain a set of 92 638 such compound-viral protein activities and after filtering for only those records which contain IC50,Ki,Kd and Potency as standard types, we limit the set to 62 219 interactions. We then remove records where the compounds contain salt and their corresponding SMILES string exceeds 128 characters. We truncated viral protein sequences to have a maximal length L =2000 amino acids in the interaction set. This results in a final set of 54 756 bioactivity samples obtained and curated via ChEMBL.

We take a union of the two data sources (ChEMBL and PubChem) resulting in the dataset D consisting of 60 195 such interactions. These interactions comprise 54 617 unique compounds, 153 unique viral protein sequences (based on Uniprot accession ids), and span over 97 different viral organisms. We randomly split the dataset D into Dtrain (54 175 interactions) and Dtest (6020 activities) in the ratio of 9:1, which are then used as the training and independent test set respectively for the task of building in silico embedding-based compound-viral protein activity predictors. The independent test set is pertinent to our framework as it enables us to take the consensus (mean) of the top k predictive models based on their performance on the test set.

All details of the steps followed to prepare, assimilate and curate compounds, viral proteins and compound–viral protein interactions is available in the ‘README’ file in the ‘data’ folder of the github repository (https://github.com/raghvendra5688/Drug-Repurposing) to enhance the reproducibility of our approach.

2.2 Methods Overview

Compound-viral protein activity prediction can be modeled as a regression task. We learn a mapping function g that takes as input a joint compound and viral protein representation, (xc, xv) and outputs the activity value ycv. In Figure 2, ycv corresponds to the −log10(IC50) and is used as standardized pChEMBL activity value. If is the model-specific loss function, then the regression task reduces to estimating the parameters w which minimizes minwc,v(ycv,g(xc,xv;w))

Overview figure depicting our predictive modeling process. For each compound c and each viral protein v, we use representations xc and xv based on SMILES strings and primary structure respectively. For each compound–viral protein interaction, the activity value used in the training set is obtained from myriad resources. Here, − log⁡10(IC50) value measured in nM units i.e. − log⁡10(103×10−9) = 6 is the standardized pChEMBL activity value (ycv)
Fig. 2.

Overview figure depicting our predictive modeling process. For each compound c and each viral protein v, we use representations xc and xv based on SMILES strings and primary structure respectively. For each compound–viral protein interaction, the activity value used in the training set is obtained from myriad resources. Here, −log10(IC50) value measured in nM units i.e. −log10(103×109) = 6 is the standardized pChEMBL activity value (ycv)

In this article, the mapping function g is a ML method including generalized linear model (GLM) (Agresti, 2015), random forests (RF) (Breiman, 2001), XGBoost (Chen and Guestrin, 2016), support vector machines (SVMs) (Mall and Suykens, 2015; Suykens and Vandewalle, 1999) and is the squared loss function. For these techniques, xc is either passed to a TF-LSTM (Gers et al., 1999) or Morgan Fingerprint generator (Capecchi et al., 2020) and xv is passed to a CNN (LeCun et al., 1995) to generate numeric vector representations LSc (for compounds) and LSv (for viral proteins) which are utilized by the aforementioned ML models to estimate activity values, such that y^cv=g(LSc,LSv;w).

Furthermore, we also considered end-to-end deep learning models using CNN, LSTM, CNN–LSTM and graph attention network (GAT)–CNN as function g, where xc corresponds to canonical SMILES sequence for compounds and xv reflects the primary structure or linear chain of amino acids (AA) for viral protein sequences and y^cv=g(xc,xv;w). The SMILES representation, is parameterized by a sequence of vectors, xc={xc,1,xc,2,,xc,l}, where xc,i is a one-hot coded vector (Harris and Harris, 2010) i.e. a binary vector of length 72 (72 unique character combinations appearing in SMILES using the ‘SmilesPE’ package https://github.com/XinhaoLi74/SmilesPE in python) with 1 bit active for ith character combination in the SMILES string and l =128. Similarly, for each viral protein sequence (Protein AA Sequence), xv={xv,1,xv,2,,xv,L}, where xv,j is a one-hot coded vector of length 22 (20 for amino acids, 1 for gap and 1 for ambiguous amino acids) and L =2000. Figure 2 provides an overview of our modeling process.

2.3 Compound autoencoder: TF-LSTM

The goal of a compound autoencoder model (Kramer, 1991) is to learn the innate low dimensional representation LSc from SMILES strings of compounds (xc) in an unsupervised setting such that compounds with similar patterns tend to be closer in the low dimensional space. Our compound autoencoder framework consists of an encoder, a decoder, and a sequence to sequence (seq2seq) model which encapsulates the encoder and decoder and provides a way to interface with each. We are interested in the output of LSTM encoder that can be represented as h=EncoderLSTM(e(xc)). Here, e(xc) represents the SMILES embedding representation for compound, h correspond to hidden state representations encapsulating sequential information used as LSc in our downstream predictive models. A detailed working mechanism of TF-LSTM is provided in Supplementary Material.

We trained this TF-LSTM model on 2.5 million SMILES strings for small molecules. Interestingly, 96.7% of the SMILES generated by our TF-LSTM model were valid small molecules [tested using RDKit (Landrum, 2013) package] and had a mean categorical cross-entropy (Goodfellow et al., 2016) error of 0.001. The convergence of the reconstruction error for our TF-LSTM model is depicted in Supplementary Figure S1(a). Supplementary Figure S2(a) illustrates our TF-LSTM compound autoencoder model.

2.4 Protein autoencoder: CNN

The goal of the viral protein autoencoder model is to learn a low dimensional protein embedding representation LSv from the AA sequences of viral proteins xv. We used a convolutional autoencoder neural network for this purpose. Our protein autoencoder framework consists of two main components: an encoder and a decoder as highlighted in Supplementary Figure S2(b). The autoencoder was trained in an unsupervised fashion to learn a low dimensional space (LSv). A detailed description of the protein autoencoder is provided in Supplementary Material.

We trained our autoencoder on 2 685 225 viral proteins. The mean categorical cross-entropy (Goodfellow et al., 2016) error for the autoencoder was 0.1. The convergence of the reconstruction error for the autoencoder is depicted in Supplementary Figure S1(b).

2.5 Traditional machine learning models

We used four state-of-the-art ML models, namely, GLMs (Agresti, 2015), RFs (Breiman, 2001), XGBoost (Chen and Guestrin, 2016) and SVMs (Mall and Suykens, 2015; Suykens and Vandewalle, 1999) as mapping function g. Thus, our predicted activity value can be represented as y^cv=g(LSc,LSv;w) for a given compound c and viral protein v. It has been shown that nonlinear ML techniques such as RFs, XGBoost and SVMs can be used efficiently for a variety of bioinformatics problems (Mall et al., 2017; Rawi et al., 2018; Mall et al., 2018; Ullah et al., 2018; Palotti et al., 2019; Elbasir et al., 2020).

GLM (Agresti, 2015) is a flexible version of linear regression model which allows the errors or residuals of the response variable to follow a distribution other than the normal distribution. In our work, GLM serves as a baseline comparison technique. RFs belong to the class of ensemble supervised learning techniques. RF algorithm applies the technique of bagging or bootstrapped aggregating (Breiman, 2001) to decision tree learners. Given Dtrain, the bagging procedure repeatedly selects random samples with replacement and fits separate trees to these samples and aggregates them to build the final regressor.

Gradient boosting machine (GBM) (Friedman, 2001) belongs to that family of predictive methods that uses an iterative strategy s.t. the learning framework will consecutively fit new models to have an accurate estimate of the response variable after each iteration. The advantage of the boosting procedure is that it works by decreasing the bias of the model, without increasing the variance. A more scalable and accurate version of GBM is XGBoost (Chen and Guestrin, 2016). It uses a scalable end-to-end tree boosting system with a weighted quantile sketch for approximate tree learning. XGBoost can scale for a large number of samples using very little computational resources.

SVMs were originally introduced in Suykens and Vandewalle (1999), Drucker et al. (1997) and belong to the family of linear optimization techniques where regression task is considered as function estimation and achieved by constructing optimal hyperplanes. They only become suitable for nonlinear regression task when a corresponding kernel is chosen (Drucker et al., 1997; Suykens and Vandewalle, 1999). The choice of the kernel enables to encodes the similarity structure in the input data in high dimensional space. We use the radial-basis function (RBF) or universal kernel for our nonlinear SVM model which is optimized using a standard cross-validation procedure.

We used the ‘sklearn’ package (Pedregosa et al., 2011) available in Python for building our optimal GLM, RF, XGBoost and SVM models after performing hyper-parameter optimization using 5-fold cross-validation. In order to do cross-validation, we shuffled the training dataset and then randomly split the data into five parts, using a combination of four parts as training set and one part as validation set to identify the optimal set of hyperparameters. This process is repeated five times and the hyperparameters with best average performance are then selected as optimal hyperparameters. These hyperparameters are used to build the final model on the entire training set.

2.6 End-to-end deep learning models

We built four end-to-end deep learning models for our regression problem where the mapping functions g were CNN, LSTM, CNN-LSTM and GAT-CNN. These models directly work on the compound (xc) and viral protein (xv) representations, unlike traditional ML techniques.

2.6.1 CNN Model:

This deep learning architecture comprises two CNN encoders. For the compound and protein CNN encoders, each of the compound (xc) and viral protein (xv) representation is passed through an embedding layer (e(·)) to generate compound embedding matrix and viral protein embedding matrix respectively. A single convolutional layer with multiple filter sizes, kK={3,6,9,12}, is applied on top of the embedding matrix followed by a max-pooling operation to generate hidden state vector for small molecules as well as viral protein sequences as depicted in Figure 3(a). The hidden state vector hc for compounds and hv for viral protein sequences are then concatenated together (h) and are considered as the output of the CNN encoders.

Different end-to-end deep learning models used as data-driven predictive models for the task of estimating compound-viral protein activity. (a) CNN model, (b) LSTM model, (c) CNN–LSTM model and (d) GAT-CNN model
Fig. 3.

Different end-to-end deep learning models used as data-driven predictive models for the task of estimating compound-viral protein activity. (a) CNN model, (b) LSTM model, (c) CNN–LSTM model and (d) GAT-CNN model

We then have multiple feed-forward layers on top of h which are ultimately connected to the output unit corresponding to the activity value. The CNN encoders can capture contiguous sequences in the SMILES representations and k-mers in viral protein sequence, whereas the feed-forward layers capture the co-occurrence of such patterns that drive the activity value to be either high or low based on our training set Dtrain. We use nonlinear activations at every layer and optimize the model architecture w.r.t. hyper-parameters such as filter sizes, learning rate, etc.

2.6.2 LSTM model:

The LSTM model consists of two LSTM encoders. We have an LSTM encoder based on the compound representation (xc) and another one based on the viral protein representation (xv). The compound LSTM encoder generates the hidden state vector (hc) while the viral protein encoder generates the hidden state vector (hv). The two hidden vectors are then concatenated together (h) as illustrated in Figure 3(b).

We again have multiple feed-forward layers on top of h which is connected to the output unit representing the activity value. The LSTM encoders not only capture short but long-term dependencies as well, due to the availability of memory units, based on SMILES strings and viral protein sequences and the feed-forward layers encapsulate the co-occurrence of such patterns driving the activity value to be high or low for a given compound-viral protein combination.

2.6.3 CNN–LSTM model:

The CNNLSTM model is a combination of CNN and the LSTM model. By combining the CNN and LSTM models, this model can capture spatially contiguous and well as long-term dependencies in the SMILES strings and viral protein sequences. The output of each encoder is concatenated together to generate hidden representation h which is passed to multiple feed-forward layers and is ultimately connected to the output layer consisting of one unit for the activity value.

2.6.4 GAT–CNN model:

This deep learning architecture is composed of two parts, graph attention networks (Veličković et al., 2017) and convolutional neural networks. For a given compound, the compound structure can be presented as a graph consisting of the atoms (nodes) in the compound and connected by edges if a bond exists between a pair of atoms. To convert a compound structure to the form of graph representations, we use the RDKit package which takes SMILES strings and converts them. Furthermore, RDKit allows us to extract different atom features such as atom’s degree, the total number of hydrogen, the number of hydrogen with the number of bonded neighbors, atom status as aromatic or not, the implicit value of atoms and atom symbol. These features can be utilized as node properties for atoms. In total, we extract 78 such features from the SMILES strings. Given the graph-based representation of a compound molecule (xc) along with the extracted node features, the GAT model learns an embedding representation for a compound encapsulating the topological information available in the graph of each compound.

The second component of this architecture is convolutional neural networks which take protein AA sequence as an input. This component is composed of the embedding layer and multiple convolutional layers. At each convolutional layer, a nonlinear activation function is applied and is followed by a max-pooling operator. It learns protein embedding (hv) and concatenates it with the SMILES embedding (hc) generated by GAT to produce h, which is then passed to feed-forward layers. The output layer provides the value corresponding to the compound activity.

The optimal model architecture hyper-parameters (like hc = 256, hv = 64) for each of the end-to-end deep learning models are provided in Supplementary Table S1.

2.7 Consensus framework

In (Gysi et al., 2020), the authors demonstrate that taking an aggregation of the results obtained from different methodologies can provide better performance than individual models while identifying suitable repurposable compounds for COVID-19. In a similar vein, we take a consensus i.e. the average of the pChEMBL values predicted by our top performing in silico embedding-based compound-viral protein predictors on the independent test set. We argue that since our models are based on myriad representations of compounds (SMILES embedding or MFP or canonical SMILES) and viral proteins (protein embedding or AA sequence), it is imperative to take a consensus of the top predictive models as they learn different combinations of nonlinear patterns from diverse representations of the data to attain optimal predictive performance as illustrated in Table 1. Figure 1 highlights the various combinations of input data representations and the top compound-viral protein predictors aggregated in the consensus framework based on the performance on the test set as illustrated in Table 1.

Table 1.

Comparison of performance of devised ML techniques for our compound–viral activity prediction problem evaluated w.r.t. the four evaluation metrics on Dtest

ModelRepresentationsMAEMSEPearson RR2
MeanDummy regressor1.100 ± 0.0031.936 ± 0.006
MedianDummy regressor1.002 ± 0.0022.310 ± 0.004
GLMSMILES Embedding + Protein Embedding0.662 ± 0.0030.869 ± 0.0090.740 ± 0.0030.548 ± 0.005
RFSMILES Embedding + Protein Embedding0.557 ± 0.0050.625 ± 0.0100.826 ± 0.0030.682 ± 0.005
SVMSMILES Embedding + Protein Embedding0.508 ± 0.0040.478 ± 0.0110.869 ± 0.0020.755 ± 0.003
XGBoostaSMILES Embedding + Protein Embedding0.453 ± 0.0030.423 ± 0.0070.885 ± 0.0020.783 ± 0.004
GLMMorgan Fingerprint + Protein Embedding0.647 ± 0.0030.775 ± 0.0080.774 ± 0.0030.600 ± 0.005
RFMorgan Fingerprint + Protein Embedding0.529 ± 0.0030.552 ± 0.0040.849 ± 0.0020.720 ± 0.003
SVMaMorgan Fingerprint + Protein Embedding0.439 ± 0.0030.357 ± 0.0050.905 ± 0.0020.818 ± 0.003
XGBoostaMorgan Fingerprint + Protein Embedding0.404 ± 0.0020.329 ± 0.0030.911 ± 0.0010.830 ± 0.002
CNNaSMILES Sequence + Protein AA Sequence0.451 ± 0.0030.398 ± 0.0060.892 ± 0.0020.795 ± 0.004
LSTMSMILES Sequence + Protein AA Sequence0.500 ± 0.0020.514 ± 0.0060.863 ± 0.0020.745 ± 0.003
CNN-LSTMSMILES Sequence + Protein AA Sequence0.516 ± 0.0040.551 ± 0.0090.852 ± 0.0020.725 ± 0.004
GAT-CNNaSMILES Sequence + Protein AA Sequence0.478 ± 0.0030.439 ± 0.0070.880 ± 0.0020.775 ± 0.003
μBest (top 10 methods)All combination0.423 ± 0.0040.342 ± 0.0090.911 ± 0.0030.829 ± 0.005
μBest (top 5 methods)All combinations0.403 ± 0.0020.313 ± 0.0060.917 ± 0.0020.841 ± 0.003
ModelRepresentationsMAEMSEPearson RR2
MeanDummy regressor1.100 ± 0.0031.936 ± 0.006
MedianDummy regressor1.002 ± 0.0022.310 ± 0.004
GLMSMILES Embedding + Protein Embedding0.662 ± 0.0030.869 ± 0.0090.740 ± 0.0030.548 ± 0.005
RFSMILES Embedding + Protein Embedding0.557 ± 0.0050.625 ± 0.0100.826 ± 0.0030.682 ± 0.005
SVMSMILES Embedding + Protein Embedding0.508 ± 0.0040.478 ± 0.0110.869 ± 0.0020.755 ± 0.003
XGBoostaSMILES Embedding + Protein Embedding0.453 ± 0.0030.423 ± 0.0070.885 ± 0.0020.783 ± 0.004
GLMMorgan Fingerprint + Protein Embedding0.647 ± 0.0030.775 ± 0.0080.774 ± 0.0030.600 ± 0.005
RFMorgan Fingerprint + Protein Embedding0.529 ± 0.0030.552 ± 0.0040.849 ± 0.0020.720 ± 0.003
SVMaMorgan Fingerprint + Protein Embedding0.439 ± 0.0030.357 ± 0.0050.905 ± 0.0020.818 ± 0.003
XGBoostaMorgan Fingerprint + Protein Embedding0.404 ± 0.0020.329 ± 0.0030.911 ± 0.0010.830 ± 0.002
CNNaSMILES Sequence + Protein AA Sequence0.451 ± 0.0030.398 ± 0.0060.892 ± 0.0020.795 ± 0.004
LSTMSMILES Sequence + Protein AA Sequence0.500 ± 0.0020.514 ± 0.0060.863 ± 0.0020.745 ± 0.003
CNN-LSTMSMILES Sequence + Protein AA Sequence0.516 ± 0.0040.551 ± 0.0090.852 ± 0.0020.725 ± 0.004
GAT-CNNaSMILES Sequence + Protein AA Sequence0.478 ± 0.0030.439 ± 0.0070.880 ± 0.0020.775 ± 0.003
μBest (top 10 methods)All combination0.423 ± 0.0040.342 ± 0.0090.911 ± 0.0030.829 ± 0.005
μBest (top 5 methods)All combinations0.403 ± 0.0020.313 ± 0.0060.917 ± 0.0020.841 ± 0.003

Note: Here, we report the mean performance and ± corresponds to maximal standard deviation. Top 10 models are highlighted in bold and ‘a’ superscript is added to top 5 models w.r.t. the four evaluation metrics. Last row corresponds to mean of top 5 methods.

Table 1.

Comparison of performance of devised ML techniques for our compound–viral activity prediction problem evaluated w.r.t. the four evaluation metrics on Dtest

ModelRepresentationsMAEMSEPearson RR2
MeanDummy regressor1.100 ± 0.0031.936 ± 0.006
MedianDummy regressor1.002 ± 0.0022.310 ± 0.004
GLMSMILES Embedding + Protein Embedding0.662 ± 0.0030.869 ± 0.0090.740 ± 0.0030.548 ± 0.005
RFSMILES Embedding + Protein Embedding0.557 ± 0.0050.625 ± 0.0100.826 ± 0.0030.682 ± 0.005
SVMSMILES Embedding + Protein Embedding0.508 ± 0.0040.478 ± 0.0110.869 ± 0.0020.755 ± 0.003
XGBoostaSMILES Embedding + Protein Embedding0.453 ± 0.0030.423 ± 0.0070.885 ± 0.0020.783 ± 0.004
GLMMorgan Fingerprint + Protein Embedding0.647 ± 0.0030.775 ± 0.0080.774 ± 0.0030.600 ± 0.005
RFMorgan Fingerprint + Protein Embedding0.529 ± 0.0030.552 ± 0.0040.849 ± 0.0020.720 ± 0.003
SVMaMorgan Fingerprint + Protein Embedding0.439 ± 0.0030.357 ± 0.0050.905 ± 0.0020.818 ± 0.003
XGBoostaMorgan Fingerprint + Protein Embedding0.404 ± 0.0020.329 ± 0.0030.911 ± 0.0010.830 ± 0.002
CNNaSMILES Sequence + Protein AA Sequence0.451 ± 0.0030.398 ± 0.0060.892 ± 0.0020.795 ± 0.004
LSTMSMILES Sequence + Protein AA Sequence0.500 ± 0.0020.514 ± 0.0060.863 ± 0.0020.745 ± 0.003
CNN-LSTMSMILES Sequence + Protein AA Sequence0.516 ± 0.0040.551 ± 0.0090.852 ± 0.0020.725 ± 0.004
GAT-CNNaSMILES Sequence + Protein AA Sequence0.478 ± 0.0030.439 ± 0.0070.880 ± 0.0020.775 ± 0.003
μBest (top 10 methods)All combination0.423 ± 0.0040.342 ± 0.0090.911 ± 0.0030.829 ± 0.005
μBest (top 5 methods)All combinations0.403 ± 0.0020.313 ± 0.0060.917 ± 0.0020.841 ± 0.003
ModelRepresentationsMAEMSEPearson RR2
MeanDummy regressor1.100 ± 0.0031.936 ± 0.006
MedianDummy regressor1.002 ± 0.0022.310 ± 0.004
GLMSMILES Embedding + Protein Embedding0.662 ± 0.0030.869 ± 0.0090.740 ± 0.0030.548 ± 0.005
RFSMILES Embedding + Protein Embedding0.557 ± 0.0050.625 ± 0.0100.826 ± 0.0030.682 ± 0.005
SVMSMILES Embedding + Protein Embedding0.508 ± 0.0040.478 ± 0.0110.869 ± 0.0020.755 ± 0.003
XGBoostaSMILES Embedding + Protein Embedding0.453 ± 0.0030.423 ± 0.0070.885 ± 0.0020.783 ± 0.004
GLMMorgan Fingerprint + Protein Embedding0.647 ± 0.0030.775 ± 0.0080.774 ± 0.0030.600 ± 0.005
RFMorgan Fingerprint + Protein Embedding0.529 ± 0.0030.552 ± 0.0040.849 ± 0.0020.720 ± 0.003
SVMaMorgan Fingerprint + Protein Embedding0.439 ± 0.0030.357 ± 0.0050.905 ± 0.0020.818 ± 0.003
XGBoostaMorgan Fingerprint + Protein Embedding0.404 ± 0.0020.329 ± 0.0030.911 ± 0.0010.830 ± 0.002
CNNaSMILES Sequence + Protein AA Sequence0.451 ± 0.0030.398 ± 0.0060.892 ± 0.0020.795 ± 0.004
LSTMSMILES Sequence + Protein AA Sequence0.500 ± 0.0020.514 ± 0.0060.863 ± 0.0020.745 ± 0.003
CNN-LSTMSMILES Sequence + Protein AA Sequence0.516 ± 0.0040.551 ± 0.0090.852 ± 0.0020.725 ± 0.004
GAT-CNNaSMILES Sequence + Protein AA Sequence0.478 ± 0.0030.439 ± 0.0070.880 ± 0.0020.775 ± 0.003
μBest (top 10 methods)All combination0.423 ± 0.0040.342 ± 0.0090.911 ± 0.0030.829 ± 0.005
μBest (top 5 methods)All combinations0.403 ± 0.0020.313 ± 0.0060.917 ± 0.0020.841 ± 0.003

Note: Here, we report the mean performance and ± corresponds to maximal standard deviation. Top 10 models are highlighted in bold and ‘a’ superscript is added to top 5 models w.r.t. the four evaluation metrics. Last row corresponds to mean of top 5 methods.

4 Results

4.1 Experimental results on Dtest

We perform 10 randomizations for each of our predictive models by randomly splitting the full dataset D into Dtrain and Dtest in proportions (9:1) for training and testing purposes, respectively as mentioned earlier in the Section 2. During each randomization, all the models are built using the same Dtrain and evaluated on the same test set Dtest to avoid any unwanted bias in the downstream consensus framework. For the traditional machine learning techniques (GLM, RF, SVM and XGBoost), the optimal hyper-parameters are obtained using a 5-fold cross-validation technique on Dtrain. However, in order to identify the optimal architecture for the end-to-end deep learning models, the training sets (Dtrain) are divided on the fly into 80% for training and 20% for validation set owing to computational costs. The cross-validation performance of traditional supervised machine learning techniques (GLM, RF, SVM and XGB) using either SMILES embedding representation or Morgan Fingerprints representation for compounds and Protein embedding representation for viral proteins is depicted in Supplementary Figure S2(c).

Table 1 provides a comprehensive comparison of the mapping functions g utilized in our work including baseline mean, median and optimal GLM regressors as well as optimal nonlinear models such as RF, SVM, XGBoost, CNN, LSTM, CNN-LSTM and GAT-CNN. In Table 1, we report the mean and corresponding standard deviation (±) in performance for each of the 4 quality metric over the 10 randomizations. These four quality metrics are the mean absolute error (MAE), mean squared error (MSE), Pearson correlation R (Pearson R) and the coefficient of determination (R2). Each of these metrics is estimated using the predicted pChEMBL values versus the groundtruth pChEMBL values for compound–viral protein interactions (Dtest). For metrics, MAE and MSE, the lower the value and closer to 0, the better the predictive performance of the model, whereas for metrics, Pearson R and R2, the higher and closer the value to 1, the better the model’s predictive capability.

We highlight two baseline regressors i.e. the mean and the median regressor to showcase the effectiveness of our nonlinear predictive models in Table 1. Here, the mean regressor takes the mean value of all the compound-viral protein activities available in the training set and considers it as fixed output from the regressor. Similarly, the median regression outputs the median value of all the compound-viral protein activities available in the training set. The performance of these two baseline regressors are significantly lower than other machine learning techniques. Additionally, we demonstrate that the GLM models built on LSc and LSv for compounds and viral proteins respectively are two of the worst performing models w.r.t. four evaluation metrics. This necessitates the usage of nonlinear machine learning techniques when using numeric vector representations for compounds (SMILES embedding/Morgan Fingerprint) and proteins (Protein embedding) for the task of accurate compound-viral protein activity prediction as illustrated in Table 1.

From Table 1, we observe that the best individual predictive model w.r.t. all quality metrics is the XGBoost model, highlighted in Table 1 by ‘*’, and is built on the LSc using the Morgan Fingerprint and LSv obtained from protein autoencoder for the compounds and viral proteins respectively. It is closely followed by the SVM model on similar representations, the end-to-end CNN and GAT-CNN end-to-end deep learning models based on the sequence representations and the XGBoost model built on the SMILES embedding (LSc) for compounds and protein embedding (LSv) for viral proteins. These top 5 models each achieve Pearson R > 0.85 and R2 value in excess 0.75. Furthermore, we observe from Table 1, that when we take a consensus (average) of the top 10 predictive models, its performance is comparable to that of the best individual predictive (XGBoost) model. This can partly be reasoned due to the inclusion of models with much lower predictive capability, such as RF (SMILES Embedding/Morgan Fingerprint + Protein Embedding) and CNN-LSTM deep learning models in the consensus, in comparison to the top performing predictive models. However, when we take a consensus of the top 5 predictors, we achieve the superior performance than the best individual predictor (XGBoost model) as depicted in Table 1. Its superior predictive capability can be attributed to the high Pearson R and R2 of the individual models included in the consensus and the ability to potentially capture different combinations of nonlinear patterns from the diverse representations of the data. It is noteworthy, that the standard deviations of each predictive model obtained via 10 randomizations of Dtest are low w.r.t. the 4 evaluation metrics as illustrated in Table 1, indicating low variance and high accuracy in the generalization performance of our proposed models.

Next, we evaluate the predictive performance of the best model obtained from the 10 randomizations for each mapping function g. The predictive capability of each of these models is highlighted in Supplementary Figure S3. We additionally compare the predictive performance of the top 5 in silico predictors w.r.t. the ground-truth compound-viral protein activities available for the same test set Dtest as illustrated in Figure 4(a). It can be observed from Figure 4(a) that the x-axis represents the sample id in Dtest, whereas for each such sample, we have five values vertically spread along the y-axis. Each of these values corresponds to the difference between the groundtruth and predicted interaction values by our in silico embedding-based models. The closer the predicted score is to the true pChEMBL value, the smaller is the residual pChEMBL value (0). We observe more deviations from 0 in the residual pChEMBL values i.e. relatively larger errors in predictions, when the true pChEMBL value is either too small (close to sample id ‘0’ on x-axis) or too large (close to sample id ‘6000’ on x-axis). This can partly be attributed to lack of availability of large number of compound-viral protein activity samples with small pChEMBL values (5) or large pChEMBL values (10) in the training set Dtrain as depicted in Figure 4(b) to train the in silico embedding-based predictors. However, for a majority of the samples the residuals are close to 0 for each of the top 5 predictors showcasing their good predictive capability.

In (a), the x-axis represents compound–viral protein activity samples ordered by their groundtruth pChEMBL values (lowest to highest) and y-axis corresponds to the residual pChEMBL values. For a majority of the samples, the residuals are close to zero for each of the top 5 predictors indicating the good predictive capability of these models. We use the ‘loess’ function with default parameters available in ‘ggplot2’ package in R to fit a smooth local regressor via a nonparametric approach for each in silico predictor. (b) The distribution as well as the density of the pChEMBL values available in Dtrain and Dtest. (a) Comparison of difference in groundtruth versus predicted pChEMBL values for top 5 in silico embedding-based compound-viral activity predictors on the same Dtest. (b) Comparison of density distribution of pChEMBL values in training and test set
Fig. 4.

In (a), the x-axis represents compound–viral protein activity samples ordered by their groundtruth pChEMBL values (lowest to highest) and y-axis corresponds to the residual pChEMBL values. For a majority of the samples, the residuals are close to zero for each of the top 5 predictors indicating the good predictive capability of these models. We use the ‘loess’ function with default parameters available in ‘ggplot2’ package in R to fit a smooth local regressor via a nonparametric approach for each in silico predictor. (b) The distribution as well as the density of the pChEMBL values available in Dtrain and Dtest. (a) Comparison of difference in groundtruth versus predicted pChEMBL values for top 5 in silico embedding-based compound-viral activity predictors on the same Dtest. (b) Comparison of density distribution of pChEMBL values in training and test set

4.2 Experimental results for COVID-19 use case

In a recent work (Riva et al., 2020), a library of compounds encompassing approximately 12 000 clinical-stage or Food and Drug Administration (FDA)-approved small molecules were profiled by means of assay development and high throughput screening based on bioactivities against SARS-COV-2 virus in Vero E6 cells to identify candidate therapeutic drugs for COVID-19. The authors in Riva et al. (2020) have deposited a total of 68 assays containing 2483 of these compounds in a publically available library named ReFRAME (https://reframedb.org/). As per the guidelines mentioned on their website, a good majority of these compounds are embargoed due to collaborations with pharmaceutical companies. Furthermore, we take a union of these compounds with 117 FDA-approved drugs which are in some stage of clinical trial for any known viral organism as indicated in Andersen et al. (2020) and available at http://drugvirus.info/. After filtering these compounds based on the length of their SMILES sequences as per the criterion defined in the Section 2, we end up with a set of S comprising 1482 compounds including known antivirals, antibiotics, anticancer and other human investigational compounds (see details about preparing the set S in the Supplementary Material). We then make activity predictions (pChEMBL values) for each of these compounds on the three main proteins of SARS-COV-2 virus i.e. the PL-Pro (PDB ID: 6WO2), the 3CL-Pro (PDB ID: 5R7Y) and the Spike proteins (PDB ID: 6MOJ). Additional information about their primary AA sequence, Uniprot Ids, etc. are provided in Supplementary Table S2.

We first obtain the predicted pChEMBL values for the top 5 in silico compound-viral activity predictors as indicated in Table 1 and take a consensus i.e. on average of these predictions for each of the three main proteins of the SARS-COV-2 virus. We then select the top 100 compounds with the highest predicted activity against each of these three viral proteins. By taking an intersection of the compounds in these lists, we obtain a set of 47 compounds which are consistently predicted to have high activity (high pChEMBL values) against all the three main viral proteins and thus can potentially be effective against the SARS-COV-2 virus. This candidate set includes 21 antivirals, 15 anticancer, 5 antibiotics and 6 other investigational human compounds as depicted in Table 2. Our candidate list includes antiviral therapies such as Lopinavir, Ritonavir and Filociclovir which have been undergoing clinical trials (https://clinicaltrials.gov/) for SARS-COV-2 as highlighted in (Riva et al., 2020). Our consensus embedding-based in silico framework also identifies Remdesivir, a viral RNA polymerase inhibitor (Warren et al., 2016), which has been granted emergency use authorization by the FDA for the treatment of COVID-19 on the basis of clinical trial data demonstrating a reduction in time to recovery (Food, Drug Administration, 2020).

Table 2.

Top ranked 47 compounds for each of PL-Pro, 3CL-Pro and Spike proteins of SARS-COV-2 virus consistently appearing in the ranked list of top 100 compounds against these viral proteins

CompoundPL-Pro
3CL-Pro
Spike Protein
Predicted pChEMBLBinding energyPredicted pChEMBLBinding energyPredicted pChEMBLBinding binding
Lopinavira7.777−6.37.851−8.78.226−5.0
Ritonavira7.562−6.47.777−7.77.845−5.5
Palinavira7.416−6.47.48−7.27.699−6
Simeprevira7.646−5.67.476−6.18.206−6.2
Cabotegravira7.194−7.16.951−9.57.002−6.8
L−870812a6.937−7.16.895−8.96.68−7.2
MK−4965a7.319−7.56.893−9.67.302−7.1
Tipranavira6.634−7.46.83−8.36.794−6.6
Zanamivira6.798−5.76.801−5.96.748−5.9
BMS-707035a6.938−7.26.766−8.86.511−6.6
GSK-364735a7.086−6.46.745−9.66.552−7
Paritaprevira6.751−6.86.5715.67.443−6.2
Filociclovira6.542−5.76.463−7.16.647−6.2
TMC-647055a6.717−5.86.45911.76.539−5.5
Elvitegravira6.462−6.86.402−86.236−5.7
Dapivirinea6.584−6.76.385−8.76.32−6.4
PLX-8394c6.208−9.16.358−9.46.494−7.2
Triciribine PO3c6.385−6.76.354−8.16.314−6.5
Zidovudinea5.966−5.76.279−7.46.264−5.6
API-2a/c5.964−6.76.175−8.36.208−5.7
Fluorouracila5.965−4.56.157−5.26.353−4.6
Gossypolc6.029−5.76.11−4.26.069−6
LM 565b6.1378.76.09472.46.461−2.9
PF-03814735c6.051−7.66.091−8.26.126−7.1
Barasertibc6.006−86.087−8.36.171−6.8
Edoxudinea5.925−5.96.075−7.66.246−5.6
Cefozopranb5.884−76.049−8.16.255−6
Entrectinibc6.231−6.86.039−9.36.023−7
Clemizolc6.085−6.26.015−86.105−6
VBY-825a6.112−66.006−86.07−4.7
R-763c6.158−6.66.002−7.86.26−6.7
Bietaserpined6.054−6.15.994−2.16.323−4.9
ACT-077825d5.916−6.95.973−6.76.223−4.8
MP-412c6.069−6.65.971−96.243−5.6
Remdesivira5.907−6.25.964−86.37−6.4
ABT-263c6.005−4.25.9251.96.211−5.6
BMS-903452d5.929−6.95.913−7.86.174−6.3
Brilacidind6.016−5.75.913−2.26.266−5.2
Taselisibc5.934−75.906−8.66.142−7.1
Goxalapladibd5.982−6.95.905−6.66.27−5.1
HKI-357c6.009−6.85.884−8.76.143−6.2
Sitravatinibc5.895−6.35.879−8.26.069−7
Rifabutinb5.904−9.45.878−12.36.136−12.1
Omadacyclineb6.002−6.15.865−2.66.251−5.3
Cefpiramideb5.883−6.85.851−8.36.179−5.9
VCH-286d5.88−6.65.847−8.16.028−4.6
BMS-754807c5.915−6.65.833−8.36.095−7.1
CompoundPL-Pro
3CL-Pro
Spike Protein
Predicted pChEMBLBinding energyPredicted pChEMBLBinding energyPredicted pChEMBLBinding binding
Lopinavira7.777−6.37.851−8.78.226−5.0
Ritonavira7.562−6.47.777−7.77.845−5.5
Palinavira7.416−6.47.48−7.27.699−6
Simeprevira7.646−5.67.476−6.18.206−6.2
Cabotegravira7.194−7.16.951−9.57.002−6.8
L−870812a6.937−7.16.895−8.96.68−7.2
MK−4965a7.319−7.56.893−9.67.302−7.1
Tipranavira6.634−7.46.83−8.36.794−6.6
Zanamivira6.798−5.76.801−5.96.748−5.9
BMS-707035a6.938−7.26.766−8.86.511−6.6
GSK-364735a7.086−6.46.745−9.66.552−7
Paritaprevira6.751−6.86.5715.67.443−6.2
Filociclovira6.542−5.76.463−7.16.647−6.2
TMC-647055a6.717−5.86.45911.76.539−5.5
Elvitegravira6.462−6.86.402−86.236−5.7
Dapivirinea6.584−6.76.385−8.76.32−6.4
PLX-8394c6.208−9.16.358−9.46.494−7.2
Triciribine PO3c6.385−6.76.354−8.16.314−6.5
Zidovudinea5.966−5.76.279−7.46.264−5.6
API-2a/c5.964−6.76.175−8.36.208−5.7
Fluorouracila5.965−4.56.157−5.26.353−4.6
Gossypolc6.029−5.76.11−4.26.069−6
LM 565b6.1378.76.09472.46.461−2.9
PF-03814735c6.051−7.66.091−8.26.126−7.1
Barasertibc6.006−86.087−8.36.171−6.8
Edoxudinea5.925−5.96.075−7.66.246−5.6
Cefozopranb5.884−76.049−8.16.255−6
Entrectinibc6.231−6.86.039−9.36.023−7
Clemizolc6.085−6.26.015−86.105−6
VBY-825a6.112−66.006−86.07−4.7
R-763c6.158−6.66.002−7.86.26−6.7
Bietaserpined6.054−6.15.994−2.16.323−4.9
ACT-077825d5.916−6.95.973−6.76.223−4.8
MP-412c6.069−6.65.971−96.243−5.6
Remdesivira5.907−6.25.964−86.37−6.4
ABT-263c6.005−4.25.9251.96.211−5.6
BMS-903452d5.929−6.95.913−7.86.174−6.3
Brilacidind6.016−5.75.913−2.26.266−5.2
Taselisibc5.934−75.906−8.66.142−7.1
Goxalapladibd5.982−6.95.905−6.66.27−5.1
HKI-357c6.009−6.85.884−8.76.143−6.2
Sitravatinibc5.895−6.35.879−8.26.069−7
Rifabutinb5.904−9.45.878−12.36.136−12.1
Omadacyclineb6.002−6.15.865−2.66.251−5.3
Cefpiramideb5.883−6.85.851−8.36.179−5.9
VCH-286d5.88−6.65.847−8.16.028−4.6
BMS-754807c5.915−6.65.833−8.36.095−7.1

Note: The ‘PPS’ represents the predicted pChEMBL value by the consensus model whereas ‘BE’ corresponding to binding energy (units: Kcal/mol) obtained via molecular docking experiment. Here, a, b, c and d in superscript correspond to antivirals, antibiotics, anticancer and other human compounds, respectively. Here, Rifabutin is highlighted in bold as it consistently achieves a low binding energy in the molecular docking experiments. Similarly, LM 565 is italicized as it constantly attains high binding energy score in the docking experiments and can potentially be a false positive.

Table 2.

Top ranked 47 compounds for each of PL-Pro, 3CL-Pro and Spike proteins of SARS-COV-2 virus consistently appearing in the ranked list of top 100 compounds against these viral proteins

CompoundPL-Pro
3CL-Pro
Spike Protein
Predicted pChEMBLBinding energyPredicted pChEMBLBinding energyPredicted pChEMBLBinding binding
Lopinavira7.777−6.37.851−8.78.226−5.0
Ritonavira7.562−6.47.777−7.77.845−5.5
Palinavira7.416−6.47.48−7.27.699−6
Simeprevira7.646−5.67.476−6.18.206−6.2
Cabotegravira7.194−7.16.951−9.57.002−6.8
L−870812a6.937−7.16.895−8.96.68−7.2
MK−4965a7.319−7.56.893−9.67.302−7.1
Tipranavira6.634−7.46.83−8.36.794−6.6
Zanamivira6.798−5.76.801−5.96.748−5.9
BMS-707035a6.938−7.26.766−8.86.511−6.6
GSK-364735a7.086−6.46.745−9.66.552−7
Paritaprevira6.751−6.86.5715.67.443−6.2
Filociclovira6.542−5.76.463−7.16.647−6.2
TMC-647055a6.717−5.86.45911.76.539−5.5
Elvitegravira6.462−6.86.402−86.236−5.7
Dapivirinea6.584−6.76.385−8.76.32−6.4
PLX-8394c6.208−9.16.358−9.46.494−7.2
Triciribine PO3c6.385−6.76.354−8.16.314−6.5
Zidovudinea5.966−5.76.279−7.46.264−5.6
API-2a/c5.964−6.76.175−8.36.208−5.7
Fluorouracila5.965−4.56.157−5.26.353−4.6
Gossypolc6.029−5.76.11−4.26.069−6
LM 565b6.1378.76.09472.46.461−2.9
PF-03814735c6.051−7.66.091−8.26.126−7.1
Barasertibc6.006−86.087−8.36.171−6.8
Edoxudinea5.925−5.96.075−7.66.246−5.6
Cefozopranb5.884−76.049−8.16.255−6
Entrectinibc6.231−6.86.039−9.36.023−7
Clemizolc6.085−6.26.015−86.105−6
VBY-825a6.112−66.006−86.07−4.7
R-763c6.158−6.66.002−7.86.26−6.7
Bietaserpined6.054−6.15.994−2.16.323−4.9
ACT-077825d5.916−6.95.973−6.76.223−4.8
MP-412c6.069−6.65.971−96.243−5.6
Remdesivira5.907−6.25.964−86.37−6.4
ABT-263c6.005−4.25.9251.96.211−5.6
BMS-903452d5.929−6.95.913−7.86.174−6.3
Brilacidind6.016−5.75.913−2.26.266−5.2
Taselisibc5.934−75.906−8.66.142−7.1
Goxalapladibd5.982−6.95.905−6.66.27−5.1
HKI-357c6.009−6.85.884−8.76.143−6.2
Sitravatinibc5.895−6.35.879−8.26.069−7
Rifabutinb5.904−9.45.878−12.36.136−12.1
Omadacyclineb6.002−6.15.865−2.66.251−5.3
Cefpiramideb5.883−6.85.851−8.36.179−5.9
VCH-286d5.88−6.65.847−8.16.028−4.6
BMS-754807c5.915−6.65.833−8.36.095−7.1
CompoundPL-Pro
3CL-Pro
Spike Protein
Predicted pChEMBLBinding energyPredicted pChEMBLBinding energyPredicted pChEMBLBinding binding
Lopinavira7.777−6.37.851−8.78.226−5.0
Ritonavira7.562−6.47.777−7.77.845−5.5
Palinavira7.416−6.47.48−7.27.699−6
Simeprevira7.646−5.67.476−6.18.206−6.2
Cabotegravira7.194−7.16.951−9.57.002−6.8
L−870812a6.937−7.16.895−8.96.68−7.2
MK−4965a7.319−7.56.893−9.67.302−7.1
Tipranavira6.634−7.46.83−8.36.794−6.6
Zanamivira6.798−5.76.801−5.96.748−5.9
BMS-707035a6.938−7.26.766−8.86.511−6.6
GSK-364735a7.086−6.46.745−9.66.552−7
Paritaprevira6.751−6.86.5715.67.443−6.2
Filociclovira6.542−5.76.463−7.16.647−6.2
TMC-647055a6.717−5.86.45911.76.539−5.5
Elvitegravira6.462−6.86.402−86.236−5.7
Dapivirinea6.584−6.76.385−8.76.32−6.4
PLX-8394c6.208−9.16.358−9.46.494−7.2
Triciribine PO3c6.385−6.76.354−8.16.314−6.5
Zidovudinea5.966−5.76.279−7.46.264−5.6
API-2a/c5.964−6.76.175−8.36.208−5.7
Fluorouracila5.965−4.56.157−5.26.353−4.6
Gossypolc6.029−5.76.11−4.26.069−6
LM 565b6.1378.76.09472.46.461−2.9
PF-03814735c6.051−7.66.091−8.26.126−7.1
Barasertibc6.006−86.087−8.36.171−6.8
Edoxudinea5.925−5.96.075−7.66.246−5.6
Cefozopranb5.884−76.049−8.16.255−6
Entrectinibc6.231−6.86.039−9.36.023−7
Clemizolc6.085−6.26.015−86.105−6
VBY-825a6.112−66.006−86.07−4.7
R-763c6.158−6.66.002−7.86.26−6.7
Bietaserpined6.054−6.15.994−2.16.323−4.9
ACT-077825d5.916−6.95.973−6.76.223−4.8
MP-412c6.069−6.65.971−96.243−5.6
Remdesivira5.907−6.25.964−86.37−6.4
ABT-263c6.005−4.25.9251.96.211−5.6
BMS-903452d5.929−6.95.913−7.86.174−6.3
Brilacidind6.016−5.75.913−2.26.266−5.2
Taselisibc5.934−75.906−8.66.142−7.1
Goxalapladibd5.982−6.95.905−6.66.27−5.1
HKI-357c6.009−6.85.884−8.76.143−6.2
Sitravatinibc5.895−6.35.879−8.26.069−7
Rifabutinb5.904−9.45.878−12.36.136−12.1
Omadacyclineb6.002−6.15.865−2.66.251−5.3
Cefpiramideb5.883−6.85.851−8.36.179−5.9
VCH-286d5.88−6.65.847−8.16.028−4.6
BMS-754807c5.915−6.65.833−8.36.095−7.1

Note: The ‘PPS’ represents the predicted pChEMBL value by the consensus model whereas ‘BE’ corresponding to binding energy (units: Kcal/mol) obtained via molecular docking experiment. Here, a, b, c and d in superscript correspond to antivirals, antibiotics, anticancer and other human compounds, respectively. Here, Rifabutin is highlighted in bold as it consistently achieves a low binding energy in the molecular docking experiments. Similarly, LM 565 is italicized as it constantly attains high binding energy score in the docking experiments and can potentially be a false positive.

In Zhou et al. (2020), the authors identified several compounds including Toremifene using a network-based drug-repurposing approach for SARS-COV-2 which they further validated against the Spike viral protein using a comprehensive combination of homology modeling, molecular docking, molecular dynamics simulation, and binding affinity calculations in Martin and Cheng (2020). In a similar vein to showcase the accuracy of our consensus framework, we perform additional molecular docking experiments on the set of 47 compounds which consistently had high predicted activities against the three main viral proteins of SARS-COV-2 virus. All details related to molecular docking experiment setup are provided in the Supplementary Material. For each of the three main proteins, we highlight our predicted pChEMBL value and the corresponding binding energy score obtained via molecular docking for the 47 candidate compounds in Table 2. We observe that a good majority of the top ranked compounds consistently achieved low binding energy ( −6 Kcal/mol) in the molecular docking experiments for all the considered viral proteins of SARS-COV-2 as illustrated in Table 2. It is noteworthy that among all the compounds in our final candidate list, LM 565 is the only compound which attains high binding energy score in the docking experiments for each of the three viral proteins and thus can potentially be a false positive. This illustrates that our consensus framework can serve as a data-driven screening tool which helps to reduce the list of candidate drugs from an initial set S (1482 compounds) to the curated list of 47 potential compounds (3% of original set S) which can either be validated through molecular docking experiments (reducing computational costs) or through bioassays in absence of known 3D crystal structure of viral proteins.

Furthermore from our molecular docking experiments, we identified Rifabutin (in the set of 47 curated compounds), an antibiotic used to treat tuberculosis and Mycobacterium avium complex, to have the lowest binding energy scores for each of the three main viral proteins of SARS-COV-2. In a recent review, the authors in Wojewodzic (2020) highlighted that bacteriophages such as Rifabutin can be a potential game changer in the trajectory of COVID-19. Here we provide additional insights about the interaction of Rifabutin with SARS-COV-2 viral proteins. The PL-Pro viral protein has a right-hand thumb-palm-fingers architecture, contains a ubiquitin-like domain (UBL) at the N-terminal (see Fig. 5(A)). Several Van der Waal as well as hydrogen bond interactions stabilizes the PL-Pro-Rifabutin complex (see Fig. 5B and C).

(A) Cartoon representation of the PL-Pro viral bound to Rifabutin (red). Protein was colored according to secondary structure: helices are brown and strands are blue. (B) Surface representation of complex structure highlighting the binding surface. (C) Rifabutin interactions with the AA residues of PL-Pro protease. (D) Cartoon representation of the 3CL-Pro viral protein bound to rifabutin (red). The three domains are shown in different color: domain I as light green, domain II as light brown and domain III as light blue. (E) Surface representation of complex structure highlighting the binding groove at the domain interface. (F) Rifabutin making significant interactions with the crucial AA residues of SARS-Cov-2 3CL-Pro protease. (G) Cartoon representation of the Spike protein bound to Rifabutin (red). Protein was rendered according to secondary structure elements. (H) Surface representation of complex structure highlighting the binding surface. (I) Rifabutin interacts with RBD of the Spike protein close to its binding to the receptor. (Color version of this figure is available at Bioinformatics online.)
Fig. 5.

(A) Cartoon representation of the PL-Pro viral bound to Rifabutin (red). Protein was colored according to secondary structure: helices are brown and strands are blue. (B) Surface representation of complex structure highlighting the binding surface. (C) Rifabutin interactions with the AA residues of PL-Pro protease. (D) Cartoon representation of the 3CL-Pro viral protein bound to rifabutin (red). The three domains are shown in different color: domain I as light green, domain II as light brown and domain III as light blue. (E) Surface representation of complex structure highlighting the binding groove at the domain interface. (F) Rifabutin making significant interactions with the crucial AA residues of SARS-Cov-2 3CL-Pro protease. (G) Cartoon representation of the Spike protein bound to Rifabutin (red). Protein was rendered according to secondary structure elements. (H) Surface representation of complex structure highlighting the binding surface. (I) Rifabutin interacts with RBD of the Spike protein close to its binding to the receptor. (Color version of this figure is available at Bioinformatics online.)

The 3CL-Pro viral protein regulates transcription and replication processes by cleaving the polyprotein chains into different nonstructural proteins. It has 306 AA residues with three distinct domains (I-III). The domains I and II mainly have an antiparallel β-barrel structure, while domain III comprises five α-helices (see Fig. 5D). Rifabutin docks at the interface between domain II and III of 3CL-Pro and the complex is stabilized by several interactions with AA residues from both domains (see Fig. 5E and F). The core of RBD of Spike protein consists of antiparallel β-sheets (b1–4 and b7) with short interconnecting loops and helices (see Fig. 5G). Rifabutin binds closer to the region of Spike protein–ACE-2 interaction site, and the complex is stabilized by hydrogen bonds and hydrophobic interactions (see Fig. 5H and I).

5 Discussion and conclusion

In this work, we showcase that the problem of predicting activity value for compound–viral protein interactions can be formulated as a regression task. We illustrate that data-driven ML models (g(·)) based on a simplistic representation of compounds (SMILES strings or Morgan Fingerprints) and viral protein sequences (AA sequence) can be used accurately for the aforementioned task. As our models are based on representations of compounds (xc) and viral proteins (xv), we can further enhance our models by using additional information such as 2D images of compounds. Similarly, we can utilize information such as physio-chemical and structural properties of proteins as showcased in (Khurana et al., 2018; Elbasir et al., 2019), to strengthen our models.

Our predictive framework is built on Dtrain, which contains information for over 97 different viral organisms along with their main proteins, hence our models are generalizable. This means that our models can produce an accurate ranked list of potential inhibitors for the next big viral threat once its associated proteins are known and thus can be used as a data-driven screening tool. Moreover, it known that viruses frequently mutate (Fleischmann, 1996). As a result, the viral protein will also have multiple point mutations i.e. few AA in the viral protein sequence might change. This can have an immense impact on the 3d structure as well as the functionality of the viral protein (Bhattacharya et al., 2017). Thus, techniques based on virtual ligand screening using docking experiments (high-quality 3D structure) such as Verma et al. (2020), Duarte et al. (2020) and Arul et al. (2020) can suffer in this situation. However, our models focus on the primary structure and with point mutations, the vector representations LSv and xv will change. But since our mapping functions are generalizable (based on frequently co-occurring k-mers and subsequences in SMILES strings), we will end up with a revised ranked list of compounds for the mutated viral protein in a computationally efficient manner.

For the COVID-19 use case, our consensus framework identifies a list of 47 compounds as potential inhibitors. By further validating this curated list using molecular docking experiments, we identified Rifabutin as a potential inhibitor as it consistently achieved low binding energy score for all the three main proteins of SARS-COV-2 virus. This suggests that a hybrid drug-repurposing approach can be developed, where in silico compound-viral protein activity predictors can be used initially to screen a large set of compounds to produce a much smaller list of compounds. This list can further be curated using molecular docking experiments (utilizing high quality 3D crystal structures) to prioritize the potential candidates for downstream in-vivo clinical trial stages.

Moreover, for the COVID-19 use case, our consensus framework recognized antivirals such as Remdesivir, Lopinavir, Ritonavir which have been identified by multiple in silico and in vitro studies (Beigel et al., 2020; Sanderset al., 2020) to be potentially effective against the SARS-COV-2 virus. However, according to the recent results from the SOLIDARITY trial (Pan et al., 2020), the aforementioned antivirals appear to have little or no meaningful effect on overall mortality rate in hospitals. This highlights a limitation of our work. Our current mapping function g only considers xc and xv and does not include any information about the host organism (xh). Recently, in Gordon et al. (2020), 26 SARS-CoV-2 viral proteins were expressed in human cells and 332 high confidence human protein interactions were identified using a network-based drug-repurposing approach. Similarly, in Gysi et al. (2020), a consensus of network-based approaches was utilized to identify repurposing candidates. Their drug-repurposing strategy relied on network proximity, diffusion, and AI-based metrics, allowing to rank all approved compounds based on their likely efficacy for COVID-19 disease leading to 81 promising candidates. In Zeng et al. (2020), a network-based deep learning framework is used on top of a knowledge graph constructed on multiple entities such as diseases, drugs/compounds, genes and proteins (human and viral protein interactome) with the goal to identify links between existing approved compounds and COVID-19. Moreover, the tool CoVex (Sadegh et al., 2020) integrates the human protein–protein interaction and the host-interacting proteins to employ strategies such as trust rank or multisteiner trees to identify repurposable drugs for COVID-19.

All the above mentioned approaches take into consideration the interaction with the human interactome, a key missing link in our current framework. In future, we plan to extend our mapping function to become g(xc,xv,xh;w), by considering compound–viral protein interactions, compound–human protein target interactions, human protein–protein interactions, human protein–viral protein interactions in a similar knowledge graph representation to identify potentially repurposable compounds for any viral disease. Another strand of work that we can be explored, is the use of Transformer Networks which use self-attention to capture long range dependency in sequence to sequence modeling for building the SMILES embedding representation. Recent work in natural language processing has convincingly demonstrated that Transformer Networks are substantially more proficient than LSTMs with comparable level of accuracy (Vaswani et al., 2017). In our particular instance, both the SMILES representation for compounds and linear chain of amino acids for proteins can benefit from these approaches.

Financial Support: none declared.

Conflict of Interest: none declared.

References

Agresti
A.
(
2015
)
Foundations of Linear and Generalized Linear Models
.
John Wiley & Sons, Inc., Corporate Headquarters, 111 River Street, Hoboken, NJ 07030-5774
.

Andersen
P.I.
 et al. (
2020
)
Discovery and development of safe-in-man broad-spectrum antiviral agents
.
Int. J. Infectious Dis
.,
93
,
268
276
.

Arul
M.N.
, Kumar S, Jeyakanthan J, and Srivastava V. (
2020
)
Searching for target-specific and multi-targeting organics for Covid-19 in the drugbank database with a double scoring approach
.
Scientific reports 10, 1–16
.

Beck
B.
 et al. (
2017
) Assay operations for SAR support. In:
Assay Guidance Manual [Internet]
.
Eli Lilly & Company and the National Center for Advancing Translational Sciences, Bethesda Maryland
.

Beck
B.R.
 et al. (
2020
)
Predicting commercially available antiviral drugs that may act on the novel coronavirus (SARS-COV-2) through a drug-target interaction deep learning model
.
Comput. Struct. Biotechnol. J
.,
18
,
784
790
.

Beigel
J.H.
 et al. (
2020
)
Remdesivir for the treatment of Covid-19—preliminary report
.
N. Engl. J. Med
.,
383
,
1813
1826
.

Bhattacharya
R.
 et al. (
2017
)
Impact of genetic variation on three dimensional structure and function of proteins
.
PLoS One
,
12
,
e0171355
.

Boeckmann
B.
 et al. (
2003
)
The swiss-prot protein knowledgebase and its supplement trembl in 2003
.
Nucleic Acids Res
.,
31
,
365
370
.

Breiman
L.
(
2001
)
Random forests
.
Mach. Learn
.,
45
,
5
32
.

Capecchi
A.
 et al. (
2020
)
One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome
.
J. Cheminformatics
,
12
,
1
15
.

Chakraborti
S.
,
Srinivasan
N.
(
2020
)
Drug repurposing approach targeted against main protease of sars-cov-2 exploiting ‘neighbourhood behaviour’in 3d protein structural space and 2d chemical space of small molecules
. Molecular omics, 16, 474–491.

Chen
T.
,
Guestrin
C.
(
2016
) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, NewYork, pp.
785
794
.

Connor
M.O.
(
2020
) Deep Learning Coronavirus Cure, https://github.com/mattroconnor/deep_learning_coronavirus_cure (25 June 2020, date last accessed).

Dong
E.
 et al. (
2020
)
An interactive web-based dashboard to track Covid-19 in real time
.
Lancet Infect. Dis
.,
20
,
533
534
.

Drucker
H.
 et al. (
1997
) Support vector regression machines. In:
Advances in Neural Information Processing Systems
, Curran Associates, Inc., NewYork, USA. pp.
155
161
.

Duarte
R.R.R.,
, Dennis C.C., Luis P.I, Jez L.M, Douglas F.N., and Timothy R.P. (
2020
)
Repurposing FDA-approved drugs for Covid-19 using a data-driven approach
.
ChemRxiv
.

Elbasir
A.
 et al. (
2019
)
Deepcrystal: a deep learning framework for sequence-based protein crystallization prediction
.
Bioinformatics
,
35
,
2216
2225
.

Elbasir
A.
 et al. (
2020
)
Bcrystal: an interpretable sequence-based protein crystallization predictor
.
Bioinformatics
,
36
,
1429
1438
.

Fear
G.
 et al. (
2007
)
Protease inhibitors and their peptidomimetic derivatives as potential drugs
.
Pharmacol. Ther
.,
113
,
354
368
.

Fleischmann
W.R.
Jr. (
1996
) Viral genetics. In: Samuel, B. (ed.)
Medical Microbiology
. 4th edn.
University of Texas Medical Branch at Galveston
.

Food, Drug Administration. et al. (

2020
)
Coronavirus (Covid-19) update: FDA issues emergency use authorization for potential covid-19 treatment
.
FDA News Release
,
1
.

Friedman
J.H.
(
2001
)
Greedy function approximation: a gradient boosting machine
.
Ann. Stat
.,
29
,
1189
1232
.

Gao
K.Y.
 et al. (
2018
)
Interpretable drug target prediction using deep neural representation
. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18), Curran Associates, Inc., NewYork, USA. Vol. 2018. pp.
3371
3377
.

Gaulton
A.
 et al. (
2017
)
The chembl database in 2017
.
Nucleic Acids Res
.,
45
,
D945
D954
.

Gers
F.A.
 
Schmidhuber J.A. Cummins
F.A. (
1999
) Learning to forget: continual prediction with LSTM. Neural Comput. 12, 2451–2471.

Goodfellow
I.
 et al. (
2016
)
Deep Learning
.
MIT Press, Cambridge, MA
.

Gordon
D.E.
 et al. (
2020
)
A SARS-COV-2 protein interaction map reveals targets for drug repurposing
.
Nature
,
583
,
459
468
.

Gupta
A.
 et al. (
2018
)
Generative recurrent networks for de novo drug design
.
Mol. Informatics
,
37
,
1700111
.

Gysi
D.M.
Ítalo D.V., Marinka Z., Asher A., Xiao G., Onur V., Susan D.G.  et al. (
2020
)
Network medicine framework for identifying drug repurposing opportunities for Covid-19
.
Proceedings of the National Academy of Sciences 118
.

Haas
J.V.
, Brian J.E., Philip W.I., Viswanath D., and Jeffrey R.W. (
2017
) Minimum significant ratio – a statistic to assess assay variability. In:
Assay Guidance Manual [Internet]
.
Eli Lilly & Company and the National Center for Advancing Translational Sciences
.

Harris
D.
,
Harris
S.
(
2010
)
Digital Design and Computer Architecture
.
Morgan Kaufmann, Curran Associates, Inc., NewYork, USA
.

Khurana
S.
 et al. (
2018
)
Deepsol: a deep learning framework for sequence-based protein solubility prediction
.
Bioinformatics
,
34
,
2605
2613
.

Kim
S.
 et al. (
2016
)
Pubchem substance and compound databases
.
Nucleic Acids Res
.,
44
,
D1202
D1213
.

Kipf
T.N.
,
Welling
M.
(
2016
)
Semi-supervised classification with graph convolutional networks
.
International Conference on Learning Representations, 1–14
.

Kitchen
D.B.
 et al. (
2004
)
Docking and scoring in virtual screening for drug discovery: methods and applications
.
Nat. Rev. Drug Discov
.,
3
,
935
949
.

Kramer
M.A.
(
1991
)
Nonlinear principal component analysis using autoassociative neural networks
.
AIChE J
.,
37
,
233
243
.

Lamb
A.M.
 et al. (
2016
) Professor forcing: a new algorithm for training recurrent networks. In: Advances in Neural Information Processing Systems, Curran Associates, Inc., New York, USA. pp.
4601
4609
.

Lan
J.
 et al. (
2020
)
Structure of the SARS-COV-2 spike receptor-binding domain bound to the ACE2 receptor
.
Nature
,
1
6
.

Landrum
G.
(
2013
)
Rdkit documentation
.
Release
,
1
,
1
79
.

LeCun
Y.
 et al. (
1995
)
Convolutional networks for images, speech, and time series
.
The Handbook of Brain Theory and Neural Networks
,
3361
,
1995
.

Liu
T.
 et al. (
2007
)
Bindingdb: a web-accessible database of experimentally determined protein–ligand binding affinities
.
Nucleic Acids Res
.,
35
,
D198
D201
.

Mall
R.
 et al. (
2017
)
Detection of statistically significant network changes in complex biological networks
.
BMC Syst. Biol
.,
11
,
32
.

Mall
R.
 et al. (
2018
)
RGBM: regularized gradient boosting machines for identification of the transcriptional regulators of discrete glioma subtypes
.
Nucleic Acids Res
.,
46
,
e39–e39
.

Mall
R.
,
Suykens
J.A.
(
2015
)
Very sparse LSSVM reductions for large-scale data
.
IEEE Trans. Neural Netw. Learn. Syst
.,
26
,
1086
1097
.

Martin
W.R.
,
Cheng
F.
(
2020
) Repurposing of FDA-approved toremifene to treat Covid-19 by blocking the spike glycoprotein and NSP14 of SARS-COV-2. Journal of Proteome Research, 19, 4670–4677.

Palotti
J.
 et al. (
2019
)
Benchmark on a large cohort for sleep-wake classification with machine learning techniques
.
NPJ Dig. Med
.,
2
,
1
9
.

Pan
H.
 et al. (
2020
) WHO Solidarity Trial Consortium.
Repurposed antiviral drugs for Covid-19; interim who solidarity trial results
.
New England journal of medicine 384, 497–511
.

Pedregosa
F.
 et al. (
2011
)
Scikit-learn: machine learning in Python
.
J. Mach. Learn. Res
.,
12
,
2825
2830
.

Polykovskiy
D.
 et al. (
2018
)
Molecular sets (MOSES): a benchmarking platform for molecular generation models
.
arXiv Preprint arXiv:1811.12823
.

Protein Data Bank. (

1971
)
Protein data bank
.
Nat. New Biol
.,
233
,
223
.

Pushpakom
S.
 et al. (
2019
)
Drug repurposing: progress, challenges and recommendations
.
Nat. Rev. Drug Discov
.,
18
,
41
58
.

Rao
D.
 et al. (
2019
) Continual unsupervised representation learning. In: Advances in Neural Information Processing Systems, Curran Associates, Inc., New York, USA. pp.
7647
7657
.

Rawi
R.
 et al. (
2018
)
Parsnip: sequence-based protein solubility prediction using gradient boosting machine
.
Bioinformatics
,
34
,
1092
1098
.

Riva
L.
 et al. (
2020
)
Discovery of SARS-COV-2 antiviral drugs through large-scale compound repurposing
.
Nature
,
586
,
113
119
.

Roy
K.
 et al. (
2015
)
Understanding the Basics of QSAR for Applications in Pharmaceutical Sciences and Risk Assessment
.
Academic Press
.

Sadegh
S.,
Julian Matschinske, David B. Blumenthal, Gihanna Galindez, Tim Kacprowski, Markus List, Reza Nasirigerdeh (
2020
)
Exploring the SARS-COV-2 virus-host-drug interactome for drug repurposing
.
Nature communications 11, 1–9.

Sanders
J.M.
 et al. (
2020
)
Pharmacologic treatments for coronavirus disease 2019 (Covid-19): a review
.
JAMA
,
323
,
1824
1836
.

Suykens
J.A.
,
Vandewalle
J.
(
1999
)
Least squares support vector machine classifiers
.
Neural Process Lett
,
9
,
293
300
.

Thafar
M.
 et al. (
2019
)
Comparison study of computational prediction tools for drug-target binding affinities
.
Front. Chem
.,
7
,
782
.

The UniProt Consortium. (

2017
)
Uniprot: the universal protein knowledgebase
.
Nucleic Acids Res
.,
45
,
D158
D169
.

Ullah
E.
 et al. (
2018
)
Harnessing Qatar biobank to understand type 2 diabetes and obesity in adult Qataris from the first qatar biobank project
.
J. Transl. Med
.,
16
,
99
.

Ullah
E.
 et al. (
2017
) Identification of cancer drug sensitivity biomarkers. In: 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, New York, USA, pp.
2322
2324
.

Vaswani
A.
 et al. (
2017
) Attention is all you need. In:
Guyon
I.
 et al. (eds)
Advances in Neural Information Processing Systems
. Vol. 30.
Curran Associates, Inc
., New york, USA, pp.
5998
6008
.

Veličković
P.,
Guillem C., Arantxa C., Adriana R., Pietro L., and Yoshua B. (
2017
)
Graph attention networks
.
International Conference on Learning Representations, PP. 1–12
.

Verma
D.
Kapoor, S., Das, S., Thakur, K. (
2020
) Potential inhibitors of SARS-COV-2 main protease (Mpro) identified from the library of FDA approved drugs using molecular docking studies. Preprints, 2020040149 (doi: 10.20944/preprints202004.0149.v1).

Wallach
I.
Dzamba, M., and Heifets, A. (
2015
)
Atomnet: a deep convolutional neural network for bioactivity prediction in structure-based drug discovery
.
CoRR, abs/1510.02855
.

Warren
T.K.
 et al. (
2016
)
Therapeutic efficacy of the small molecule GS-5734 against ebola virus in rhesus monkeys
.
Nature
,
531
,
381
385
.

Wheeler
D.L.
 et al. (
2008
)
Database resources of the national center for biotechnology information
.
Nucleic Acids Res
.,
36
,
D13
D21
.

Wishart
D.S.
 et al. (
2018
)
Drugbank 5.0: a major update to the drugbank database for 2018
.
Nucleic Acids Res
,
46
,
D1074
D1082
.

Wojewodzic
M.W.
(
2020
)
Bacteriophages could be a potential game changer in the trajectory of coronavirus disease (Covid-19
).
PHAGE
,
1
,
60
65
.

Zeng
X.
 et al. (
2020
)
Repurpose open data to discover therapeutics for Covid-19 using deep learning
.
J. Proteome Res
.,
19
,
4624
4636
.

Zhou
Y.
 et al. (
2020
)
Network-based drug repurposing for novel coronavirus 2019-NCOV/SARS-COV-2
.
Cell Discov
.,
6
,
14
18
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Pier Luigi Martelli
Pier Luigi Martelli
Associate Editor
Search for other works by this author on:

Supplementary data