DockCoV2: a drug database against SARS-CoV-2

Abstract The current state of the COVID-19 pandemic is a global health crisis. To fight the novel coronavirus, one of the best-known ways is to block enzymes essential for virus replication. Currently, we know that the SARS-CoV-2 virus encodes about 29 proteins such as spike protein, 3C-like protease (3CLpro), RNA-dependent RNA polymerase (RdRp), Papain-like protease (PLpro), and nucleocapsid (N) protein. SARS-CoV-2 uses human angiotensin-converting enzyme 2 (ACE2) for viral entry and transmembrane serine protease family member II (TMPRSS2) for spike protein priming. Thus in order to speed up the discovery of potential drugs, we develop DockCoV2, a drug database for SARS-CoV-2. DockCoV2 focuses on predicting the binding affinity of FDA-approved and Taiwan National Health Insurance (NHI) drugs with the seven proteins mentioned above. This database contains a total of 3,109 drugs. DockCoV2 is easy to use and search against, is well cross-linked to external databases, and provides the state-of-the-art prediction results in one site. Users can download their drug-protein docking data of interest and examine additional drug-related information on DockCoV2. Furthermore, DockCoV2 provides experimental information to help users understand which drugs have already been reported to be effective against MERS or SARS-CoV. DockCoV2 is available at https://covirus.cc/drugs/.

The SARS-CoV-2 genome contains ∼30 000 nucleotides that encode about 29 proteins such as spike protein, 3Clike protease (3CLpro, also called main protease, Mpro), Papain-like protease (PLpro), RNA-dependent RNA polymerase (RdRp, also named nsp12), and nucleocapsid (N) protein (9,(13)(14)(15). These five proteins are all known to play important roles in the viral replication process in some way. For example, the surface spike glycoprotein is known to be critical for virus entry through binding to the major antigens of host receptors such as angiotensin-converting enzyme 2 (ACE2) (16,17). The 3CLpro is a key enzyme which digests pp1a and pp1ab, two proteins that yield functional viral proteins in the virus' life cycle (10,18). Blocking the activity of this enzyme would therefore suppress viral replication, making it an efficient therapeutic strategy. Likewise, PLpro is responsible for the cleavage of the viral polypeptide (19), and RdRp catalyzes viral RNA synthesis and plays an important role in the replication and transcription machinery of SARS-CoV-2 (20). Remdesivir, the potent inhibitor against SARS-CoV-2, is effective because of how it targeted the RdRp protein (20,21). N protein is an RNAbinding protein necessary for viral RNA transcription and replication (15). In addition to the viral proteins, SARS-CoV-2 uses human ACE2 for viral entry and human transmembrane serine protease family member II (TMPRSS2) for spike protein priming (16). We therefore hypothesize that these five SARS-CoV-2 proteins, spike, 3CLpro, PLpro, RdRp, N protein, and two host proteins, ACE2 and TMPRSS2, can be potential druggable targets.
Previously, researchers have targeted these viral proteins for drug discovery by using structure-assisted design, virtual and high-throughput screening (8,11). Similarly, these approaches can also be used for SARS-CoV-2 research (22).
For example, Jin et al. used a computer-aided drug design and virtual screening to identify the mechanism-based inhibitor (N3) and the antineoplastic drug carmofur targeting Mpro (10,11). They further determined the crystal structures of the N3-and carmofur-Mpro complexes and confirmed the antiviral activities of N3 and carmofur using the plaque-reduction assays (10,11).
Drug repositioning is the process of finding new uses for existing approved drugs, and is believed to offer great benefits over de novo drug discovery, as well as enable rapid clinical trials and regulatory review for COVID-19 therapy (23). Here we develop the database, DockCoV2, by performing molecular docking analyses of seven proteins including spike, 3CLpro, PLpro, RdRp, N protein, ACE2 and TMPRSS2 with 2285 FDAapproved and 1478 NHI drugs. DockCoV2 also provides appropriate experimental information with literature support. Several databases focus on delivering repurposing drugs against SARS-CoV-2. For example, Excelra's COVID-19 Drug Repurposing Database includes 128 small molecules and biologics (https://www.excelra. com/covid-19-drug-repurposing-database/) and the Open-Data collected the actionable data and validated methodologies (https://doi.org/10.1101/2020.06.04.135046). Additionally, Covid19 DB lists the curated data of clinical trials in Covid-19/2019-nCoV (http://www.redo-project.org/ covid19db/). To our knowledge, no database provides a more up-to-date and comprehensive resource with drugtarget docking results for repurposed drugs against SARS-CoV-2.

DATA SOURCES
For drug repositioning, we selected FDA and Taiwan NHI (National Health Insurance Administration) as our list of approved drugs. FDA-approved drugs were retrieved from the ZINC15 database (24), and Taiwan NHI-approved drugs were downloaded from the website of NHI. Drug names were extracted from the top five ingredients of each product, and drug structures were obtained from the Pub-Chem Compound database (25). Due to the limitation of the ligand preparation pipeline, we dropped some compounds with ions which could not be converted properly by the gen3d operation of OpenBabel (version 3.0.0) (26). Note that we also added some additional FDA-approved drugs which are not in the ZINC15 database. In total, 2285 FDAapproved drugs and 1478 Taiwan NHI-approved drugs were recruited as our candidates for virtual screening.
Seven target proteins were selected because of their involvement in viral entry and replication. The protein structures, including 3CLpro (PDB ID: 6LU7) (10), PLpro (PDB ID: 6WX4) (https://doi.org/10.1101/2020.04.29. 068890), RdRp (PDB ID: 7BV2) (21), spike receptorbinding domain (RBD) (PDB ID: 6M0J) (27), N protein (PDB ID: 6M3M) (15) and ACE2 (PDB ID: 1R42) (28) were retrieved from Protein Data Bank. To ensure the availability of the binding sites of the proteins, we removed the compounds added for protein structure determination such as ligands, cofactors, and ions. For the spike protein, only the RBD was used for molecular docking and for 7BV2, only the nsp12 was used out of the nsp7-nsp8-nsp12 com-plex. The structure of human TMPRSS2 is currently not available in Protein Data Bank, so we used homology modeling on the sequence of TMPRSS2 from Uniprot (ID: O15393) (29) to build an approximate structure based on the template of the human serine protease hepsin (PDB ID: 5CE1) using SWISS-MODEL (30).

DATABASE CONTENT
DockCov2 aims to speed up the process of finding potential drugs by providing a computational representation of molecular docking with the most commonly-used drug databases (31-35). Through our website, researchers can query the docking scores of candidate drugs with essential drug-related information from the database to identify which drugs could be potential drugs. We designed a virtual screening pipeline, and docked 2285 FDA approved drugs and 1478 Taiwan National Health Insurance (NHI) approved drugs with the seven main proteins of the SARS-CoV-2 infection mechanism (36)(37)(38). All of our docking results are saved in the formats of Protein Data Bank, Partial Charge (Q) and Atom Type (T) (PDBQT), and can be viewed either directly on the website through NGLView (39) or downloaded for further docked structure refinement.
The overview of the database is shown in Figure 1. In addition to the molecular docking score, the joint panel section has three tabs: docking structure, ligand information and experimental data. The docking structure panel visualizes the docking PDBQT from the self-calculated database, which also contains druglikeness information and compound similarity. The ligand information panel contains (a) structural information from PubChem Compound database (25); (b) drug information from DrugBank (33); (c) pathway information from Kyoto Encyclopedia of Genes and Genomes (KEGG) (40); (d) clinical-related information from Repurposing Hub (34); and (e) other chemical information. In the experimental data panel, we collected experimental and literary evidence from ChEMBL (35), a study of gene set enrichment analysis (GSEA) from Zhou et al. (41) and COVID-19 Crowd Generated Gene and Drug Set Library (https://amp.pharm.mssm.edu/covid19).

Identifier conversion
We chose PubChem compound identifiers (CIDs) as the key identifier in DockCoV2 because it is unique and widely used and can be easily linked to other external databases. For drugs retrieved from NHI, drug names were translated into CIDs using whole source mappings provided by UniChem (31) and PubChem Identifier Exchange Service (https:// pubchem.ncbi.nlm.nih.gov/idexchange/). Specifically, multiple CIDs may be returned by PubChem Identifier Exchange Service for each drug name. We canonically chose the first returned record in order to be consistent with the result by manual search via the graphical interface of PubChem. Once we have the PubChem CIDs of our compound library, simplified molecular-input line-entry system (SMILES) (42), international chemical identifier (InChI), hashed values of international chemical identifiers (InChIKey), and synonyms were fetched through the Pub-Chem PUG API (43).

Ligand information
For drug development, Lipinski's rule of five (44), Ghose filter (45), Veber filter (46), rapid elimination of swill (REOS) filter (47) are several available indicators to evaluate the druglikeness. Therefore, components of these rules were calculated by RDKit (http://www.rdkit.org). Moreover, fragment screening is able to find a sensible starting point for the evolution of a new molecule in modern drug discovery. Previously, a large crystallographic fragment screen (48) was performed against 3CLpro by the Diamond Light Source group. Seventy-eight hits were released to the public in March 2020. In DockCoV2, we performed the substructure matching of these fragments toward the drug library we used here. To let the user easily fetch structurally sim-ilar compounds, similarity search (49) strategies were also adopted. Similarity search within the DockCoV2 and toward whole PubChem compound databases are both supported. To generate similarity values between any two compounds in DockCoV2, a 512-bit fingerprint of each compound is calculated using the Morgan algorithm (50). Each bit represents the presence or absence of a particular structural feature. From here, similarity values between any compound pairs are represented by the Dice coefficient. In the current version of DockCoV2, the rule of five, substructure matching, and the calculation of similarity values were all done by using the RDKit 2020.03.2 release.
Pathway information can serve as the basis of hypothesis in drug discovery. Therefore, two databases of KEGG (namely, KEGG Drug and KEGG Compound) were integrated into DockCoV2. PubChem CIDs were converted into DrugBank IDs first, then further converted into KEGG Drug IDs and KEGG Compound IDs. Alterna-Nucleic Acids Research, 2021, Vol. 49, Database issue D1155 tively, PubChem CIDs were directly converted to KEGG Drug IDs and KEGG Compound IDs by PubChem Identifier Exchange Service. Clinically related information including approval information, clinical phase, mechanism of action (MoA), target, disease area, and indication were also retrieved from the metadata file from the Drug Repurposing Hub.

Experimental data
To provide useful validation information for drug repurposing and development, chemical information was aggregated either from external databases or directly calculated from structures. Three biological assays CHEMBL4303805, CHEMBL4303810, CHEMBL4303819 were fetched from ChEMBL (35). All of these three assays are functional assays. For CHEMBL4303805, Antiviral activities were determined as inhibition of SARS-CoV-2 induced cytotoxicity of Caco-2 cells at 10 uM after 48 h by high content imaging. Ratios of inhibition of each compound were reported. For CHEMBL4303810, Antiviral activities against SARS-CoV-2 (USA-WA1/2020 strain) were measured by imaging in HRCE cells at MOI 0.4 after 96 h. Hit scores were reported from 0 to 1 for on-disease versus off-disease activity: scores >0.6 are considered hits. For CHEMBL4303819, Inhibition of cell viability relative to arbidol control (inhibition index) was measured by fluorescence (OD590nm) in Vero E6 cells, which were infected with SARS-CoV-2 (strain BavPat1) at MOI 0.002 after 72 h. Inhibition index with a value over 1 indicates higher activity. ChEMBL IDs were translated into CIDs using whole source mappings provided by UniChem (31) and PubChem Identifier Exchange Service as well.
Gene Set Enrichment Analysis (GSEA) is a method that determines whether an a priori defined set of genes show statistically significant expression differences between two biological states (51). Here, three NCBI Gene Expression Omnibus (GEO) data, GSE1739, GSE33267 and GSE122876 were used to identify sets of differentially expressed genes post SARS or MERS infection. Gene expression profiles of drug treatment were retrieved from the Connectivity Map (CMAP) database. The GSEA enrichment scores indicate the potential effectiveness of candidate drugs to reverse the gene expression signature of HCoV infection. For detailed information, please see Zhou et al. (41). Literature was also fetched from COVID 19 Coward Generated Gene and Drug Set Library to assist the user to validate the hypothesis of an interaction between a drug and its target. Only compounds that were reported in experiments were collected.

VIRTUAL SCREENING
There are many softwares that exist for virtual screening such as VSDK (52), AUDocker (53), DockoMatic (54), PyRx (55), etc. However, the efficiency of these tools may not be applicable on large drug banks such as the complete FDA-approved drug list. Therefore, we used AutoDock Vina (version 1.1.2) (56) as the core docking utility to reconstruct our virtual screening pipeline, and ran it on a Kubernetes server with parallel computing (Figure 2). Kubernetes is an engine server, which can automatically deploy, Figure 2. The virtual screening pipeline. We used AutoDock Vina as the core docking utility to reconstruct the virtual screening pipeline. In addition, the Kubernetes server and Redis data store were adopted for parallel computing. After finishing docking, the top 20 docking poses were taken to build a protein heatmap for visualization. scale and manage the containerized tasks to clusters. In practice, we uploaded the FDA-approved and Taiwan NHIapproved drug lists to Redis data storage, and generated 128 Kuberenetes pods in parallel. Each pod continuously takes up a new docking task of a drug until the Redis drug list is completed. Based on our implementation, the virtual screening for our drug list only costs 3 days per protein.
To simulate binding affinity between protein and ligands, each of the drug's 3D structure was obtained from the PubChem database in structure-data file (SDF) format. The gen3d operation of OpenBabel (version 3.0.0) (26) was used for energy minimization. This operation iterated 500 cycles of performing the geometry optimization with the MMFF94 force field and conducting the weighted rotor conformational search, in order to generate a likely global minimum energy conformer in MOL2 file format. Since AutoDock Vina only takes the PDBQT format as input, we used AutoDockTool 1.5.7 (http://autodock.scripps.edu/ resources/adt) to convert the file format from MOL2 to PDBQT with default parameter settings excluding a parameter -A, which would add hydrogens to structure only if there are none already. After that, we applied rigid-body docking on these converted files using AutoDock Vina. In order to consider all of the potential docking poses, the entire protein is taken as the search space for blind search. We noted that the number of runs of the docking simulation should be adjusted accordingly by considering the variety  of protein sizes. In AutoDock Vina, the number of runs is set by the exhaustiveness parameter, and the default setting of exhaustiveness is set at eight for a search space below 30 × 30 × 30Å (56). We proportionally scaled the exhaustiveness depending on the size of the protein by a factor of 2. For example, if the size of a protein is 60 × 60 × 60Å, the exhaustiveness would be 8 × (60/30) × (60 /30) × (60 /30) × 2 = 128. AutoDock Vina reported multiple docking scores for each run, and the top score was selected as the final result shown on the website. The distribution of the final scores of each protein is shown in Figure 3. By comparing the scores between systems (different proteins), there is no bias of docking scores toward any protein we are interested in. By examining the intra-system scores, we observed that for each protein if there is a group or few ligands that are with particularly good docking scores.
After finishing docking, AutoDock Vina will generate multiple docking poses for each ligand-protein pair. For the sake of a straightforward representation of the docking results, the top 20 docking poses from AutoDock Vina were taken to build a protein heatmap. For each pose, a sphere centered at the centroid of ligand with a radius of 5Å is highlighted, and the frequency of residues inside the sphere is incremented. The protein heatmap and the ligand docking position with the best affinity score are shown on the main page of each entry (a ligand-protein pair) in Dock-CoV2, where the red scale of the protein heatmap denotes the frequency of residues over twenty trials. Figure 4 is a demonstration of the structure of 3CLpro (PDB ID: 6LU7) (10) with molecular docking. In order to validate our virtual screening pipeline, we separated the original structure into a ligand and a receptor, and docked them with the proposed pipeline. The left-hand side of Figure 4  producing our results or repurposing the proposed pipeline for other drugs or proteins, we released our scripts and the configuration file for docking procedures on GitHub (https://github.com/ailabstw/DockCoV2).

INTERFACE
In order to provide an efficient searching interface for users, we built a website with a search engine based on Django 3.0 (https://djangoproject.com) and Bootstrap 4.3.1 (https: //getbootstrap.com/). The search bar, which can be found either in the index page or on navbar, allows the user to input the drug name, PubChem CID or synonym to explore the docking score with all proteins. The details of each ligand are shown after choosing a specific ligandprotein pair, including docking structure, ligand information and experimental data. PDBQT files of ligand and protein from AutoDock Vina are also provided on the main page of each ligand-protein pair. In the docking structure panel, both protein heatmap and sequences are displayed on the right-side of the detail page. Protein heatmap is interactively demonstrated with the NGL viewer tool, which is a WebGL-based molecular viewer (57). To easily visualize the targeted binding sites on the proteins, not only basic operations on protein heatmap, but also selecting or limiting atoms/residues through the selection language is available on the docking structure panel. The ligand information panel is divided into ligand relative, clinical relative, druglikeness, other relative info and drug similarity. Ligand relative part consists of basic compound Information, including 3D structure view, CAS number, SMILES, InChI, InChI key, synonyms and pathway from KEGG Drug and KEGG Compound databases. Clinical Relative consists of approval information, which denotes whether or not this ligand has been approved by the Taiwan NHI or FDA, as well as other corresponding data from Drug Repurposing Hub, including clinical phase, mechanism of action (MoA), target, disease area, and indication. Results and components of Lipinski's rule of five, Ghose filter, Veber filter and REOS filter are shown in druglikeness section. External links to Pub-Chem and Drugbank are also provided in the other relative info section. SARS-CoV-2 relative, related documents and drug assays are included in experimental data.

CONCLUSION
Drug repositioning, the exploration of repurposing existing drugs that have already passed the safety screening for new indications, is a relatively quick therapeutic solution for emerging infectious diseases like SARS-CoV-2. For the purpose of drug repositioning, molecular docking is a well-known virtual screening method to evaluate a great amount of compounds automatically. Based on the urgent need to find effective drugs, we established Dock-CoV2, the first virtual screening database for SARS-CoV-2. DockCoV2 focuses on predicting 2285 FDA-approved and 1478 Taiwan NHI-approved drugs targeting five proteins relating to the mechanism of viral entry and replication. Those docked structures could also be further utilized and get more accurate binding poses and the correct ranking of the binding affinity through methods such as molecular mechanics Poisson-Boltzman surface area (MM/PBSA) and molecular mechanics generalized Boltzman surface area (MM/GBSA) calculations (22,58). In addition, DockCoV2 also provides experimental data, including biological assays, pathway information, and gene set enrichment analysis recruited from other validated databases. Future versions will include the prediction of binding affinity with other potentially druggable targets against SARS-CoV-2 and the reported experimental assays and high-throughput screening results to support our prediction.