DrugCentral 2021 supports drug discovery and repositioning

Abstract DrugCentral is a public resource (http://drugcentral.org) that serves the scientific community by providing up-to-date drug information, as described in previous papers. The current release includes 109 newly approved (October 2018 through March 2020) active pharmaceutical ingredients in the US, Europe, Japan and other countries; and two molecular entities (e.g. mefuparib) of interest for COVID19. New additions include a set of pharmacokinetic properties for ∼1000 drugs, and a sex-based separation of side effects, processed from FAERS (FDA Adverse Event Reporting System); as well as a drug repositioning prioritization scheme based on the market availability and intellectual property rights forFDA approved drugs. In the context of the COVID19 pandemic, we also incorporated REDIAL-2020, a machine learning platform that estimates anti-SARS-CoV-2 activities, as well as the ‘drugs in news’ feature offers a brief enumeration of the most interesting drugs at the present moment. The full database dump and data files are available for download from the DrugCentral web portal.


INTRODUCTION
DrugCentral integrates a broad spectrum of drug resources related to chemical structures, biological activities, regulatory data, pharmacology and drug formulations (1). Since 2018, DrugCentral has continuously strengthened its role as a key resource for the worldwide scientific community being additionally cross-referenced by several resources, such as UniProt (2), ChEBI (3), Hetionet (4), GUILDify (5), UniChem (6) and Guide to Pharmacology (7). DrugCentral served as primary resource for RepoDB, a drug repurposing database (8), a time-resolved computational drug repurposing algorithm (9), and an adverse drug event network for computational toxicology predictions (10). First introduced and published in the 2017 NAR database issue (1), Drug-Central reconciles the basic scientist's understanding of the 'drug' concept (active pharmaceutical ingredient) with the view of the patient and healthcare practitioner (pharmaceutical formulation). Since its initial launch, the two Drug-Central papers (1,11) were cited more than 160 times cf. Google Scholar, and the website is accessed on average by ∼8000 visitors monthly, with a monthly average of ∼20 000 page views and over 20 000 full database downloads per year (as of 15 September 2020). Throughout regulatory and scientific documents, several terms are often used interchangeably: drug substance, new chemical (or molecular) entity and active (pharmaceutical) ingredient. While these terms have precise contextual meaning, in this paper preference is given to the term 'drug' as synonymous with these three concepts. The term 'formulation' is used when discussing pharmaceutical products.
The current update adds newly approved drugs by the US Food and Drug Administration (FDA, https:// www.fda.gov/home) and the European Medicines Agency (EMA, https://www.ema.europa.eu/en) up to 31 March 2020. Drugs approved by Japan Pharmaceuticals and Medical Devices Agency (PMDA, https://www.pmda.go.jp/ english/index.html) were also monitored up to the latest information available, i.e. November 2019. In addition, for numerous drugs present in DrugCentral since 2018, regulatory agency information was added according to their approval status.
An important component of drug discovery and repositioning is information related to the pharmacokinetic (PK) properties of drugs, e.g. maximum recommended dose or half-life, as well as information related to side effects. In this regard, DrugCentral 2021 introduces critically reviewed information on PK, thus increasing the clinical pharmacology-related information coverage for drugs. Furthermore, adverse drug events separated by sex are tabulated at the drug level, to increase our understanding of drug safety.
Sudden outbreaks can rapidly impact global health, as evidenced by the COVID-19 pandemic, caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). This pandemic has accelerated the need to rely on computational platforms (12) capable of identifying and advancing novel therapeutics for clinical evaluation. In this regard, the current DrugCentral update enables computational and medicinal chemists with (i) drug repositioning categories, i.e. an in-depth classification of drugs based on current market status and intellectual property rights in the US (13), to prioritize new therapeutic uses for 'old drugs'; and (ii) a suite of machine learning models that predict anti-SARS-CoV-2 activities, REDIAL-2020 (14), to prioritize compounds against COVID-19.

Active pharmaceutical ingredients
The current DrugCentral update includes 109 newly approved drugs and two molecules (mefuparib and EIDD-2801, or ) with anti-SARS-CoV-2 potential to the 4531 indexed in 2018 (11). The vast majority of these were approved by the US FDA (95 drugs), followed by EMA (36 drugs), with 31 overlapping drugs. Compared to the additions in 2018, the number of newly approved drugs in Japan has nearly tripled, i.e. 16 new drugs compared to 6. In the past two years, the ratio of newly approved drugs between small organic molecules and biologics has changed in favor of the first class (70 small molecule drugs compared to 35 biologics), which contrasts with a more balanced ratio encountered in the last version of the database (11). Compared to the 2018 update, we note increases in the number of approved subtypes of biologics, such as antibodydrug conjugates (60% increase), oligonucleotides (50% increase) and monoclonal antibodies (30% increase). Approximately, half of the drugs processed (i.e. 52) are orphan drugs (15) pointing out the therapeutic gain in the group of rare diseases (15,16). Out of the newly added drugs, the ChEMBL database (17) indexes 104 (91) of the 111 drugs, KEGG (18) indexes 107, DrugBank captures 105 and the Guide to Pharmacology 77 drugs, respectively (Table 1).
Our knowledge-based protein classification (19) bins human proteins into four categories, according to their 'target development level' (TDL): Tclin are MoA-designated drug targets via which approved drugs act (15,20,21), currently 659 human proteins; Tchem are proteins that are not Tclin, but are known to bind small molecules with high potency; Tbio includes proteins that have Gene Ontology (22) 'leaf' (lowest level) term annotations based on experimental evidence; or meet two of the following three conditions: A fractional publication count (23) above five, three or more Gene RIF, 'Reference Into Function' annotations (https://www.ncbi.nlm.nih.gov/gene/about-generif), or 50 or more commercial antibodies, as counted in the Antibodypedia portal (24). The fourth category, Tdark, currently includes ∼31% of the human proteome that were manually curated at the primary sequence level in UniProt, but do not meet any of the Tclin, Tchem or Tbio criteria. DrugCentral 2021 contains 669 Tchem, 219 Tbio and 14 Tdark proteins linked to 3859, 607 and 39 bioactivity points, respectively. These proteins are mapped onto the Target Central Resource Database (TCRD) and interfaced with the TCRD portal, Pharos, respectively (25,26).

Pharmacological classification
New and existing drugs in DrugCentral were mapped (or remapped) into the latest versions of the World Health Organization Anatomic, Therapeutic and Chemical classification system (WHO ATC, https://www.whocc.no/), the FDA Established Pharmacologic Class (EPC, https://bit.ly/ 2OWiJdH), the Medical Subject Headings (MeSH) (27) and ChEBI (3) pharmacological classifications using the adaptive mapping schemes described in 2018. The resulting pharmacological additions are described in Table 1. Among novel drugs, 78 were linked to 136 pharmacologic classifications; 313 of the drugs were mapped to 424 additional pharmacologic terms.  (17 052) and topical (9832) administrations. The percentage of human prescription (Rx) products (52.7%) remains only slightly higher compared to OTCs.

Drug repurposing categories
The current version of DrugCentral includes a recently published drug repurposing categorization scheme (13), according to which drugs are sorted based on their market availability and intellectual property rights (including exclusivity protections) into three distinct categories: OFP, or offpatent, which are on-market drugs with expired patents or exclusivities; ONP, or on-patent, which are on-market drugs covered by current patents and exclusivity protections; and OFM, or off-market, which includes all previously marketed drugs that have been discontinued or withdrawn, respectively. The analysis, based on the US FDA's Orange Book (FDA-OB), mapped small organic molecules and peptides from DrugCentral (having molecular weight between 100 and 1250) onto FDA-OB. In total, 996 drugs were categorized as OFP, 320 as OFM and 237 as ONP (Figure 1), respectively. These drugs can be found in a variety of pharmaceutical formulations, but oral drugs appear to be predominant in all three sets: 73% in OFP, 82% in ONP and 62% in OFM. Moreover, the data shows an increasing proportion of oral drugs in more recently approved drugs (i.e. ONP and OFP compared to OFM). This classification scheme allows researchers to inform their decisions with respect to drug repositioning based on the existing intellectual property landscape. Given that, in time, novel drugs will be added and other drugs will change categories (i.e. ONP drugs naturally migrate to OFP and,  possibly, to OFM), this drug repositioning classification will be updated on a yearly basis following the previously described workflow (13). This feature complements the pharmacopedic nature of DrugCentral, providing the scientific community (academia and industry) support to more efficiently advance 'old' drugs toward new therapeutic opportunities (28).

ADMET-PK data
DrugCentral 2021 now includes nine measured properties that describe pharmacokinetics (PK) such as absorption, distribution, metabolism, excretion and toxicity (AD-MET) for a number of drugs. These ADMET-PK data were retrieved from five authoritative references (29)(30)(31)(32)(33), which themselves are extensively curated compilations from biomedical literature or drug records. These ADMET-PK properties are highly relevant for understanding the fate of drugs in the human body, for estimating dosage regimens and for conducting data analyses or machine learning studies. The number of drugs indexed with each property is summarized in Figure 2. What follows is a brief description of the ADMET-PK properties incorporated in DrugCentral 2021.
i. The absolute oral bioavailability (BA) indicates the fraction of the orally dosed drug that is absorbed through the gut, undergoes first-pass metabolism (gut and liver) and reaches systemic circulation.
ii. The volume of distribution at steady state (Vd) is the theoretical volume (expressed in L/kg) necessary to contain the measured steady-state drug concentration in plasma. iii. The systemic (or total) clearance (CL) is the volume of plasma from which a drug is completely removed from the body. It is expressed as mL/min/kg and it is the sum of the clearance of the drug by each organ: kidneys, liver, etc. iv. Half-life (t1/2) is the time (expressed in hours) it takes for a drug to decrease to half of its maximum concentration in plasma. v. The fraction unbound (fu) is the fraction of drug that is not bound to plasma proteins. vi. Water solubility (S) indicates the degree of a drug dissolving in water at neutral pH and 37 • C. vii. The extent of metabolism (EoM) is the fraction of the drug (API) excreted unchanged (mainly, in urine). viii. The Biopharmaceutical Drug Disposition Classification System (BDDCS) is an adaptation of the FDA Biopharmaceutical Classification System for bioequivalence studies. In BDDCS, drugs are assigned to four categories in accordance with solubility and EoM cutoffs: Class 1 are high solubility, extensively metabolized drugs; Class 2 are low solubility, extensively metabolized drugs; Class 3 are high solubility, poorly metabolized drugs; and Class 4 are low solubility, poorly metabolized drugs, respectively. It should be noted that the solubility used for BDDCS is the one defined by FDA guidance: the solubility of the formulated active ingredient at its highest approved dose strength, in 250 mL of water, at 37 • C, over the pH range 1-6.8 (https: //www.fda.gov/media/70963/download). BDDCS has proven to be useful in understanding the role of drug transporters (34), in predicting the brain permeability of drugs (35) and in understanding the PK specificity of drug targets (36). BDDCS, S and EoM data gathered from two separate publications (31,32). ix. The Maximum Recommended Therapeutic Daily Dose (MRTD) is the dose threshold above which a drug starts to manifest adverse reactions. Therefore, it is a measure of the toxicity potential of a drug. While the original publication (33) reported MRTD in mg/kg/day units, whereas DrugCentral 2021 uses M/kg/day (i.e. the mg quantities were divided by the molecular weight of the specific active ingredient). MRTD values were re-normalized to an average body weight of 70 kg instead of the original 60 kg, although the 'average 70 kg man' concept needs re-evaluation (37).
As new data points become available, these will be added in DrugCentral.

Sex-differences in adverse drug events
FAERS (FDA Adverse Event Reporting System, https: //open.fda.gov/data/faers/) data were first incorporated in DrugCentral 2018 (11). Compared to the 2018 release, there was a 10% increase in unique drugs (from 2023 to 2220), which are associated with 12,098 unique MedDRA terms (i.e. adverse events--AEs; Medical Dictionary for Regulatory Activities, https://www.meddra.org/), resulting in 739 990 drug-AE combinations. The larger the log likelihood ratio LLR value (38) for an AE, the more likely the event occurred due to a drug, and significant signals can be encountered for AEs with LLRs larger than the calculated drugspecific threshold values (t). Statistically relevant signals for the LLR test yield 1618 unique drugs associated with 8185 unique AEs, for a total of 147 191 (20%) significant drug-AE combinations. The DrugCentral 2021 FAERS dataset supports sex-specific granularity for AEs. An overview of the sex differences described in Table 3 shows a larger number of AEs reported for women compared to men. Indeed, at LLR > 5*t, the number of API-AE pairs almost doubles in females. This phenomenon, first reported in the US using FAERS data (39), and independently confirmed in the Netherlands (40), shows that sex bias in medical treatment persists, ten years after it was first discussed (41). Creating an interface that highlights sex-differences in AEs may facilitate further analyses and may reveal essential drug actions to pave the way for truly personalized medicine (42).

REDIAL-2020
DrugCentral 2021 incorporates a web server named 'REDIAL-2020' to efficiently estimate anti-SARS-CoV-2 activities from molecular structure (14). REDIAL-2020 hosts a suite of machine learning (ML) models that represent various experimental assays related to live virus infectivity (LVI), viral entry (VE) and virus replication (VR)  process. It currently consists of six ML models that represent six assays using data from the NCATS (National Center for Advancing Translational Sciences) COVID19 portal (43). These assays are: the SARS-CoV-2 cytopathic effect, CPE (LVI) (44); Vero E6 host cell cytotoxicity (LVI counterscreen); Spike-ACE2 protein-protein interaction (AlphaL-ISA; VE) (45), TruHit (VE) counterscreen; angiotensinconverting enzyme 2 (ACE2; VE) inhibition; and 3C-like proteinase (3CL or Mpro; VR) inhibition (46). These models use chemical structures (or drug names; or PubChem CIDs) as input; a similarity search retrieves similar compounds in the NCATS dataset, and sorts them according to the Tanimoto similarity score. In addition to anti-SARS-CoV-2 activities, the top 10 most similar entries compared to the query molecule are displayed. Promising compounds are the ones that are (i) active in the CPE but inactive in cytotoxicity LVI models; (ii) active in the Spike-ACE2 (Al-phaLISA) model and inactive in both the TruHit and ACE2 counterscreen VE models; or (iii) active in 3CL (VR) model; or any combination of the above. We are committed to update the current models periodically and build additional models to represent more assays as new data gets available in the literature. Initially for each assay type, ML models based on each descriptor category (fingerprint, pharmacophore and physicochemical) were developed by employing 22 different ML algorithms from scikit-learn (47). The best performing model from each descriptor type was used to build consen-sus models. Finally, the best performing models according to their performance on the validation and test sets (15% of the initial set, each) were picked and implemented in the REDIAL-2020 prediction server. Against three different external sets, these models exhibited predictivity in the range of 60-75%. An in-depth discussion of the models, their training procedures, performance, external predictivity and implementation are discussed elsewhere (14). Based on the same concept as the L1000 gene perturbation profile similarity, which was implemented in DrugCentral 2018 (11), REDIAL-2020 serves a complementary need, i.e. the search for drugs effective against COVID-19, as opposed to the evidence-based (factual) DrugCentral system. Both aim to support the process of drug discovery and repositioning.

Drugs in the news
Given the lack of approved therapeutic options, the COVID-19 pandemic has heightened the interest in approved medicines that are suitable for drug repositioning. A number of them have been used off-label in COVID-19 patients, and are therefore of interest to the community at large. Assessment of evidence for COVID-19-related treatments are frequently updated by the American Society of Health-System Pharmacists, AHSP (https://bit.ly/ 3mvXCQX). Reflecting heightened interest in COVID-19, the front-page of DrugCentral 2021 now includes a list of drugs that are 'in the news' (Figure 3). The current list includes favipiravir, which is not available in the US, but approved as Avigan in Japan and Russia and emergency approved in Italy (48) and remdesivir, which was granted emergency authorization in Japan and was FDA-approved as Veklury (https://bit.ly/33zA8Su), among other drugs.

SUMMARY AND FUTURE DIRECTIONS
DrugCentral 2021 is up-to-date with drug marketing approvals and patent/exclusivity annotations up to 31 March 2020 and 23 June 2020, respectively. We incorporated ADMET-PK data and sex-based adverse events from FAERS, in addition to an anti-SARS-CoV-2 activities prediction server. At its core, DrugCentral continues to index {pharmaceutical formulation--drug--drug target--disease} association, although a significant number of additional attributes have been added to facilitate drug discovery and repositioning. We will continue to incorporate new drugs as soon as regulatory approvals are published. Drugs withdrawn due to other than safety reasons will be flagged in the OFM category, and all other drugs will be annually updated with respect to their marketing/patent/exclusivity status (13) in order to maintain easily accessible lists for drug repositioning. The FAERS interface will be streamlined to highlight sex differences in the drug safety profiles of existing drugs. Within the next six months, we plan to launch a chemical substructure and similarity search functionality. Last but not least, we have performed an extensive curation of veterinary drugs, which will be annotated in the next major DrugCentral release.

Web interface
The DrugCentral web interface has been updated since the 2018 release to integrate novel data types and functionalities. The 'Drugs in the news' section will be updated monthly, by monitoring drugs that are widely associated with current events.

Download
DrugCentral data can be downloaded in PostgreSQL format (full database dump available) for advanced data query, export and integration. User interaction with the local instance is facilitated through structured query language (SQL) examples as previously available, together with downloads of the chemical structures of the drugs in several formats (e.g. SDF, InChI and SMILES) and drugs bioactivity profiles in tabular format. The database is available via Docker container (https://dockr.ly/35G46a6), and public instance drugcentral:unmtid-dbs.net:5433. A Python API is also available (https://bit.ly/2RAHRtV).

SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.