CoV3D: a database of high resolution coronavirus protein structures

Abstract SARS-CoV-2, the etiologic agent of COVID-19, exemplifies the general threat to global health posed by coronaviruses. The urgent need for effective vaccines and therapies is leading to a rapid rise in the number of high resolution structures of SARS-CoV-2 proteins that collectively reveal a map of virus vulnerabilities. To assist structure-based design of vaccines and therapeutics against SARS-CoV-2 and other coronaviruses, we have developed CoV3D, a database and resource for coronavirus protein structures, which is updated on a weekly basis. CoV3D provides users with comprehensive sets of structures of coronavirus proteins and their complexes with antibodies, receptors, and small molecules. Integrated molecular viewers allow users to visualize structures of the spike glycoprotein, which is the major target of neutralizing antibodies and vaccine design efforts, as well as sets of spike-antibody complexes, spike sequence variability, and known polymorphisms. In order to aid structure-based design and analysis of the spike glycoprotein, CoV3D permits visualization and download of spike structures with modeled N-glycosylation at known glycan sites, and contains structure-based classification of spike conformations, generated by unsupervised clustering. CoV3D can serve the research community as a centralized reference and resource for spike and other coronavirus protein structures, and is available at: https://cov3d.ibbr.umd.edu.


INTRODUCTION
Coronaviruses have been responsible for several outbreaks over the past two decades, including SARS-CoV in 2002-2003, MERS-CoV in 2012(de Wit et al., 2016, and the current COVID-19 pandemic, caused by SARS-CoV-2, which began in late 2019 (Tse et al., 2020). The scale of the  pandemic has led to unprecedented efforts by the research community to rapidly identify and test therapeutics and vaccines, and to understand the molecular basis of SARS-CoV-2 entry, pathogenesis, and immune targeting.
Between February and April 2020, a large number of structures have been generated and deposited in the Protein Data Bank (PDB) (Rose et al., 2011): 12 spike glycoprotein structures, over 100 main protease structures, and 23 structures of SARS-CoV-2 non-structural proteins (NSPs). These high resolution protein structures are of immense importance for understanding viral assembly and for rational vaccine design. The first structures of SARS-CoV-2 trimeric spike glycoproteins were reported in February and early March 2020 (Walls et al., 2020;Wrapp et al., 2020). These are the major target of SARS-CoV-2 vaccines and antibody therapeutics, and previously determined spike glycoprotein structures have enabled advances including rational design of CoV spike glycoproteins with proline substitutions to stabilize the prefusion conformation, yielding improved protein expression and immunogenicity (Pallesen et al., 2017). Given that the rapid rate of coronavirus protein structural determination and deposition is likely to continue, and their importance to the research community, a simple and updated resource detailing these structures would provide a useful reference.
Here we describe a new database of experimentally determined coronavirus protein structures, CoV3D.
CoV3D is updated automatically on a weekly basis, as new structures are released in the PDB. Structures are classified by CoV protein, as well as bound molecule, such as monoclonal antibody, receptor, and small molecule ligand. To enable insights into the spike glycoprotein, we also include information on SARS-CoV-2 residue polymorphisms, overall coronavirus sequence diversity of betacoronaviruses mapped onto spike glycoprotein structures, and structures of spike glycoproteins with modeled glycosylation. This resource can enable efforts in rational vaccine design, targeting by immunotherapies, biologics, and small molecules, in addition to basic research into coronavirus structure and recognition, and is publicly available at https://cov3d.ibbr.umd.edu.

MATERIALS AND METHODS
CoV3D is implemented using the Flask web framework (https://flask.palletsprojects.com/) and the SQLite database engine (https://www.sqlite.org/). Structures are identified from the PDB on a weekly basis using NCBI BLAST command line tools (Camacho et al., 2009) Structural visualization is performed using NGL viewer (Rose and Hildebrand, 2015). SARS-CoV-2 spike glycoprotein sequences and sequence information were downloaded from NCBI Virus (Hatcher et al., 2017), followed by filtering out sequences with missing residues. Sequence polymorphism information was obtained by BLAST search using a reference SARS-CoV-2 spike glycoprotein sequence (QHD43416.1). To develop spike glycoprotein alignments, betacoronavirus spike glycoprotein sequences were downloaded from NCBI Virus and aligned with Clustal Omega (Sievers et al., 2011) in SeaView (Gouy et al., 2010). Sequences that were redundant (>95% similarity) or contained missing residues were removed, with the remaining 70 sequences forming the Pan-betacoronavirus alignment. A subset of 18 sequences from the pan-betacoronavirus alignment was used to generate the SARS-like sequence alignment, which contains every sequence from the pan-betacoronavirus alignment with >70% sequence similarity to the SARS-CoV-2 spike. Sequence logos are generated dynamically for user-specified residue ranges using the command-line version of WebLogo (Crooks et al., 2004). N-glycans are modeled onto spike glycoprotein structures using a glycan modeling and refinement protocol in Rosetta (Labonte et al., 2017). An example command line and Rosetta Script for this glycan modeling protocol is provided as Supplemental Information.

Database Contents
The main components of the CoV3D database are interrelated tables, datasets and tools for coronavirus protein structures and spike glycoprotein sequences. The structure portion of the database includes dedicated pages and tables for: • Spike glycoprotein structures

Example use case: viewing and conservation of RBD site
The spike glycoprotein structure table on CoV3D includes structures of the SARS-CoV-2 spike in complex with antibodies and the receptor ACE2. One of these is the structure of SARS-CoV-2 spike receptor binding domain (RBD) in complex with human antibody CR3022 (PDB code 6W41) (Yuan et al., 2020), which can be visualized in the browser using the "View" link in the table (Figure 1A).
Inspection of this complex shows one contiguous region of the spike RBD that makes multiple contacts with the CR3022 antibody (residues 375-390; circled in Figure 1A). CR3022-contactiong residues include K378 (lysine), P384 (proline), and K386 (lysine). The sequence logo generator on the CoV3D site can then be used to generate logos representing that sequence range for SARS-like coronaviruses ( Figure   1B) as well as a broader set of betacoronaviruses, including SARS-CoV, SARS-CoV-2, and MERS-CoV ( Figure 1C). The SARS-like logo highlights the high conservation of this region of the spike glycoprotein, providing a mechanistic basis for the cross-reactive binding exhibited by CR3022 for SARS-CoV and SARS-CoV-2 RBDs (Yuan et al., 2020). However, aside from the cysteine residue at position 379, there is much lower conservation of this region across the broader set of betacoronaviruses ( Figure 1C).

Example use case: location of polymorphisms
One useful feature of CoV3D is readily available information on identified SARS-CoV-2 spike polymorphisms, and their mapping onto the spike glycoprotein structures. Under "Sequences", users can navigate to "Spike Polymorphisms" where a table is shown with observed single and multiple substitutions in the spike glycoprotein, along with the counts of sequences containing them. The D614G variant is currently the most prevalent, with more occurrences than the reference sequence represented in spike glycoprotein structures (923 sequences, versus 637 sequences). The position of this substitution on the spike glycoprotein can provide some indication of its possible structural and functional impact. By clicking "view" next to this substitution, users can visualize the position of this substitution, showing that it is located on the spike surface, and is not located within or adjacent to the receptor binding domain (RBD; Figure 2A). However, this site is closer to the RBD than the site of another less prevalent SARS-CoV-2 spike variant (T791I; Figure 2B).

Example use case: visualizing oligomannose glycosylation
N-glycosylation of viral glycoproteins can play a key role by masking the glycoprotein from the immune system, or effecting function. Experimentally reported protein structures often lack full N-glycans due to limitations from resolution or intrinsic glycan dynamics or heterogeneity. To enable visualization and additional analysis or modeling of glycosylated spike glycoproteins, CoV3D includes sets of structures with modeled N-glycans at all predicted glycosylation sites, with N-glycans built onto the glycoprotein structures and refined using a protocol in Rosetta (Labonte et al., 2017). Examples of glycosylated structures that can be visualized in CoV3D are shown in Figure 3, and these can be downloaded directly by users for further processing. This permits users to view features such as the amount of N-glycosylation present on the ACE2 surface ( Figure 3A) and the relatively high glycosylation of the spike glycoprotein base (bottom in Figure 3B) versus the RBD. Presently, these structures include oligomannose glycans that were found to be prevalent on the SARS-CoV spike based on previous mass spectroscopy analysis (Ritchie et al., 2010), with five branched mannose sugars. In the future, we plan to include more glycosylated structures, with additional options for glycan sizes and types.

DISCUSSION
We have constructed the CoV3D database as a reference for the research community, providing a simple and updated interface to high resolution coronavirus 3D structures and sequence variability. This will allow researchers to identify and classify new coronavirus protein structures as they are released, particularly for SARS-CoV-2 and COVID-19, and it can enable insights into sequence features, polymorphisms, as well as glycosylation. A recent study combining coronavirus structural and sequence analysis revealed insights regarding the determinants of ACE2 recognition (Wan et al., 2020), and CoV3D can enable prospective comparative studies, in addition to modeling and structure-based design efforts. This database will likely be of interest to virologists, computational biologists, immunologists, and those interested in learning about and targeting SARS-CoV-2 proteins with small molecules and antibodies, as well as those engaged in vaccine design.