A Comprehensive, Flexible Collection of SARS-CoV-2 Coding Regions

The world is facing a global pandemic of COVID-19 caused by the SARS-CoV-2 coronavirus. Here we describe a collection of codon-optimized coding sequences for SARS-CoV-2 cloned into Gateway-compatible entry vectors, which enable rapid transfer into a variety of expression and tagging vectors. The collection is freely available. We hope that widespread availability of this SARS-CoV-2 resource will enable many subsequent molecular studies to better understand the viral life cycle and how to block it.

SARS-CoV-2 coding sequence collection Gatewaycompatible TEV (tobacco etch virus) sequence A global pandemic of the coronavirus disease COVID-19, a severe respiratory illness caused by a novel virus from the family Coronaviridae (SARS-CoV-2), has infected millions and caused hundreds of thousands of deaths (World Health Organization 2020a). COVID-19 manifestation in patients can range from a lack of symptoms to severe pneumonia and death (Huang et al. 2020). Person-to-person spread through respiratory droplets has been identified as a major source of transmission of the virus . Various measures, from social distancing to nationwide lockdowns, have been imposed to contain and control the transmission of SARS-CoV-2 (Cohen and Kupferschmidt 2020). Despite these measures, the number of confirmed COVID-19 cases has continued to rise (World Health Organization 2020a), highlighting the need for an effective vaccine and antiviral agents. Furthermore, the extrapolations concerning the evolution of the pandemic are particularly alarming (Ferguson et al. 2020). It is therefore of intense and pressing interest to better understand this virus and its interaction with host cells on a molecular level.
ORF1AB, a large polyprotein which is post-translationally processed into 16 proteins (Chan et al. 2020 (Wu et al. 2020). Progress on molecular characterization has been made on several viral proteins (Walls et al. 2020;Zhang et al. 2020), providing valuable insights into hostvirus interaction, but more research is necessary. The Gateway system offers efficient and high-throughput transfer of the viral coding sequences (CDSs) into a large selection of Gateway-compatible destination vectors used for protein expression in many biological systems, e.g., Escherichia coli, Saccharomyces cerevisiae, insect, or mammalian cells (Walhout et al. 2000). Broad availability of a collection of SARS-CoV-2 CDSs has the potential to enable many downstream biochemical and structural studies and thus a better understanding of processes within the viral life cycle, including scalable assays for screening drug candidates that could potentially disrupt these processes.

Synthesis of viral coding sequences
Based on the published annotation of the genome sequence of the HKU-SZ-005b (GenBank MN975262; Chan et al. 2020) and Wuhan-Hu-1 (GenBank MN908947; Wu et al. 2020) isolates of SARS-CoV-2, we requested the synthesis of viral coding sequences (GenScript and Integrated DNA Technologies), including termination codons and n■  attB recombination sequences, with optimization of codon usage to reduce GC content and optimize expression in human and insect cells. A start codon was added to NSP2-16 to allow independent transcription and translation, as the endogenous products are derived from ORF1AB by post-translational processing. ORF9Bwu, an alternative ORF within the N gene from SARS-COV-2 (Wu et al. 2020), was subsequently amplified by polymerase chain reaction (PCR) from the viral N gene with primers listed in Table S1.
Generation of Gateway-compatible viral coding sequence clone collections Synthesized viral coding sequences were incorporated into Gateway Entry plasmids: either pDONR207 (Invitrogen Cat #12213013) or pDONR223 (Rual et al. 2004). To enable C-terminal fusion constructs, we also generated an equivalent set of Gateway-compatible clones without termination codons. These clones were made by either PCR-amplifying the whole plasmid with primers that eliminated the stop codon, or by amplifying CDS regions from the first collection, using downstream primers with complementary regions that were internal to each stop codon, and which simultaneously incorporated the flanking sequences necessary for incorporation into a Gateway Entry plasmid [pDONR207, pDONR221 (Invitrogen Cat #12536017) or pDONR223]. Expression clones with N-terminal fusion tags (e.g., for purification) can be produced simply by preparing the appropriate Gatewaycompatible Destination vector. However, to enable the subsequent removal of such N-terminal fusion tags, we generated an additional set of clones containing, at the N-terminus of the ORF, a recognition sequence for nuclear inclusion protease from tobacco etch virus (TEV). TEV sequences were incorporated by amplifying CDS regions from the first collection using forward primers that also provide TEV sequences with the original reverse primers.
Each SARS-CoV-2 CDS bacterial clone (DH5a E. coli strain, NEB Cat# C2987) was isolated from a single colony, and its inserted CDS was confirmed by full-length Sanger sequencing (The Centre for Applied Genomics, Toronto, Canada). All clones with a pDONR221 or pDONR223 backbone were sequenced with M13F and M13R primers. Clones with a pDONR207 backbone were sequenced with customized forward and reverse primers. All primer sequences are available in Table S1.

Data availability
Clones are available through Addgene. Table S1 contains all primers used. Table S2 contains detailed descriptions of clones in the collection and links to the clone resource available from Addgene. Supplemental material available at figshare: https://doi.org/10.25387/ g3.12725096.

RESULTS AND DISCUSSION
A total of 98 clones (Table 1) are currently included in the Gateway-compatible collection, covering 28 out of 29 total annotated CDSs in the SARS-CoV-2 genome. NSP11 was omitted due to the incompatibility of its 36 base pair length with the Gateway cloning system (Cheo et al. 2004). All 28 of these CDS regions are available as clones with and without termination codons. The 'no-stop' collection was further extended to include six clones encoding different cleaved products of the spike (S) protein -"S-fragment" 1-6. We also included two CDS variants with in-frame deletions ("S-24nt" and "E-27nt"), one truncated CDS variant ("ORF8B-truncated"), that were each detected by recent viral transcriptome mapping efforts (Davidson et al. 2020, Kim et al. 2020 and two missense catalytic variants (NSP3 C857A and NSP5 C146A; Gordon et al. 2020).
Although our collection facilitates tagging of SARS-CoV-2 proteins for various functional studies, certain applications require removal of tags at some stage, for example, after protein purification. Fusion proteins can potentially interfere with the yield, structure, and function of purified proteins, such as during large scale production and crystallography studies (Booth et al. 2018). To address this we expanded our collection to include clones containing an N-terminal recognition sequence for the nuclear inclusion protease from tobacco etch virus (TEV; Carrington and Dougherty 1987;Carrington and Dougherty 1988). The TEV sequence is one of the best characterized and widely used endoproteolytic reagents due to its stringent sequence specificity, ease of production, and ability to tolerate a variety of residues at the P1' position of its recognition site (Waugh 2011). We note that our clones are not expression vectors in and of themselves, and we have not yet assessed the expression of any of our clones after moving to a Gateway Destination expression vector. However, we note that our Gateway-compatible collection allows users the flexibility to conveniently move any of the SARS-CoV-2 ORFs into any Gateway Destination expression vector with any preferred N-terminal or C-terminal fusion.
To promote open-access dissemination of the collection, all clones have been deposited to the non-profit organization Addgene (Kamens 2015), and are freely available from the authors under circumstances where Addgene cannot be used. Table S2 summarizes all CDSs in the collection, together with their nucleotide sequences, nucleotide and amino acid lengths and links for ordering clones.
We hope that this SARS-CoV-2 CDS-clone collection will be a valuable resource for many applications, including study of how coronaviruses can exploit host cellular processes for the viral replication cycle (de Wilde et al. 2018), understanding virus-host protein-protein interactions (Gordon et al. 2020;Lasso et al. 2019), production of recombinant virus proteins for structural studies (Edavettal et al. 2012), mapping of protein subcellular localization using N-terminal fluorescent reporters (Tanz et al. 2013), or development of vaccines or other therapeutics (Jing et al. 2012;McDonald et al. 2007).