Potential Pathogenicity Determinants Identified from Structural Proteomics of SARS-CoV and SARS-CoV-2

Abstract Despite SARS-CoV and SARS-CoV-2 being equipped with highly similar protein arsenals, the corresponding zoonoses have spread among humans at extremely different rates. The specific characteristics of these viruses that led to such distinct outcomes remain unclear. Here, we apply proteome-wide comparative structural analysis aiming to identify the unique molecular elements in the SARS-CoV-2 proteome that may explain the differing consequences. By combining protein modeling and molecular dynamics simulations, we suggest nonconservative substitutions in functional regions of the spike glycoprotein (S), nsp1, and nsp3 that are contributing to differences in virulence. Particularly, we explain why the substitutions at the receptor-binding domain of S affect the structure–dynamics behavior in complexes with putative host receptors. Conservation of functional protein regions within the two taxa is also noteworthy. We suggest that the highly conserved main protease, nsp5, of SARS-CoV and SARS-CoV-2 is part of their mechanism of circumventing the host interferon antiviral response. Overall, most substitutions occur on the protein surfaces and may be modulating their antigenic properties and interactions with other macromolecules. Our results imply that the striking difference in the pervasiveness of SARS-CoV-2 and SARS-CoV among humans seems to significantly derive from molecular features that modulate the efficiency of viral particles in entering the host cells and blocking the host immune response.


Supplementary Figures
SARS-CoV-2 spike glycoproteins) and the putative receptors ACE2 and ACE. A contact was 29 considered for pairs of residues with C-alpha less than 8 Å distant. The colors of lines indicate the 30 different independent simulations (three or five for each system) and the purple "X" indicates the 31 number of contacts in the initial conformation. Phylogenetic tree of all known beta coronaviruses using the full protein sequence for 3CL pro . Red 49 text indicates viruses of concern for human health. The amino acid changes for the site at 285, 50 which may affect dimerization are shown along branches. An alanine at this site defines the SARS-51 CoV-2 clade with the horseshoe bats from mainland China. 52 53 54 55 Fig. S4. 56 Ensemble workflow for structure prediction of SARS-CoV-2 nsp3. Case-by-case protocols of 57 structure prediction are determined by finely parsing each protein sequence using information 58 about the position of intrinsically disordered regions (IDR), transmembrane regions (TM), signal 59 peptides, and templates (A). The method applied to SARS-CoV-2 nsp3 defined six regions to be 60 modeled by the local modeling (LM), fragment-based (FB), and/or ab initio (AB) approaches. The 61 first region, 2-169, of the nsp3 sequence was modeled using the combination LM+FB (B) so that 62 the structured region (5-111) could be determined in high resolution via LM of the side chain using 63 a highly similar template (PDB 2gri, identity 76%). The bound conformation of intrinsically 64 disordered segments was predicted using FB, a more flexible method. The choice of prediction 65 method for protein regions with templates of high identity includes a thorough structural analysis 66 of the template. For example, the region 413-676 of nsp3 is aligned with a high identity template 67 (2w2g, 76%), so that modeling only variant side chains was considered. However, due to the 68 predicted loss of a disulfide bridge, the FB approach was used since it allows larger conformational 69 changes (C). 70 71  2002), whereas ACE is found at moderate to high levels; neither were quantitative and both were 100 based on small sample sizes. 101 Other early reports of high expression of ACE2 in the lung were based on protein expression using 102 a peptide-derived polyclonal antibody and immunohistochemistry (Hamming et al. 2004; Ren et 103 al. 2006). Aside from being non-quantitative, the results are difficult to interpret and could be due 104 to the non-specificity of these types of antibodies. The three remaining reports of ACE2 expression 105 in the lung or lung-derived epithelium were based on improved antibody immunohistochemistry used to identify ACE2 as the SARS-CoV receptor, ACE was not overexpressed as it was done 116 with ACE2 . A subsequent study using pseudotyped virus demonstrated a 117 correlation between ACE2 expression and infectivity in several cell lines and although association 118 with ACE2 was much higher, overexpression of ACE was shown to increase infectivity in some 119 cell lines (Nie et al. 2004). Notably, ACE2-mediated infection was highest in kidney-and colon-120 derived cell lines and much less efficient in those from the lung. This same pattern in kidney-, 121 colon-, and lung-derived cell lines was confirmed in a more recent analysis of beta-coronaviruses 122 (Letko et al. 2020). We note that the several studies that focused on SARS-CoV infection mediated 123 by ACE2 were conducted using kidney-and colon-derived cell lines, which express higher levels 124 of ACE2 than ACE. In the lung and other respiratory tissues, the reverse is true; ACE is expressed 125 at higher levels than ACE2. Experiments in vitro show that ACE does not bind to SARS-CoV (and 126 perhaps SARS-CoV-2) S protein as efficiently as ACE2, but, given an environment in which 127 levels of ACE are an order of magnitude higher than ACE2, as is the case in the lung, the potential 128 alternative association with ACE has to be further verified. 129 electroneutrality of the system. All ions present in the original crystallographic structures were 134 kept for the simulations . The system was first energy minimized for 5000 steps with steepest  135 descent. This was followed by two phases of equilibration. In the first phase, 6 ns passed while 136 applying positional constraints on α-carbons except for those at binding interfaces. The 137 temperature was gradually raised to 298.15 K. In the second phase, 20 ns passed during which 138 positional constraints were applied to alpha carbons of the C-terminal domains of ACE and ACE2 139 and a few residues in the core of the spike protein's receptor-binding domain. In each simulation 140 of the triplicate, atomic velocities were reinitialized from a Maxwell-Boltzmann distribution at 10 141 ns, using a different random number of seed in each case. Equilibration and production were run 142 in the NPT ensemble, and a 2 fs time step was used. non-conservative and highly concentrated in the C-terminal end of the protein (Fig. S9A). An ab 201 initio model was generated for this region (556-633, Fig. S9B). Fewer mutations appear at the N-202 terminal side, and a highly conserved region appears in the middle of the protein sequence. Ub1 has significant structural homology with Ras effector proteins, and it is thought that Ub1 may 223 interact with and modulate the activity of Ras in the host. This interaction potentially affects 224 growth and cell cycle cascades and could be a potential mechanism for cell mortality (Serrano et 225 al. 2007). Significantly, there are differences in residue identity between SARS-CoV and SARS-226 CoV-2 in residues 41 to 63, 83 to 87, 88 to 94, 95 to 98 of the Ub1 region (Fig. S6A). These regions 227 correspond to structural homology with Ras effectors and these residue changes may affect 228 interactions with host Ras and contribute to pathogenicity divergence (Serrano et al. 2007). 229 The Glu-rich acidic region is also known as the hypervariable region (HVR). It is intrinsically 230 disordered and its interaction partners are not entirely clear. One study by yeast-2-hybrid 231 demonstrates interaction with nsp6, but a GST pull-down assay also revealed interactions with 232 nsp8, nsp9, and three regions of nsp3 itself (nucleic-acid binding domain, betacoronavirus-specific 233 marker domain, and transmembrane region 1) (Lei et al. 2018). This region is significantly 234 elongated, relative to the SARS-CoV counterpart, with 16 additional amino acids, including 235 several potential sites of post-translational modification. 236 The X domain binds ADP-ribose, the structure of this complex was recently solved for SARS-237 CoV-2 (PDB id 6w02), and there are several studies and structures available in the Protein Data 238 Bank of homolog complexes (PDB id: 5hol, 5dus, and 2fav). ADP-ribosylation is a type of post-239 translational modification (PTM), which may be implicated in inhibiting the immune response of 240 the host through PTM of proteins related to the expression of interleukin-6 and interferon-beta (Lei 241 et al. 2018). A GST-pulldown assay showed interaction of the X domain of nsp3 with nsp12. 242 Several non-conservative substitutions occur in this domain relative to SARS-CoV, but those are 243 located in superficial regions and do not significantly affect protein conformation (RMSD 1.5 Å, 244 relative to 2fav). The papain-like protease domain (papain-like protease 2; PL2pro) (Lei et al. 245 2018) contains two ubiquitin-binding sites. It is implicated in suppression of the host immune 246 response, but its targets and, more generally, which immune-related signal transduction cascades 247 are affected is unclear. The substitution Gln 977 Lys in this domain likely intensifies its interaction 248 with ISG15 by forming a salt bridge with Glu 127 , suggesting an important mechanism for variable 249 virulence, as discussed in the main text (Fig. 6). consists of a small globular protein (113 amino acid residues) with seven antiparallel β-sheets and 370 one α-helix (Sutton et al. 2004). Parallel dimers occur between two nsp9 monomers interacting in 371 their C-terminal α-helixes and their N-fingers (Zeng et al. 2018). In SARS-CoV, Gly 100 and Gly 104 372 residues are shown to be in the core of the dimer interface (Sutton et al. 2004). Mutation of these 373 conserved glycines in the nsp9 α-helix impairs ssDNA binding, suggesting that dimerization is 374 essential for nucleic acid binding (Zeng et al. 2018 Because SARS-CoV and SARS-CoV-2 sequences are highly conserved (sequence identity is 385 97%), much structural information from SARS-CoV is transferable to SARS-CoV-2. The three 386 substitutions in SARS-CoV-2 nsp2 relative to SARS-CoV, Asn 34 Thr, Ser 35 Thr, and His 48 Leu, are 387 located on the nsp9 surface, distant from the dimerization region (Fig. S10). Particularly, 388 substitution Asn 34 Thr can result in an additional phosphorylation site, but phosphorylation of nsp9 389 has not yet been reported. As Thr 34 and Thr 35 are close to a potential ubiquitination site (Lys 36 ), 390 we hypothesize that their phosphorylation may prevent nsp9 ubiquitination. 391

415
Nsp11 is a short protein only 13 a.a. in length and is translated from both polyproteins ORF1a and 416 ORF1ab. In SARS-CoV nsp11 has been implicated in RNA synthesis (Su et al. 2006). SARS-417 CoV-2 nsp11 has 85% sequence identity to SARS-CoV nsp11, with two substitutions between 418 them (Ser 5 Gln and Thr 6 Ser). 419 2, nsp15 is very conserved (89% identity). However, the recently solved crystal structure of the 492 nsp15 monomer shows a significant conformational variation in the region of the catalytic site. 493 Although several non-conservative substitutions occur around this region, the conformation of the 494 catalytic loop (a.a. 234-249) is known to greatly change upon protein oligomerization (Fig. S13). 495 CoV-2 nsp16. Non-conservative substitutions relative to SARS-CoV nsp16 and depicted in orange 501 and key functional residues are depicted in green. PDB id 3r24 was used as the template. 502

503
The nonstructural protein nsp16 is involved in capping of viral mRNA to protect it from host 504 degradation, and it has been demonstrated that it has to be associated with nsp10 to be active. One 505 potential method of ablating nsp16 activity is to disrupt the binding interface between nsp16 and 506 nsp10. In SARS-CoV, the following interfacial mutations in nsp16 were shown to ablate nsp16 2

573
The envelope (E) protein is one of the four structural proteins of coronaviruses. The other 574 structural proteins are the membrane (M), nucleocapsid (N), and spike (S) proteins. The E protein 575 is a type 1 transmembrane protein that is able to form pentamers by associating with other E 576 proteins ). This pentamer forms a membrane pore that is able to transport ions (Li et 577 al. 2014). The pore structure is called viroporin and it is present in many common viruses. The ion 578 channel activity can be inhibited in SARS-CoV E protein by the drug HMA ). Only 579 a few E proteins are present in each viral particle, but it is highly expressed in the host cells 580 (Vennema et al. 1996 The E protein has been shown to interact with other viral proteins. Tandem affinity purification 590 assays established interaction between the S and the E proteins, but the mechanism of how this 591 happens was not pursued further (Alvarez et al. 2010). This study also shows that the E protein 592 interacts with the nsp3 and they suggest that nsp3 mediates E protein ubiquitination. Besides, the 593 E protein co-immunoprecipitates with the N protein, but the function of this interaction remains 594 unclear (Maeda et al. 1999 . It is also important to note that the C-terminal of the E protein interacts with 619 the C-terminal of the M protein in the cytoplasmic side, and this is required to virus envelope 620 formation (Lim and Liu 2001). Since this protein is highly conserved relative to its SARS-CoV 621 counterpart (96% identical), this information is likely applicable to SARS-CoV-2 E protein. 622 SARS-CoV-2 E protein has a predicted short 11 aa N-terminal tail, a 25 aa transmembrane region, 623 and a 37 aa C-terminal cytoplasmic region, including PBM (DLLV). Four variations relative to 624 SARS-CoV E protein were verified, all located in the C-terminal end (Fig. S16), including in one 625 conservative substitution, two non-conservative substitutions (Val 56 Phe, Glu 69 Arg), and a deletion. 626 Similar to SARS-CoV, SARS-CoV-2 E protein is likely to assemble as a pentamer to form a 627 viroporin (Fig. S16), and the three C-terminal substitutions are likely exposed in the pentamer to 628 the cytoplasm, thereby they could be involved in modulating E protein interactions with other  629 proteins.

692
ORF6 is an auxiliary protein in SARS coronaviruses that is not required for virus replication 693 Huang et al. 2007). However, it can increase virus replication when expressed 694 in a heterologous system or at low multiplicity of infection (Zhao et al. 2009

730
ORF7a is an accessory protein of coronaviruses and it is not essential for viral replication in vitro 731 (Schaecher, Touchette, et al. 2007). ORF7a is a type I transmembrane protein, localized mainly in 732 Golgi apparatus and in the cell surface (Nelson et al. 2005;Taylor et al. 2015). Besides, ORF7a 733 colocalizes with calnexin (ER marker), showing that it is also localized at ER (Nelson et al. 2005). 734 ORF7a is an antagonist of bone marrow stromal antigen 2 (BST-2/CD317/tetherin) (Taylor et al. 735 2015). BST-2 is a pre-B-cell growth promoter that inhibits virus release by tethering budding 736 virions to the host cell membrane (Sauter et al. 2010). A greater virion tethering to the cell 737 membrane is observed when ORF7a is not present (Taylor et al. 2015). ORF7a is usually located 738 in the Golgi apparatus and it relocates to the plasma membrane when BST-2 is expressed, 739 colocalizing it (Taylor et al. 2015). ORF7a binds to BST-2 and reduces restriction activity of BTS2 740 by preventing its glycosylation (Taylor et al. 2015 ORF7a structure is composed of seven antiparallel β-sheets that altogether make a β-sandwich 755 (Fig. S19). The ORF7a is very conserved between SARS-CoV and SARS-CoV-2 with 85% 756 sequence identity. A total of 18 variations are verified, with 11 non-conservative substitutions and 757 a deletion. Substitutions in the luminal domain are located in the protein surface (Fig. S19 packaging ). All of these residues are fully conserved in SARS-CoV-2 N. A 827 long intrinsically disordered region follows these residues (181-246), and is predicted to fold upon 828 binding (ANCHOR2 prediction (Mészáros et al. 2018)). 829 The and hijack this complex for ubiquitination and degradation. 867 Structural analysis of SARS-CoV-2 ORF10 -ORF10 of SARS-CoV-2 is the last predicted coding 868 sequence upstream of the poly-A tail and is the shortest predicted coding sequence, composed of 869 38 a.a. ORF10 is predicted to harbor a long helix and a pair of ϐ-strands. It appears to be unique 870 to SARS-CoV-2 (Fig. S24).