Evolutionary changes in the number of dissociable amino acids on spike proteins and nucleoproteins of SARS-CoV-2 variants

Abstract The spike protein of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is responsible for target recognition, cellular entry, and endosomal escape of the virus. At the same time, it is the part of the virus that exhibits the greatest sequence variation across the many variants which have emerged during its evolution. Recent studies have indicated that with progressive lineage emergence, the positive charge on the spike protein has been increasing, with certain positively charged amino acids (AAs) improving the binding of the spike protein to cell receptors. We have performed a detailed analysis of dissociable AAs of more than 1400 different SARS-CoV-2 lineages, which confirms these observations while suggesting that this progression has reached a plateau with Omicron and its subvariants and that the positive charge is not increasing further. Analysis of the nucleocapsid protein shows no similar increase in positive charge with novel variants, which further indicates that positive charge of the spike protein is being evolutionarily selected for. Furthermore, comparison with the spike proteins of known coronaviruses shows that already the wild-type SARS-CoV-2 spike protein carries an unusually large amount of positively charged AAs when compared to most other betacoronaviruses. Our study sheds light on the evolutionary changes in the number of dissociable AAs on the spike protein of SARS-CoV-2, complementing existing studies and providing a stepping stone towards a better understanding of the relationship between the spike protein charge and viral infectivity and transmissibility.


Introduction
The coronavirus disease 2019  pandemic caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) gave rise to an unprecedented global effort to collect and share the information of the virus' ongoing evolution (Moorthy et al. 2020;Pratt, Bull, and Med 2021) with its emerging stream of new variants with varying degrees of disease severity, transmissibility, and other traits (Singh and Yi 2021;Telenti, Hodcroft, and Robertson 2022). Consequently, this large amount of data enables the study of how the virus has been changing since it was first detected in humans and the discovery of those mutations which made it adapt further, with a particular focus on the designated variants of interest (VOIs) and variants of concern (VOCs) (Magazine et al. 2022;Obermeyer et al. 2022;Bloom and Neher 2023;Carabelli et al. 2023). The spike (S) protein of the virus-responsible for target recognition, cellular entry, and endosomal escape (Huang et al. 2020)-in particular exhibits the greatest sequence variation, not only across different SARS-CoV-2 variants but also in coronaviruses in general (Cavanagh 2005;Harvey et al. 2021). As an example, the highly transmissible Omicron variant has fifteen mutations solely in its receptor-binding domain (RBD), the part of the S protein that binds to angiotensin-converting enzyme 2 (ACE2) receptor. The high variability of the S protein across different variants has also led to recent efforts in studying whether VOCs could be identified from the sequence of the S protein alone, which would make lineage assignment relatively easier compared to the use of complete genome sequences (O'Toole et al. 2022).
Numerous experimental, computational, and bioinformatics studies have analysed the influence of the observed mutations in the S protein on the viral function. Some of the mutations have been linked to an increased transmissibility of the virus, while and negatively charged AAs (blue and red, respectively) on the S protein surface. The actual charge distribution on the S protein is a more complex question, which depends, among other things, on the bathing solvent conditions, such as electrolyte concentration and pH, and local values of AA dissociation constants (Javidpour et al. 2021;Kim et al. 2022). Mutations of dissociable AAs on the S protein of selected SARS-CoV-2 VOCs: beta (B), delta (C), Omicron BA.1 (D), Omicron BA.4 (E), and Omicron XBB (F). Only those mutations that lead to either a gain or a loss of a dissociable AA are shown (Hodcroft 2021): gain of a positively charged AA (blue), gain of a negatively charged AA (red), and loss of either positively or negatively charged AA (light gray). Mutations which replace one dissociable AA with another of the same charge type are not represented. Note that these images show only the positions of the mutations of dissociable AAs and do not depict the changes in charge on the S protein. The complete PDB structure of the S protein in (B)-(F) was obtained from Kim et al. (2022). All the images were drawn using UCSF Chimera (Pettersen et al. 2004).
others have been linked to changes in its binding to ACE2 (Jawad et al. 2022;Obermeyer et al. 2022;Sang et al. 2022;Zhang et al. 2022). A particularly interesting observation, made as the COVID-19 pandemic progressed, concerns the tendency of the mutations in the S protein of different VOCs to increase the number of positively charged amino acid (AA) residues in it (Pawłowski 2021) (illustrated on a few examples in Fig. 1). This observation is further supported by a more recent analysis of the charge on the S protein of major SARS-CoV-2 lineages, which identified a striking change in the S protein charge with the evolution of the virus (Cotten and Phan 2023). Molecular dynamics and large-scale ab initio studies also found that the increase in the (partial) charge of S protein RBD of some VOCs increases the binding to ACE2 and other cell surface receptors (Adhikari et al. 2022a, 2022b, Nie et al. 2022Gan et al. 2022;Kim et al. 2022) and that the total charge of the RBD might be a simple predictor for the RBD-ACE2 binding affinity based on the data obtained for main VOCs (Barroso da Silva, Giron, and Laaksonen 2022). That charge might influence viral infectivity has also been shown in avian influenza viruses, where hemagglutinin proteins from low and high pathogenic strains exhibit clear difference in their surface charge (Baggio, Filippini, and Righetto 2023). These observations go hand in hand with a large body of studies showing the electrostatic interactions to be of enormous importance both in biological systems, in general (Holm, Kékicheff, and Podgornik 2000;Zhou and Pang 2018), as well as in viruses, in particular (Šiber, Bož ič , and Podgornik 2012;Zandi et al. 2020).
To study the effect of S protein charge on, for instance, its binding affinity requires significant computational effort and more often than not involves a certain degree of approximation (Javidpour et al. 2021;Barroso da Silva, Giron, and Laaksonen 2022;Kim et al. 2022), unless one resorts to quantum-level calculations (Ching et al. 2023). On the other hand, the amount of dissociable AAs provides a good estimate for the total charge of the S protein and importantly enables a broad study of different SARS-CoV-2 lineages (Pawłowski 2021;Cotten and Phan 2023). In this work, we use the large amount of available data on different SARS-CoV-2 lineages that have emerged over the past 3 years of the pandemic to examine in detail how the number of dissociable AAs on the S protein-as a proxy for its total charge-has changed with increasing lineage divergence and evolution of the virus. We observe several different clusters corresponding to the emergence of different variants. We indeed observe a general tendency toward an increase in the total number of positively charged AAs on the S protein, a tendency that, however, seems to have recently plateaued. For comparison, we perform the same analysis on the SARS-CoV-2 nucleoprotein (N protein) to show that the evolutionary preference for positive charge is specific to the S protein.
We additionally examine the available data on S and N proteins in known coronaviruses to frame our results in a wider context. In this way, our work complements the existing studies on the importance of S protein charge for its interaction with the environment, at the same time adding several novel observations regarding the preference for particular dissociable AAs in different variants and the saturation of their number with Omicron VOC and its many subvariants.

Dissociable AAs on S and N proteins show different amounts of change with lineage divergence
In general, there are six AAs that can (de)protonate and thus acquire charge: three of them can be positively charged (arginine (ARG), lysine (LYS), and histidine (HIS)), whereas three of them can be negatively charged (aspartic acid (ASP), glutamic acid (GLU), and tyrosine (TYR))-see also the Methods section. Figure 2 shows how the average number of different dissociable AAs on the S and N proteins of 1421 SARS-CoV-2 variants changes with increasing average lineage divergence from the wild-type (WT) genome. Since the length of the S protein is approximately three times the length of the N protein (≈1,270 AA compared to ≈415 AA, respectively), it is not surprising that the number of dissociable AAs on the S protein is, in general, much larger than the number of dissociable AAs on the N protein. One can, nonetheless, observe very different trends in their numbers as lineage divergence increases. For instance, the number of positively charged LYS AAs on the S protein tends to steadily increase with lineage divergence, with a peak number of around sixty-seven with the Omicron subvariants BA.1, BA.3, and the recombinant XD ( Fig. 2A). However, this number slightly decreases afterward for the more divergent subvariants BA.4, BA.5, and the recombinant XBB-see also Fig. 3 and Table 1. The number of HIS also increases with increasing divergence, albeit it does so only at relatively highly divergent lineages. The changes in the number of negatively charged AAs are less prominent, with perhaps the exception of TYR, whose number is slightly increased with the more divergent Omicron subvariants.   Table 1).
In contrast to the S protein, the number of dissociable AAs on the N protein does not show any significant increases or decreases with lineage divergence (Fig. 2B). Here, the more interesting observation is that the number of certain AAs, such as TYR and LYS, occupied two distinct values in the early variants. As the lineages began to diverge more, only one of the values is selected for (a lower number of TYR and a higher number of LYS). Nonetheless, the number of dissociable AAs on the N protein shows far fewer changes with divergent SARS-CoV-2 lineages compared to their number in the S protein.  Fig. 2 shows, different dissociable AAs show different patterns of change with the evolution of SARS-CoV-2 and the emergence of new lineages. In order to see whether any patterns can be observed between different lineages with respect to their preference for a particular AA type, Fig. 3 shows a heatmap of the relative changes in the number of dissociable AAs on the S and N proteins from different SARS-CoV-2 lineages compared to the WT, with select VOCs and VOIs marked along the divergence progression. Looking at the S protein first (Fig. 3A), we can observe that the first positively charged AA residue to show an increase is ARG, which reaches a slight peak in a cluster of variants that covers Delta and Gamma variants (cf. also Table 1). The number of ARG, however, decreases again for more divergent variants, and in its place, the number of LYS is significantly increased, in a cluster covering Omicron subvariants BA.1, BA.3, and XD. While the number of LYS also remains high for later, more divergent subvariants, its number, nonetheless, drops in comparison to this cluster. Interestingly, the most divergent variants, including the Omicron subvariants B.2.75, BJ.1, and the recombinant XBB, show an increased number of HIS residues, which are only fractionally charged at physiological pH. The numbers of negatively charged AAs on the S protein change less drastically, but have a tendency toward a slight decrease in the more divergent lineages, with the notable exception of TYR, the number of which is increased in most Omicron subvariants.
While the overall number of dissociable AAs on the N protein changes less drastically (Fig. 3B), some trends can nonetheless be observed. Variants such as Delta and Gamma show an increase in the number of TYR and a decrease in the number of ASP. On the other hand, these changes are absent in the Omicron subvariants, where the most notable observation is the decrease in the number of GLU residues. We again observe that as lineage divergence increases, the number of positively charged AAs on the N protein shows a much lesser degree of change compared to their counterparts on the S protein. Figures 3 and 2 together make it clear that the evolutionary preference for the increase in the amount of positively charged AAs is particular to the S protein of SARS-CoV-2, and no similar effect can be observed for the number of dissociable AAs on the N protein.

Number of positively charged AAs on the S protein has plateaued with Omicron subvariants
As already mentioned, the total number of dissociable AAs can serve as a proxy for the amount of charge on the S and N proteins. Figure 4 separates the contributions of the positively (ARG, LYS, and HIS) and negatively (ASP, GLU, and TYR) charged AAs and shows how their total number changes with the increasing divergence of SARS-CoV-2 lineages from the WT genome. Similar to what has been observed previously (Pawłowski 2021;Cotten and Phan 2023), the number of positively charged AAs on the S protein increases with lineage divergence, while the number of negatively charged AAs remains rather steady or even decreases slightly for highly divergent lineages (Fig. 4A). This indeed indirectly implies that the overall charge on the S protein has been becoming more positive with increasing lineage divergence. However, we can also observe that the total number of positively charged AAs appears to have reached a plateau with the Omicron variant, with only small changes in the amount of positively charged AAs observed between its different subvariants.
The changes in the total number of positively and negatively charged AAs on the N protein as lineage divergence increases are, on the other hand, again far smaller compared to those of the S protein (Fig. 4B). However, we can also observe that the most divergent (and recently emerged) lineages seem to clearly prefer a version of the N protein with slightly fewer negatively charged AAs and slightly more positively charged AAs compared to the WT, albeit with a significantly smaller variation in their number compared to the less divergent lineages. This, combined with the observations of the number of charged AAs on the S protein (Fig. 4A), implies that the number of dissociable AAs on both the S and N proteins has reached an 'equilibrium' where any further significant changes appear to be less likely.

SARS-CoV-2 has more positively charged AAs than other known (beta)coronaviruses
Compared to the abundance of data on SARS-CoV-2, there is much less information available regarding the evolution of the S proteins of other coronaviruses. Nonetheless, we can compare the number of dissociable AAs on the S protein of most currently known coronaviruses based on their reference sequences. Figure 5 thus shows the comparison of the number of positively and negatively charged AAs on the S proteins of the reference strains of fortysix different coronaviruses (see the Methods section), together with twenty-two VOCs and VOIs of SARS-CoV-2 (Table 1). In general, the S proteins of coronaviruses tend to have a larger number of GLU and ASP AAs compared to the number of ARG and LYS AAs (Fig. 5A), as well as a large amount of TYR (Fig. 5B), which indicates that the overall charge of the S protein is negative. There are a few exceptions that have approximately the same number of GLU and ASP AAs compared to ARG and LYS, including human coronavirus NL63, suncus murinus coronavirus X74, two avian coronaviruses (IBV and 9203), and two bat coronaviruses (Rhinolophus bat coronavirus HKU2 and Rousettus bat coronavirus HKU9). Compared to most other betacoronaviruses-and, in fact, most  other coronaviruses in general-both WT SARS-CoV-2 and its variants have a larger number of ARG and LYS AAs on their S proteins. One notable exception is SARS-CoV, which has a similar amount of positively and negatively charged AAs as WT SARS-CoV-2. In general, Fig. 5 shows that the S protein of WT SARS-CoV-2 already has fewer negatively charged AAs and more positively charged AAs than other betacoronaviruses and that with lineage divergence, it is the number of positively charged AAs that has increased even further.

Discussion and Conclusions
By analysing the number of dissociable AAs on the S and N proteins of more than 1,400 different SARS-CoV-2 lineages, we have shown that there is an overall tendency towards an increase in the number of positive AAs on the S protein with increasing lineage divergence, an effect that is not readily observed for the N protein (Fig. 3). While the number of dissociable AAs on a protein can only be considered a proxy for its total charge, which is in fact a result of many different local and global factors (Adhikari et al. 2022a), our observations confirm that a more positively charged S protein of SARS-CoV-2 is evolutionarily favourable. This result is in line with previous studies (Pawłowski 2021; Barroso da Silva, Giron, and Laaksonen 2022; Cotten and Phan 2023) which have shown that the charge on the S protein has been increasing ever since the start of the pandemic. However, we have also shown that the overall number of positively charged AAs has seemingly reached a plateau with Omicron and its subvariants (Figs 4 and 1).
Further observation of changes in the number of dissociable AAs in emerging variants as SARS-CoV-2 continues to evolve will show whether this plateau is only temporary or whether the positive charge on the S protein has reached its peak. On the other hand, our observation that the positive charge on the N protein, which plays a role in condensing the RNA genome (Dinesh et al. 2020;Cascarina and Ross 2022;Li and Zandi 2022), has changed only little with the evolution of SARS-CoV-2 might indicate that it is already optimized. We also note that while a similar analysis can be carried out on other structural proteins of SARS-CoV-2, such as the envelope and membrane proteins, these are significantly smaller than the S and N proteins and have far fewer dissociable AAs to start with. Consequently, any evolutionary changes in the number of dissociable AAs occur on a much smaller scale-on the order of a change in a single AA residue-and are thus difficult to compare.
Detailed inspection of the exact AA composition of the S protein in different SARS-CoV-2 lineages showed that the main contribution to positive charge comes from additional LYS residues, which reached the largest number with BA.1 and BA.3 Omicron subvariants and the XD recombinant variant ( Fig. 2A). Interestingly, the most divergent lineages, including the Omicron subvariants BA.2.75 and BJ.1 as well as the XBB recombinant variant, also show a significant increase in the number of HIS residues, which is only fractionally charged at neutral pH. Comparison with the changes in the number of dissociable AAs on the N proteins of different SARS-CoV-2 lineages (Fig. 2B) shows that while there is a slight tendency for an increase in charge with increasing lineage divergence, this occurs to a far lesser extent. We argue that this is an additional confirmation that charge plays an important role in the function(s) of the S protein and that the observed increase in the number of positively charged AAs on it is not a general effect that would occur in any viral protein as lineages continue to diverge.
Numerous studies have demonstrated how individual AA mutations that increase the local charge in the RBD of the S protein change its interaction with the ACE2 receptor (which is not the only receptor that binds to the S protein of SARS-CoV-2 (Loo et al. 2023)). Even here, the question of positive charge is a complex issue: on the one hand, individual mutations that increase positive charge can reduce the binding affinity, as they are incompatible with particular LYS residues on the receptor (Zhang et al. 2022). On the other hand, the absence of charge-increasing Q493R mutation in the BA.4 and BA.5 Omicron subvariants results in a significantly weaker binding affinity to ACE2 compared to Omicron subvariants BA.1, BA.2, and BA.3 (Sang et al. 2022). Charge can also be an important factor in other interactions, as it can disrupt the hydrogen bond network and thus influence the overall interaction (Ching et al. 2023), and it has the potential to diminish antibody binding (Harvey et al. 2021). Our results, together with previous studies (Pawłowski 2021;Cotten and Phan 2023), clearly demonstrate that the evolutionary progress of SARS-CoV-2 lineages favours an overall increase in the positive charge of the S protein, making it stand out in this respect from among other known betacoronaviruses (Fig. 5). These changes likely have an effect that is greater than the contributions of individual AA mutations themselves: for instance, anionic, negatively charged lipids represent a dominant fraction of charged lipid species in biological membranes (Galassi and Wilke 2021), and consequently, their role in the interaction between proteins and membranes is of great biological interest (Pö yry and Vattulainen 2016). More positively charged proteins would then interact more strongly with the membranes, consequently making the positive charge mutations more desirable. We hope that further studies can elucidate how the observed increase in the overall positive charge of the S protein benefits viral infectivity, transmissibility, and other traits.

SARS-CoV-2 variants
We obtained a list of SARS-CoV-2 Pango lineages from the CoV-Lineages.org lineage report (O'Toole et al. 2021) on 24 November 2022. These lineages were used as an input to download virus genomic and protein data from NCBI Virus (Hatcher et al. 2017) using the provided command line tools; the data were downloaded between 30 November 2022 and 5 January 2023. We used the accompanying annotations to obtain the isolate collection dates and kept the earliest record with an available full date of collection (i.e. year, month, and day) as the timepoint of the lineage 'emergence' for use in our analysis.
For each Pango lineage, we furthermore obtained the information on lineage divergence-the number of nucleotide changes (mutations) in the entire genome relative to the root of the phylogenetic tree, i.e. the start of the outbreak-from the global SARS-CoV-2 data available on Nextstrain (Hadfield et al. 2018). We have selected only those entries with a genome coverage of >99 per cent and extracted their lineage divergence and the number of mutations. Since individual entries within a Pango lineage can still exhibit small differences in their lineage divergence from the WT genome, we averaged over them to obtain the average lineage divergence for each Pango lineage. To allow for an easier interpretation of our results, we list in Table 1 a comparison of the average lineage divergence of selected VOCs and VOIs used in our analysis.
As the last selection step, we retained only those Pango lineages whose downloaded protein fasta file was not empty. We used these protein data to obtain the number of dissociable AAs on the S and N proteins. The numbers of dissociable AAs were then averaged over all available protein data for a given Pango lineage. The final number of analysed SARS-CoV-2 Pango lineages for which the entirety of the data described earlier was attainable is N =1421.

Coronaviridae
As a point of comparison, we also examined the number of dissociable AAs on the S and N proteins of known coronaviruses. We used the coronaviruses listed in the most recent Virus Metadata Resource (2 December 2022) issued by the International Committee on Taxonomy of Viruses (International Committee on Taxonomy of Viruses 2021) and limited ourselves to the genomes of those viruses with available REFSEQ accession numbers. We used these to download the representative genome and protein fasta files from NCBI Virus (Hatcher et al. 2017). Due to the large amount of variation in the annotations across the different coronavirus datasets, we limited ourselves solely to the REFSEQ genomes and neglected any information on different strains and variants. These data were downloaded on 28 January 2023. We then followed the same procedure as with SARS-CoV-2 variants to obtain the number of dissociable AAs on the S and N proteins of different coronaviruses.
Our final dataset of coronaviruses includes forty-six different species; of these, one is a repetition of WT SARS-CoV-2. The dataset also includes SARS-CoV, Middle East respiratory syndrome-related coronavirus, and all other four known human coronaviruses. In general, the dataset comprises nineteen species from the genus Alphacoronavirus, fifteen species from the genus Betacoronavirus, five species from the genus Gammacoronavirus, and seven species from the genus Deltacoronavirus.

Dissociable AAs
We analysed the S and N proteins of SARS-CoV-2 variants and coronaviruses to obtain the (average) numbers of six dissociable AAs: GLU, ASP, TYR, ARG, LYS, and HIS. We used Biopython to parse the protein fasta files and count the number of dissociable AAs on the S and N proteins. Three of the six AAs can carry positive charge (ARG, LYS, and HIS), while the other three can carry negative charge (ASP, GLU, and TYR) (Lide 2013). HIS typically carries a relatively small fractional charge at physiological pH, while TYR starts to acquire charge only at very basic pH; the importance of the latter in charge-mediated interactions has been demonstrated in recent studies (Barroso da Silva, Giron, and Laaksonen 2022). In our analysis, we did not consider cysteine, which has a thiol with a functional end group that is a very weak acid and is usually not considered to be an acid at all (Nap et al. 2014; Bož ič and Podgornik 2017).

Data availability
All the data presented in this work and a detailed description of data collection are openly available in OSF at https://osf.io/78b3f/, reference number 78B3F.