The Heterogeneous Landscape and Early Evolution of Pathogen-Associated CpG Dinucleotides in SARS-CoV-2

Abstract COVID-19 can lead to acute respiratory syndrome, which can be due to dysregulated immune signaling. We analyze the distribution of CpG dinucleotides, a pathogen-associated molecular pattern, in the SARS-CoV-2 genome. We characterize CpG content by a CpG force that accounts for statistical constraints acting on the genome at the nucleotidic and amino acid levels. The CpG force, as the CpG content, is overall low compared with other pathogenic betacoronaviruses; however, it widely fluctuates along the genome, with a particularly low value, comparable with the circulating seasonal HKU1, in the spike coding region and a greater value, comparable with SARS and MERS, in the highly expressed nucleocapside coding region (N ORF), whose transcripts are relatively abundant in the cytoplasm of infected cells and present in the 3′UTRs of all subgenomic RNA. This dual nature of CpG content could confer to SARS-CoV-2 the ability to avoid triggering pattern recognition receptors upon entry, while eliciting a stronger response during replication. We then investigate the evolution of synonymous mutations since the outbreak of the COVID-19 pandemic, finding a signature of CpG loss in regions with a greater CpG force. Sequence motifs preceding the CpG-loss-associated loci in the N ORF match recently identified binding patterns of the zinc finger antiviral protein. Using a model of the viral gene evolution under human host pressure, we find that synonymous mutations seem driven in the SARS-CoV-2 genome, and particularly in the N ORF, by the viral codon bias, the transition–transversion bias, and the pressure to lower CpG content.


Introduction
When a virus enters a new host, it can present pathogenassociated molecular patterns (PAMPs) that are rarely seen in circulating strains that have adapted to that host's immune environment over evolutionary timescales. The emergence of SARS-CoV-2, therefore, provides a rare window into innate immune signaling that may be relevant for understanding immune-mediated pathologies of SARS-CoV-2, antiviral treatment strategies, and the evolutionary dynamics of the virus, where evidence for selective pressures on viral features can reflect what defines "self" in its new host. As a case in point, the 1918 influenza pandemic was likely caused by a strain that originated in water fowl and entered the human population after possible evolution in an intermediate host.
That viral genome presented CpG dinucleotides within a context and level of density rarely found in the human genome where they are severely underrepresented (Karlin et al. 1994; Karlin and Mr azek 1997;Cheng et al. 2013), particularly in a set of genes coding for the proteins associated with antiviral innate immunity (Greenbaum et al. 2008(Greenbaum et al. , 2014. Over the past century, the 1918 H1N1 lineage evolved in a directed manner to lower these motifs and gain UpA motifs, in a way that could not be explained by its usage of amino acid codon bias (cb) (Greenbaum et al. 2008(Greenbaum et al. , 2014. It has since been found that these motifs can engage the pattern recognition receptors (PRRs) of the innate immune system (Jimenez-Baranda et al. 2011;Vabret et al. 2017) and directly bind the zinc finger antiviral protein (ZAP) in a CpGdependent manner (Zhu et al. 2011;Takata et al. 2017;Meagher et al. 2019;Luo et al. 2020). Hence, the interrogation of emergent viruses from this perspective can predict novel host virus interactions.
COVID-19 presents, thus far, a different pathology than that associated with the 1918 H1N1, which was disproportionately fatal in healthy young adults. It has been characterized by a large heterogeneity in the immune response to the virus (Hirano and Masaaki 2020;Mehta et al. 2020;Zhou et al. 2020) and likely dysregulated type-I interferon (IFN) signaling (Kindler and Thiel 2016;Hadjadj et al. 2020;Ragab et al. 2020;Vabret et al. 2020). Various treatments to attenuate inflammatory responses have been proposed and are currently under analysis or being clinically tested (RECOVERYCollaborativeGroup 2020; Xu, Han, et al. 2020). It is therefore essential to quantify pathogen-associated patterns in the SARS-CoV-2 genome for multiple reasons. The first is to better understand the pathways engaged by innate immune system and the specific agonists to help build better antiviral therapies. Another is to better predict the evolution of motif content in synonymous mutations in SARS-CoV-2, as it will help understand the process and timescales of attenuation in humans. Third is to offer a principled approach for optimizing vaccine strategy for designed strains (Amanat and Krammer 2020;Kames et al. 2020) to better reflect human genome features.
In this work, we will use the computational framework developed in Greenbaum et al. (2014) to carry out a study of nonself-associated dinucleotide usage in SARS-CoV-2 genomes. The statistical physics framework is based on the idea of identifying the abundance or scarcity of dinucleotides given their expected usage based on host features. It generalizes the standard dinucleotide relative abundance introduced in Karlin and Mr azek (1997), as it can easily incorporate constraints in coding regions coming from amino acid content and codon usage. The outcomes of the approach are selective forces (Greenbaum et al. 2014) that characterize the deviations with respect to the number of dinucleotides which is statistically expected under a set of various constraints. Such formalism has further been applied to identify noncoding RNA from repetitive elements in the human genome expressed in cancer that engage PRRs (Tanne et al. 2015), to characterize the CpG evolution through synonymous mutations in H1N1 (Greenbaum et al. 2014) and to characterize local and nonlocal forces on dinucleotides across RNA viruses (Vabret et al. 2017).
We perform an analysis of the landscape of CpG motifs and associated selective forces in SARS-CoV-2 in comparison with other ssRNA viruses and other genomes in the coronavirus family in order to understand specific PAMP features in the new SARS-CoV-2 strains. We also focus on the heterogeneity of CpG motif usage along the SARS-CoV-2 genome. Finally, we use a model of viral genome evolution under human host pressure given by the CpG force to study synonymous mutations, and in particular those which change CpG content, observed since the SARS-CoV-2 entered the human population, and study sequence motifs preceding CpG loss loci. The model is used to score all possible synonymous mutations from an ancestral sequence sampled in Wuhan at the beginning of the COVID-19 pandemic (GISAID ID: EPI_ISL_406798) and is validated on single-nucleotide variants observed in the sequence data collected so far. This approach points out at hotspots where new mutations will likely attenuate the virus, while evolving in contact with the human host.

Definition of Coding and Noncoding CpG Forces
To characterize CpG dinucleotide usage on SARS-CoV-2 genome, we have computed the CpG forces following the approach introduced in Greenbaum et al. (2014) and described in Materials and Methods. CpG forces are intensive parameters that characterize the abundance or scarcity of CpG dinucleotides in a DNA or RNA sequence with respect to their expected usage relative to a reference nucleotide distribution. We propose two frameworks to define these reference distribution, schematically represented in figure 1. In the noncoding framework, nucleotides are randomly drawn according to a fixed nucleotide bias, whereas in the coding framework, the amino acid sequence is fixed, and the cb defines the reference distribution.

Noncoding Forces
The CpG noncoding force relative to the sequence nucleotide bias essentially captures the same information as the relative abundance index, I ¼ qðCGÞ qðCÞqðGÞ , where q(CG), q(C), and q(G) are, respectively, the frequencies of CpG dinucleotides and of C, G nucleotides in the sequence (Karlin and Mr azek 1997). The CpG noncoding force is well approximated by the logarithm of the relative abundance index: f % log I (see supplementary section SI.3 and fig. SI.13, Supplementary Material online). Positive and negative forces correspond therefore to, respectively, excess and scarcity of dinucleotides with respect to their expected occurrences determined by the nucleotide bias. Table 1 show that the CpG noncoding forces for human coding (Greenbaum et al. 2014) and noncoding RNA (Tanne et al. 2015) (relative to the human nucleotide bias) are negative, and particularly low for type-I IFN transcripts involved in the innate immune system (Greenbaum et al. 2014), confirming that CpG motifs are overall scarce in the human genome (Karlin et al. 1994;Karlin and Mr azek 1997;Cheng et al. 2013;Greenbaum et al. 2014). As shown in table 1 strongly pathogenic viruses in humans, such as Ebola, the Spanish flu H1N1 (1918) and the avian flu H5N1 (2005), are characterized by large CpG forces compared with the average force on human RNAs. The CpG force value can then be used as an indicator for the propensity of a viral sequence to be sensed by PRRs as nonself and engage the human innate immune response (Greenbaum et al. 2014;Tanne et al. 2015;Vabret et al. 2017). A comparative analysis for noncoding force in the Coronaviridae family will be discussed in the following Sections.

Coding Forces
The CpG coding force is based on the comparison of CpG occurrences in a coding RNA (or DNA) sequence and random synonymous sequences (associated to the same amino acids) drawn according to prescribed codon usage, see figure 1. The computation of coding forces relative to the human codon usage for SARS-CoV-2 will be discussed in the following sections, and it will be used to characterize the  . 2a).
The value ' À1:1 of the global noncoding force for SARS-CoV-2 is comparable with the one for human noncoding RNA and lower than other strongly pathogenic viruses in humans, such as H1N1, H5N1, and Ebola (table 1). Among the coronaviruses ( fig. 2a and table 2), MERS shows the highest CpG force followed by SARS, whereas some bat coronaviruses have even stronger CpG force. SARS-CoV-2 is among the viruses with lower global CpG force; its value is median among the hCoV that circulates in humans with less pathogenicity, between HKU1 with a smaller CpG force and OC43 with a larger one (supplementary fig. SI.4, Supplementary Material online, for a comparison with other hCoVs). The absence of a straightforward correlation between global CpG force and the pathology of a coronavirus in human calls for a finer, local analysis of CpG forces we report below. Figure 2b compares the forces (comparisons based on dinucleotide number rather than force give qualitatively similar FIG. 1. Schematic definitions of CpG noncoding and coding forces. The viral RNA sequence to be analyzed is shown in black, with CpG dinucleotide motifs in red. The force is computed by comparing the occurrences number of CpG with ensembles of random sequences fulfilling some of the constraints acting on the natural sequence, see Materials and Methods for details. Top, "noncoding" framework, in green: sequences of the same lengths can be generated by randomly drawing nucleotides according to prescribed frequencies, here taken from the human genome. Bottom: When the sequence under consideration codes for a protein (sequence of amino acids in black letters), random sequences (violet letters) can be generated in a "coding" framework as follows. For each amino acid, a licit codon is randomly chosen according to prescribed codon usage (here, computed from the coding regions in the human genome). Notice that the above computational frameworks are not restricted to CpG and can be applied to other dinucleotidic motifs.   To go beyond the global analysis, we study the local noncoding forces acting on CpG in fixed-length windows along the genome. Results for SARS, MERS, SARS-CoV-2, hCoV-HKU1, and two representative sequences of bat and pangolin coronaviruses, chosen for their closeness to SARS-CoV-2, are reported in figure 2c. SARS-CoV-2 shows a strong Early Evolution of CpG Dinucleotides in SARS-CoV-2 . doi:10.1093/molbev/msab036 MBE heterogeneity of CpG forces along its genome: in some genomic regions, especially at the 5 0 and 3 0 extremities, SARS-CoV-2, SARS, and MERS (together with the bat and pangolin viruses) have a peak in CpG forces, which is absent in the hCoV-HKU1 (as well as in the other hCoVs, see supplementary fig. SI.4, Supplementary Material online). The heterogeneous CpG content in SARS-CoV-2 has been also pointed out in Digard et al. (2020).
The high CpG forces at the extremities could have an important effect on the activation of the immune response via sensing, as the life cycle of the virus is such that the initial and final part of the genome are those involved in the subgenomic transcription needed for viral replication (Lai and Cavanagh 1997;Kim et al. 2020). During the infection, many more RNA fragments from these regions are present in the cytoplasm than from the other parts of the viral genome. Consequently, despite the relatively low CpG content of SARS-CoV-2 compared with other coronaviruses, there can be high concentrations of CpG-rich RNA due to the higher transcription of these regions.
The similarity between the high values of the maximum local forces of SARS-CoV-2 and those of SARS, bat and pangolin coronaviruses shown in figure 2a confirm this pattern: MERS and SARS, viruses that are likely less well adapted to a human host, have the highest local peaks in CpG content, followed by SARS-CoV-2 and then by seasonal strains that circulate in humans. It is interesting to notice that high and very high levels of proinflammatory cytokines/chemokines (such as IL-6 and TNF-a) have been observed in, respectively, SARS and MERS and, at times, SARS-CoV-2 infection (Zhou et al. 2014;Fajgenbaum and

Forces Acting on Coding Regions Widely Vary across Structural Proteins
We now restrict our analysis to the coding regions of SARS-CoV-2 and, in particular, on two structural proteins, N (nucleocapside) and S (spike) (Amanat and Krammer 2020;Hoffmann et al. 2020;Xu, Chen, et al. 2020). As stressed before, the computation of the force in such coding regions accounts for the extra constraints on the amino acid content and takes the human codon bias as reference background.
The landscape of coding CpG forces with respect to the human codon bias is shown, restricted to the coding regions of SARS-CoV-2 genome, in figure 3a, and compared with the noncoding forces from figure 2, with respect to the human nucleotide bias (dashed red lines). The two curves are similar apart from a global shift toward lower forces for the coding forces. Notice that this shift essentially vanishes if the noncoding force is computed with respect to the nucleotide bias computed on human coding RNAs only (Howe and Song 2013; fig. 3a).
Apart from this global shift, the strongly heterogeneous landscape of the CpG coding forces along the SARS-CoV-2 genome does not substantially differ from the findings of figure 2c. In particular, the peak of high CpG density and force is still present at the 5 0 and the 3 0 ends of the genome, including the N ORF, the envelope E ORF, and membrane glycoprotein M ORF regions. In the S ORF region, the coding CpG force remains small. Detailed results for the S and N ORFs are shown in, respectively, figures 3b and 3c. These structural proteins are present and quite similar across the Coronaviridae family and allow us to compare several strains of coronaviruses. In the S ORF, SARS-CoV-2 shows the lowest global CpG force among the human-infecting betacoronaviruses, see figure 3c. The CpG force is much higher for protein N in SARS-CoV-2, immediately below the level of SARS and above that of MERS, see comparison with human-infecting members of the Coronaviridae family presented in figure 3b.

Force-Based Model Accounts for Early Evolution of Synonymous CpG-Related Mutations in SARS-CoV-2
We now assess the ability of our CpG force model to predict biases in the synonymous mutations already detectable across the few months of evolution following the first sequencing of SARS-CoV-2 (data from GISAID; Elbe and Buckland-Merrett 2017; Wuhan ancestral strain has GISAID ID: EPI_ISL_406798, collected in Wuhan on December 26, 2019; last updated sequence September 29, 2020; see Materials and Methods). Barring confounding effects, we expect that high-force regions, such as N ORF, will be driven by the host immune system pressure toward a lower number of CpG motifs. Other regions, such as S ORF, already have low CpG content and would feel no pressure to decrease the CpG content, so random mutations would likely leave their CpG number unaffected or increase it. We define in the following synonymous-single nucleotide variants (syn-SNV) as nucleotide synonymous substitution with respect to the Wuhan ancestral strain, observed at least in five collected sequences (0.01% of the sample) (such cutoff removes very rare mutations which may be due to sequencing errors). Figure 4a (bottom and middle panels) shows that many syn-SNV that decrease the number of CpG are located at the 5 0 and 3 0 ends of the sequence, in correspondence with the high peaks in CpG local force, notably in the N ORF region and at the 5 0 extremity of the genome. Conversely, syn-SNV that increase the number of CpG are more uniformly spread along the sequence. The behaviors of the local CpG force and of the local density of CpG-decreasing syn-SNV, computed on the same sliding windows along the sequence, show strong similarities, see middle panels in figure 4a.
To better explain this CpG mutational trend along the sequence, we define a putative equilibrium CpG force of the SARS-CoV-2 genome in human host, as the average CpG force of hCoVs in table 2: hCoVs have long been circulated in humans and are, therefore, supposed to be close to Di Gioacchino et al. . doi:10.1093/molbev/msab036 MBE equilibrium with their host. Other choices for equilibrium force will be discussed later. Regions with a CpG force much larger than the equilibrium one are predicted to be under strong evolutionary pressure to decrease their CpG content. This prediction is confirmed by the fact that CpGdecreasing syn-SNV are much more frequent than CpGincreasing ones, see figure 4a, middle. Conversely, in regions with forces slightly smaller than the equilibrium force value, the presence of a small evolutionary pressure to increase CpG is confirmed by the fact that CpG-increasing syn-SNV are slightly more frequent than CpG-decreasing ones. The scatter plot of the local forces and densities of CpG-increasing (blue) and -decreasing (red) syn-SNV along the SARS-CoV-2 Wuhan ancestral strain is shown in figure 4b. The correlation coefficient is much larger for CpG-decreasing syn-SNV (r 2 ¼ 0:85) than for CpG-increasing mutations (r 2 ¼ 0:16). The two scatter plots cross at a local CpG force f ' À1:860:2, very close to the equilibrium force, f eq ¼ À1:79. This result supports our choice of the equilibrium force. The global force of SARS-CoV-2 (f ¼ À1:71) is also compatible with this crossing point. On the contrary, other possible choices for the equilibrium force, such as the coding force computed on human type-I IFNs, f ¼ À2:89 would not match the crossing point. The results above suggest to introduce, as will be done Di Gioacchino et al. . doi:10.1093/molbev/msab036 MBE in the following, the CpG drive defined as the difference between the CpG local force and the equilibrium CpG force. Table 3 complements figure 4 with a detailed description of CpG-decreasing and -increasing syn-SNV along the ORFs and the 5 0 and 3 0 untranslated regions (UTRs) of SARS-CoV-2 genome. The regions with high negative CpG drive have a large density of CpG removing mutations, see for instance, 5 0 UTR, 3 0 UTR, N ORF, and M ORF. Importantly, syn-SNV are in many loci across the sequence ( fig. 4a) and taking into account syn-SNV counts in the sequence data or considering unique syn-SNV does not qualitatively affect our conclusions. Focusing on N ORF, remarkably 21% (47% with count) of syn-SNV remove a CpG motif. Such percentage represents a fraction of 75% (94% with counts), among the total number of syn-SNV affecting a CpG. On the opposite, the regions with a small-amplitude, negative or positive, drive such as ORF1a, ORF1b, and S ORF have a low density of CpG-affecting mutations; among the syn-SNV affecting CpG motifs the percentage for syn-SNV adding a CpG motif or removing a CpG motif are comparable. For S ORF, having the largest positive drive, the large majority of synonymous variants, 85% (92% with counts), leaves the CpG content unchanged with only few, 7% (4% with counts), syn-SNV affecting a CpG motif. Among syn-SNV affecting CpG, a slight predominance of CpG increasing syn-SNV is observed with 53% (56% with counts) CpG increasing against 47% (44% with counts) CpG decreasing syn-SNV. Last of all, figure 4c shows that, in N ORF, a rapid accumulation of CpG removing syn-SNV is observed in the sampled sequences as a function of the delay between the time of collection and the beginning of the COVID-19 pandemic. This increase is much steeper that the gradual rise of syn-SNV increasing CpG occurrences. In the S region, on the contrary, a gradual rise of syn-SNV is observed both for CpGincreasing and -decreasing mutations, with a slight predominance of CpG-increasing ones.

Analysis of Synonymous Mutations in N ORF Suggests Implication of ZAP in Progressive Loss of CpG
We have then studied the nucleotidic patterns preceding, along the viral sequence, the CpG dinucleotides lost in syn-  Early Evolution of CpG Dinucleotides in SARS-CoV-2 . doi:10.1093/molbev/msab036 MBE SNV encountered so far. In N ORF, the ORF with largest density of CpG decreasing syn-SNV as shown in table 3, 24 syn-SNV removing one CpG have been found. The nucleotide motifs preceding these loci are listed in the top 19 lines of table 4 (for some loci, more than one syn-SNV removed the same CpG), together with their positions along N ORF of SARS-CoV-2 and their number of occurrences in the sequence data. Seven out of 19 of these loci, which represent 71% of total syn-SNV removing a CpG (1,587 out of the 2,239), correspond to a motif of the type CnxGxCG, where nx is a spacer of n nucleotides and were identified as ZAPbinding patterns in (Luo et al. 2020). The binding affinity of ZAP to the motifs was shown to depend on the spacer length, n, with top affinity for n ¼ 7 (Luo et al. 2020) (table 4). Notice that 43% (three out of the seven) of the CpG-suppressionrelated motifs in SARS-CoV-2 correspond to n ¼ 7. Other motifs of the type CnxGcCG are also present in SARS-CoV-2, but their CpG is not lost in sequence data, see last five lines of table 4; the dissociation constants associated to their spacer lengths are on average larger than the ones of the motifs showing CpG loss. From the spacer length-dependent-binding affinity given in Luo et al. (2020) (table 4), we have computed a score, which we call ZAP affinity score (ZAS), which is related to the probability of having at least one ZAP bound to such motif (see Materials and Methods for technical details). The ZAS computed in sliding windows across the genome is presented in figure 4a (top plot): N ORF is the richest region in motifs of the form CnxGxCG, with the largest ZAS. Our analysis is confirmed in table 5, which reports all Syn-SNV removing CpG following an extended sequence motif. Even if N ORF represents only 4% of the total sequence length, 18% of extended motifs CnxGxCG and 26% (58% with counts) of syn-SNV removing a CpG on an extended motif are on this region. In contrast, only two extended motifs of type CnxGxCG were found in 5 0 UTR even if many repeated CpG at short interspace were present, see supplementary  (table 3) but have a large difference in CnxGxCG-like motif content, as shown in table 5. Remarkably, when considering the counts, the number of CpG-decreasing mutations occurring in N ORF, out of which 71% are on CnxGxCG-like motifs, is 10-fold more than that occurring in M ORF. These results support the existence of early selection pressure to lower CnxGxCG-like motifs in N ORF, where they are particularly concentrated.

Model Is Able to Discriminate Observed and Nonobserved Single Nucleotide Variants among Early Synonymous Mutations
Our model can be further used to predict the odds of synonymous mutations, either implying CpG or not, from the ancestral SARS-CoV-2 (GISAID ID: EPI_ISL_406798) sequence.

MBE
For this purpose, we introduce a synonymous mutation score (SMS), defined in Materials and Methods, whose value expresses how likely a mutation is to appear under the joint actions of the CpG force, the codon bias (we consider here the virus codon bias, calculated on the Wuhan ancestral strain, rather than the human cb, as SARS-CoV-2 is likely not in equilibrium with its host yet. This choice will be justified later), and the transition-transversion bias (ttb) (Jiang and Zhao 2006) (we consider the canonical ratio 4:1 here, see Materials and Methods for details.). For synonymous mutations that do not affect CpG the only mutational driving factors in our model are the cb, and the ttb, which are global drives on the genome. Synonymous mutations changing CpG are additionally driven by the local force to increase or decrease CpG content depending on the CpG mutational drive in the region under consideration ( fig. 4a and table 3). Figure 5a and 5b shows SMS along, respectively, the N ORF and S ORF, for all the observed Syn-SNV lowering (blue), increasing (red), or leaving unchanged (gray) CpG content, along with their counts in the sequence sample. In N ORF as in S ORF, the majority of Syn-SNV have high SMS, validating the model predictions. The sign and amplitude of the SMS for CpG-affecting Syn-SNV in figure 5a and 5b result from the interplay between the virus cb with the CpG drive in the model: the virus cb, computed on the whole genome which is globally low in CpG content, already favors synonymous mutations removing CpG. Due to the cb, both in N ORF and S ORF, Syn-SNV removing CpG tend to have a positive SMS, whereas Syn-SNV adding CpG tend to have a negative SMS (see supplementary fig. SI.12, Supplementary Material online, for SMS computed without CpG drive along N ORF and S ORF). The large and negative CpG drive in N ORF adds to the cb trend. As shown in figure 5a, it further raises the SMS of CpG-removing Syn-SNV and further decreases the negative SMS of CpG-adding Syn-SNV, in agreement with the data. The resulting SMS amplitude for CpG-affecting mutations is generally larger than for mutations nonaffecting CpG. On the opposite, in S ORF, as shown in figure 5b (see also supplementary fig. SI.12, Supplementary Material online), the positive drive acts against the cb trend, reducing or raising the SMS for, respectively, CpG-decreasing or -increasing Syn-SNVs. In some loci, the CpG drive is strong enough to reverse the sign of the SMS for CpG-increasing Syn-SMS, and the SMS become positive. The resulting SMS amplitude for CpGaffecting mutations is generally smaller than for mutations leaving CpG content unchanged, in agreement with the observation than CpG-affecting Syn-SNV are rare in S ORF.
To make our arguments more quantitative, we tested the ability of our model to discriminate between observed and nonobserved Syn-SNV in sequence data collected so far. For the sake of clarity, nonobserved Syn-SNV refer to the set of possible synonymous mutations that have been not observed so far in the sequence data (or observed very rarely, i.e., with <5 counts). In figure 5c and 5d, we show that the distribution of SMS for syn-SNV are shifted to higher values compared with its counterpart for nonobserved syn-SNV, both in N ORF and S ORF. Hence, our model is able to statistically discriminate between syn-SNV and nonobserved syn-SNV (ANOVA F-test: 1,085 for N and 5,590 for S). Moreover, when ranking all possible synonymous mutations in decreasing order according to their SMS and considering them as true predictions if they have been observed in sampled sequences so far, and thus correspond to an syn-SNV, or false predictions if they have not been observed in the collected sequences, we obtained very good classification performances (AUROC ¼ 0.82 for N and 0.84 for S). As a complementary test we give in supplementary figure SI.8, Supplementary Material online, the positive predicted value as a function  To identify what ingredient in the model is responsible for correctly distinguishing between observed and nonobserved synonymous mutations, we compare the discriminatory performances of the full model (including the cb, the ttb, and the CpG drive) with all model variations obtained removing one ingredient at the time, up to a null model in which all synonymous mutations are equally likely and all have zero SMS. In figure 5e and 5f, we compare, for the different models, the average SMS for Syn-SNV, obtained both with and without their counts in the sequence sample (see horizontal light and dark green lines in fig. 5a and 5b), with the average SMS for the nonobserved syn-SNV. The models are ranked by the difference of average SMS between observed and nonobserved Syn-SNV. As expected, the null model is unable to distinguish between the two sets, assigning vanishing average SMS to both. The gap between the average SMS of observed and nonobserved Syn-SNV progressively increases when introducing back the different ingredients of the model, up to the full one.
We have checked that the choice of the viral codon bias is important for such discriminatory ability. When using the human codon bias instead of the viral one, the model based only on the human codon bias behaves similar to the uniform bias model; the full model is again the best model but with a smaller difference in averages SMS ( It is worth noticing that models based on viral codon bias and ttb alone (without the force) already provide good discriminatory performances in terms of classification of SNV (AUROC and F score). This result is expected from the predominance of syn-SNV not affecting CpG occurrences, especially in S ORF. In addition, in regions, such as N ORF, where CpG lowering mutations are frequently observed, the viral codon bias, due to the low global CpG content, favors such synonymous mutations in agreement with figure 5e and 5f. Table 6 gives the average SMS and differences, as well as AUROC and ANOVA tests, for all ORFs and 5 0 and 3 0 UTR. We consistently find that the full model is always a good model to describe syn-SNV observed in the early evolution of SARS-CoV-2. There is a clear separation between observed and unobserved syn-SNV average SMS with large AUROC for all the ORFs and UTRs (! 0:68 for all the ORFs and UTRs and ! 0:71 for the regions with a better mutational statistics, with at least 20 syn-SNV) and a large ANOVA F value (! 3 for all the ORFs and UTRs and ! 13 among regions with at least 20 syn-SNV); the ORF1a, S ORF, and ORF7b are the only regions for which the average difference between SMS of observed and nonobserved syn-SNV computed with no CpG force is slightly larger. This again suggests that, for the S ORF, syn-SNV are only marginally affecting CpG motifs and are mainly driven by codon bias and ttb.
Remarkably, the ranking of models and their discriminatory performances are essentially the same when taking into account or not the counts of the syn-SNV, even if a net increase in the average syn-SNV SMS score is present for models with the CpG drive in most ORFs and UTRs, and  (Łuksza and L€ assig 2014) and correcting for sampling bias, will lie in between the two limit-case SMS discussed above.

Discussion
The present work reports analysis of dinucleotide motif usage, particularly CpG, in the early evolution of SARS-CoV-2 genomes up to October 2020. First, a comparative analysis with other genomes shows that the overall CpG force, and the associated CpG content are not as large as for highly pathogenic viruses in humans (such as H1N1, H5N1, Ebola, and SARS and MERS in the Coronaviridae family). However, the CpG force, when computed locally, displays large fluctuations along the genome. This strong heterogeneity is compatible with viral recombination, in agreement with the hypothesis stated in Andersen et al. (2020). The degree to which this heterogeneity in any way reflects zoonotic origins should be further worked out using phylogenetic analysis. In particular, the segment coding for the Spike protein has a much lower CpG force. The S protein has to bind ACE2 human receptors and TMPRSS2 (Hoffmann et al. 2020;Vabret et al. 2020). A fascinating reason that could explain the low CpG force on this coding region is that it may come (at least in part) from other coronaviruses that better bind human entry receptors (Andersen et al. 2020;MacLean et al. 2020). Other regions, in particular, the initial and final part of the genome, including the 5 0 and 3 0 UTR and N ORF, are characterized by a larger density of CpG motifs (and corresponding CpG force), which are comparable with what is found in SARS and MERS viruses in the Betacoronavirus genus. Interestingly, the initial and final part of the genome are implied in the fullgenome and subgenomic viral replication. In particular, the coding region of the N protein and its RNA sequence, present in the 3 0 UTRs of all SARS-CoV-2 subgenomic RNAs, has been shown in Kim et al. (2020) to be the most abundant transcript in the cytoplasm. The high concentration of N transcripts in the cytoplasm could contribute to a dysregulated innate immune response. A mechanism generating different densities of PAMPs being presented to the immune system at different points in the viral life cycle can affect immune recognition and regulation. The precise way this can contribute to immunopathologies associated with COVID-19 and how this is related to the cytokine signaling dysfunction associated with severe cases (Ragab et al. 2020) need further experimental investigation. The analysis of the evolution of synonymous mutations since the outbreak of COVID-19 shows that mutations lowering the number of CpG have taken place in regions with higher CpG content, at the 5 0 and 3 0 ends of the sequence, and in particular in the N protein-coding region. The sequence motifs preceding the loci of the CpG removed by mutations match, especially in N ORF, some of the strongly binding patterns of the ZAP protein (Luo et al. 2020). Natural sequence evolution seems to be compatible for protein N with our model, in which synonymous mutations are driven by the virus codon bias and the CpG forces leading to a progressive loss in CpG. These losses are expected to lower the CpG forces, until they reach their equilibrium values in human host, as is seen in hCoV coronaviruses commonly circulating in human population (Abdul-Rasool and Fielding 2010). More data, collected at an unprecedented pace (Pickett et al. 2012;Elbe and Buckland-Merrett 2017;Hadfield et al. 2018), and on a longer evolutionary time are needed to confirm these hypothesis. Since the data collected are likely affected by relevant sampling biases, a more precise analysis of synonymous mutations could be carried out using the available phylogenetic reconstruction of viral evolution (Gonzalez-Reiche et al. 2020). Nevertheless our results seem robust, because they are consistent both considering unique synonymous variants and including their counts. They coherently point to the presence of putative mutational hotspots in the viral evolution. Although the results presented here are preliminary due to the early genomics of this emerging virus, they have been confirmed by incoming sequence data since our first analysis (dated May 5, 2020, see supplementary fig. SI.11, Supplementary Material online), and they point to interesting future directions to identifying the drivers of SARS-CoV-2 evolution and building better antiviral therapies. In this work, we focused on synonymous mutations, but it would interesting to extend our fixed amino acid model for viral evolution to take into account nonsynonymous mutations and to further model transmission and mutation (in the presence of a proofreading mechanism; Denison et al. 2011) processes in SARS-CoV-2 to predict the timescale at which natural evolution driven by host mimicry would bring the virus to an equilibrium with its host (Greenbaum et al. 2008(Greenbaum et al. , 2014. After our work was posted on the bioRxiv, Nchioua and colleagues have shown the importance of ZAP in controlling the response against SARS-CoV-2 (Nchioua et al. 2020) by demonstrating that a knockout of this protein increases SARS-CoV-2 replication. The interaction between SARS-CoV-2 and ZAP has also been observed with unbiased methods in another recent work (Flynn et al. 2020). This finding supports our prediction that recognition of SARS-CoV-2 by ZAP imposes a significant fitness cost on the virus, as demonstrated by its early evolution to remove ZAP recognition motifs. Two other recent theoretical works Sadykov et al. 2021) corroborate our results showing that at the single nucleotide level, there is a net prevalence of C!U synonymous mutations (the most common nucleotide mutation which may cause a CpG loss) in the early evolution of SARS-CoV-2. Moreover, a recent analysis of the immune profile of patients with moderate and severe disease revealed an association between early, elevated cytokines and worse disease outcomes identifying a maladapted immune response profile associated with severe COVID-19 outcome (Lucas et al. 2020).

CpG Density Versus Local and Global Forces
Throughout this work, we used CpG forces to quantify the CpG content of a given sequence. Here we want to compare this approach with the simple count of CpG motifs in the sequence. In supplementary figure SI.2, Supplementary Material online, we show that some of our results, such as the large fluctuations of the CpG content across the SARS-CoV-2 genome, are also apparent from a simple motif density analysis. However, this counting strategy is not suitable to make comparisons among viruses of different families, mainly because of the different lengths and usage biases of viral genomes. Moreover, without the statistical framework at the basis of the CpG force, it is very difficult to take into account the many constraints acting on a genetic sequence, notably the constraint on the amino acids that have to be coded for in the sequence.
The force formalism is, therefore, much more flexible and allows us to introduce in a theoretically grounded way the synonymous mutation score, which we used to characterize mutations that are likely to happen. The drawback of such a formalism is the quite large number of extra choices that have to be done, and which can influence the result. These choices are discussed in the following.

Force Computation
The technique at the core of many of the analyses made here is taken from Greenbaum et al. (2014). Here we briefly review this technique, starting from its noncoding version that takes as reference bias the nucleotide bias and then describing the coding version that takes as reference bias the codon bias at fixed amino acid sequence.

Force Computation in Noncoding Case
Given a motif m and a sequence s 0 ¼ fs 1 ; . . . ; s N g of length N, we consider the ensemble of all sequences with length N, which we denote with S, and we suppose the probability of observing s out of this ensemble to be Here, qðs i Þ is the nucleotide bias, that is, the probability of the ith nucleotide being s i (e.g., we always used in this work the human frequency of nucleotides as qðs i Þ), f nc is the force we want to compute (the subscript nc stands for noncoding), and N m is the number of times the motif m appears in the sequence. Z is the normalization constant, that is, qðs i Þ e f nc N m ðsÞ : Therefore, the force f nc is a parameter that quantifies the lack (if negative) or abundance (if positive) of occurrences of m with respect to the number of occurrences due to the local probabilities qðs i Þ. We can fix f nc by requiring that the number of motifs in the observed sequence, N m ðs 0 Þ ¼ n 0 , is equal to the average number of motifs computed with the probability equation (1), hni, that is, Notice that this is equivalent to the request that f nc is so that probability of having observed s 0 is maximum.
Let us focus now on the specific case of a dinucleotide motif, that is, our motif m consists of the pair ab, where a and b are two consecutive nucleotides (e.g., a ¼ C and b ¼ G for the CpG motif). In this case, within an approximation discussed in the supplementary section SI.3, Supplementary Material online, the force computed as above turns out to be the logarithm of the relative abundance index, that is, where q(ab) is the number of motifs ab divided by the total length of the sequence N. In supplementary figure SI.13, Supplementary Material online, we tested the accuracy of this approximation in our specific case. As it is clear from equation (4), the choice of the nucleotide bias q(s) affects the absolute value of the forces but not the difference between forces computed on different viral genomes using the same reference bias. We have chosen as reference nucleotide bias the human nucleotide bias (computed on all the genome or on the coding DNA only). This choice can be then replaced by any other reference bias (possible choices include the cb computed on the ancestral SARS-CoV-2 sequence or other human Coronavirus viral sequences or the one computed on RNA transcripts of human type-I IFNs, at the core of innate immune response) and will shift the values of the forces without affecting the ranking of the force on different viral sequences, see figure 2 and table 1.

Force Computation in the Coding Case
Our technique can be generalized to coding sequences at fixed amino acid sequence and codon bias. In this case, we write each sequence s as a series of codons, and its probability is defined as where now the bias takes the form of a codon usage bias, and the normalization constant Z changes accordingly into a sum over all possible synonymous sequences. The subscript c stands for coding and differentiates this force from the noncoding one. The force f c can be computed, analogously with the procedure for the simpler case, by requiring that the number of motifs observed in s 0 is equal to the statistical average performed with equation (5), as described in detail in Greenbaum et al. (2014). As shown in figure 3a, the CpG force at fixed amino acids are roughly comparable with the Early Evolution of CpG Dinucleotides in SARS-CoV-2 . doi:10.1093/molbev/msab036 MBE one at fixed nucleotide bias when computing the nucleotide bias on human coding sequences. The force computed in the coding (or noncoding case) is an useful tool to determine the content of a given dinucleotide, while taking into account a number of constraints.

Definition of SMS
We use the ideas discussed above to introduce a model in order to assign a score, which we call SMS, to each possible single-codon synonymous mutation of an ancestral sequence. We consider a system evolving for a small timescale, and a mutation that changes the ith codon c i into a synonymous c 0 i . The transition probability, that is, the probability of observing the mutation, for such evolution can be decomposed in the product of two evolution operators: the first TðN CG ! N 0 CG Þ representing the change in the number of CpG motifs in the mutated sequence and the second Tðc i ! c 0 i Þ representing the gain in mutating the particular codon in position i.
The first operator can be computed from the dynamical equation introduced in Greenbaum et al. (2014) for the evolution of the CpG number N CG of a sequence under an initial force (Here we drop the subscripts nc and c used in the previous section to identify noncoding and coding forces, since the SMS is defined for a generic force) f through an equilibrium force f eq : The equilibrium force can be computed on a viral strain which is supposed to be at equilibrium with the human innate immune system, because it has evolved in the human host since a long time. Equation (6) was used in Greenbaum et al. (2014) to describe the evolution of the CpG number in H1N1, taking as the equilibrium force the one of the Influenza B strain. In analogy with this approach, we take here as f eq the average force calculated on coding regions of the seasonal hCoVs (i.e., hCoV-229E, hCoV-NL63, hCoV-HKU1, hCoV-OC43). Other possible choices are discussed below (see also supplementary fig. SI.7, Supplementary Material online). s is a parameter determining the characteristic timescale for synonymous mutations. Based on equation (6), we define the transition operator for CpG number as where DN CG ¼ N 0 CG À N CG . Notice that for all the synonymous mutations leaving unchanged the CpG number, the above operator is one. The codon mutational operator is where qðc i Þ is the frequency of codon c i from the chosen codon usage bias. Putting together, these two terms allow us to estimate how likely a specific synonymous mutation is to happen. The SMS accompanying a mutation is defined as the logarithm of this quantity, To conclude, we remark that different models can be used in the SMS computation, where a model is specified by giving the choice of including or not the force term, the choice of the equilibrium force to be used, the choice of including or not the cb term, and choice of the reference cb to be used.

Adding transition-transversion bias to SMS
It is well known that transversions (i.e., mutations of purines in pyrimidines and vice versa) are suppressed with respect to transitions (i.e., mutations of purines in purines or pyrimidines in pyrimidines).
We introduce here a simple way to account for transitiontransversion bias (ttb) in the model used to assign the SMS. We suppose that a mutation with n transversions happens four times less than a mutation with n À 1 transversions. This probability ratio, which is a standard value in the literature (Jiang and Zhao 2006), has been recently shown to be close to the observed value for SARS-CoV-2 (Roy et al. 2020). To include that in our model, consider mutating a codon c to c 0 , one of its synonymous codons. Let SMSðc; c 0 Þ be the SMS for this event, computed with a given model. We then count the number of transitions, n trn , and the number of transversions, n trv , and modify the SMS into SMS 0 , so that SMS 0 ðc; c 0 Þ ¼ SMSðc; c 0 Þ þ n trn logð2Þ À n trv logð2Þ: (10) This choice is motivated by mainly two considerations: 1) in this way, a dynamical model where mutation probabilities are proportional to the exponential of SMS (as the one used to justify the SMS itself) correctly gives a 4-fold probability to a transition than a tranversion (if the two mutations have the same SMS without this new term) and 2) the splitting on the extra term in a positive weight for transitions and a negative weight for transversions ensures that the average SMS before and after adding this term is comparable.

ZAP Affinity Score
We introduce the ZAS to roughly quantify a priori the likelihood of ZAP biding to a given region of RNA. ZAS is based on the dissociation constants obtained in vitro in Luo et al. (2020). Let us consider the case of a single motif (be it CG, or CnxGxCG, with n ¼ 4, 5, 6, 7, or 8), M, with dissociation constant K d . The association constant is then defined as where ½ZAP þ M is the concentration of complexes, [ZAP] and ½M are the concentration of free molecules. Let us denote by ½ZAP 0 and ½M 0 the total concentration of molecules (bound and unbound). If we suppose that only a small part of the available molecules form a complex, that is, more specifically that K a ½ZAP 0 ( 1 and K a ½M 0 ( 1, then K a ½ZAP 0 ' K a ½ZAP is the probability of binding. If we have n sites with association constants K where we also used that n is sufficiently small so that ½ZAP P n i¼1 K ðiÞ a ( 1. Finally, ZAS is defined as that is, p=½ZAP. Although ZAS itself does not depend on ½ZAP, its interpretation (and in particular its connection with the probability of binding) does, as it requires K ðiÞ a ½ZAP 0 ( 1; K ðiÞ a ½M 0 ( 1, and ½ZAP 0 P n i¼1 K ðiÞ a ( 1. The K ðiÞ a used here range from about 10 5 (for the simple CpG motif) to 10 7 (for C7xGxCG) mol/L. It is more difficult to estimate ½M 0 and ½ZAP 0 during the infection. However, we hypothesize that these requirements are fulfilled in cells, and that our interpretation in terms of binding probability is acceptable.

Robustness of Analysis with Respect to Choice of Parameters
We discuss here how force values and SMS scores change by changing model parameters.
Parameters affecting the force values: i. Nucleotide, codon bias choices: the most relevant effect due to this choice is a global shift of the force, as we show in figure 3a for the noncoding case, which does not change the ranking of forces when comparing different sequences using the same reference bias. ii. Choice of the length of the segment to compute the force: the force is an intensive parameter. However, here we use the force to quantify the content of CpG motifs, which are quite rare. For this reason, computing forces on small segments can lead to large negative values (the force is À1 when no CpG motif is present) and to unnatural fluctuations. For this reason, to compute local force, we fix large sliding windows of 3 kb, and we use Gaussian sliding averages to smooth the resulting curves. The effect of Gaussian smoothing and changing sliding window on the force are presented in supplementary figure SI.9, Supplementary Material online.
Parameters affecting the SMS: i. The codon bias (or nucleotide bias for mutations in 5 0 and 3 0 UTRs): it is both present as a reference bias in the force computations for the CpG drive term and directly used as a more generic evolutionary driver for the synonymous mutations. For the computation of forces in the CpG drive, we have used the human codon or nucleotide bias as reference usage, but such choice is actually irrelevant because the drive is a difference of the segment force and the equilibrium one. The choice of bias is, on the opposite, very relevant for the choice of the synonymous mutations driver. Indeed, in figure 5e and 5f, it is apparent that the virus codon bias alone gives to the model a certain ability to discriminate between observed an unobserved syn-SNV. We tested also the human codon bias which gives bad performances (see supplementary fig. SI.6, Supplementary Material online). ii. Choice of equilibrium force: this choice is arbitrary to a certain degree. We use as equilibrium force (computed with the human codon bias) À1.79 which is the average coding force hCoVs (229E, NL63, HKU1, OC43), since these viruses are well adapted to the human environment so likely a good equilibrium point for SARS-CoV-2.
To check the effects of other choices of equilibrium forces, in supplementary figure SI.7, Supplementary Material online, we performed the same analysis shown in figure 5e and 5f with other two possible choices of equilibrium forces: the global force of SARS-CoV-2 (À1.71, which is quite low, but slightly higher than the average of the seasonal hCoVs), and the average global force of INF-I transcripts (À2.89 which is much lower than that of seasonal hCoVs), see also tables 1 and 2. Although the SMS assigned to the mutations are in general different, especially when taking into account the counts in syn-SNV, and so the average SMS in figures, the ranking of the various models in terms of average SMS difference (syn-SNV vs. unobserved syn-SNV) is quite robust. iii. Presence of ttb: we observe that the presence of this term always increases the difference between the average SMS in observed and unobserved synonymous SNV. Two choices are needed to fix this term: the value of the probability ratio of a transition with respect to a transversion (here we considered this ratio to be 4), and the specific way of realizing this bias by adding a bonus/ penalty term to transitions and trasnversions. The latter choice is almost irrelevant when considering the differences of average SMS between observed and unobserved SNV.
Data Analysis and Data Availability SARS-CoV-2 sequences are taken from GISAID (Elbe and Buckland-Merrett 2017). We downloaded each sequence present in the database on October 05, 2020 (the most recent sequence was collected on September 29, 2020). Before any of our analyses, we discarded all the sequences where one or more nucleotides were wrongly read (other characters than A, C, G, T, U). This left us with 56,045 SARS-CoV-2 sequences. To obtain figure 2, we considered, in addition to the SARS-CoV-2 sequences are taken from GISAID, other Alphacoronavirus and Betacoronavirus sequences (whole genomes and genes) which have been obtained from VIPR (Pickett et al. 2012). The preprocessing consisted again of discarding all the sequences with one error or more. After this process, we collected 341 SARS, 48 MERS, 20 hCoV-229E, 48 hCoV-NL63, 14 hCoV-HKU1, 124 hCoV-NL63, 166 bat-CoVs, and 5 pangolin-CoVs whole genomes. For figure 2b, we used the largest number possible of sequences, up to a maximum of 100. For figure 2a (viral sequences) and 2c, we chose a single sequence for each species. However, we checked that the result is qualitatively the same if we use Early Evolution of CpG Dinucleotides in SARS-CoV-2 . doi:10.1093/molbev/msab036 MBE other sequences from the same species for human coronaviruses. The curves in figure 2c are smoothed through a Gaussian sliding average (on windows of 3 kb, the Gaussian being centered in the window, normalized, with a standard deviation of 300 b). The ancestral SARS-CoV-2 sequence used throughout the work has been collected on 26-12-2019 (ID: EPI_ISL_406798).
In figure 3a, the SARS-CoV-2 sequence has been processed to ensure the correct reading frame. Therefore, the ORF1ab gene is read in the standard frame up to the ribosomal shifting site, and it is read in the shifted frame from that site up to the end of the polyprotein. Moreover, the small noncoding parts between successive proteins have been cut, resulting in a loss of 634 nucleotides (including the 5 0 and 3 0 UTR). A Gaussian smoothing has been performed to obtain the plotted CpG forces (as in fig. 2c). To produce the bar plots in figure 3b and c, we collected genes data on VIPR. Then we discarded all the sequences with one or more errors, and we computed for each gene an average of up to 20 different sequences (coming from the same species). For some structural proteins, we did not find 20 different genes but in any case, the standard deviation of the averages of figure 3b and 3c is smaller than 0.02. In particular, we used 20 sequences of SARS-CoV-2, MERS, hCoV-NL63, hCoV-OC43 proteins, 14 sequences for hCoV-229E, 13 for hCoV-HKU1, and 4 for SARS. More detailed information about the genomes used in this work are given in supplementary section SI.1, Supplementary Material online.
The mutations used for figures 4 and 5 have been collected by extracting ORFs from the SARS-CoV-2 sequence data set and comparing them with the Wuhan ancestral strain. ORFs with mutations too close to the start or end codon are not considered, together with ORFs with insertion/deletion, this filtering procedure leaving us with 48,511 sequences to obtain mutation data. Mutations with <5 counts in different sequences are discarded. All curves in figure 4a are smoothed with the same Gaussian average used in figures 2 and 3. Finally, to get the mutation data in 5 0 and 3 0 UTR, we considered the UTRs of the Wuhan ancestral strain, and we compared them with those of other sequences. The number of nucleotides of the Wuhan ancestral considered part of 5 0 and 3 0 UTR for the search for mutations in other sequences is given in table 3. This length is chosen so that a large number of uploaded sequences (about 50,000) have a UTR of the same length or longer. In the UTR analysis, all observed mutations are considered "synonymous." The code used to compute coding and noncoding forces is publicly available at https://github.com/adigioacchino/dinu-cleotide_forces (last accessed February 16, 2021).