Synonymous mutations and the molecular evolution of SARS-CoV-2 origins

Abstract Human severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is most closely related, by average genetic distance, to two coronaviruses isolated from bats, RaTG13 and RmYN02. However, there is a segment of high amino acid similarity between human SARS-CoV-2 and a pangolin-isolated strain, GD410721, in the receptor-binding domain (RBD) of the spike protein, a pattern that can be caused by either recombination or by convergent amino acid evolution driven by natural selection. We perform a detailed analysis of the synonymous divergence, which is less likely to be affected by selection than amino acid divergence, between human SARS-CoV-2 and related strains. We show that the synonymous divergence between the bat-derived viruses and SARS-CoV-2 is larger than between GD410721 and SARS-CoV-2 in the RBD, providing strong additional support for the recombination hypothesis. However, the synonymous divergence between pangolin strain and SARS-CoV-2 is also relatively high, which is not consistent with a recent recombination between them, instead, it suggests a recombination into RaTG13. We also find a 14-fold increase in the dN/dS ratio from the lineage leading to SARS-CoV-2 to the strains of the current pandemic, suggesting that the vast majority of nonsynonymous mutations currently segregating within the human strains have a negative impact on viral fitness. Finally, we estimate that the time to the most recent common ancestor of SARS-CoV-2 and RaTG13 or RmYN02 based on synonymous divergence is 51.71 years (95% CI, 28.11–75.31) and 37.02 years (95% CI, 18.19–55.85), respectively.


Introduction
The COVID-19 pandemic is perhaps the biggest public health and economic threat that the world has faced for decades (Li et al. 2020;Wu et al. 2020;Zhou et al. 2020b). It is caused by a coronavirus (Lu et al. 2020;Zhang and Holmes 2020), severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), an RNA virus with a 29,903-bp genome consisting of four major structural genes Zhou et al. 2020). Of particular relevance to this study is the spike protein which is responsible for binding to the primary receptor for the virus, angiotensin-converting enzyme 2 (ACE2) (Wan et al. 2020;Wu et al. 2020;Zhou et al. 2020b).
Human SARS-CoV-2 is related to a coronavirus (RaTG13) isolated from the bat Rhinolophus affinis from Yunnan province of China (Zhou et al. 2020b). RaTG13 and the human strain reference sequence (GenBank accession number MN996532) are 96.2 per cent identical and it was first argued that, throughout the genome, RaTG13 is the closest relative to human SARS-CoV-2 (Zhou et al. 2020b). And RaTG13 and SARS-CoV-2 were 91.02 per cent and 90.55 per cent identical, respectively, to coronaviruses isolated from Malayan pangolins (Pangolin-CoV) seized at the Guangdong customs of China, which therefore form a close outgroup to the SARS-CoV-2þRaTG13 clade (Zhang, Wu, and Zhang 2020). Furthermore, five key amino acids in the receptor-binding domain (RBD) of spike were identical between SARS-CoV-2 and Pangolin-CoV, but differed between those two strains and RaTG13 (Zhang, Wu, and Zhang 2020). Xiao et al. assembled and analyzed a full-length Pangolin-CoV genome sequence, showing that the RBD of its S protein differs from the SARS-CoV-2 by only one noncritical amino acid (Xiao et al. 2020). Similar observations were made using Pangolin-CoV strains found in Malayan pangolin samples seized by the Guangxi customs of China (Lam et al. 2020). Additionally, it is shown that when analyzing a window of length 582 bp in the RBD, nonsynonymous mutations support a phylogenetic tree with SARS-CoV-2 and Pangolin-CoV as sister groups, while synonymous mutations do not (Lam et al. 2020). They discuss two possible explanations for their results, one which includes recombination and another which includes selection-driven convergent evolution. Independent analysis also support SARS-CoV-2 obtains the receptor-binding motif through recombination from a donor related to this Pangolin-CoV strain (Li et al. 2020). Detailed phylogenetic analysis on subregions across the S protein showed that it is the RaTG13 sequence that show exceptionally divergent pattern in the RBD region, they instead argued a recombination occurred into RaTG13 from an unknown divergent source (Boni et al. 2020). This would explain the amino acid similarity between SARS-CoV-2 and Pangolin-CoV in the RBD as an ancestral trait that has been lost (by recombination) in RaTG13. Using a phylogenetic analysis, they also dated the RaTG13 and SARS-CoV-2 divergence to be between 40 and 70 years. Recently, Zhou et al. discovered a viral strain, RmYN02 from the bat Rhinolophus malayanus, with a reported 97.2 per cent identity in the ORF1ab gene but with only 61.3 per cent sequence similarity to SARS-CoV-2 in the RBD (Zhou et al. 2020a). Moreover, the RmYN02 strain also harbors multiple amino acid insertions at the S1/S2 cleavage site in the spike protein (Zhou et al. 2020a).
To analyze the history of these sequences further, we here focus on patterns of synonymous divergence, which has received less focus, but also is less likely to be affected by selection than amino acid divergence. We develop a bias corrected estimator of synonymous divergence specific for SARS-CoV-2 and related strains, and analyze divergence using both sliding windows and a whole-genome approach between SARS-CoV-2 and related viral strains.  (Valitutto et al. 2020) were also added to the database. Blast databases were created using the default parameters for makeblastdb. Blast searches were performed using blastn (Altschul et al. 1990) with parameters '-word_size 7 -reward 1 -penalty -3' and all other parameters as the default settings. All the blast hits to different Guangdong pangolin viral strain sequences were merged as one hit, and the blast hits to different Guangxi pangolin viral strain sequences were also merged.

Alignment
To obtain an in-frame alignment of the genomes, we first identified the coding sequences of each viral strain using independent pairwise alignments with the coding sequences of the SARS-CoV-2 (Wuhan-Hu-1) genome. The genome alignments were performed using MAFFT (v7.450) (Katoh and Standley 2013) with parameters '-maxiterate 1000 -localpair'. The coding sequences of each gene were aligned using PRANK (Loytynoja 2014) (v.170427) with parameters '-codon -F'. Finally, the alignments for all genes were concatenated following their genomic order. ORF1a was excluded since its sequence is a subset of ORF1ab.

Recombination detection
We detected possible recombination events across the genome using a combination of seven algorithms: RDP (Martin and Rybicki 2000), GENECONV (Padidam, Sawyer, and Fauquet 1999), Bootscan (Salminen et al. 1995), Maxchi (Smith 1992), Chimaera (Posada and Crandall 2001), SiSscan (Gibbs, Armstrong, and Gibbs 2000), and 3SEQ (Boni, Posada, and Feldman 2007) implemented in RDP5 program (Martin et al. 2015) (version Beta 5.5) and then considered the recombination signals that were supported by at least two methods. We note that these seven methods are all based on inferring recombination using the same type of evidence, and concordance between the methods cannot be interpreted as validation of the recombination signal. However, we will also use phylogenetic methods and methods based on relative sequence divergence to further investigate the putative recombination signals. The analysis was performed on the multiple sequence alignment consisting of the five viral strains.
All regions showing recombination signals (Supplementary Table S5) were removed in subsequent analyses from all strains when stating that recombination regions were removed.

Tree estimation
We estimated phylogenetic trees using two methods: neighbor joining (NJ) and maximum likelihood (ML). The NJ trees were estimated using d N or d S distance matrices which estimated using codeml (Yang 2007) with parameters 'runmode ¼ -2, CodonFreq ¼ 2, cleandata ¼ 1'. To obtain bootstrap values, we bootstrapped the multiple sequence alignments 1,000 times, repeating the inference procedure for each bootstrap sample. The NJ tree was estimated using the 'neighbor' software from the PHYLIP package (Felsenstein 2009). For ML trees, we used IQ-TREE (Nguyen et al. 2015) (v1.5.2) with parameter '-m TEST -alrt 1000' which did substitution model selection for the alignments and performed ML tree estimation with the selected substitution model for 1,000 bootstrap replicates. For this analysis, we masked all regions (Supplementary Table S5) that show recombination signals in any of the five studied viral genomes. We masked regions from all sequences when at least one sequence showed evidence for recombination in that region. All masked regions are listed in Supplementary Table S5. The coordinates (based on the Wuhan-Hu-1 genome) of the three recombination regions (merged set of all the regions in Supplementary Table S5) were: 14611-15225, 21225-24252 and 25965-28297. We also estimate genome-wide divergence between RaTG13 and Wuhan-Hu-1 only excluding the region (position 22853-23092) where potential recombination was detected for the Wuhan-Hu-1 strain (Supplementary Table S5).

Simulations
We simulated divergence with realistic parameters for SARS-CoV-2 using a continuous-time Markov chain under the F3x4 codon-based model (Goldman and Yang 1994;Muse and Gaut 1994;Yang et al. 2000), which predicts codon frequencies from the empirical nucleotide frequencies in all three codon positions and using the global genomic ML estimates of the transition/transversion bias j (¼2.9024) and the d N /d S ratio x (¼0.0392) estimated from the human SARS-CoV-2 comparison to the nearest outgroup sequence, RaTG13 (see Section 3). For the simulations of short 300-bp sequences, we kept x constant but varied time such that the number of synynoymous substitutions per synonymous sites, d S , varied between 0.25 and 3.00. Estimates of d S > 3 are truncated to 3. For simulations of genome-wide divergence between RaTG13 and human strains, we fix d S at 0.1609 (the ML estimate outside the RBD region reported in Section 3). In all cases, we use 10,000 independent replicate simulations for each parameter setting.
2.6 Estimation of sequence divergence in 300-bp windows d N and d S were estimated using two different methods implemented in the PAML package (Yang 2007) (version 4.9d): a count-based method, YN00 (Yang and Nielsen 2000) as implemented in the program 'yn00' with parameters 'icode ¼ 0, weighting ¼ 0, commonf3x4 ¼ 0', and a ML method (Goldman and Yang 1994;Muse and Gaut 1994) implemented in codeml applied with arguments 'runmode¼ -2, CodonFreq ¼ 2'. For real data, the calculations are based on multiple sequence alignment, and sites at which any sequence has missing data are removed. The estimates in 300-bp windows were further bias corrected as described below.

Bias correction for d S estimates in 300-bp window
To correct for the biases observed in the estimation of d S (see Section 3), we identified a quartic function which maps fromd S , the estimates of d S , intod S Ã , the bias-corrected estimate such that to a close approximation, E½d S *] ¼ d S . To identify the coefficients of this function, we used 10,000 simulations as described previously, on a grid of d S values (0.25, 0.5, 0.75, . . ., 3.0). We then identified coefficients such that sum of (E½d S *] À d S ) 2 is minimized over all simulation values.

Database searches
The genome of human coronavirus can effectively recombine with other viruses to form a chimeric new strain when they coinfect the same host (Forni et al. 2017;Boni et al. 2020). Complicated recombination histories have been observed in the receptor-binding motif region of the spike protein (Lam et al. 2020;Xiao et al. 2020;Zhang, Wu, and Zhang 2020) and several other regions (Boni et al. 2020) of the SARS-CoV-2, it is thus important to exhaustively search along the viral genome for other regions potentially of recombination origin and identify possible donors associated with them. To identify possible viral strains that may have contributed, by recombination, to the formation of human SARS-CoV-2, we searched NCBI and EMBL virus entries along with GISAID Epiflu and EpiCov databases for similar sequences using BLAST in 100 bp windows stepping every 10 bp (Fig. 1b). The majority of the genome (78.1%, 2330/2982 of the windows) has one unique best hit, likely reflecting the high genetic diversity of the coronavirus. 21.9 per cent of the genomic regions has multiple best hits, which suggests that these regions might be more conserved. Among the windows with unique best hits, 97.0 per cent (2260/2330) of them were the RaTG13 or RmYN02 bat strains and 1.9 per cent of them, including the ACE2 contact residues region of the S protein, were the pangolin SARS-CoV-2 virus. These observations are consistent with previous results that RaTG13 and RmYN02 are the most closely related viral strains, while the region containing the ACE2 contact residues is more closely related to the pangolin virus strain (Lam et al. 2020;Li et al. 2020;Xiao et al. 2020;Zhang, Wu, and Zhang 2020). A considerable amount of genomic regions (20 windows with unique hits) show highest sequence identity with other coronaviruses of the SARS-CoV-2 related lineage (Lam et al. 2020) (bat-SL-CoVZC45 and bat-SL-CoVZXC21 (Hu et al. 2018)). In addition, there were 6 windows whose unique top hits are coronavirus of a SARS-CoV related lineage (Lam et al. 2020) (Supplementary Table S4). The mosaic pattern that different regions of the genome show highest identity to different virus strains is likely to have been caused by the rich recombination history of the SARS-CoV-2 lineage (Boni et al. 2020;Li et al. 2020;Patiño-Galindo et al. 2020). Moreover, its unique connection with SARS-CoV-related lineages in some genomic regions may suggest recombination between the ancestral lineage of SARS-CoV-2 and distantly related virus lineages, although more formal analyses are needed to determine the recombination history (see also Boni et al. 2020 for further discussion). Searching databases with BLAST using the most closely related viral strains, RaTG13 and RmYN02, we observe a very similar pattern, as that observed for SARS-CoV-2, in terms of top hits across the genome (Fig. 1b), suggesting that these possible recombination events with distantly related lineages are not unique to the SARS-CoV-2 lineage, but happened on the ancestral lineage of SARS-CoV-2, RaTG13, and RmYN02. A notable exception is a large region around the S gene, where RmYN02 show little similarity to both SARS-CoV-2 and RaTG13.

Sequence similarity and recombination
We focus further on studying the synonymous evolution of SARS-CoV-2, and analyzing Wuhan-Hu-1 as the human nCoV19 reference strain ) along with the four viral strains with highest overall identity: the bat strains RmYN02 and RaTG13 (Zhou et al. 2020a;Zhou et al. 2020b), and the Malayan pangolin strains, GD410721 and GX_P1E, which were isolated from Malayan pangolin samples seized by Guangdong and Guangxi Customs of China, respectively. These four strains have previously been identified as the strains most closely related to SARS-CoV-2 (Lam et al. 2020;Xiao et al. 2020). Other available phylogenetically related, but less similar viral strains, such as bat-SL-CoVZXC21 and bat-SL-CoVZC45 (Hu et al. 2018), are not included due to nearly saturated synonymous mutations when compared with SARS-CoV-2 (ML estimates of d S ¼ 3.2067 and 2.8445, respectively).
We performed recombination analyses across the five viral genomes based on the consensus of the seven recombinationdetection methods implemented in RDP5 (see Section 2). We identified nine recombination regions affecting at least one of the sequences (Supplementary Table S5). Phylogenetic analyses of these regions confirm phylogenetic incongruence when compared with genome-wide trees ( Fig. 2 and Supplementary Figs S1-S3). Particularly, a recombination signal is found in a region encompassing the RBD of the S protein, suggesting that the human SARS-CoV-2 (Wuhan-Hu-1) sequence is a recombinant with the Pangolin-CoV (GD410721) as the donor (Supplementary  Table S5). Phylogenetic analyses also support that Wuhan-Hu-1 and GD410721 form a clade relative to RaTG13 (Supplementary  Table S5) masked using ML (Fig. 2a) and neighbor-joining based on synonymous ( Fig. 2b) or nonsynoymous (Fig. 2c) mutation distance metrics, consistently support RmYN02 as the nearest outgroup to human SARS-CoV-2, in contrast to previous analyses before the discovery of RmYN02, which instead found RaTG13 to be the nearest outgroup (Lam et al. 2020;Wu et al. 2020). This observation is also consistent with the genome-wide phylogeny constructed in previous study (Zhou et al. 2020a).
We plot the overall sequence similarity (% nucleotides identical) between SARS-CoV-2 and the four other strains analyzed in windows of 300 bp (Fig. 1). Notice that the divergences between human SARS-CoV-2 and the bat viral sequences, RaTG13 and RmYN02, in most regions of the genome, are quite low compared to the other comparisons. A notable exception is the suspected recombination region in RmYN02 that has an unusual high level of divergence with all other viruses (Fig. 1e). However, there is also another exception: a narrow window in the RBD of the S gene where the divergence between SARS-CoV-2 and GD410721 is moderate and the divergences between GD410721 and both SARS-CoV-2 and RaTG13 are quite high and show very similar pattern. This, as also found in the recombination analyses based on methods implemented in RDP5, would suggest a recombination event from a strain related to GD410721 into an ancestor of the human strain (Lam et al. 2020;Xiao et al. 2020;Zhang, Wu, and Zhang 2020), or alternatively, from some other species into RaTG13, as previously hypothesized (Boni et al. 2020). We note that RmYN02 is not informative about the nature of this event as it harbors a long and divergent haplotype in this region, possibly associated with another independent recombination event with more distantly related viral strains (Fig. 1e). The other four sequences are all highly, and approximately equally, divergent from RmYN02 in this large region (Fig. 1e), suggesting that the RmYN02 strain obtained a divergent haplotype from the recombination event. When BLAST searching using 100-bp windows along the RmYN02 genome, we find no single viral genome as the top hit, instead the top hits are found sporadically in different viral strains of the SARS-CoV lineage (Fig. 1f), suggesting that the sequence of the most proximal donor is not represented in the database.

Estimating synonymous divergence and bias correction
While the overall divergence in the S gene encoding the spike protein could suggest the presence of recombination in the region, previous study (Lam et al. 2020) reported that the tree based on synonymous substitutions supported RaTG13 as the sister taxon to the human SARS-CoV-2 also in this region. That would suggest the similarity between GD410721 and human SARS-CoV-2 might be a consequence of convergent evolution, possibly because both strains adapted to the use of the same receptor. An objective of the current study is to examine if there are more narrow regions of the spike protein that might show evidence of recombination. We investigate this issue using estimates of synonymous divergence per synonymous site (d S ) in sliding windows of 300 bp. However, estimation of d S is complicated by the high levels of divergence and extremely skewed nucleotide content in the third position of the sequences (Table 1) which will cause a high degree of homoplasy. We, therefore, entertain methods for estimation that explicitly account for unequal nucleotide content and multiple hits in the  Table 3.
same site such as ML methods and the YN00 method (Yang and Nielsen 2000). It is shown that for short sequences, some counting methods, such as the YN00 method, can perform better in terms of mean-squared error (MSE) for estimating d N and d S (Yang and Nielsen 2000). However, it is unclear in the current case how best to estimate d S . For this reason, we performed a small simulations study (see Section 2) for evaluating the performance of the ML estimator of d N and d S (as implemented in codeml (Yang 2007)) under the F3x4 model and the YN00 method implemented in PAML. In general, we find that estimates under the YN00 are more biased with slightly higher MSE than the ML estimate for values in the most relevant regime of d S < 1.5 (Fig. 3). However, we also notice that both estimators are biased under these conditions. For this reason, we perform a bias correction calibrated using simulations specific to the nucleotide frequencies and d N /d S ratio observed for SARS-CoV-2 (see Section 2). The bias corrections we obtain ared , for the ML estimator andd S * 1 d S þ1.492d S 2 À 3.166d S 3 þ 1.241d S 4 for yn00. Notice that there is a trade-off between mean and variance (Fig. 3) so that the MSE becomes very large, particularly for the for yn00 method, after bias correction. For d S >2, the estimates are generally not reliable, however, we note that for d S < 1.5 the bias-corrected ML estimator tends overall to have slightly lower MSE, and we, therefore, use this estimator for analyses of 300 bp regions.

Synonymous divergence
We estimate d N and d S under the F3x4 model in codeml (Goldman and Yang 1994;Muse and Gaut 1994) and find genome-wide estimates of d S ¼ 0.1604, d N ¼ 0.0065 (d N /d S ¼ 0.0405) between SARS-CoV-2 and RaTG13 and d S ¼ 0.2043, d N ¼ 0.0220 (d N /d S ¼ 0.1077) between SARS-CoV-2 and RmYN02. However, a substantial amount of this divergence might be caused by recombination with more divergent strains. We, therefore, also estimate d N and d S for the regions with inferred recombination tracts (Supplementary Table S5) removed from all sequences (Table 3). We then find values of d S ¼ 0.1462 (95% CI, 0.1340-0.1584) and d S ¼ 0.1117 (95% CI, 0.1019-0.1215) between SARS-CoV-2 and RaTG13 and RmYN02, respectively. This confirms that RmYN02 is the virus most closely related to SARS-CoV-2. The relative high synonymous divergence also shows that the apparent high nucleotide similarity between SARS-CoV-2 and the bat strains (96.2% (Zhou et al. 2020b) and 97.2% (Zhou et al. 2020a)) is caused by conservation at the amino acid level (d N /d S ¼ 0.0410 and 0.0555) exacerbated by a high degree of synonymous homoplasy facilitated by a highly skewed nucleotide composition at the third position of codons (with an AT content >72%, Table 1).
The synonymous divergence to the pangolin sequences GD410721 and GX_P1E in genomic regions with inferred recombination tracts removed is 0.5095 (95% CI, 0.4794-0.5396) and 1.0304 (95% CI, 0.9669-1.0939), respectively. Values for other comparisons are shown in Tables 2 and 3. In comparisons between SARS-CoV-2 and more distantly related strains, d S will be larger than 1, and with this level of saturation, estimation of divergence is associated with high variance and may be highly dependent on the accuracy of the model assumptions. This makes phylogenetic analyses based on synonymous mutations unreliable when applied to these more divergent sequences. Nonetheless, the synonymous divergence levels seem generally quite compatible with a molecular clock with a d S of 0.9974 (95% CI, 0.9381-1.0567, GD410721), 1.0366 (95% CI, 0.9737-1.0995, RaTG13), 1.0333 (95% CI, 0.9699-1.0967, RmYN02), and 1.0304 (95% CI, 0.9669-1.0939, Wuhan-Hu-1) between the outgroup, GX_P1E, and the three ingroup strains. The largest value is observed for RaTG13 (d S ¼ 1.0366), despite this sequence being the earliest sampled sequence, perhaps caused by additional undetected recombination into RaTG13.

Sliding windows of synonymous divergence
To address the issue of possible recombination, we plot d S between SARS-CoV-2, GD410721, and RaTG13 and the ratio of d S (SARS-CoV-2, GD410721) to d S (SARS-CoV-2, RaTG13) in 300 bp sliding windows along the genome. Notice that we truncate the estimate of d S at 3.0. Differences between estimates larger than 2.0 should not be interpreted strongly, as these estimates have high variance and likely will be quite sensitive to the specifics of the model assumptions.
We find that d S (SARS-CoV-2, GD410721) approximately equals d S (GD410721, RaTG13) and is larger than d S (SARS-CoV-2, RaTG13) in almost the entire genome showing than in these parts of the genome GD410721 is a proper outgroup to (SARS-CoV-2, RaTG13) assuming a constant molecular clock. One noticeable exception from this is the RBD region of the S gene. In this region, the divergence between SARS-CoV-2 and GD410721 is substantially lower than between GD410721 and RaTG13 ( Fig. 4a and c). The same region also has much smaller divergence between SARS-CoV-2 and GD410721 than between SARS-CoV-2 and RaTG13 ( Fig. 4a and c). The pattern is quite different than that observed in the rest of the genome, most easily seen by considering the ratio of d S (SARS-CoV-2, GD410721) to d S (SARS-CoV-2, RaTG13) ( Fig. 2b and d). In fact, the estimates of d S (SARS-CoV-2, RaTG13) are saturated in this region, even though they are substantially lower than 1 in the rest of the genome. This strongly suggests a recombination event in the region and provides independent evidence of that previously reported based on amino acid divergence (e.g. Zhang, Wu, and Zhang 2020).
The combined evidences from synonymous divergence and the topological recombination inference provide strong support for the recombination hypothesis. However, these analyses alone do not distinguish between recombination into RaTG13 from an unknown source as previously hypothesized (Boni et al. 2020) and recombination between SARS-CoV-2 and GD410721 as proposed as one possible explanation by Lam et al. (2020). To distinguish between these hypotheses, we searched for sequences that might be more closely related, in the RBD region, to RaTG13 than SARS-CoV-2 and we plotted sliding window similarities across the genome for RaTG13 (Fig. 1c). We observe relatively low sequence identity between RaTG13 and all three The nucleotide compositions at the first and second positions can be found in Supplementary Table S1. other strains in the ACE2 contact residue region of the spike protein, which is more consistent with the hypothesis of recombination into RaTG13, as proposed in Boni et al. (2020). Moreover, our BLAST search analyses of RaTG13 in this region show highest local sequence similarity with GX pangolin virus strains which is the genome-wide outgroup for the three other sequences (Lam et al. 2020). This observation is more compatible with the hypothesis of recombination from a virus related to GX pangolin strains, than with recombination between SARS-CoV-2 and GD410721.   The d S estimates are in lower triangle, and the d N estimates are in upper triangle. The coordinates relative to the Wuhan-Hu-1 genome of the masked region can be found in Section 2. The 95 per cent confidence intervals, calculated based on 1,000 bootstrap replicates, are included in the brackets for each estimate.
Unfortunately, because of the high level of synonymous divergence to the nearest outgroup, tree estimation in small windows is extremely labile in this region. In fact, synonymous divergence appears fully saturated in comparison with GX_P1E, eliminating the possibility to infer meaningful trees based on synonymous divergence. However, we can use the overall ML tree using both synonymous and nonsynonymous mutations (Fig. 2d). The ML tree using sequence from the ACE2 contact residue region supports the clustering of SARS-CoV-2 and GD410721, but with unusual long external branches for all strains except SARS-CoV-2, possibly reflecting smaller recombination regions within the ACE2 contact residue region.

Weakly deleterious mutations and clock calibrations
The use of synonymous mutations provides an opportunity to calibrate the molecular clock without relying on amino acid changing mutations that are more likely to be affected by selection. The rate of substitution of weakly and slightly deleterious mutations is highly dependent on ecological factors and the effective population size. Weakly deleterious mutations are more likely to be observed over small time scales than over long time scales, as they are unlikely to persist in the population for a long time and go to fixation. This will lead to a decreasing d N /d S ratio for longer evolutionary lineages. Furthermore, changes in effective population size will translate into changes in the rate of substitution of slightly deleterious mutations. Finally, changes in ecology (such as host shifts, host immune changes, changes in cell surface receptor, etc.) can lead to changes in the rate of amino acid substitution. For all of these reasons, the use of synonymous mutations, which are less likely to be the subject of selection than nonsynonymous mutations are preferred in molecular clock calculations. For many viruses, the use of synonymous mutations to calibrate divergence times is not possible, as synonymous sites are fully saturated even at short divergence times. However, for the comparisons between SARS-CoV-2 and RaTG13, and SARS-CoV-2 and RmYN02, synonymous sites are not saturated and can be used for calibration. We find an estimate of x ¼ 0.0391 between SARS-CoV-2 and RaTG13, excluding just the small RDB region showing a recombination signal in SARS-CoV-2 (Supplementary Table S5, coordinates: 22851-23094). Using 1,000 parametric simulations under the estimated values and the F3x4 codon model, we find that the estimate is approximately unbiased ( x ¼ 0.0398, SEM ¼ 0.0001) and with standard deviation 0.0033, providing an approximate 95 per cent CI of (0.0332, 0.0464). Also, using fifty-nine human strains of SARS-CoV-2 from GenBank and National Microbiology Data Center (see Section 2), we obtain an estimate of x ¼ 0.5604 using the F3x4 model in codeml. Notice that there is a 14-fold difference in d N /d S ratio between these estimates. Assuming very little of this difference is caused by positive selection, this suggests that the vast majority of mutations currently segregating in the SARS-CoV-2 are slightly or weakly deleterious for the virus.

Dating of divergence between bat viruses and SARS-CoV-2
To calibrate the clock, we use the estimate provided by (http://vi rological.org/t/phylodynamic-analysis-of-sars-cov-2-update-20 20-03-06/420) of l ¼1.04 Â 10 À3 substitutions/site/year (95% CI: 0.71 Â 10 À3 , 1.40 Â 10 À3 ). The synonymous specific mutation rate can be found from this as d S /year ¼ lS ¼ l/ (pS þ xpN), where x is the d N /d S ratio, and pN and pS are the proportions of nonsynonymous and synonymous sites, respectively. The estimate of the total divergence on the two lineages is then t ¼ dS pS þ xpN ð Þ =l. Inserting the numbers from Table 3 for the divergence between SARS-CoV-2 and RaTG13 and RmYN02, respectively, we find a total divergence of 96.92 years and 74.05 years respectively. Taking into account that RaTG13 was isolated July 2013, we find an estimated tMRCA between that strain and SARS-CoV-2 oft ¼(96.92 þ 6.5)/2 ¼ 51.71 years. Similarly, we find an estimate of divergence between SARS-CoV-2 and RmYN02 oft ¼74.05/2 ¼ 37.02 years, assuming approximately equal sampling times. The estimate for SARS-CoV-2 and RaTG13 is compatible with the values obtained using different methods for dating (Boni et al. 2020). The variance in the estimate in d S is small and the uncertainty is mostly dominated by the uncertainty in the estimate of the mutation rate. We estimate the SD int using 1,000 parametric simulations, using the ML estimates of all parameters, for both RaTG13 versus SARS-CoV-2 and for RmYN02 versus SARS-CoV-2, and for each simulated data also simulating values of l and x from normal distributions with mean 1.04 Â 10 À3 and SD 0.18 Â 10 À3 , and mean 0.5604 and SD 0.1122, respectively. We subject each simulated data set to the same inference procedure as done on the real data. Our estimate of the SD in the estimate is 11.8 for RaTG13 versus SARS-CoV-2 and 9.41 for RmYN02 versus SARS-CoV-2, providing an approximate 95 per cent CI of (28.11, 75.31) and (18.19, 55.85), respectively. For RaTG13, if including all sites, except the 244 bp in the RBD of the S gene (Supplementary  Table S5), the estimate is 55.02 years with an approx. 95 per cent CI of (29.4, 80.7). As more SARS-CoV-2 sequences are being obtained, providing more precise estimates of the mutation rate, this CI will become narrower. However, we warn that the estimate is based on a molecular clock assumption and that violations of this assumption eventually will become a more likely source of error than the statistical uncertainty quantified in the calculation of the CI. We also note that, so far, we have assumed no variation in the mutation rate among synonymous sites. However, just from the analysis of the 300-bp windows, it is clear that is not true. The variance in the estimate of d S among 300-bp windows from the RaTG13-SARS-CoV-2 comparison is approximately 0.0113. In contrast, in the simulated data assuming constant mutation rate, the variance is approximately 0.0034, suggesting substantial variation in the synonymous mutation rate along the length of the genome. Alternatively, this might be explained by undetected recombination in the evolutionary history since the divergence of the strains.

Discussion
The highly skewed distribution of nucleotide frequencies in synonymous sites in SARS-CoV-2 (Kandeel et al. 2020), along with high divergence, complicates the estimation of synonymous divergence in SARS-CoV-2 and related viruses. In particular, in the third codon position, the nucleotide frequency of T is 43.5 per cent while it is just 15.7 per cent for C. This resulting codon usage is not optimized for mammalian cells (e.g. Chamary, Parmley, and Hurst 2006). A possible explanation is a strong mutational bias caused by Apolipoprotein B mRNA-editing enzymes (APOBECs) which can cause cytosine-to-uracil changes (Giorgio et al. 2020).
A consequence of the skewed nucleotide frequencies is a high degree of homoplasy in synonymous sites that challenges estimates of d S . We here evaluated estimators of d S in 300-bp sliding windows and found that a bias-corrected version of the ML estimator tended to perform best for values of d S < 2. We used this estimator to investigate the relationship between SARS-CoV-2 and related viruses in sliding windows. We show that synonymous mutations show shorter divergence to pangolin viruses, than the otherwise most closely related bat virus, RaTG13, in part of the RBD of the spike protein. This strongly suggests that the previously reported amino acid similarity between pangolin viruses and SARS-CoV-2 is not due to convergent evolution, but more likely is due to recombination. In the recombination analysis, we identified recombination from pangolin strains into SARS-CoV-2, which provides further support for the recombination hypothesis. However, we also find that the synonymous divergence between SARS-CoV-2 and pangolin viruses in this region is relatively high, which is not consistent with a recent recombination between the two. It instead suggests that the recombination was into RaTG13 from an unknown strain, rather than between pangolin viruses and SARS-CoV-2, as proposed in Boni et al. (2020). The alternative explanation of recombination into SARS-CoV-2 from the pangolin virus would require the additional assumption of a mutational hotspot to account for the high level of divergence in the region between SARS-CoV-2 and the donor pangolin viral genome. To fully distinguish between these hypotheses, additional strains would have to be discovered that either are candidates for introgression into RaTG13 or can break up the lineage in the phylogenetic tree between pangolin viruses and RaTG13.
The fact that synonymous divergence to the outgroups, RaTG13 and RmYN02, is not fully saturated, provides an opportunity for a number of different analyses. First, we can date the time of the divergence between the bat viruses and SARS-CoV-2 using synonymous mutations alone. In doing so, we find estimates of 51.71 years (95% CI, 28.11-75.31) and 37.02 years (95% CI, 18.19-55.85), respectively. Most of the uncertainty in these estimates comes from uncertainty in the estimate of the mutation rate reported for SARS-CoV-2. As more data are being produced for SARS-CoV-2, the estimate should become more precise and the CI significantly narrowed. We note that the mutation rate we use here are estimated based on the entire genome, which may differ from that in nonrecombination regions. To address this problem, we downloaded all the SARS-CoV-2 sequences that are available until 17 August 2020 from GISAID, and obtained an estimate of 1:0.81 for the ratio of mutation rates in the recombination and nonrecombination regions, using the 'GTRGAMMA' model implemented in the RAxML (Stamatakis 2014). Given the length ratio between the two partitions is 1:4, the difference between the partitions will cause a slight overestimate of the mutation rate by $5 per cent, which is relatively small compared to the confidence intervals and the potential for other unknown sources of uncertainty. However, we warn that a residual cause of unmodeled statistical uncertainty is deviations from the molecular clock. Variation in the molecular clock could be modeled statistically (see, e.g. Drummond et al. 2006;Lartillot, Phillips, and Ronquist 2016), but the fact that synonymous mutations are mostly saturated for more divergent viruses that would be needed to train such models, is a challenge to such efforts. On the positive side, we note that the estimates of d S given in Table 3 in general are highly compatible with a constant molecular clock. Boni et al. (2020) obtained divergence time estimates similar to ours using a very different approach based on including more divergent sequences and applying a relaxed molecular clock. We see the two approaches as being complimentary. In the traditional relaxed molecular clock approach, more divergent sequences are needed that may introduce more uncertainty due to various idiosyncrasies such as alignment errors. Furthermore, the relaxed molecular clock uses both synonymous and nonsynonymous mutations and is, therefore, more susceptible to the effects of selection. Our approach allows us to focus on just the relevant in-group species and to use only synonymous mutations. The disadvantage is that we cannot accommodate a relaxed molecular clock. However, the fact that both approaches provide similar estimates is reassuring and suggests that neither idiosyncrasies of divergent sequences, natural selection, or deviations from a molecular clock has led to grossly misleading conclusions.
Another advantage of estimation of synonymous and nonsynonymous rates in the outgroup lineage is that it can provide estimates of the mutational load of the current pandemic. The d N /d S ratio is almost 14 times larger in the circulating SARS-CoV-2 strains than in the outgroup lineage. While some of this difference could possibly be explained by positive selection acting at a higher rate after zoonotic transfer, it is perhaps more likely that a substantial proportion of segregating nonsynonymous mutations are deleterious, suggesting a very high and increasing mutation load in circulating SARS-CoV-2 strains.

Data availability
The pangolin virus sequences, GD410721 and GX_P1E, were downloaded from GISAID with accession numbers EPI_ISL_410721 and EPI_ISL_410539, respectively, and RmYN02 sequence was provided by E. C. Holmes. All other sequences analyzed in this study were downloaded from either NCBI GenBank or National Microbiology Data Center (NMDC). The accession codes for nonhuman sequences can be found in Supplementary Table S2 and the accession codes for human sequences can be found in Supplementary Table S3.

Supplementary data
Supplementary data are available at Virus Evolution online.