Assessing Uncertainty in the Rooting of the SARS-CoV-2 Phylogeny

Abstract The rooting of the SARS-CoV-2 phylogeny is important for understanding the origin and early spread of the virus. Previously published phylogenies have used different rootings that do not always provide consistent results. We investigate several different strategies for rooting the SARS-CoV-2 tree and provide measures of statistical uncertainty for all methods. We show that methods based on the molecular clock tend to place the root in the B clade, whereas methods based on outgroup rooting tend to place the root in the A clade. The results from the two approaches are statistically incompatible, possibly as a consequence of deviations from a molecular clock or excess back-mutations. We also show that none of the methods provide strong statistical support for the placement of the root in any particular edge of the tree. These results suggest that phylogenetic evidence alone is unlikely to identify the origin of the SARS-CoV-2 virus and we caution against strong inferences regarding the early spread of the virus based solely on such evidence.

SARS-CoV-2, the virus causing COVID-19 or "severe acute respiratory syndrome," has a single-stranded RNA genome 29,891 nucleotides in length . The exact origin of the virus causing the human pandemic is unknown, but two coronaviruses isolated from bats-RaTG13 isolated from Rhinolophus affinis (Zhou, Chen, et al. 2020) and RmYN02 isolated from Rhinolophus malayanus , both from the Yunnan province of China-appear to be closely related. After accounting for recombination, the divergence time between these bat viruses and SARS-CoV-2 is estimated to be $52 years [95% CI (28,75)] and 37 years [95% CI (18,56)] , for RaTG13 and RmYN02, respectively, using a strict clock, only the most closely related sequences, and only synonymous mutations, or 51 years [95% HPD credible interval (40, 70)] for RaTG13 (Boni et al. 2020) using a relaxed clock and all mutations including divergent sequences saturated in synonymous sites. After the emergence of the virus was first reported from Wuhan in China (Li 2020) it rapidly spread to many other areas of the world (World Health Organization 2020). However, the events leading to the early spread of the viruses are still unclear, in part because there is substantial uncertainty about the rooting of the SARS-CoV-2 phylogeny. The importance in identifying the origin of the virus has prompted other analyses on the uncertainty of rooting the phylogeny (Gomez-Carballa et al. 2020;Morel et al. 2020). Previous analyses have reached different conclusions about the rooting of the phylogeny. Although analyses that used an outgroup reached one placement Tang et al. 2020;Yu et al. 2020;Zhang et al. 2020), analyses that used midpoint rooting reached another placement (Li, Zai, Zhao, et al. 2020;Li, Zai, Wang, et al. 2020;Nie et al. 2020), and yet other analyses using a Bayesian molecular clock have reached a different placement of the root Giovanetti et al. 2020;Lemey et al. 2020;.
As illustrated in figure 1 (see figure legend for methods description), there is considerable discrepancy between rootings based on rooting with the two closest outgroup sequences using maximum likelihood ( fig. 1A), which has a rooting in clade A, and a Bayesian analysis using a molecular clock ( fig. 1B), which gives a rooting in clade B with posterior probability 0.96 (supplementary fig. S1, Supplementary Material online). We have here used clade designations by Rambaut et al. (2020). Clade B contains the earliest sequences from Wuhan, and a rooting in this clade would be compatible with the epidemiological evidence of an origin of SARS-CoV-2 in or near Wuhan. However, if an outgroup rooting is assumed ( fig. 1A), the inferred origin is in Clade A which consists of many individuals from both inside and outside East Asia. Such a rooting would be compatible with origins of SARS-CoV-2 outside of Wuhan. The rooting of the SARS-CoV-2 pandemic is, therefore, critical for our understanding of the origin and early spread of the virus. However, it is not clear how best to root the tree and how much confidence can be placed in any particular rooting of the tree.
There are many different methods for inferring the root of a phylogenetic tree, but they largely depend on three possible sources of information: outgroups, the molecular clock, and nonreversibility. The latter source of information can be used if the underlying mutational process is nonreversible, that is, for some pair of nucleotides (i, j), the number of mutations from i to j differs from the number of mutations from j to i, in expectation at stationarity. However, this source of information is rarely used to root trees because it relies on strong assumptions regarding the mutational process, and it has been shown to perform poorly on real data (Huelsenbeck et al. 2002). Most studies use methods based on either outgroup rooting, molecular clock rooting, or a combination of both. Outgroup rooting is perhaps the conceptually easiest method to understand, and arguably the most commonly used method. In outgroup rooting, the position in which one or more outgroups connects to the ingroup tree is the root position. Outgroup rooting can be challenged by long-branch attraction if distant outgroups are being used (e.g., Felsenstein 1978;Maddison et al. 1984;Hendy and Penny 1989;Graham et al. 2002). In such cases, the outgroup will have a tendency to be placed on the longest branches of the ingroup tree. In viruses, in particular, because of their high mutation rate, it can be challenging to identify an outgroup sequence that is sufficiently closely related to the ingroup sequences to allow reliable rooting. An alternative to outgroup rooting is molecular clock rooting, which is based on the assumption that mutations occur at an approximately constant rate, or at a rate that can be modeled and predicted using statistical models (e.g., using a relaxed molecular clock, such as Yoder and Yang 2000;Drummond et al. 2006). The rooting is then preferred that makes the data most compatible with the clock assumption by some criterion. Early methods for rooting using molecular clocks were often labeled FIG. 1.. Estimated maximum likelihood tree using RAxML-NG (A) and maximum clade credibility tree using tip-dating using BEAST (B) for SARS-CoV-2. The software package pangolin (https://github.com/hCoV-2019/pangolin, updated on May 1, 2020) was used for lineage assignment based on lineages updated on April 27, 2020. The maximum likelihood estimate of the phylogeny was obtained using the program RAxML-NG (Kozlov et al. 2019) under the GTRþC model of DNA substitution. We estimated the maximum clade credibility tree using a time-measured Bayesian phylogenetic model implemented in BEAST (Suchard et al. 2018) v1.10.4 using a Markov chain Monte Carlo (MCMC) length of 10,000,000 steps and sampling every 1,000 steps. We used a Hasegawa, Kishino, and Yano (HKY) substitution model (Hasegawa et al. 1985) with C distributed rate heterogeneity with invariant sites and the uncorrelated relaxed clock with a lognormal distribution using default parameters. We specified exponential growth coalescent tree priors, used an exponential prior distribution with a mean and initial value of 1 Â 10 À3 for the clock rate, and used the tree prior with default values for the root height. Both of these parameters had effective sample sizes of >100. All of the other BEAST parameters not mentioned here were set as the default values. TreeAnnotator was used to annotate the maximum clade credibility tree. The estimated posterior probability of a rooting in clade B is large (9,616/10,000 sampled trees have roots in clade B, see supplementary fig. S1, Supplementary Material online). Pipes et al. . doi:10.1093/molbev/msaa316 MBE "midpoint rooting" as some original methods were based on placing the root halfway between the most distant leaf nodes in the tree (e.g., Swofford et al. 1996). More modern methods use more of the phylogenetic information, for example, by finding the rooting that minimizes the variance among leaf nodes in their distance to the root (e.g., Mai et al. 2017) or produces the best linear regression of root-to-tip distances against sampling times when analyzing heterochronous data (Rambaut et al. 2016). Methods for inferring phylogenetic trees that assume an ultrametric tree (i.e., a tree that perfectly follows a molecular clock), such as unweighted pair group method with arithmetic mean (UPGMA; Sokal and Michener 1958), directly infer a rooted tree. Similarly, Bayesian phylogenetic methods using birth-death process priors (Kendall 1948;Thompson 1975) or coalescent priors (Kingman 1982a(Kingman , 1982b(Kingman , 1982c) also implicitly infer the root. But even with uninformative priors on the tree, the placement of the root can be estimated in Bayesian phylogenetics using molecular clock assumptions. An advantage of such methods, over methods that first infer the branch lengths of the tree and then identify the root most compatible with a molecular clock, is that they explicitly incorporate uncertainty in the branch length estimation when identifying the root and they simultaneously provide measures of statistical uncertainty in the rooting of the tree. Huelsenbeck et al. (2002) investigated the use of Bayesian inference of root placement and found high consistency between outgroup rooting and molecular clock rooting. The objective of this study is to determine how well the root of the SARS-CoV-2 phylogeny can be identified and to provide measures of statistical uncertainty for the placement of the root of the SARS-CoV-2 pandemic. There are several challenges when doing so. First, and most importantly, there is very little variability among the early emerging strains of the virus, challenging both molecular clock and outgroup rooting. Secondly, although the nearest outgroup sequence (RmYN02) is 97.2% identical to SARS-CoV-2 (Zhou, Chen, et al. 2020), the synonymous divergence is >11% revealing the presence of appreciable homoplasy, providing potential additional uncertainty for outgroup rooting. Thirdly, it is unclear if a molecular clock assumption is suitable during the early phases after zoonotic transfer where selection could possibly be quite strong. Finally, coronaviruses experience substantial recombination (e.g., Boni et al. 2020;Patino-Galindo et al. 2020), and although there is no current evidence of recombination into SARS-CoV-2 since its divergence with RaTG13 and RmYN02 (Boni et al. 2020;Wang et al. 2020), both of these viruses show evidence of recombination with other viruses, particularly around the gene encoding the Spike protein, that elevates the divergence from outgroup viral strains locally (e.g., Boni et al. 2020;Wang et al. 2020). Recombination in the outgroups is at odds with the assumption of a single phylogenetic tree shared by all sites assumed by phylogenetic models when using outgroup rooting, particularly if more than one outgroup is included in the analysis.
To investigate the possible rootings of the SARS-CoV-2 phylogeny we used six different methods and quantified the uncertainty in the placement of the root for each method on the inferred maximum likelihood topology. We note that the question of placement of a root is a question idiosyncratic to a specific phylogeny, and to define this question we fixed the tree topology, with the exception of the root placement, in all analyses. In all cases, we applied the method to the alignment of 132 SARS-CoV-2 sequences and two putative outgroup sequences, RaTG13 and RmYN02 (see supplementary table S1, Supplementary Material online) that was constrained such that the protein-coding portions of the SARS-CoV-2 genome were in frame, and is described in detail in Wang et al. (2020). To ensure that we could accurately capture the rooting from available sequences, the sequences used for the analysis are chosen to be representative of the basal branches of the phylogeny and/or were early sequenced strains. There are two orders of magnitude more strains available in public databases, however these sequences are more terminally located and would provide little additional information about the placement of the root but have the potential to add a significant amount of additional noise. We are therefore focusing our efforts on the limited data set of early sequences. However, we note that future inclusion of more sequences with a basal position in the phylogeny (with few splits between the edge leading to the sequence and the root) could add additional information. The maximum likelihood estimate of the phylogeny was obtained using the program RAxML-NG (Kozlov et al. 2019) under the GTRþ C model of DNA substitution. The topology of the tree is shown in figure 2. The outgroup sequences were pruned from the tree using nw_prune from Newick utilities v1.6 (Junier and Zdobnov 2010). Bootstrapping was performed using the RAxML-NG -bootstrap option. For the RaTG13þRmYN02 analysis, only bootstrapped trees that formed a monophyletic group for RaTG13 and RmYN02 were kept. The clades of the tree were assigned according to nomenclature proposed by Rambaut et al. (2020) where the A and B clades are defined by the mutations at the genome positions 8782 and 28144 and based on whether or not they share those sites with RaTG13. The six different methods for identifying the root of the SARS-CoV-2 phylogeny were: (1) Outgroup rooting using RaTG13. We constrained the tree topology to be equal to the unrooted SARS-CoV-2 phylogeny, that is, the only topological parameter estimated was the placement of the RaTG13 sequence on the unrooted SARS-CoV-2 phylogeny. We masked the potential recombination segment (NC_045512v2 positions 22851-23094) in RaTG13 identified in Wang et al. (2020) from the alignment. To quantify uncertainty we obtained 1,000 bootstrap samples. We note that although interpretation of bootstrap proportions in phylogenetics can be problematic (see Efron et al. 1996), in the current context they should have a more simple interpretation as providing a confidence set for the placement of the root, that is, if the sum of bootstrap proportions exceed 0.95 for a set of edges, under repeated sampling we would expect the root to be placed on one of these edges with probability >0.95. However, as there are very few informative sites, the bootstrap could potentially lead to poorly calibrated confidence intervals. To assess this issue, we SARS-CoV-2 Phylogeny . doi:10.1093/molbev/msaa316 MBE performed 1,000 parametric simulations using pyvolve (Spielman and Wilke 2015) using maximum likelihood estimates, from the original data set, of the model of molecular evolution and the phylogenetic tree, including branch lengths (see supplementary table S2, Supplementary Material online). For each simulation, we then estimated the tree using the same procedure as used for the real data for both the simulated data, and for 100 bootstrap replicates. We then constructed confidence sets by finding a set of branches b ¼ f b 1 ; b 2 ; . . . ; b k g such that pðb 1 Þ þ pðb 2 Þ þ Á Á Á þ pðb k Þ ! 0:95 and for which jbj is minimal, where pðb i Þ is the bootstrap proportion for branch i. Notice in supplementary figure S2, Supplementary Material online, that the bootstrap proportions provide rather poor measures of confidence if interpreted as such. This is likely because of the small number of mutations observed on each branch. We, therefore, consider the posterior probabilities (described in the next sections) to be more interpretable measures of uncertainty than the bootstrap proportions.
(2) Outgroup rooting using RmYN02. We used the same methods as in (1) Rambaut (2000Rambaut ( , 2016 applied to the maximum likelihood tree. This method uses the molecular clock to root the tree. We again quantified uncertainty using 1,000 bootstrap samples. (5) We used the Bayesian molecular clock rooting method described in Huelsenbeck et al. (2002) but constrained to maintain the maximum likelihood topology as in the previous rooting methods. We wrote specialized software to calculate the posterior probability distribution of the root position under the molecular clock (the "Rooter" method). The program constrained the unrooted tree of the human SARS-CoV-2 sequences, estimated via maximum likelihood. However, all other parameters of the phylogenetic model were treated as random variables. The GTRþC model of DNA substitution was assumed Probabilities are shown for branches with probability greater than 0.05 for at least one method. Branch lengths are not to scale. Lineage assignment was performed using the pangolin software. After running the software for assignment, only lineages called for sequences where both "aLRT" and "UFbootstrap" values which quantify the branch support in phylogeny construction are !80. Pipes et al. . doi:10.1093/molbev/msaa316 MBE in all Bayesian analyses. We used Markov chain Monte Carlo (MCMC) with 10,000,000 cycles with a sample frequency of 1,000 to update all of the model parameters. For the outgroup criterion, we initialized the tip dates using the sample dates of the viruses (which ranged from December 23, 2019 to March 24, 2020). The molecular clock was enforced, with an exponential prior with parameter k ¼ 1000 placed on the tree height. (6) We used an outgroup rooting method (the "Ogrooter" method) as described in (5) except where each branch length had an independent exponential prior with parameter k ¼ 1000. The outgroup criterion was used to root the tree. That is, we kept track of where the RaTG13 and RmYN02 sequences, which were forced to be monophyletic, joined the ingroup tree of 132 human SARS-CoV-2 sequences. We report the marginal posterior probability of the root position, which is approximated using MCMC as the fraction of the time the outgroup sequences joined the various branches.
Notice, the four methods for outgroup rooting are largely compatible ( fig. 2). Most of the bootstrap replicates place the root in one of two places: in a clade leading to three Japanese sequences, two sequences from the USA, two Shenzhen sequences, and one Beijing sequence (with bootstrap proportion varying between 0.068 and 0.184, and posterior probability 0.0413) and in a clade leading to two Washington sequences, one Shanghai sequence, and one Zhejiang sequence (with bootstrap proportion varying between 0.074 and 0.142, and posterior probability 0.0363). None of these rootings are very epidemiologically plausible given that the first outbreak of SARS-CoV-2 was identified in Wuhan. There are also positive bootstrap proportions on other edges of the tree. Importantly, there is not a single placement that has high bootstrap proportion. In fact, the highest bootstrap proportion on any edge of the tree for any bootstrap method is only 0.245 and when using both RmYN02 and RaTG13, no placement has a higher bootstrap proportion than 0.2. Perhaps, surprisingly, the bootstrap proportions do not get more concentrated when adding both RmYN02 and RaTG13. A possible explanation for this is the reduction in alignment length when removing the recombination fragments from RmYN02. The two methods for placing the root using a molecular clock are also mostly compatible with each other. Rooter places about half of the posterior probability (0.464), and the rootto-tip regression rooting method (rtt) places 0.341 bootstrap proportion, at the earliest collected sequence from Wuhan (Wuhan/IPBCAMS-WH-01/2019). Rooter also places 0.137 probability on the edge leading to this sequence and 0.151 probability on the sister edge to this sequence. However, there is also considerable probability assigned in various other positions. No singular placement in the tree receives more than 0.464 probability.
To investigate differences in signs of temporal signal for the outgroup rooting and the molecular clock rooting, we calculated root-to-tip distances using TempEst v1.5.3 (Rambaut et al. 2016) for the ML tree using the outgroup rooting (supplementary fig. S3, Supplementary Material online) and a rerooting of the ML tree using the molecular clock rooting (supplementary fig. S4, Supplementary Material online). Rerooting was performed using nw_reroot from Newick utilities v1.6 (Junier and Zdobnov 2010). As expected, the molecular clock rooting has more temporal signal (r ¼ 0.403, Pvalue ¼ 7.226 Â 10 À8 ) than the outgroup rooting (r ¼ 0.271, P-value ¼ 3.367 Â 10 À4 ). Additionally, we infer the root age of the molecular clock rooting to be in mid-October 2019 (2019.794) with a 95% confidence interval of [2019.225, 2020.363] and we estimate the rate of evolution to be 5.470 Â 10 À4 substitutions per site per year with a 95% confidence interval of [3.049 Â 10 À4 , 7.891Â 10 À4 ]. Despite our small sample size and our focus on basal lineages of the SARS-CoV-2 tree to assess uncertainty in the rooting of the phylogeny, this is largely compatible with other estimates of the time to the most common recent ancestor (e.g., Lai et al. 2020;van Dorp et al. 2020). The inferred root age of the outgroup rooting is much earlier than other estimates, in mid-August of 2019 (2019.632), with a much wider 95% confidence interval of [2017.632, 2020.469], an indication of greater uncertainty and less molecular clock signal. We estimate the rate of evolution for the outgroup rooting to be 3.547 Â 10 À4 substitutions per site per year with a 95% confidence interval of [1.112 Â 10 À4 , 5.969Â 10 À4 ]. We calculated the confidence intervals for the root age and the rate of evolution using the standard errors of the x-intercept and the slope, respectively, which was estimated using a nonparametric bootstrap of the alignment sites with n ¼ 5,000. Using the same bootstrapping of the alignment, we calculated the P-values for the correlations using a two-sided Wald test. The results are qualitatively similar to the results inferred from the BEAST analysis in figure 1B, which also give an older inferred root age when the tree has a root is in clade A (mean age 2019.769) than when it is in clade B (mean age 2019.827) (supplementary fig. S1, Supplementary Material online).
We also assessed the temporal signal in the data using Bayesian evaluation of temporal signal (BETS) , which estimates the marginal likelihoods using a model that contains sampling times and a model containing no sampling times. A model is preferred over another according to their ratio of marginal likelihoods. The model using sampling times is expected to have the highest statistical fit if the data contain temporal signal. BETS provides further evidence that there is temporal signal in the data with only a modest improvement in statistical fit for the relaxed clock over the strict clock model (supplementary fig. S5, Supplementary Material online). Additionally, Duchene, Featherstone, et al. (2020) have used BETS to show that there is positive evidence for temporal signal if SARS-CoV-2 genomes past February 2nd are included in the data. Although molecular clock rooting and the outgroup rooting strategies internally give qualitatively similar results, they are largely incompatible with each other. The molecular clock rooting places the root in the B clade with high confidence, whereas outgroup rooting places the root in the A clade with similarly high confidence. The reason for this discrepancy is unclear, but it could be caused either by deviations from a molecular clock or excess back-mutations, that is, SARS-CoV-2 Phylogeny . doi:10.1093/molbev/msaa316 MBE unexpectedly many mutations in the same site occurring both within the SARS-CoV-2 phylogeny and on the lineage leading to the outgroup(s). We were able to capture outgroup rooting compatible with the molecular clock rooting (obtaining an outgroup rooting in clade B instead of clade A) by removing three positions from the alignment (8782, 18060, and 28144) (supplementary fig. S6, Supplementary Material online). All of these positions have negative phyloP values based on the UCSC 119way alignment (Fernandes et al. 2020), which suggests fast evolution. Although positions 8782 and 18060 are synonymous changes, position 28144 is a missense mutation in orf8 whose function is unclear, but which has also back-mutated in more recent samples of the A clade. Also, all three mutations are between T and C which occur with particularly high rate within SARS-CoV-2 (see e.g., https://virological.org/t/issues-with-sars-cov-2-sequencingdata/473). The most likely explanation for the observed discrepancy between the rootings might be hypermutability in these sites causing excess back-mutations, suggesting that the molecular clock rooting is more reliable. However, we cannot exclude an increased rate of mutation (or sequencing errors) in the A clade that would attract the root to this clade. However, both methods of rooting reveal substantial uncertainty in the placement of the root.
The rooting of the SARS-CoV-2 phylogeny has important implications for our understanding of Covid19 epidemiology. Rooting in the B clade is compatible with an origin in or near Wuhan, whereas a rooting in Clade A suggests alternative origins of the virus, perhaps outside East Asia. The phylogenetic evidence alone, which is the focus of our study, is not sufficient to resolve this issue. However, we note that the vast majority of epidemiological evidence points to an origin of the virus in or near Wuhan. Furthermore, hypermutation in a few sites, not accounted for in standard models of molecular evolution, might be able to explain the signal in the outgroup rooting toward a root in Clade A. For that reason we consider the rooting in Clade B, as estimated in molecular clock analyses, and implied in standard phylodynamic analyses (e.g., Lemey et al. 2020), to be the most plausible rooting. However, as a root in Clade B cannot be excluded based on the phylogenetic evidence alone, it might be prudent to avoid strong inferences regarding the early divergence of SARS-CoV-2 based on a fixed rooting in either the A or the B clade, and analyses based on the outgroup rooting should be avoided until outgroups more closely related to SARS-CoV-2 have been discovered. Although the outgroup rooting seems the least compatible with the available epidemiological evidence, we recommend that studies that use rooting for inferences: 1) use methods that can take uncertainty in the rooting into account, for example by integrating over possible rootings, as accomplished by Bayesian phylogenetic methods, and 2) combine evidence from outgroup rooting and molecular clock rootings. The latter could, for example, be accomplished in a Bayesian framework by also including outgroup sequences in traditional phylodynamic analyses, but with a different prior governing the evolution and epidemiology of the outgroup sequences than that used for the ingroup sequences.

Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.