Intra- vs. Interhost Evolution of SARS-CoV-2 Driven by Uncorrelated Selection—The Evolution Thwarted

Abstract In viral evolution, a new mutation has to proliferate within the host (Stage I) in order to be transmitted and then compete in the host population (Stage II). We now analyze the intrahost single nucleotide variants (iSNVs) in a set of 79 SARS-CoV-2 infected patients with most transmissions tracked. Here, every mutation has two measures: 1) iSNV frequency within each individual host in Stage I; 2) occurrence among individuals ranging from 1 (private), 2–78 (public), to 79 (global) occurrences in Stage II. In Stage I, a small fraction of nonsynonymous iSNVs are sufficiently advantageous to rise to a high frequency, often 100%. However, such iSNVs usually fail to become public mutations. Thus, the selective forces in the two stages of evolution are uncorrelated and, possibly, antagonistic. For that reason, successful mutants, including many variants of concern, have to avoid being eliminated in Stage I when they first emerge. As a result, they may not have the transmission advantage to outcompete the dominant strains and, hence, are rare in the host population. Few of them could manage to slowly accumulate advantageous mutations to compete in Stage II. When they do, they would appear suddenly as in each of the six successive waves of SARS-CoV-2 strains. In conclusion, Stage I evolution, the gate-keeper, may contravene the long-term viral evolution and should be heeded in viral studies.


Introduction
Selection for new mutations is the essence of molecular evolution (Li 1997).For virus, this phase of selection must happen within a host first.Hence, a study of viral evolution has to consider the selective advantage, or disadvantage, within individuals.We shall refer to this stage of evolution as Stage I.After the mutations sweep through within the host, they compete with the prevalent strains from other individuals in Stage II evolution.
In Stage I, we need to track intrahost single nucleotide variants (iSNVs), which are the alternative alleles at identical genomic position within an intrahost sample.For a de novo mutation in an individual to become detectable as an iSNV, it must increase from one virion in millions to an appreciable frequency beyond the sequencing error rate.Before an iSNV reaches 50% in frequency, it is essentially invisible in the current practice of presenting only one viral genome per individual.This practice explicitly assumes little intrahost variation and directs the focus to Stage II, bypassing Stage I evolution entirely (Korber et al. 2020;Rambaut, Holmes, et al. 2020;Tang et al. 2020;Zeng et al. 2020;Dellicour et al. 2021;Planas et al. 2021;Ruan, Luo, et al. 2021).
Presenting one genome per host can be justified if the number of virions that successfully colonize a new host (denoted N 0 ) is very small.Obviously, with N 0 = 1, there is no within-host diversity at the start of infection.Note Mol.Biol.Evol.40(9):msad204 https://doi.org/10.1093/molbev/msad204Advance Access publication September 14, 2023 1 that N 0 should be much smaller than the number of virions in the droplets or aerosol carrying the virus (Killingley et al. 2022;Puhach et al. 2022).While N 0 has been frequently estimated to be close to 1 (Braun et al. 2021;Lythgoeet al. 2021;Martin and Koelle 2021;Wang, Wang, et al. 2021), others have shown that N 0 is large enough to preserve the intrahost polymorphism during transmission (Popa et al. 2020;Ruan, Hou, et al. 2021).The difference in estimates is mainly due to de novo mutations in the donors (as well as the recipients), which are not involved in the transmission and should be excluded from the estimation of N 0 .
While tracking iSNVs is necessary for a full understanding of viral evolution, iSNVs also have clinical values.Viral strains that have spread widely and displayed detrimental effect on human health have been classified as variants of concern (VOCs), including Delta and Omicron (WHO).VOCs are reported only when their characteristic mutations become high-frequency (>50%) iSNVs.However, these mutations may be detectable at lower frequencies within hosts long before VOCs are identified.Despite the unprecedented efforts in surveillance, the lack of intermediate sequences has prevented us from accurately describing how the VOCs emerge (Ruan, Wen, et al. 2021;Wu et al. 2021;Du et al. 2022;Ghafari et al. 2022;Mallapaty 2022;Magiorkinis 2023;Markov et al. 2023).Several hypotheses have been proposed for the origin of VOCs, including persistent evolution in a few chronically infected COVID-19 patient (Choi et al. 2020;Rambaut, Loman, et al. 2020;Kemp et al. 2021;Hill et al. 2022;Scherer et al. 2022), cryptic circulation in a human population with insufficient samples (Wilkinson et al. 2021;Brito et al. 2022), reverse-zoonosis from animal hosts such as rodents and mink (Oude Munnink et al. 2021;Wei et al. 2021;Hale et al. 2022).Exploring the differences of selective forces in the two stages may help us understand the lack of intermediate sequences of emerging VOCs.
In this study, we track the evolution of SARS-CoV-2 in Stage I through the transition to Stage II.By comparing the evolutionary forces in the two stages, we would know whether and how the current exclusive focus on Stage II evolution may bias, or even distort, the understanding of long-term viral evolution, including the emergence of VOCs.In particular, we may need this understanding to anticipate the future of COVID-19.

Results
In this study, we present a data set of 79 COVID-19 confirmed cases.The mutation profile of the viral genomes within each patient, relative to the reference genome (Wuhan-Hu-1), is shown in figure 1.This dataset is uniquely informative in two ways.First, the contact records of this cohort of patients are available.Second, the viral sequences from each patient are shown as iSNVs with their frequencies indicated by color.Although fixed mutations are no longer "variants" in the strict sense of the word, they used to be iSNVs until reaching fixation.Hence, they are still classified as variants.
The 116 mutations, detected in the cohort of 79 COVID-19 patients, are classified into three groups which are, from left to right in figure 1, 67 private, 14 public, and 35 global mutations.Private mutations occur in only one single individual while public and global mutations are observed, respectively, in multiple (usually 2-10) and almost all (>70) individuals.Note that the green-to-red gradient denotes the increase in frequency with the red color showing near-fixation within the individual.Eight of these sites (four private, three public, and one global) are marked light gray.These are sites of low read depth (<100) packed in a 25 bp stretch of the genome.These gray dots should be considered uninformative sequencing reads.
It is visually obvious that global mutations are a sea of red dots.The 35 global iSNV mutations, with intrahost frequency >0.9, overlap with the defining polymorphisms of Delta strain (A23403G, C22995A) (Planas et al. 2021;Ruan, Hou, et al. 2022), thus confirming the infection by Delta strain.Importantly, red color sites are also frequently seen among private and public mutations (fig.1).The pattern suggests that an iSNV usually has to reach a high frequency (colored red) within a few individuals before it spreads through the population.In other words, Stage II evolution commences only after the completion of Stage I.With two distinct stages of evolution, each stage can now be analyzed separately, thus simplifying the task of analyzing a complex process.

Transmission of iSNVs From Donors to Recipients
The data set of figure 1 also records the detailed contact information among this cohort of 79 patients, shown in figures 2-4.The contact records establish the chain of transmission among patients (solid arrows) with some ambiguities (dotted arrows).Most important, these figures reveal the circumstances under which mutations are transmitted (becoming public) or not transmitted (remaining private).
Figures 2-4 show 15 mutations that occur in only parts of the transmission chains which are either public or private.Global mutations that occur in nearly all individuals, usually at iSNV > 0.9, are not shown.Of the three kinds, public mutations are the least abundant as they are the bridge between private and global mutations.
Public mutations have different degrees of within-host advantage, as shown in figures 2 and 3, respectively.Figure 2 displays mutations of moderate selective advantage within individuals.These are iSNV mutations that increase their frequencies step by step in more than one individual.The first one, C925T, has not reached fixation in any individual in the transmission chain.The second one, A6823G, reached fixation in the recipient from the donor (gz5266) with iSNV frequency at 59%.This iSNV seems to be a de novo mutation in gz5266 as it is not seen upstream of the transmission chain.The two mutations are deemed "moderately" advantageous within hosts Hou et al. • https://doi.org/10.1093/molbev/msad204MBE only in comparison with the mutations of figures 3 and 4 below.After all, the ability to increase to a high frequency in 2-3 transmissions is impressive.
The third mutation of figure 2, C27092T, appears in the first patient (gz4925) of this cohort with the iSNV frequency of 46%.C27092T could be the weaker within-host mutation among the 15 mutations identified in this chain.We infer its weakness for two reasons.First, it is already at 46% at the beginning of the chain.Even if it rose to this frequency de novo in gz4925, it is still weaker than most others.Besides, it is likely that C27092T arose earlier and has taken some time to reach 46%.Second, C27092T failed in one of the two recipients (gz5002) from gz4925.In a mapped chain like this one, one can distinguish between nontransmission and post-transmission failure.Importantly, the box surrounding gz4925, 5002, and 5087 has dotted lines to indicate that all other patients outside of the box has C27092T at 100%.We will return to this mutation after figures 2 and 3 are presented.
The transmission patterns of figure 2 suggest that unfixed iSNVs must have a strong population structure in both space and time.In other words, samples taken at different times, or from different tissues, of the same individual would often be quite different in mutation profile (Popa et al. 2020;Gaoet al. 2021;Lythgoe et al. 2021;Ruan, Hou, et al. 2021;Li, Du, et al. 2022).Such a population structure may also explain why donors and recipients, or two recipients downstream of the same donor, often have different mutation profiles.In contrast, iSNVs reaching 100% are more often truly fixed in the host such that all samples would carry the mutation at ∼100%.
In figure 3, the three public mutations are quite different from those of figure 2. Each of the three iSNVs is a de novo mutation as it is absent upstream of the host along the transmission chain.Since each reaches 100% in the host where it is first observed, the speed of spread would suggest substantial selective advantage.With that, one might have expected the mutations to have spread widely but, instead, all of them get transmitted only once or twice.In other words, the advantage appears to be mainly within the host but does not extend to a transmission advantage between hosts.

FIG. 1.
Heatmap portraying SARS-CoV-2 mutations in 79 patients.Each row is a patient's mutation profile and each column is the mutation across patients.The iSNV frequency in each host is indicated by color (gray color denoting sites with unreliable reads).The 116 mutations are classified into three groups from left to right: 67 private (one occurrence), 14 public (multiple occurrences), and 35 global (all patients) mutations.Note that public mutations are relatively rare, compared with private mutations, suggesting a hurdle of transmission for private mutations.The reference genome is Wuhan-Hu-1.

MBE
The conjecture that the selective advantages in the two stages may be decoupled can be seen more clearly in figure 4.These are 9 de novo mutations that, like those of figure 3, rise to 100% within the host.Their further spread to other individuals, however, is completely absent.Thus, rapid rises of mutations within hosts rarely result in subsequent widespread transmission among hosts.We now return to the C27092T mutation of figure 2 which, as stated above, is the weakest iSNV within hosts.Interestingly, it is the only mutation that comes very close to being a global mutation, thus hinting its strength in transmission between hosts.In short, figures 2-4 together suggest that selection for fitness characteristics in Stage I and Stage II may be uncorrelated, or even antagonistic.

Selection Within-versus Between-hosts-Two Uncorrelated Forces
The total results of figures 1-4 are summarized in figure 5 with the synonymous (S) and nonsynonymous (A for FIG. 2. The transmission of three public mutations with mild intrahost fitness.The transmission network of 79 patients is shown in the upper panel.These three mutations (C925T, A6823G, and C27092T ) are of moderate frequency (8%, 59%, or 46%) when first observed, but increased to higher frequency in later recipient patients.The spread of C925T (A) and A6823G (B) are limited and present in only three and two individuals, respectively.C27092T (C ) reaches fixation (>95%) in all but three downstream recipients (marked by an asterisk).This mutation is deemed mildly advantageous as it has taken an unknown length of time to reach the high iSNV frequency prior to entering this cohort of patients.

FIG. 3.
The transmission of three public mutations with strong intrahost fitness.These mutations reach 100% when first observed but are absent in the donors, thus suggesting large fitness gain in the new host.However, the spread of these mutations is limited in the cohort of patients with T17838C (A), C7844T (B), and C506T (C ) present in only three, six, and three patients.Hou et al. • https://doi.org/10.1093/molbev/msad204MBE amino acid altering) mutations separately tallied.To detect selection, the A:S ratio is a conventional measure (Li et al. 1985;Nei and Gojobori 1986;Yang and Nielsen 2000).If there is no selection on all mutations, the expected A:S ratio would be the same in any grouping of mutations.The neutral A:S ratio is a function of the codon usage and the nucleotide substitution pattern of each genome; for example, the A:S ratio in the human genome is ∼2.5 (Fay et al. 2001;Voight et al. 2006;Fu and Akey 2013;Martincorena et al. 2017).An observed A/S ratio larger (or smaller) than the neutral one is an indication of positive (or negative) selection for nonsynonymous changes.
Below, we first analyze the influence of selection in Stage I using private mutations, as shown in the red-border box of figure 5. We then analyze selection in Stage II, using mutations that reach iSNV frequency ≥ 0.9, as shown in the black-border box.

Selection for Viral Proliferation Within Hosts (The Red-border Box)
The iSNV frequencies in the red-border box of figure 5 are grouped into 3 bins, Low (L, 0.05-0.1),Middle (M, 0.1-0.9)and High (H, >0.9).Frequencies <0.05 are not used as errors below 0.05 are high.From the L to M bin, the A:S ratio decreases from 1.9 (21:11) to 1.0 (11:11).The standard population genetic interpretation (Fay and Wu 2003;Fu and Akey 2013;Wang et al. 2018;Chen et al. 2022) is that the L bin mutations consist mainly of neutral and deleterious mutations.These deleterious mutations have not been eliminated yet but will be eventually.In the M bin, with the deleterious mutations eliminated, it contains mostly neutral mutations.
In contrast, the A:S ratio increases from 11:11 to 9:0 between the M and H bin (P = 0.012 by Fisher's Exact Test).A salient feature of advantageous mutations is that their frequency spectrum tilts toward the high frequency bins (usually >0.8 in frequency; see (Wang et al. 2018)).It is interesting that the low-to-median frequency portion (<0.7) is not strikingly different from the neutral mutation spectrum.Hence, the high A:S ratio in the H bin is most easily explained by the spread of advantageous mutations.

Selection for Viral Spread Among Hosts
We now examine the interhost selection (Stage II) by examining the mutation occurrences from left to right in figure 5. We first use the last row of the table in figure 5 that sums up all iSNVs with a frequency of >0.05.If iSNVs with a frequency >0.05 are somewhat advantageous within individuals, as alluded to above, the sums should reflect the average advantage within hosts.
As shown in the table, the A/S ratio is 1.86 (41:22), 1.0 (7:7) and 5.4 (27:5) for private, public and global mutations, respectively.Generally, the A/S ratio in the population would decrease as the frequency increases, due to the working of negative selection.However, this trend may not necessarily be the expectation in viral evolution since the mutations have already been through one round of selection in Stage I.In particular, given the large number of virions within a single individual, the mutation at the time of its emergence is likely to be <10 −6 in frequency.In that case, iSNVs of even 0.05 in frequency are likely to be somewhat advantageous.At least, it is reasonable to assume that such iSNVs are not deleterious within hosts.In short, if the selective advantages in State I and II are correlated, the decrease in the A/S ratio from low (private mutations) to medium (public mutations) frequencies reported above (1.86 to 1.0) is opposite of the expectation.In the next step from public to global mutations, the A/S ratio does increase from 1.0 (7:7) to 5.4 (27:5) as expected.
To test the postulate that the selective advantage in Stage I does not translate to an advantage in Stage II, we next focus on high-frequency iSNVs that should have the strongest advantages in Stage I (see the first row of the table with a black-border box) among all iSNVs.While we use A/S ratios to gauge the effects of selection above, the number of synonymous mutations in the iSNV > 0.9 class is too small to yield informative A/S ratios.(In fact, Intra-vs.Interhost Evolution of SARS-CoV-2 • https://doi.org/10.1093/molbev/msad204MBE the paucity of such synonymous iSNVs is an indication that they are rarely advantageous within hosts to reach a high frequency.) We therefore ask the following question: Given 9 nonsynonymous iSNVs > 0.9 that are private, how many public mutations are expected?We use the formula (Fu 1995) of f i = θ/i where f i is the number of mutations occurring in i of the 79 patients and θ is a constant for the population.Figure 5 shows f 1 = θ = 9.Hence, the expected number of public mutations that are high frequency iSNVs should be ∑ i=2, 78 θ/i ∼ 36.It is striking that the observed number is only 1, nowhere close to the expected 36.Clearly, fixed private iSNVs are not transmitted to become public iSNVs.For a succinct summary of this section, the selective advantage as an iSNV in Stage I may be a liability in Stage II of interhost transmission.

Private and Global Mutations in Association With Different Viral Genes
We now ask where private and global mutations may fall among the viral genes.Public mutations are too few to be included in this analysis.We compare the S (Spike) protein with the rest of the viral genome.As shown in table 1, global mutations tend to fall in the S protein more often than expected, based on the size consideration (13% of the genome).Indeed, S protein mutations are widely known to affect viral transmission via cell attachment and entry.Interestingly, private mutations do not show an aggregation on the S protein.Perhaps, given the small number of virions that are transmitted between individuals (see the next section), the ability to be attached to cells is critical.In intrahost selection, the number of virions is so large that many other forces may be at least as important as the attachment efficiency.
In summary, we ask whether the selective forces in the two stages are correlated.While the transmission patterns of figures 2-4 do not find evidence of strong correlation, figure 5 offers a more definitive answer.Whether an advantage in Stage I is advantageous, neutral or disadvantageous in Stage II would depend on how often the fitness traits in the two stages overlap.Indeed, the two types of traits may even be antagonistic (see Discussion).In this last section, we address the N 0 estimation.The whole study is based on the transmission of within-host diversity from the donor to the recipients.Hence, if N 0 is (or is very close to) 1, then no diversity could be transmitted.
Although several studies (Braun et al. 2021;Lythgoe et al. 2021;Martin and Koelle 2021;Wang, Wang, et al. 2021;Li, Deng, et al. 2022) estimate a very tight bottleneck N 0 , often including N 0 = 1 in the procedure, these calculations are flawed as explained below.
Most studies use the full dataset as that of figure 6A and B (from Popa et al.), which show many sample-specific variants either on the X-axis (donor specific) or Y-axis (recipient specific).These variants most likely have emerged after, and hence not involved in, the transmission.As the de novo variants are maximally different between donor and recipient, they would yield a maximal likelihood estimate (MLE) of N 0 = 1 by the binomial sampling.In such cases, MLE is simply "the best among the incorrect" as shown in figure 6C.The red dots represent the donorrecipient relationship that is a far N 0 = 1 departure from those of figure 6A and B. As N 0 increases, figure 6D shows the pattern of N 0 = 20; if N 0 = 100, the pattern is shown by the black dots of figure 6C.
Overall, if we factor in measurement errors in the estimation, the prudent (and conservative) estimation would be N 0 ≥ 10, even if the actual N 0 is 1,000.Most important, the intrahost polymorphism should be integrated into the analyses except when N 0 ∼ 1, an estimate that can be convincingly rejected.
Finally, in an attempt that is not overly conservative, we estimate N 0 by the beta-binomial method (Sobel Leonard et al. 2017).Sample-specific variants (i.e., variants detected only in donors or recipients) are excluded from the estimation as almost all of them are de novo mutations.Among the 40 available transmission pairs, the estimates from three pairs are outliers (the red-border box) in figure 6F.The low estimates, due mainly to three advantageous variants (C925T, A6823G, C27092T; see figs.

Discussion
Any virus in the course of evolution has to through two stages.It has to rise to a high frequency within the individual(s) to have a chance for transmission (Stage I).In Stage II, the virus has to enable the host to transmit it.We document in this study that the selective forces in the two stages are uncorrelated, and possibly antagonistic.In the extreme cases, a mutation that manages to become dominant within individuals is unable to spread in the population, or vice versa, then viral evolution simply could not proceed.We have previously reported that SARS-CoV-2 has been in a "runaway" mode that sped up its evolution greatly (Ruan, Hou, et al. 2022).This report shows that this runaway evolution may have been tempered or constrained by the two-stage evolution.
There are many reasons why selection may operate divergently within versus between hosts.For example, a mutation that causes faster viral growth in all tissues outside of the respiratory tract may be the dominant strain in the host, but this mutation could not be transmitted.On the other hand, a cold-temperature tolerant mutant that is suited to transmission may not compete well within the host.Several lines of evidence have shown that strains more competitive in the hosts often lose out to the less competitive ones in human populations.For example, Omicron is less efficient in replication and fusion compared with Delta (Zhao et al. 2022), but Omicron has displaced Delta in human populations.Also, Omicron is more infectious than Delta but has a lower viral load than Delta (Puhach et al. 2022), even in rhesus macaque (van Doremalen et al. 2022).In other cases, the trend also appears true.For example, in chronic SARS-CoV-2 infections, Kemp et al. (2021) found that a single spike mutation D796H that decreases susceptibility to neutralizing antibodies actually results in infectivity decline.A different study (Lee et al. 2023) also found that spike M1237I mutation increase viral assembly and secretion but decreases efficiency of transmission.The evidence supports the posit that selection in Stage I and Stage II may be antagonistic.
The antagonism enables the mutations that are deleterious in Stage I evolution (but generally gain fitness advantage in Stage II evolution) to persist in multiple hosts for a long time, greatly retaining the genetic diversity of virus.At the same time, many adaptive mutations would emerge during Stage I evolution, although these mutations may have no competitive advantage in Stage II evolution.Most spontaneous mutations are deleterious according to evolutionary theory (Shen et al. 2022), so there are very few mutations that are adaptive in both Stage I and Stage II evolution.However, the antagonistic pleiotropy (Williams 1957) allows the mutations, which are only partially favorable in either Stage I or Stage II evolution, to have more staying power in an evolutionary context.In this way, the virus can weigh its competitive advantages during the two stages, and finally form a VOC variant that gain overall benefit within and between hosts by possible hitchhiking or recombination.
We hence propose a model in figure 7 where a mutant has to rise to a high frequency in Stage I (the lower panel for iSNVs) before it can enter the competition in Stage II (the upper panel for SNPs).The model incorporates three types of iSNVs as presented in Results.Type I is the mutations of figure 4 that have high fitness advantage within hosts but do not get transmitted between hosts.Type I mutations contribute little to the long-term viral evolution.
Type II iSNVs confer moderate advantages in Stage I.These mutations must increase their frequencies step by step via multiple hosts (shown by the staircase trajectory), thus requiring much longer time to become fixed Intra-vs.Interhost Evolution of SARS-CoV-2 • https://doi.org/10.1093/molbev/msad204MBE iSNVs than Type I mutations.It is expected that Type II mutations accumulate continually in this slow process.We also note that even a moderate advantage in Stage I may be associated with a disadvantage in Stage II.Even with a fitness disadvantage in Stage II (basic reproductive number R 0 < 1), Type II mutations could still spread among multiple hosts due to the stochasticity of early transmission but eventually become extinct in host population (Ruan, Wen, et al. 2021).Hence, only a small fraction of advantageous mutations of Type II could be established in the host population.
Type III iSNVs could confer an advantage in Stage II but few of them would realize that potential as they generally do not get out of the gate in Stage I. Occasionally, they may hitchhike with Type II mutations to a high frequency in Stage I.In reciprocity, Type III mutations can compensate for the transmission limitation of Type II mutations, eventually leading to the emergence of successful strains.
Interestingly, hitchhiking and compensation have been detected in persistent SARS-CoV-2 infection in immunosuppressed individuals (Kemp et al. 2021).The mutant D796H alluded above is a Type II mutation found in the patients.After convalescent plasma therapy, a spike deletion mutant ΔH69/ΔV70, with a higher level of infectivity, compensates for the reduced infectivity of the D796H mutation.With the double mutants of D796H and ΔH69/ΔV70, the strain became dominant in the host.Furthermore, in our study, mutation T27049C may be a Type III mutation as it occurs in 41 patients, but at low iSNV frequencies of 5-11% (supplementary fig.S1, Supplementary Material online and fig.1).In other words, T27049C has limited within-host proliferation but appears to be good at transmission.
The model thus explains a most perplexing feature of SARS-CoV-2 evolution.Since the beginning of COVID-19, there have been six waves of viral strain, referred to as W0-W5 (Ruan, Hou, et al. 2022) where W3, W4, and W5 are, respectively, the Alpha, Delta, and Omicron wave.

MBE
Each wave carries a set of mutations (21 for Alpha, 31 for Delta, and >50 for Omicron) that represent a complete replacement of those of the previous wave.Strikingly, each replacement happened in a few weeks with the sudden appearance of a new strain carrying the full set of mutations (Wei et al. 2021;Mallapaty 2022;Ruan, Hou, et al. 2022;Ruan, Wen, et al. 2022).A best documented replacement is the Alpha-Delta transition whereby Delta sweeping through within a month.The mechanism can be explained by the model of figure 7 whereby multiple Type II and III mutations are slowly assembled into a new strain.The process happens in only a few individuals.Because the process is hardly noticeable during the assembly phase, the eventual emergence of the new strain would appear to be very sudden.This suddenness is merely a perception.Several hypotheses of VOC origins (Kemp et al. 2021;Oude Munnink et al. 2021;Wei et al. 2021;Du et al. 2022;Ghafariet al. 2022;Hill et al. 2022;Mallapaty 2022;Magiorkinis 2023;Markov et al. 2023) have been proposed to understand the emergence of VOCs, but the lack of intermediate sequences is an important obstacle to our accurate understanding of the origin of VOCs.All the five VOCs (Alpha, Beta, Gamma, Delta, and Omicron) had evolved from the pre-VOC progenitors, rather than from one another (Carabelli et al. 2023), suggesting the undetected lineages could be evolving for a long time.These pre-VOCs may be largely noncompeting and likely occupy semi-independent epidemiological niches that are not regionally defined (Mutz et al. 2022).An uncorrelated, and possibly antagonistic driving forces in Stage I and Stage II evolution, found in this study, provide a new and proper explanation for the lack of intermediate sequences and the possible emergence pattern of VOCs.
Long before Delta became prevalent, most (27) of the 31 Delta mutations are already present in very low frequency in India (Ruan, Hou, et al. 2022).Unlike typical natural populations whereby such rare mutations are scattered across haplotypes with each harboring 1-2 such mutations, ALL 27 rare mutations are found on the same, albeit rare, haplotype.Importantly, although a rare haplotype can be quickly lost in most evolutionary processes, such a rare viral strain would not be lost in the population due to its intrahost advantage, stated explicitly in figure 7. The sudden appearance has at times meant the existence of animal reservoirs in the literature (Oude Munnink et al. 2021;Wei et al. 2021;Mallapaty 2022).For example, Wei et al. (2021) have suggested that Omicron was assembled in mice before it jumped to humans.Such an explanation has its limitation because Delta, as well as other new strains, also experienced the swift replacement but these events are still believed to have evolved solely in humans.
The transmission bottleneck of SARS-CoV-2 is a controversial issue (Popa et al. 2020;Armero et al. 2021;Braun et al. 2021;Lythgoe et al. 2021;Martin and Koelle 2021;Li, Deng, et al. 2022;Li, Du, et al. 2022).Our analysis suggests that N 0 has been severely underestimated, mainly because the genetic divergence between donor and recipient is exaggerated.While it is true that "the larger the divergence, the smaller the N 0 estimate", small N 0 in fact does not lead to the divergence actually observed.The divergence between donor and recipient is often the results of de novo mutations that fall on the X and Y axes of figure 6.Even N 0 = 1 could not account for the divergence.In some cases, a few advantageous mutations may also bias the N 0 estimate downward whereas small N 0 should affect all mutations.As in some other reports (Popaet al. 2020),

FIG. 7. The evolutionary model of variant of concern (VOC)
. There are three main types of variants in the two-stage evolution.The lower and upper panels depict Stage I and Stage II evolution, respectively.Type I (yellow) has high intrahost fitness but is limited in the ability of transmission.Type II (blue) is moderately advantageous within host but slightly disadvantageous or neutral in Stage II evolution.Type III (red) gains an advantage in interhost transmission but generally cannot get out of the gate in Stage I evolution.The staircase trajectory represents the transmission between hosts, highlighted by a circle.Since it is unlikely for a single mutation to be beneficial in both stages, Type III variant may hitchhike with Type II variant to a high frequency in Stage I.At the same time, Type III variant can compensate for the transmission deficiency of Type II variant, leading to the emergence of VOC (purple line).
Intra-vs.Interhost Evolution of SARS-CoV-2 • https://doi.org/10.1093/molbev/msad204MBE our analyses show N 0 to be at least in the hundreds and large enough to transmit the genetic diversity between hosts.
In this context, a key question about COVID-19 3 years after its onset is whether Omicron is the last wave.While subvariant VOCs of Omicron are common, the threat would come from a new wave of variants that shares no mutations with Omicron.It is not farfetched that Delta may re-emerge from the ashes as Delta has not entirely disappeared (Yaniv et al. 2022).The re-emergence of a previous wave has been reported; for example, Wave 1 of Ruan, Hou, et al. (2022) disappeared after W2 but later reemerged as W3 (Alpha) after the acquisition of additional mutations.The monitoring of VOCs should include features of figure 7 by focusing on potential new waves in addition to new subvariants of Omicron.In conclusion, Stage I appears to exert a strong selective pressure on SARS-CoV-2 as it filters out many mutations and deprive them the opportunity to compete in Stage II.This stage of evolution has been neglected in previous studies and deserves a lot more attentions.

Samples and Transmission Network
Our study included 79 COVID-19 patients infected with SARS-CoV-2 Delta strain admitted in the Guangzhou Eighth People's Hospital from May 21 to June 18, 2021.All patients of this cohort were confirmed by the local Centers for Disease Control and transferred to Guangzhou Eighth People's Hospital, Guangzhou.Epidemiological data were collected including the exposure histories directly to the confirmed cases (see supplementary table S2, Supplementary Material online).Transmission chains are visualized by Cytoscape v3.9.1 (Shannon et al. 2003).

Viral RNA Sequencing
The sequencing library was prepared using an ampliconbased enrichment method as described previously (Wang, Chen, et al. 2021).All samples were sequenced on the MGISEQ-2000 platform.

Reanalysis of Previously Published SARS-CoV-2 Data
We reanalyzed 138 COVID-19 samples with clinical information of Popa's data (Popa et al. 2020), which including 39 transmission pairs.We downloaded the clinical information and vcf files available at https://doi.org/10.5281/zenodo.5224640.We used python scripts to merge the frequency of iSNVs of these 138 samples.For each transmission pair, we identified the variants at frequency of ≥1% and showed the allele frequency change between donor and recipient.We used the threshold 100 of transmission bottleneck (N 0 ), estimated by Martin and Koelle (2021), to divide the alleles into two groups.
Calculating the Number of Nonsynonymous (N) and Synonymous Sites (S) in SARS-CoV-2 Reference Genome We downloaded 12 coding region sequences (CDSs) of SARS-CoV-2 reference genome (Wuhan-Hu-1, GenBank accession no.NC_045512.2) from NCBI, including ORF1ab, ORF1a, S, ORF3a, E, M, ORF6, ORF7a, ORF7b, ORF8, N, and ORF10.We removed the stop codons of all the 12 CDSs first.Production of pp1ab depends on the occurrence of a −1 programed ribosomal frameshift at nucleotide 13,468, just four codons upstream of the ORF1a (266-13,483) termination codon.After cutting the overlapping segments (nucleotides 266-13,468) between ORF1ab and ORF1a from ORF1a, we concatenated the trimmed ORF1a with the remaining 11 CDSs (including ORF1ab) into a single sequence (29,244 nucleotides in total).YN00 from PAML v4.9a (Yang 2007) was then used to calculate the N (the number of nonsynonymous sites) and S (the number of synonymous sites).There are 22,599.3nonsynonymous (N) and 6,644.7 synonymous (S) sites in the coding regions of the reference genome.Thus, with no selection, the A/S ratio should be close to 3.4 (22,599.3/6,644.7).

Genetic Drift in a Growing Population
Based on branching process, Chen et al. (2017) obtained the genetic drift after single generation.Here, we expand it and get the genetic drift after multiple generations, which can be used to estimate the variance of alternative allele frequency within host.According to Chen et al. (2017), the average and variance of population size at time t are

FIG. 4 .
FIG. 4.The limited spread of nine private mutation with strong intrahost fitness.Each of the nine mutations is present and, most importantly, fixed in only one host (marked by the red-border box).They are absent either upstream or downstream of this one patient, thus suggesting large fitness gain within the host but little or no transmission advantage between hosts.

FIG. 5 .
FIG. 5.The number of nonsynonymous and synonymous mutations within and among hosts.The lower panel shows the relationship between iSNVs frequency (Y-axis) and the occurrence of iSNVs (X-axis) in 79 patients.Each nonsynonymous mutation (A) or synonymous mutation (S) is shown by a red triangle or circle.The upper panel calculates the number of A and S across different occurrences of iSNVs.The red-border box depicts the iSNV evolution and the black-border box depicts the evolution of high-frequency iSNVs in the human population.The A:S ratios show how positive and negative selection operate in the viral evolution (see the main text).
2 and 6E) are highly biased and should be excluded in N 0 estimation.The remaining 37 pairs yield N 0 estimates of 70-500 in 15 pairs and 1,200-1,500 in 22 pairs (fig.6F).Our estimation is thus in agreement with the study that furnishes figure 6A and B(Popa et al. 2020) by rejecting N 0 ∼ 1.

FIG. 6 .
FIG. 6. Allele frequency (AF) changes between donor and recipient used in estimating N 0 .(A and B) AF changes among 39 donor-recipient pairs (Popa et al.).(B) magnifies the low frequency portion of (A).(C) The expected AF change in donor-recipient pair when N 0 = 1 (red points) or 100 (black points) based on the binomial sampling.The arrows indicate the distribution of fixed or lost variants when N 0 = 1.(D) The change of AF when N 0 = 20.(E) Allele frequencies of 40 donor-recipient pairs in this study.The sites used to estimate N 0 are marked by orange points, which are detected in both donors and recipients.Orange dashed lines show the frequency threshold of 5%.(F) Estimated N 0 across 40 transmission pairs.Among the 40 available pairs, the low estimates from three pairs are outliers (red-border box) due to the presence of advantageous variants (C925T, A6823G, C27092T ).Orange points represent the maximum likelihood estimates and the error bars denote the 95% confidence interval.