-
PDF
- Split View
-
Views
-
Annotate
-
Cite
Cite
Yongsen Ruan, Zhida Luo, Xiaolu Tang, Guanghao Li, Haijun Wen, Xionglei He, Xuemei Lu, Jian Lu, Chung-I Wu, On the founder effect in COVID-19 outbreaks: how many infected travelers may have started them all?, National Science Review, Volume 8, Issue 1, January 2021, nwaa246, https://doi.org/10.1093/nsr/nwaa246
- Share Icon Share
Abstract
How many incoming travelers (I0 at time 0, equivalent to the ‘founders’ in evolutionary genetics) infected with SARS-CoV-2 who visit or return to a region could have started the epidemic of that region? I0 would be informative about the initiation and progression of epidemics. To obtain I0, we analyze the genetic divergence among viral populations of different regions. By applying the ‘individual-output’ model of genetic drift to the SARS-CoV-2 diversities, we obtain I0 < 10, which could have been achieved by one infected traveler in a long-distance flight. The conclusion is robust regardless of the source population, the continuation of inputs (It for t > 0) or the fitness of the variants. With such a tiny trickle of human movement igniting many outbreaks, the crucial stage of repressing an epidemic in any region should, therefore, be the very first sign of local contagion when positive cases first become identifiable. The implications of the highly ‘portable’ epidemics, including their early evolution prior to any outbreak, are explored in the companion study (Ruan et al., personal communication).
INTRODUCTION
It is generally accepted that, in principle, a small number of infected people arriving in a new place could trigger an epidemic (if the basic reproduction number, R0, is not too small [1–3]). The main issue is how many travelers actually started each epidemic. Here, ‘a new place’ may mean a country, or a bordered region, within which the bulk of human interactions happen. Relative to the within-region movement, a bordered region is lightly connected to the rest of the world. Since the epidemic in any bordered region could have been started by one single infected traveler, or by 1000 of them, we take the population genetic approach to analyzing the divergence among viral populations in relation to the ‘founder effect’ [4].
We shall let It be the amount of input at time t (i.e. the number of infected people coming into an uninfected region). The crucial number is I0, i.e. the first batch of input. The magnitude of It is important in public health practice. If It has been large with continual input lasting for weeks, then a bordered region may be able to prevent the epidemic from being exported to (or being imported from) other regions, solely by restricting human movements out of (or into) its borders. On the other hand, if the epidemic in a region could be started with I0 < 10 (and It>0 ∼ 0), then sealing off either emigration or immigration would not be effective in stopping a pandemic. Unless the bordered regions are maintaining zero infections, the danger would be coming mainly from within their borders. Here, we aim to infer It, in particular, in the early period of an epidemic.
In this study, we use a population genetic framework [5]. Because the focus is the stochastic differentiation among viral populations, epidemiological models generally do not cover such topics. The genetic drift formulation used here also permits the calculation of the extinction probability of the invasion of the virus ([5], Ruan et al., personal communication). Epidemiological parameters, such as the number of uninfected individuals, the effects of quarantine and the development of immunity [6,7], are not considered here as population differentiation takes place in the earliest stage of the invasion. During this stage, neither quarantine nor herd immunity has yet become a major factor in the outbreak.
Theory
To estimate It, a conventional method is to inspect the changes in the population size of the viruses, Nt [8,9]. Viral population size corresponds to the number of infected individuals, assuming a viral clone in each person. Because Nt is only weakly dependent on It, the conventional approach does not offer the resolution we seek for. It would be more informative to examine multiple populations for their differences in genetic polymorphism. The differences would depend strongly on It at the very beginning of the epidemic (Fig. 1).

Schematic diagram of viral population divergence among regions. In G0 (generation 0), I0 = 8 and infected individuals arrive in regions 1–3 with 3, 5 and 7 of them, respectively, carrying the L-type virus. In the beginning, genetic drift is particularly strong and the frequency of L fluctuates as modeled by the Branching Process. G8 is about 5 weeks after the first arrival when the data are collected. After G8, the fluctuation is greatly dampened due to the large population size. In the later stage (after G20), even weak selection could drive gene frequency toward the fixation of the more contagious genotype. Regions 4–6 are not independent samples and are not included in the analysis (see Data of the main text).
For studying population differentiation, the source population infecting the travelers needs to harbor genetic variants in non-trivial frequencies to yield informative data [10]. For example, Tang et al. [11] reported the existence of two lineages that are distinguishable by two Single Nucleotide Polymorphisms (SNPs), one being a Serine/Lysine (S/L) polymorphism. According to ref. 11, the S lineage accounts for ∼30% and the L lineage accounts for ∼70% among the 103 viral genomes they examined.
For the ease of estimating It’s, the variants should ideally be neutral in fitness. Indeed, since variants under selection would have a short retention time in the population, SNPs are often neutral [12–14]. While the fitness differential between the L and S variants remains unclear, our simulations show that the estimation of It is only weakly dependent on selection (see below).
The estimation of I0 as well as It>0 are conducted for multiple viral populations, which should ideally originate from independent samples. Among these populations, we model their differentiation as a function of It. It is not likely to be very large between regions (including the source) reachable only by air flight, each of which may carry at most a small number of infected passengers. As long as all extant populations are derived from the same source population, the estimates of It are only weakly dependent on the actual genetic make-up of the source. For that reason, the source population need not be known.
The hypothesis is that the viral populations seeded by the infected travelers have experienced strong fluctuation in gene frequency. This may happen at the beginning of the epidemic when Nt is small. Soon afterwards, the fluctuation in gene frequencies would be quickly dampened as Nt grows. The fluctuation in gene frequencies due to the random transmissions of genes is referred to as genetic drift [14,15] or the founder effect [4]. The standard formulation of genetic drift by the Wright-Fisher model (or the alternative model of Moran [16]) is not applicable for tracking the viral population. Instead, we use the ‘individual output’ model we previously proposed [5]. All models assume discrete generations. Based on the infection dynamics estimated in a recent study [17], we assume that each discrete generation is ∼4 days. If we use a longer or shorter generation time, the outcome would be similar as long as the progeny production is calibrated with the generation time.
Simulations
Based on the definition above, we simulate the stochastic changes of the viral population until it reaches the 20th generation (i.e. t = 20). Since genetic drift is negligible when Nt is large (e.g. >105), we simulate the trajectory by a deterministic model when Nt > 105.
Defining the parameter set (I0, T, X0, s) and the distribution of k
We set I0 = 1, 5, 10, 20, 50, 100 and T = 0, 1, 2, 3, 20. Thus, (I0, T) = (5, 3) means five travelers each generation for four generations, counting T = 0 as the initial batch. We set T in this range but will show that T = 0, 1 or 20 hardly matters. The frequency of L lineage in the source population is X0 = 0.1, 0.3, 0.5, 0.7, 0.9. Again, the polymorphism frequency in the source population has turned out to have little effect on the differentiation among populations. In addition, we also set s = 0, 0.1 to study the effect of selection. For each parameter set (I0, T, X0, s), we repeat the simulation 100 times.
Data
The estimation of I0 as well as It>0 should be done on multiple viral populations that are independent samples from the same source. If they are not fully independent, then our estimates of It would be conservative (i.e. over-estimation) since any exchange between populations should reduce the divergence.
Here, we generate two sets of data as shown in Table 1. In the hypothetical Set I, the gene frequency is taken from a normal distribution with the mean of 0.7 and standard deviation of 0.06. The mean is close to the average frequency of the L type in this pandemic. Set I is generated to show the possible pattern of population divergence if I0 is in the hundreds.
Region . | A1a . | A2 . | A3 . | B1 . | C1 . | C2 . | C3 . | D1 . | D2b . | D3 . |
---|---|---|---|---|---|---|---|---|---|---|
I (hypothetical) | 0.73 | 0.70 | 0.71 | 0.64 | 0.75 | 0.61 | 0.70 | 0.74 | 0.82 | 0.69 |
II (realistic) | 0.70 | 0.95 | 0.40 | 0.80 | 1.00 | 0.99 | 0.50 | 0.33 | 0.95 | 0.67 |
Region . | A1a . | A2 . | A3 . | B1 . | C1 . | C2 . | C3 . | D1 . | D2b . | D3 . |
---|---|---|---|---|---|---|---|---|---|---|
I (hypothetical) | 0.73 | 0.70 | 0.71 | 0.64 | 0.75 | 0.61 | 0.70 | 0.74 | 0.82 | 0.69 |
II (realistic) | 0.70 | 0.95 | 0.40 | 0.80 | 1.00 | 0.99 | 0.50 | 0.33 | 0.95 | 0.67 |
aEach letter in the region code indicates a separate continent and the number indicates a country or region; bD2 could have been derived from C1 or C2 due to the inter-continental travel pattern. Its exclusion from the analysis would increase the spread of Fst in Figs 2–4 and would thus lead to a smaller I0 estimate.
Region . | A1a . | A2 . | A3 . | B1 . | C1 . | C2 . | C3 . | D1 . | D2b . | D3 . |
---|---|---|---|---|---|---|---|---|---|---|
I (hypothetical) | 0.73 | 0.70 | 0.71 | 0.64 | 0.75 | 0.61 | 0.70 | 0.74 | 0.82 | 0.69 |
II (realistic) | 0.70 | 0.95 | 0.40 | 0.80 | 1.00 | 0.99 | 0.50 | 0.33 | 0.95 | 0.67 |
Region . | A1a . | A2 . | A3 . | B1 . | C1 . | C2 . | C3 . | D1 . | D2b . | D3 . |
---|---|---|---|---|---|---|---|---|---|---|
I (hypothetical) | 0.73 | 0.70 | 0.71 | 0.64 | 0.75 | 0.61 | 0.70 | 0.74 | 0.82 | 0.69 |
II (realistic) | 0.70 | 0.95 | 0.40 | 0.80 | 1.00 | 0.99 | 0.50 | 0.33 | 0.95 | 0.67 |
aEach letter in the region code indicates a separate continent and the number indicates a country or region; bD2 could have been derived from C1 or C2 due to the inter-continental travel pattern. Its exclusion from the analysis would increase the spread of Fst in Figs 2–4 and would thus lead to a smaller I0 estimate.
In Set II, we assign gene frequencies to 10 populations in Table 1 using the reported frequencies as a guide (GISAID (https://www.gisaid.org/); see Supplement). These 10 populations, distributed among four continents, are as likely to be independently derived as we could ascertain. The choice is based on three criteria: (i) the geography of the countries/regions and the distance between them; (ii) the timing of the documented onset of the epidemic; (iii) the abundance of DNA sequences. We consult the frequencies in samples collected before late March 2020, corresponding roughly to G8 in Fig. 1. Due to the rapidly changing data reporting (GISAID [26]), the frequency profile of Set II is plausibly realistic as reported in mid-April (see Supplement). Readers with access to the more up-to-date data can compare the new observations with the theoretical results to improve the estimation of It.
Comparisons between simulations and data
In Fig. 2, the Fst distributions based on the datasets of Table 1 are presented. The two very different distributions would be informative when compared with the simulated distributions.
Given the large number of possible combinations of parameters, (I0, T, X0, s), the task could have been daunting. Fortunately, all but one parameter would have little impact on the results. In Fig. 3A–D, only X0 varies and the Fst distributions are very similar. The simplest explanation is that small populations would all deviate from the initial frequency, X0, regardless of where it is. Figure 3E–H shows that the results do not depend strongly on T either, T being the time duration of traveler input. Since, at T = 2, the number of infected individuals would have swelled by 20-fold without any new input, the results are not significantly affected by later inputs.

The Fst distribution at G20. For each parameter set of (I0, T, X0, s), we repeat the simulations 100 times. For all panels, I0 = 10, s = 0. Panel A–D, T = 20 and X0 ranges from 0.1 to 0.7. Panel E–H, X0 = 0.7, T ranges from 0 to 20. These eight panels show that neither X0 nor T would impact the Fst distribution much.
In Fig. 4, the value of I0 is varied from 10, 50 to 100. Here, we assume no travelers arriving after the first generation because their contributions, as shown in Fig. 3E–H, are insubstantial. The left panels (Fig. 4A, C and E) show the trajectories of 100 populations, which diverge in the first generation or two and then evolve steadily as the population size increases. On the right panels are the comparisons between the simulated and ‘observed’ distributions; the latter being those based on Dataset II (Fig. 4B and D) or Dataset I (Fig. 4F). It is clear that the expected distribution of Fst is very sensitive to It (I0, in particular). The more realistic Dataset II can only be explained by I0 ≤ 10 whereas the artificial Dataset I would agree with I0 ≥ 100. We should note that the extensive survey of parameter space (see Supplement) shows the conclusion of I0 ≤ 10 to be robust.

Comparisons between data and simulations with various I0 values. For all panels, s = 0 (no selection), X0 = 0.7 and T = 0. The value of I0 is shown next to each panel. Panel (A, C and E): the frequency of L type over time (100 repeats), the average is showed by the orange dotted line. Panel (B and D): Fst distribution; the simulation results are in black and the distributions from Dataset II (realistic data) are in orange. Panel (F): like panels B and D but Dataset I is used.
In the results presented above, L and S lineages are assumed neutral. Intuitively, selection should drive the trajectories to converge and the left panels (Fig. 5A, C and E) do show that trend. If we let the L type enjoy a 10% selective advantage, the results of Fig. 5 still indicate I0 < 10. Note that the range of I0 spans a smaller range in Fig. 5 than that in Fig. 4. In other words, with selection, the estimated I0 should be even smaller than indicated above.

Same as Fig. 4 except that (i) s = 0.1 with selection favoring the L type; (ii) X0 = 0.5 so the L type would not reach fixation so quickly.
The simulation results are based on the distribution of k that follows the power law (see Equations 6 and 7) with V(k) ∼ 10E(k). Such a large V(k) means that the genetic drift would be very large, requiring large It’s to reduce the drift. It is hence interesting that, even under such stringent conditions, I0 is still <10. In the Supplement, we show that the estimated value of I0 would be substantially lower if we use the Poisson distribution of k, associated with the conventional Wright-Fisher model. With V(k) = E(K), I0 would be 2–4. Hence, the conclusion presented in this section is robust.
Inference of parameters (I0, T, X0, s)
In the last section, we present a range of parameter values that yield the expected population divergence for a comparison with the data. To corroborate the visual comparisons, we also carry out formal inferences using the ABC (Approximate Bayesian Computation) procedure on Dataset II (see Supplement). The results of Fig. S3 and S4 indeed confirm the visual impression as the observed divergence is sensitive only to I0, but not to the variation in T, X0 and s. The formal inference of I0 at 2–5 is even smaller than the visual impression would suggest. Finally, this analysis considers only the number of infections that can spread the virus further. A more extensive model that incorporates the development of symptoms, the border control and the local quarantine, all of which may contribute to the suppression of the epidemics, will be presented later (Ruan et al., personal communication).
DISCUSSION
In the theory of genetic drift [5], even 100 infected travelers from a source viral population would give rise to a fairly uniform level of genetic polymorphism among bordered regions. In contrast, the reported data indicate substantial divergence among countries (GISAID (https://www.gisaid.org/); see Supplement). Dataset II of Table 1 is realistic in this respect. The divergent polymorphisms across countries depend mainly on a critical parameter—the size of the first cohort arriving in a country, I0, which is estimated to be <10. The number may in fact be smaller than it seems since a long distance flight carrying one single infected but symptomless patient could infect this many people, all of whom would be without symptoms upon arrival [27].
On the robustness of the estimation
In Figs 3 and 5, we show that, despite the complexity of the model with many parameters, none of them (T, X0, s), except I0, plays a significant role in the divergence among viral populations. As discussed in conjunction with Fig. 3, the distribution of k does not matter either. In fact, in the standard Wright-Fisher model, the estimate of I0 would be <5. It is also noted that the E(k) and V(k) values used are for a generation time of 4 days. For a shorter generation time, the values would be correspondingly smaller and the results should be similar.
The model also assumes that each population is an independent sample of the source population. Since all populations are likely to exchange some individuals due to traveling, the actual divergence among populations would be even smaller than simulated. In other words, to attain the observed level of divergence, I0 would have to be even smaller than estimated. Considering all these variables, we believe the conclusion of I0 < 10 to be robust. In the analysis of regional divergence, the results would depend strongly on the smaller I0 of the two regions being compared. Hence, the assumption of the same I0 among all regions would be a reasonable one.
Subsequent viral evolution after arriving in a new continent
While our focus is on the divergence in the first few generations, we now briefly discuss the subsequent evolution after this initial critical period. The primary lineage delineation, the S/L polymorphism defined by two SNPs [11], has many subtypes (see Supplementary data and Table S1 for details). For example, western European countries including Italy, Switzerland, Germany and Belgium are predominantly of the L type with a similar abundance in the L2 subtype. In contrast, while Japan is also predominantly of the L type, it has mainly the L1 subtype. This contrast suggests that Japan may represent an independent sample from the western European samples, which have likely been spreading regionally after the initial seeding. Another example is the S1 and S2 subtypes, which differentiate between the samples from China and the west coast of the US.
These patterns suggest that, after the initial seeding, each major region or continent has been evolving along an independent path. Since the initial seeding may be extremely difficult to prevent, the onus is to suppress the regional spread. The analyses of the subtypes in Asia, Australia and various parts of North America would offer additional details of the spread of the virus, as has been done recently [28–31]. These details are beyond the scope of this study, which focuses on the early stages of the viral spread.
Implications
The analysis suggests that the COVID-19 epidemic in each region surveyed was likely started by a very small number of travelers (I0 < 10). With such a tiny trickle of human movement, it would have been very inefficient for any region to prevent infected individuals from exporting an epidemic to (or importing it from) other places. For that reason, the crucial stage of repressing an epidemic in any region should be the very first sign of local contagion.
Finally, due to the ‘portability’ of COVID-19, each epidemic, including the first one on record, could have easily been imported. Where then did all these epidemics begin? While the interest in the ‘origin’ is intense, we suggest the question be broadened as ‘the origin and early evolution’ of SARS-CoV-2. The latter implies a process whereas the former seems to mean a single time point. The process of early evolution may have stretched over different regions in a long time-span and involved multiple host species. Like many other evolutionary questions on origin, we suggest the question be phrased as the early evolution of SARS-CoV-2, rather than be about the ‘origin’. The former implies a process whereas the latter seems to mean a single time point. This distinction is important as seen in the debates on the ‘origin’ of dogs [32,33] and new species in novel environments [34]. By compressing a process into a simple ‘origin’, we may be asking a false question about, say, ‘the first dog’ or ‘the first patient’. The possible early evolution of SARS-CoV-2 is addressed in the companion study (Ruan et al., personal communication).
SUPPLEMENTARY DATA
Supplementary data are available at NSR online.
Acknowledgements
We thank all those who have contributed sequences to the GISAID (Global Initiative on Sharing All Influenza Data) database (https://www.gisaid.org/). Acknowledgements and detailed information of the genome sequences we consulted for this study are given in Table S2. The actual data fitted to our model were computationally generated (see the main text). We thank Drs Yaping Zhang, Suhua Shi, Weiwei Zhai, Jianzhi George Zhang, Bingjie Chen, Zheng Hu and Jindong Zhao for comments on and suggestions for this manuscript.
FUNDING
This work was supported by grants from the Ministry of Science and Technology of the People's Republic of China (No. 2020YFC0847000) to H.W., X.L. and J.L. This work was funded by the National Natural Science Foundation of China (31730046, 91731000 to C.I.W), Guangdong Basic and Applied Basic Research Foundation (2020B1515020030, 2019A1515010708), and CAS ‘Light of West China’ Program.
Conflict of interest statement. None declared.
REFERENCES