On the founder effect in COVID-19 outbreaks: how many infected travelers may have started them all?

Abstract How many incoming travelers (I0 at time 0, equivalent to the ‘founders’ in evolutionary genetics) infected with SARS-CoV-2 who visit or return to a region could have started the epidemic of that region? I0 would be informative about the initiation and progression of epidemics. To obtain I0, we analyze the genetic divergence among viral populations of different regions. By applying the ‘individual-output’ model of genetic drift to the SARS-CoV-2 diversities, we obtain I0 < 10, which could have been achieved by one infected traveler in a long-distance flight. The conclusion is robust regardless of the source population, the continuation of inputs (It for t > 0) or the fitness of the variants. With such a tiny trickle of human movement igniting many outbreaks, the crucial stage of repressing an epidemic in any region should, therefore, be the very first sign of local contagion when positive cases first become identifiable. The implications of the highly ‘portable’ epidemics, including their early evolution prior to any outbreak, are explored in the companion study (Ruan et al., personal communication).


INTRODUCTION
It is generally accepted that, in principle, a small number of infected people arriving in a new place could trigger an epidemic (if the basic reproduction number, R 0 , is not too small [1][2][3]). The main issue is how many travelers actually started each epidemic. Here, 'a new place' may mean a country, or a bordered region, within which the bulk of human interactions happen. Relative to the within-region movement, a bordered region is lightly connected to the rest of the world. Since the epidemic in any bordered region could have been started by one single infected traveler, or by 1000 of them, we take the population genetic approach to analyzing the divergence among viral populations in relation to the 'founder effect' [4].
We shall let I t be the amount of input at time t (i.e. the number of infected people coming into an uninfected region). The crucial number is I 0 , i.e. the first batch of input. The magnitude of I t is important in public health practice. If I t has been large with continual input lasting for weeks, then a bordered region may be able to prevent the epidemic from being exported to (or being imported from) other regions, solely by restricting human movements out of (or into) its borders. On the other hand, if the epidemic in a region could be started with I 0 < 10 (and I t>0 ∼ 0), then sealing off either emigration or immigration would not be effective in stopping a pandemic. Unless the bordered regions are maintaining zero infections, the danger would be coming mainly from within their borders. Here, we aim to infer I t , in particular, in the early period of an epidemic.
In this study, we use a population genetic framework [5]. Because the focus is the stochastic differentiation among viral populations, epidemiological models generally do not cover such topics. The genetic drift formulation used here also permits the calculation of the extinction probability of the invasion of the virus ( [5], Ruan et al., personal communication). Epidemiological parameters, such as the number of uninfected individuals, the effects of quarantine and the development of immunity [6,7], are not considered here as population differentiation takes place in the C The Author(s) 2020. Published by Oxford University Press on behalf of China Science Publishing & Media Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Natl Sci Rev, 2021, Vol. 8, nwaa246 In G 0 (generation 0), I 0 = 8 and infected individuals arrive in regions 1-3 with 3, 5 and 7 of them, respectively, carrying the L-type virus. In the beginning, genetic drift is particularly strong and the frequency of L fluctuates as modeled by the Branching Process. G 8 is about 5 weeks after the first arrival when the data are collected. After G 8 , the fluctuation is greatly dampened due to the large population size. In the later stage (after G 20 ), even weak selection could drive gene frequency toward the fixation of the more contagious genotype. Regions 4-6 are not independent samples and are not included in the analysis (see Data of the main text).
earliest stage of the invasion. During this stage, neither quarantine nor herd immunity has yet become a major factor in the outbreak.

Theory
To estimate I t , a conventional method is to inspect the changes in the population size of the viruses, N t [8,9]. Viral population size corresponds to the number of infected individuals, assuming a viral clone in each person. Because N t is only weakly dependent on I t , the conventional approach does not offer the resolution we seek for. It would be more informative to examine multiple populations for their differences in genetic polymorphism. The differences would depend strongly on I t at the very beginning of the epidemic (Fig. 1). For studying population differentiation, the source population infecting the travelers needs to harbor genetic variants in non-trivial frequencies to yield informative data [10]. For example, Tang et al. [11] reported the existence of two lineages that are distinguishable by two Single Nucleotide Polymorphisms (SNPs), one being a Serine/Lysine (S/L) polymorphism. According to ref. 11, the S lineage accounts for ∼30% and the L lineage accounts for ∼70% among the 103 viral genomes they examined.
For the ease of estimating I t 's, the variants should ideally be neutral in fitness. Indeed, since variants under selection would have a short retention time in the population, SNPs are often neutral [12][13][14]. While the fitness differential between the L and S variants remains unclear, our simulations show that the estimation of I t is only weakly dependent on selection (see below).
The estimation of I 0 as well as I t>0 are conducted for multiple viral populations, which should ideally originate from independent samples. Among these populations, we model their differentiation as a function of I t . I t is not likely to be very large between regions (including the source) reachable only by air flight, each of which may carry at most a small number of infected passengers. As long as all extant populations are derived from the same source population, the estimates of I t are only weakly dependent on the actual genetic make-up of the source. For that reason, the source population need not be known.
The hypothesis is that the viral populations seeded by the infected travelers have experienced strong fluctuation in gene frequency. This may happen at the beginning of the epidemic when N t is small. Soon afterwards, the fluctuation in gene frequencies would be quickly dampened as N t grows. The fluctuation in gene frequencies due to the random transmissions of genes is referred to as genetic drift [14,15] or the founder effect [4]. The standard formulation of genetic drift by the Wright-Fisher model (or the alternative model of Moran [16]) is not applicable for tracking the viral population. Instead, we use the 'individual output' model we previously proposed [5]. All models assume discrete generations. Based on the infection dynamics estimated in a recent study [17], we assume that each discrete generation is ∼4 days. If we use a longer or shorter generation time, the outcome would be similar as long as the progeny production is calibrated with the generation time.
From one generation to the next, each individual produces k 'descendants' (or infects k others) with the mean of E(k) and the variance of V(k). In the Wright-Fisher model, k follows the Poisson distribution and V(k) is tied to E(k) [5]. In the 'individual output' model, k may follow any distribution, which is often measurable but not in any common form. E(k) dictates the population growth, N t , while V(k) determines the fluctuation in N t and in gene frequency. We will attempt to obtain E(k) and V(k) from the empirical data and, for a comparison, will also allow V(k) = E(k) to approximate the Wright-Fisher model. N t is a function of E(k), V(k) and I t .
Here, we assume I t to be a constant, hence, is not too small, E(N T ) would depend mainly on I t of the first few generations. In fact, the results are often similar whether there is constant input or not (i.e. I t = I 0 , or I t = 0 when t ≥ 1). In other words, Equation (1 ) below would yield similar results to Equation (1).
With reasonable accuracy, E(k) can be obtained from the growth trajectory of N t but I 0 has to be obtained by a different means. While many of the assumptions such as exponential growth and the constancy of I t may hold for only a few generations, most of our results depend primarily on the dynamics of the first few generations. The actual trajectory of each population would also depend on V(k). Using the simpler Equation (1 ), To obtain I 0 for Equations (1) or (1 ), we have to model gene frequencies. Using the example of the S/L polymorphism, we let X T be the frequency of the L lineage at time T. If the fitness of the S and L type is the same, then where X 0 is the frequency in the source population.
Equations (3) and (4) will need some modifications if we use Equation (1), or if we consider the fitness difference between L and S (see Supplement).

Simulations
The actual realization of X T in each population is obtained by iteration described here. We assume two types of viruses (L type and S type; [11]). The relative fitness of L type to S type is 1 + s (s = 0 represents no selection). In addition, there is a source population, in which the frequency of the L type is X 0 . At generation t, there will be I t immigrants from the source population. I 0 is the founder population size. A parameter T sets the time limit of immigration. Thus, At generation t − 1, the numbers of the L type and S type are L t−1 and S t−1 respectively. Also, After one generation, there will be I t (I t = I L + I S ; I L , I S are the numbers of L and S type, respectively) immigrants from the source population. In addition, N t−1 will increase to N t. . Thus, at generation t, the numbers of the L and S type are where k i is the progeny number of the i-th individual of either type. The distribution of k i will be defined in the next section. If there is selection, the number of L type will change as follows: At generation t, the population size and the frequency of L type are Based on the definition above, we simulate the stochastic changes of the viral population until it reaches the 20th generation (i.e. t = 20). Since genetic drift is negligible when N t is large (e.g. >10 5 ), we simulate the trajectory by a deterministic model when N t > 10 5 .
To quantify the population differentiation, we calculate the pairwise Fst values [14,18], defined below. With X t and Y t for a pair of populations, wherep = (X t + Y t )/2. If Fst = 0, X t = Y t and the two populations are identical in gene frequency. If Fst = 1, the two populations are maximally differentiated with X t = 0 and Y t = 1, or vice versa. Defining the parameter set (I 0 , T, X 0 , s) and the distribution of k We set I 0 = 1, 5, 10, 20, 50, 100 and T = 0, 1, 2, 3, 20. Thus, (I 0 , T) = (5, 3) means five travelers each generation for four generations, counting T = 0 as the initial batch. We set T in this range but will show that T = 0, 1 or 20 hardly matters. The frequency of L lineage in the source population is X 0 = 0.1, 0.3, 0.5, 0.7, 0.9. Again, the polymorphism frequency in the source population has turned out to have little effect on the differentiation among populations. In addition, we also set s = 0, 0.1 to study the effect of selection. For each parameter set (I 0 , T, X 0 , s), we repeat the simulation 100 times. As stated, the conventional Wright-Fisher Model requires k (progeny number of an individual) to follow a Poisson distribution with V(k) = E(k). Here, we assume the spread of virus to be associated with the social network, which usually follows the power law [19,20]. Specifically, we let k follow Zipf's law (a discrete power-law distribution; [21]): The estimate of the basic reproduction number (R 0 ) of SARS-CoV-2 ranges from 1.4 to 6.5 [22][23][24]. Here, we focus on the early phase of the viral population growth by using R 0 = 6.5. The relationship between E(k) and R 0 is as follows: where τ is the serial interval and G is the generation time, and τ is estimated to be ∼5 days [23,25] and G is 4 days [17]. Then, E (k) = R G /τ 0 = 4.5. (Note that E(k) would become smaller with smaller G, as stated above.) To have k follow the power law with E(k) = 4.5, we assume M = 30 and c = 1.3 in Equation (6), which yields V(k) = 45. As noted in Chen YX et al. [5], the strength of genetic drift depends mainly on E(k) and V(k) rather than the actual distribution. By setting V(k) so large, we ensure that the I 0 estimate would be on the high side (see Discussion).

Data
The estimation of I 0 as well as I t>0 should be done on multiple viral populations that are independent samples from the same source. If they are not fully independent, then our estimates of I t would be conservative (i.e. over-estimation) since any exchange between populations should reduce the divergence.
Here, we generate two sets of data as shown in Table 1. In the hypothetical Set I, the gene frequency is taken from a normal distribution with the mean of 0.7 and standard deviation of 0.06. The mean is close to the average frequency of the L type in this pandemic. Set I is generated to show the possible pattern of population divergence if I 0 is in the hundreds.
In Set II, we assign gene frequencies to 10 populations in Table 1 using the reported frequencies as a guide (GISAID (https://www.gisaid.org/); see Supplement). These 10 populations, distributed among four continents, are as likely to be independently derived as we could ascertain. The choice is based on three criteria: (i) the geography of the countries/regions and the distance between them; (ii) the timing of the documented onset of the epidemic; (iii) the abundance of DNA sequences. We consult the frequencies in samples collected before late March 2020, corresponding roughly to G 8 in Fig. 1. Due to the rapidly changing data reporting (GISAID [26]), the frequency profile of Set II is plausibly realistic as reported in mid-April (see Supplement). Readers with access to the more up-todate data can compare the new observations with the theoretical results to improve the estimation of I t .

Comparisons between simulations and data
In Fig. 2, the Fst distributions based on the datasets of Table 1 are presented. The two very different distributions would be informative when compared with the simulated distributions.
Given the large number of possible combinations of parameters, (I 0 , T, X 0 , s), the task could have been daunting. Fortunately, all but one parameter would have little impact on the results. In Fig. 3A-D, only X 0 varies and the Fst distributions are very similar. The simplest explanation is that small populations would all deviate from the initial frequency, X 0 , regardless of where it is. Figure 3E-H shows that the results do not depend strongly on T either, T being the time duration of traveler input. Since, at T = 2, the number of infected individuals would have swelled by 20-fold without any new input, the results are not significantly affected by later inputs.
In Fig. 4, the value of I 0 is varied from 10, 50 to 100. Here, we assume no travelers arriving after the first generation because their contributions, as shown in Fig. 3E-H, are insubstantial. The left panels (Fig. 4A, C and E) show the trajectories of 100 populations, which diverge in the first generation or two and then evolve steadily as the population size increases. On the right panels are the comparisons between the simulated and 'observed' distributions; the latter being those based on Dataset II (Fig. 4B and D) or Dataset I (Fig. 4F). It is clear that the expected distribution of Fst is very sensitive to I t (I 0 , in particular). The more realistic Dataset II can only be explained by I 0 ≤ 10 whereas the artificial Dataset I would agree with I 0 ≥ 100. We should note that the extensive survey of parameter space (see Supplement) shows the conclusion of I 0 ≤ 10 to be robust.
In the results presented above, L and S lineages are assumed neutral. Intuitively, selection should drive the trajectories to converge and the left panels (Fig. 5A, C and E) do show that trend. If we let the L type enjoy a 10% selective advantage, the results of Fig. 5 still indicate I 0 < 10. Note that the range of I 0 spans a smaller range in Fig. 5 than that in Fig. 4. In other words, with selection, the estimated I 0 should be even smaller than indicated above.
The simulation results are based on the distribution of k that follows the power law (see Equations 6 and 7) with V(k) ∼ 10E(k). Such a large V(k) means that the genetic drift would be very large, requiring large I t 's to reduce the drift. It is hence interesting that, even under such stringent conditions, I 0 is still <10. In the Supplement, we show that the estimated value of I 0 would be substantially lower if we use the Poisson distribution of k, associated with the conventional Wright-Fisher model. With V(k) = E(K), I 0 would be 2-4. Hence, the conclusion presented in this section is robust.

Inference of parameters (I 0 , T, X 0 , s)
In the last section, we present a range of parameter values that yield the expected population divergence for a comparison with the data. To corroborate the visual comparisons, we also carry out formal inferences using the ABC (Approximate Bayesian Computation) procedure on Dataset II (see Supplement). The results of Fig. S3 and S4 indeed confirm the visual impression as the observed divergence is sensitive only to I 0 , but not to the variation in T, X 0 and s. The formal inference of I 0 at 2-5 is even smaller than the visual impression would suggest. Finally, this analysis considers only the number of infections that can spread the virus further. A more extensive model that incorporates the development of symptoms, the border control and the local quarantine, all of which may contribute to the suppression of the epidemics, will be presented later (Ruan et al., personal communication).

DISCUSSION
In the theory of genetic drift [5], even 100 infected travelers from a source viral population would give rise to a fairly uniform level of genetic polymorphism among bordered regions. In contrast, the reported data indicate substantial divergence among countries (GISAID (https://www.gisaid.org/); see Supplement). Dataset II of Table 1 is realistic in this respect. The divergent polymorphisms across countries depend mainly on a critical parameter-the size of the first cohort arriving in a country, I 0 , which is estimated to be <10. The number may in fact be smaller than it seems since a long distance flight carrying one single infected but symptomless patient could infect this many people, all of whom would be without symptoms upon arrival [27].

On the robustness of the estimation
In Figs 3 and 5, we show that, despite the complexity of the model with many parameters, none of them (T, X 0 , s), except I 0 , plays a significant role in the divergence among viral populations. As discussed in conjunction with Fig. 3, the distribution of k does not matter either. In fact, in the standard Wright-Fisher model, the estimate of I 0 would be <5. It is also noted that the E(k) and V(k) values used are for a generation time of 4 days. For a shorter generation time, the values would be correspondingly smaller and the results should be similar.
The model also assumes that each population is an independent sample of the source population. Since all populations are likely to exchange some individuals due to traveling, the actual divergence among populations would be even smaller than simulated. In other words, to attain the observed level of divergence, I 0 would have to be even smaller than estimated. Considering all these variables, we believe the conclusion of I 0 < 10 to be robust. In the analysis of regional divergence, the results would depend strongly on the smaller I 0 of the two regions being compared. Hence, the assumption of the same I 0 among all regions would be a reasonable one.

Subsequent viral evolution after arriving in a new continent
While our focus is on the divergence in the first few generations, we now briefly discuss the subsequent evolution after this initial critical period. The primary lineage delineation, the S/L polymorphism defined by two SNPs [11], has many subtypes (see Supplementary data and Table S1 for details). For example, western European countries including Italy, Switzerland, Germany and Belgium are predominantly of the L type with a similar abundance in the L2 subtype. In contrast, while Japan is also predominantly of the L type, it has mainly the L1 subtype. This contrast suggests that Japan may represent an independent sample from the western European samples, which have likely been spreading regionally after the initial seeding. Another example is the S1 and S2 subtypes, which differentiate between the samples from China and the west coast of the US. These patterns suggest that, after the initial seeding, each major region or continent has been evolving along an independent path. Since the initial seeding may be extremely difficult to prevent, the onus is to suppress the regional spread. The analyses of the subtypes in Asia, Australia and various parts of North America would offer additional details of the spread of the virus, as has been done recently [28][29][30][31]. These details are beyond the scope of this study, which focuses on the early stages of the viral spread.

Implications
The analysis suggests that the COVID-19 epidemic in each region surveyed was likely started by a very small number of travelers (I 0 < 10). With such a tiny trickle of human movement, it would have been very inefficient for any region to prevent infected individuals from exporting an epidemic to (or importing it from) other places. For that reason, the crucial stage of repressing an epidemic in any region should be the very first sign of local contagion. Finally, due to the 'portability' of COVID-19, each epidemic, including the first one on record, could have easily been imported. Where then did all these epidemics begin? While the interest in the 'origin' is intense, we suggest the question be broadened as 'the origin and early evolution' of SARS-CoV-2. The latter implies a process whereas the former seems to mean a single time point. The process of early evolution may have stretched over different regions in a long time-span and involved multiple host species. Like many other evolutionary questions on origin, we suggest the question be phrased as the early evolution of SARS-CoV-2, rather than be about the 'origin'. The former implies a process whereas the latter seems to mean a single time point.