mstree: a multispecies coalescent approach for estimating ancestral population size and divergence time during speciation with gene flow.

Abstract Gene flow between species may cause variations in branch length and topology of gene tree, which are beyond the expected variations from ancestral processes. These additional variations make it difficult to estimate parameters during speciation with gene flow, as the pattern of these additional variations differs with the relationship between isolation and migration. As far as we know, most methods rely on the assumption about the relationship between isolation and migration by a given model, such as the isolation-with-migration model, when estimating parameters during speciation with gene flow. In this article, we develop a multispecies coalescent approach which does not rely on any assumption about the relationship between isolation and migration when estimating parameters and is called mstree. mstree is available at https://github.com/liujunfengtop/MStree/ and uses some mathematical inequalities among several factors, which include the species divergence time, the ancestral population size, and the number of gene trees, to estimate parameters during speciation with gene flow. Using simulations, we show that the estimated values of ancestral population sizes and species divergence times are close to the true values when analyzing the simulation data sets, which are generated based on the isolation-with-initial-migration model, secondary contact model, and isolation-with-migration model. Therefore, our method is able to estimate ancestral population sizes and speciation times in the presence of different modes of gene flow and may be helpful to test different theories of speciation.


Introduction
The role of gene flow in speciation is a fundamental issue in evolutionary biology. Allopatric speciation considers complete lack of gene flow as prerequisite to the formation of new species. However, parapatric and sympatric speciation allow gene flow during speciation. Although allopatric speciation has been historically taken as the paramount mode of speciation (Futuyma and Mayer 1980), theoretical modeling and empirical evidence increasingly support that speciation can occur with gene flow (Gourbiere and Mallet 2010;Smadja and Butlin 2011;Feder et al. 2012).
There are usually two kinds of models to make inferences about gene flow during speciation. Some methods are based on an isolation-with-migration (IM) model (Wang and Hey 2010;Tian and Kubatko 2016;Dalquen et al. 2017) and others on an isolation-with-initial-migration (IIM) model (Mailund et al. 2012;Costa and Wilkinson-Herbots 2017). However, the above two models include an assumption about the relationship between isolation and migration. Here, we use the properties of coalescent-based model in gene tree data for estimating the important parameters such as ancestral population sizes and divergence times without any assumption of the relationship between isolation and migration. Furthermore, we conduct simulations to examine the accuracy of the estimates of parameters. The simulation results show that our method can accurately estimate the parameters. At last, we compared mstree with the program 3s (Dalquen et al. 2017) and IMa3 (Hey et al. 2018) with simulation data; the simulation results show that mstree is faster than 3s and IMa3.

Materials and Methods
The Theoretical Model Consider two closely related species (1 and 2) with an outgroup species 3. We assume that there is only gene flow between two closely related species ( fig. 1). We use s 0 and s 1 to denote the two species divergence times, scaled by mutation rate. Let h 0 ¼ 4N 0 l and h 1 ¼ 4N 1 l measure the two ancestral species population sizes. Here, l is the mutation rate per site and generation, and the N 0 and N 1 denote the effective population sizes. There are five possible gene trees for a locus with three sequences (k, l, and m), which are from species 1, species 2, and species 3, respectively ( fig. 2). For any locus, t 0 is the coalescent time among three sequences and t 1 is the coalescent time between two sequences. In the presence of gene flow between species 1 and species 2, the gene tree G 1 may be possible at a locus. Otherwise, only gene trees G 2 -G 5 are possible. In this study, the term "loci" refers to independent or loosely linked short segments of the genome, and we assume that there is no recombination within a locus while different loci are free recombining. For tens of thousands of loci, there are some mathematical inequalities among the species divergence time, the ancestral species population size, and the number of gene trees, of which t 1 is larger than s 1 . Based on the coalescent theory with no gene flow under given species, the probability of gene tree, of which t 1 belongs to s 1; s 0 Â Á , is 1 À e À2 s 0 Às 1 ð Þ =h 1 ; and the probability of gene tree, of which t 0 belongs to s 0 ; s Case C: g 4 þg 5 g 2 þg 3 þg 4 þg 5 % 2 3 e À2 s 0 Às 1 ð Þ =h 1 for the category G 2 -G 5 ( fig. 2).
The above gene tree distributions allow us to compute the parameters h 0 , s 0 , h 1 , and s 1 . The approach of estimating the parameters is called mstree and the strategies are as follows. First, we can estimate the value of s 0 based on the fact that the shape of gene tree may be ((k, l),m), ((k, m),l) or ((l, m),k) where a is the number of gene trees, of which t 0 is larger than t 00 and t 1 is less than s 0 ; b 0 is the number of gene trees, of which t 0 belong to t 0 ; t 00 Â Á and t 1 is less than s 0 . Similarly, we can also estimate the value of h 1 based on the estimated value of s 0 by using the formula e À2 s 0 Às 0 1 ð Þ=h1 % a aþb 0 in case A. If we choose t 0 that is less than s 0 and assume that t 0 is larger than s 1 when t 0 is closed to s 0 , the estimated value of h 1 approximates À2 where a is the number of gene trees, of which t 1 is larger than s 0 and b 0 is the number of gene trees, of which t 1 belongs to t 0 ; s 0 Â Á . Lastly, we estimate the value of s 1 based on the values of s 0 and h 1 by using the formula 2 3 e À2 s 0 Às 1 ð Þ =h 1 % n 2 þn 3 aþn 2 þn 3 in case C. If there exists t 0 that is less than s 0 and satisfies 2 3 e À2 s 0 Àt 0 ð Þ=h1 % n 2 þn 3 aþn 2 þn 3 , where a is the number of gene trees, of which t 1 is larger than t 0 and the shape is ((k, l),m) ( fig. 2: G 2 -G 3 ); n 2 is the number of gene trees, of which the shape is ((k, m),l) (fig. 2: G 4 ); and n 3 is the number of gene trees, of which the shape is ((l, m),k) ( fig. 2: G 5 ), t 0 can be considered as the estimated value of s 1 .

Compared with IMa3 and Analyzed Real Data
The model estimated by IMa3 is IM model, and we used a fixed true species topology for IMa3. The real data are the genomic sequences of the human (H), chimpanzee (C), and gorilla (G) from Burgess and Yang (2008). The data set comprises 14,663 autosomal loci, and the mean locus length is 508 bp.

Results
The Accuracy of mstree We used the program ms (Hudson 2002) to simulate gene trees at multi loci under the IIM, secondary contact (SC), and IM model. For the IIM model, the gene flow stopped at 2 3 s 1 in the past; For the SC model, the time of SC began at 1 3 s 1 in the past. Two sets of parameter values were used, roughly based on estimates from the hominoids (Burgess and Yang 2008) and the mangroves (Zhou et al. 2007). They are as follows: h 0 ¼ h 1 ¼ 0:005, s 0 ¼ 0:006, and s 1 ¼ 0:004 (hominoids); h 0 ¼ h 1 ¼ 0:01, s 0 ¼ 0:02, and s 1 ¼ 0:01 (mangroves). For the three models, gene flow is symmetrical and the migration rate (the expected number of migrants per generation) is 1. The number of loci is 10,000 and the number of replicates is 1,000. Analyzing the simulation data by using mstree, the results show that the parameter estimates are very close to the true values and are not sensitive to the model's assumption about the relationship between isolation and migration (table 1). e in table 1 is the threshold value in mstree and  Gene flow is symmetrical and the migration rate is 1. e is the threshold value in mstree. The number of loci is 10,000. The number of replicates is 1,000. IIM, isolation-with-initialmigration; SC, secondary contact; IM, isolation-with-migration. The best estimates are marked in bold.
describes the degree of approximation between two sides of the formulas in cases A, B, and C. For example, in case A, e ¼ 0:03 means the value of e À2 s 0 Às 0 should be <0.03 when e À2 s 0 Às 0 1 ð Þ=h1 % a aþb 0 . In mstree, the value of e must be <0.05. The results in table 1 show that the smaller e increased the standard deviations of the parameter estimates and the estimate of s 1 . Therefore, we suggest that the value of e should be 0.03 when using mstree. Furthermore, we applied mstree to additional two parameter sets (Dalquen et al. 2017), which have larger parameter values and different values for two hs (table 2). The results show that mstree still performs well.

The Factors That Influence Parameter Estimates
In addition, we performed more simulations to test how different factors influence parameter estimates, such as the number of loci, migration rate, and the direction of migration. The numbers of loci are 5,000, 10,000, and 50,000; the migration rates are 0.1, 1, and 10; and the directions of migration are symmetrical and asymmetrical. The results are shown in supplementary tables S1-S6, Supplementary Material online. For the parameters h 0 , h 1 , and s 0 , the results show that larger number of loci makes the estimates more accurate and the estimates are not sensitive to the model's assumption, migration rate, and the direction of migration. For the parameter s 1 , we have the same conclusion except for the case that migration rate is 10. When migration rate is 10, asymmetrical gene flow decreases the accuracy of s 1 estimate. This indicates that the estimate of s 1 is sensitive to the direction of migration with large migration rate (supplementary fig. S1,

Compared with 3s and IMa3
The input file of mstree is gene tree, which is in Newick format, and the gene tree can be estimated from the observed sequence alignments where there must be three sequences, with one sequence from each species, at each locus. Therefore, we need a program, such as PHYLIP, to infer gene trees when applying mstree to experimental data. Because inference of gene trees is associated with error and uncertainty, we did some simulations to investigate the effect of the gene tree uncertainty (supplementary table S7, Supplementary Material online). We used program ms and seq-gen (Rambaut and Grassly 1997) to generate sequence data under JC69 model and used program dnamlk in PHYLIP package to infer gene trees. Although there has been some decline in the accuracy of parameter estimates because of the inferred error of gene trees, the estimates of mstree are still near to the true values and not sensitive to the model's assumption. Comparing mstree with the program 3s (Dalquen et al. 2017) and IMa3 (Hey et al. 2018), mstree is faster than 3s and IMa3 (supplementary table S7, Supplementary Material online). Although 3s and IMa3 performed very well on some parameter estimates, the s 1 estimates of 3s and s 0 estimates of IMa3 were very poor.

Robustness of mstree and Analyzing Real Data
Though our method is not affected by gene flow between the sister species, our method assumes that there is no gene flow between the ingroup and the outgroup. Therefore, we examined the robustness of our method in the presence of gene flow between the ingroup and the outgroup. The results are shown in supplementary table S8, Supplementary Material online. Our method is robust to the simulations based on IIM model between the ingroup and the outgroup. For the simulations based on SC and IM model between the ingroup and the outgroup, the accuracy of parameter estimates is on the decline. At last, we apply mstree to the genomic sequences of the human (H), chimpanzee (C), and gorilla (G) (Burgess and Yang 2008). The estimates of parameters are similar to those of Burgess and Yang (2008), but the estimate of s HC is slightly higher (supplementary table S9, Supplementary Material online). In order to quantify uncertainty in the estimates obtained, we resort to bootstrapping with 100 replicates. The averages and the standard errors of estimates are as follows: b h HCG ¼ 0:0032 6 0:0000, b h HC ¼ 0:0068 6 0:0005, b s HCG ¼ 0:0059 6 0:0000, and b s HC ¼ 0:0038 6 0:0005. -h and s estimates are scaled by 10 2 . Gene flow is symmetrical and the migration rate is 1. e is the threshold value in mstree. The number of loci is 10,000. The number of replicates is 1,000. IIM, isolation-with-initial-migration; SC, secondary contact; IM, isolation-with-migration.

Discussion
Supplementary table S8, Supplementary Material online, shows the performance of mstree in the presence of gene flow with species 3. Under the influence of gene flow between the ingroup and the outgroup, s 0 was underestimated and was closed to the time that gene flow stopped except for IIM model. When s 0 was underestimated, the estimates of other parameters were far away from the true value. Burgess and Yang (2008) estimated divergence times under the assumption of no gene flow. However, Zhu and Yang (2012) applied the test based on SIM3s model to a human-chimpanzee-gorilla genomic data and the test results suggested gene flow around the time of speciation of human and chimpanzee. Compared with the estimated divergence times from Burgess and Yang (2008), the analysis from mstree suggested migrations between sister species. In addition, there are two significant differences between mstree and COALGF (Tian and Kubatko 2016), which describes the distribution of coalescent histories under the coalescent model with gene flow: 1) mstree uses the coalescent history distribution under coalescent model without gene flow to infer model parameters based on summary statistics; however, COALGF computes probabilities of gene tree histories given species trees under the coalescent process with gene flow and the results obtained from COALGF may be used to infer model parameters based on a maximum likelihood framework. 2) mstree does not make any assumption about the mode of gene flow between sister taxa; however, COALGF assumes that the mode of gene flow between sister taxa is IM.
To summarize, we propose a multispecies coalescent approach, mstree, for estimating the parameters during speciation with gene flow. Theoretically, our method does not rely on any assumption about the relationship between isolation and migration. Furthermore, the simulation results demonstrate that mstree can accurately estimate species divergence time and ancestral population size regardless of IIM model, SC model, or IM model.

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.