A Bayesian Implementation of the Multispecies Coalescent Model with Introgression for Phylogenomic Analysis

Abstract Recent analyses suggest that cross-species gene flow or introgression is common in nature, especially during species divergences. Genomic sequence data can be used to infer introgression events and to estimate the timing and intensity of introgression, providing an important means to advance our understanding of the role of gene flow in speciation. Here, we implement the multispecies-coalescent-with-introgression model, an extension of the multispecies-coalescent model to incorporate introgression, in our Bayesian Markov chain Monte Carlo program Bpp. The multispecies-coalescent-with-introgression model accommodates deep coalescence (or incomplete lineage sorting) and introgression and provides a natural framework for inference using genomic sequence data. Computer simulation confirms the good statistical properties of the method, although hundreds or thousands of loci are typically needed to estimate introgression probabilities reliably. Reanalysis of data sets from the purple cone spruce confirms the hypothesis of homoploid hybrid speciation. We estimated the introgression probability using the genomic sequence data from six mosquito species in the Anopheles gambiae species complex, which varies considerably across the genome, likely driven by differential selection against introgressed alleles.


Introduction
A number of recent studies have revealed cross-species hybridization/introgression in a variety of species ranging from Arabidopsis (Arnold et al. 2016), butterflies (Martin et al. 2013), corals (Mao et al. 2018), and birds (Ellegren et al. 2012) to mammals such as bears Kumar et al. 2017), cattle (Wu et al. 2018), gibbons (Chan et al. 2013;Shi and Yang 2018), and hominins (Nielsen et al. 2017). Introgression may play an important role in speciation (Harrison and Larson 2014;Mallet et al. 2016;Martin and Jiggins 2017). Inference of introgression and estimation of migration rates can contribute to our understanding of the speciation process (Mallet et al. 2016;Martin and Jiggins 2017). Furthermore, introgression and deep coalescence (or incomplete lineage sorting) are two major challenges for species tree reconstruction (Martin et al. 2013;Liu et al. 2014;Fontaine et al. 2015).
There is a large body of literature on the use of networks to model non-treelike evolution (Huson et al. 2011) and a number of methods have been developed to detect cross-species gene flow. Most use summaries of the multilocus sequence data such as the estimated gene trees (Solis-Lemus and Ane 2016; Wen et al. 2016;Solis-Lemus et al. 2017;Cao et al. 2019) or the counts of parsimony-informative site patterns (Green et al. 2010;Durand et al. 2011;Blischak et al. 2018). See Degnan (2018) and Folk et al. (2018) for recent reviews. We focus on coalescent-based full-likelihood models applied to multilocus sequence alignments from closely related species. These come in two forms. The isolation-with-migration (IM) model assumes continuous migration, with species exchanging migrants at certain rates every generation (Hey and Nielsen 2004;Hey 2010), whereas the multispecies-coalescent-with-introgression (MSci) model assumes episodic introgression/hybridization (Yu et al. 2014). Although the probability density of the gene trees under the IM (Hey 2010) and MSci (Yu et al. 2014) models is straightforward to compute, developing a Bayesian Markov chain Monte Carlo (MCMC) program that is feasible for use with genome-scale data sets has been challenging. The space of unknown genealogical histories (including the migration/introgression histories) is large, and constraints between the species tree and the gene trees make it difficult to traverse the parameter space in the posterior. Current implementations of full-likelihood methods through MCMC include IMa3 (Hey 2010;Hey et al. 2018) for the IM model, and *BEAST Jones 2019) and PHYLONET/MCMC-SEQ ) for the MSci model. It does not appear computationally feasible to apply those programs to realistically sized data sets, with more than 200 loci, say.
In this article, we extend the multispecies-coalescent (MSC) model in the BPP program (Rannala and Yang 2003;Burgess and Yang 2008;Yang 2015) to accommodate introgression, resulting in the MSci model (Degnan 2018). The MSci model can be used to estimate species divergence times and the number, timings, and intensities of introgression events. By accommodating gene flow and providing more reliable estimates of evolutionary parameters, the model may also be used in heuristic species delimitation (Jackson et al. 2017;Leach e et al. 2019). We conduct simulation to examine the statistical properties of the method, in comparison with two summary methods, SNAQ (Solis-Lemus and Ane 2016; Solis-Lemus et al. 2017) and HYDE (Blischak et al. 2018). We apply the new method to data sets of purple cone spruce (Sun et al. 2014;Zhang et al. 2018), budding yeast (Rokas et al. 2003;, and Anopheles mosquito genomes Thawornwattana et al. 2018a), to examine the computational efficiency of our algorithms in comparison with previous implementations Zhang et al. 2018) and to estimate the introgression probability and to study its variation across the genome.

The MSci Model
We extend the MSC model (Rannala and Yang 2003) to accommodate cross-species hybridization (or introgression) by introducing hybridization (or H) nodes ( fig. 1). Each H node has two parents (H l and H r , for left and right) and one daughter, although the H node and its parents may have the same age when there is an admixture or horizontal gene transfer ( fig. 1B and C). In model A, both parental species become extinct after hybridization, whereas model B represents an introgression from species RSA into THC. Model C represents hybrid speciation, whereas model D represents bidirectional introgression (Kubatko 2009).
When we trace a lineage backward in time and reach an H event, the lineage may traverse either the left or the right parental species, according to the introgression probability (u or 1 À u). This probability is equivalent to the "inheritance probability" c of Yu et al. (2014) and the "heritability" of Solis-Lemus and Ane (2016). The MSci model includes three sets of parameters: the speciation and introgression times (s); the population size parameters (h), with each h ¼ 4Nl, where N is the effective population size and l is the mutation rate per generation per site; and the introgression probabilities (u). Both ss and hs are measured by the expected number of mutations per site. Here, we assume that the MSci model is fixed; cross-model MCMC moves will be developed in future work.
Let G ¼ fG i g be the set of gene trees for the L loci. For each locus i, G i represents the gene-tree topology, the branch lengths (coalescent times), and the paths taken at the H nodes, indicated by a set of flags for each gene-tree branch, with l for left, r for right and 1 for null (meaning that the branch does not pass the H node). The data X ¼ fX i g are the sequence alignments at the L loci. Sites within the same locus are assumed to share the same genealogical history, whereas the gene trees and coalescent times are assumed to be independent among loci given the species tree and parameters. The ideal data for this kind of analysis are loosely linked short genomic segments (called loci), so that recombination within a locus is unimportant, whereas different loci are largely independent (Burgess and Yang 2008;Lohse et al. 2016;Hey et al. 2018). The Bayesian formulation consists of two components. The first is the probability density of gene trees given the species tree under the MSci model, f ðG i js; h; uÞ, given in Yu et al. (2014) although here G i includes the flags for hybrid nodes. Note that this differs from the density given by Kubatko (2009), as pointed out by Solis-Lemus and Ane (2016). The second component is the likelihood of the sequence data at each locus i given the gene tree, f ðX i jG i Þ (Felsenstein 1981). The posterior probability density of the parameters on the species tree given sequence data is then (1) where f ðs; h; uÞ is the prior on parameters. We assign inverse-gamma priors on hs and ss and a beta prior on u.
We have implemented six MCMC proposals to average over the gene trees (G i ) and sample from the posterior (eq. 1). Those proposals 1) change node ages on gene trees, 2) change gene-tree topologies using subtree pruning and regrafting (SPR), 3) change hs on the species tree using sliding windows, 4) change ss on the species tree using a variant of the rubber-band algorithm (Rannala and Yang 2003), 5) changing all node ages on the species tree and gene trees using a multiplier, and 6) change the introgression probabilities us using sliding windows. The proposals are detailed in Materials and Methods using the example species tree model of figure 2.

Simulation Study
We conducted three sets of simulation to examine the performance of BPP in different situations.
The first set includes multiple sequences from each species and examines BPP estimation of parameters in the MSci model and the impact of factors such as the number of loci, the introgression probability u, and the species tree model. We used models A and C of figure 1. Each data set consisted of 10, 100, or 1,000 loci, with 10 sequences from each species per locus (and 30 sequences in total). We used two values for u (0.1 and 0.5) and two values for h (0.001 and 0.01). Either a JC (Jukes and Cantor 1969) or a GTR þ C (Yang 1994a(Yang , 1994b substitution model was used to simulate data, but JC was always used to analyze them. The results may be summarized as follows (supplementary figs. S1-S8, Supplementary Material online).
• First, there were large variations in estimation precision and accuracy among the different parameters. For example, estimates of hs for modern species were accurate even in small data sets in all combinations of trees, models, and h values. In contrast, hs for some ancestral species (such as h S , h T , h H l , and h H r in model A when the true h ¼ 0:001) were poorly estimated, with the posterior dominated by the prior even with 100 or 1,000 loci. Those parameters were hard to estimate as very few sequences enter or coalesce in the populations. These same parameters were much better estimated when the true h ¼ 0:01 as then many more sequences could enter and coalesce in the ancestral species. For similar reasons, the ages of ancestral nodes such as s H and s S were much better estimated when the true h ¼ 0:01 than when h ¼ 0:001. • Second, parameter estimates under model C were more precise than under model A, because the former has 9 parameters, whereas the latter 13. • Third, there were virtually no differences in the results whether the data were simulated under JC or GTR þ C.
As the role of the mutation model in BPP is to correct for multiple hits at the same site and as the simulated sequences are highly similar, the choice of the mutation model is unimportant. Similar observations were made in previous simulations examining species tree estimation without introgression (Shi and Yang 2018). • Last, the data size (the number of loci) had a huge impact on the precision and accuracy of estimation. In particular, data of only 10 or 100 loci did not produce reliable estimates of u, whereas estimates from 1,000 loci were both precise (with narrow intervals) and accurate (close to the true values). Because the MSci models are parameter rich, large data sets in the order of 1,000 loci are necessary for reliable inference.
In the second set of simulations we compared BPP with two summary methods: SNAQ (Solis-Lemus and Ane 2016; Solis-Lemus et al. 2017) and HYDE (Blischak et al. 2018), using one sequence per species. We simulated data under model A, with three ingroup species (A, B, and C), as well as two outgroup species D and E, as required by SNAQ (Solis-Lemus et al. 2017). One sequence was sampled per species per locus. The data were then analyzed using the three programs to estimate u (supplementary fig. S9, Supplementary Material online). Data size had a large impact on the precision and accuracy of the estimates. All three methods performed poorly with 10 or 100 loci (or gene trees), but the estimates were close to the true values with 1,000 loci. Overall, the three methods had similar performance in estimating u. In some small data sets, SNAQ and HYDE had extreme estimates of 0, whereas BPP always produced nonzero estimates, due to Bayesian shrinkage through the prior.

FIG. 2.
A species tree for three species (A, B, C) with a gene tree for 12 sequences running inside it to illustrate the gene-tree node-age move and the gene-tree SPR move. There are four speciation nodes (R; S; U; V) and two hybridization nodes (H 1 and H 2 ). This MSci model is also used in simulation, where the model is referred to as "2H".
A Bayesian Implementation of the MSci Model . doi:10.1093/molbev/msz296 MBE Note that the problem examined here is a conventional parameter estimation problem under a well-specified model, so that standard statistical theory applies, which states that the Bayesian method has optimal large-sample properties (O'Hagan and Forster 2004). The small differences among the methods suggest that information in the data concerning u mostly lies in the proportions of gene trees, which may be reliably estimated even if phylogenetic information content at each individual locus is low. We note that BPP has several advantages. 1) BPP accommodates the uncertainties in the data appropriately and produces posterior credible intervals (CIs), whereas SNAQ and HYDE generate point estimates only.
2) BPP estimates all 13 parameters in the model, whereas SNAQ and HYDE estimate only 2 (u and the internal branch length) with the others unidentifiable. Estimates of ancestral population sizes (hs) and species divergence and introgression times (ss) may be useful for understanding the evolutionary history of the species. 3) BPP can use loci of any data configuration, including loci with sequences from only one or two species, which are informative for BPP but carry no information about gene trees. 4) Some introgression models or biologically important scenarios are unidentifiable using SNAQ and HYDE but can be analyzed using BPP (see below). In contrast, SNAQ and HYDE have a huge computational advantage over BPP and may be very useful for exploratory analysis in large data sets.
The third set of simulations explored the performance of BPP under models that are unidentifiable using SNAQ and HYDE. We used model D of figure 1 and model 2H of figure 2, with results in supplementary figure S10, Supplementary Material online. Model D represents bidirectional introgression between two species. Population size parameters (hs) for modern species A and B were well estimated even with 100 loci, as was h R for the root, but h X for species X (branch X-R) and h Y (for branch Y-R) were more poorly estimated (supplementary fig. S10A, Supplementary Material online). Both s parameters were well estimated. The introgression probabilities u X and u Y were poorly estimated in small data sets of 10 or 100 loci but were fairly accurate with 1,000 loci.
Model 2H ( fig. 2) involves two introgression events on a species tree of three species. There were large differences in information content for different parameters (supplementary fig. S10B, Supplementary Material online). Parameters hs for modern species were well estimated even in small data sets, but hs for most ancestral species were poorly estimated because of lack of coalescent events in those populations. Parameter h H 1r was more accurately estimated than h H 1l because more sequences passed node H 1 from the right (with probability 1 À u ¼ 0:9) than from the left (with u ¼ 0:1), and h H 1l and h H 1r were more reliably estimated than h H 2l and h H 2r because more sequences passed node H 1 than node H 2 . Similarly s H 1 was better estimated than s H 2 . With 1,000 loci, all six node ages (for R; S; U; V; H 1 , and H 2 ) were well estimated. The two introgression probabilities (u H 1 and u H 2 ) were poorly estimated with 10 or 100 loci but were reliably estimated when 1,000 loci were used.
In summary, in both introgression scenarios of models D and 2H, where SNAQ and HYDE are inapplicable, BPP appears to be a well-behaved method, providing reliable estimates of introgression probabilities as well as species divergence and introgression times.

Analysis of the Purple Cone Spruce Data
We analyzed three data sets concerning the origin of the purple cone spruce in the Qinghai-Tibet Plateau, Picea purpurea, hypothesized to be a hybrid species, formed through homoploid hybridization between P. wilsonii (W) and P. likiangensis (L) (Sun et al. 2014). Two small data sets were previously analyzed using *BEAST under model A of figure 3, whereas the third one (the "Full" data) is a much larger data set from which the first two were sampled. We attempted to apply PHYLONET/MCMC-SEQ  to analyze any of those data sets but were unsuccessful. The program used all 144 cores on our server and did not produce any output after 5 days. The data sets appeared to be too large for the program.
Parameter estimates under model A from the two small data sets were similar to those in Zhang et al. (2018), with u estimates between 0.32 and 0.44, although the estimates involved large uncertainties (supplementary table S1, Supplementary Material online). The uncertainty is apparently due to the use of only 11 loci (although many sequences are available at each locus) and the shallowness of the species tree, with species divergence times being comparable with coalescent times (or with similar ss and hs). Accommodating rate variation among loci had very small effects. The full data produced similar parameter estimates to the two small data sets, but u is larger, at 0.47-0.49. We also applied model C ( fig. 3) to the full data, which produced more precise estimates because of the smaller number of parameters ( fig. 3 and supplementary table S1, Supplementary Material online). The u estimate under model C was 0.53, with the 95% highest posterior density (HPD) CI to be 0.36-0.71.
Note that the models represent different biological scenarios. Model A assumes the existence, and subsequent extinction at the time of hybridization, of species DH and EH ( fig. 3). This is not a very plausible model (Sun et al. 2014). Model C represents speciation through homoploid hybridization, with species RDW and REL coming into contact and forming a hybrid species (H) at time s H . A possible scenario is that changes in species distribution may have led to habitat overlap between P. wilsonii and P. likiangensis during the Quaternary glaciation in the central Qinghai-Tibet Plateau (Sun et al. 2014). We calculated marginal likelihoods (Bayes factors) to compare models A-C ( fig. 1). The log marginal likelihood was -18,361 for model A, -18,359 and -18,361 for two cases of model B (with s H ¼ s D and s H ¼ s E , respectively), and -18,362 for model C, suggesting the fit of the models to data is similar. The marginal likelihoods are thus indecisive. We suggest that model C should be preferred, because of its biological plausibility.

Analysis of the Budding Yeast Data Set
We fitted the MSci model of figure 4 to the data of 106 loci from 5 species of budding yeast. This model had a posterior probability of >95% in the PHYLONET/MCMC-SEQ analysis of the Flouri et al. . doi:10.1093/molbev/msz296 MBE same data by . The u estimate from BPP was 0.70 (with the 95% HPD CI 0.56-0.83), compared with 0:7560:06 in . The small differences may be due to the use of different priors and the assumption of a constant h across all populations in the PHYLONET analysis. The results confirm the expectation that full-likelihood programs, if computationally feasible, should produce similar results. Running time for achieving an effective sample size (ESS) of 1,000 for u was $3 min for BPP using all 8 threads on a notebook, compared with $17 h for PHYLONET using 32 threads on a computer server . If we make a 10-fold allowance for the fact that the model is fixed in BPP while PHYLONET spent computational efforts attempting changes to the model, this very roughly translates into a 100-fold difference in mixing/computational efficiency between the two programs (17 Â 60=3 Â 4=10 ¼ 136).

Variable Introgression across the Genome in the Anopheles gambiae Species Complex
To examine the variation in introgression intensity across the Anopheles genome, we analyzed blocks of 100 loci, assuming the species tree of figure 5. Estimates of u A!GC for the A. arabiensis ! A. gambiae þ A. coluzzii introgression and u R!Q for the A. merus ! A. quadriannulatus introgression vary considerably across genomic regions or chromosomal arms ( fig. 6). The probability u A!GC is high (>0.5) in most blocks, whereas u R!Q is high in 3La and 3R.
We then merged the loci on the same chromosomal arms/ regions to form 12 large coding and noncoding data sets, and analyzed them under the model of fig. 5 (table 1). We also sampled three sequences per locus to form data triplets for analysis using the maximum likelihood (ML) program 3S (Thawornwattana et al. 2018a, supplementary table S3, Supplementary Material online, GAR and RQO). For all autosomes, the introgression/migration rate from A. arabiensis to A. gambiae þ A. coluzzii is very high (with u A!GC > 0:5), whereas M A!G ranges from 0.12 to 1.12. To reconcile the estimates from the two models, note that M is the expected number of migrants per generation, so that even a small M may mean a large number of migrants accumulated over many generations. As noted previously Thawornwattana et al. 2018a), the autosomes are overwhelmed by the A ! GC introgression so that all species tree methods that ignore gene flow infer incorrect species trees. In population genetic models of population subdivision, migration rates of M ( 1 do not lead to substantial population subdivision. However, here M as low as 0.1 may have a significant impact on the species phylogeny if the species arose through radiative speciation events and the ancestral species had large sizes. Parameter u R!Q varied across chromosomal regions and was high for the inversion region 3La. Coding and noncoding loci produced highly consistent estimates of species trees,   Thawornwattana et al. 2018a), but estimates of migration rate/introgression probability differed between the two data sets. The higher introgression rates for coding than noncoding loci in regions 3La, 2L1 þ 2, and 3R suggest the intriguing possibility that the introgressed genes may have brought adaptive advantages, so that introgression is aided by natural selection. The functions of coding genes or exons that are most likely transferred across the species barriers could be examined to explore this hypothesis. There is overall consistency between estimates from the IM model in 3S and the MSci model in BPP in that regions with high u tend to have high M as well. Note that both u and M reflect long-term effective gene flow, after the filtering of introgressed alleles by natural selection.

Identifiability of MSci Models
If the probability distributions of the data are identical for two sets of parameter values (H and H 0 ), with f ðXjHÞ ¼ f ðXjH 0 Þ for all possible data X, then H is unidentifiable given data X.
Previous studies of identifiability have mostly focused on the use of gene-tree topologies as data (Zhu and Degnan 2017;Degnan 2018). Note that a model unidentifiable given genetree topologies alone may be identifiable given gene trees with branch lengths or coalescent times, and that a model unidentifiable when one sequence is sampled per species may be identifiable when multiple samples per species are available (Yu et al. 2012;Pardi and Scornavacca 2015;Zhu and Degnan 2017). A comprehensive examination of the identifiability issue under MSci is beyond the scope of this article. Here, we consider a few simple cases. First, the population size parameter h is unidentifiable if at most one sequence per locus is sampled from that species or its descendant species and ss associated with a hybridization event may also be unidentifiable. Consider the model of figure 2 and suppose the data consist of one sequence from each species. Then h A , h B , and h C , as well as h H 1l ; h H 1r ; h H 2l , and h H 2r , are unidentifiable. In addition, s H 1 and s H 2 are unidentifiable. Parameters s U , s V, s S , s R , and FIG. 6. Posterior means and 95% HPD intervals of the introgression probabilities u A!GC and u R!Q for the Anopheles gambiae species complex in BPP analysis of the blocks. Each block consists of 100 loci, which are assumed to have the same u at each hybridization node. Flouri et al. . doi:10.1093/molbev/msz296 MBE u H 1 and u H 2 are identifiable, as are h R ; h S ; h U , and h V. In this case, the gene tree at any locus depends on whether sequence c takes the left path at H 1 and enters species U (which happens with probability u H 1 ), or it takes the right path and enters species H 2 (which happens with probability 1 À u H 1 ), but not on the age of the H 1 node. The same applies to the path taken by sequence c at H 2 .
The most interesting case for the MSci model implemented here is where multiple sequences are sampled from each species at each locus, with multiple sites per locus. We speculate that the MSci model is identifiable on such data of sequence alignments as long as it is identifiable when the data consist of gene trees with coalescent times: H is identifiable using multilocus data X if and only if f ðG; tjHÞ 6 ¼ f ðG; tjH 0 Þ for some G and t. Note that identifiability implies statistical consistency for a full-likelihood method as implemented here. If the model is identifiable, the Bayesian parameter estimates will approach the true values when the number of loci approaches infinity.
Here, we note an interesting unidentifiability issue with model D of figure 1. Let H ¼ (h A ; h B ; h R ; h X ; h Y ; s R ; s X ; u X ; u Y ) be the parameters of the model, and let H 0 have the same parameter values as Then f ðGjHÞ ¼ f ðGjH 0 Þ for any H, G, and data configuration (with n A and n B sequences from A and B, respectively, say). Thus for every point H in the parameter space, there is a "mirror" point H 0 with exactly the same likelihood. With H, a certain number of A sequences may take the left (upper) path at X (with probability u X ) and enter population XR, coalescing at the rate 2=h X , whereas with H 0 , the same A sequences may take the right (horizontal) path (with probability 1 À u 0 X ¼ u X ) and enter population YR, coalescing at the rate 2=h 0 Y ¼ 2=h X . The differences between the two scenarios are in the labeling only, with "left" and X under H corresponding to "right" and Y under H 0 , but the probabilities involved are exactly the same. The same argument applies to sequences from B going through node Y, and to sequences from A and B considered jointly. This is a case of the label-switching problem. Arguably H and H 0 have the same biological interpretations concerning the relatedness of the sequences sampled from A and B. If the priors on u X and u Y are symmetrical, say betaða; aÞ, the posterior density will satisfy f ðHjXÞ ¼ f ðH 0 jXÞ for all X. Otherwise, the "twin towers" may not have exactly the same height.
Note that the label-switching kind of unidentifiability does not hinder the utility of the model. One can apply an identifiability constraint, such as u < 0:5, to remove the unidentifiability. However, in the general case of multiple bidirectional introgression events or multiple species on the species tree, it may be complicated to decide on the identifiability of the model.
Finally, we point out that there are many scenarios of data configurations and parameter settings in which some parameters are only weakly identifiable and very hard to estimate. For example, if h C is very small relative to s H 1 in figure 2, sequences from C will have coalesced before reaching node H 1 , so that only one C sequence passes H 1 and the data will have little information about s H 1 ; h H 1l ; h H 1r ; s H 2 ; h H 2l , and h H 2r .

Full-Likelihood and Summary Methods to Accommodate Introgression/Migration
Although biologically simplistic, the MSci and IM models offer powerful tools for analysis of genomic sequence data from closely related species, when cross-species gene flow appears to be the norm (Mallet et al. 2016;Martin and Jiggins 2017). Full-likelihood implementations of those models, including the ML (Zhu and Yang 2012;Dalquen et al. 2017) and Bayesian MCMC methods (Hey et al. 2018;Zhang et al. 2018), make efficient use of the information in the data and naturally accommodate phylogenetic uncertainties at individual loci caused by high sequence similarities (Edwards et al. 2016;Xu and Yang 2016). The complexity of those models means that large data sets with hundreds or thousands of loci may be necessary to obtain reliable parameter estimates, as indicated by our analyses of both simulated and real data. In this article, we have developed new MCMC proposal algorithms for MSci models (of MBE types A-D of fig. 1) and have successfully applied them to analyze large data sets of over 10,000 loci (table 1). The algorithms appear to have good mixing efficiency. We suggest that this is a promising start, from which further improvements to the algorithms may be possible. Future work will include implementation of efficient MCMC proposals to move between MSci models, and a systematic examination of identifiability issues. We note that the computational load for BPP increases with an increase in the number of species, the number of hybridization events, the number of loci, the number of sequences per locus, or the number of sites per sequence. Increasing the number of species or hybridization events increases the number of parameters (as well as the number of models when the model is changing in the MCMC) so that the parameter space becomes much larger. Increasing the number of loci also increases the posterior search space since the MCMC has to sample in the space of gene trees for each locus (Flouri et al. 2018). In comparison, the number of sites per sequence has the least impact on the amount of computation. We found it helpful to make a distinction between computational efficiency of an MCMC algorithm, which reflects the computational time for each MCMC iteration, and mixing efficiency, which is measured by the ESS in parameter estimates for a given number of MCMC iterations. When the data set gets larger, particularly with more loci in the data set, the posterior for parameters becomes spiky, which in general leads to a deterioration of MCMC mixing efficiency, so that a greater number of MCMC iterations become necessary to produce estimates with acceptable precision. We conjecture that poor mixing is a more serious problem than poor computational efficiency for most MCMC algorithms in phylogenomics.
Our simulation suggests that under simple introgression scenarios, summary methods such as SNAQ and HYDE can produce as reliable estimates of the introgression probability as BPP. However, full-likelihood methods provide measures of uncertainties and are applicable to complex introgression scenarios which are unidentifiable using summary methods. Summary methods are simple to implement, computationally efficient, and useful for analyzing large data sets. They can be used to generate hypotheses for further testing and estimation using BPP. Furthermore, the current implementation of MSci in BPP assumes the molecular clock and is unsuitable for distantly related species. Summary methods such as SNAQ use outgroups to root the tree without the need for the molecular clock.

Variable Introgression Probability across the Genome
The models implemented here ( fig. 1A-C) assume that the introgression probability or migration rate is constant among loci or across the genome. However, the impact of introgressed alleles on the fitness of the individual may strongly depend on the function of the genes in the introgressed region. Genes involved in cross-species incompatibilities are unlikely to be accepted in the recipient species. For example, crossing experiments between A. arabiensis and A. gambiae highlighted large differences between the chromosomes, with the X chromosome being most resistant to introgression, presumably because it harbors genes involved in cross-species sterility and inviability (Slotman et al. 2005). Differential selection across the genome means that the u parameter should vary among loci. Note that u in our models when estimated from genetic sequence data reflects the long-term combined effects of migration, recombination, and natural selection. It may be very different from the per-generation hybridization rate, which should apply to the whole genome.
In our analysis of the Anopheles genomic data, we used blocks of 100 loci to partially accommodate the variation in migration rate or introgression probability across chromosomal regions ( fig. 6). We leave it to future work to implement MSci models with u varying among loci. We note that many sequences per locus may be necessary to estimate locusspecific migration rates or introgression probabilities.

MCMC Proposals
We have adapted the five proposals in Rannala and Yang (2003) to accommodate hybridization nodes on the species tree and added another move to update the u parameters.
Step 1. Change node ages on gene trees using a sliding window. Suppose the concerned gene-tree node is node x with age t x in population X, with parent node p in population P and two daughter nodes u and v in populations U and V, respectively. To propose a new node age t Ã x first determine the bounds, t L < t Ã x < t U , with t U determined by the age of the parent node (t p ) and t L by the age of the oldest daughter node: t L ! maxðt u ; t v Þ. In addition, if the two daughter nodes are in different populations (with U 6 ¼ V), t L must be older than the youngest common ancestor of populations U and V on the species tree.
Generate the new age t Ã x by sampling around t x , reflected into the interval ðt L ; t U Þ. The new node x Ã has to reside in a population that is descendant to the parent population P and ancestral to the child populations U and V. Among those target populations, we sample one uniformly. Given the sampled population for x Ã , we sample the flags for the three branches: p-x Ã ; x Ã -u, and x Ã -v. In each case, the two ends of the branch are already assigned a population. This move may cause large changes to the flags even though it does not change the genetree topology. For example, consider the change of t x in figure 2. Node x is in population S, with branch x-v having the flags l1, since the branch passes H 1 from the left and does not pass H 2 . Suppose the new age, generated in the interval (t u , t p ), is t Ã x > s R so that the new node x Ã resides in R. The resampled flags for branch x Ã -v may be rr, if the new branch passes both H 1 and H 2 from the right. The proposal ratio is given by the probabilities of sampling the flags at the H nodes.
Step 2. SPR move to change the gene-tree topology. This move cycles through the nonroot nodes on the gene trees. Suppose the node is a. We prune off its parent node y. The remaining part of the gene tree is called the backbone. We sample the new age (t Ã y ) before reattaching the subtree y-a onto the backbone. This move always changes the node age t y but may not change the gene-tree topology. Flouri et al. . doi:10.1093/molbev/msz296 MBE First, we determine the bounds on the age of reattachment point: t L < t Ã y < t U . The maximum age is unbounded, whereas the minimum is t L ! t a . However, if there are no branches on the backbone passing the population of node a, the reattachment point has to be in an ancestral species (in which there exists at least one branch on the backbone) and t L has to be greater. For example, clade a in figure 2 resides in population B, and if we prune off clade a, there will still be branches in B on the backbone for reattachment. In contrast, clade a 0 resides in population H 1r , but if we prune off y 0 -a 0 , there will be no branches in population H 1r for reattachment and the youngest ancestor of H 1r with branches on the backbone is V, so that t L ¼ s V .
We generate a new age t Ã y around the current age (t y ), reflected into the interval (t L , t U ) if necessary, and then reattach y and clade a to a branch on the backbone at time t Ã y . A feasible target branch should cover t Ã y and should at time t Ã y be in a population ancestral to the population of a. We sample a target branch at random, either uniformly or with weights determined using local likelihoods.
In case the new branch y Ã -a passes hybridization nodes, we sample the flags at each hybridization node, as in step 1. Suppose we prune off branch y 0 -a 0 in figure 2 and the new age is t 0Ã y > s R . Then, we let branch y 0Ã -a 0 go through H 2l or H 2r according to their probabilities. The proposal ratio is given by the number of target branches for reattachment and the probabilities for sampling the flags.
Step 3. Change hs on the species tree using a sliding window. This step is the same as in Rannala and Yang (2003).
Step 4. Change ss on the species tree using a variant of the rubber-band algorithm (Rannala and Yang 2003). We generate a new age (s Ã ) around the current age, reflected into the interval ðs L ; s U Þ, determined using the ages of the parent nodes and daughter nodes on the species tree. Next we change the ages of the affected nodes on the gene trees using the rubber-band algorithm. An affected node has age in the interval ðs L ; s U Þ and resides in the current population (with age s) or the two daughter populations (if a speciation node is changed), or in the two current populations (H l and H r ) and the daughter population (if an H node is changed). For example, to change s S , the bounds are ðs H 2 ; s R Þ, and the affected nodes on the gene tree of figure 2 are in species S, U, and H 2 . These are x, u, and w. To change s H 2 the bounds are ðs H 1 ; s V Þ and the affected nodes are in species H 2l ; H 2r , and H 1r , and are a 0 . The proposal for changing node ages on the gene tree given the bounds is as in (Rannala and Yang 2003, eqs. A7 and A8).
Step 5. Rescale all node ages on the species tree and on the gene trees using a mixing step (a multiplier) (Rannala and Yang 2003).
Step 6. Change the introgression probability u for each introgression event using a sliding window. This step affects the gene-tree density, but not the sequence likelihood.
The sliding window used in BPP is the Bactrian move with the triangle kernel (Yang and Rodriguez 2013;Thawornwattana et al. 2018b).
Step lengths are adjusted automatically during the burn-in, to achieve an acceptance rate of $30% (Yang and Rodriguez 2013).

Simulation Study
We conducted three sets of simulations. The first set includes multiple sequences from each species and examines BPP estimation of parameters in the MSci model and the impact of the number of loci, the introgression probability u, and the species tree model. The second set compares BPP with two summary methods: SNAQ (Solis-Lemus and Ane 2016; Solis-Lemus et al. 2017) and HYDE (Blischak et al. 2018), using one sequence per species. The third set explores the performance of BPP when the model is unidentifiable using SNAQ and HYDE.
For the first set of simulations, multilocus data sets were simulated under the MSci models A and C of figure 1 and then analyzed using BPP to examine the precision and accuracy of parameter estimation. For model A, we used s R ¼ 0:03, s S ¼ 0:02; s T ¼ 0:02, and s H ¼ 0:01. For model C, we used s R ¼ 0:03 and s S ¼ s T ¼ s H ¼ 0:01. We used two values of u (0.1 and 0.5) and two values of h (0.001 and 0.01), applied to all populations. Each data set consisted of 10, 100, or 1,000 loci, and at each locus, 10 sequences were sampled from each species (with 30 sequences in total). The sequence length was 500 sites.
Data were generated using the "simulate" option of BPP. Gene trees with branch lengths (coalescent times) were simulated under the MSci model. Then, sequences were "evolved" along the branches of the gene tree according to either the JC (Jukes and Cantor 1969) or the GTR þ C (Yang 1994a(Yang , 1994b) models, and the sequences at the tips of the gene tree constituted the data at the locus. In the GTR þ C model, the GTR parameters varied among loci according to estimates obtained for chromosomal arm 2L from the A. gambiae species complex ). The base-frequency parameters were generated from a Dirichlet distribution ðp T ; p C ; p A ; p G Þ $ Dirð25:18; 20:50; 25:22; 20:38Þ. The GTR exchangeability parameters (Yang 1994a) were ða; b; c; d; e; f Þ $ Dirð7:59; 3:23; 2:95; 2:93; 2:93; 7:57Þ. The overall rates for loci varied according to a gamma distribution G(5, 5), whereas the rates for sites at the same locus varied according to the gamma distribution with mean one, Gða; aÞ (Yang 1994b), with the shape parameter a sampled from G(20, 4).
Each data set was analyzed using BPP. The JC model was always assumed whether the data were simulated under JC or GTR þ C. Inverse-gamma priors were assigned on parameters h and s 0 (the root age), with the shape parameter 3 and the prior mean equal to the true value: IG(3, 0.02) for h ¼ 0:01 and IG(3, 0.002) for h ¼ 0:001, and s 0 $ IG(3, 0.06). The inverse-gamma distribution with shape parameter a ¼ 3 has the coefficient of variation 1 and constitutes a diffuse prior. The uniform prior Uð0; 1Þ was used for u.
Pilot runs were used to determine the suitable chain length, and then the same settings (such as the burn-in, the number of MCMC iterations, and the sampling frequency) were used to analyze all replicates. Convergence was assessed A Bayesian Implementation of the MSci Model . doi:10.1093/molbev/msz296 MBE by running the same analysis multiple times and confirming consistency between runs (Yang 2015;Flouri et al. 2018).
The second set of simulation was to compare BPP with summary methods. Most methods are designed to test for the presence of gene flow (hybridization or migration) (Degnan 2018). Here, we used two methods that can estimate the introgression probability under a fixed introgression model: SNAQ (Solis-Lemus and Ane 2016) implemented in the program PhyloNetworks (Solis-Lemus et al. 2017) and HYDE (Blischak et al. 2018). The basic algorithms for SNAQ and HYDE are formulated for the case of three species with one or two outgroup species used to root the tree. SNAQ uses the proportions of the three gene-tree topologies, based on the observation that the probabilities for the two mismatching gene trees (which have different topologies from the species tree) are the same if there is deep coalescent but no gene flow while they are different if there is gene flow as well (Yu et al. 2014). HYDE uses the proportions of the three parsimony-informative site patterns pooled across loci or genomic regions (xxyy, xyxy, and xyyx), based on the observation that the probabilities for the two "mismatching" site patterns (xyxy and xyyx) are the same if there is deep coalescent but no gene flow while these are different if there is gene flow as well (Green et al. 2010).
We used model A of figure 1, plus two outgroup species D and E, to simulate L ¼ 10, 100, or 1,000 loci, with one sequence per species per locus. The data were then analyzed using SNAQ and HYDE, as well as BPP. The JC model was used both to simulate and to analyze the data. For SNAQ, gene trees were inferred using RAxML (Stamatakis et al. 2012). For BPP, the point estimates (posterior means) of u were used for comparison even though estimates for all parameters, with CIs, were produced.
The third set of simulation explores the performance of BPP under models that are unidentifiable using SNAQ and HYDE. An MSci model may be identifiable given the gene trees with coalescent times but unidentifiable given gene-tree topologies only (Degnan 2018). We simulated and analyzed data using BPP under two models: model D of figure 1 with bidirectional introgression between species A and B and the model of figure 2 (referred to as model 2H), with three species and two introgression events. Under model D, there is only one gene tree between two species so that its frequency is uninformative and SNAQ is not applicable, and nor is HYDE. Under model 2H, frequencies of three gene trees or three site patterns cannot be used to estimate two introgression probabilities and two internal branch lengths: It is thus impossible to apply SNAQ and HYDE to such data.
For model D, we used the following parameter values: s R ¼ 0:01; s X ¼ s Y ¼ 0:005; u X ¼ 0:1; u Y ¼ 0:3, and h ¼ 0:01 for all populations. We simulated 10 replicate data sets, each of 10, 100, or 1,000 loci. At each locus, we sampled 10 sequences per species (20 sequences in total), with the sequence length to be 500. The JC mutation model was used both to simulate and to analyze data by BPP. Note that there is an interesting identifiability issue (or label-switching issue) with model D, such that the two sets of parameters h R ; h Y ; h X ; s R ; s X ; 1 À u X ; 1 À u Y Þ are unidentifiable (see Discussion). Thus, an identifiability constraint should be applied, such as u X < 0:5. We ran the MCMC without any constraint, but the MCMC sample was preprocessed, with H replaced by H 0 if the sampled value for u X > 0:5, before the posterior summary was generated.

Analysis of the Purple Cone Spruce Data Sets
We reanalyzed sequence data concerning the origin of the purple cone spruce Picea purpurea (P) from the Qinghai-Tibet Plateau, hypothesized to have originated through homoploid hybridization between P. wilsonii (W) and P. likiangensis (L) (Sun et al. 2014). The data were generated by Sun et al. (2014). To make the computation feasible for the *BEAST program, Zhang et al. (2018) constructed and analyzed two nonoverlapping data subsets (data sets 1 and 2), each with 40, 30, and 30 phased sequences for P, W, and L, respectively, at 11 autosomal loci. We analyzed these data sets for comparison with the analysis of Zhang et al. (2018) using *BEAST. We also used BPP to analyze the "Full" data set from which data sets 1 and 2 were sampled, with 112, 100, and 120 sequences per locus for the same 11 loci.
The species tree of figure 3 was assumed (Sun et al. 2014). The priors were s 0 $ IGð3; 0:004Þ; h $ IGð3; 0:003Þ, and u $ Uð0; 1Þ. Rates for loci were either constant or had a Dirichlet distribution with a ¼ 2 (Burgess and Yang 2008). We used a burn-in of 32,000 iterations and took 10 5 samples, sampling every 10 iterations. The program was run at least twice for each analysis, to check for consistency between runs. Each run (on a single core) took $5 days.
Marginal likelihood for models A-C ( fig. 1) was calculated using thermodynamic integration with Gaussian quadrature (Lartillot and Philippe 2006;Rannala and Yang 2017), with 16 quadrature points.

Analysis of the Budding Yeast Data Set
We analyzed a budding yeast data set with 106 loci and five species: Saccharomyces cerevisiae (Scer), Saccharomyces paradoxus (Spar), Saccharomyces mikatae (Smik), Saccharomyces kudriavzevii (Skud), and Saccharomyces bayanus (Sbay). This is a subset of the data set published by Rokas et al. (2003) and previously analyzed by . The species tree or MSci model is shown in figure 4, with a Skud ! Sbay introgression. We used inverse-gamma priors IG(3, 0.04) for hs and IG(3, 0.2) for s 0 , and u $ Uð0; 1Þ.

Analysis of the Genomic Data from the A. gambiae Species Complex
We used the coding and noncoding loci compiled by Thawornwattana et al. (2018a) from the genomic sequences for six species in the A. gambiae species complex: A. gambiae Flouri et al. . doi:10.1093/molbev/msz296 MBE