Inferring the Direction of Introgression Using Genomic Sequence Data

Abstract Genomic data are informative about the history of species divergence and interspecific gene flow, including the direction, timing, and strength of gene flow. However, gene flow in opposite directions generates similar patterns in multilocus sequence data, such as reduced sequence divergence between the hybridizing species. As a result, inference of the direction of gene flow is challenging. Here, we investigate the information about the direction of gene flow present in genomic sequence data using likelihood-based methods under the multispecies-coalescent-with-introgression model. We analyze the case of two species, and use simulation to examine cases with three or four species. We find that it is easier to infer gene flow from a small population to a large one than in the opposite direction, and easier to infer inflow (gene flow from outgroup species to an ingroup species) than outflow (gene flow from an ingroup species to an outgroup species). It is also easier to infer gene flow if there is a longer time of separate evolution between the initial divergence and subsequent introgression. When introgression is assumed to occur in the wrong direction, the time of introgression tends to be correctly estimated and the Bayesian test of gene flow is often significant, while estimates of introgression probability can be even greater than the true probability. We analyze genomic sequences from Heliconius butterflies to demonstrate that typical genomic datasets are informative about the direction of interspecific gene flow, as well as its timing and strength.


Introduction
Gene flow between species occurs as a result of hybridization followed by backcrossing in one of the hybridizing species. While interspecific gene flow has a predominantly homogenizing effect, it may create new beneficial combinations of alleles at multiple loci, facilitating species diversification and adaptation (Arnold and Kunte 2017;Campbell et al. 2018;Feurtey and Stukenbrock 2018;Marques et al. 2019;Edelman and Mallet 2021). The outcome of introgression in each direction is influenced by multiple factors including mate choice (Peters et al. 2017), ecological selection, and hybrid incompatibility (for reviews, see Coyne and Orr 2004;Martin and Jiggins 2017;Moran et al. 2021). Given that these factors typically differ between species and that selection on introgressed material acts independently in different recipient species, it is likely that gene flow is often asymmetrical, being more prevalent in one direction than in the other. Reliable inference of the direction of introgression, as well as its timing and rate, will advance our understanding of this important evolutionary process and its consequences, including the role of gene flow during speciation and the adaptive nature of introgressed alleles.
Two models of interspecific gene flow have been developed in the multispecies coalescent (MSC) framework, representing different modes of gene flow (Jiao et al. 2021;Hibbins and Hahn 2022). The MSC-with-introgression (MSC-I; Flouri et al. 2020) model, also known as multispecies network coalescent (MSNC, Yu et al. 2012;Wen and Nakhleh 2018;Zhang et al. 2018), assumes that gene flow occurs at a particular time point in the past. The magnitude of gene flow is measured by the introgression probability (φ), the proportion of immigrants in the recipient population at the time of introgression. The MSC-withmigration (MSC-M) model, also known as the isolationwith-migration (IM) model, assumes that gene flow occurs continuously at a certain rate every generation after species divergence (Nielsen and Wakeley 2001;Hey et al. 2018). The rate of gene flow is measured by the expected number of immigrants from populations A to B per generation, M AB = N B m AB , where N B is the (effective) population size of population B and m AB is the proportion of immigrants in population B from A. In both models, the rates of gene flow (φ or M) are "effective" rates, reflecting combined effects of gene flow and negative or positive natural selection on introgressed alleles, influenced by the local recombination rate (Petry 1983; Barton and Bengtsson 1986).
Interspecific gene flow alters gene genealogies, causing fluctuations over the genome in the genealogical history of sequences sampled from extant species. Under both the MSC-M and MSC-I models, gene trees and coalescent times have probabilistic distributions specified by the model and parameters, including species divergence times, population sizes for extant and extinct species, and the rate of gene flow (see Yang 2014;Jiao et al. 2021 for reviews). Multilocus sequence alignments are informative about gene tree topologies and coalescent times, and thus about the direction of gene flow as well as its timing and strength. However, opposite directions of gene flow often create similar features in gene genealogies and in the sequence data. For example, gene flow in either direction reduces the average and minimum divergence between the hybridizing species. In the special case of sampling one sequence per species per locus, the data cannot identify introgression direction between two sister species (say A and B), because the coalescent time (t ab ) between the two sequences at each locus (a, b) has the same distribution under the models with A → B or B → A introgression (Yang and Flouri 2022, fig. 10; see also Discussion). If multiple sequences are sampled per species per locus, introgression direction becomes identifiable . Even so, inference of introgression direction may be expected to be a challenging task. This is particularly so for heuristic methods for inferring gene flow based on summary statistics. For example, the D statistic (Green et al. 2010;Durand et al. 2011) operates on species quartets and cannot identify the direction of gene flow. Although heuristic methods exist for inferring the direction of gene flow, based on estimated local genomic divergences (Green et al. 2010, fig. S39) or genome-wide site-pattern counts D FOIL (Pease and Hahn 2015), they do not make efficient use of information in the data, often require a specific species phylogeny and sampling setup, and cannot infer gene flow between sister lineages. For recent discussions of the strengths and weaknesses of heuristic versus likelihood methods, see Jiao et al. (2021), Hibbins and Hahn (2022), Huang et al. (2022), and Yang and Flouri (2022).
Here, we study the inference of introgression direction, focusing on the Bayesian method under the MSC-I model . Suppose introgression occurs from species A → B but we analyze genomic data assuming B → A introgression. We address the following questions. (a) Will we often detect introgression despite the assumed wrong direction? (b) How will the estimated introgression probability (φ B→A ) compare with the true introgression probability (φ A→B )? (c) How reliable will estimates of the time of introgression be, as well as other parameters such as species divergence times and population sizes? (d) Does the method behavior differ depending on whether gene flow is between sister lineages or between nonsister lineages, and whether gene flow is from a small population to a large one, or in the opposite direction? (e) How can we infer the direction of introgression (A → B vs. B → A)? (f) Are typical genomic data informative about the direction of gene flow? We focus on both Bayesian estimation of parameters, in particular the introgression probability , and on Bayesian tests of introgression (Ji et al. 2023).
We use a combination of mathematical analysis and computer simulation to characterize features of sequence data that are informative about the direction of gene flow. We first study the case of two species (A, B) by examining the distribution of coalescent times (t aa , t ab , t bb ) under the MSC-I model. The theory allows us to compare and quantify the amount of information in the data under different scenarios. Next, we explore the amount of information gained when a third species is added to branches of the species tree for two species and study the impact of introgression direction when gene flow involves nonsister species. Finally, we test these methods with genomic sequences from three species of Heliconius butterflies to verify the applicability of our results derived from the theoretical analysis and computer simulation and to demonstrate how the framework can be applied to infer the direction of gene flow, as well as its timing and strength. Our results provide practical guidelines for inferring introgression and its direction from genomic sequence data.

Notation and Problem Setup
We use the MSC-I model of figure 1a with A → B introgression to introduce the notation and set up the problem. Species A and B diverged at time τ R and hybridized later at time τ X . The magnitude of introgression is measured by the introgression probability or admixture proportion φ Y , which is the proportion of immigrants in population B from A at the time of introgression. There are three types of parameters in the model: species divergence times or introgression times (τ R , τ X ), population sizes for extant and extinct species (θ A , θ B , θ X , θ Y , θ R ), and the introgression probability (φ Y ). We measure divergence time (τ) by the expected number of mutations per site, with τ = Tμ, where T is the divergence time in generations and μ is the mutation rate per site per generation. As time T and rate μ are confounded in analysis of sequence data, only τ is estimable. Each branch on the species tree represents a species or population and is associated with a population size parameter, θ = 4N e μ, where N e is the (effective) population size of the species. A branch on the species tree is also referred to by its daughter node so that branch RX is also branch X, with population size θ X . Both τ and θ are measured as expected number of mutations per site; that is, one time unit is the expected time to accumulate one mutation per site. At this time scale, coalescence occurs between any two sequences in a population of size θ as a Poisson process with rate 2 θ .
Thawornwattana et al. · https://doi.org/10.1093/molbev/msad178 MBE Each dataset consists of sequence alignments at L loci, with n A sequences from A and n B sequences from B at each locus, and with N sites in each sequence. Underlying the sequences at each locus is a gene tree with branch lengths (coalescent times), with its probability distribution specified by the MSC-I model (Yu et al. 2014). We assume no recombination among sites in the sequence of the same locus and free recombination between loci; a recent simulation suggests that inference under the MSC is robust to moderate levels of recombination (Zhu et al. 2022). Under these assumptions, gene trees and sequence alignments are independent among loci. The data are analyzed under three MSC-I models that differ in introgression direction: model I with A → B introgression, model O with B → A introgression, and model B with bidirectional introgression (A ⇆ B) ( fig. 1a-c). The "inflow" (I) and "outflow" (O) labels are used here in anticipation of models involving more than two species to be analyzed later. We use the multilocus sequence data to estimate parameters in the MSC-I model . We also use the Bayesian test to detect the presence of gene flow, comparing an MSC-I model (

Distributions of Coalescent Times and Identifiability of Introgression Direction
We study the distributions of coalescent times between two sequences sampled from the same population (t aa , t bb ) or from different populations (t ab ). These are analytically tractable and are given in Appendix. Note that likelihood methods under the MSC-I model average over the full distribution of the gene tree (G) and coalescent times (t) for sampled sequences at every locus. However, this distribution depends on the number of sequences sampled per species (n A , n B ) and is too complex to analyze. Instead, we examine the pairwise coalescent times (t aa , t ab , t bb ) as important summaries of the data, and use their distributions to demonstrate the identifiability of introgression direction, to characterize the information content in estimation of introgression probability, and to predict the behavior of Bayesian parameter estimation  and Bayesian test of gene flow (Ji et al. 2023). Note that our theory for coalescent times applies to arbitrary sample configurations (n A , n B ); for example, if multiple sequences are sampled per species, t ab will refer to any pair of sequences, one from A another from B.
First, we ask whether introgression direction can be inferred using sequence data sampled from extant species. From equation (A2), we have f I (t ab ) = f O (t ab ) for all t ab > 0, with the parameter mapping τ (O) where the superscripts indicate the assumed model. Thus, t ab alone cannot distinguish models I and O. In other words, in the case of two species, introgression direction is unidentifiable using data of only one sequence per species per locus (Yang and Flouri 2022, fig. 10; see also Discussion).
However, introgression direction is identifiable if multiple sequences are sampled from A and B. Information for distinguishing models I and O comes mostly from coalescent times between sequences sampled from the same species (t aa , t bb ). If gene flow is A → B, the coalescent time for sequences from the donor species, t aa , is not affected by the A → B introgression. If different populations on the species tree have the same size (θ A = θ X = θ R ), t aa will have a smooth exponential distribution (e.g., fig. 2a, model I). Otherwise the distribution is discontinuous at time points τ X and τ R , because of population size changes. In contrast, t bb has a mixture distribution, depending on the hybridizing species to which each of the two B sequences is traced back on the gene genealogy (i.e., either parental species RX or RY at node Y, fig. 1a). Thus, the two models make different predictions about coalescent times t aa and t bb , and the direction of The magnitude of introgression is measured by the introgression probability: φ Y ≡ φ A→B in a and c or φ X ≡ φ B→A in b and c. Note that in the MSC-I models studied in this paper, branches RX and XA represent distinct populations with different population size parameters (θ X , θ A ), as are branches RY and YB. Horizontal arrows (XY and YX) represent introgression events rather than real populations and have no θ associated with them. The arrow points to introgression direction in the real world (forward in time) Direction of Introgression · https://doi.org/10.1093/molbev/msad178 MBE introgression is identifiable when multiple sequences are sampled per species per locus. If the introgression direction is specified (i.e., under the unidirectional introgression model), introgression probability (e.g., φ Y given model I) is identifiable using data of one sequence per species per locus. However, the bidirectional introgression model (model B) involves an unidentifiability of the label-switching type, with two unidentifiable modes or "towers" in the posterior surface if multiple sequences are sampled per species , or four unidentifiable modes if a single sequence is sampled per species; see Discussion for details.  R should be well estimated based on coalescent events in the root population. Below we focus on parameters θ (O) X , θ (O) Y , and φ X , which are harder to estimate.

Asymptotic Analysis and Best-Fitting Parameter Values
First, by considering the distributions of t aa , we predict θ * (O) X < θ (I) X (table 1). In the true model I, both A sequences enter X and may coalesce during (τ X , τ R ). In the fitting model O, the two A sequences may be separated into different populations due to introgression (one in X and the other in Y), so they may not coalesce in (τ X , τ R ) as often. Thus, having θ * (O) X < θ (I) X will increase the coalescent rate in X and help to fit model O to f(t aa ) over (τ X , τ R ).
Next from t bb , we predict θ * (O) (table 1). In the true model, A → B introgression reduces the chance of coalescence between sequences from B during (τ X , τ R ). In the fitting model, both B sequences enter Y, leading to a higher chance of coalescence during (τ X , τ R ). Thus, having θ * (O) Y > θ (I) Y helps to reduce the chance of coalescence in (τ X , τ R ).
Finally, by matching the amount of coalescence between sequences a and b over the time interval (τ X , τ R ), or by matching the probability densities f I (t ab ) and where Δτ = τ R − τ X is assumed to be the same under models I and O based on the arguments above. Equation (1) predicts that more gene flow will be inferred under model if the coalescent rate between sequences a and b during (τ X , τ R ) is lower in the fitting model than in the true model, a higher φ * X than the true φ Y will increase the chance of such coalescence and achieve a better fit to f I (t ab ). Similarly, less gene flow is ex- X to be 0.31, 0.35, 0.44, and 0.22 for cases a-d, respectively, compared with the inferred values φ * X of 0.27, 0.30, 0.98, and 0.17 (supplementary table S1, Supplementary Material online). The approximation is reasonably good except for case c, where φ * X was very high. We discuss these cases further when describing simulation results below.

Simulation Results Under the True Models I and B: Parameter Estimates Have Drastically Different Precisions
To verify and extend our theoretical analysis, we simulated datasets under model I ( fig. 1a) and analyzed them under models I, O, and B ( fig. 1a-c) using four sets of parameter values. Each dataset consists of L = 250, 1,000 or 4,000 loci, with n A = n B = 4 sequences sampled per species per locus and N = 500 sites in the sequence. Posterior means and 95% highest-probability-density (HPD) credibility intervals

Discontinuities in f(t aa ), f(t ab ), and f(t bb )
Species divergence time is informed by the discontinuities in the coalescent times (t aa , t ab , t bb ) (c) Population sizes for extant species: f(t aa ) and f(t bb ) over (0, τ X ) Population size for an extant species is easily estimated by the heterozygosity in the species (d) Population sizes for ancestral species not involved in introgression: Population sizes for ancestral species not involved in introgression are determined by coalescent times in the ancestral species (e) Ancestral population size: The fitting model O predicts a deficit of coalescence of A sequences (t aa ) over (τ X , τ R ) due to introgression but there is no such deficit in the true model I or in the data. Having a larger coalescent rate (or smaller population size θ (O) X ) in model O thus helps to improve the model fit to coalescence of A sequences in the data (f) Ancestral population size: There is a deficit of coalescence of B sequences (t bb ) over (τ X , τ R ) in the true model I or in the data. Having a smaller coalescent rate (or larger population size θ (O) Y ) in model O thus helps with the model fit (g) Introgression probability: f(t ab ) over (τ X , τ R ) Introgression probability is informed by the amount of between-species coalescence (t ab ) over (τ X , τ R ). Equation ( Direction of Introgression · https://doi.org/10.1093/molbev/msad178 MBE (CIs) are plotted in figure 3 (see also supplementary table S1, Supplementary Material online for L = 4, 000). Model I is the true model, so that the performance under this model constitutes the best-case scenario. Indeed all parameters are well estimated, with the posterior means approaching true values and the CI width approaching 0 when the amount of data L → ∞ ( fig. 3 and supplementary table S1, Supplementary Material online, cases a-d, model I). However, the amount of information in the data varies hugely for different parameters, as reflected in the relative error, measured, for example, by the CI width divided by the true value. Population sizes for extant species (θ A , θ B ) are much better estimated than those for ancestral species (θ X , θ Y ). Divergence times (τ R , τ X ) are well estimated as well. Introgression probability (φ Y ) has substantial uncertainties with wide CIs but with L = 4, 000 loci in the data, the estimates are fairly precise, suggesting that thousands of loci are necessary to estimate introgression probability precisely. The results parallel those found in a previous simulation examining the impact of data size (such as the number of loci, the number sequences per species, and the number of sites) on inference under the MSC-I model .
Model B allows bidirectional introgression and thus is a correct model, although it is overparametrized with an extra parameter φ X . As the amount of data increases, φ Y should converge to the true value while φ X to 0. Estimates of other parameters are very similar to those under model I, and the CI widths under models I and B are also very similar. In particular, φ Y is estimated with similar precision in the two models. In large datasets of L = 4, 000 loci, the average CI width is 0.07, 0.12, 0.08, and 0.16 for cases a-d under model I, compared with 0.07, 0.12, 0.09, 0.17 under model B. Even in small or intermediate datasets with L = 250 or 1,000 loci, the CIs for φ Y are similar between the two models. Thus, overparametrization incurred little cost to statistical performance of model B. This might seem surprising, because, given the difficulty of inferring introgression direction, one might expect the assumed incorrect B → A introgression in model B would interfere with estimation of φ Y in the correct direction, so that φ Y would have a much larger variance under model B than under model I. However, information concerning φ Y is largely determined by 1) the number of sequences reaching the hybridization node Y and 2) the ease with which one can tell the parental path taken by each B sequence at Y (see the next subsection for detailed discussions). Thus, there may be little difference in information content about φ Y between models I and B. Computationally, model B is much more expensive than model I due to sampling an extra parameter in the Markov chain Monte Carlo (MCMC) algorithm and to MCMC mixing issues .

Information Content for Estimating Introgression Probability Under the True Model
Here, we consider estimation of introgression probability φ Y in model I in the four cases ( fig. 3, cases a-d, model I).
We characterize the amount of information concerning φ Y when the correct model is assumed, and explain why φ Y was much better estimated in case a (same θ tall tree) than in case b (same θ short tree), and in case c (small to large) than in case d (large to small) ( fig. 3; supplementary table S1, Supplementary Material online: cases a-d, model I), even though the data size is the same and the true φ Y is the same (0.2) in all cases. The theory is also useful for understanding later simulation results for larger species trees.
Consider tracing the genealogical history of sequences at a locus backwards in time. When sequences from B reach the hybridization node Y ( fig. 1a), there is a binomial sampling process, with each sequence taking the horizontal (introgression) parental path (into RX) with probability φ Y and the vertical parental path (into RY) with 1 − φ Y . However, there are two differences from a typical binomial sampling. First, the number of B sequences reaching node Y is a random variable. Second, the outcome of the sampling process (i.e., the parental path taken by the sequence) is not observed but instead reflected in the gene tree and coalescent times (and thus in mutations in the sequences). Using a coin-tossing analogy, the number of coin tosses is random, and the outcome of the toss is visible only probabilistically. If a B sequence coalesces with an A sequence during the time interval (τ X , τ R ), it will be clear that the B sequence has taken the introgression parental path.
Thus, the amount of information in the data concerning φ Y is determined by two factors: 1) the number of B sequences reaching Y and 2) the ease with which one can tell the parental path taken by each B sequence at Y. The number of B sequences reaching Y at the locus is given as n B − c B , where n B is the number of B sequences sampled at the locus and c B is the number of coalescent events among them in B before reaching Y. The distribution of n B − c B can be easily calculated as a function of n B and 2τ Y /θ B , the length of branch B measured in coalescent units (Tavaré 1984: eqs. 6.1 and 6.2; Wakeley 2009: eqs. 3.39 and 3.41). More B sequences will reach Y the larger n B is and the smaller 2τ Y /θ B is. As a result, it will be harder to estimate φ Y if introgression is older (larger τ Y ).
The second factor-the ease with which one can tell the parental path taken by each B sequence at Y-concerns the probability that two sequences entering X coalesce in X before reaching R; there is more information about φ Y the longer the internal branch RX is or the smaller the population size θ X is ( fig. 1a). This may be seen by considering the special case where the data consist of one sequence per species per locus and where the true coalescent time (t ab ) is available at each locus. Then the information content for estimating φ Y may be measured by the Fisher information, given by where the expectation is with respect to t ab (eq. A3), and with equality holding if P X = 1. There is thus more information for estimating φ Y the closer P X is to 1, or in other words if the branch length in coalescent units,

MBE
(3) thus suggests that increasing P X is more effective in reducing V(φ Y ) than increasing the number of loci (L) by the same factor, which is in turn more effective than increasing the number of sampled sequences per locus (n B ) by the same factor. For example, doubling n B − c B reduces the variance for φ Y by a half, but doubling P X reduces the variance by more than a half. In our simulation ( fig. 3, model I), the introgression probability φ Y was better estimated in case a (same θ tall tree) than in case b (same θ short tree). At L = 4, 000, the 95% HPD CI width was 0.07 for case a, and 0.12 for case b. Consider the two factors. First, in case a (tall tree), branch YB is longer, with length 2τ Y /θ B in coalescent units, with a smaller number of sequences reaching Y than in case b (short tree). Indeed, given n B = 4 sequences from B, the probability that n B − c B = 1, 2, 3, and 4 sequences remain by time τ Y is 0.388, 0.515, 0.095, and 0.002, respectively in case a, with an average of 1.71 (supplementary fig. S1, Supplementary Material online). For the short tree of case b, the corresponding probabilities are 0.122, 0.481, 0.347, and 0.050, with average 2.32. The average number of sequences reaching Y differ by a factor 1.36. Second, in case a (tall tree), any B sequence reaching Y and taking the left parental path is more likely to coalesce with A sequences in X than in case b (short tree), with P X = 1 − e −1 = 0.632 in case a and P X = 1 − e −0.5 = 0.393 in case b, differing by a factor of 1.61. As increasing P X is more effective than increasing n B − c B (eq. 3), φ Y was more precisely estimated (with smaller variance) in case a than in b ( fig. 3; supplementary table S1, Supplementary Material online).
The difference between case c (small to large) and case d (large to small) was even greater, with φ Y much better estimated in c ( fig. 3). At L = 4, 000, the CI width was 0.08 for case c and 0.16 for case d (supplementary table S1, Supplementary Material online). In case c, more B sequences reach Y because of the large θ B than in case d. Furthermore, B sequences reaching Y into X have a high chance of coalescence with other sequences in population X. Both effects make it easier to estimate φ Y in case c than in case d (eq. 3). It is thus easier to estimate φ Y if introgression is from a small population to a large one than in the opposite direction (supplementary fig. S2, Supplementary Material online). Note that φ Y is the proportion of immigrants in the recipient population, so that with the same φ Y , there are many more migrants in case c than in d.

Parameter Estimation Under Misspecified Introgression Direction
When model O was used to analyze data simulated under model I ( fig. 1), the introgression direction is misspecified. As discussed above (table 1), species divergence and introgression times (τ R , τ X ) are well estimated despite misspecification, as are population sizes for extant species and for the root (θ A , θ B , θ R ). Indeed, those parameters are estimated with the same precision under models O and I ( fig. 3).
Here, we focus on parameters φ X , θ X , θ Y ( fig. 3, model O). Our arguments from the asymptotic analysis (table 1) also apply, although in simulations the results are affected by random sampling errors due to finite data size.
In cases a and b, all populations have the same size. Biases in parameter estimates under model O are well predicted by the theory (table 1): based on coalescent times t aa , t bb , and t ab , we expect In case c (small to large), introgression is from a small population to a large one. As the coalescent rate for sequences a and b over (τ X , τ R ) is much slower in the fitting model than in the true model, consideration of t ab predicts a large φ X or a small will compensate for reduced coalescence between B sequences caused by the A → B introgression (table 1). Thus, predictions about θ (O) Y based on t ab and t bb are somewhat conflicting. In the simulation, Material online). The extreme estimate causes small biases in τ R and τ X and poor estimates of θ (O) X ( fig. 3). Case d (large to small) assumes introgression from a large population to a small one ( fig. 1a) (table 1). Moreover, the larger source population in the true model (θ (I) X ) means t ab is less common in (τ X , τ R ), with most coalescence occurring in the common ancestor R. Thus, based on t ab we predict a larger θ (O) Y or a smaller φ X to reduce the amount of coalescence in (τ X , τ R ) in the fitting model (eq. 1). Thus, considerations of both t bb and t ab sug- Y is smaller or greater than θ (I) X , the introgression probability φ X may be greater or smaller than the true φ Y , according to equation (1). In our setting, θ (O) Y = 0.0107, slightly greater than θ (I) X = 0.01, and φ X = 0.17, slightly smaller than φ Y = 0.2 (supplementary table S1, Supplementary Material online).

Bayesian Test of Introgression: Power and False Positive Rate
We applied the Bayesian test of gene flow (Ji et al. 2023) to the data analyzed in figure 3. We are interested in the power of the test under the correct model I. Also we ask how often the test is significant if it is conducted under model O, with introgression direction misspecified.
Note that the behavior of the test or the asymptotic behavior of posterior probabilities of the compared models is determined by the parameter values in the limit of L → ∞ (Yang and Zhu 2018). If data are simulated under model I (with φ Y > 0) and analyzed under model I, the posterior probability for the true model I should approach 1, the Bayes factor in support of model I against model Ø of no gene flow ( fig. 1d) B Iø → ∞, and the power of the test should approach 100%, when the data size L → ∞ (Yang and Zhu 2018). If the data are simulated under model I and analyzed under model B, the power for testing φ Y (which has the true value φ Y > 0) should approach 100%, and the false positive rate for testing φ X (which has the Thawornwattana et al. · https://doi.org/10.1093/molbev/msad178 MBE true value φ X = 0) should approach 0, when the data size L → ∞.
If the data are generated under model I and analyzed under model O, both the null and alternative models are incorrect. According to our analysis φ * X > 0, and model O is a "less wrong" model than model Ø, judged by the KL divergence (Yang and Zhu 2018). Thus, when L → ∞, B Oø → ∞, and the probability of rejecting H ø : φ X = 0 will approach 100%. Here, the biological interpretation of test results is somewhat ambiguous. If one emphasizes the fact that model O allows gene flow while model Ø does not, detecting gene flow may be considered a correct result. However, if one emphasizes misspecification of introgression direction in model O, accepting model O may be considered a rather severe false positive error. In this paper, we use the second interpretation.
The Additional Information that Results from Including a Third Species Given two species (A, B) with introgression from A → B at the rate of φ ( fig. 1a), we consider the information gain for estimating φ from including a third species (C). There are five branches on the two-species tree onto which C can be attached ( fig. 4a-e): (a) the root population, (b,c) the source and target populations before gene flow, and (d, e) the source and target populations after gene flow. Case c is one of "inflow," with gene flow from the outgroup species (A) into one of the ingroup species (B), while b represents "outflow," with gene flow from an ingroup species (A) into the outgroup (B). Note that in all cases the correct MSC-I model is used in the analysis, so that the estimate (posterior mean) of φ will converge to the true value (which is 0.2). However, the information content may differ among the five cases. As in the case of two species, the amount of information concerning φ is determined by two factors: 1) the number of sequences reaching the hybridization node and 2) the ease with which one can tell the parental path taken by each sequence at the hybridization node. When introgression is between nonsister species, information concerning the parental path taken by each sequence may be in the change of gene-tree topology rather than in the change of between-species coalescent time.
We assumed the same population size θ 1 = 0.01 for all populations, but examined the impact of different population sizes in cases b and c. We simulated 100 replicate datesets in each case. The posterior means, the posterior standard deviation (SD), and the width of the HPD CI for φ are summarized in figure 4f-h. The 95% CIs for other parameters are shown in supplementary figure S3, Supplementary Material online.

Equal Population Sizes on the Species Tree
If all populations on the species tree have the same size (θ), we expect the amount of information for estimating φ to be in the order a ≺ d ≺ (b, e) ≺ c, with the order of b and e undecided ( fig. 4f-h).
First, a ≺ d. Cases a and d are the least informative. Adding an outgroup species C in case a adds little information about φ. In d, the C sequences may reach node X and coalesce with a B sequence in RX, providing information about whether sequences from B take the introgression parental path at node Y. Thus, we expect more information in the data in d than in a.
Next, d ≺ b. The number of B sequences reaching node Y is the same in the two cases, so the only difference is in the difficulty of inferring the parental path taken by B sequences at Y. In case b, coalescence of a B sequence with an A sequence causes a change to gene tree topology. In case d, introgression does not cause such topological change to the gene tree. The information content may thus be higher in b than in d.
Next, d ≺ e. In case e, sequences from both B and C may reach the hybridization node Y while in d only sequences from B may reach Y, so that the sample size at node Y is larger (less than twice as large) in e than in d. In d, more sequences enter population RX, increasing slightly the probability of coalescence for any B sequence that takes the introgression parental path at Y, but this effect may be less important than that of increased sample size in e.
Next, b ≺ c (i.e., it is easier to infer inflow than outflow). In both cases, the number of B sequences reaching node Y or the sample size at Y is the same. However, the two cases differ in the ease with which one can tell the parental path taken by each B sequence at Y. In c, coalescence of a B sequence with an A sequence over (τ X , τ R ) causes a change to Direction of Introgression · https://doi.org/10.1093/molbev/msad178 MBE gene tree topology. In case b, such topology change occurs only if the coalescence occurs in the shorter time interval (τ X , τ S ), and the resulting gene tree is harder to infer because of the shorter internal branch. It is thus harder to resolve the parental path taken by each B sequence at Y in b than in c, and the data are less informative about φ in b. It is harder to infer outflow than inflow.
Finally, e ≺ c. In case c, introgression leads to changes in gene tree topology whereas in e, more sequences reach Y with a larger sample size. The relative effects depend on the parameter values. In the simulation here, the increased sample size was less effective than the gene tree topology change ( fig. 4g and h, case c same-θ vs. case e). Note that in e the data are more informative about φ the closer τ S is to  MBE τ Y , and in both c and e the data are more informative the smaller τ X is.

Different Population Sizes on the Species Tree
For cases b (outflow) and c (inflow), we also consider different population sizes. The results are shown in figure 4f-h. First, in case b, φ is most poorly estimated in the lar-ge→small setting, much better estimated in the same-θ (or large→large) setting, and best in the small→large setting. This can be explained easily by the theory we developed in analysis of the two species case: a large recipient population means many sequences reaching the hybridization node Y and a large sample size, while a small donor species (θ X ) means fast coalescence and easy determination of the parental path taken at node Y. For example, the probability that more than one B sequence reaches Y is 0.613 in case b (same θ or small→large), and 0.012 in case b (large→small), with a large difference in the sample size.
Similarly in case c (inflow), φ is more poorly estimated in the large→small and same-θ (large→large) settings, and was better in the small→large setting. The differences among the three settings are much smaller than in case b.
Although case b outflow is less informative about φ than c inflow in the case of same-θ, the order is reversed in the small→large setting ( fig. 4). The same number of B sequences reaches node Y in both cases, so the difference must be due to the different levels of difficulty by which one can tell the parental paths taken by B sequences at node Y. In case b, B sequences taking the introgression parental path go through the small population SX and may coalesce at a high rate with sequences from A (which lead to changes to the gene tree topology informative about introgression), and with sequences from both A and C in population RS. In case c, B sequences taking the vertical parental path may coalesce in population RS with C sequences, but given that both populations SY and RS are large, this effect may be expected to be minor. While multiple factors can have opposing effects on the relative information content concerning φ in cases b versus c small→large, the data are more informative in case b than in c overall. Material online for different population sizes. The results for the large datasets of L = 4, 000 are summarized in supplementary tables S2 and S3, Supplementary Material online. We also applied the Bayesian test of introgression (Ji et al. 2023) to the simulated data. The results are summarized in supplementary figures S6 and S7, Supplementary Material online.

Simulation Results in the Case of Four Species
Overall, the results parallel those for the cases of two and three species discussed above. See the Supplementary Material online text "Simulation results in the case of four species" for detailed descriptions.

Analysis of Heliconius Genomic Datasets to Infer the Direction of Introgression
Overview To assess the applicability of our results from the asymptotic analysis and computer simulation to empirical datasets and the statistical and computational feasibility of inferring the direction of gene flow using genomic sequence data, we analyzed data from Heliconius cydno (C), H. melpomene (M), and H. hecale (H) ( fig. 6). Gene flow is known to occur between H. cydno and H. melpomene, whereas H. hecale is more distantly related, and is here treated as an outgroup, and is assumed not to have had introgression with the other two (Martin et al. 2013). We analyzed coding and noncoding loci on each chromosome as separate datasets (see supplementary table S4 . 1). We ran the MCMC algorithm in BPP to generate the posterior estimates of parameters in each model ) and conducted the Bayesian test of introgression (Ji et al. 2023). We describe the results for the coding and noncoding datasets from chromosome 1 (tables 2 and 3) in detail before discussing results for the other chromosomes.

Bayesian Test of Introgression for Chromosome 1
Results of the Bayesian test are summarized in table 3. To compare the four different models, we calculated Bayes factors using two approaches: thermodynamic integration with Gaussian quadrature (Lartillot and Philippe 2006;Rannala and Yang 2017) and Savage-Dickey density ratio (Ji et al. 2023); see Materials and Methods. The calculated values of the Bayes factor for the same test varied depending on the number of quadrature points in the thermodynamic-integration approach and on the threshold parameter in the Savage-Dickey density ratio, reflecting the challenges of calculating the marginal likelihoods or Bayes factors reliably in large datasets (Rannala and Yang 2017). For example, log B Iø for comparison of model I (C → M introgression) against model Ø (no gene flow) was 1087.1 and 1082.5, respectively, when K = 32 and 64 quadrature points were used in Gaussian quadrature. This difference is mainly due to the difficulty of calculating Direction of Introgression · https://doi.org/10.1093/molbev/msad178 MBE the power posterior rather than the use of too few quadrature points (Rannala and Yang 2017). Nevertheless, both values are far greater than the cutoff of 4.6 (= log 100). Similarly the Savage-Dickey density ratio approach estimates B Iø to be ∞ at all three threshold values (ϵ = 1%, 0.1%, 0.01%). Both approaches thus strongly support model I with C → M introgression and reject model Ø with no gene flow.
For both datasets from chromosome 1, the two approaches to Bayes factor calculation lead to the same conclusion, as do the three threshold values for the Savage-Dickey density ratio (ϵ = 1%, 0.1%, 0.01%). The null hypothesis φ C→M = 0 is rejected in the I-Ø and B-O comparisons, with strong support for the C → M introgression, whether or not the M → C introgression is accommodated in the model.
The B-I comparison tests the null hypothesis φ M→C = 0 when both the null and alternative models accommodate the C → M introgression. This test leads to strong support for the null model I, with B BI < 0.01. With C → M introgression accommodated, the data strongly support the absence of M → C introgression. Unlike Frequentist hypothesis testing, which can never support the null hypothesis strongly, here the Bayesian test strongly favors the null model I, rejecting the more general alternative model B. Thus, all tests have led to the same conclusions. Both the coding and noncoding datasets strongly support the presence of H. cydno → H. melpomene introgression, and both strongly support the absence of the H. melpomene → H. cydno introgression.
Parameter Estimation for Chromosome 1 Bayesian parameter estimates under the four models are summarized in table 2. Consistent with the results of the Bayesian test above, estimates of φ under model B suggest that gene flow is unidirectional. The estimates for the noncoding data are φ C→M = 0.28 (95% HPD CI: 0.25-0.31) and φ M→C < 1% in the opposite direction, while for the coding data, they are φ C→M = 0.51 (95% HPD CI: 0.47-0.54) and φ M→C < 1% (table 2). The reasons for the higher rate (φ C→M ) for the coding than the noncoding data are unknown. One intriguing possibility is that introgression is mostly adaptive, driven by natural selection, and that coding loci are under stronger selection. The time of introgression is nearly zero, suggesting that gene flow may be ongoing. Estimates under model I are nearly identical to those under model B. In model O where only M → C gene flow is allowed, the introgression probability is estimated to be φ M→C = 0.17 (0.15,0.20) for the noncoding data, and 0.14 (0.08, 0.20) for the coding data. Those rates are substantial, consistent with the significant test results (B Oø ). Even if gene flow is unidirectional from C to M, assuming introgression in the opposite (and presumably wrong) direction leads to high estimates of the rate and significant test results. Those results again parallel our simulations (supplementary figs. S2, S6, and S7, Supplementary Material online). The misspecified We note that the divergence time between H. cydno and H. melpomene (τ s ) is estimated to be much smaller, and θ S is much larger under model Ø (no gene flow) than under model I or B. This is because ignoring gene flow when it occurs causes model Ø to misinterpret reduced between-species sequence divergence (due to introgression) as more recent species divergence (Leaché et al. 2014;Tiley et al. 2023).

Parameter Estimation for the Other Autosomes
We analyzed the coding and noncoding data from all chromosomes in the same way, with parameter estimates under the four models (Ø, I There is overall consistency among the autosomes (chromosomes 1-20), although estimates of some parameters from chromosomes 5, 10, 13, 15, and 19 appear as outliers. For example, estimates of θ C and θ M are unusually large for chromosomes 5, 15, 19, and 20. A likely explanation is that the H. melpomene sample was partially inbred, with large variations in heterozygosity across chromosomes. We discuss results for the autosomes first before dealing with chromosome 21 (the Z chromosome).
For the autosomes, there is overall consistency between the coding and noncoding data: divergence times τ r and τ s and population sizes θ H and θ r are larger for the noncoding than coding data, by a similar factor across chromosomes (supplementary fig. S8, Supplementary Material online). This can be explained by a reduced effective neutral mutation rate for the coding data, due to purifying selection removing nonsynonymous mutations.
Although model Ø (no gene flow) underestimated the divergence time between the two species involved in gene flow, τ s (see above), all four models including model Ø produce nearly identical estimates of τ r , indicating that the impact of introgression is local on the species tree, only affecting estimates of parameters for nodes close to the introgression event. Estimates of τ s under model O are consistently smaller than under models I and B, especially for the coding data, apparently related to the low estimates of φ M→C for the coding data under model O. Introgression time τ c = τ m is nearly zero for most chromosomes under models I, O, and B, indicating that gene flow may be ongoing (Huang et al. 2022).
Estimates of introgression probability φ C→M are very similar between models I and B, and they are consistently larger for coding than noncoding data. Estimates of φ M→C under model B are consistently ≈ 0, suggesting the absence of M → C gene flow. Estimates of φ M→C under model O, assuming introgression in the wrong direction, are always larger than estimates under model B, but vary among chromosomes. These results are consistent with our simulations (e.g., fig. 3, cases a-d), where estimates of introgression probability φ X in model O vary, even though the true rate in the opposite direction is fixed (φ Y = 0.2), influenced by estimates of population sizes such as θ X and θ Y .

Bayesian Test of Introgression for the Autosomes
Bayes factors calculated via the Savage-Dickey density ratio are presented in supplementary table S6, Supplementary Material online. The results are similar to those for chromosome 1, with overwhelming evidence for the C → M introgression and no evidence for M → C introgression. For some datasets, B Oø < 100, so that the test of gene flow (H 0 : φ M→C = 0) is not significant when introgression was assumed to be in the wrong direction.  Nevertheless, parameters not involved in the unidentifiability (such as τ r , τ c , θ r ) are well estimated. In theory, the four modes represent unidentifiability of the label-switching type, and a relabeling algorithm can be used to process the MCMC samples to map the parameter values onto one of the four modes, as in Yang and Flouri (2022). This is not pursued here. Instead, our objective here is to provide explanations for the results of supplementary figure S8, Supplementary Material online (chromosome 21, model B). We recommend that multiple samples per species per locus (in particular from the recipient species) should be used to estimate introgression probabilities. Note that one diploid individual is equivalent to two haploid sequences.

Inferring the Direction of Gene Flow Using Genomic Data
In this study, we have identified features of genomic sequence data that are informative about the direction of gene flow, and quantified the power of the Bayesian test of gene flow and the precision and biases in estimates of parameters under the MSC-I model such as the time and strength of introgression. Our asymptotic analysis, computer simulation and real data analysis have produced highly consistent results. We have illustrated that one may gain much insight into the workings of likelihoodbased inference under the MSC-I model by simply considering pairwise coalescent times (t aa , t ab , t bb ) even though these are very simple summaries of the original data of multilocus sequence alignments (table 1). Knowledge of important features in the data that drive the estimation of model parameters, such as the introgression time and introgression probability, is very useful when we interpret results from analysis of real datasets.
Our analyses of both simulated and real data have demonstrated that typical genomic datasets may be very informative about the direction, timing and strength of introgression, and that current Bayesian implementations of the MSC-I model can accommodate thousands of genomic loci and are able to detect gene flow with nearly 100% power and to estimate the introgression time and introgression probability with high precision and accuracy ( fig. 3; supplementary figs. S2-S7, Supplementary Material online; see also Thawornwattana et al. 2022;Ji et al. 2023).
One major result from our analysis is that if introgression is assumed to occur in the wrong direction, the Bayesian test of gene flow will often be significant, and Bayesian estimates of introgression rate will typically be  nonzero and may even be greater than the true rate in the correct direction. Thus, neither a significant test nor a high rate estimate is reliable evidence that introgression occurred in the specified direction. This result may seem surprising and disturbing given that introgression in the specified direction is nonexistent. Our analyses of both simulated and real data suggest that the bidirectional model may be applied to infer the introgression direction. If gene flow is truly unidirectional, overparametrization of the bidirectional model appears to incur little cost in statistical performance even though it does add to computational cost: posterior CIs and power to detect gene flow under the bidirectional model are very similar to those under the true unidirectional model.
Of course a better approach to inferring the introgression direction is to implement efficient cross-model MCMC algorithms to search in the space of all MSC-I models for the given set of species. Indeed, MCMC algorithms that move between MSC-I models already exist (Wen and Nakhleh 2018;Zhang et al. 2018). These propose changes to the MSC-I model when the gene trees at all loci are fixed, and if the proposed new model is in conflict with some gene trees, the proposal is abandoned. Such algorithms have poor mixing properties if the dataset is not very small because the proposed new model is very likely to be in conflict with at least some gene trees. The algorithms do not appear to be feasible for analyzing even small datasets with 100 loci (Wen and Nakhleh 2018;Zhang et al. 2018). However, thousands of loci are often needed to provide precise and reliable inference of introgression between species. Smart MCMC moves that make coordinated changes to the gene trees when the chain moves from one model to another-similar to the algorithms developed under the MSC model with no gene flow for updating species divergence times (the rubber-band algorithm, Rannala and Yang 2003) or species phylogenies (the speciestree NNI or SPR moves, Yang and Rannala 2014; Rannala and Yang 2017)-may offer significant improvements even though they are challenging to develop.
Most heuristic methods for detecting gene flow are based on species triplets or quartets and use summaries of sequence data such as genome-wide site-pattern counts (as in the D-statistic, Green et al. 2010;Durand et al. 2011 andHYDE, Blischak et al. 2018) or frequencies of estimated gene tree topologies (as in SNAQ, Solis-Lemus and Ane 2016). Those methods are agnostic about the direction of gene flow. The D FOIL method of Pease and Hahn (2015) extends the D-statistic to identify the introgression direction: it assumes a particular species phylogeny for five species (a balanced quartet tree plus an outgroup), with one sequence sampled per species per locus. None of those heuristic methods can identify gene flow between sister lineages or its direction. Overall current heuristic methods make use of a small portion of information about gene flow in the multilocus sequence alignments, and offer exciting opportunities for improvements.  fig. 6, table 2), Calculated Using Thermodynamic Integration with 32 or 64 Gaussian Quadrature Points and Savage-Dickey Density Ratio with Threshold ϵ = 1%, 0.1%, or 0.01%.

Unidentifiability of Introgression Models
In this study (in particular, during the analysis of the Heliconius data), we have encountered several different types of unidentifiability issues. Here, we include a summary, which is technical and can be skipped (see also Yang and Flouri 2022 for further discussions). Yang and Flouri (2022) distinguished between withinmodel and cross-model unidentifiability. If the probability distributions of the data are identical under model m with parameters Θ and under model m ′ with parameters Θ′, with for all possible data X, then data X cannot identify (m, Θ) and (m ′ , Θ′). If m = m ′ and Θ ≠ Θ′, the parameters within the given model are unidentifiable. If m ≠ m ′ , the two models are unidentifiable (cross-model); in this case there is a parameter mapping from Θ in m to Θ′ in m ′ .
In the case of two species (say A and B) with one sequence sampled per species per locus, the coalescent time (t ab ) between the two sequences (a, b) has the same distribution under model I with A → B introgression and under model O with B → A introgression (Appendix). As a result, the two models are unidentifiable, or in other words, the introgression direction is unidentifiable   fig. 10). This is a case of cross-model unidentifiability. The parameter mapping is θ R and φ X = φ Y , with τ R and τ X being identical between the two models (eq. A3, fig. 1 If the model (model I, say) is given, parameters τ R , τ X , θ X , and φ Y are identifiable even with data of one sequence per species per locus. In the example of chromosome 21 for the Heliconius data, parameters τ s , τ c , θ c , and φ m are identifiable (supplementary fig. S8 and table S5, Supplementary Material online).
In the case of two species, introgression direction becomes identifiable if multiple sequences are sampled per species per locus . Furthermore, if data from other species are available and if gene flow occurs between nonsister species, introgression direction affects the distributions of the gene trees and coalescent times, and is identifiable whether one sequence or multiple sequences are sampled per species per locus (Jiao et al. 2021;Hibbins and Hahn 2022;Yang and Flouri 2022).
Such models can still be used in inference. If multiple samples are available per species per locus, model B with introgression between sister lineages shows two unidentifiable modes involving the two introgression probabilities and two population size parameters : This is a within-model unidentifiability of the label-switching type.
The case of the model B with only one sequence per species per locus was not discussed by Yang and Flouri (2022), although it arose in the analysis of data for chromosome 21 in the Heliconius genomic data (supplementary fig. S8, Supplementary Material online). With such data, model B with introgression between sister lineages shows four unidentifiable modes in the posterior: in figure 7b, . 7). If introgression is between nonsister lineages, each bidirectional introgression pair will create two cross-model modes, whether one sequence or multiple sequences are sampled per species per locus .

Asymmetry of Gene Flow in Nature
No systematic studies have examined the frequency of unidirectional versus bidirectional gene flow given that two species are involved in introgression. Both scenarios appear to be common. Sometimes gene flow occurs in one direction even though opportunities exist also in the opposite direction. A well-documented example is gene flow in the Anopheles gambiae group of mosquitoes in sub-Saharan Africa (della Torre et al. 1997;. Analysis of genomic data provides strong evidence for gene flow from A. arabiensis to A. gambiae or its sister species A. coluzzii, while the rate of gene flow in the opposite direction was estimated to be 0 (Thawornwattana et al. 2018;Flouri et al. 2020). This result from comparisons of genomic sequences is consistent with crossing experiments which supported introgression of autosomal regions from A. arabiensis into A. gambiae but not in the opposite direction (della Torre et al. 1997;. One possible explanation is that the X chromosome from one species may be incompatible with the autosomal background of the other species (Slotman et al. 2004;. The introgression from A. arabiensis into the common ancestor of A. gambiae and A. coluzzii has been hypothesized to have facilitated the range expansion of A. gambiae and A. coluzzii into the more arid savanna habitats of A. arabiensis (Coluzzi et al. 1979;Ayala and Coluzzi 2005).
Note that the rate of gene flow in the MSC-I model estimated from the genomic sequence data is an "effective" rate, reflecting the combined effects of gene flow and natural selection. Most introgressed alleles are expected to be purged in the recipient species by selection because they are deleterious or incompatible with the host genomic background (Schumer et al. 2018;Matute et al. 2020). It seems likely that alleles at introgressed loci from species Thawornwattana et al. · https://doi.org/10.1093/molbev/msad178 MBE A on the genomic background of species B will have different fitnesses than introgressed alleles from B on the background of A. Another factor is geographic context. If a smaller population of species A hybridizes with a larger population of species B, A is more likely to be swamped by B, making introgression asymmetrical. With all those factors considered, one should expect gene flow to be asymmetrical in most systems, with different rates in the two directions.

Gene Flow in Heliconius Butterflies
Heliconius cydno and H. melpomene are broadly sympatric across Central America and northwestern South America, and are known to hybridize in the wild (Mallet et al. 2007 male hybrids backcross to either parental species much more readily than the pure species mate with one another (Naisbit et al. 2001(Naisbit et al. , 2002. Previous studies used different approaches to estimate gene flow between these two species. Early phylogenetic analyses of multilocus data attributed recent gene flow between H. cydno chioneus and H. melpomene rosina as a cause for gene tree variation among loci (Beltrán et al. 2002). An IM analysis (Hey and Nielsen 2004) using a small number of loci yielded an estimated symmetric bidirectional migration rate m between the two species of 1.7 × 10 −6 (95% CI 1.0 − 45 × 10 −6 ) per generation, with H. cydno chioneus having a larger effective population size (Bull et al. 2006). An IM model allowing for different migration rates in each direction found evidence for unidirectional gene flow from H. cydno into H. melpomene, with 2N M m C→M = 0.294 (90% HPD CI: 0.116-0.737), whereas 2N C m M→C = 0.000 (0.000, 0.454) (Kronforst et al. 2006), consistent with our results. Similar patterns were obtained in a subsequent IMa2 analysis (Hey 2010) of a larger dataset (Kronforst et al. 2013). In a more recent analysis of genome-scale data, Martin et al. (2015)  Our estimates are in general consistent among chromosomes and between coding and noncoding data. However, only one diploid individual per species is included in the genomic data, with some from inbred lines (selected for sequencing because of easy assembly). These features of the data may have affected our estimates and account for the outlier estimates observed for a few chromosomes (supplementary fig. S8, Supplementary Material online). Overall, our analyses of genomic data are consistent with previous estimates.
We note that the null model ø in the Bayesian test used in this study constrains the population sizes (θ C = θ c and θ M = θ m ) as well as the introgression probability (φ = 0), compared with the alternative model (models I, O, or B) (Ji et al. 2023). Rejection of the null model may in theory be due to either introgression or inequality of population sizes, or both. A sharper test may use an alternative model with the same constraints on the population sizes as in the null model (θ C = θ c , θ M = θ m ) so that the two models under comparison have the only difference concerning the introgression probability (φ = 0 vs. φ > 0); this is test 2 in Ji et al. (2023, fig. 3). For the Heliconius data, we note that the CIs for φ c exclude the null value φ c = 0 for every autosome (supplementary fig. S8, Supplementary Material online), providing strong evidence for some introgression in the minority direction. Furthermore, it may be interesting to examine the impact of priors on parameters on the Bayesian test (Ji et al. 2023). We leave it to future work to use more genomic data and more focused tests to infer gene flow in this group of Heliconius butterflies.

Asymptotic Analysis and Simulation in the Case of Two Species
We examined the distributions of coalescent times and conducted computer simulations under model I of figure  1a, with A → B introgression. We used four sets of parameter values. a) same θ tall tree: all populations have the same size with θ = 0.01. The other parameters are τ R = θ, τ X = 0.5θ, and φ Y = 0.2. b) same θ short tree: θ = 0.01 for all populations, τ R = 0.5θ, τ X = 0.25θ, and φ Y = 0.2. c) small to large: different species on the species tree have different population sizes, with θ A = θ X = θ R = θ 0 = 0.002 on the left of the tree and θ B = θ Y = θ 1 = 0.01 on the right, with introgression from a small population to a large one ( fig. 1a). Other parameters are τ R = 3θ 0 , τ X = 1.5θ 0 , and φ Y = 0.2. d) large to small: This is the same as case (c) except that θ A = θ X = θ R = θ 0 = 0.01 on the left of the tree and θ B = θ Y = θ 1 = 0.002 on the right, so that introgression is from a large population to a small one.
We simulated multilocus sequence datasets under model I ( fig. 1a) and analyzed them under models I, O, and B ( fig. 1a-c). Each replicate dataset consisted of L = 250, 1,000 or 4,000 loci, with n = 4 sequences sampled per species per locus. The sequence length is N = 500 sites. The simulate option of BPP (Flouri et al. 2018) was used to simulate gene trees with coalescent times and to "evolve" sequences along the gene tree under the JC model Direction of Introgression · https://doi.org/10.1093/molbev/msad178 MBE (Jukes and Cantor 1969). Sequences at the tips of the gene tree constitute the data. The number of replicates was 100.
Each replicate dataset was then analyzed using BPP (Flouri et al. 2018 under models I, O, and B of figure 1a-c. This setting in which the model is fixed corresponds to the A00 analysis of (Yang 2015). The JC model was assumed in the analysis. Gamma priors were assigned to the age of the root of the species tree (τ R ) and to population size parameters (θ), with the shape parameter α = 2 so that the prior was diffuse and with the rate parameter β chosen so that the prior mean was close to the true values. We used τ R ∼ G(2, 200) and θ ∼ G(2, 200) for case a "same θ tall tree"; τ R ∼ G(2, 400) and θ ∼ G(2, 200) for case b "same θ short tree"; τ R ∼ G(2, 400) and θ ∼ G(2, 400) for case c "small to large" and d "large to small." Introgression probability φ was assigned the beta prior beta(1, 1), which is U(0, 1). MCMC settings were chosen by performing pilot runs, with MCMC convergence assessed by verifying consistency between replicate runs for the same analysis. The same setting was then used to analyze all replicate datasets. We used 16,000 MCMC iterations as burnin, and then took 10 5 samples, sampling every 2 iterations. Running time for analyzing one replicate dataset was ∼45 min for L = 250 loci or ∼3 h for L = 1, 000 using one thread, and ∼12 h for L = 4, 000 using two threads.
Simulation to Evaluate the Gain in Information for Estimating φ by Adding a Third Species Given the introgression model for two species (A, B) of figure 1a, with A → B introgression, we added a third species (C) and assessed the gain in information for estimating φ. There are five branches on the two-species tree, to which the third species could be attached (fig. 4a-e): (a) the root population, (b, c) the source and target populations before gene flow, and (d, e) the source and target populations after gene flow. In all cases φ = 0.2. The original two-species tree had τ R = θ 1 and τ X = θ 1 /2. In cases b-e, species C was attached to the midpoint of the target branch, while in a, the new root was 1.25× as old as the old root. For models a, d, and e, all populations on the species tree had the same size, with θ 1 = 0.01. For cases b and c, three scenarios were considered: 1) equal population size, with θ 1 = 0.01 for all populations; 2) from small to large, with θ A = θ X = θ S = θ 0 = 0.002 for the thin branches in case b and θ A = θ X = θ 0 = 0.002 in case c and with θ 1 = 0.01 for all other branches; and 3) from large to small, with θ B = θ Y = θ 0 = 0.002 in case b and θ B = θ Y = θ S = θ 0 = 0.002 in case c and with θ 1 = 0.01 for all other branches. For each parameter setting, we simulated 100 replicate datesets. Each dataset consisted of L = 1, 000 loci, with n A = n B = 4 sequences per species per locus and N = 500 sites in the sequence. Each dataset was analyzed using BPP to estimate the parameters in the MSC-I model ( fig. 4a-e). Gamma priors were assigned to τ R and θ: τ R ∼ G(2, 200) and θ ∼ G(2, 200), while φ A→B ∼ U(0, 1). We used 32,000 MCMC iterations as burnin, and then took 10 6 samples, sampling every 10 iterations. Running time for analyzing one dataset using one thread was ∼30 h.

Simulation in the Case of Four Species: Inflow Versus Outflow
We simulated data under the three MSC-I models (I, O, B) of figure 5a-c, with introgression between nonsister species A and B on a four-species tree ( (A, (B, C)), D). The three models differ in the assumed direction of gene flow, with I for inflow from A to B, O for outflow from B to A, and B for bidirectional introgression between A and B. We used two sets of parameter values. In the first set (same-θ), all species on the tree had the same population size, with θ 0 = θ 1 = 0.01. In the second set (different-θ), the thin branches had θ 0 = 0.002 while the thick branches had θ 1 = 0.01 ( fig. 5a-c). Other parameters were the same in the two settings, with τ R = 4θ 0 , τ S = 3θ 0 , τ T = 2θ 0 , and τ X = τ Y = 1.5θ 0 , and the introgression probabilities were φ X = φ Y = 0.2.
Each dataset consists of L=250, 1,000, or 4,000 loci, with n = 4 sequences per species per locus and with N = 500 sites in the sequence. The number of replicates was 100. With three MSC-I models (I, O, B), two population-size settings (same-θ vs. different-θ), and three data sizes (L), a total of 3 × 2 × 3 × 100 = 1800 datasets were generated. Each dataset was analyzed under the three models (I, O, B). Gamma priors were assigned to τ R and θ: τ R ∼ G(2, 200) and θ ∼ G(2, 400), while φ ∼ U(0, 1). We used 32,000 MCMC iterations as burnin, and took 2 × 10 5 samples, sampling every 5 iterations. Running time for analyzing one dataset was ∼12 h for small datasets of L = 250 loci and 60 h for L = 1, 000 using one thread, and ∼120 h for L = 4, 000 using two threads.

Analysis of the Heliconius Butterfly Dataset
We processed the raw genomic sequencing data of Edelman et al. (2019) from three species of Heliconius butterflies, H. hecale (H), H. cydno (C), and H. melpomene (M), to retrieve coding and noncoding loci for each chromosome, following the procedure of Thawornwattana et al. (2022). See supplementary table S4, Supplementary Material online for the number of loci in each of the 22 datasets. Each locus consisted of one unphased diploid sequence per species, except the Z chromosome (chromosome 21) for which only a haploid sequence is available per species (from ZW females). Heterozygote phase in the diploid sequence was resolved using an analytical integration algorithm in the likelihood calculation in BPP (Gronau et al. 2011;Flouri et al. 2018;Huang et al. 2022). We fitted four MSC-I models with different introgression directions: (Ø) MSC with no gene flow, (I) C → M introgression, (O) M → C introgression, and (B) C ⇆ M bidirectional introgression.

Bayesian Test of Introgression
We applied the Bayesian test of introgression (Ji et al. 2023) to data for two species simulated under the models of figure 1a-c, the data for four species simulated under models I, O, and B of figure 5, and the Heliconius datasets ( fig. 6).
Bayesian model selection was used to compare the null model of no gene flow H 0 : φ = 0 and the alternative model of introgression H 1 : φ > 0. The Bayes factor was calculated as B 10 = M 1 M 0 , where M 0 and M 1 are marginal likelihood values under H 0 and H 1 , respectively. If the prior model probabilities are π 0 and π 1 , B 10 can be converted into posterior model probabilities as P(H 1 | X) P(H 0 | X) = π 1 π 0 · B 10 . If π 0 = π 1 , B 10 = 100 will translate to the posterior probability P(H 0 |X) ≈ 1%. Thus, B 10 > 100 may be considered strong evidence in support of H 1 over H 0 , while B 10 < 0.01 is strong evidence in favor of H 0 over H 1 .
As H 0 and H 1 are nested, B 10 can be calculated using the Savage-Dickey density ratio (Dickey 1971), by using an MCMC sample under H 1 (Ji et al. 2023). Define an interval of null effects, ø : φ < ϵ, inside which the introgression probability is so small that introgression may be considered nonexistent. The Bayes factor in favor of H 1 over H 0 is then B 10,ϵ = P(ø) P(ø| X) , where P(ø) is the prior probability of the null interval, while P(ø|X) is the posterior probability, both calculated under H 1 (Ji et al. 2023). Note that P(ø) = P(φ < ϵ) = ϵ if the prior is φ ∼ U(0, 1). When ϵ → 0, B 10,ϵ → B 10 (Ji et al. 2023). We used a few values for ϵ in the range 0.01-1% to assess its effect. This approach has a computational advantage as it requires running the MCMC under H 1 only and avoids trans-model MCMC algorithms or calculation of marginal likelihood values. For the Heliconius datasets, we in addition used thermodynamic integration combined with Gaussian quadrature to calculate the marginal likelihood under each model, using 32 or 64 quadrature points (Lartillot and Philippe 2006;Rannala and Yang 2017). This approach applies even if the compared models are nonnested, and was used to conduct pairwise comparisons among all four models fitted to the Heliconius data.

Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.