Multispecies coalescent and its applications to infer species phylogenies and cross-species gene flow

Abstract Multispecies coalescent (MSC) is the extension of the single-population coalescent model to multiple species. It integrates the phylogenetic process of species divergences and the population genetic process of coalescent, and provides a powerful framework for a number of inference problems using genomic sequence data from multiple species, including estimation of species divergence times and population sizes, estimation of species trees accommodating discordant gene trees, inference of cross-species gene flow and species delimitation. In this review, we introduce the major features of the MSC model, discuss full-likelihood and heuristic methods of species tree estimation and summarize recent methodological advances in inference of cross-species gene flow. We discuss the statistical and computational challenges in the field and research directions where breakthroughs may be likely in the next few years.


INTRODUCTION
Developed in the 1980s, the coalescent is a stochastic process that describes the genealogical history of a sample of DNA sequences taken from a population [1][2][3]. Whereas traditional population genetic models of drift and mutation describe changes in allele frequencies over generations in the population, the coalescent focuses on the sample and traces the genealogical history of lineage joining of the sampled sequences backwards in time. The coalescent model is in particular suited to inference using genetic sequence data [4][5][6][7].
The multispecies coalescent (MSC) is an extension of the single-population coalescent to the case of multiple species [8]. It integrates the process of species divergences and the within-population process of drift and mutation. Placing the coalescent in the context of a species phylogeny makes it possible to use the ever-increasing genomic sequence data from multiple species to address a number of important biological questions, and in the past two decades, the MSC has emerged as the natural framework for such inferences. These include estimation of population parameters (such as species divergence times, population sizes for extant species and extinct ancestors and rates of cross-species gene flow), estimation of species phylogeny accommodating heterogeneous gene genealogies across the genome and delineation of species boundaries (species delimitation) [9][10][11][12]. In molecular phylogenetics, incorporation of the MSC to accommodate the so-called gene-tree-species-tree conflicts has been heralded as a 'paradigm shift' [13]. Stochastic fluctuation in genealogical history of sequences across the genome, when accommodated in the model, is not a 'conflict' or 'problem', but rather a source of information for important evolutionary parameters such as ancestral population sizes [14][15][16] and rates of cross-species gene flow [17,18].
The past decade has seen exciting advancements in the implementation and extension of the MSC model for inference using genomic sequence data. The data we consider in this review are sequence alignments at hundreds or thousands of loci, with the different loci having independent coalescent histories while all sites in the sequence at the same locus share the same history. Ideal data for such analysis are short segments sampled from the genome that are far apart [16]. While we use the term gene or locus, the data should ideally be non-coding DNA,  although exonic data have been successfully used in such analyses [19,20]. We describe the major features of the MSC model (in particular, the probability distribution of gene trees and coalescent times), and discuss its applications in two major areas: the estimation of the species phylogeny and the inference of cross-species gene flow. We focus on fulllikelihood methods (maximum likelihood or ML and Bayesian inference), as they have the best statistical properties, but include heuristic methods based on summaries of the data in our discussion. Several comprehensive reviews on heuristic methods have been published [9,10,[21][22][23]. We review recent advances in using the MSC model to infer ancient gene flow, including models of continuous migration (the so-called isolation-with-migration model) and the introgression/hybridization models. We end the paper with a discussion of the challenges and perspectives in the field. Our focus in this review is on MSCbased analyses of multilocus sequence data, and we do not consider population genetics methods that use summary statistics such as allele frequencies and single nucleotide polymorphisms (SNPs) to infer demographic processes including population structure and admixture [24,25].

Fisher-Wright model and the coalescent
The Fisher-Wright model [26,27] in population genetics describes the biological process of reproduction and drift in an idealized population of constant size, with non-overlapping generations, random mating and no population structure or selection ( Fig. 1(a)). Individuals of the next generation are generated by random sampling of gametes from the current population: the frequencies of alleles at a locus (say, A and a for two alleles) in the next generation are generated by binomial sampling given the allele frequencies in the current generation.
The coalescent model describes the same process of reproduction and drift, with the focus on the sample of sequences and with time running backwards ( Fig. 1(b)) [1]. When we trace the genealogical history of the sample backwards in time, lineages join or coalesce when we reach their common ancestors. While the forward Fisher-Wright model and backward coalescent model are two characterizations of the same process, the coalescent approach of focusing on the sample offers major advantages for many inference problems using genetic sequence data. For example, coalescent simulation of the genealogy of the sample is often far more efficient than forward simulation tracking the whole population. The basic coalescent model has been extended to accommodate demographic changes, recombination, population subdivision and selection [5,7]. Here we focus on the basic coalescent and on the probability distribution of gene tree topologies and coalescent times generated by the process.
Consider first n = 2 sequences sampled from a diploid population of size N. With random mating assumed in the Fisher-Wright model, sequences pick parents at random when we trace the genealogical history of the sample to the previous generation. As there are 2N parental sequences to choose from, the probability that the two sequences pick the same parent (that is, they coalesce) in the previous generation is 1/(2N). In other words, coalescent occurs as a Poisson process at the rate of 1/(2N), faster in smaller populations, and the coalescent time (the waiting time until the two sequences find their common ancestor) has a geometric distribution with the mean of 2N generations. Thus, two sequences sampled at random are on average separated by 2N × 2 generations or θ = 4Nμ mutations per site, where μ is the mutation rate per site per generation. Parameter θ , known as the population size parameter, is the average distance between two sequences sampled at random from the population. It is also known as heterozygosity and can vary hugely even between close species. Typical values include θ ≈ 0.1% for humans [28] and 0.1%-5% for Heliconius butterflies [29].
In analysis of sequence data, it is convenient to measure time by the mutational distance so that one time unit is the expected time to accumulate one mutation per site. With this time unit, the coalescent waiting time for two sequences (t 2 ) is approximately exponential with the mean θ/2, with density If there are n > 2 sequences in the sample, there will be ( n 2 ) = n(n − 1)/2 pairs and each pair coalesce at the rate of 2/θ , with the total rate ( n 2 ) · (2/θ ). The time until the next coalescent event has an exponential distribution with mean When a coalescent occurs, each of the ( n 2 ) pairs has the same probability to join. The number of lineages is then reduced from n to n − 1, and the process repeats, until the most recent common ancestor (MRCA) is reached ( Fig. 1(b)).
The n − 1 successive coalescent events generate a genealogical tree (G) of the sequences in the sample. This is a rooted tree with the internal nodes ranked by age, and is called the ranked tree or labelled history [30] (Fig. 1(b)). The number of possible labelled histories for a sample of size n is The extension of the single-population coalescent to multiple species has been called the interspecific coalescent [31] or censored coalescent [8], and is now commonly known as the multispecies coalescent [32]. Suppose that there are s species, which are related through a species phylogeny. Instead of a single parameter θ , the model now involves two sets of parameters: s − 1 species divergence times (τ s) and 2s − 1 population size parameters (θ s), with a total of 3s − 2 parameters (Fig. 2). Both the τ and θ are measured in the expected number of mutations per site. Figure 2. A species tree for three species (A, B and C) showing parameters in the MSC model, and the four possible coalescent histories for a locus with one sequence from each species, with probabilities (1 − φ, 1 3 is the probability that sequences b and c do not coalesce in species BC. Note that the first two histories correspond to the same rooted gene tree G 1 , and there are three gene trees: G 1 , G 2 and G 3 . Figure 3. A species tree for three species, (A, (B, C)), with a gene tree for five sequences at a locus to illustrate the MSC density of the gene tree with coalescent times.
Given the species tree, coalescent events occur independently in different populations, with the coalescent rate (2/θ ) given by the population size. When we trace the history of the sequences at a locus backwards in time and reach a speciation event, the coalescent process and rate are reset, because of the change in population size and because of sequences coming from the sibling species. For example, in Fig. 3, sequences c 1 and c 2 coalesce at the rate 2/θ C in species C. When they enter species BC at time τ BC , the coalescent rate (for each pair) is reset to 2/θ BC and the number of lineages becomes 3. Furthermore, we assume that gene trees at different loci are independent. One important feature of the MSC model is that the divergence time between sequences from two species must be greater than the species divergence time: sequences split before species or equivalently the gene tree fits inside the species tree. This intrinsic constraint between the species tree and the gene trees is the source of computational challenges in Bayesian implementations of the MSC model.
There are two important probability distributions under the MSC model: the (marginal) probabilities of gene tree topologies [21,33,34] and the joint distribution of the gene tree topology and coalescent times [8]. The former is useful for two-step methods of species tree estimation, which use reconstructed gene tree topologies as data, while the latter is used in full-likelihood methods, which use information in gene tree branch lengths (coalescent times) as well.

Probabilities of gene tree topologies
Under the MSC model, the gene tree topologies and coalescent times have a joint probability distribution given the species tree and parameters. For small species trees, it is easy to derive the marginal probability of gene tree topologies [2,33,35]. This line of work typically assumes one sequence sampled per species at every locus, so that there is no coalescent in modern species at the tips of the species tree. The case of three species is considered in [2]. Let the three species be A, B and C, with the phylogeny S = (A, (B, C)) ( Fig. 2). Let the divergence times be τ = (τ BC , τ ABC ) and the population sizes be θ = (θ BC , θ ABC ). Suppose that three sequences are sampled from the three species (a, b and c). There are three possible gene tree topologies: G 1 = (a, (b, c)) matches the species tree, while G 2 = (b, (c, a)) and G 3 = (c, (a, b)) are the mismatching gene trees.
When we trace the genealogy of the three sequences, sequences b and c may coalesce in population BC as a Poisson event at the rate of 2/θ BC just as in the single-population coalescent. Note that the probability that a Poisson event of rate λ does not occur in a time interval t is e −λt . Thus, the probability that sequences b and c do not coalesce in population BC or over the time interval Here τ /(θ BC /2) is known as the internal branch length in coalescent units-one coalescent unit in population BC is 2N BC generations or θ BC /2 mutations per site. If b and c coalesce in population BC, the gene tree must be G 1 . Otherwise, all three sequences enter species ABC and coalesce in random order so that the three gene trees occur with equal probability. Thus, the probabilities for the three gene trees (G 1 , G 2 , G 3 ) are Figure 4. Asymmetrical species tree for four species A, B, C and D, and three labelled histories (G 1 , G 2a , G 2b ) for a locus with one sequence from each species. Here G 1 matches the species tree, while G 2a and G 2b are distinct labelled histories sharing the same topology ((a, b), (c, d)), which is different from the species tree.
For certain species trees and parameter values, a mismatching gene tree may be more probable than the matching gene tree. The species tree is then said to be in the anomaly zone [33,34]. The anomaly zone does not exist for species trees of three species-as P(G 1 ) > P(G 2 ) = P(G 3 ) in equation (4), but can occur for asymmetrical species trees of four species, and for any species tree of five or more species [34].
Consider the asymmetrical species tree for four species S = (A, (B, (C, D))) of Fig. 4, and suppose that the three divergence times are very close, with τ ABCD ≈ τ BCD ≈ τ CD . Then all three coalescent events for the four sequences (a, b, c and d) will most likely occur in the root population ABCD, so that the 18 = ( 4 2 )( 3 2 )( 2 2 ) labelled histories will have nearly equal probability 1 18 . There are 15 possible rooted gene trees, 12 asymmetrical and 3 symmetrical. Each symmetrical gene tree (e.g. G 2 in Fig. 4) corresponds to two labelled histories (G 2a and G 2b in Fig. 4), so that its probability is ∼ 2 18 . Each of the 12 asymmetrical gene trees (e.g. G 1 in Fig. 4) is compatible with only one labelled history, with probability ∼ 1 18 . Thus, P(G 2 ) ≈ 2P(G 1 ). When the divergence times (τ s) are unequal but the internal branches are short enough, it is possible for the symmetrical mismatching gene tree G 2 to have a higher probability than the matching asymmetrical gene tree G 1 , in which case the species tree is in the anomaly zone.
If the species tree is in the anomaly zone, the simple majority-vote approach of using the most commonly observed gene tree as the estimate of the species tree is statistically inconsistent: the more gene trees there are, the more certain that the species-tree estimate will be incorrect. Note that the existence of the anomaly zone is not an intrinsic difficulty for species tree estimation; it instead highlights the importance of adopting a proper statistical inference framework. Full-likelihood methods are consistent for all species trees both inside and outside the anomaly zone, as they accommodate the probability distribution of the gene trees under the MSC appropriately. The discussion of the anomaly zone typically assumes true gene trees and ignores phylogenetic reconstruction errors in estimated gene trees. There have been only a handful of empirical examples of the anomaly zone, in African Anopheles mosquitoes [20], skinks [36], flightless birds [37] and gibbons [19].
The probabilities of gene tree topologies can be used to calculate the likelihood function for estimating the species tree using (reconstructed) gene trees as input data, as in the STELLS program [38]. However, popular heuristic methods such as MP-EST [39] and ASTRAL [40] do not use this theory and are instead based on species triplets or quartets. Furthermore, calculation of the probabilities of gene tree topologies, which involves summing over all coalescent histories that are compatible with each gene tree, becomes expensive when the number of species increases [21].

Joint probability distribution of gene trees and coalescent times
While the marginal probability of the gene tree topology may be challenging to compute, it is straightforward to derive the joint distribution of gene tree topologies and coalescent times. The general form, for an arbitrary species tree and an arbitrary number of sequences, is given in [8].
The joint density of gene trees and coalescent times is a product over the populations on the species tree, and as a result, we focus on the contribution from one population. A population is represented by a branch on the species tree (say XY) or by the daughter node of the branch (say X). Let τ X and τ Y be node ages or divergence times, and θ X be the population size. Suppose that m sequences enter the population at time τ X and l sequences leave the population at time τ Y , with 1 ≤ l ≤ m. For example, in the gene tree of Fig. 3, m = 3 lineages enter population BC while l = 2 lineages leave it. Unlike the singlepopulation coalescent, under the MSC, lineages entering a population do not necessarily find their common ancestor in that population, and the coalescent process may be 'censored' [8]. Note that if X is the root of the species tree, l must be 1.
The MSC density for the part of the gene tree residing in population XY is the product of three components. The first is the joint density of the m − l independent exponential coalescent waiting times {t X m , t X m−1 , . . . , t X l +1 }. The second component is for the gene tree topology in XY, and is a product of m − l probabilities, each being the probability, 1/( i 2 ), of choosing two out of i lineages to join, for i = m, m − 1, ..., l + 1. These two components are the same as in the single-population coalescent. The third component is the probability that no coalescent events occur in the last time interval before reaching τ Y . Multiplying the three components, we obtain the MSC density of the gene tree in XY as For example, the contribution of species BC to the MSC density of the gene tree in Fig. 3 is As coalescent processes in different populations operate independently, the MSC density for the whole gene tree at a locus is the product of the contributions across all populations. For the gene tree of Fig. 3, this is The four pairs of brackets correspond to species A, C, BC and ABC, respectively. Coalescence is not possible in species B as only one sequence is sampled from that species. With multiple loci in the data, the joint MSC density of the gene trees is a product across all loci, because the genealogical histories at different loci are assumed to be independent. The formulation allows the loci to have different sampling configurations. For example, the number of sequences from each species may vary among loci and some species may be missing at some loci.

Species-tree-gene-tree conflicts
The gene tree representing the coalescent history of the sequences at a locus may not match the species tree. Such a discordance may occur because when we trace the history of the sample backwards in time, sequences from different species may not coalesce as soon as they reach the most recent common ancestor on the species tree but instead coalesce in more ancient ancestors (e.g. gene trees G 1b , G 2 , G 3 in Fig. 2). This delayed coalescence or deep coalescence is also known as incomplete lineage sorting. While several biological processes, including gene duplication followed by gene loss or horizontal gene transfer [41,42], can cause the gene tree to differ from the species tree as well, deep coalescence is more fundamental because coalescent is simply biological reproduction and drift and thus may affect every species. Deep coalescence is more common when multiple species arise through a rapid succession of speciation events, resulting in very short internal branches on the species tree relative to the coalescent waiting time (note that φ in equation (3) is greater for smaller τ and larger θ BC ). The existence of the anomaly zone is an extreme case of deep coalescence. Deep coalescence is related to how short the internal branches are, rather than how deep they are on the species tree, and may thus occur in both shallow and deep species trees [43].

Full-likelihood methods
ML methods [44,45] and Bayesian inference [46][47][48][49] use the joint distribution of gene trees and coalescent times [8] and operate on multilocus sequence data directly. Let the sequence data be X = {X j }, where X j is the alignment of n j sequences at the jth locus for j = 1, 2, ..., L. Let S be the species tree, and let = {τ , θ , η} be the vector of parameters, including species divergence times (τ ), population sizes (θ ) and parameters in the mutation model (η). The likelihood of the sequence data given the MSC model has the form where f (X j |G j , t j , η) is the phylogenetic likelihood given the gene tree G j and branch lengths t j at locus j [50], while f (G j , t j |S, ) is the MSC density of the gene tree described above [8]. As the genealogical histories at different loci are independent, the likelihood of the sequence data is a product across all loci. The summation in equation (8) is over all possible gene tree topologies for the sequences, and the integral is n j − 1 dimensional, over the n j − 1 coalescent times on each gene tree. The gene trees and coalescent times are not observed, and the likelihood function averages over them, accommodating their uncertainties. The species tree S and the MSC parameters can be estimated using ML by maximizing equation (8).
Both the phylogenetic likelihood f (X j |G j , t j , η) and MSC density f (G j , t j |S, ) are straightforward to calculate, but averaging over all the possible gene tree topologies and coalescent times at each locus is computationally infeasible except for small data sets. The only ML implementation available is the 3S program [44,45], which enumerates the gene trees and uses numerical integration (Gaussian quadrature) to calculate the integrals. Although limited to three species and three sequences per locus, 3S can handle tens of thousands of loci.
With more than three species, the Bayesian method has a computational advantage over ML, with the Markov chain Monte Carlo (MCMC) algorithm averaging over the gene trees and coalescent times. We assign prior distributions to the species tree and model parameters. For example, the species tree can be assigned a uniform prior over all rooted trees, while the population-size parameters (θ s) can be assigned gamma or inverse-gamma priors. The inverse-gamma priors for the θ are conjugate (so that both the prior and posterior for the θ are inverse gamma), allowing the θ to be integrated out analytically [51], which helps with MCMC mixing. The age of the species-tree root can be assigned a gamma or inverse-gamma prior, while the other node ages can be constructed using a Dirichlet distribution [52]. The MCMC algorithm samples from the joint posterior distribution of the species tree, the MSC parameters and the gene trees at all loci In particular, the samples of (S, ) generated by the algorithm are from the marginal posterior f (S, |X), and the frequency at which a species tree is visited is an estimate of its posterior probability. In this way, MCMC averages out the gene trees and coalescent times numerically.
The first implementation of the Bayesian approach is the program BEST [53]. This uses the samples of gene trees with branch lengths produced by MRBAYES [54] and applies an importance-sampling correction because MRBAYES does not assume that the gene trees are distributed according to the MSC density. This strategy does not work well, as the species tree and the gene trees place tight constraints on each other in the MSC model. Currently, two Bayesian programs under the MSC are in common use: *BEAST [46] and BPP [47][48][49], both of which explicitly use the MSC model. The algorithm in BPP for species tree inference goes through several proposal steps in each MCMC iteration, as follows.
(1) Update the coalescent times t j on the gene tree at each locus j. (2) Update the gene tree topology G j at each locus j through a subtree-pruning-and-regrafting (SPR) algorithm. (3) Update the population sizes (θ s).
(4) Update the species divergence times (τ s). (5) Update the species tree topology S through a nearest-neighbor interchange (NNI) or SPR move, which may change the gene trees to avoid conflicts. (6) Use a multiplier to rescale all node ages on the species tree and on all gene trees.
Perhaps the greatest challenge in such MCMC algorithms comes from the constraint between the species tree and the gene trees. Consider step 4 for changing species divergence time τ AB , the age of the ancestral node for two sister species/clades A and B. Let t ab be the sequence divergence time for two sequences from A and B. Then τ AB < t ab . If the dataset includes thousands of loci and many sequences from A and B at each locus, the smallest of t ab among all loci may be almost identical to the current τ AB . Then, when we use a sliding window to change τ AB , the window size will have a width near zero, and the MCMC is virtually stuck. A 'rubber-band' algorithm was proposed in [8], which changes τ and the affected node ages on gene trees jointly. Similarly, in step 5, it is very inefficient to change the species tree when all gene trees are fixed. A breakthrough was to make coordinated changes to the gene trees when an NNI algorithm is used to change the species tree [47]. The algorithm has since been extended to SPR [48,55] and ported to *BEAST as well [55,56]. Those improvements have pushed the limit of datasets that can be analyzed using Bayesian MCMC programs from ∼100 to ∼10 000 loci [19,20].

Heuristic or summary methods
Many heuristic methods for species tree estimation have been developed, which use summaries of the data rather than the original mutlilocus sequence alignments. For extensive reviews, see [9,10,22,23].
MP-EST [39] estimates triplet gene trees under the molecular clock (rate constancy among lineages), and then uses a composite likelihood function, treating the frequencies of the triplet gene trees as input data from a trinomial distribution (with probabilities given in equation (4)). A composite or pseudo-likelihood is constructed by multiplying those probabilities for all possible triplets, ignoring lack of independence among them. This composite likelihood is maximized to estimate the species tree.
ASTRAL [40] uses a phylogenetic method to infer unrooted gene trees, and extracts the quartets from them. It then finds the species tree that is most compatible with the quartets in the set. A procedure has also been developed to attach local support values for nodes on the inferred species tree [59].
NJ-ST [57] uses a distance method to estimate an unrooted species tree from a collection of unrooted gene trees. The species tree estimate is the neighborjoining tree built from a distance matrix where the distance between two species is defined as the average number of internal nodes on the gene tree between the species.
All those three methods are two-step methods, treating estimated gene tree topologies as data. They are consistent, with the probability to recover the correct species tree approaching 1 when the number of gene trees increases. As discussed above, the anomaly zone does not exist for rooted triplets or equivalently for unrooted quartets. However, the argument for consistency is based on the assumption that the input gene trees are known without error. Phylogenetic reconstruction errors are known to affect the performance of two-step methods [60]. Furthermore, as those two-step methods use gene tree topologies but not branch lengths or coalescent times, they suffer from unidentifiability issues [9]. They can estimate the species tree topology but not all parameters in the MSC model.
Another summary method is called SVDQUAR-TETS [58]. This is a quartet method, designed for data of coalescent-independent sites, sites that have independent histories. Such sites are similar to SNPs but include constant sites as well. Genome sequencing projects do not generate such data. When the method is applied to multilocus sequence alignments, sites are pooled across loci, as in the concatenation method, so that the data are the counts of 256 (=4 4 ) site patterns for the species quartet. Note that the site-pattern counts pooled across loci are summaries of the original multilocus alignments. When all sites have independent histories, the summation over gene trees and the integral over coalescent times under the MSC model (equation (8)) are analytically tractable [58,61]. Pooling sites across loci causes information loss and identifiability issues, so that the method is unable to identify all parameters in the MSC model even if the species tree topology is identifiable [9,62]. Some two-step methods use both gene tree topologies and branch lengths (coalescent times) [63]. However, those methods were found to have poorer performance than methods based on topologies alone [64,65]. This is because the methods ignore random sampling errors in branch-length estimates. It is easy to see that sampling errors in branch lengths may have a major impact on estimation of the species tree and the MSC parameters. For example, if two sequences from two species are identical at a locus so that the estimated coalescent time is t ab = 0, the species divergence time τ AB will be forced to be 0 as well (since τ AB < t ab ), which may have a dramatic effect on species tree estimation. While coalescent times or branch lengths on gene trees contain much information [62], it is important to accommodate their uncertainties. Figure 5 shows results from a small simulation to illustrate the different performance of a fulllikelihood method (BPP), two summary methods (ASTRAL and MP-EST) and ML analysis of concatenated data. The species tree is challenging with short internal branches in both sets of simulations. BPP recovered the true species tree with higher probability than the two summary methods and concatenation. For set 1, all four methods are consistent, with the probability of recovering the true species tree approaching 1 for every method when the number of loci increases. For set 2, the species tree is in the anomaly zone, and concatenation/ML is inconsistent, with the probability for the mismatching balanced tree approaching 1, while the other three methods are consistent. Note that the ML method applied to concatenated data assumes one tree and one set of divergence times for all loci and can be inconsistent [67].

Comparison between full-likelihood and heuristic methods
Heuristic methods based on data summaries have a huge computational advantage over full-likelihood methods. For large datasets with hundreds or thousands of species and thousands of loci, they may be the only methods that are currently feasible computationally. Heuristic methods have poorer statistical performance than full-likelihood methods, and the difference can be large for challenging species trees with short internal branches [9,19,62,64,68]. As two-step methods typically ignore phylogenetic reconstruction errors in gene trees, their performance may suffer from uncertainties in the gene trees [60,64]: for those methods, species trees are only as good as the gene trees on which they are built [9,23].
An important strength of full-likelihood methods is that they can provide estimates of parameters in the MSC model when the species tree is fixed [16,56]. The MSC model for a species tree of s species has s − 1 divergence times (τ s) and 2s − 1 population sizes (θ s) (Fig. 2), all of which can be identified and estimated by full-likelihood methods using multilocus sequence data. In contrast, summary methods use only a portion of information in the data and are unable to identify all parameters in the model. For example, in the case of three species, the MSC model involves seven parameters (two τ and five θ ), but there are only two distinct frequencies of gene trees (equation (4)), so that twostep methods using gene tree topologies alone can identify only the internal branch length in coalescent units: φ or 2 τ /θ BC of equation (3). For large datasets for which species tree estimation using fulllikelihood methods is too expensive, it may be advisable to use summary methods to infer the species tree, and then full-likelihood methods to estimate the population parameters on the species tree.

MULTISPECIES COALESCENT WITH MIGRATION OR INTROGRESSION
In the past two decades, analyses of genomic data have highlighted the prevalence of cross-species gene flow [69][70][71]. Ancient gene flow has been detected in a variety of species, from mosquitoes [20,72] and butterflies [73] to hominins [74]. Like deep coalescence, gene flow causes genealogical fluctuations across the genome, posing challenges to species tree estimation [75][76][77][78]. Perhaps more importantly, hybridization can lead to rapid genomic changes, leading to beneficial new phenotypes and ecological adaptations. Inferring the mode and timing of gene flow may help us to achieve a better and richer understanding of the process of speciation and adaptation [70,71]. Two types of model of gene flow have been developed, both as extensions to the MSC model. The first is the migration model (MSC+M), also known as the isolation-with-migration (IM) model [17,79], which assumes that gene flow occurs at a certain rate every generation. The second is the hybridization/introgression model (MSC+I or MSci) [80,81], in which hybridization occurs at a fixed time point in the past. Here we discuss the distribution of gene trees under those models of gene flow. ML and Bayesian methods of inference proceed as before (equations (8) and (9)), except that the model may involve parameters that measure the timing and strength of gene flow and the gene tree may include the migration or introgression history, as well as the tree topology and coalescent times. We also mention a few heuristic methods for testing for the presence of gene flow and estimating its rate.

Isolation with migration
Consider two populations A and B with population sizes θ A and θ B that have been exchanging migrants at the rates of M AB and M BA since their divergence at time τ R (Fig. 6(a)). The parameter vector in the IM model for two species is thus Here the population migration rate M AB = m AB N B is the expected number of migrants from A to B (in the real world with time running forwards) per generation, with m AB the proportion of individuals in population B that are immigrants from population A. The rate M BA = m BA N A is defined similarly. Note that migration rates in the IM model reflect the long-term effects of migration, genetic drift, recombination, as well as natural selection purging introduced alleles [71]. We consider the probability density of gene trees under the IM model. There are two formulations, depending on whether the gene tree at a locus includes the migration history.
In the first formulation, the gene tree includes the tree topology and coalescent times, but not the migration history (or with the migration history integrated out). This relies on the theory developed in the structured coalescent framework in which the backwards-in-time process of coalescence and migration is described using a continuous-time Markov chain [82][83][84]. The state of the chain is specified by the number of sequences in the sample and their population IDs [18,45,61]. Consider the IM model for the two species (A and B) of Fig. 6(a) and suppose that two sequences (a and b) are sampled at locus j ( Fig. 6(b)), so that the gene tree is just the sequence divergence time t j (we suppress the subscript and write t j as t henceforth). When we trace the genealogy of the two sequences backwards in time, the sequences may move between populations and they may coalesce. The possible states are s AA , s AB , s BB , s A and s B . Here s AA means that both sequences are in population A, s BB means that both are in B, while s AB means that one is in A and the other is in B. With only two sequences in the sample, there is no need to distinguish s AB and s BA . If the two sequences have coalesced, the state becomes s A or s B , and these are lumped into one artificial absorbing state, s A|B , since there is no need to trace the history any further. Let Q = {q uv } be the generator matrix for the Markov chain over the time interval (0, τ R ), where q uv is the instantaneous rate of transition from states u to v. That is, Here the time unit is one mutation per site, m 1 = 4M BA /θ A = m BA /μ is the mutation-scaled migration rate into species A and m 2 = 4M AB /θ B = m AB /μ is the rate into B. Note that the Markov chain runs backwards in time while the migration rates (e.g. M AB and m 2 ) are defined under the real-world forward-in-time view. For example, in the first row, the transition from s AA to s AB represents migration from B to A in the real world, and either sequence in A can be the migrant, so that the rate is 2m BA per generation or 2m BA /μ = 2m 1 per mutational time unit. The transition from s AA to s A|B means that the two sequences coalesce in A, with rate 2/θ A . State s BB is not reachable from s AA instantaneously. The transition probability matrix over any time where p uv (t) is the probability that, given state u at time 0, the chain will be in state v at time t. The matrix P(t) is analytically tractable in special cases (e.g. when the model is symmetrical with M AB = M BA and θ A = θ B , [18]), but can be calculated in general using efficient algorithms for matrix exponentiation. Let s 0 be the initial state, which is one of s AA , s AB and s BB , depending on which species each sequence is sampled from (s 0 = s AB in the gene tree of Fig. 6(b)). The density of the divergence time t is Recall that the probability density f(t) means that f(t) t is the probability that the divergence time is in the small interval (t, t + t). In the case of t < τ, the two sequences coalesce before reaching τ . The probability f(t) t is a sum of two terms, corresponding to the coalescent occurring in either A or B.
The first term, p s 0 s A A (t)(2/θ A ) t, is the probability that both sequences are in species A right at t, times the probability, (2/θ A ) t, that they coalesce during (t, t + t). Similarly, the second term is the probability of coalescent occurring in B. In the case of t > τ, the two sequences do not coalesce in either A or B before time τ and both enter the ancestral species R. Here 1 − p s 0 s A|B (τ ) is the probability that the Markov chain is in any of the two-sequence states at time τ (in other words, sequences a and b have not coalesced by time τ ). Inside species R, the two sequences coalesce at the rate 2/θ R , with the waiting time (t − τ ) exponentially distributed. Note that calculation of P(t) for the Markov chain integrates out the migration history at each locus analytically, so that equation (11) is a function of the divergence time t but not of the migration events or times. Even in the case of two sequences ( Fig. 6(b)), there are an infinite number of migration histories that give rise to the same t, and equation (11) averages over all of them.
The Markov chain (Q) specified above applies to two species and two sequences. A different Markov chain has to be constructed if there are more species or more sequences. The theory is general and works for arbitrary numbers of species and sequences. For a tree of s extant species, we divide the timeline into s epochs according to the (s − 1) species divergence times. In each epoch, the populations are fixed so that the coalescent and migration rates stay the same, and a Markov chain can be constructed [18,61]. With the MSC density of gene trees calculated this way, the likelihood under the IM model is given by equation (8), although the parameter vector includes the migration rates as well. This strategy of integrating out the migration history may offer a huge computational advantage. However, the number of states in the Markov chain grows explosively with the increase in the number of species and the number of sequences [61]. The formulation is feasible for very small numbers of species and sequences only. The only implementation of this strategy appears to be the ML program 3S [18,45], which is limited to three species and three sequences, although tens of thousands of loci can be handled.
In the second formulation, the gene tree at a locus includes the tree topology, coalescent times and the full migration history, including the number, times and directions of migration events (Fig. 6(b)). The probability density for such a gene tree is easy to compute because both coalescent and migration are Poisson events with exponential waiting times [85][86][87]. In the gene tree of Fig. 6(b), the time period (0, t) is broken into six time segments by the coalescent, migration and speciation events, and within each segment, the number of lineages is constant, as are the coalescent and migration rates. Then the probability density of the gene tree (G) is given by the rates for the coalescent and migration events times the probability of no events over the whole time period The three pairs of brackets represent contributions to the gene tree density from species A, B and R, respectively. For species A, there are two migration events into A (with rates m 2 1 ), a coalescent does not occur over time segment s 2 and migration does not occur over segments s 1 , s 2 , s 3 or s 5 , during which the number of lineages is 1, 2, 1 and 1, respectively. Hence the term for species A. Note that the probability of no events, or the probability that none of multiple independent Poisson events with a total rate of λ occurs, over time t is e −λt . The contribution from species B is given similarly. In species R, a coalescent occurs after the waiting time s 6 = t − τ , so the rate is 2/θ R and the probability of no event is e −2(t−τ )/θ R .
Unlike equation (11) in which the gene tree means divergence time t, here G represents the full coalescent and migration history at the locus, such as the (backwards-in-time) transitions of sequence b from B into A at time s 1 and back to B at time s 1 + s 2 , and so on. If we sum over all possible histories that have divergence time t (one of which is that of Fig. 6(b)), the marginal density f(t) will be given by equation (11).
Equation (12) is easily generalizable to more species and sequences. For a general gene tree, one can break the time period from the present time to the root of the gene tree into time segments by the coalescent and migration events at the locus and by the speciation events. Then the probability density of the gene tree is simply given as the product of rates for the coalescent and migration events that occurred times the probability of no events over the whole time period.
This formulation is used in Bayesian implementations of the IM model such as IMA [88,89] and G-PHOCS [90]. The posterior is given by equation (9) except that the gene tree G j includes the migration history. G-PHOCS is an extension of an earlier version of BPP [8,16] and is computationally more efficient than IMA and can deal with a few thousand loci. The algorithm averages over the migration history at every locus and becomes inefficient at high migration rates, as there will be many migration events to average over. Note that the sequence likelihood depends on the gene tree and coalescent times but not migration events.

Multispecies coalescent with introgression
The introgression or multispecies coalescent with introgression (MSci) model assumes that gene flow occurs between species at fixed time points in the past (Fig. 7). There are two types of nodes on the species tree: speciation nodes and hybridization nodes. While a speciation node (if it is not the root) has one parent, a hybridization node has two parents, with their contributions to the hybrid species represented by probabilities ϕ and 1 − ϕ. When we trace the history of sequences backwards in time and meet a hybridization node, each sequence picks one of the two parents according to probabilities ϕ and 1 − ϕ. The parameters in the model include the introgression probabilities as well as the species divergence/introgression times (τ s) and population sizes (θ s), with = {τ , θ , ϕ}. The introgression probability ϕ, also written as γ , has been called (inappropriately) 'inheritance probability' or 'heritability'. Like the migration rate in the IM model, the introgression probability reflects the long-term effects of drift and selection on introgressed alleles. The MSci model has been referred to as the network multispecies coalescent [91,92] or multispecies network coalescent [93,94]. We avoid the term 'network' as it has been used to refer to a variety of processes, including gene tree reconstruction errors [95].
Four types of MSci model are implemented in BPP (Fig. 7) [81]. In model A, two species SH and TH merge to form a hybrid species HC. This scenario may be rare, but the model can be used to accommodate introgressions involving ghost or unsampled species (Fig. 8(a) and (b)). Model B assumes introgression from species RA to TC at time τ S = τ H . This is distinguishable using genetic data from the alternative model in which there is introgression from RB to SC (B 2 in Fig. 8(d)). Model C (Fig. 7(c)) is a case of hybrid speciation. Model D assumes that two species RA and RB came into contact at time τ X = τ Y and exchanged migrants.
The two parental branches are sometimes called the 'major hybrid edges' and 'minor hybrid edges', according as ϕ > 1 2 , and the binary species tree that remains after all minor hybrid branches are removed is called the 'major species tree' [95]. This characterization is useful if gene flow occurs in pulses as assumed by the MSci model, but may be misleading if gene flow is continuous. For example, continuous migration at a low rate per generation can drastically change the gene tree distribution so that, when the Natl Sci Rev, 2021, Vol. 8, nwab127 Figure 7. (a)-(d) MSci models A, B, C and D implemented in BPP [81], showing the parameters. In model A, two parental species SH and TH merge to form a hybrid species H at time τ H , but both parental species become extinct (see Fig. 8(a) and (b) for alternative interpretations). In model B, there is introgression from species RA to TC at time τ S = τ H . In model C, species RA and RB come into contact to form hybrid species HC at time τ S = τ H = τ T . Model D assumes bidirectional introgression between species RA and RB at time τ X = τ Y . Here the introgression probability (ϕ) is assigned to the horizontal (introgression) branch at each hybridization node, whereas in [81] it is sometimes assigned to the vertical branch.   Fig. 7(a), involving a ghost species X. In model A 2 , species SUX contributes migrants to species THC at time τ H and has since become extinct or unsampled in the data, while in model A 3 , TUX is the ghost species. Models A 1 (Fig. 7(a)), A 2 and A 3 are indistinguishable using genetic data. (c) and (d) Two versions of model B, which are identifiable using genetic data.
MSci model is fitted to the data, the major species tree may reflect gene flow, rather than species divergences [20,72,78]. Below we consider the probabilities of gene tree topologies under the MSci model. These can be used in the two-step methods to estimate the introgression probabilities or to infer the introgression model using reconstructed gene trees as input data, as in the PHYLONET/ML program [96].
The calculation is very similar to that under the simple MSC model (equation (4)). Consider model B (Fig. 9(a)), with three sequences at the locus (a, b, c) [78]. If sequences b and c coalesce in species T, the gene tree will be G 1 = (a, (b, c)), while if a and c coalesce in species S, the gene tree will be G 2 = (b, (c, a)). If neither event occurs, the two coalescent events for the three sequences will occur in species R and the three gene trees will occur with Figure 9. (a) MSci model B for three species (Fig. 7(b)) and (b) a gene tree for four sequences for illustrating the gene tree density under the MSci model. equal probabilities. Thus, G 3 = (c, (a, b)) must be the least probable gene tree. We have where φ S = e −2(τ R −τ S )/θ S and φ T = e −2(τ R −τ T )/θ T are the probabilities that two sequences entering species S or T do not coalesce in that species (cf. φ of equation (3)). Consider gene tree G 1 , which means that sequences b and c coalesce first. If sequence c enters S (which happens with probability ϕ), G 1 can occur only if sequences c and a do not coalesce in S. Hence the first term, ϕφ S · 1 3 . If sequence c enters H (which happens with probability 1 − ϕ), sequences b and c can coalesce in T or R. Hence the second term, . The gene tree probabilities (equations (13)) are functions of ϕ, φ S and φ T , while φ S and φ T are simple functions of the internal branch lengths in coalescent units on the species tree. We have P( and c are more likely to coalesce in T than are a and c to coalesce in S [78]. Next we consider the joint density of (G j , t j ), the gene tree with the complete history of coalescence and introgression events at locus j, including the parental path taken by each sequence at each hybridization node. This is used in full-likelihood implementations of the MSci model. This joint density is very similar to that under the MSC without gene flow (equation (7)), with the only modification that each time a sequence passes a hybridization node, there is a probability ϕ or 1 − ϕ depending on the parental path taken. Thus, for the gene tree of Fig. 9(b), The five pairs of brackets correspond to species C, S, H, T and R ( Fig. 9(b)). For species S (i.e. SR), sequence c 1 picks parental path S and coalesces with sequence a at time t 2 , so that the contribution to the gene tree density from S is ϕ(2/θ S )e −2(t 2 −τ S )/θ S . Introgression is counted as an event in the receiving population (rather than the source population) when we trace the lineages backwards in time and reach a hybridization node. Bayesian implementations of the introgression model can then proceed as before, with the joint posterior of the MSci model and parameters given by equation (9), except that S now represents the MSci model, the parameter vector includes the introgression probabilities (ϕs) as well as the divergence/introgression times (τ s) and population sizes (θ s), and the gene tree G j includes the introgression history at the locus. There are currently three Bayesian MCMC implementations of the MSci model: PHYLONET/MCMC-SEQ [93], *BEAST [94,97] and BPP [81] (Table 1). PHYLONET and *BEAST can allow changes to hybridization events in the MCMC and can infer the introgression model from the data. Those programs appear to reach their limits with <100 loci. BPP assumes that the MSci model is specified and fixed and the program estimates the parameters under the model. It has been applied to datasets of over 10 000 loci [29,81]. Also, BPP implements four different types of introgression model (Fig. 7), while only model A is available in PHYLONET and *BEAST. Binary species trees generated by taking different parental paths at hybridization nodes are called 'displayed species trees' [92] or 'parental species trees'. An interesting formulation of the MSci model specifies the distribution of the gene trees as a mixture over the displayed species trees, with the mixing probabilities given by the introgression probabilities at the hybridization nodes (Fig. 10); see, e.g. [98,99]. To simulate a gene tree, one would sample a displayed species tree first and then generate the gene tree according to the simple MSC model. This is in general incorrect as it forces all sequences at the locus to take the same parental path at each hybridization node, whereas correctly there should be a binomial sampling process when two or more sequences reach a hybridization node. In the model of Fig. 10, if sequences b and c reach species Y, it should be possible for one of them to take the left parent and the other the right parent. In the special case where each hybridization node on the species tree has at most one sequence from all its descendant populations, the formulation is correct and can be used to derive the probability distribution of gene trees. For example, equations (13) for the case of three species and three sequences ( Fig. 9(a)) can be derived this way. It is also interesting to note that, under the MSci model, the most probable gene tree may have a topology that is different from all of the displayed species trees [100].

Heuristic methods for inferring gene flow
A number of heuristic methods have been developed to test for the presence of gene flow and to estimate its strength. Here we mention a few briefly. The most popular method is the D-statistic or ABBA-BABA test [101]. This uses the species tree (((A, B), C), O) for three species A, B and C, with the outgroup species O, and is based on the counts of site patterns when one sequence or genome is available from each species [102]. There are three parsimonyinformative site patterns: AABB matches the species tree, while ABBA and BABA are the mismatching patterns, where A and B are any two distinct nucleotides. The probabilities for the two mismatching site patterns ABBA and BABA should be equal if there exists deep coalescence but no gene flow, but they are different if there is gene flow between the non-sister species (A and C or B and C) in addition to deep coalescence. Thus, gene flow can be tested by using the site-pattern frequencies to examine the deviation of from 0. The D-statistic has been extended to the case of five species, assuming a symmetric species tree in the so-called D FOIL test [103]. The site pattern frequencies can also be used to estimate the introgression probability, as in the program HYDE [104,105]. From one gets the estimatê This is based on the hybridization model with τ S = τ T and θ S = θ T (Fig. 7(c)). The estimate should be biased if this symmetry does not hold. A similar argument may be applied to gene tree topologies instead of site patterns (equation 13, Fig. 9(a)). The probabilities of the two mismatching gene trees ((b, c), a) and ((c, a), b) are equal if there exists deep coalescence but no gene flow, but different if there is in addition gene flow between the non-sister species (A and C or B and C). Thus, the observed frequencies of gene tree topologies can be used to estimate the introgression probability, as in the SNAQ method [95,106]. Assume that φ S = φ T = φ in equations (13), and let f 2 = P(G 2 ) = 1 3 φ + ϕ(1 − φ) and f 3 = P(G 3 ) = 1 3 φ be the probabilities of the two mismatching gene trees. Thenφ The D-statistic cannot be used to detect gene flow between sister species or to estimate the time of introgression. Such unidentifiability issues also exist in other methods that detect hybridization events using genome-wide averages, such as the average interspecies sequence divergence [107] or the joint allele frequency spectrum [108].

Unidentifiability, low information content and challenges of identifying the mode of gene flow
One area where more research is urgently needed is the identifiability of introgression models. If the probability distributions of the data are identical for two sets of parameter values ( and ), with f(X| ) = f(X| ) for essentially every dataset X, then is unidentifiable given data X. Several studies have examined identifiability issues of summary methods that use gene tree topologies as data [76,80,91,109], but little research has been done on full-likelihood methods.
Some cases of unidentifiability are easy to identify. If it is impossible for two or more sequences to be in one species when we trace the genealogical history of the sample backwards in time, the population size (θ ) for that species will be unidentifiable, since it takes two sequences to define a distance. For example, in the MSC model with no gene flow (Fig. 2), the population sizes for the extant species are unidentifiable if only one sequence is sampled from each species per locus, but this unidentifiability disappears when multiple sequences are available from each species. Furthermore, parameters or models that are unidentifiable using gene tree topologies alone may become identifiable when both gene trees and branch lengths (coalescent times) are used. In the case of three species, there are only three gene trees, so that use of gene tree topologies can identify only one (under the MSC model) or two (under the MSci model) parameters, whereas there are 7 (Fig. 2) and 13 ( Fig. 7(a)) parameters in those two models, respectively, which are all identifiable when information from both gene trees and coalescent times is used.
The identifiability of full-likelihood methods applied to data of multilocus sequence alignments, with multiple sequences per species, is the most interesting case, because full-likelihood methods are expected to be optimal from a statistical point of view and because multilocus alignments are Natl Sci Rev, 2021, Vol. 8, nwab127 the dominating data form in such analyses. Flouri et al. [81] conjectured that the MSci model is identifiable on multilocus sequence alignments as long as it is identifiable on data of gene trees with coalescent times. Given this, the problem of identifiability can be studied by considering the gene trees with coalescent times (G j and t j ) as the input data. It is noted that MSci model D (Fig. 7(d)) has an unidentifiability issue of the label-switching type [81] (Fig. 11). For every set of parameters, = (θ R , θ A , θ B , θ X , θ Y , τ R , τ X , ϕ X , ϕ Y ), there is a 'mirror' point , which has identical parameter values as except that θ X = θ Y , θ Y = θ X , ϕ X = 1 − ϕ X and ϕ Y = 1 − ϕ Y . Both and have exactly the same likelihood, f(X|S, ) = f(X|S, ), for all possible data X. This is a label-switching issue, and does not affect the utility of the model: one may apply a constraint such as ϕ X < 1 2 to remove the unidentifiability or apply more sophisticated post-processing of the MCMC sample if the 'twin towers' are not well separated [110]. The cases where the bidirectional introgression involves non-sister species or where there are multiple introgression events are yet to be studied.
Even if all parameters are identifiable, typical datasets may lack information for their reliable estimation. For example, typical datasets may be highly informative about species divergence times, but not about population sizes for ancestral species, especially if those species correspond to very short branches on the species tree [111]. In the case of three species both gene flow between non-sister species and population structure in the ancestral species can cause the asymmetry in the probabilities of the two mismatching gene trees [112], so that the two models are unidentifiable using gene tree topologies alone. In general, it may be hard to distin-guish the different models of gene flow, such as the complete isolation model (MSC with no gene flow), the migration (IM) model, the isolation-with-initialmigration (IIM) model [113] and the introgression (MSci) model. Simulation may be useful to evaluate the power to distinguish such models using genomic datasets.

CONCLUSION
The multispecies coalescent model provides a powerful framework for analysis of genomic sequences sampled from multiple species to extract the rich information about the evolutionary history of the species. Incorporating species phylogeny in population genetic models of population subdivision opens up opportunities for addressing many exciting questions in evolutionary biology, such as detecting gene flow during and after species formation and delineating species boundaries, as well as inferring demographic changes and estimating population sizes for extinct ancestral species. As discussed in [92], the basic MSC model accommodating species phylogeny and coalescent is in effect a null model, which can be extended to include other important biological processes, leading to models such as r Currently, large differences exist between fulllikelihood methods and heuristic methods. The former have higher statistical efficiency while the latter are orders-of-magnitude faster computationally. There is thus much room for improvement for both classes of methods. For the present, a pragmatic approach to analyzing large datasets may be to use summary methods to estimate the species tree and then full-likelihood methods to estimate the parameters.
Analysis of the simple three-species case [62] suggests that there is rich historical information both in gene tree branch lengths (which two-step methods such as ASTRAL, MP-EST and SNAQ ignore) and in the stochastic fluctuation of genealogical history across loci (which genome-averaging approaches such as SVDQUARTETS and D-statistic ignore). Heuristic methods that make use of both kinds of information may thus have much improved power. For Bayesian implementations of the MSC model, mixing inefficiency of the MCMC algorithm appears to be a far more serious problem than the increase in computational cost for each MCMC iteration [48]. Developing smart proposal algorithms that respect and accommodate the mutual constraints between the species tree and the gene trees is likely to bring dramatic improvement to the capacity of the full-likelihood methods. To empirical biologists, the MSC framework makes it possible to ask exciting evolutionary questions; to method developers, it offers rich opportunities for testing cutting-edge algorithms in computational statistics (in particular, trans-model MCMC algorithms). With the advancements of sequencing technologies and rapid accumulation of genomic sequence data as the driving force, the field will in all likelihood continue to be a research hotspot in the coming years.