Phase Resolution of Heterozygous Sites in Diploid Genomes is Important to Phylogenomic Analysis under the Multispecies Coalescent Model

Abstract Genome sequencing projects routinely generate haploid consensus sequences from diploid genomes, which are effectively chimeric sequences with the phase at heterozygous sites resolved at random. The impact of phasing errors on phylogenomic analyses under the multispecies coalescent (MSC) model is largely unknown. Here, we conduct a computer simulation to evaluate the performance of four phase-resolution strategies (the true phase resolution, the diploid analytical integration algorithm which averages over all phase resolutions, computational phase resolution using the program PHASE, and random resolution) on estimation of the species tree and evolutionary parameters in analysis of multilocus genomic data under the MSC model. We found that species tree estimation is robust to phasing errors when species divergences were much older than average coalescent times but may be affected by phasing errors when the species tree is shallow. Estimation of parameters under the MSC model with and without introgression is affected by phasing errors. In particular, random phase resolution causes serious overestimation of population sizes for modern species and biased estimation of cross-species introgression probability. In general, the impact of phasing errors is greater when the mutation rate is higher, the data include more samples per species, and the species tree is shallower with recent divergences. Use of phased sequences inferred by the PHASE program produced small biases in parameter estimates. We analyze two real data sets, one of East Asian brown frogs and another of Rocky Mountains chipmunks, to demonstrate that heterozygote phase-resolution strategies have similar impacts on practical data analyses. We suggest that genome sequencing projects should produce unphased diploid genotype sequences if fully phased data are too challenging to generate, and avoid haploid consensus sequences, which have heterozygous sites phased at random. In case the analytical integration algorithm is computationally unfeasible, computational phasing prior to population genomic analyses is an acceptable alternative. [BPP; introgression; multispecies coalescent; phase; species tree.]


INTRODUCTION
Next-generation sequencing technologies have revolutionized population genetics and phylogenetics by making it affordable to sequence whole genomes or large portions of the genome, even for nonmodel organisms. Many phylogenomic studies use the approach of reduced representation library to maximize their DNA sequencing efforts on a small subset of the genome. These strategies can generate thousands of genomic segments (called loci in this article irrespective of whether they are protein-coding) with high coverage, and target sequences can be assembled with confidence. Examples include restriction site-associated DNA sequencing (RADseq), which is used frequently to identify single nucleotide polymorphisms (SNPs) for population genetic and phylogeographic studies (Andrews et al. 2016;Leaché and Oaks 2017), although, it has also been applied to address phylogenetic questions at deeper timescales (Eaton et al. 2017). A more common approach for phylogenomic studies is targeted sequence capture, generating so-called reduced-representation data sets, with typically longer sequences for distantly related species than with RADseq data. Examples include exome sequencing, ultraconserved elements (UCEs, Faircloth et al. 2012), anchored hybrid enrichment (AHE, Lemmon et al. 2012), conserved nonexonic elements (CNEEs, Edwards et al. 2017), or rapidly evolving long exon capture (RELEC, Karin et al. 2020).
Typical sequencing technologies produce short fragments of sequenced DNA called "reads" that are either de novo assembled or mapped to a pre-existing reference genome. This leads to chromosomal positions being sequenced a variable number of times across the genome (usually referred to as the sequencing depth). A common practice in genome sequencing projects has been to produce the so-called "haploid consensus sequence" for a diploid individual, which uses the most common nucleotide at any heterozygous site to produce one genomic sequence. Assemblers like Velvet (Zerbino and Birney 2008), ABySS (Simpson et al. 2009), and Trinity (Grabherr et al. 2011), pick up only one of the two nucleotide bases at any heterozygous site and essentially resolve the phase of heterozygous sites at random, producing chimeric sequence that may not exist in nature. Suppose a diploid individual is heterozygous at two sites in a genomic region, so that the diploid genotype may be represented Y...R, with two heterozygous sites Y (for T/C) and R (for A/G) (Fig. 1). Suppose the reads are 14×T and 6×C at the first site, and 7×A, 10×G, and 1×T at the second (with the single T to be most likely a sequencing error). The haploid consensus sequence is constructed as T...G. In effect a heterozygote site with high quality scores for the two nucleotides is represented as one consensus nucleotide with a low quality score. Because it is largely pure chance which of the two nucleotides at a heterozygous site has the greater number of reads, this strategy is equivalent to resolving the phase at random and using only one of the constructed sequences. The resulting haploid consensus sequence may not be a real biological sequence and may not represent the biology of the diploid individual. Besides loss of information, a more serious problem is that the artifactual phased haploid sequence may be unusually divergent from other sequences in the sample, potentially introducing systematic biases in downstream inference. Currently, constructing true diploid de novo assemblies is expensive. A sequencing platform has been developed in combination with bioinformatic algorithms to determine the true diploid genome sequence but the strategy still involves high cost (Weisenfeld et al. 2017). If a read is long and fully covers a locus, multiple heterozygous sites in the same locus will be naturally phased. However, if the reads are short, and the two heterozygous sites do not occur in the same read, their genotypic phase resolution will become an issue.
How the heterozygote phase is resolved may have a significant impact on population genomic and phylogenomic inference using genomic sequence data. Phase information is well-known to be important for relating genotype to phenotype in human disease mapping (Tewhey et al. 2011). Similarly, Gronau et al. (2011) found that use of an analytical integration method (which averages over all possible phase resolutions) leads to nearly identical performance as the use of true phase resolutions for estimating population parameters, and that random phase resolution produced unreliable estimates. Andermann et al. (2019) developed a bioinformatics pipeline to recover allelic sequences from sequence capture data and found it to produce more accurate estimation of species divergence times under the MSC model (Rannala and Yang 2003) than other strategies such as use of consensus haploid sequences, random phasing, or ambiguity encoding. Overall little is known about the effects of heterozygote phase resolution on many inference problems using multilocus genomic sequence data under the MSC model, including species tree estimation, estimation of population sizes and species divergence times, and inference of cross-species introgression/hybridization.
We have implemented in BPP (Flouri et al. 2018) an analytical integration algorithm to handle unphased diploid sequences, developed by Gronau et al. (2011) in their G-PhoCS program, which is an orthogonal extension of an earlier version of BPP (Rannala and Yang 2003;Burgess and Yang 2008). Previously, Kuhner and Felsenstein (2000) implemented an Markov chain Monte Carlo (MCMC) algorithm to average over different phase resolutions in the likelihood calculation for estimating under the single-population coalescent. The algorithm was found to mix slowly even for small data sets. The analytical integration algorithm uses a data-augmentation strategy, in which the unknown fully resolved haploid sequences constitute the complete data or latent variables and enumerates and averages over all possible phase resolutions, weighting them according to their likelihoods based on the whole sequence alignment. For example, if a diploid sequence has two heterozygous sites, Y...R, the approach will average over both phased genotypic resolutions: (i) T...A and C...G versus (ii) T...G and C...A (Fig. 1). Note that there may be rich information about the phase resolution of any unphased sequence in an alignment of many sequences, either from the same species or from different but closely related species. Consider for example the phase 336 SYSTEMATIC BIOLOGY VOL. 71 resolutions for a human diploid sequence Y...R (Fig. 1). If we observe in the chimpanzee fully resolved sequences T...A and C...G (e.g., in an individual homozygous at both sites, with genotypes T/T...A/A) and never observe sequences T...G and C...A, then very likely the human diploid sequence has the haploid genotypes T...A and C...G, because we assume no recombination within each locus. Our implementation of the algorithm works with all four analyses under the MSC model in BPP (Yang 2015;Flouri et al. 2018;Flouri et al. 2020b), including species tree estimation (Yang and Rannala 2014;Rannala and Yang 2017) and species delimitation through Bayesian model selection (Yang and Rannala 2010;Yang and Rannala 2014;Leaché et al. 2019). We also implemented the algorithm under the multispecies-coalescent-withintrogression (MSci) model (Flouri et al. 2020a).
Here, we use computer simulation to evaluate different phase-resolution strategies in terms of their precision and accuracy in Bayesian species tree estimation under the MSC and in parameter estimation under both the MSC and MSci models. In addition to using the true phase resolution, which is generated during the simulation and is known with certainty, we also include analytical phase integration (Gronau et al. 2011;Flouri et al. 2018), phase resolution using the program PHASE (Stephens et al. 2001;Stephens and Donnelly 2003), and random resolution. The strategy of random resolution is largely equivalent to the common method of using haploid consensus sequences. The PHASE program was developed for population data from the same species, but is here applied to unphased sequences from both within and between species. We note that a number of computational phasing algorithms are available (Andres et al. 2007;Browning and Browning 2011), such as Haplotyper (Niu et al. 2002) and fastPHASE (Scheet and Stephens 2006). These are mostly developed to improve the computational efficiency and to handle long sequences (Choi et al. 2018), and are expected to produce similar results to PHASE in analysis of short sequences.

Simulation to Estimate Species Trees
We use the program MCcoal in BPP3.4 (Yang 2015) or the simulate switch of BPP4.3 (Flouri et al. 2020b) to simulate gene trees and multilocus sequence data using four fixed species trees for eight species (Fig. 2a,a ,b,b ). The trees have very short branches, mimicking challenging species trees generated during radiative speciation events. In the two deep trees, species divergences are much older than average coalescent times (/2). In the two shallow trees, species divergences are very recent relative to coalescent times, mimicking different populations of the same species. Note that in this study, we make no distinction between species and populations. The MSC model has two sets of parameters: the species divergence times (s) and the population size parameters (s). Both are measured by the expected number of mutations/substitutions per site. For each species/population, = 4N, where N is the effective population size and is the mutation rate per site per generation. We consider two mutation rates, with = 0.001 (low rate) or 0.01 (high rate), respectively, for all populations on the tree. The species divergence times (s) are given as multiples of . We consider 10, 20, 50, or 100 loci, with each locus having 500 sites. On average there should be 0.5 and 5 heterozygous sites between the two sequences of any individual at the low and high rates, respectively. We sample S = 2 or 4 haploid sequences (or 1 or 2 diploid individuals) per species at each locus. Gene trees with branch lengths (coalescent times) are generated independently among loci using the MSC density given the species tree and parameters (Rannala and Yang 2003). The JC model (Jukes and Cantor 1969) is then used to "evolve" the sequences along the gene tree to generate the sequence alignments at the tips of the tree. Analysis of this full data set by BPP is strategy "F." To simulate unphased diploid sequences, two sequences from the same species are combined into one diploid sequence, using the International Union of Pure and Applied Chemistry (IUPAC) ambiguity characters to represent heterozygous sites (for example, Y means a T/C heterozygote) (Fig. 1c). The data of unphased diploid sequences are analyzed using the diploid or phase option of the BPP program (strategy "D"), which analytically averages over all possible phase resolutions (Gronau et al. 2011). With strategy "P," we use the program PHASE (Stephens et al. 2001) to resolve the phase and then analyze the phased sequences using BPP (with 16 or 32 sequences in the alignment per locus for S = 2 and 4, respectively). Lastly, we use random phase resolution, referred to as strategy "R." The simulation program automatically generates the sequence alignments for strategies F, D, and R. For strategy P, we ran PHASE 2.1 (Stephens et al. 2001) to reconstruct the phased sequences for each locus, and used the PERL program SeqPhase (Flot 2010) to convert files.
The number of replicate data sets is 100. With four trees, two mutation rates ( = 0.001 or 0.01), two sampling configurations (S = 2 or 4), four numbers of loci (L = 10,20,50,100), we generated in total 4×2×2×4×100 = 6400 data sets, each of which is analyzed using the four strategies. The BPP program (Flouri et al. 2018) was used in the analysis. Inverse-gamma priors are assigned on parameters under the MSC model, with the shape parameter 3 so that the priors are diffuse and with the mean to be close to the true value. We use ∼ IG(3, 0.02) with mean 0.01 and 0 ∼ IG(3, 0.08) with mean 0.04 for the age of the root of the species tree for data simulated with the high rate ( = 0.01). For data of the low rate ( = 0.001), the priors are ∼ IG(3, 0.002) with mean 0.001 and 0 ∼ IG(3, 0.008) with mean 0.004. The prior means for 0 are close to the true values for the deep trees but are larger than the true values for the shallow trees, although the priors are diffuse. For species tree estimation, we integrate out FIGURE 2. (a and a ) Deep and shallow balanced species trees and b and b ) deep and shallow unbalanced species trees for eight species used for simulating data under the MSC model. c and c ) Deep and shallow species trees with introgression used to simulate data under the MSci model. The ages of internal nodes (s) are shown next to the nodes, with = 0.01 (high rate) or 0.001 (low rate). The blue indexes at internal nodes of the tree are used to identify the parameters associated with the ancestral species (e.g., 9 is the age of the root and 9 is the population size for the root population in a and b). s analytically through the use of the conjugate inversegamma priors. We conducted pilot runs to determine the chain lengths needed for convergence. The final settings for the MCMC are 20,000 iterations for burn-in, then taking 2×10 5 samples, sampling every two iterations.
Strategy P requires running the Bayesian MCMC program PHASE L times if there are L loci in the data set, to generate the fully resolved sequence alignments at the loci. This is somewhat expensive if there is a large number of loci and the mutation rate is high resulting in many heterozygous sites at each locus. After the data sets are generated, the BPP analysis of each data set by strategies F, P, and R involves about the same amount of computation. Strategy D is more expensive as the method averages over all possible phase resolutions, which may involve likelihood calculation for many site patterns, especially if there are many sequences per locus with many heterozygous sites.
For species tree estimation (A01 analysis in Yang 2015), we calculated the proportion (among the 100 replicates) with which each node on the true species tree is found in the maximum a posteriori (MAP) species tree in the BPP analysis. This is a measure of accuracy since the MAP tree is the best "point estimate" of the species tree (Rannala and Yang 1996). We examined the size and coverage probability of the 95% credibility set of species trees. The coverage probability is the proportion among the 100 replicate data sets in which the credibility set includes the true species tree. The size of the set indicates the precision or power of the method, but the method is considered reliable only if the coverage probability exceeds the nominal 95%.

Simulation to Estimate Parameters under the MSC Model
The same data simulated under the MSC model for species tree estimation are analyzed using the four phase-resolution strategies to estimate parameters in the MSC model (s and s), with the species tree fixed. This is the A00 analysis in Yang (2015). We calculated the posterior means and the 95% HPD CI intervals for each parameter and examine the relative root mean 338 SYSTEMATIC BIOLOGY VOL. 71 square error (rRMSE), using the posterior means as point estimates. This is defined as where ω is the true value of any parameter, andω i its estimate (posterior mean) in the ith replicate data set, with i = 1,...,R over the R = 100 replicates. For example, rRMSE = 0.1 means that the mean square error is 10% of the true value. The rRMSE is a combined measure of bias and variance.

Simulation to Estimate Parameters under the MSci Model
The MSci models for three species of Figure 2c and c are assumed to generate gene trees and sequence alignments using the simulate option of BPP4.3 (Flouri et al. 2020a). The three species have the phylogeny (A,(C,B)), but there was introgression from A to C at the time H = S , with the introgression probability ϕ = 0.1 and 0.3. Other settings are the same as above for the simulation under the MSC model. We consider two mutation rates (with = 0.001 and 0.01) and four datasizes (with L = 10,20,50, and 100 loci), with each locus having 500 sites. We sample either S = 2 or 4 sequences per species per locus. The JC model is used both to simulate and to analyze the data.

Analyses of Two Real Data Sets
We applied different phase-resolution strategies (D, P, and R) to analyze two previously published data sets, one of East Asian brown frogs (Zhou et al. 2012) and another of Rocky Mountains chipmunks (Sarver et al. 2021), to demonstrate that the effects discovered in the simulations apply to real data analysis. With real data, the option of true phase resolution (F) is unavailable, and the analytical phase resolution (D) is expected to perform the best. In addition, we include an approach of treating heterozygote sites in the alignment as ambiguity characters in the likelihood calculation, and refer to it as strategy "A" (for ambiguity). As heterozygotes (with, e.g., Y meaning both T and C) are not ambiguities (with Y meaning either T or C), this is a mistaken approach of handling the data, and has the obvious effect of underestimating the heterozygosity or parameters. It was thus not included in our simulation, but we use it in the real data analysis to illustrate its effects.
We reanalyzed a data set of five nuclear loci from the East Asia brown frogs in the Rana chensinensis species complex (Zhou et al. 2012) to infer the species tree (the A01 analysis) and to estimate the parameters under the MSC on the MAP tree (the A00 analysis). There are three morphologically recognized species or four populations: R. chensinensis (clades C and L), R. kukunoris (K) and R. huanrensis (H) (Fig. 3a). The data set was previously analyzed by Yang (2015), treating heterozygotes as ambiguities (strategy A). Each locus has 20-30 sequences, with sequence lengths to be 285-498 sites. We assign inverse-gamma priors on parameters: ∼ IG(3, 0.002) with mean 0.001 and 0 ∼ IG(3, 0.004) with mean 0.002 for the root age. We used a burnin of 8000 iterations, then taking 10 5 samples, sampling every two iterations. The same analysis was run at least twice to confirm consistency between runs. This is a small data set and the MCMC algorithm mixes well.
The second data set consist of nuclear loci from six species of Rocky Mountains chipmunks in the Tamias quadrivittatus group: Tamias canipes (C), Tamias cinereicollis (I), Tamias dorsalis (D), Tamias quadrivittatus (Q), Tamias rufus (R), and Tamias umbrinus (U) (Fig. 3b). Sarver et al. (2021) used a targeted sequence-capture approach to sequence 51 Rocky Mountains chipmunks from those six species. As a reference genome assembly was lacking, reads were assembled iteratively into contigs using an approach called "assembly by reduced complexity." A data set of 1060 nuclear loci was compiled for molecular phylogenomic and introgression analyses, including three individuals from an outgroup species, Tamias striatus. Each locus consists of 54 sequences, with the sequence length ranging from 14 to 1026 sites. Highquality heterozygotes, judged by mapping quality and read depth, are represented in the alignments using the IUPAC ambiguity codes. The filters applied by the authors suggest that the loci may be mostly coding exons or conserved parts of the genome. The majority of loci have ≤ 5 variable sites (including the outgroup). We used the first 500 loci in our analyses to infer the species tree and to estimate parameters under the MSC model. We assigned inverse-gamma priors on parameters: ∼ IG(3, 0.002) with mean 0.001 and 0 ∼ IG(3, 0.01) with mean 0.005 for the root age. In the A01 analysis (species tree estimation), we used a burnin of 16,000 iterations, then taking 2×10 5 samples, sampling every two iterations. The A00 analysis (parameter estimation on the MAP tree) used the same settings except that only 10 5 samples were collected. The same analysis was run at least twice to confirm consistency between runs.

Species Tree Estimation under the MSC Model
Bayesian analysis of each replicate data set using each of the four strategies produced a sample from the posterior distribution of the species trees, which we summarized to identify the maximum a posteriori probability (MAP) tree, and construct the 95% credibility set of species trees. The proportion, among the 100 replicates, with which the clades represented by those short branches were recovered 339 FIGURE 3. Inferred species trees (a) for East Asian brown frogs and (b) for Rocky Mountains chipmunks. Branch lengths reflect the posterior means of divergence times, with branch bars representing the 95% HPD intervals, obtained under the MSC using the analytical phase integration algorithm (strategy D). Estimates of other parameters are in Table 6. in the MAP tree are shown in Table 1 and Supplementary Tables S1-S3 available on Dryad at https://doi.org/10.5061/dryad.vmcvdncrd. Other clades on the trees, represented by longer branches, were recovered with probability near 100%, even for the low mutation rate and 10 loci. We also plotted the posterior probabilities for the true tree for the different phasing strategies in Figure 4, Supplementary Figures S1-S3 available on Dryad. Strategy F, the analysis of the fully resolved haploid data, is expected to have the best performance and is thus the gold standard, against which the other strategies are compared.
In data simulated using the two deep trees (Deep B and Deep U) (Fig. 2a,b), the four phase-resolution strategies produced similar probabilities for recovering the true clades, with the differences among methods not being larger than the random sampling errors due to the limited number of replicates (Supplementary Tables S1 and S3 available on Dryad). The different strategies most often produced the same MAP tree, although the posterior probability attached to the MAP tree varies somewhat among methods, but the differences are comparable to MCMC sampling errors. This can be seen in Supplementary Figures S1 and S3 available on Dryad, where the posterior for the true tree is plotted. Even random resolution (R) produced very similar results to the use of the fully resolved data (F). Note that in data simulated at the high rate, there are very likely to be two or more heterozygote sites in the diploid genotype of each individual at any locus, and the switch error rate for random phase resolution, which is the average proportion of heterozygous sites misassigned relative to the previous heterozygous site (Stephens and Donnelly 2003;Andres et al. 2007), is 50%. Even the PHASE program generates substantial errors of phase resolution at the high mutation rate (Table 2). Species tree estimation is thus robust to considerable phasing errors when species divergences are much older than average coalescent times.
For the two shallow trees (Fig. 2a ,b ), large differences were found among the four strategies (Table 1 and  Supplementary Table S2  (MSC A01, shallow, S = 4) Probabilities of recovering true clades and the size and coverage of the 95% credibility set of species trees when the true species tree is Shallow B and Shallow U (Fig. 2a ,b )  Note: The two mutation rates are low ( = 0.001) and high ( = 0.01), while 10L, 20L, 50L, 100L are the number of loci. C 10 , C 12 , etc. are probabilities of recovering the true clades on the species trees, while "tree" is the probability of recovering the whole tree. "CI size" is the number of species trees in the 95% credibility set and and "CI cover" is the probability that the set contains the true species tree. Results for other simulation settings are in Supplementary Tables S1-S3 available on Dryad. larger CI sets, with lower coverage than strategies F and D. Thus phasing errors have different effects on species tree estimation depending on whether the species tree is deep or shallow. We suggest that this may be explained by the probability that the sequences from the same species coalesce before they reach the time of species divergence, when one traces the genealogical history at each locus backwards in time. For example, the probability that S = 2 sequences from species A coalesce before reaching the common ancestor of A and B is P{t mrca < AB } = 1−e −4 ≈ 0.982 in the two deep trees and 1−e −0.4 ≈ 0.330 in the two shallow trees, while the corresponding probabilities for S = 4 sequences are 0.967 and 0.077 for the deep and shallow trees, respectively (Supplementary Fig. S4 available on Dryad). In the deep trees, there is a high chance for all sequences from the same species to coalesce before reaching species divergence, and then the problem will be similar to using the ancestral sequence for each species (which is mostly determined by the most common nucleotides at the individual sites; Yang et al. 1995) for species tree estimation, a process that is not expected to be sensitive to phasing errors. In the shallow species trees, there are high chances that sequences from the same species may not have coalesced before reaching the time of species divergence, and sequences with phasing errors will enter ancestral populations, interfering with species tree estimation. Posterior probability for the true species tree for phase-resolution strategies D (diploid), P (PHASE), and R (random) plotted against the probability for strategy F (full data). The data are simulated under the MSC models with species trees Shallow B and Shallow U (Fig. 2a ,b ), with S = 4 sequences sampled per species. Each plot has 100 scatter points, for the 100 replicate data sets, with the x-axis to be the posterior probability for strategy F while the y-axis is for strategies D, P, or R. "Low" ( = 0.001) and "high" ( = 0.01) refer to the mutation rate, and L (= 10, 20, 50, 100) is the number of loci. Results for other simulation settings are in Supplementary Figures S1-S3 available on Dryad. Note: Data of L = 100 loci are used in the calculation although the error rate does not depend on the number of loci. The same data generated under the MSC model are used in the A01 (species tree estimation) and A00 (parameter estimation) analyses. Note that the error rate for random phase resolution (R) is expected to be 0.5.
While our main objective in this study is to evaluate the impacts of different phasing strategies, it is worth noting the effects of other major factors on species tree estimation that are obvious from our results (Fig. 4,  Supplementary Figs. S1-S3 available on Dryad and Table 1, Supplementary Tables S1-S3 available on Dryad). By design species tree B is harder to recover than tree U because tree B has four short branches (for clades C 10 , C 12 , C 13 , and C 15 ) while tree U has only three (for clades C 10 , C 11 , and C 15 ) (Fig. 2). Thus tree B is recovered with much lower probability than tree U by all methods in all parameter settings. We note that the individual clades in tree B are recovered with lower probabilities than those in tree U (Table 1 and  Supplementary Tables S1-S3 available on Dryad). We speculate that this may be due to the fact that the four short branches in tree B are close together (so that 945 trees around them are nearly equally good) while the three short branches in tree U are far apart (so that only 3×15 = 45 trees around them are nearly equally good). Because of the symmetry in tree B, the probabilities of recovering clades C 10 and C 13 should be equal, as are those for C 12 and C 15 . Differences within each pair reflect the random sampling errors due to our use of only 100 replicates. (Note that clades C 11 and C 14 were always recovered in the simulation.) The mutation rate had a dramatic impact on the precision and accuracy of species tree estimation. At the higher rate (with = 0.01 vs. 0.001), the credibility set was smaller, its coverage was higher, and the MAP tree matched the true species tree with higher probability. In our species trees, species divergence times () are proportional to . This allows us to compare the two values of , mimicking the use of conserved or variable regions of the genome for species tree estimation. Our study focuses on closely related species with highly 342 SYSTEMATIC BIOLOGY VOL. 71 similar sequences, and data simulated at the high rate contain more variable sites and more phylogenetic information.
The number of loci similarly had a huge impact on species tree estimation. With more loci, inference became more precise (with smaller credibility set) and more accurate (with the MAP tree matching the true tree with greater probability). Increasing the number of loci by 10 fold improves performance for all strategies more than increasing the mutation rate by the same factor.
The number of sequences sampled per species had consistent but relatively small effects on species tree estimation. Changing S = 2 to 4 improved the probabilities of recovering the true clades in the true species tree, reduced the CI set size, and improved the coverage of the CI set, but the improvements are in general small.
It is noteworthy that the coverage of the 95% CI set was below the nominal 95% in small or uninformative data sets while above 95% in large and informative data sets. In the case of 10 loci at the low rate for tree Deep B, coverage was even below 50% (Supplementary Table S1 available on Dryad). Even though the set included nearly 500 trees, more than a half of the CI sets failed to include the true tree. In contrast, at the high mutation rate and with 50 or 100 loci, CI coverage was often 100%. The method is overconfident in small and uninformative data sets and conservative in large and informative ones. The same pattern was noted in a previous simulation examining the information content in phylogenomic data sets , Table 3). Note that in our simulation, the replicate data sets are generated under a fixed model (species tree) and fixed parameter values, so that we are evaluating the Frequentist properties of Bayesian model selection, and a match is not expected (Huelsenbeck and Rannala 2004;Yang and Rannala 2005). Yet the large discrepancies are striking.

Estimation of Divergence Times and Population Sizes under the MSC Model
The Impact of the Phasing Strategies.-The same data sets simulated for species tree estimation were analyzed to estimate the parameters in the MSC model (s and s) with the species tree fixed (Fig. 2a,a ,b,b ). The posterior means and 95% HPD CI for the 100 replicates are plotted in Figure 5, and Supplementary Figures S5-S11 available on Dryad, while the relative root mean square errors (rRMSE) are presented in Supplementary Tables S4-S11 available on Dryad. Whereas the rRMSE reflects both biases and variances in parameter estimation, the data sets generated by the four phase-resolution strategies have about the same size in terms of the number of loci, the number of sequences per locus, and the number of sites per sequence, so that the sampling errors or variances are similar among methods and the differences in rRMSE mainly reflect differences in bias. Furthermore, we may use the symmetry of species tree B to gauge the magnitude of random sampling errors due to our use of 100 replicates: for instance, rRMSE should be equal for A , B , E , and F , and for 10 and 13 , on the balanced trees.
The four phase-resolution strategies (F, D, P, and R) performed similarly for the Deep trees at the lower rate and when only S = 2 sequences (or one individual) are sampled per species. We note that with S = 2 and at the low mutation rate (with heterozygosity at = 0.001), there will be on average 0.5 heterozygous sites at the same locus, and the probability of having two or more heterozygous sites is 1−0.999 500 − 500×0.999 499 × 0.001 = 0.0901. Then phase resolution will not be a serious issue, and all four strategies examined in the study will be nearly equivalent.
At the high mutation rate ( = 0.01) for the Shallow trees, differences were noted among the strategies even for S = 2 sequences ( Supplementary Fig. S6 available on Dryad and Supplementary Tables S5 and S7 available on Dryad). The PHASE program produced underestimates for the youngest species divergence times ( 11 and 14 on Shallow B and 15 on Shallow U) (Fig. 2a ,b ). The biases became more pronounced when S = 4 sequences per species are in the sample (Fig. 5

and Supplementary
Tables S9-S11 available on Dryad). At the high rate, there are on average five heterozygotes per locus in the individual and the probability of having two or more heterozygotes at the locus is 96%. Two factors may be responsible for the bias. First the PHASE program may have inferred heterozygote phase incorrectly (indeed the error rate is comparable to that of random phasing with S = 2). Second PHASE is an MCMC program generating a posterior distribution of different phase resolutions but we used only the optimal resolution. As the optimal resolution may involve the least amount of divergence, this approach is expected to lead to underestimation of sequence divergences or of the and parameters. At the high rate and for shallow trees, random phasing (R) also created serious biases, but the biases are in the opposite direction. Random phasing overestimated the youngest species divergence times ( 11 and 14 on Shallow B and 15 on Shallow U), and overestimated for all modern species. The overestimation of modern is most striking, and occurred for both deep and shallow species trees at the high rate and is more dramatic with more sequences (S = 4 rather than 2) or more loci.
We examined the number of distinct site patterns in the alignment at each locus for the high-rate data ( Supplementary Fig. S12 available on Dryad). Site patterns are compressed for the JC model, so that one site pattern is constant while the others are variable (Yang 2006, p. 144), and the number is thus an indication of the level of sequence divergence. At almost every locus, the PHASE program (P) produced alignments with fewer distinct site patterns than the true phase resolution (e.g., with the mean to be 36.07 compared with the true value 38.51 on tree B), apparently because we used the optimal phase resolution inferred by the program and ignored the less likely ones. Random resolution produced about the same number of site patterns as the true number (average 38.36 vs. 38.51 for tree Deep B). The number of site patterns is thus not the reason for the poor performance of random phasing.
Note that calculation of the heterozygosity for each diploid individual, which is simply the proportion of heterozygous sites in the two sequences at the locus, does not rely on phase resolution. If we calculate the heterozygosity for each diploid individual and then average over individuals of the same species, we will get a reasonably good estimate of for that species. However, in the gene-tree based analysis conducted in BPP, each randomly phased haploid sequence is compared not only with the other sequence from the same individual but also with sequences from other individuals through the use of a gene tree relating all phased haploid sequences at the locus. While the true haploid sequences may all be closely related, random phase resolution may generate chimeric sequences that are very different from naturally occurring fully resolved sequences, inflating apparent coalescent times and genetic diversity in the population. This effect is expected to be more serious when more individuals are included in the sample.
Estimation of for a single species.-To explore this interpretation, we conducted a small simulation sampling independent loci from a single species to estimate the only parameter (Fig. 6, Table 3). With S = 2 sequences per locus (one diploid individual), the four phase-resolution strategies are equivalent. However, with the increase of S, the strategy of random phase resolution becomes increasingly biased. Previously Felsenstein (1992) examined the efficiency of two summary methods based on the number of segregating (variable) sites ( S ; Watterson 1975) and the average pairwise distance ( ; Tajima 1983), relative to the maximum likelihood (ML) method based on gene genealogies. He found that the summary methods ( S and ) were much less efficient than the ML estimate, with orders-of-magnitude differences in the variance in large samples (Felsenstein 1992, Tables 1 and 2), indicating that there is much information about in the genealogical histories. The ML method should be very similar to BPP here as both are full likelihood methods. Here, we note that the number of segregating sites does not depend on phase resolutions, and similarly the average proportion of different sites, averaged over all the S(S−1)/2 pairwise comparisons, depends on the site configurations at each variable site (such as 10 Ts and 4 Cs) but not on the genotypic phase between different heterozygous sites. Both Waterson's estimator and the average pairwise distance are thus unaffected by phasing errors. It is also noteworthy that those two simple methods are not affected by recombination within the locus, while coalescent-based methods are (Felsenstein 2019). While it is not unexpected that a full likelihood method may be more sensitive to certain errors in the model or in the data than heuristic methods, in this case it is striking that the systematic bias is so large (with estimates to be several times larger than the true value) when the coalescent-based method is applied to randomly phased sequences. Felsenstein's (1992) analysis, as mentioned above, assumed knowledge of the true gene trees and coalescent times (or equivalently infinitely long sequences at each locus). Here, BPP is applied to sequence alignments and accommodates uncertainties in the genealogical trees. The different methods then have much more similar performance (Table 3, S , , and BPP strategy F), suggesting that the uncertainties in the genealogical trees due to mutational variations in the sequences have eroded much of the information in the gene trees. The summary methods (in particular, ) have larger variances than the BPP estimates, especially in large samples of S = 32 sequences, but the differences are relatively small. We also note that analytical phase integration (D) produced variances that are nearly identical to those for the use of the full data (F).

Impacts of Other Factors on Parameter Estimation
Under the MSC Model.-We note that different parameters are estimated with very different precision and accuracy, reflecting the different amount of information in the data. Population size parameters (s) for modern species are well estimated, as well as 9 for the root population, but s for other ancestral species, especially those represented by very short branches (e.g., 10 , 13 , 12 , 15 in tree B) have large errors (Fig. 5, Supplementary  Figs. S1-S3 available on Dryad). Species divergence times are all well estimated, with rRMSE to be even much smaller than those for population size parameters for modern species (Supplementary Tables S4-S11 available  on Dryad).
Both the mutation rate and the number of loci had a major impact on the estimation of the parameters. For all phasing strategies increasing the number of loci by 10-fold improves performance more than increasing the mutation rate by the same factor (Fig. 5, Supplementary Figs. S1-S3, Tables S4-S11 available on Dryad).

Estimation of Introgression Probability under the MSci
Model We used the MSci models of Figure 2c,c to simulate sequence data and used BPP to analyze them to estimate parameters in the MSci model. We are in particular interested in whether the different strategies of heterozygote phase resolution may lead to biases in the estimation of the timing ( H ) and strength of the introgression (ϕ). The results are summarized in Figure 7 and Supplementary Figures S13-S19 available on Dryad and Table 4 and Supplementary Tables S12-S18 available on Dryad.
As before, the diploid strategy (D) produced results almost indistinguishable from the use of the full data (F) in all parameter settings. The performance of the PHASE program (P) and random phasing (R) depends on the mutation rate and, to an lesser extent, on the number of sequences per species S. At the low rate, and in particular with only S = 2 sequences per species, all four strategies have similar performance, but large differences were found at the high mutation rate. Strategy R overestimates the modern and the species divergence times () at the high rate, with the bias being more serious for S = 4 sequences than for S = 2. This is the same behavior as discussed earlier in the simulation under the MSC model. Strategy R also tends to overestimate ϕ, but the bias is small. Strategy P had the opposite bias and produced underestimates of modern and species divergence times when the mutation rate is high, with smaller biases than for strategy R. Strategy P also underestimates the introgression probability (ϕ).
An interesting question is whether each method detects introgression. We calculated the proportion of replicates in which the lower limit of the 95% HPD CI for ϕ exceeds a small value, set somewhat arbitrarily at 0.001. If the CI excludes the small value, we may take it as evidence that ϕ = 0 is ruled out so that there is significant evidence for introgression. By this measure of power of the Bayesian "test," strategies D and P had nearly identical power as the use of the full data (F), while random resolution (R) had reduced power at the high mutation rate (Table 5, Supplementary Table S19 available on Dryad). Overall, power was very high even with only 10-20 loci and at the low mutation rate. Having more sequences is noted to boost the power of the test for all phase-resolution strategies.

Running Time for Different Analyses
The running time for the A01 analysis under the MSC model (species tree estimation) for the four phasing strategies (F, D, P, and R), averaged over the 100 replicates, is plotted against the number of loci in Supplementary Figure S20 available on Dryad. Running time increases nearly linearly with the number of loci, with the slope being steeper when S = 4 sequences are sampled per species than for S = 2. The diploid integration algorithm (D) has the longest running time. Note that the number of parameters in the MSC model, the number of loci, the number of sequences etc. are identical for the four strategies, so that their computational load is proportional to the number of site patterns. As strategy D enumerates all possible phase resolutions (including the true resolution), which may result in many distinct site patterns, it is more expensive than the other methods. The running time for each BPP analysis on a single core ranged from ∼20 min for 10 loci to ∼5 h for strategy D with data of 100 loci. Strategy P involves running the Bayesian MCMC program PHASE for each of the L loci. At the low mutation rate with very few heterozygous sites per locus, this requires minimal computation ( Supplementary Fig. S21 available on Dryad), but at the high rate and with S = 4 sequences per species, the running time can be comparable with running the subsequent BPP analyses.
The running time for the A00 analysis (parameter estimation) under the MSC and MSci models is shown in Supplementary Figures S22-S25 available on Dryad. The A00 analysis under the MSC involves less computation than the A01 analysis as there is no MCMC moves to change the species tree. Overall, the same patterns are observed as discussed above for the A01 analysis. Note that the computer cluster used in this work consist of computers with different processors, so there may be random fluctuations in running time due to the different jobs being assigned to different processors. For example the differences in Supplementary Figures S21 and S23 available on Dryad reflect this random fluctuation as the data were the same.
Overall, the data sets generated in the simulation of this study are relatively small, and the running time for the best method (strategy "D") is comparable with that for strategy "P," in particular if one considers the preprocessing overhead of computational phasing.

Analysis of Two Real Data Sets
We analyzed two real data sets using four different phase-resolution strategies: D (diploid), P (PHASE), R (random), and A (ambiguity). With real data, the option of true phase resolution (F) is unavailable, and the analytical phase resolution (D) is expected to have the best performance, against which we compare the other strategies.
East Asia brown frogs.-We reanalyzed a data set of five nuclear loci from the East Asia brown frogs in the Rana chensinensis species complex (Zhou et al. 2012) (Fig. 3a). This data set was previously analyzed by Yang (2015) using strategy A. The number of site patterns at each locus is 18-26 for strategy A, and 22-102 for strategy D. Running time using one thread on our server was 3 min for A, 7-8 min for P and R, and 12 min for D. In the A01 analysis (species tree estimation), the four strategies (D, P, R, and A) produced the same MAP tree (Fig. 3a): (((H, L), C), K), with the posterior to be 0.29 for D, 0.36 for P, 0.35 for R, and 0.21 for A. The analysis of Yang (2015) produced a different MAP tree, ((H, L), (C, K)). The difference is due to the use of different priors: Yang (2015) used BPP3.1, with gamma priors on the parameters (s for all populations and for the root), whereas here inverse gamma priors are used in BPP4.3. Note that the species trees have low support in both analyses.
In the A00 analysis (parameter estimation under MSC with the MAP species tree fixed), the posterior means and 95% HPD intervals are shown in Table 6a. Strategy P (PHASE) produced similar results to strategy D. Strategy R (random) produced overestimates of s for modern species, while strategy A (ambiguity) produced serious underestimates of s for modern species and divergence times. The results are consistent with our findings from the simulation.

Rocky Mountains
Chipmunks.-In the A01 analysis (species tree inference) of the 500 nuclear loci for Rocky Mountains chipmunks, strategies D, P, and R produced the same MAP tree, shown in Figure 3b, with the posterior for every node ∼1.0. This is also the species tree inferred by Sarver et al. (2021) using summary methods, although the authors obtained lower support values even with all 1060 loci used. The difference may be due to the higher power of the BPP analysis, which uses the full data of multilocus sequence alignments rather than data summaries (e.g., Shi and Yang 2018;Kim and Degnan 2020;Zhu and Yang 2021). Strategy A (ambiguity) produced a different MAP species tree from the other strategies (Fig. 3b) (D, (IQR))) instead of (D, (C, (IQR))), with the posterior at 0.94. The running time for the A01 analysis, using eight cores on a server with Intel Xeon Gold 6154 3.0GHz processors, was 9 hours for strategy A, and 16-17 h for strategies D, P, and R, with strategy D having slightly longer running time. The number of site patterns at the 1060 loci for strategy D is shown in Supplementary  Fig. S26 available on Dryad. Strategy P also needed the additional time for running the PHASE program, which was 33 min to phase all 1060 loci using one thread on the server.
In the A00 analysis (parameter estimation), strategy P (PHASE) produced nearly identical results to strategy D (diploid) ( Table 6). Compared with strategy D, strategy R (random) produced overestimates of s for modern species, while divergence times for recent nodes were also overestimated very slightly. Strategy A (ambiguity) produced serious underestimates of s for modern species, with divergence times, especially of recent nodes, to be underestimated as well. Those results mimic our findings about the relative performance of the different strategies in the simulated data. Running time for the A00 analysis was 2.5 h for strategy A, and 5-6 h for strategies D, P, and R. Note that in the A00 analysis the chain is only half as long as in the A01 analysis.

The Impact of Phasing Errors Depends on the Inference Problem
We have used simulation to examine the performance of four different strategies for handling heterozygote phase in genomic sequence data: F (full phased data), D (diploid analytical phase integration), P (PHASE), and R (random). Inference problems examined have included species tree estimation under the MSC model and parameter estimation under the MSC and MSci models. We found that the different strategies, including random phase resolution (or equivalently the use of haploid consensus sequences), did not affect species tree estimation when the species divergences are much older than the coalescent times. The different phasing strategies may be expected to have even less impact on inference of deep phylogenies, where within-species polymorphism is much lower than between-species divergence. However, species tree estimation is affected by phasing errors if the species tree is shallow and between-species divergence is similar to within-species polymorphism, if the mutation rate is high so that there are many heterozygote sites in the sequence, and if many sequences are sampled from each species. Phasing errors are clearly important when genomic data are used to infer the divergence history of populations of the same species. Previously, Kates et al. (2018) used several summary methods (including astral and concatenation/maximum likelihood) to analyze a data set of breadfruit, jackfruit, and relatives (333 loci from 23 species of the genus Artocarpus) to infer the species phylogeny, evaluating several phase-resolution strategies such as haploid consensus sequences, ambiguity encoding, and a bioinformatics procedure to incorporate read information to phase alleles. The authors found that heterozygote phase resolution had minimal impact on phylogeny reconstruction. The results are consistent with our simulation, which also found that phasing errors do not influence estimation of deep species phylogenies. However, our simulation also suggests that the authors' conclusion based on the deep species tree of Artocarpus, with an estimated age of 40 Myr for the crown age of the ingroup, may not apply to inference of shallow phylogenies, in which species divergence times are similar to average coalescent times.
We found that estimation of parameters in the MSC and MSci models is more sensitive to phasing errors than is species tree estimation. In particular, population sizes for modern species are seriously overestimated under the MSC and MSci models when random phasing or haploid consensus sequences are used. Our analysis of the simple case of estimating under the singlepopulation coalescent suggests that the bias is caused mainly by the unusual sequences generated by random phase resolution ( Fig. 6 and Table 3). Estimates of the introgression probability and introgression time under the MSci model may also be biased by errors in random phasing. The biases are more serious when the mutation rate is high so that there are multiple heterozygote sites at each locus and when multiple sequences are sampled per species. Those results are consistent with Gronau et al. (2011), who also found that random phase resolution affected parameter estimation in their analysis of genomic sequence data from different human populations.

Limitations of our Simulation and Implications to Practical
Data Analysis Here, we note a few limitations of our study. First, we have examined only one inference method, the Bayesian method implemented in the BPP program. Our results may be expected to apply to other full likelihood implementations such as StarBeast (Ogilvie et al. 2017;Zhang et al. 2018). or PhyloNet-Seq (Wen and Nakhleh 2018), but may not apply to summary methods. Similarly, we considered only a few inference problems under the MSC and MSci models using genomic sequence data. We have not examined the impact of phasing errors on inference of population demographic changes or on inference of migration/introgression histories (our simulation under the MSci model assumed a fixed introgression event).
Given those caveats, we discuss the implications of our simulation results to practical data analysis. First, our simulation as well as those of Gronau et al. (2011) and Andermann et al. (2019) suggest that random phase resolution or the use of haploid consensus sequences should be avoided. Strategy R never performed better than computational phasing (strategy P) in our simulations. Similarly strategy A (ambiguity) should not be recommended. Virtually all phylogenetic likelihood programs accommodate ambiguities in a sequence alignment representing undetermined nucleotides using a data-augmentation algorithm in the likelihood calculation (Felsenstein 2004, pp. 255-6;Yang 2014, pp. 110-112). As this approach misinterprets heterozygotes as ambiguities, it leads to underestimation of polymorphism or for the modern species. Bias may also be introduced into estimates of other parameters, such as underestimation of divergence times (Andermann et al. 2019). The approach also underestimates the information content in the data, as it in effect treats two sequences (although unphased) as only one. This mistake in the treatment of the data was made by 350 SYSTEMATIC BIOLOGY VOL. 71 Rannala and Yang (2003) in the analysis of three human noncoding loci of Zhao et al. (2000), Yu et al. (2001), and Makova et al. (2001), and by Yang (2015) in the analysis of the five nuclear loci from East Asian brown frogs (Zhou et al. 2012). The mistake is easy to see from the occurrence of the same ambiguity character (such as Y) in multiple sequences at the same site in the alignment.
Strategy D (diploid analytical integration) produced results that are extremely similar to the use of the full data (F) in all simulation settings of this study (see also Gronau et al. 2011). As the algorithm averages over all possible phase resolutions and constitutes a full likelihood approach to handling missing data, it is the optimal statistical approach when the data consist of unphased diploid sequences, and may thus be recommended in general, even for inference problems that are not examined in our simulation study. As a statistical inference method, strategy D is equivalent to the approach of sampling phase resolutions in a Markov chain Monte Carlo (MCMC) algorithm (Kuhner and Felsenstein 2000), but the approach of analytical phase integration appears to have a computational advantage.
The computational effort required by strategy F is proportional to the number of distinct site patterns at each locus that may result from enumerating all possible phase resolutions. For the data sets simulated in this article as well as the two real data sets analyzed here, the computational load for strategy D is comparable to (if not better than) that for strategy P, in particular when one considers the preprocessing of computational phasing required by strategy P. In such cases, strategy D should be recommended. However, when there are many long sequences of high heterozygosity at a locus, enumeration of all phase resolutions may lead to a huge number of site patterns. For example, the three noncoding regions of human DNA analyzed by Rannala and Yang (2003) have about 60 sequences per locus, with ∼ 10 4 sites. The number of site patterns is 50-73 in the unphased alignments (strategy A), but reaches 1.2-4.4 million for strategy D, rendering the analysis unfeasible. Note that those loci are long genomic segments, which may be affected by recombination, whereas data sets suitable for analysis under the MSC typically involve much shorter genomic segments (e.g., Burgess and Yang 2008).
We suggest that computational phasing (strategy P) should be an acceptable alternative when strategy D is computationally unfeasible. In our analyses of the simulated and real data sets, strategy P produced similar results to the use of full data (F) or the analytical phase integration approach (D), with very small biases. Note that the Bayesian program PHASE assumes a population genetics model and is designed for sequence or allelic data from the same species. However, our use of it to analyze sequence data from multiple species produced relatively small biases in parameter estimation in both simulated data and in the two real data sets, much better than random phase resolution or haploid consensus sequences. We also note that phasing based on reads combined with bioinformatic analysis shows great promise (Andermann et al. 2019). In particular, exciting developments in sequencing technology to provide longer reads, combined with computational algorithms (Porubsky et al. 2020;Zhou et al. 2020;Cheng et al. 2021), may soon make it practical to produce routinely fully phased diploid genomes.