Evaluating the Use of ABBA–BABA Statistics to Locate Introgressed Loci

Several methods have been proposed to test for introgression across genomes. One method tests for a genome-wide excess of shared derived alleles between taxa using Patterson’s D statistic, but does not establish which loci show such an excess or whether the excess is due to introgression or ancestral population structure. Several recent studies have extended the use of D by applying the statistic to small genomic regions, rather than genome-wide. Here, we use simulations and whole-genome data from Heliconius butterflies to investigate the behavior of D in small genomic regions. We find that D is unreliable in this situation as it gives inflated values when effective population size is low, causing D outliers to cluster in genomic regions of reduced diversity. As an alternative, we propose a related statistic f^d, a modified version of a statistic originally developed to estimate the genome-wide fraction of admixture. f^d is not subject to the same biases as D, and is better at identifying introgressed loci. Finally, we show that both D and f^d outliers tend to cluster in regions of low absolute divergence (dXY), which can confound a recently proposed test for differentiating introgression from shared ancestral variation at individual loci.


Introduction
Hybridization and gene flow between taxa play a major role in evolution, acting as a force against divergence, and as a potential source of adaptive novelty (Abbott et al. 2013). Although identifying gene flow between species has been a long-standing problem in population genetics, the issue has received considerable recent attention with the analysis of shared ancestry between humans and Neanderthals (e.g., Yang et al. 2012;Wall et al. 2013;Sankararaman et al. 2014). With genomic data sets becoming available in a wide variety of other taxonomic groups, there is a need for reliable, computationally tractable methods that identify, quantify, and date gene flow between species in large data sets.
A sensitive and widely used approach to test for gene flow is to fit coalescent models using maximum-likelihood or Bayesian methods (Pinho and Hey 2010). However, simulation and model fitting are computationally intensive tasks and are not easily applied on a genomic scale. A simpler and more computationally efficient approach that is gaining in popularity is to test for an excess of shared derived variants using a four-taxon test (Kulathinal et al. 2009;Green et al. 2010;Durand et al. 2011). The test considers ancestral ("A") and derived ("B") alleles and is based on the prediction that two particular single nucleotide polymorphism (SNP) patterns, termed "ABBA" and "BABA" (see Materials and Methods), should be equally frequent under a scenario of incomplete lineage sorting without gene flow. An excess of ABBA or BABA patterns is indicative of gene flow between two of the taxa and can be detected using Patterson's D statistic (Green et al. 2010;Durand et al. 2011; see Materials and Methods for details). However, an excess of shared derived variants can arise from factors other than recent introgression, in particular nonrandom mating in the ancestral population due to population structure (Eriksson and Manica 2012). It is therefore important to make use of additional means to distinguish between these alternative hypotheses, for example, by examining the size of introgressed tracts (Wall et al. 2013), or the level of absolute divergence in introgressed regions (Smith and Kronforst 2013).
The D statistic was originally designed to be applied on a genome-wide or chromosome-wide scale, with block-jackknifing used to overcome the problem of nonindependence between loci (Green et al. 2010). However, many researchers are interested in identifying particular genomic regions subject to gene flow, rather than simply estimating a genomewide parameter. Theory predicts that the rate of gene flow should vary across the genome, both in the case of secondary contact after isolation (Barton and Gale 1993) as well as continuous gene flow during speciation (Wu 2001). Indeed, a maximum-likelihood test for speciation with gene flow devised by Yang (2010) is based on detecting this underlying heterogeneity. Moreover, adaptive introgression might lead to highly localized signals of introgression, limited to the particular loci under selection.
Many methods for characterizing heterogeneity in patterns of introgression across the genome have been proposed. Several genomic studies have used F ST to characterize heterogeneity in divergence across the genome, often interpreting the variation in F ST as indicative of variation in rates of gene flow (e.g., Ellegren et al. 2012). However, it is well established that, as a relative measure of divergence, F ST is dependent on within-population genetic diversity (Charlesworth 1998), and is therefore an unreliable indicator of how migration rates vary across the genome. In particular, heterogeneity in purifying selection and recombination rate could confound F STbased studies (Noor and Bennett 2009;Hahn et al. 2012;Roesti et al. 2012;Cruickshank and Hahn 2014). Various studies of admixture among human populations, or between humans and Neanderthals, have used probabilistic methods to assign ancestry to haplotypes, and infer how this ancestry changes across a chromosome (Sankararaman et al. 2008;Price et al. 2009;Henn et al. 2012;Lawson et al. 2012;Omberg et al. 2012;Churchhouse and Marchini 2013;Maples et al. 2013;Sankararaman et al. 2014). Other methods have modeled speciation with the allowance for variable introgression rates among loci (Garrigan et al. 2012;Roux et al. 2013), allowing the detection of more ancient gene flow.
There have also been recent attempts to characterize heterogeneity in patterns of introgression across the genome using the D statistic, calculated either in small windows Smith and Kronforst 2013) or for individual SNPs (Rheindt et al. 2014). The robustness of the D statistic for detecting a genome-wide excess of shared derived alleles has been thoroughly explored (Green et al. 2010;Durand et al. 2011;Yang et al. 2012;Eaton and Ree 2013;Martin et al. 2013). However, it has not been established whether D provides a robust and unbiased means to identify individual loci with an excess of shared derived alleles, or to demonstrate that these loci have been subject to introgression. Any inherent biases of the D statistic when applied to specific loci have implications for methods that assume its robustness.
For example, Smith and Kronforst (2013) made use of the D statistic in a proposed test to distinguish between the hypotheses of introgression and shared ancestral variation at wing-patterning loci of Heliconius butterflies. Two wingpatterning loci are known to show an excess of shared derived alleles between comimetic populations of Heliconius melpomene and H. timareta (Heliconius Genome Consortium 2012). At one of these loci, phylogenetic evidence and patterns of linkage disequilibrium are consistent with recent gene flow (Pardo-Diaz et al. 2012). Nevertheless, Smith and Kronforst (2013) argue that this shared variation might represent an ancestral polymorphism that was maintained through the speciation event by balancing selection. Conceptually, this is not unlike the population structure argument of Eriksson and Manica (2012), except that here structure is limited to one or a few individual loci.
Smith and Kronforst proposed that the alternative explanations of introgression or ancestral polymorphism could be distinguished by considering absolute divergence within and outside of the loci of interest. Both hypotheses predict an excess of shared derived alleles at affected loci, but introgression should lead to reduced absolute divergence due to more recent coalescence at these loci, whereas the locus-specific population structure hypothesis predicts no reduction in absolute divergence at these loci compared with other loci in the genome. Loci with an excess of shared derived alleles, and therefore showing evidence of shared ancestry, were located by calculating the D statistic in nonoverlapping 5-kb windows across genomic regions of interest, and identifying outliers using an arbitrary cutoff (the 10% of windows with the highest D values). The mean absolute genetic divergence (d XY ) was then compared between the outliers and nonoutliers, and found to be significantly lower in outlier windows, consistent with recent introgression (Smith and Kronforst 2013). This method makes two assumptions. First, that the D statistic can accurately identify regions that carry a significant excess of shared variation, and second, that D outliers do not have inherent biases leading to their cooccurrence with regions of low absolute divergence. These assumptions, which extend the use of D beyond its original definition, may be made by other researchers for similar purposes, but they remain to be tested.
Here, we first assess the reliability of the D statistic as a means to quantify introgression at individual loci. Using simulations of small sequence windows, we compare D to a related statistic that was developed by Green et al. (2010) specifically for estimating f, the proportion of the genome that has been shared, and we propose improvements to this statistic. We then use whole-genome data from several Heliconius species to investigate how these statistics perform on empirical data, and specifically how they are influenced by underlying heterogeneity in diversity across the genome. Lastly, we use a large range of simulated data sets to test the proposal that recent gene flow can be distinguished from shared ancestral variation based on absolute divergence in D outlier regions.

Results
The D Statistic Is Not an Unbiased Estimator of Gene Flow Patterson's D statistic was developed to detect, but not to quantify introgression. We used the deterministic derivation of the expected value of D (E[D]) provided by Durand et al. (2011, eq. 5) to test how sensitive the value of D is to other factors apart from the proportion of introgression. We define the proportion of introgression (f) as the proportion of haplotypes in the recipient population (P 2 ) that trace their ancestry through the donor population (P 3 ) at the time of gene flow ( fig. 1A). The expected D value increases with the proportion of introgression (f), but not linearly (fig. 1B) and expected D increases as population size decreases ( fig. 1B and C). The split times between populations also have a small effect, with a more recent split between P 1 and P 2 leading to higher expected values of D ( fig. 1C). This implies that empirically calculated values of the D statistic will depend on various parameters other than the amount of gene flow, irrespective of the number of sites analyzed.

Direct Estimators of f Outperform D on Simulated Data
Analysis of simulated data confirmed that the D statistic is not an appropriate measure for quantifying introgression over small genomic windows, but that direct estimation of the proportion of introgression (f) provides a more robust alternative. The D statistic (eq. 1, Materials and Methods) was compared with the f estimator of Green et al. (2010) (eq. 4, Materials and Methods), which is referred to asf G below, along with two proposed modified versions of this statistic (eqs. 5 and 6, Materials and Methods). The first,f hom (eq. 5), is similar tof G in that it explicitly assumes unidirectional gene flow from P 3 to P 2 , but makes a further assumption that maximal introgression would lead to complete homogenization of allele frequencies in P 2 and P 3 . This is a conservative assumption, as an extremely high rate of migration would be necessary to attain a maximal value off hom . The second,f d (eq. 6), is dynamic in that it allows for bidirectional introgression on a site-by-site basis, setting the donor population at each site as that which has the higher frequency of the derived allele. These f estimators are distinct from the F 2 , F 3 , and F 4 statistics of Reich et al. (2009Reich et al. ( , 2012 and Patterson et al. (2012), which all test for correlated allele frequencies associated with introgression (much like the D statistic). However,  F 4 -ratio is conceptually very similar to the f estimators discussed here, in that it estimates the proportional contribution of a donor population.
To compare the utility of these statistics for quantifying introgression in small genomic windows, we simulated sequences from four populations: P 1 , P 2 , and P 3 and outgroup O, with the relationship (((P 1 ,P 2 ),P 3 ),O), with a single instantaneous gene flow event, either from P 3 to P 2 or from P 2 to P 3 . Simulations were performed over a range of different values of f (the probability that any particular haplotype is shared during the introgression event), and with various window sizes, recombination rates, and times of gene flow. depends on the two split times, t 12 and t 23 , separating populations P 1 , P 2 , and P 3 . It assumes a single instantaneous admixture event from P 3 to P 2 at t GF , after which a proportion f of P 2 individuals trace their ancestry through P 3 . The effective population size, N e , is constant through time and the In all three cases, t 23 is set at 2 million generations ago, and f is set to 0.1.
underestimates, but were nevertheless well correlated with the simulated f value (supplementary fig. S1A, Supplementary Material online). This is unsurprising, as genetic drift and the accumulation of mutations in these lineages after the introgression event should dilute the signal of introgression. In simulations of gene flow in the opposite direction, from P 2 to P 3 , bothf G andf hom showed considerable stochasticity, particularly when recombination rates were low and gene flow was recent (supplementary fig. S1, Supplementary Material online). The size of the window had little effect on this behavior (supplementary fig. S1, Supplementary Material online), implying that it was not an effect of the number of sites analyzed, but rather the level of independence among sites. This is also to be expected, as these statistics can give values greater than 1 where derived allele frequencies happen by chance to be higher in P 3 than P 2 (see Materials and Methods) . Unlike these two statistics,f d behaved predictably at all recombination rates and times of gene flow, giving estimates that were fairly well correlated with the simulated f,  Simulations covered 11 different values of f, the proportion of introgression. Gene flow was simulated either from P 3 to P 2 (left-hand column) or from P 2 to P 3 (righthand column). Dashed diagonal lines show the expectation of a perfect estimator of f.

247
Locating Introgression . doi:10.1093/molbev/msu269 MBE Although none of the examined measures were able to accurately quantify both forms of introgression in all cases,f d has some appealing characteristics as a measure to identify introgressed loci in a genome scan approach. It has low variance and is not prone to false positives when gene flow is absent and recombination rare. In all of our simulations, it provided estimates that were proportional to the simulated level of introgression. Although it tended toward underestimates, genome scans for introgressed loci would primarily be interested in relative rates of introgression across the genome, rather than absolute rates.
f Estimators Are Robust to Variation in Nucleotide Diversity Across the Genome Analysis of published whole-genome data from Heliconius species confirmed that Patterson's D statistic was prone to extreme values in regions of low diversity, whereas f estimators were not (fig. 3A-C). We reanalyzed published wholegenome sequence data from two closely related Heliconius butterfly species, H. melpomene and H. timareta, and four outgroup species from the related silvaniform clade. The races H. melpomene amaryllis and H. timareta thelxinoe are sympatric in Peru, and show genome-wide evidence of gene flow (Martin et al. 2013), with particularly strong signals at two wing-patterning loci: HmB, which controls red pattern elements, and HmYb, which controls yellow and white pattern elements (Heliconius Genome Consortium 2012; Pardo-Diaz et al. 2012). To determine whether heterogeneity in diversity across the genome may influence D and the f estimators, we calculated D,f G ,f hom ,f d , and nucleotide diversity () in nonoverlapping 5-kb windows across the genome. Variance in the D statistic was highest among windows with low nucleotide diversity and decreased rapidly with increasing diversity ( fig. 3A and supplementary fig. S4, Supplementary Material online). Windows from the wingpatterning loci were among those with the highest D values, but there were many additional windows with D values approaching or equal to 1. By contrast,f d , calculated for all windows with positive D, was far less sensitive to the level of diversity, with most outlying windows showing intermediate levels of diversity ( fig. 3B). Notable exceptions were windows located within the wing-patterning regions, which tended to have highf d values and below average diversity. This is consistent with the strong selection known to act upon the patterning loci. The lack of extremef d values in windows with low diversity suggests that most of the D outliers are spurious, and thatf d provides a better measure of whether a locus has shared ancestry between species. Finally, we also tested the other two f estimators described here:f G andf hom (eqs. 4 and 5, Materials and Methods). Both performed similarly tof d except that both had higher variance (supplementary figs. S3 and S4, Supplementary Material online), consistent with the simulations reported above (supplementary fig. S1, Supplementary Material online), and both gave a considerable number of values greater than 1, confirming thatf d was the most conservative and stable statistic.
Taken together, these findings demonstrate that, when small genomic windows are analyzed, a high D value alone is not sufficient evidence for introgression. Many of the D outlier loci probably represent statistical noise, concentrated in regions of low diversity, whereas outliers for the f estimates, and particularlyf d , tend to be less biased.
This effect could also be observed on the scale of whole chromosomes. The variance in D among 5-kb windows for each of the Heliconius chromosomes (n = 21) was strongly negatively correlated with the average diversity per chromosome (r[19] = À 0.936, P < 0.001; fig. 3C). This relationship was most clearly illustrated by the Z chromosome: It had the lowest diversity by some margin, as expected given its reduced effective population size, and the highest variance among D values for 5-kb windows, despite the fact that previous chromosome-wide analyses suggest very limited gene flow affecting this chromosome (Martin et al. 2013). By contrast, the variance inf d , estimated for all windows with positive D, had a weak positive correlation with the mean diversity per chromosome (r[19] = 0.440, P < 0.05). This was driven by the fact that the Z chromosome had the lowest diversity and also the lowest variance inf d , as expected given the reduced gene flow affecting this chromosome. When the Z chromosome was excluded, there was no significant relationship between the variance inf d values and average diversity (r[18] = 0.092, P 4 0.05). We also considered the effect of window size on the variance of D and the estimators of f. As window size increases, the higher variance of D in regions of lower diversity persists, but becomes less extreme (supplementary fig. S4, Supplementary Material online). In summary, these data show that extreme D values, both positive and negative, occur disproportionately in genomic regions with lower diversity, whereasf d values are less biased by underlying heterogeneity in genetic variation.

Inherent Biases in the D andf Statistics Confound a Test to Distinguish between Introgression and Shared Ancestral Variation
The biases associated with the D statistic described above may have important consequences for methods that use D to identify candidate introgressed regions. For example, Smith and Kronforst (2013) proposed a method to discriminate between gene flow and shared ancestral variation that relies upon D values calculated for small genomic regions (see Introduction). Briefly, the Smith and Kronforst test calculated D for all nonoverlapping 5-kb windows. Absolute divergence (d XY ) was then compared between the set of windows that were outliers for the D statistic (defined as the windows with the top 10% of D values) and the remaining 90% of nonoutlier windows. The method predicts that introgression between species at a specific genomic region should reduce the between-species divergence in this region as compared with the rest of the genome, whereas shared ancestry due to ancestral population structure would not lead to lower divergence. We first confirmed this prediction using simulations, and then assessed whether biases in the D statistic might affect the power of the method.

MBE
To test the prediction that introgression and ancestral population structure leave distinct footprints in terms of absolute divergence, 10,000 sequence windows for three populations (P 1 ,P 2 ,P 3 ) and an outgroup (O) were simulated. In total, 9,000 windows were defined as "Background," having the topology (((P 1 ,P 2 ),P 3 ),O), without any gene flow or population structure. The remaining 1,000 windows were defined as "Alternate" and were subject to either gene flow or structure (see Materials and Methods for details). Ten percent of windows were defined as Alternate to match Smith and Kronforst's design, wherein the top 10% of D values are taken as outliers. Three different Alternate scenarios were considered: Gene flow from P 2 to P 3 , gene flow from P 3 to P 2 and ancestral structure leading to shared ancestry between P 2 and P 3 . For all scenarios, Alternate windows were defined with the topology ((P 1 ,(P 2 ,P 3 )),O). In the gene flow scenarios, the split time between P 2 and P 3 in the Alternate topology was set to be more recent than the split between P 1 and P 2 in the Background topology ( fig. 4A and B). In the ancestral structure scenario, the split time between P 1 and P 2 in the Alternate topology was set to be more ancient than the split between P 2 and P 3 in the Background topology ( fig. 4C). This was designed to model a region of the genome undergoing balancing selection or some other process that maintains polymorphism at particular loci before the speciation event. Gene flow or structure in the Alternate windows can be considered to be complete (f = 1). For example, under gene flow from P 2 to P 3 , all P 3 alleles trace their ancestry through P 2 at the time of gene flow. This simplified design, where gene flow or structure is absent in 90% of the sequences and complete in 10%, allowed for the most straightforward and predictable test of Smith and Kronforst's method; if the logic of the method does not follow in this extreme scenario, it is unlikely to do so in more complex situations.
For each of the three evolutionary scenarios, 120 different permutations of split times and times of gene flow or structure were simulated. The split times and times of gene flow for all models are given in the first three columns of supplementary tables S1-S3, Supplementary Material online. To simplify our comparisons between models, we focused specifically on d XY between P 2 and P 3 , the most relevant parameter when testing for introgression between P 2 and P 3 . We tested whether P 2 -P 3 d XY was significantly lower in the Alternate windows (those that had experienced gene flow or structure) compared with the Background windows, using a Wilcoxon rank-sum test, with Bonferroni correction over all 120 models As predicted, in all models simulating gene flow, average d XY between P 2 and P 3 was significantly lower in Alternate windows compared with Background windows. In contrast, in all models simulating ancestral population structure, there was no significant difference in P 2 -P 3 d XY between the Background and Alternate windows, again in agreement with predictions. These findings therefore demonstrate that the intuitive premise of Smith and Kronforst's (2013) method is justified.
We then tested whether introgression could be distinguished from shared ancestral variation where loci with shared ancestry are not known (as would be the situation with empirical data), but are instead inferred by selecting the top 10% of D values (outliers), following the Smith and Kronforst method. We also tested this method using the top 10% of f estimates among windows with positive D (usingf d ,f G , andf hom ). Using the D statistic to identify outliers, mean d XY between P 2 and P 3 was significantly reduced in outlier windows as compared with nonoutlier windows in all 120 models simulating gene flow from P 3 to P 2 , and all but one of the models simulating gene flow from P 2 to P 3 . The single nonsignificant case had the most ancient possible t 23 and the most recent possible t 12 , and only 11.9% of D outlier windows were genuine Alternate windows, the lowest recall of any model. Using any of the three f estimators, mean d XY

FIG. 4.
Simulations to evaluate a method to distinguish introgression from shared ancestral variation. (A-C) Combined models were made up of 9,000 sequence windows simulated under the Background topology (brown outline) and 1,000 windows simulated under an Alternate topology (colored line). Three distinct evolutionary scenarios were simulated by varying the split times t 12 , t 23 , t GF , and t STR ; (A, E, I) Gene flow from P 2 to P 3 , (B, F, J) gene flow from P 3 to P 2 , (C, G, K) ancestral structure. (D, H, L) Null models were made up of 10,000 sequences simulated under the background topology only. (E-L) Example data from a single simulated data set for each of the four types of models. Split times (in units of 4N generations) were as follows: t 12 = 0.6 in all four cases, t 23 = 0.8 in all four cases, t GF = 0.4 in both gene flow models and t STR = 1.0. Points show mean and standard deviation for P 2 -P 3 d XY calculated over subsets of trees: Simulated Background and Alternate trees (brown and colored points) or nonoutliers and outliers (gray and black points) identified using the D andf d statistics. A significant reduction in P 2 -P 3 d XY for the Alternate compared with Background windows, or for outliers compared with nonoutliers, is indicated by astrices. (E-H) Results of simulations with a population recombination rate (4Nr) of 0.01. (I-L) Results for the same models, but with a population recombination rate (4Nr) of 0.001. between P 2 and P 3 was significantly reduced in outlier windows in all gene flow models.
However, mean P 2 -P 3 d XY was also significantly reduced in D and foutlier windows in more than half of the 120 models simulating ancestral population structure (figs. 4C, 4G, and 5; table 1 and supplementary table S3, Supplementary Material online). This demonstrates that a simple test for reduced divergence in P 2 -P 3 d XY among D orf outlier windows would, under a range of ancestral structure scenarios, produce results consistent with introgression. The fact that this bias was similar whether D or f estimators were used to identify outliers indicates that there is an inherent tendency in all of these statistics toward regions with below-average divergence between P 2 and P 3 . To confirm this finding, we analyzed a set of simulations using a null model, with no gene flow or structure in any of the 10,000 windows, over 45 permutations of split times (supplementary table S4, Supplementary Material online). Outlier windows showed significantly reduced d XY between P 2 and P 3 in most or all of the null models. Finally, we repeated all of these simulations with a lower withinwindow recombination rate parameter (4Nr) of 0.001. This tended to increase the reduction in P 2 -P 3 d XY for outliers in ancestral structure and null models (figs. 4I-L and 5), with at most three of the ancestral structure models showing nonsignificant drops in P 2 -P 3 d XY for outliers, and most or all null models showing significantly lower P 2 -P 3 d XY for outliers, regardless of the statistic used (table 1).
In summary, although shared ancestral variation and introgression can theoretically be distinguished based on the fact that only the latter should reduce d XY between P 2 and P 3 , an inherent bias in both the D and fstatistics makes a simple test for a statistical difference in d XY between outliers and nonoutliers problematic. Both D and f outliers tended toward windows with lower P 2 -P 3 d XY , regardless of the underlying evolutionary history, and particularly when recombination rates were low. In the absence of any gene flow, the outliers must therefore be identifying windows that coalesce more recently in the ancestral population. However, even when the reduction in P 2 -P 3 d XY was significant for ancestral structure or null models, it was typically smaller than the reductions in d XY seen in the gene flow models ( fig. 5). In the presence of gene flow, some windows coalesce more recently than the species split, so the magnitude of the reduction in P 2 -P 3 d XY is greater. This difference could potentially be used to distinguish introgression from shared ancestral variation, but can not be done with a simple significance test, and will require a more sophisticated model-fitting approach.

Discussion
With the advent of population genomics, studies of species divergence have moved from simply documenting interspecific gene flow, toward the identification of specific genomic regions that show strong signals of either introgression or divergence (Garrigan et al. 2012;Heliconius Genome Consortium 2012;Staubach et al. 2012;Roux et al. 2013;Bosse et al. 2014;Huerta-S anchez et al. 2014;Sankararaman et al. 2014). This is a useful goal for many reasons. It can permit the identification of large-scale trends, such as chromosomal differences, and the fine-scale localization of putative targets of adaptive introgression for further characterization. Therefore, simple and easily computable statistics that can be used to identify loci with a history of introgression have considerable appeal.
Previous studies have explored the behavior of Patterson's D statistic, a test for gene flow based on detecting an inequality in the numbers of ABBA and BABA patterns, using wholegenome analyses across large numbers of informative sites (Green et al. 2010;Yang et al. 2012;Eaton and Ree 2013;Martin et al. 2013;Wall et al. 2013). These studies have shown that D is a robust method when applied as intended: To test for an excess of shared variation on a genome-wide scale. Indeed, a major strength of the ABBA-BABA test is that it combines data from across the genome, accounting for chance fluctuations among loci, and therefore is able to detect the net effect of gene flow. Moreover, the nonindependence among linked sites can be accounted for by blockjackknifing (Green et al. 2010). However, it is not clear whether D can be extended beyond its original use to identify specific loci with introgressed variation. We have documented two main problems with this approach. First, D is not an unbiased estimator of the amount of introgression that has occurred. In particular, it is influenced by effective population size (N e ), leading to more extreme values when N e is low. Second, when calculated over small windows, it is highly stochastic, particularly in genomic regions of low diversity and low recombination rate, such that D outliers will tend to be clustered within these regions. Local reductions in genetic diversity along a chromosome can come about through neutral processes, such as population bottlenecks, but also through directional selection. Therefore, these problems may be exacerbated in studies specifically interested in loci that experience strong selective pressures, as this would increase the likelihood of detecting chance outliers at such loci.
Direct estimation of f, the proportion of introgression, holds more promise as a robust method for detecting introgressed loci. Green et al. (2010) proposed that f could be estimated by comparing the observed difference in the number of ABBA and BABA patterns to that which would be expected in the event of complete introgression. As this expected value is calculated from the observed data, this method controls for differences in the level of standing variation, making it more suitable for application to small regions. In Green et al.'s approach, complete introgression from P 3 to P 2 was taken to mean that P 2 would come to resemble a subpopulation of lineage P 3 . Here, we make the conservative assumption that complete introgression would lead to homogenization of allele frequencies, such that the frequency of the derived allele in P 2 would be identical to that in P 3 . Green et al.'s approach assumed unidirectional introgression from P 3 to P 2 , but can lead to spurious values when introgression occurs in the opposite direction. We have therefore proposed a new dynamic estimator of f, in which the donor population can differ between sites, and is always the population with the higher frequency of the derived allele. Although this conservative estimator leads to slight underestimation of the amount of introgression that has occurred, it provides an estimate that is roughly proportional to the level of introgression, regardless of the direction. It is therefore a more suitable measure for identifying introgressed loci. This is supported by our analysis of whole-genome data from Heliconius butterflies, where many 5-kb windows had maximal D values (D = 1), but only a few had highf d values, the vast majority of which were located around the wing-patterning loci previously identified as being shared between these species through adaptive introgression (Heliconius Genome Consortium 2012;Pardo-Diaz et al. 2012).
The sensitivity of D to heterogeneous genomic diversity is likely to affect studies that have drawn conclusions from D FIG. 5. Mean d XY between P 2 and P 3 in outlier windows as a percentage of P 2 -P 3 d XY in nonoutlier windows. Outlier windows defined by Alternate or Background topology (simulation) or by outlying D andf values, as per figure 4. Model types shown in color (gene flow from P 2 to P 3 , green; gene flow from P 3 to P 2 , blue; ancestral structure, red; null model, brown). Results for two different recombination rates are shown (4Nr = 0.01, left; 4Nr = 0.001, right). statistics calculated for particular genome regions. For example, Wall et al. (2013) identified regions carrying putative long (8-100 kb) haplotypes segregating in European humans, and then found that these regions showed evidence of a Neanderthal origin, as indicated by elevated D statistics. However, it may be that such haplotypes would be overrepresented in low-recombination regions, which also tend to have reduced diversity in humans and many other species (Cutter and Payseur 2013). In another recent Heliconius study, F ST was calculated for 5-kb windows across the genome. Windows showing increased differentiation between H. melpomene and H. pachinus (according to F ST ) also showed significantly elevated D statistics in a test for introgression between the same species pair . This illustrates how the sensitivity of both D and F ST to withinspecies diversity can produce conflicting results. This sensitivity is likely to be particularly problematic in studies using very small genomic regions. At the extreme, Rheindt et al. (2014) calculated D for single SNPs and predicted that genes linked to SNPs with outlying D values are more likely to have been introgressed.
In the present study, to investigate whether biases in D when calculated over small regions could influence subsequent analyses, we investigated a recently proposed method to distinguish between introgression and shared ancestral variation (Smith and Kronforst 2013). The premise of this test is that introgression should result in an excess of shared derived alleles and a reduction in absolute divergence (d XY ), whereas shared ancestral variation will exhibit the former but not the latter signature. Our simulations confirmed that the intuitive predictions of this method are valid, but also showed that this test can be misled by the use of D to identify outliers. Windows that were outliers for D exhibited below average d XY in simulations with gene flow, but also in most simulations with ancestral structure, or where both gene flow and ancestral population structure were absent. All three f estimators also failed to distinguish between introgression and ancestral structure in many models. This implies that all of these statistics are systematically biased toward regions that coalesce more recently, regardless of whether gene flow has occurred.
We predict that D would have additional problems in real genomes, where selective constraint leads to a correlation between within-species diversity and between-species divergence, causing D outliers to be even more strongly associated with reduced d XY . However, it is notable that the reduction in divergence among D and foutliers was almost always greater in simulations with introgression than in simulations with ancestral structure or with no Alternate topology, across a large range of split times and dates of gene flow. There may, therefore, be considerable information about the evolutionary history of DNA sequences present in the joint distribution of d XY andf d . On the other hand, in real data, levels of divergence can vary dramatically due to heterogeneity in selective constraint, mutation rate, and recombination rate, which would exaggerate the problems described here. Even an unbiased statistic, when applied to small genomic windows, would be confounded by heterogeneity in recombination rate across the genome. In regions of reduced recombination, fewer independent data points are sampled by each window, so extreme estimates become more likely. Heterogeneity in recombination rate is therefore an essential consideration in any study that aims to scan the genome for regions of interest.

Conclusions
In an era of increasing availability of genomic data, there is a demand for simple summary statistics that can reliably identify genomic regions that have been subject to selection, introgression and other evolutionary processes. It seems unlikely, however, that any single summary statistic will be able to reliably distinguish these processes from noise introduced by demography, drift, and heterogeneity in recombination rate. Here, we have shown that, while Patterson's D statistic provides a robust signal of shared ancestry across the genome, it should not be used for na€ ıve scans to ascribe shared ancestry to small genomic regions, due to its tendency toward extreme values in regions of reduced variation. Estimation of f, the proportion of introgression, particularly using our proposed statisticf d , provides a better means of identifying putatively introgressed regions. Nevertheless, both D andf d tend to identify regions of reduced interspecies divergence, even in the absence of gene flow, which may confound tests to distinguish between recent introgression and shared ancestral variation based on absolute divergence (d XY ) in outlier regions. However, the joint distribution of d XY andf statistics may be a useful summary statistic for modelfitting approaches to distinguish between these evolutionary hypotheses.

Statistics Used to Detect Shared Ancestry
In this study, we focused on an approach to identify an excess of shared derived polymorphisms, indicated by the relative abundance of two SNP patterns termed ABBAs and BABAs (Green et al. 2010). Given three populations and an outgroup with the relationship (((P 1 , P 2 ), P 3 ), O) ( fig. 1A), ABBAs are sites at which the derived allele B is shared between the nonsister taxa P 2 and P 3 , whereas P 1 carries the ancestral allele, as defined by the outgroup. Similarly, BABAs are sites at which the derived allele is shared between P 1 and P 3 , whereas P 2 carries the ancestral allele. Under a neutral coalescent model, both patterns can only result from incomplete lineage sorting or recurrent mutation, and should be equally abundant in the genome (Durand et al. 2011). A significant excess of ABBAs over BABAs is indicative either of gene flow between P 2 and P 3 , or some form of nonrandom mating or structure in the population ancestral to P 1 , P 2 , and P 3 . This excess can be tested for, using Patterson's D statistic, where C ABBA (i) and C BABA (i) are counts of either 1 or 0, depending on whether or not the specified pattern (ABBA or 253 Locating Introgression . doi:10.1093/molbev/msu269 MBE BABA) is observed at site i in the genome. Under the null hypothesis of no gene flow and random mating in the ancestral population, D will approach zero, regardless of differences in effective population sizes (Durand et al. 2011). Hence, a D significantly greater than zero is indicative of a significant excess of shared derived alleles between P 2 and P 3 . If population samples are used, then rather than binary counts of fixed ABBA and BABA sites, the frequency of the derived allele at each site in each population can be used (Green et al. 2010;Durand et al. 2011), effectively weighting each segregating site according to its fit to the ABBA or BABA pattern, with where p ij is the frequency of the derived allele at site i in population j. These values are then used in equation 1 to calculate D (Durand et al. 2011). Green et al. (2010) also proposed a related method to estimate f, the fraction of the genome shared through introgression (Green et al. 2010;Durand et al. 2011). This method makes use of the numerator of equation 1, the difference between sums of ABBAs and BABAs, which is called S. In the example described above, with ((P 1 ,P 2 ),P 3 ),O), the proportion of the genome that has been shared between P 2 and P 3 subsequent to the split between P 1 and P 2 can be estimated by comparing the observed value of S to a value estimated under a scenario of complete introgression from P 3 to P 2 . P 2 would then resemble a lineage of the P 3 taxon, and so the denominator of equation 1 can be estimated by replacing P 2 in equations 2 and 3 with a second lineage sampled from P 3 , or by splitting the P 3 sample into two, where P 3a and P 3b are the two lineages sampled from P 3 . Splitting P 3 arbitrarily in this way may lead to stochastic errors at individual sites, particularly with small sample sizes. These should be negligible when whole-genome data are analyzed but could easily lead to erroneous values off (includingf 4 1) when small genomic windows are analyzed, as in the present study. We therefore used a more conservative version, in which we assume that complete introgression from P 3 to P 2 would lead to complete homogenization of allele frequencies. Hence, in the denominator, P 3a and P 3b are both substituted by P 3 : Although this conservative assumption may lead to underestimation of the proportion of sites shared, it also reduces the rate of stochastic error. Moreover, in the present study, we are less concerned with the absolute value off , and more with the relative values off between genomic regions.
Thef statistic assumes unidirectional gene flow from P 3 to P 2 (i.e., P 3 is the donor and P 2 is the recipient). Because the branch leading to P 3 is longer than that leading to P 2 ( fig. 1A), gene flow in the opposite direction (P 2 to P 3 ) is likely to generate fewer ABBAs. Thus, in the presence of gene flow from P 2 to P 3 , or in both directions, thef equation should lead to an underestimate. However, when small genomic windows are analyzed, the assumption of unidirectional gene flow could lead to overestimates, because any region in which derived alleles are present in both P 2 and P 3 , but happen to be at higher frequency in P 2 , will yield f estimates that are greater than 1. Thus, we propose a dynamic estimator in which the denominator is calculated by defining a donor population (P D ) for each site independently. For each site, P D is the population (either P 2 or P 3 ) that has the higher frequency of the derived allele, thus maximizing the denominator and eliminating f estimates greater than 1: Assessing the Ability of D and f Estimators to Quantify Introgression in Small Sequence Windows To assess how reliably Patterson's D statistic, and other estimators of f are able to quantify the actual rate of introgression, we simulated sequence data sets with differing rates of introgression using ms (Hudson 2002). For each data set, we simulated 100 sequence windows for eight haplotypes each from four populations with the relationship (((P 1 ,P 2 ),P 3 ),O). The split times t 12 and t 23 (as on fig. 1A) were set to 1 Â 4N generations and 2 Â 4N generations ago, respectively, and the root was set to 3 Â 4N generations ago. An instantaneous, unidirectional admixture event, either from P 3 to P 2 or from P 2 to P 3 , was simulated at a time t GF with a value f, which determines the probability that each haplotype is shared. We tested two different values for t GF : 0.1 and 0.5 Â 4N generations ago. For each direction of gene flow and each t GF , 11 simulated data sets were produced, with f values ranging from 0 (no gene flow) to 1 (all haplotypes are shared). Finally, the entire set of simulations was repeated with three different window sizes: 1, 5, and 10 kb, and with three different recombination rates: 0.001, 0.01, and 0.1, in units of 4Nr, the population recombination rate. DNA sequences were generated from the simulated trees using Seq-Gen (Rambaut and Grass 1997), with the Hasegawa-Kishino-Yano substitution model and a branch scaling factor of 0.01. Simulations were run using the provided script compare_f_estimators.r, which generates the ms and Seq-Gen commands automatically. An example set of commands to simulate a single 5-kb sequence using the split times mentioned above, with gene flow from P 3 to P 2 at t GF = 0.1 and f = 0.2, and with a recombination rate parameter of 0.01 would be: ms 32 1 -I 4 8 8 8 8 -ej 1 2 1 -ej 2 3 1 -ej 3 4 1 -es 0.1 2 0.8 -ej 0.1 5 3 -r 50 5000 -T j tail -n + 4 j grep -v // 4 treefile partitions=($(wc -l treefile))

Analysis of Heliconius Whole-Genome Sequence Data
To investigate how the D andf statistics are affected by underlying diversity in a given window, we reanalyzed whole genome data from Martin et al. (2013). For ABBA-BABA analyses, populations were defined as follows: P 1 = H. m. aglaope (four diploid samples), P 2 = H. m. amaryllis (4), P 3 = H. timareta thelxinoe (4), O=H. hecale (1), H. ethilla (1), H. pardalinus sergestus (1), and H. pardalinus ssp. nov. (1). Patterson's D (eq. 1) and the three f estimators (eqs. 4-6) were calculated, along with nucleotide diversity () and absolute divergence (d XY ), for nonoverlapping 5-kb windows across the genome. Both and d XY were calculated as the mean number of differences between each pair of individuals, sampled either from the same population (), or from separate populations (d XY ). Sites with missing data were excluded in a pairwise manner, and each pair of individuals contributed equally to the mean. Windows were restricted to single scaffolds and windows for which fewer than 3,000 sites had genotype calls for at least half of the individuals were discarded.
To calculate D and the f estimators only biallelic sites were considered. The ancestral state was inferred using the outgroup taxa, except when the four outgroup taxa were not fixed for the same allele, in which case the most common allele overall was taken as ancestral. The HmB locus was defined as positions 300000-450000 on scaffold HE670865 and the HmYb locus as positions 650000-900000 on scaffold HE667780 of version 1.1 of the H. m. melpomene genome sequence. We also analyzed windows from each of the 21 chromosomes of the H. m. melpomene genome sequence separately. Scaffolds were assigned to chromosomes according to the Heliconius Genome Consortium (2012), and incorporating the improved assignment of Z-linked scaffolds by Martin et al. (2013)  Assessing a Test to Distinguish Introgression from Shared Ancestral Variation Based on Absolute Divergence Smith and Kronforst (2013) proposed a simple test to distinguish between the hypotheses of pre and postspeciation shared ancestry based on absolute divergence. To assess this method on data of known history, we generated a large range of sequence data sets using ms (Hudson 2002) and Seq-Gen (Rambaut and Grass 1997). For the simplest ("null") model 10,000 5-kb sequence windows were simulated for eight haplotypes each from three populations and an outgroup, with the relationship (((P 1 ,P 2 ),P 3 ),O), without gene flow or population structure. To approximate a scenario in which a subset of the genome has a distinct phylogenetic history, either due to gene flow or genomically localized ancestral population structure, we used a combined model approach. This entailed combining 9,000 5-kb windows from the null model (90% Background windows), with 1,000 5-kb windows simulated with the topology ((P 1 ,(P 2 ,P 3 )),O), consistent with shared ancestry between P 2 and P 3 (10% Alternate windows). By altering the split times, three distinct scenarios were emulated: Gene flow from P 2 to P 3 , gene flow from P 3 to P 2 , and ancestral structure ( fig. 4A-D). Using entirely distinct topologies in this way is equivalent to making the probability of gene flow (or structure) equal to one in the 1,000 Alternate windows. Although this approach of partitioning each data set into two somewhat arbitrarily sized subsets with evolutionary histories at two extremes is biologically unlikely, it provided a simple and powerful framework in which to evaluate Smith and Kronforst's approach, with clear expectations. Model combination data sets were generated using run_model_combinations.py and shared_ancestry_simulator.R, which generates the ms and Seq-Gen commands automatically, in a similar form to those given above. For example if t 12 = 1, t 23 = 2, 4Nr = 0.01, and gene flow from P 3 to P 2 at t GF = 0.2, the ms calls for Background and Alternate models, respectively, would be: ms 32 1 -I 4 8 8 8 8 -ej 1 2 1 -ej 2 3 1 -ej 3 4 1 -r 50 5000 -T ms 32 1 -I 4 8 8 8 8 -ej 0.2 2 3 -ej 2 3 1 -ej 3 4 1 -r 50 5000 -T We calculated Patterson's D (eq. 1) and the three f estimators (eqs. 4-6) for all windows, and identified the top 1,000 "outliers" (10%) with the most extreme values. For D, only positive values were included as outliers, as negative values indicate an excess of BABAs, consistent with introgression between P 1 and P 3 . Similarly, for f estimators, only windows with D!0 were considered, as these values only give meaningful quantification of introgression when there is an excess of ABBAs. To compare P 2 -P 3 divergence between the Background and Alternate windows, or between outlier and nonoutlier windows, we calculated d XY for each window as described above, for each pair of populations. Average d XY was compared between subsets of windows using a Wilcoxon rank-sum test, as values tended to be nonnormally distributed (confirmed with Bonferroni-corrected Shapiro-Wilk tests).
These tests were repeated over a large range of split times. In all cases the root was set to 3.0 Â 4N generations ago, and the other splits ranged from 0.2 to 2.0. Times of gene flow and structure also varied on the same scale. In total, this gave 45 null models and 120 models each for the two gene flow scenarios and ancestral structure scenario (405 overall). The analyzed models therefore covered a vast range of biologically relevant scales. In all cases, the Seq-Gen branch scaling factor was set to 0.01. Full parameters for all models are provided in supplementary tables S1-S4, Supplementary Material online. Finally, to examine the effects of recombination rate, the entire simulation study was repeated using population recombination rate