Inferring Adaptive Introgression Using Hidden Markov Models

Abstract Adaptive introgression—the flow of adaptive genetic variation between species or populations—has attracted significant interest in recent years and it has been implicated in a number of cases of adaptation, from pesticide resistance and immunity, to local adaptation. Despite this, methods for identification of adaptive introgression from population genomic data are lacking. Here, we present Ancestry_HMM-S, a hidden Markov model-based method for identifying genes undergoing adaptive introgression and quantifying the strength of selection acting on them. Through extensive validation, we show that this method performs well on moderately sized data sets for realistic population and selection parameters. We apply Ancestry_HMM-S to a data set of an admixed Drosophila melanogaster population from South Africa and we identify 17 loci which show signatures of adaptive introgression, four of which have previously been shown to confer resistance to insecticides. Ancestry_HMM-S provides a powerful method for inferring adaptive introgression in data sets that are typically collected when studying admixed populations. This method will enable powerful insights into the genetic consequences of admixture across diverse populations. Ancestry_HMM-S can be downloaded from https://github.com/jesvedberg/Ancestry_HMM-S/.


Introduction
It is becoming increasingly clear that admixture, gene flow between genetically divergent populations, is a common phenomenon in nature. In some cases, introgressed genetic material confers a selective advantage for individuals in the recipient population, commonly referred to as adaptive introgression, and it is thought to underlie the evolution of numerous adaptive phenotypes (Hedrick 2013;Racimo et al. 2015;Suarez-Gonzalez et al. 2018), for example pesticide resistance in mice (Song et al. 2011) and mosquitos (Norris et al. 2015), and complex mimicry patterns in Heliconius butterflies (The Heliconius Genome Consortium 2012). Perhaps the most famous example is the introgression of an allele of EPAS1 from archaic Denisovans into a modern human population, where the Denisovan allele is thought to have increased in frequency in Tibet due to higher fitness at high altitudes (Huerta-S anchez et al. 2014;Jeong et al. 2014;Racimo et al. 2015). Admixture therefore has the potential to facilitate adaptive phenotypic outcomes across diverse populations and is rapidly emerging as one of the fundamental drivers of natural selection (Hedrick 2013;Suarez-Gonzalez et al. 2018).
Recent admixture is thought to be an important evolutionary force in Drosophila melanogaster as well. Populations of this species migrated out from sub-Saharan Africa to colonize the rest of the world $10,000-15,000 years ago (Thornton and Andolfatto 2006). During this expansion, the population that left Africa experienced a dramatic bottleneck that reshaped haplotypic variation across the genome, resulting in decreased diversity and extended linkage disequilibrium (Thornton and Andolfatto 2006;Pool et al. 2012). More recently, descendants of the ancestral and derived populations have admixed in several locations across the world, and these have been the subjects of numerous previous analyses of admixture and local ancestry Kao et al. 2015;Lack et al. 2015;Pool 2015;Bergland et al. 2016;Lack et al. 2016;Corbett-Detig and Nielsen 2017;Medina et al. 2018). In particular, the population history for one large admixed sample from South Africa (Lack et al. 2015) is consistent with a simple admixture model where cosmopolitan ancestry introgressed into this population once $30 years prior to sampling (Medina et al. 2018).
Although there have been numerous investigations into the factors that cause nascent reproductive isolation between populations (e.g., Coyne and Orr 2004) and genome-wide signatures of selection (Kolaczkowski et al. 2011;Langley et al. 2012;Reinhardt et al. 2014;Garud et al. 2015), comparatively little work has focused on adaptive outcomes resulting from admixture in D. melanogaster. As in other species, pesticides are a major driver of selection, and resistance factors can quickly spread in populations when pesticides are introduced, either from standing genetic variation or from de novo mutations (Karasov et al. 2010;Garud et al. 2015). In D. melanogaster, specific alleles of several different genes are known to confer resistance to common pesticides. For instance, alleles of several Cyp6 Cytochrome P450 genes are implicated in DDT resistance (Daborn et al. 2002;Schmidt et al. 2017). Similarly, alleles of the gene acetylcholinesterase (Ace) can confer resistance to organophosphate pesticides (Aldridge 1950). Such alleles have been shown to quickly increase in frequency in populations exposed to pesticides (Daborn et al. 2002;Menozzi et al. 2004;Karasov et al. 2010) and in the case of Ace and Cyp6g1, the resistant alleles are thought to have arisen de novo on multiple distinct haplotypes in cosmopolitan populations during adaptation and are therefore often cited as examples of soft sweeps (Karasov et al. 2010;Garud et al. 2015). These results have also been argued to show that adaptation in D. melanogaster is not limited by de novo mutations (Karasov et al. 2010), but little is known about how the balance of de novo mutations and gene flow has shaped current day patterns of pesticide resistance.
Adaptive introgression results in characteristic genomic signatures that are distinct from both those of neutral introgression and those of classical models of natural selection at the molecular level. First, adaptively introgressed alleles will typically exceed the baseline introgression fraction ( fig. 1A). Second, because adaptive haplotypes increase quickly in frequency, the surrounding segments of nonrecombined ancestry is expected to be longer than under a neutral model (Shchur et al. 2020). To a first approximation, these patterns are qualitatively similar to classical models of selective sweeps. However, because introgressing haplotypes are genetically distinct and selected alleles are introduced at moderate starting frequencies, the characteristics of genetic variation associated with alleles contributed by adaptive introgression differs substantially (Fraïsse et al. 2014;Racimo et al. 2015;Shchur et al. 2020). Moreover, even neutral admixture affects haplotype patterns, confounding direct quantification of selective coefficients using conventional techniques for selection in single populations (Lohmueller et al. 2011;Racimo et al. 2015). Accurate detection and quantification of adaptive introgression therefore cannot rely on many of the rich and detailed models of adaptive evolution (reviewed for instance in Pavlidis and Alachiotis 2017) and remains a fundamental challenge in evolutionary genomics (Racimo et al. 2015).
An identifying characteristic of adaptively introgressed alleles is that they reach higher frequencies than neutral alleles (Hedrick 2013;Racimo et al. 2015;Suarez-Gonzalez et al. 2018). A first step in specifically searching for adaptive introgression is therefore to infer the ancestry frequencies of admixed samples locally across the genome, which is typically accomplished using hidden Markov models (HMM) (Falush et al. 2003;Sankararaman et al. 2008;Baran et al. 2012;Maples et al. 2013). By identifying loci with unusually high proportions of introgressing ancestry within admixed populations, it is sometimes possible to detect signatures of adaptive introgression (Racimo et al. 2015). However, these approaches typically require tailor-made methods for identifying local ancestry outliers consistent with selection. Moreover, as described above, natural selection itself shapes the resulting ancestry tract length distribution, and a more general and powerful method for detecting and quantifying adaptive introgression could explicitly model the consequences of adaptive introgression during local ancestry inference (LAI).
Recently, a few software packages for detecting adaptive introgression have been released. Genomattn (Gower et al. 2020) uses convolutional neural networks to perform this task, and VolcanoFinder (Setter et al. 2020), can infer adaptively introgressed loci from patterns of elevated heterozygosity surrounding the introgressed allele. VolcanoFinder is intended for mutations that were introduced from a highly divergent population and then went to fixation some time in the past. Although it has the advantage of not using data from the donor population, making it applicable in human genetics to detect introgression from unsampled and now extinct hominins, it is less suitable for detecting more recently introgressed adaptive alleles that have still not gone to fixation, or alleles introgressed from a closely related population. The objective of this article is to develop a method applicable to segregating alleles, from possibly highly related populations, when reference data from the donor population are available.
We have previously developed a method for LAI named Ancestry_HMM (Corbett-Detig and Nielsen 2017; Medina et al. 2018). Briefly, our approach uses an HMM framework to perform LAI using genomic samples from an admixed focal population and two unadmixed, ancestral reference populations. By assuming a neutral admixture model (Liang and Nielsen 2014), our approach can infer both the timing of multiple admixture pulses and local ancestry patterns across the genome based on low coverage data from unphased diploid samples or samples of arbitrary ploidy, including data generated through pooled sequencing strategies. Because of the generality of this framework, it is both possible and appealing to expand this approach to explicitly model and search for the contributions of adaptive introgression to patterns of local ancestry within samples from admixed populations.
Here, we introduce a novel method called Ancestry_HMM-S (AHMM-S, where S stands for selection) to explicitly model the impacts of natural selection during admixture. Our approach enables the detection of adaptive introgression and estimation of the strength of selection acting on individual loci. We validate this approach through extensive forward simulations, demonstrating that AHMM-S is robust under many plausible scenarios of selection during admixture. We use AHMM-S to analyze a genomic data set of admixed samples of D. melanogaster from a population in South Africa, where we identify several loci that show signatures of adaptive introgression. Our results show that selection has driven cosmopolitan haplotypes carrying insecticide resistant alleles to high frequencies, likely a result of the application of chemical insecticides in South Africa.

Expected Patterns of Ancestry Transitions during Adaptive Introgression
We began by modifying Ancestry_HMM to estimate the likelihood of adaptive introgression from haplotype patterns surrounding a candidate locus. We did this by adapting a framework for calculating the expected lengths of haplotypes carrying an adaptively introgressed allele that we have previously developed (Shchur et al. 2020), and implemented a fast method for calculating the corresponding transition rates that are used in the HMM. Briefly, we assume a single discrete admixture event, a "one-pulse" model, that took place t generations prior to the time of sampling. The probability of the ancestry states in a HMM (emissions probabilities) is not affected by selection and is unchanged from Ancestry_HMM and we therefore refer readers to previous works for details on how these probabilities are calculated (Corbett-Detig and Nielsen 2017; Medina et al. 2018).
However, in order to take selection into account, we must update the transition probabilities to reflect the expected increased frequency of the selected site relative to background levels (i.e., the initial admixture fraction) ( fig. 1B). We do this by modeling the increase in frequency of an additively adaptive allele using a familiar logistic deterministic approximation (Kaplan et al. 1989), as well as the decay of the introgressed haplotypes surrounding the locus through recombination (Shchur et al. 2020) (fig. 1C). By optimizing this model at regular intervals along a chromosome and comparing these results to neutral models, we can detect loci that experience adaptive introgression and quantify the strength of selection that has acted on these sites ( fig. 1D).

Model Evaluation and Validation
In order to validate AHMM-S, we performed forward simulations of selection during admixture. In brief, we simulated admixed populations of diploid individuals which received an introgressive pulse from a second population carrying an Following an admixture event, recombination will break up introgressed haplotypes. In the absence of selection, the frequency of the introgressed genotype (red regions) is expected to remain at a constant low level and haplotype lengths are expected to be short. If positive selection is acting on an introgressed locus (yellow star), the genotype frequency is expected to be higher, and the haplotype lengths larger. (B) Cartoon of two scenarios of adaptive introgression. The top panel shows the genotype frequencies of a (blue) 1% introgression pulse of an adaptive locus with weaker selection at position 0, sampled after 400 generations, and (orange) a 10% introgression pulse with stronger selection sampled after 200 generations. The bottom panel shows the corresponding transition rates. In general, larger introgression pulses correspond with high baseline transition rates (i.e., the probability of recombination event between the two genotypes) and stronger selection with bigger haplotype blocks and dips in transition rates in a wider region surrounding the selected site. (C) An example of simulated population with a 1% introgression pulse and a selected allele with s ¼ 0.05 at position 5,000,000. The frequency of the introgressed genotype is shown in red, and the expected transition rate of the selected site in black (in transitions from the introgressed genotype to the receiving genotype per Morgan). (D) Likelihood surface of simulated chromosome. Adaptive introgression was inferred using AHMM-S for values of s from 0.001 to 0.15 at every 10 sites along the chromosome, for the same simulation as shown in C. The likelihood ratio for each unique combination of site and s is plotted. The red cross marks the position and selective coefficient used in the simulation.
adaptive allele, t generations prior to sampling. We simulated low coverage, short-read allele counts for 25 diploid individuals, in order to represent a realistic and modest sampling strategy. The simulated reads were conditional on a fixed error rate and their known genotype at each site. We varied the admixture fraction m between 0.01 and 0.5 and the selective coefficient s of the adaptive allele from 0 to 0.1, and we sampled the population at steps from 50 to 1,000 generations. We estimated a selective strength at each locus in our simulated data set and identified the selected site as the site with the highest likelihood ratio across the chromosome, as might typically be done when searching for adaptive introgression in real data sets. Our simulations cover an extremely broad range of parameter spaces, and whereas AHMM-S performs generally well, we were also able to define several important limitations ( fig. 2). At lower levels of selection and shorter time periods since introgression, it is difficult to identify the position of the adaptively introgressed locus. The mean distance to the locus is on the order of 1 Mb, suggesting that the inferred locus is generally incorrect and that outlier values in likelihood ratio is mostly caused by noise. This is consistent with likelihood ratios under these conditions being low (<50), as well as the inferred selective coefficients being too high. This phenomenon is likely caused by there being little difference in genotype frequency of the selected locus compared with the other neutrally segregating introgressed loci. Since our method depends on there being a difference in genotype frequencies, and correspondingly transition rates, it has low power to distinguish adaptive introgression from noise under these conditions. Nonetheless, the method performs well in general for a range in the plausible parameter space, and we find that for populations sampled 300 or more generations after admixture with moderate selection (s ¼ 0.01), the model accurately estimates both the position of the selected site and the strength of selection. In cases of stronger selection (s > 0.05), the inferred position and estimated selective coefficient are close to the real value already at 50 or 100 generations. Under favorable conditions, we are able to identify the correct position within a few kb, which is close to the limit of resolution in our simulations, which are based on SNP densities found in D. melanogaster. The inferred selective coefficients are also close to the real value under these conditions, and the error is within 20%.
Overall, AHMM-S works best when selection is strong and/ or when selected sites have reached frequencies that are significantly higher than baseline introgression levels. These are conditions that are most likely to be associated with important phenotypic changes and are for this reason of greatest interest to biologists. Nevertheless, weakly selected sites may reach high frequencies over longer periods of time. Unfortunately, this also means that recombination will have time to break diagnostic haplotype patterns apart, decreasing the ability to quantify the relevant selective coefficient. This pattern is expected over time scales beyond what we validate for here and that is much longer than is suitable for most LAI applications.

Effects of Algorithm for Calculating Expected Transition Rates
We evaluated the simulated scenarios above using the fourpoint analytical approximation of the expected transition rates. This method calculates expected transition rates for a given site and a given selective coefficient based on just four sites spaced along the chromosome and then interpolates the rates for all other sites, but we also evaluated the performance of a slower forward iteration-based approach, where the expected transition rate is instead calculated between each pair of loci along a chromosome for a given model of selection. We applied the same simulation strategy for a subset of population parameters using the forward iteration algorithm and inferred adaptive introgression (supplementary fig. S1, Supplementary Material online). Both methods perform similarly well across a range of input parameters, and we therefore decided to only use the four-point approximative method for further analysis.

Effects of Sample Size and Sequencing Approach
We tested the effects of sample size by increasing the number of individuals in the set of simulated reads from 25 to 75 ( fig. 3). Although a larger sample size improved the estimates of both the selective coefficient and the location of the locus, the effects on the estimated selection coefficient was generally small. Therefore researchers should use the largest feasible sample size when studying adaptive introgression using this method, although the primary results from the modest sample sizes suggest that this method is applicable for all species but those that are most challenging to sample in high numbers.
We also investigated the performance of AHMM-S when using pooled sequencing data. A subset of the simulated populations were converted to pooled reads (instead of reads separated by individual), and adaptive introgression was then inferred ( fig. 3B). The estimated strength of selection is close to the real value, and errors ranging from 5% to 25%, which is comparable to the values generated when using individually sequenced samples instead. The accuracy of the position estimate is worse than for individually sequenced samples, though it also improves with stronger selection and longer time since admixture. We therefore suggest using individually barcoded and sequenced samples, but pooling may provide an economic tradeoff if accurately mapping the specific selected site is not a primary concern.
Generating a Null Model AHMM-S performs a likelihood test for each site (or a subset of sites) on a chromosome, and for relatively densely spaced markers this implies a substantial number of hypotheses are tested in a single run of the program. As many of the sites will be in genetic linkage with each other, these tests are not independent. Furthermore, since linkage is expected to decay over time, identifying a cutoff for statistical significance is difficult and to some degree depends on the time since admixture. For this reason we recommend performing simulations of a neutral introgression scenario with similar population parameters as the data set of interest. The distribution of likelihood ratio scores generated in these simulations can then provide a null model for variation in likelihood scores under a neutral admixture model.

Computational Performance
The computational performance of AHMM-S is influenced by the number of sites, the number of samples and the type of algorithm used for approximating the trajectory of a selected allele. A data set with 20,000 sites and 25 diploid individuals takes $2 h using a single thread of an Intel i7-8550U CPU when using the 4-point approximation method for computing the transition rates and 6.5 h for the same data set when using the forward iteration method to obtain transition rates. Memory usage is similar for both methods with 120-140 Mb used for both algorithms. When using a larger data set consisting of 75 individuals, computation time is $4 times longer, and memory usage rises to 700 Mb.

Robustness to Parameter Misspecification
AHMM-S assumes knowledge of several demographic parameters, including the time of introgression, the admixture fraction and the effective size of the admixed population. In practice, these must also be estimated from the data and the true parameters cannot be known without some uncertainty. We have previously shown that LAI using Ancestry_HMM is neither strongly affected by parameter misspecification nor the presence of selection (Corbett-Detig and Nielsen 2017), but that the estimated times since introgression were. We therefore evaluated the robustness of estimates of selective strength by AHMM-S and the consequences of poorly estimated parameter values, by intentionally misspecifying the necessary parameters on a subset of the simulations used for validation. Although misspecifying population size has little effect on the final estimated selective coefficient, both time since introgression and the admixture fraction can skew the estimate (supplementary fig. S2, Supplementary Material online). Even so, if the specified values are within 20% of the true values, the errors in estimation of s are within 40%.
In general, it is straightforward to accurately estimate the overall admixture fraction using a range of approaches (for instance Pritchard et al. 2000;Alexander et al. 2009) and our previous work has shown the time of admixture can be approximated well using approaches that we (Corbett-Detig and Nielsen 2017; Medina et al. 2018) and others (Pool and Nielsen 2009;Gravel 2012;Loh et al. 2013) have developed, even with moderate impacts of natural selection. In contrast, the effective sizes of admixed populations may be challenging to accurately infer. However misspecification of the effective population size has only a minor impact on estimates of selection obtained using this method, suggesting that our approach is robust even to substantial uncertainty regarding the effective population size. Effects of Continuous Gene Flow, Going to Fixation, Dominant/Recessive Selection, and Segregation in the Donor Population AHMM-S uses a simple model for adaptive introgression, which assumes 1) a single pulse of introgression, 2) that the selected allele is fixed in the donor population, and 3) that selection is additive for the adaptively introgressed allele. We tested the performance when violating these assumptions by simulating populations where there was either continuous gene flow for 20-100 generations, where the selected locus was segregating at 50% in one ancestral population or where the selected allele was either dominant or recessive. The results are summarized in supplementary figures S3-S5, Supplementary Material online. In general, AHMM-S is able to identify adaptive introgression in these cases, but with somewhat reduced precision compared with when the population model is not violated.
A further limitation of AHMM-S is that it is not capable of estimating the selective coefficient when the selected allele has gone to fixation. This is caused by the method that is used to calculate the expected transition rates (supplementary fig.  S6, Supplementary Material online). In such a scenario, the program is still generally capable of identifying the location of the adaptively introgressed site, but the reported value of s will be lower than the true value. As it is possible to quantify the local ancestry along the chromosome with the software package on which AHMM-S is based (Corbett-Detig and Nielsen 2017), it should be easy to identify such cases.
The Effects of a Small Introgression Fraction AHMM-S calculates expected transition rates based on a logistic function for calculating the allele frequency trajectory of the selected site. As noted in Shchur et al. (2020), this logistic function is only a good approximation of the trajectory when the admixture pulse is large. Shchur et al. developed a simple stochastic method based on multiple forward simulations for estimating the trajectory when the admixture pulse is small, and we implemented it in AHMM-S. We compared the logistic and the stochastic methods for simulations of populations with m ¼ 0.001 and m ¼ 0.0001, where we conditioned the simulations on the selected allele not being lost by drift. In our simulations (supplementary fig. S7, Supplementary Material online), we see little effect of a small m on our estimates, and no large difference between the logistic and stochastic methods for approximating the trajectory of the adaptive allele. As the logistic approximation is significantly faster, we would recommend its use even in cases where the initial admixture pulse is very small.

Linkage between Multiple Selected Sites
Multiple selected sites can be located near each other, and we examined how AHMM-S can handle such a scenario and how it affects inference of selective coefficients. We ran simulations where two positively selected sites were placed at varying distances from each other, ranging from 0.1 to 5 cM, and inferred adaptive introgression (supplementary fig. S8, Supplementary Material online). When the two sites are located in close proximity (0.1 cM), AHMM-S will typically produce a single peak for both the likelihood ratio and the inferred selection. The inferred selection coefficient is close to the sum of s for each site, as would be expected for additive selection. When the sites are placed at increasingly large distances, individual peaks are distinguishable from 1 to 2 cM, and the additive effect is diminished as the distance increases. Care must still be taken though, when interpreting the inferred selective coefficients in such cases as they are likely to be overestimations reflecting in part the joint effect of two sites.
It is also possible that negatively selected loci affect the inference of adaptive introgression at linked sites. Purging of weakly deleterious alleles following admixture has been suggested in a number of systems (Harris and Nielsen 2016;Kim et al. 2018;Meiklejohn et al. 2018) often causing large changes in introgression patterns along the genome. Although we do not consider this effect here, we expect that patterns of large peaks of introgressing ancestry are unlikely to occur under a model of purely weakly deleterious variation and are unlikely to dramatically affect inferences using this method.
Detecting Negative Selection AHMM-S can detect negatively selected sites as well by treating negative selection of the introgressed genotype as positive selection of the receiving genotype (defined as the less common genotype and more common genotype respectively). To test how well this works, we ran a smaller set of simulations (s was set to -0.05 and m to either 0.1 or 0.01) where a single introgressed allele experienced negative selection following admixture. Supplementary figure S9, Supplementary Material online shows the case where m is 0.1, and whereas we can easily detect the position of the negatively selected locus, the introgressed allele will quickly be lost, which leads to an underestimate of the selective coefficient. For this reason, we expect that it will often be challenging to detect negatively selected alleles with small introgression fractions. However, if selection is weak and the introgression pulse is large the allele will segregate for a significant time and AHMM-S will be able to generate a reasonably accurate estimate of the selective coefficient.

Suitability for Detecting Dobzhansky-Muller Incompatibilities
A specific example of negative selection that is expected to be common following admixture between species and distantly related populations are Dobzhansky-Muller incompatibilities (DMIs) (Coyne and Orr 2004) and there has also been considerable interest in identifying DMIs within admixed populations (Corbett-Detig et al. 2013;Schumer et al. 2014;Pool 2015;Powell et al. 2020). DMIs are caused by complex epistatic interactions between at least two different loci and we therefore evaluated how our approach might cope with such scenarios. As a proof of concept, we simulated several twolocus DMI scenarios and used AHMM-S to identify sites which show signs of selection. We found that our method consistently identifies selected sites with high accuracy, that is, Svedberg et al. . doi:10.1093/molbev/msab014 selected loci are detected within 8 kb. As expected, due to the conditional nature of selection against DMI loci, estimated selective coefficients are typically small relative to a single locus model, with estimates 50-90% lower than the actual value (supplementary fig. S10, Supplementary Material online). Our approach may therefore be applicable for detecting DMI's in addition to adaptive alleles. However, we caution that without additional evidence (e.g., linkage disequilibrium in admixed samples) or experimentation to demonstrate epistatic selection, it will not typically be possible to distinguish between DMI's and strong single locus selection based purely on the results from our program. If possible, putatively selected sites should be biologically interrogated to identify specific likely modes of selection. Furthermore, parameters, such as the distance between the incompatible loci, the size and timing on the admixture pulse can affect the strength and direction of selection on incompatible loci, and comparing real data to simulations with similar parameters may also provide increased confidence in the presence of DMIs.

Adaptive Introgression in D. melanogaster
In order to test AHMM-S on real data, we selected a population sample of D. melanogaster from South Africa that has shown signals of admixture in previous studies (Lack et al. 2015;Corbett-Detig and Nielsen 2017;Medina et al. 2018). This data set is moderately sized (n ¼ 81), the admixture history is approximately consistent with a one-pulse admixture model, and the previously estimated time since admixture (m ¼ 0.17, t ¼ 430 generations) suggests that this population is ideally suited for testing our approach (Corbett-Detig and Nielsen 2017; Medina et al. 2018). First we performed simulations of neutral admixture in a similar population, to determine the null model against which we test for adaptive introgression. We identified likelihood ratio outlier peaks (Materials and Methods) and then determined the likelihood ratio threshold that would generate on average a single false discovery per genome. In our case, this threshold is 15; that is, we expect one likelihood ratio outlier above 15 under a neutral model.
We then applied our method to this population, where we observed highly variable patterns of adaptive introgression across the genome. Specifically, we identified one locus on chromosome 2L, three on 2R, 13 on 3R, and none on 3L and X as putative targets of selection following admixture, with selection coefficients ranging from 0.0046 to 0.0115 ( fig. 4 and table 1). We therefore find evidence for moderately strong fitness effects associated with introgressing cosmopolitan ancestry in the focal population. The inferred selective coefficients are consistent with our expectations given the relatively short time since admixture, in which selection would need to be relatively strong to drive alleles to moderate frequencies.
Selection for resistance to commonly used insecticides might underlie many of the signatures of adaptive introgression that we observe. Several candidate loci are located close to genes known to be associated with resistance, such as the three loci on 2R, which are all located within 5 kb from Cyp6 Cytochrome 450 genes. Two of the most prominent likelihood ratio outliers are located next to Cyp6g1 and Cyp6w1 respectively, and allelic variants of these genes are specifically known to confer resistance to DDT exposure (Daborn et al. 2002;Schmidt et al. 2017). The third candidate locus is located close to a cluster of several Cyp6 genes, of which Cyp6a17 and Cyp6a23 have been shown to be associated with resistance to other insecticides (Battlay et al. 2018). Additionally, the candidate site with the highest likelihood ratio on chromosome 3R is located within the acetylcholinesterase (Ace) gene. Several different common alleles of Ace confer resistance to the large class of organophosphate insecticides and it is a locus that is known to be under selection when these insecticides are introduced (Karasov et al. 2010;Garud et al. 2015). Resistance to insecticides is a strong candidate phenotype driving adaptive introgression because DDT and most other insecticides were first applied for insect control in populations outside of Africa, where resistance originally evolved (Schmidt et al. 2017). DDT is still actively used in South Africa to control mosquito populations (Biscoe et al. 2005), and the country imports a wide range of other broad-spectrum insecticides (Quinn et al. 2011). Our results therefore strongly suggest that resistance to commonly used pesticides has been a primary driver of adaptive introgression in admixed populations of D. melanogaster.
In total, 17 loci are classified as potential candidates for adaptive introgression. We performed a gene ontology (GO) analysis of genes that were either spanning, or were located within 5 kb of the candidate locus (N ¼ 46). Two categories, "organic cyclic compound binding" and "heterocyclic compound binding" (both N ¼ 21) showed a significant enrichment (at q < 0.1) after correcting for false discovery rate (supplementary table S1, Supplementary Material online). We note that these GO categories contain many more genes in addition to Ace and Cyp6 genes, presenting the possibility that our method may have identified genes that contribute to insecticide resistance in nature and that were not previously known (table 1 and supplementary table S1, Supplementary Material online). For instance, of the other potential adaptively introgressed genes, lncRNA:Hsrx has been shown to provide some protection against insecticides in laboratory experiments (Chowdhuri et al. 2001), but it has not been identified in selection scans. As each candidate locus is located close to several genes, further functional work is necessary to determine exactly which genes are driving the signatures of selection and their specific phenotypic effects.
Our findings here mirror earlier studies on selection in D. melanogaster, and two of our most distinct outliers, Ace and Cyp6g1, have also been strong outliers in selection scans presented in previous papers (e.g., Karasov et al. 2010;Garud et al. 2015). In the case of Ace, several different alleles which confer resistance to insecticides have been found at high frequencies in cosmopolitan populations, and Karasov et al. argues that this suggests that adaptation in this species is not limited by the mutation rate. These mutations are thought to have appeared in cosmopolitan populations and loci showing signs of adaptive introgression in our data set is consistent with this idea, especially since the majority of D. melanogaster genetic diversity is found in Africa (Begun and Aquadro 1992;Thornton and Andolfatto 2006;Pool et al. 2012). On the other hand, this suggests that adaptation to pesticides is not driven by additional de novo mutations in our data set, but instead it is shaped in large part by introgression.
In Garud et al., three genes showed the strongest signals for recent strong selection. Besides Ace and Cyp6g1, which we find here, CHKov1 was also a major outlier. A transposon insertion inside CHKov1 is associated with resistance to both infection by the sigma virus and to organophosphate pesticides (Aminetzach et al. 2005;Magwire et al. 2011). We do not find CHKov1 among our set of genes, but it is located $60 kb from one candidate peak (on chromosome 3R position 25262165). It is possible that we were not able to identify the exact position of this gene with sufficient precision, but simulations above suggest that we can accurately map loci with this strength of selection. It is also possible that cosmopolitan alleles at this locus are not selected within our focal population. Consistent with this idea, CHKov1 is thought to have spread from standing variation that was present within the ancestral African populations (Aminetzach et al. 2005;Magwire et al. 2011). Our results are therefore concordant with expectations from known geographic distributions of strongly selected insecticide resistance loci and further support the idea that resistance to insecticides has been an important driver of adaptive introgression.

Conclusions
Generalized tools for inferring and mapping adaptive introgression and estimating the strength of selection from genomic data have long been lacking. Here, we provide an approach for this problem that is well-suited for detecting adaptive introgression in commonly generated data sets. AHMM-S can both identify the locations of adaptive introgression genes and infer their selective coefficients. It is robust over a range of introgressive scenarios, and especially in cases where an adaptively introgressed gene has strongly increased in frequency. Such scenarios are especially important, as the loci with the most severe shift in genotype frequencies are more likely to be important for understanding adaptation. Many previous studies have addressed the question of adaptive introgression by tailoring analysis methods to the specific data at hand (for instance Sankararaman et al. 2014;Vernot and Akey 2014), but AHMM-S provides a more general solution and enables searches for adaptive gene flow over a large range of eukaryotic species. AHMM-S is based on a simple model of adaptive introgression with positive selection of a single locus following a single introgression pulse. Despite this it can still identify adaptive introgression with reasonable precision from more complex scenarios, such as continuous gene flow. To improve the performance in the future, these phenomena could be explicitly modeled when generating the expected transition rates. We could model negative background selection around a positively selected site, or different selective regimes, such as balancing or conditional selection, to improve our estimates of selective strengths. In the case of balancing selection, we would currently expect to be able to identify the selected locus (given the introgression pulse is different from the equilibrium frequency of the selected site), but it would be interpreted as a positively selected site. In order to infer the exact nature of the adaptive evolution that has taken place for more complex scenarios, such as balancing selection or epistatic interactions, it will most likely be easiest to generate expected haplotype patterns and transition rates through simulations, rather than analytical approximations.
In this work, we also use AHMM-S to investigate possible adaptive introgression in D. melanogaster in South Africa. Several of the loci we identify are associated both with pesticide resistance and with strong selection in cosmopolitan populations, suggesting that pesticide use has been a major driver of selection both in sub-Saharan Africa and elsewhere. Although some of these resistant alleles are thought to have evolved through de novo mutations in cosmopolitan populations, here, we find that they have been introduced through introgression in South Africa. This positions admixture to be an important factor to consider when studying the causes of adaptation and selection within species and populations.

Materials and Methods
We developed an approach based on an adaptation of Ancestry_HMM (Corbett-Detig and Nielsen 2017; Medina et al. 2018) that allows one to infer adaptive introgression by implementing a model for tract length distributions surrounding an adaptively introgressed locus (Shchur et al. 2020). This allows us to calculate expected transition rates between the two genotypes at a given distance from a locus of interest, which then can be used in a HMM that can calculate a likelihood score for a particular value of s at a particular site. We assume a single discrete admixture event, a "one-pulse" model, that took place t generations prior to the time of sampling. Therefore, the state space of the HMM is all possible counts of chromosomes of ancestry type one given the number of chromosomes, or ploidy, of a sample (e.g., for a diploid, H ¼ f0,1,2g). The probability of the ancestry states at a given site in an admixed genome is unchanged in our modified framework (emissions probabilities, see Corbett-Detig and Nielsen 2017;Medina et al. 2018). However, to incorporate natural selection, we must update the transition probabilities to reflect the increased frequency of the selected site relative to background level. We define a three locus coalescent process where one site denotes an introgressing allele experiencing additive selection. The other two sites trace the ancestry linked to that site. For example, at the time of admixture, the only possible three locus haplotypes are 0*-0-0 and 1*-1-1, where * denotes the selected locus. By tracking the frequencies of recombinant haplotypes, that is, chromosomes in which the two linked sites correspond to different ancestry states (0*-1-0, 0*-0-1, 1*-1-0, 1*-0-1), we can define an ancestry transition model along the chromosome in the regions adjacent to the selected site. Given a data set with known recombination rates between each site, AHMM-S can then generate expected transition rates going away from that site in each direction, either by using a forward iteration strategy, where the transition rates between adjacent sites are calculated for each site along a chromosome, or through an approximative method which can interpolate transition rates based on just four sites.
When running AHMM-S, a single chromosome data set is specified, together with the population size N, introgression fraction m and time since introgression t, which all need to have been estimated previously. AHMM-S will then estimate a likelihood ratio for a specific site p and a specific selective coefficient s compared with the neutral case for that site. AHMM-S will loop over all sites, or a subset of sites, and can then either calculate the likelihood ratio in a grid for a defined set of values of s, or find the value of s that gives the highest likelihood for each site, by using golden section search. The size of window used in the HMM can be specified by the user, as can the number of sites that will be analyzed.

4-Point Approximation
The transition rates between ancestries of different types are functions of recombination distance r from the site under selection. In particular, the transition rate f 10 ðrÞ from ancestry type 1 to ancestry type 0 is a monotonously growing function with a finite limit at infinity which is equal to the transition rate under neutrality. Similarly, the transition rate f 01 ðrÞ is a monotonously decreasing function. We will search for an approximate solution of these functions in a form of b f 10 r ð Þ ¼ L À ke Àar p ; where coefficients L; k; a and p can be calculated as follows.
Let r 1 and r 2 be two points such that 0 < r 1 < 1. We can estimate transition rates numerically for r ¼ 0, r 1 and r 2 .
Inferring Adaptive Introgression Using HMMs . doi:10.1093/molbev/msab014 In order to be informative, r 2 can be set to the value of expected tract length under neutrality, and we set r 1 ¼ r 2 =10. The expected tract length under neutrality can be calculated analytically under SMC' model (Marjoram and Wall 2006). Using the formula derived by Liang and Nielsen (2014), we set where N e is the effective size of the admixed population, m is the admixture fraction and t is the time of introgression.
Given these values, we can calculate the coefficients of the function b f 10 r ð Þ. L is the transition rate for the neutrality (s ¼ 0), because lim r!1 b f 10 ¼ L. Again, following (Liang and Nielsen 2014), the neutral transition rate is given by the following formula L ¼ 2N e ð1 À mÞð1 À e Àt=2N e Þ: In order to find the last two parameters a and p, there are two more equations corresponding to r ¼ r 1 and r ¼ r 2 . Denote for i ¼ 1; 2. Then p ¼ logðy 1 =y 2 Þ logðr 1 =r 2 Þ ; and a ¼ À y 1 r p 1 : We validated this approximation by comparing with simulations and with "forward time" approximation, see supplementary figure S11, Supplementary Material online for an example. In all the considered scenarios, our approximation turned out to be very precise.

Simulations
We validated the method using extensive simulations of populations of 100,000 diploid individuals over a range of parameter values. We varied the selective coefficient s from 0 to 0.1 (0, 0.001, 0.01, 0.05, 0.1) and the admixture fraction m from 0.01 to 0.5 (0.01, 0.05, 0.1, 0.2, 0.5). We let the simulations run to t ¼ 1,000 generations or 0.99 frequency of the selected allele, whichever came sooner, and we drew samples from the admixed population at 50, 100, 200, 300, 500, and 1,000 generations. Twenty simulations were run for each parameter set and 25-75 diploid individuals were sampled per time point per simulation. The selective coefficient s is specified for the diploid case, and selection acts additively, meaning that a heterozygous individual experiences half the selective strength.
Simulations were performed using SELAM (Corbett-Detig and Jones 2016), which is a forward simulator that records the full local ancestry across the genome. We then converted the haplotype information generated by SELAM to simulated genotypes representing the reference panels of the two ancestral populations, using the results of a coalescent simulation consistent with the evolutionary history of ancestral D. melanogaster populations (following Pool et al. 2012;Corbett-Detig and Nielsen 2017). Specifically, we used the SMC' coalescent simulator, MaCS (Chen et al. 2008) with the following command line: $macs 200 10000000 -i 1 -h 1000 -t 0.0376 -r 0.171 -c 5 86.5 -I 2 100 100 0 -en 0 2 0.183 -en 0.0037281 2 0.000377 -en 0.00381 2 1 -ej 0.00382 2 1 -eN 0.0145 0.2 -s 1 To construct admixed individuals, we applied genotypes obtained from each MaCS-simulated population to fill in genotype data on the appropriate chromosomes for each ancestry type. Pairs of chromosomes were joined to create single diploid individuals, and then we simulated short-read pileup data for each variable site by drawing the depth from a Poisson distribution with mean 2 and the resulting alleles from a binomial distribution with p equal to the genotype frequency in the individual sampled at that site and applying a uniform error rate of 0.01 per allele per site.
We emphasize that users interested in evaluating the utility of this or our previous software packages can produce similar simulations tailored to their specific evolutionary models or sequencing and sampling schemes by applying our method using the LAI simulation packages released in Schumer et al. (2020).
We then analyzed the simulated reads with AHMM-S, and specified the values of N, m, and t that were used in each simulation. We analyzed either 25 or 75 diploid individuals, or 50 pooled chromosomes, depending on experiment. We used the 4-point approximative method for calculating transition rates, and included sites in the HMM in a window extending 10% of the chromosome length in each direction of the focal site. We used the golden section search algorithm and extracted the site that had the highest likelihood ratio value as the inferred adaptively introgressed site. For the pooled samples, only every 100 sites were analyzed, in order to reduce computational time.

Robustness
The robustness of AHMM-S against misspecification of input parameter values was evaluated using a subset of the same simulations that were generated for validation. When running AHMM-S, we varied population size (10,000 and 1,000,000 instead of 100,000), introgression fraction m and introgression time t, setting them to 10% and 20% above and below the true value. Each parameter set was run in 20 replicates.

Simulations for Testing the Effects of Model Violations
We performed simulations to test the effects of continuous gene flow, segregation in donor population, dominance and fixation in focal population on performance of AHMM-S in Svedberg et al. . doi:10.1093/molbev/msab014 estimating position and selective coefficient. All simulations were performed in 20 replicates. Introgression pulse was set to 0.01 or 0.1, selective coefficient to 0.01 or 0.05. The simulations were performed as described above, with the following differences: for continuous gene flow, we set an initial introgression pulse to 0.01 or 0.1, and then allowed for 1% gene flow for 20 generations or 0.2% gene flow for 100 generations. When simulating an adaptive allele segregating in the donor population, we set the allele frequency to 50% at the time of introgression. To simulate dominant or recessive selection, we modified the SELAM selection file to change selection to work in a completely dominant or recessive way.

Simulating a Small Admixture Pulse
When running simulations with a small admixture pulse (m ¼ 0.001 or 0.0001), there is a strong risk that the adaptively introgressed site would be lost to drift. In order to counteract this, we condition our simulations on not being lost, by restarting SELAM if the site was lost, until we had 20 replicates for each set of parameters. After this the AHMM-S pipeline was ran as usual.

Multiple Selected Sites
We tested the ability of AHMM-S to distinguish adaptively introgressed sites that are located near each other by running simulations as described above, but with two selected sites with equal selective strengths at 0.1, 1, 2, and 5 cM distance from each other. Three parameter sets were used: introgression fraction m ¼ 0.01, selection s ¼ 0.025 and time t ¼ 500 generations; m ¼ 0.1, s ¼ 0.005 and t ¼ 200; and m ¼ 0.17, s ¼ 0.005 and t ¼ 430. The last parameter set corresponds to the population of D. melanogaster analyzed in this study. We then analyzed the simulations with AHMM-S and plotted the inferred selective coefficients and likelihood ratios to determine peak separation.

Detecting Negative Selection
Simulations of introgression followed by negative selection of an introgressed allele located in the middle of the chromosome were performed using SELAM as previously described. AHMM-S is designed to infer positive selection on the introgressed genotype, and in order to make it detect negative selection, we converted the genotype identity of the simulated individuals, so negative selection of the introgressed genotype is interpreted as positive selection of the receiving genotype. We then converted the simulation files to reads as before and analyzed it with AHMM-S.

Dobzhansky-Muller Incompatibilities
Simulations of diploid individuals were performed as described above for two different scenarios of DMIs. In both cases, two loci (A and B) had a negative DMI interaction. At each locus two alleles were present (A0 and A1, B0 and B1) where the integer denotes the ancestral population that contributed the relevant allele. In scenario one, an individual carrying any combination of alleles that were not from a single population (i.e., A0þB1, or A1þB0) had a reduced fitness (s < 1). This is an example of dominant sign epistasis.
In scenario two, only one specific combination (i.e., A0þB1 but not A1þB0) was selected against (s < 1). In both cases all other allelic combinations had s ¼ 1. We ran these scenarios for three values of s: -0.01, -0.05 and -0.1, with a 50% introgression pulse. The two sites were placed on the same chromosome at a distance of 40 cM, where linkage plays little role in governing outcomes. We then converted the simulations to reads as previously described and analyzed them using AHMM-S. Since DMIs cause negative selection at the interacting loci, and AHMM-S is only designed to handle positive selection, we changed genotype identities before converting the simulations, which allowed us to treat the negative selection at these loci as positive selection and bypass this limitation.

Drosophila melanogaster data
We applied AHMM-S to a publicly available data set of D. melanogaster from South Africa and Europe (Lack et al. 2015;Lack et al 2016). We extracted both homozygous and heterozygous regions for each sample and ran AHMM-S after supplying the appropriate ploidy for each (i.e., 1 if inbred, 2 if outbred) as we have done previously (Medina et al. 2018). We also removed any chromosome arm containing one of the common chromosomal inversions in this species (Corbett-Detig and Hartl 2012;Lack et al. 2016). As these assemblies are generally quite high quality, we used the genotype emissions function in AHMM to calculate the local ancestry across the genome (Corbett-Detig and Nielsen 2017). We also supplied a fine-scale recombination map for this species (Comeron et al. 2012).
Because the direction of introgression is known, that is, populations with cosmopolitan ancestry have recently backmigrated into Africa, we scanned specifically for adaptive introgression of cosmopolitan alleles into these predominantly African populations. We performed 50 simulations of a neutral admixture scenario with the same population parameters (m ¼ 0.17, t ¼ 430, N ¼ 100,000) and used these to determine a likelihood ratio cutoff that would produce an acceptable false positive rate. A likelihood ratio > 15 and filtering for proximity to other higher peaks produced 42 outliers above this threshold, corresponding to a false positive rate of $5% in the D. melanogaster data set. Using simulations, we also determined that we could distinguish selected sites separated by at least 2 cM (see section on multiple sites). A GO enrichment analysis was performed on the candidate loci using Gowinda (Kofler and Schlötterer 2012). We ran the program with default parameters and separately included only genes that either spanned the candidate locus or included genes located 5 kb upstream or downstream of the locus.

Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online. and an award from the Alfred P. Sloan Foundation to R.B.C. R.N. and R.C.D. were funded within the framework of the HSE University Basic Research Program. V.S. was supported by grant RFBR 19-07-00515.

Data Availability
The source code of Ancestry_HMM-S can be downloaded from https://github.com/jesvedberg/Ancestry_HMM-S/. A user manual is also available at this location. No new data were generated for this study.