Estimating Temporally Variable Selection Intensity from Ancient DNA Data

Abstract Novel technologies for recovering DNA information from archaeological and historical specimens have made available an ever-increasing amount of temporally spaced genetic samples from natural populations. These genetic time series permit the direct assessment of patterns of temporal changes in allele frequencies and hold the promise of improving power for the inference of selection. Increased time resolution can further facilitate testing hypotheses regarding the drivers of past selection events such as the incidence of plant and animal domestication. However, studying past selection processes through ancient DNA (aDNA) still involves considerable obstacles such as postmortem damage, high fragmentation, low coverage, and small samples. To circumvent these challenges, we introduce a novel Bayesian framework for the inference of temporally variable selection based on genotype likelihoods instead of allele frequencies, thereby enabling us to model sample uncertainties resulting from the damage and fragmentation of aDNA molecules. Also, our approach permits the reconstruction of the underlying allele frequency trajectories of the population through time, which allows for a better understanding of the drivers of selection. We evaluate its performance through extensive simulations and demonstrate its utility with an application to the ancient horse samples genotyped at the loci for coat coloration. Our results reveal that incorporating sample uncertainties can further improve the inference of selection.


Introduction
A problem of long standing in population genetics is to understand evolutionary processes, including selection, in shaping the genetic composition of populations. Although the majority of these studies have focused on single time point polymorphism data, the use of temporally spaced genetic samples can provide valuable information on selection and demography Skoglund et al. 2014). The commonest data sources have been evolve and resequence studies combining experimental evolution under controlled laboratory or field mesocosm conditions with next-generation sequencing technology, which however are typically limited to the species with small evolutionary timescales (e.g., Turner and Miller 2012;Bosshard et al. 2017;Good et al. 2017). Recent advances in technologies for obtaining DNA molecules from ancient biological material have resulted in massive increases in time serial samples of segregating alleles from natural populations (e.g., Mathieson et al. 2015;Librado et al. 2017;Fages et al. 2019), which offer unprecedented opportunities to study the chronology and tempo of selection across evolutionary timescales (see Dehasque et al. 2020, for a review).
As the number of published ancient genomes is growing rapidly, a range of statistical methods that estimate selection coefficients and other population genetic parameters based on ancient DNA (aDNA) data have been introduced over the last 15 years (e.g., Bollback et al. 2008;Malaspinas et al. 2012;Mathieson and McVean 2013;Foll et al. 2014Foll et al. , 2015Steinrücken et al. 2014;Ferrer-Admetlla et al. 2016;Schraiber et al. 2016;Shim et al. 2016;He et al. 2020aHe et al. , 2020bMathieson 2020;Lyu et al. 2022). See Malaspinas (2016) for a review. Most existing methods are built upon the hidden Markov model (HMM) framework of Williamson and Slatkin (1999), where the latent population allele frequency is modeled as a hidden state following the Wright-Fisher model (Fisher 1922;Wright 1931), and MBE at each sampling time point, the observed sample allele frequency drawn from the underlying population is modeled as a noisy observation of the latent population allele frequency. To remain computationally feasible, these approaches usually work with the diffusion approximation of the Wright-Fisher model (e.g., Bollback et al. 2008;Malaspinas et al. 2012;Steinrücken et al. 2014;Ferrer-Admetlla et al. 2016;Schraiber et al. 2016;He et al. 2020aHe et al. , 2020bLyu et al. 2022). The diffusion approximation enables us to efficiently integrate over all possible underlying allele frequency trajectories, therefore producing substantial reductions in the computational cost of the likelihood calculation. These methods have already been successfully applied in aDNA studies (e.g., Ludwig et al. 2009;Sandoval-Castellanos et al. 2017;Ye et al. 2017;Wutke et al. 2018). Moment-based approximations of the Wright-Fisher model, as tractable alternatives, are commonly used in the approaches tailored to experimental evolution (e.g., Feder et al. 2014;Lacerda and Seoighe 2014;Terhorst et al. 2015;Paris et al. 2019) due to their poor performance for large evolutionary timescales (He et al. 2021).
Although the field of aDNA is experiencing an exponential increase in terms of the amount of available data, accompanied by an increase in available tailored statistical approaches, data quality remains a challenge due to postmortem damage, high fragmentation, low coverage, and small samples. To our knowledge, none of the existing methods allow for modeling the two main characteristics of aDNA, that is, the high error rate caused by the damage of aDNA molecules and the high missing rate resulting from the fragmentation of aDNA molecules, with the exception of Ferrer-Admetlla et al. (2016) and He et al. (2020a), which partially resolved this issue. Ferrer-Admetlla et al. (2016) incorporated genotype calling errors but excluded samples with missing genotypes whereas He et al. (2020a) allowed genotype missing calls but assumed no calling errors. Moreover, most existing approaches assume that the selection coefficient is fixed over time, which is commonly violated in aDNA studies. Most cases of adaptation in natural populations involve adaptation to ecological, environmental, and cultural shifts, where it is no longer appropriate for the selection coefficient to remain constant through time. Shim et al. (2016) extended WFABC (Foll et al. 2015) to detect and quantify changing selection coefficients from genetic time series but more suitable for the scenario of a short evolutionary timescale like experimental evolution due to computational efficiency. More recently, Mathieson (2020) introduced a novel method to infer selection and its strength and timing of changes from aDNA data. However, these methods still ignored genotype calling errors and missing calls.
To address these challenges, in this work, we introduce a novel Bayesian approach for estimating temporally variable selection intensity from aDNA data whereas modeling sample uncertainties arising from postmortem damage, high fragmentation, low coverage, and small samples. We test our method through extensive simulations, in particular when samples are sparsely distributed in time with small sizes and in poor quality (i.e., high missing rate and error rate). We reanalyze a data set of 201 ancient horse samples from Wutke et al. (2016) that were genotyped at the loci encoding coat coloration to illustrate the applicability of our approach on aDNA data.

New Approaches
To model sample uncertainties caused by the damage and fragmentation of aDNA molecules, unlike existing approaches, we base our selection inference on raw reads rather than called genotypes. Our method is built upon a novel two-layer HMM framework, where the first hidden layer characterizes the underlying frequency trajectory of the mutant allele in the population through the Wright-Fisher diffusion, the second hidden layer represents the unobserved genotype of the individual in the sample, and the third observed layer represents the data on aDNA sequences (see fig. 1 for the graphical representation of our two-layer HMM framework). Such an HMM framework enables us to work with genotype likelihoods as input rather than allele frequencies. Our posterior computation is carried out by applying the particle marginal Metropolis-Hastings (PMMH) algorithm introduced by Andrieu et al. (2010), which permits a joint update of the selection coefficient and population mutant allele frequency trajectory. The reconstruction of the underlying population allele frequency trajectories allows for a better understanding of the drivers of selection. Moreover, our approach provides a procedure for testing hypotheses about whether the shift in selection is linked to specific ecological, environmental, and cultural drivers. Further details on our method appear in Section "Materials and Methods".

Results
In this section, we test our approach through extensive simulations and show its applicability with the data on aDNA sequences from earlier studies of Ludwig et al. Estimating Temporally Variable Selection Intensity · https://doi.org/10.1093/molbev/msad008 MBE where they sequenced a total of 201 ancient horse samples for eight loci that determine horse coat coloration. In what follows, we base our selection inference on the maximum a posteriori (MAP) estimate and assume only a single event that might change selection.

Performance Evaluation
We run forward-in-time simulations of the Wright-Fisher model with selection (e.g., Durrett 2008) and assess the performance of our method with simulated data sets of genotype likelihoods. We let the only event that might change selection occur in generation 350, thereby taking the selection coefficient to be s(k) = s − for k < 350 otherwise s(k) = s + . We uniformly draw the selection coefficients s − and s + from [ − 0.05, 0.05] and pick the dominance parameter of h = 0.5 (i.e., assuming codominance). To mimic the demographic history of the horse population given in Der Sarkissian et al. (2015), we adopt a bottleneck demographic history, where the population size N(k) = 32, 000 for k < 200, N(k) = 8, 000 for 200 ≤ k < 400, and N(k) = 16, 000 for k ≥ 400. The initial population mutant allele frequency x 1 is uniformly drawn from [0.1, 0.9]. For clarity, we write down the procedure for generating the data set of genotype likelihoods: Repeat Step 1 until x K ∈ (0, 1): Step 1: Generate s − , s + , and x 1 : K .
We take the parameter α n,k g n,k to be ϕψ and the other two to be (1 − ϕ)ψ/2, where ϕ and ψ are the parameters introduced to control the quality of the simulated data set in terms of the missing rate and error rate with a common threshold for genotype calling (i.e., 10 times more likely, see Kim et al. 2011). By following this procedure, we simulate a total of 801 generations starting from generation 0 and draw a sample of 10 individuals every 40 generations, 210 sampled individuals in total (i.e., nearly the size of the ancient horse samples), in our simulation studies.
We run a group of simulations to assess how our method performs for different data qualities, where we vary the parameters ϕ ∈ {0.75, 0.85, 0.95} and ψ ∈ {0.5, 1.0}, respectively, giving rise to six possible combinations of the missing rate and error rate (see table 1 where the mean and standard deviation of missing rates and error rates are computed with 1,000 replicates for each combination). For each scenario listed in table 1, we consider nine possible combinations of the selection coefficients s − and s + (see table 2), and for each combination we repeatedly run the procedure described above until we get 200 data sets of genotype likelihoods. Thus, in summary, we consider 200 replicates for each of 54 combinations of the data quality and selection scenario.
For each replicate, we choose a uniform prior over [ − 1, 1] for both selection coefficients s − and s + , and adopt the reference population size N 0 = 16, 000. We run 10,000 PMMH iterations with 1,000 particles, where  s − > 0 s + > s − Positively selected mutation following a positive change MBE each generation is partitioned into five subintervals in the Euler-Maruyama scheme. We discard the first half of the total PMMH samples as burn-in and thin the remaining by keeping every fifth value. See figure 2 for our posteriors for the selection coefficients s − and s + produced from a simulated data set of genotype likelihoods (see supplementary table S1, Supplementary Material online), including our estimate for the underlying frequency trajectory of the mutant allele in the population. Evidently, in this example, our approach can accurately infer temporally variable selection from genetic time series in genotype likelihood format. The true underlying frequency trajectory of the mutant allele in the population fluctuates slightly around our estimate and is completely covered in our 95% highest posterior density (HPD) interval.

Performance in Estimating Selection Coefficients
To assess how our method performs for estimating selection coefficients, we show the boxplot results of our estimates across different data qualities in figure 3, where the tips of the whiskers are the 2.5%-quantile and the 97.5%-quantile, and the boxes denote the first and third quartiles with the median in the middle. We summarize the bias and root mean square error (RMSE) of our estimates in supplementary table S2, Supplementary Material online. As illustrated in figure 3, our estimates for both selection coefficients are nearly median-unbiased across different data qualities although a slightly large bias can be found when the data quality is poor such as scenario A (around 31.0% missing rate and 13.9% error rate). The bias completely vanishes as the data quality improves (see, e.g., scenario F).
To assess how our method performs for different levels of the selection coefficient, especially weak selection, we run an additional group of simulations with an example of codominance, where we adopt the parameters ϕ = 0.85 and ψ = 1. We assume no event that changes selec-  We observe from figure 4 that our estimates are nearly median-unbiased for weak selection, but the bias gets larger with an increase in the strength of selection (i.e., |s|). This bias could be caused by ascertainment arising from the procedure that we use to generate simulated data sets. In our simulation studies, only the simulated data sets in which no fixation event has occurred in the underlying population are kept. This setting means that the Wright-Fisher model in our data generation process is equivalent to that conditioned on no fixation event occurred, which does not match that in our approach for estimating the selection coefficient. Such a mismatch could bring about the underestimation of the selection coefficient (i.e., the simulated data sets that are retained correspond to a biased sample of the underlying population mutant allele frequency trajectories that reach loss or MBE fixation more slowly, and therefore the estimates for the strength of selection are more likely to be smaller than their true values), especially for strong selection, since fixation events are more likely (see the histograms of the loss probability of mutations and those of the fixation probability of mutations in supplementary fig. S1, Supplementary Material online for each level of the selection coefficient, where the loss probability and fixation probability are calculated with 1,000 replicates for each simulated data set). We find that the simulated data sets with large loss probabilities are all generated with the selection coefficient s ∈ [ − 0.05, − 0.01) whereas those with large fixation probabilities are all generated with the selection coefficient s ∈ (0.01, 0.05], which correspond to the levels of the selection coefficient with significant bias in figure 4. The bias resulting from the mismatch can be fully eliminated by conditioning the Wright-Fisher diffusion to survive (He et al. 2020b).

Performance in Testing Selection Changes
To evaluate how our approach performs for testing selection changes, we produce the receiver operating characteristic (ROC) curves across different data qualities in figure 5, where the true-positive rate (TPR) and false-positive rate (FPR) are computed for each value of the posterior probability for the selection change that is used as a threshold to classify a locus as experiencing a shift in selection, and the ROC curve is produced by plotting the TPR against the FPR. We compute the area under the ROC curve (AUC) to summarize the performance. From figure 5, we see that even though data qualities vary across different scenarios, all curves are very concave and close to the upper left corner of the ROC space with their AUC values varying from 0.89 to 0.94. This suggests that our procedure has superior performance in testing selection changes even for the data sets of up to around 31.0% missing rate and 13.9% error rate (see scenario A). Although these ROC curves almost overlap with each other, we can still see that improved data quality yields better performance (see, e.g., scenario F).
To assess how our approach performs for different levels of the selection change, in particular for small changes, we run an additional group of simulations with an example of codominance, where we adopt the parameters ϕ = 0.85 and ψ = 1. We vary the selection change Δs We generate a data set of genotype likelihoods with the selection coefficients s − and s + . We repeat this procedure until we obtain 200 simulated data sets for each subinterval. Similarly, we run the ROC analysis, and the resulting ROC curves for different levels of the selection change are shown in figure 6 with their AUC values.
By examining the plots of the ROC curves in figure 6, we find that the performance becomes significantly better with the increase in the degree of the change in the selection coefficient (i.e., |Δs|) as expected. Even for small changes |Δs| < 0.001, the AUC value is still larger than 0.70, and for large changes |Δs| > 0.01, the AUC value is up to 0.98. Our results illustrate that our method has strong discriminating power of testing selection changes, even though such a change is small.

Horse Coat Coloration
We employ our approach to infer selection acting on the ASIP and MC1R genes associated with base coat colors

MBE
(black and chestnut) and the KIT13 and TRPM1 genes associated with white coat patterns (tobiano and leopard complex) based on the data on aDNA sequences from previous studies of Ludwig et al. (2009), Pruvost et al. (2011), and Wutke et al. (2016, which were found to be involved in ecological, environmental, and cultural shifts (Ludwig et al. 2009(Ludwig et al. , 2015Wutke et al. 2016). Since only called genotypes are available in Wutke et al. (2016), we use the following procedure to generate genotype likelihoods for each gene in the same format as those produced by GATK (McKenna et al. 2010): if the genotype is called, we set the genotype likelihood of the called genotype to 1 and those of the other two to 0, and otherwise, all possible (ordered) genotypes are assigned equal genotype likelihoods that are normalized to sum to 1. See genotype likelihoods for each gene in supplementary table S4, Supplementary Material online.
Due to the underlying assumption of our approach that mutation occurred before the initial sampling time point, in our following analysis, we exclude the samples drawn before the sampling time point that the mutant allele was first found in the sample for each gene. We adopt the horse demographic history estimated by Der Material online) with the average length of a single generation of the horse being eight years and set the reference population size N 0 = 16, 000 (i.e., the most recent size of the horse population). For each gene, we take the dominance parameter h to be the value given in Wutke et al. (2016).
In our PMMH procedure, we run 20,000 iterations with a burn-in of 10,000 iterations, and the other settings are the same as we adopted in Section "Performance Evaluation", including those in the Euler-Maruyama approach. The estimates of the selection coefficient and its change for each gene with their 95% HPD intervals are summarized in supplementary table S5, Supplementary Material online.

Selection of Horse Base Coat Colors
Base coat colors in horses are primarily determined by ASIP and MC1R, which direct the type of pigment produced, black eumelanin (ASIP) or red pheomelanin (MC1R) (Corbin et al. 2020). More specifically, ASIP on chromosome 22 is associated with the recessive black coat, and MC1R on chromosome 3 is associated with the recessive chestnut coat. Ludwig et al. (2009) found that there was a rapid increase in base coat color variation during horse domestication (starting from approximately 3,500 BC) and provided strong evidence of positive selection acting on ASIP and MC1R. Fang et al. (2009) suggested that such an increase was directly caused by human preferences and demands. We apply our approach to test their hypothesis that selection acting on ASIP and MC1R was changed when horse became domesticated and estimate their selection intensities. The resulting posteriors for ASIP and MC1R are shown in figures 7 and 8, respectively.
Our estimate of the selection coefficient for the ASIP mutation is 0.0018 with 95% HPD interval [ − 0.0012, 0.0066] before domestication and 0.0003 with 95% HPD interval [ − 0.0022, 0.0031] after horses were domesticated. The 95% HPD interval contains 0 for the selection coefficient s − , but there is still some evidence showing that the ASIP mutation was most probably favored by selection before domestication since the posterior probability for positive selection is 0.904. The posterior for the selection coefficient s + is approximately symmetric about 0, which indicates that the ASIP mutation was effectively neutral after horse domestication started. Our estimate of the change in selection acting on the ASIP mutation

MBE
when horses became domesticated is −0.0021 with 95% HPD interval [ − 0.0071, 0.0031], and the posterior probability for such a negative change is 0.779. Our estimate for the underlying ASIP mutation frequency trajectory illustrates that the ASIP mutation frequency rises substantially in the pre-domestication period and then keeps approximately constant in the post-domestication period.
Our estimate of the selection coefficient for the MC1R mutation is −0.0300 with 95% HPD interval [ − 0.0928, 0.0997] before horses became domesticated and 0.0116 with 95% HPD interval [0.0079, 0.0174] after domestication started. Our estimates reveal that the MC1R mutation was effectively neutral or selectively deleterious in the pre-domestication period (with posterior probability for negative selection being 0.670) but became positively selected after horse domestication started (with posterior probability for positive selection being 1.000). Our estimate of the change in selection acting on the MC1R mutation from a pre-to a post-domestication period is 0.0439 with 95% HPD interval [ − 0.0891, 0.1058], and the posterior probability for such a positive change is 0.748. We see a slow decline in the MC1R mutation frequency before domestication (even though the evidence of negative selection before domestication is weak) and then a significant increase after horse domestication started in our estimate for the underlying MC1R mutation frequency trajectory.

Selection of Horse White Coat Patterns
Tobiano is a white spotting pattern in horses characterized by patches of white that typically cross the topline somewhere between the ears and tail. It is inherited as an autosomal dominant trait that was reported in Brooks et al. (2007) to be associated with a locus in intron 13 of the KIT gene on chromosome 3. Wutke et al. (2016) observed that spotted coats in early domestic horses revealed a remarkable increase, but medieval horses carried significantly fewer alleles for these traits, which could result from the shift in human preferences and demands. We apply our method to test their hypothesis that selection acting on KIT13 was changed when the medieval period began (in around AD 400) and estimate their selection intensities. We show the resulting posteriors for KIT13 in figure 9.
Our estimate of the selection coefficient for the KIT13 mutation is 0.0039 with 95% HPD interval [ − 0.0010, 0.0103] before the medieval period, which shows that the KIT13 mutation was positively selected before the Middle Ages (i.e., the posterior probability for positive selection is 0.935). Our estimate of the selection coefficient for the KIT13 mutation is −0.0284 with 95% HPD interval [ − 0.0627, 0.0015] during the medieval period, which demonstrates that the KIT13 mutation became selectively deleterious during the Middle Ages (i.e., the posterior probability for negative selection is 0.969). Our estimate of the change in selection acting on the KIT13 mutation when the Middle Ages started is −0.0326 with 95% HPD interval [ − 0.0700, 0.0008], and the posterior probability for such a negative change is 0.969. We observe from our estimate for the underlying KIT13 mutation frequency trajectory that the KIT13 mutation experienced a gradual increase after horses were domesticated and then a marked decline in the Middle Ages.
Leopard complex is a group of white spotting patterns in horses characterized by a variable amount of white in the coat with or without pigmented leopard spots, which is inherited by the incompletely dominant TRPM1 gene residing on chromosome 1 (Terry et al. 2004). The first genetic evidence of the leopard complex coat pattern could date back to the Pleistocene ( Estimating Temporally Variable Selection Intensity · https://doi.org/10.1093/molbev/msad008 MBE leopard complex coat pattern in domestic horses but did not investigate whether TRPM1 undergone a change in selection from a pre-to a post-domestication period. We apply our method to test the hypothesis that selection acting on TRPM1 was changed when horses became domesticated and estimate their selection intensities. The resulting posteriors for TRPM1 are shown in figure 10. Our estimate of the selection coefficient for the TRPM1 mutation is −0.0005 with 95% HPD interval [ − 0.0042, 0.0029] before horses were domesticated and −0.0078 with 95% HPD interval [ − 0.0139, − 0.0005] after domestication started. Our estimates provide little evidence of negative selection in the pre-domestication period (with posterior probability for negative selection being 0.617) but strong evidence of negative selection in the post-domestication period (with posterior probability for negative selection being 0.980). Our estimate of the change in selection acting on the TRPM1 mutation when horses became domesticated is −0.0066 with 95% HPD interval [ − 0.0142, 0.0025]. The 95% HPD interval for the change in the selection coefficient contains 0, which however still provides sufficient evidence to support that a negative change took place in selection when horses became domesticated (i.e., the posterior probability for such a negative change is 0.942). Our estimate for the underlying trajectory of the TRPM1 mutation frequency displays a slow decrease in the TRPM1 mutation frequency during the pre-domestication period with a significant drop after horse domestication started.

Discussion
In this work, we have introduced a novel Bayesian approach for estimating temporally changing selection intensity from aDNA data. To our knowledge, most earlier methods ignore sample uncertainties resulting from the damage and fragmentation of aDNA molecules, which however are the main characteristics of aDNA. Our method to circumvent this issue is that we base our selection inference on genotype likelihoods rather than called genotypes, which facilitates the incorporation of genotype uncertainties. We have developed a novel two-layer HMM framework, where the top hidden layer models the underlying frequency trajectory of the mutant allele in the

MBE
population, the intermediate hidden layer models the unobserved genotype of the individual in the sample, and the bottom observed layer denotes the data on aDNA sequences. By working with the PMMH algorithm, where the marginal likelihood is approximated through particle filtering, our approach enables us to reconstruct the underlying population allele frequency trajectories. Moreover, our method provides the flexibility of modeling time-varying demographic histories.
The performance of our approach has been evaluated through extensive simulations, which show that our procedure can produce accurate selection inference from aDNA data across different evolutionary scenarios even though samples are sparsely distributed in time with small sizes and in poor quality. Note that in our simulation studies, our procedure has only been tested in the case of a bottleneck demographic history and codominance, but in principle, the conclusions we have drawn here hold for other demographic histories and levels of dominance. Demographic histories have been demonstrated to have little influence on the inference of selection from time series genetic data in Jewett et al. (2016), but in practice, demographic history misspecification can largely bias the inference of selection (see, e.g., Schraiber et al. 2016 We have illustrated the utility of our approach with an application to ancient horse samples from earlier studies of Ludwig et al. (2009), Pruvost et al. (2011), and Wutke et al. (2016, which were genotyped at the loci (e.g., ASIP, MC1R, KIT13, and TRPM1) for horse coat coloration. The published demographic estimate for the horse population from Der Sarkissian et al. (2015) is utilized in our analysis. Our findings are consistent with previous studies that the coat color variation in the horse is a domestic trait that was subject to early selection by humans (Hunter 2018), for example, ASIP, MC1R, and TRPM1, and human preferences have changed greatly over time and across cultures (Wutke et al. 2016), for example, KIT13. We have also run Our results for the base coat color are consistent with previous studies that the shift in horse coat color variation in the early stage of domestication could be caused by relaxed selection for camouflage alleles (Hunter 2018). More specifically, the ASIP mutation was positively selected in the pre-domestication period, but the MC1R mutation was not. From Sandoval-Castellanos et al. (2017), forest cover was growing as a result of global warming during the Late Pleistocene, which pushed horses into the forest full of predators. Dark-colored coats could help horses avoid predators through better camouflage, therefore improving their chances of survival. After horses were domesticated, the ASIP mutation was no longer selectively advantageous, but the MC1R mutation became favored by selection. The shift in the horse coat color preference from dark to light could be explained by that light-colored horses were no longer required to be protected against predation due to domestication. Furthermore, lightcolored coats could facilitate horse husbandry since it was easier to keep track of the horses that were not camouflaged ).
Our results for the tobiano coat pattern illustrate that the KIT13 mutation was favored by selection from domestication till the Middle Ages and then became negatively selected, which confirm the findings of Wutke et al. (2016). Such a negative change in selection of the tobiano coat could result from pleiotropic disadvantages, a lower religious prestige, a reduced need to separate domestic horses from their wild counterparts or novel developments in weaponry during the medieval period (see Wutke et al. 2016, and references therein).
Our results for the leopard complex coat pattern demonstrate that the TRPM1 mutation was negatively selected from the Late Pleistocene onwards. Our evidence of negative selection acting on the TRPM1 mutation in the predomestication period is not strong enough, but we can still see a slow drop in the TRPM1 mutation frequency over time. The TRPM1 mutation is the most common cause of congenital stationary night blindness (CSNB) (Bellone  et al. 2013), which could reduce the chance to survive in the wild since vision is key for communication, localization, orientation, avoiding predators, and looking for food (Murphy et al. 2009). The weak intensity of negative selection could be explained as resulting in part from that only horses homozygous for the leopard complex coat pattern are influenced by CSNB, which however remarkably increased when horses were domesticated. In the postdomestication period, horses were harnessed mainly for power and transportation, for example, they were used to pull wheeled vehicles, chariots, carts, and wagons in the early stage of domestication and later used in war, in hunting and as a means of transport, which all strongly rely on the ability to see. Moreover, night-blind horses are nervous and timid in human care, and difficult to handle at dusk and darkness (Rebhun et al. 1984).

MBE
These genes have been well studied by other methods (e.g., Bollback et al. 2008;Malaspinas et al. 2012;Steinrücken et al. 2014;Schraiber et al. 2016;He et al. 2020aHe et al. , 2020b, but our results are not completely consistent with those presented in previous studies (see supplementary table S8, Supplementary Material online for a summary of the results produced by existing approaches). The discrepancy can be mainly explained by that the strength of selection is assumed to be fixed over time in all existing methods. Other potential causes of the discrepancy are, for example, only 89 ancient horse samples from Ludwig et al. (2009) being used to infer selection acting on the ASIP and MC1R mutations (e.g., Malaspinas et al. 2012;Steinrücken et al. 2014;Schraiber et al. 2016) and a more adequate modeling of relevant scenarios (e.g., He et al. 2020a modeled genetic recombination and local linkage in the inference of selection acting on the KIT13 and KIT16 mutations).
In aDNA studies, the samples with missing genotypes are usually filtered, and the remaining samples are manually grouped into a small number of sampling time points, for example, ancient horse samples were grouped into six sampling time points in Ludwig et al. (2009) or nine sampling time points in Wutke et al. (2016). Grouping has been shown to significantly alter the results of the inference of selection from time series genetic data in He et al. (2020b). Our approach provides an alternative that can address the issue caused by the procedure of sample filtering

MBE
and grouping (see supplementary figs. S9 and S10, Supplementary Material online, as well as supplementary table S9, Supplementary Material online, for additional simulation studies, where we compare our results based on genotype likelihoods with those produced with called genotypes). Our simulation studies show that the commonly used procedure of processing aDNA data for the inference of selection can significantly bias the result, in particular for poor data quality, and suggest that the method based on genotype likelihoods can be a more promising alternative for future aDNA studies.
In our work, the demographic history is required to be prespecified but allowed to be changed over time, as in Schraiber et al. (2016) and He et al. (2020b). Our method is ready to be extended to co-estimate the population size like Malaspinas et al. (2012), but genetic variation at a single locus is not sufficient to produce a reliable estimate of the population size, for example, the population size estimated by Malaspinas et al. (2012) from allele frequency time series data at ASIP is far smaller than the genomewide estimate produced by Der Sarkissian et al. (2015), and therefore Malaspinas et al. (2012) could not distinguish positive from negative selection for ASIP, which has been shown to be positively selected in other studies (e.g., Ludwig et al. 2009;Steinrücken et al. 2014;Schraiber et al. 2016;He et al. 2020b). Several methods like Foll et al. (2015) and Ferrer-Admetlla et al. (2016) can jointly estimate the population size and selection coefficients from time series genomic data but are usually used in experimental evolution rather than aDNA studies due to computational efficiency. Moreover, most of these methods are commonly divided into two steps (but see Ferrer-Admetlla et al. 2016). The first step is to estimate the population size from all loci across the genome assuming selective neutrality, and in the second step, given the estimate of the population size, the selection coefficient for each single locus is independently inferred. These twostep procedures suffer from the difficulty of rejecting the null hypothesis of selective neutrality and the issue of underestimating the population size and selection coefficients (Paris et al. 2019), in particular with an increase in the proportion of selected loci across the genome. A method to address this issue can be to co-estimate the population size with purifying and background selection in the first step as a null model and then fix that model in the second step (Johri et al. 2020(Johri et al. , 2021Johri, Aquadro, et al. 2022;. Joint inference of demographic and selective parameters from ancient genomes, an important topic of future investigation, is anticipated to significantly improve the estimation of selection coefficients and even permit the inference of demographic changes. Although the level of dominance is prespecified in our analysis of simulated data and aDNA data such as Malaspinas et al. (2012) and He et al. (2020aHe et al. ( , 2020b, our method allows to co-estimate the dominance parameter (see supplementary figs. S11 and S12, Supplementary Material online, as well as supplementary table S10, Supplementary Material online, for additional simulation studies evaluating its performance in co-estimating the dominance parameter). In practice, a reliable estimate of the dominance parameter depends heavily on the quality and quantity of data, for example, Foll et al. (2015) illustrated that WFABC performed well in the joint estimation of the dominance parameter from the Panaxia dominula data (i.e., a total of 58,592 samples distributed over 60 generations, 51 sampling time points) but exhibited poor performance in the simulation studies of Kojima et al. (2020), where three replicated populations were simulated, each with 250 samples distributed over 60 generations, five sampling time points. Considering that the quality and quantity of aDNA data are poor, prespecifying the dominance parameter in aDNA studies with sufficient prior knowledge can be a feasible and reasonable alternative.
Compared with existing methods (e.g., Bollback et al. 2008;Malaspinas et al. 2012;Steinrücken et al. 2014;Ferrer-Admetlla et al. 2016;Schraiber et al. 2016;He et al. 2020aHe et al. , 2020b, our Bayesian procedure enables the selection coefficient to vary in time (i.e., piecewise constant) although the event that might change selection is required to be prespecified. However, this is still important in aDNA studies as adaptation in natural populations often involves adaptation to ecological, environmental, and cultural shifts. We run our procedure on the ancient horse samples presented in supplementary table S4, Supplementary Material online with the same settings as we adopted in Section "Horse Coat Coloration", except that the selection coefficient is fixed over time (see supplementary fig. S13, Supplementary Material online for the resulting posteriors and supplementary table S11, Supplementary Material online for the estimates of the selection coefficients with their 95% HPD intervals). We find, for example, that the KIT13 mutation was effectively neutral during the postdomestication period, which contradicts the archaeological evidence and historical records that spotted horses were subject to early selection by humans, but the preference shifted in the medieval period (see Wutke et al. 2016, and references therein). Compared with the results shown in figures 7-10, we see the necessity of modeling temporally variable selection in aDNA studies. Our procedure lends itself to being extended to allow multiple events that might change selection (see Section "Bayesian Inference of Selection"). To guarantee computational efficiency, a feasible solution is to adopt an adaptive strategy that allows for automatically tuning the selection coefficients during a run (see Luengo et al. 2020, for a review). A potential direction for future research is the inference of selection and its strength and timing of changes from time serial genetic samples (Shim et al. 2016;Mathieson 2020).
One fundamental limitation of our method applied for the inference of selection from aDNA data is that it assumes that all samples have been collected after the mutant allele was created. However, allele age is not always available, and as a result, in our analysis we have to exclude the samples drawn before the time that the mutant allele MBE was first observed in the sample, which might alter the result of the inference of selection. To address this problem, we can extend our approach to jointly estimate the allele age as in Malaspinas et al. (2012), Schraiber et al. (2016), andHe et al. (2020b), and the foreseeable challenge is how to resolve particle degeneracy and impoverishment issues in our PMMH-based procedure that result from lowfrequency mutant alleles at the early stage facing a higher probability of being lost. Also, similar to most existing approaches (except for Mathieson and McVean 2013;Lyu et al. 2022), our method lacks the ability to take gene migration into account, which is a common source of confounding in aDNA studies of adaptive processes since natural populations are almost always structured (Mathieson et al. 2015). To resolve this issue, we can incorporate migration into the Wright-Fisher model with selection and co-estimate the migration rate through our PMMH-based procedure, where following Lyu et al. (2022), we alternatively update the selection-related and migration-related parameters to improve the mixing of the chain. In addition, our method assumes that the gene of our interest is independent of others, which however can be easily violated once there exist interactions between genes like epistasis and linkage, for example, MC1R is epistatic to MC1R (Rieder et al. 2001), and KIT13 is tightly linked to KIT16 (Dumont and Payseur 2008). Ignoring gene interactions might bias the inference of selection (He et al. 2020a). To extend our method to the scenario of multiple genes with epistasis and/or linkage, we need to find a good approximation for the Wright-Fisher model of multiple genes evolving under selection with epistasis and linkage over time, which becomes challenging with an increase in the number of genes. We will discuss how to extend our method to the case of two genes with epistasis and/ or linkage in our upcoming work, and the extension for the scenario of multiple interacting genes (≥3) will be the topic of future investigation.

Materials and Methods
In this section, we describe in detail our approach to infer temporally variable selection from aDNA data whereas modeling sample uncertainties resulting from the damage and fragmentation of aDNA molecules.

Wright-Fisher Diffusion
Let us consider a diploid population of N randomly mating individuals at a single locus A, which evolves subject to selection under the Wright-Fisher model (see, e.g., Durrett 2008). We assume discrete time, nonoverlapping generations, and nonconstant population size. Suppose that at locus A there are two possible allele types, labeled A 0 and A 1 , respectively. We attach the symbol A 0 to the ancestral allele, which originally exists in the population, and the symbol A 1 to the mutant allele, which arises in the population only once. We let selection take the form of viability selection and take per-generation relative viabilities of the three possible genotypes A 0 A 0 , A 0 A 1 , and A 1 A 1 to be 1, 1 + hs, and 1 + s, respectively, where s ∈ [ − 1, + ∞) is the selection coefficient and h ∈ [0, 1] is the dominance parameter.
We now consider the standard diffusion limit of the Wright-Fisher model with selection. We measure time in units of 2N 0 generations, denoted by t, where N 0 is an arbitrary reference population size fixed through time, and assume that the population size changes deterministically, with N(t) being the number of diploid individuals in the population at time t. In the diffusion limit of the Wright-Fisher model with selection, as the reference population size N 0 goes to infinity, the scaled selection coefficient α = 2N 0 s is kept constant and the ratio of the population size to the reference population size N(t)/N 0 converges to a function β(t). As demonstrated in Durrett (2008), the mutant allele frequency trajectory through time converges to the diffusion limit of the Wright-Fisher model with the reference population size N 0 approaching infinity. We refer to this diffusion limit as the Wright-Fisher diffusion with selection.
We let X denote the Wright-Fisher diffusion with selection, which models the mutant allele frequency evolving in the state space [0, 1] under selection. Many existing approaches define the Wright-Fisher diffusion X in terms of the partial differential equation (PDE) that characterizes its transition probability density function (e.g., Bollback et al. 2008;Steinrücken et al. 2014;He et al. 2020b). Instead, like Schraiber et al. (2016), He et al. (2020a), and Lyu et al. (2022), we define the Wright-Fisher diffusion X as the solution to the stochastic differential equation for t ≥ t 0 with initial condition X(t 0 ) = x 0 , where W represents the standard Wiener process.

Bayesian Inference of Selection
Suppose that the available data are always drawn from the underlying population at a finite number of distinct time points, say t 1 < t 2 < · · · < t K , measured in units of 2N 0 generations. At the sampling time point t k , there are N k individuals sampled from the underlying population, and for individual n, let r n,k represent, in this generic notation, all of the reads at the locus of interest. The population genetic parameters of interest in this work are the selection coefficient s and dominance parameter h, and for ease of notation, we set ϑ = (s, h) in what follows.

Hidden Markov Model
Our method is built upon the HMM framework introduced by Bollback et al. (2008), where the underlying population evolves under the Wright-Fisher diffusion with selection in equation (2), and the observed sample is made up of the individuals independently drawn from MBE where superscript represents the particle label. By substituting equation (6) into equation (5), the filtering distribution p(x k+1 | r 1 : k+1 , ϑ) (up to proportionality) can be approximated bŷ p(x k+1 | r 1 : k+1 , ϑ) ∝ p(r k+1 | x k+1 ) M m=1 1 M p(x k+1 | x m k ; ϑ).
From equation (7), the approximation of the filtering distribution p(x k+1 | r 1 : k+1 , ϑ) can be sampled with importance sampling, where we generate a set of particles x 1 : M k+1 from the predictive distribution p(x k+1 | r 1 : k , ϑ) with each particle x m k+1 being assigned a weight w m k+1 ∝ p(r k+1 | x m k+1 ) (Gordon et al. 1993). We resample M particles with replacement amongst the set of particles x 1 : M k+1 with probabilities proportional to weights w 1 : M k+1 . For clarity, we write down the bootstrap particle filter algorithm: Step 1: Initialize the particles at the sampling time point t 1 : Step 1a: Draw x m 1 ∼ p(x 1 | ϑ) for m = 1, 2, . . . , M.
Step 2c: Resample M particles with replacement amongst x 1 : M k with w 1 : M k . With the procedure for the bootstrap particle filter described above, the smoothing distribution p(x 1 : K | r 1 : K , ϑ) can be sampled by uniformly drawing amongst the set of particles x 1 : M 1 : K , and the marginal likelihood p(r 1 : K | ϑ) can be estimated bŷ p(r 1 : K | ϑ) =p(r 1 | ϑ) We can therefore write down the PMMH algorithm: Step 1: Initialize the population genetic parameters ϑ and population mutant allele frequency trajectory x 1 : K : Step 1a: Draw ϑ 1 ∼ p(ϑ).

Repeat
Step 2 until a sufficient number of samples of the population genetic parameters ϑ and population mutant allele frequency trajectory x 1 : K have been obtained: Step 2: Update the population genetic parameters ϑ and population mutant allele frequency trajectory x 1 : K : Step 2a: Draw ϑ i ∼ q(ϑ | ϑ i−1 ).
Step 2b: Run a bootstrap particle filter with ϑ i to yield p(r 1 : K | ϑ i ) and x i 1 : K .
Step 2c: Accept ϑ i and x i 1 : K with otherwise set ϑ i = ϑ i−1 and x i 1 : K = x i−1 1 : K . Note that superscript represents the iteration in our procedure described above. In the sequel, we adopt random walk proposals for each component of the population genetic parameters ϑ unless otherwise specified.
Once enough samples of the population genetic parameters ϑ and population mutant allele frequency trajectory x 1 : K have been obtained, we can get the MAP estimates for the population genetic parameters ϑ by computing the posterior p(ϑ | r 1 : K ) with nonparametric density estimation techniques (see Izenman 1991, for a review). Alternatively, we can yield the minimum mean square error estimates for the population genetic parameters ϑ through their posterior means. Similarly, we can take the posterior mean of the samples of the population mutant allele frequency trajectory to be our estimate for the population mutant allele frequency trajectory x 1 : K .
Our method allows the selection coefficient s to be piecewise constant over time. For example, we let the selection coefficient s(t) = s − if t < τ, otherwise s(t) = s + , where τ is the time of an event that might change selection, for example, the times of plant and animal domestication. Our procedure can then be directly applied to estimate the selection coefficients s − and s + for any prespecified time τ. The only modification required is to simulate the underlying mutant allele frequency trajectory of the population through the Wright-Fisher diffusion with the selection coefficient s − for t < τ and s + for t ≥ τ, respectively. In this case, our method also provides a procedure to estimate and test the change in the selection coefficient, denoted by Δs = s + − s − , at time τ by computing the posterior p(Δs | r 1 : K ) from the samples of the selection coefficients s − and s + . This is a highly desirable feature in aDNA studies as it enables us to test hypotheses about whether the shift in selection is linked to specific ecological, environmental, and cultural drivers.

Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.