An efficient and robust ABC approach to infer the rate and strength of adaptation

Abstract Inferring the effects of positive selection on genomes remains a critical step in characterizing the ultimate and proximate causes of adaptation across species, and quantifying positive selection remains a challenge due to the confounding effects of many other evolutionary processes. Robust and efficient approaches for adaptation inference could help characterize the rate and strength of adaptation in nonmodel species for which demographic history, mutational processes, and recombination patterns are not currently well-described. Here, we introduce an efficient and user-friendly extension of the McDonald–Kreitman test (ABC-MK) for quantifying long-term protein adaptation in specific lineages of interest. We characterize the performance of our approach with forward simulations and find that it is robust to many demographic perturbations and positive selection configurations, demonstrating its suitability for applications to nonmodel genomes. We apply ABC-MK to the human proteome and a set of known virus interacting proteins (VIPs) to test the long-term adaptation in genes interacting with viruses. We find substantially stronger signatures of positive selection on RNA-VIPs than DNA-VIPs, suggesting that RNA viruses may be an important driver of human adaptation over deep evolutionary time scales.


Introduction
Genomes contain a record of the evolutionary processes that shape diversity within and across species, and software tools that use genomic sequences to infer aspects of the evolutionary past are now an integral part of population genetics research.Of particular interest to evolutionary biologists are methods that can disentangle various processes that may contribute to diversification between species, such as adaptation and genetic drift.Such methods have the potential to resolve fundamental questions about the evolutionary (e.g.Corbett-Detig et al. (2015); Galtier (2016); Galtier and Rousselle (2020)) and biological (e.g.Enard et al. (2016); James et al. (2016)) drivers of diversification at the genomic level.Though numerous methods have been proposed to this end, it remains challenging to generate accurate and unbiased estimates.Studies addressing the potential biases of the available approaches unaccounted-for evolutionary processes and assessing evidence for genome adaptation is still a highly active area of research in molecular population genetics (McDonald and Kreitman 1991;Gillespie 1994;Smith and Eyre-Walker 2002;Hahn 2008;Fay 2011;Tataru et al. 2017;Kern and Hahn 2018;Jensen et al. 2019;Johri et al. 2020Johri et al. , 2022a,b),b).
The development of methods that are both computationally efficient and reasonably robust to model misspecification remains a major challenge.Most computational approaches that infer the rate of long-term adaptation at the DNA level derive from the McDonald and Kreitman (MK-test) framework (McDonald and Kreitman 1991) or the related Poisson Random Field (PRF) framework (Sawyer and Hartl 1992).Both methods use divergence and polymorphism data to estimate the proportion of nonsynonymous substitutions fixed by positive selection in coding sequences, comparing alleles that are likely to have fitness effects (putatively selected) to those less likely to be under selection (putatively neutral).A significant excess of fixed differences among the putatively functional set relative to the putatively neutral set is taken as a signal of positive selection.The rate of adaptation is often summarized by the quantity α, which is defined as the proportion of non-synonymous (or putatively functional) fixed differences that were under positive selection along a particular evolutionary branch.Smith and Eyre-Walker (2002) applied a simple theoretical model of directional selection relating polymorphism and divergence with adaptation rate, and showed that the rate of adaptation α could be inferred with the quantity where D S is the number of synonymous fixed differences in a sequencing sample, D N represents non-synonymous fixed differences, P N is the number of non-synonymous polymorphic sites, and P S represents polymorphic synonymous sites.When α is close to 1, then positive selection is the predominant determinant of molecular divergence.If α is close to 0, then drift dominates sequence divergence.This convenient formula has been widely applied to estimate molecular adaptation, in part because of its simplicity.Indeed, the quantities on the right hand side of equation ( 1) are commonly inferred by comparing a population sample of sequenced individuals (sequencing sample) to a closely related outgroup species.Although widely used, it should be noted that MK-test and PRF-based approaches have multiple drawbacks that could bias the estimation of α.For instance, equation ( 1) relies on a null model derived from nearly-neutral theory (Kimura 1968;Ohta 1974;Kimura 1977), and assumes that selected polymorphism, either deleterious or beneficial, is rarely observed.Subsequent modeling and empirical studies argued that weakly selected alleles can attain both low and high frequencies and may cause substantial biases in inferences that use equation ( 1) (Balloux and Lehmann 2012;Lanfear et al. 2014;Booker and Keightley 2018;Galtier and Rousselle 2020;Rousselle et al. 2020).
Though weakly deleterious alleles are less likely to reach fixation, they can reach relatively high frequencies and contribute to the class P N .This causes overestimation of the neutral mutation rate in the putatively functional category, which makes the estimation of α downwardly biased (Charlesworth and Eyre-Walker 2008).Fixation of weakly deleterious alleles could also cause overestimation of α after a long population bottleneck since the reduced population size would make the selection less efficient to purge deleterious alleles (Eyre-Walker 2002;Eyre-Walker and Keightley 2009).Conversely, weakly beneficial alleles can be found at intermediate or high frequencies (Tataru et al. 2017;Uricchio et al. 2019), especially when the rate of strongly beneficial mutations is high (as might be expected in a large population) or weakly beneficial alleles contribute substantially to polymorphism (as might be expected under some polygenic selection models).The presence of weakly selected alleles has been addressed over the last decade by MK-test and PRF-based methods mainly by explicitly modelling the Distribution of Fitness Effects (DFE) for negatively selected variants (Boyko et al. 2008;Eyre-Walker and Keightley 2009;Messer and Petrov 2013;Racimo and Schraiber 2014;Galtier 2016), along with the recent incorporation of positively selected variants in such DFE models (Galtier 2016;Tataru et al. 2017).
Despite the development of several methods that account for weakly selected polymorphism, some empirical observations remain challenging to explain under existing models, such as the apparent low rate of adaptation in primates, the constrained range of genetic diversity across species, and differences in the rate of adaptation among taxa (Galtier 2016;Castellano et al. 2018Castellano et al. , 2019a)).Generating a deeper biological and evolutionary understanding of the drivers of differentiation across species may require new methods and models that can efficiently estimate the DFE while simultaneously accounting for many (potentially confounding) evolutionary processes.
Demographic processes (such as population contractions, expansions, and migrations) are another primary potential source of bias in the inference of selection (Jensen et al. 2019;Johri et al. 2020Johri et al. , 2022a,b),b), just as the selection is an essential potential confounder in the inference of demography (Schrider et al. 2016;Torres et al. 2018).The developers of robust inference methods have typically sought to account for both selection and demographic processes simultaneously.The cost of incorporating both demography and selection is accrued in terms of model complexity and loss of efficiency, as it is much more challenging to compute likelihoods or summary statistics under joint demography/selection models.There is some hope, however, that methods based on the asymptotic MK-test (aMK-test) (Messer and Petrov 2013) may have some inherent robustness since these approaches rely on summary statistics that involve ratios of functional and (putatively) non-functional alleles.Hence, some of the effects of demography should be absorbed into the ratio, as both categories of alleles will be affected.
Here, we develop an extension of the Approximate Bayesian Computation ABC-MK method presented in Uricchio et al. (2019) that greatly improves the efficiency of the population genetics inferences.In Uricchio et al. (2019), analytical calculations were used to explore the effect of background selection (BGS) and selective interference on weakly beneficial alleles, but the estimation procedure employed was based on computationally intensive forward-in-time simulations and took days even on a High Performance Computing cluster.We developed a simpler and much more computationally efficient ABC-based inference procedure that accounts for the DFE of deleterious and beneficial alleles and partial recombination between selected genomic elements while avoiding forward-in-time simulations.We describe the inference procedure, assess its performance and robustness to nonequilibrium demographic scenarios and different intensities of adaptation, and apply it to human genomic data.We show that the method is reasonably robust to non-equilibrium events or different fitness values of adaptation, use it to provide additional evidence for a substantial effect of different RNA-viruses on human adaptation rates, and discuss caveats and potential extensions of our work.

Preliminaries
Our first goal is to calculate the expected rate of fixation and the expected site frequency spectrum of neutral and selected polymorphism under a model of directional selection with incomplete recombination.To do so, we follow the results of Uricchio et al. (2019), which in turn extended the results of several earlier studies (e.g., Eyre-Walker and Keightley (2009); Messer and Petrov (2013)).
In subsequent sections we extend these calculations by developing a random sampling scheme that accounts for the Poisson variance in mutation and fixation rates, and allows us to develop a simple inference pipeline.We briefly review the core aspects of the theoretical framework.
Our ultimate goal is to estimate α, the proportion of non-synonymous substitutions fixed by positive selection, as well as the DFE over de novo non-synonymous mutations and substitutions.We suppose that selection is directional, with both positively selected and negatively selected mutations.We first consider the case where each selected locus evolves independently, and in subsequent sections we consider cases with BGS and selective interference.
The DFE over beneficial alleles consists of two point masses, one representing strongly beneficial alleles and the other representing weakly beneficial alleles.The rate of adaptation can therefore be decomposed into weakly and strongly beneficial components, α = α W + α S .The substitution rate for non-synonymous alleles is denoted as d N , with d N + , d N − , and d N 0 representing the rates for positively selected, negatively selected, and neutral alleles respectively (note that d N = d N + + d N − + d N 0 by definition -since there is a DFE for beneficial and deleterious alleles we could further decompose d N − and d N + into more categories, but we collapse the DFE into two compartments here and consider the full DFE in later sections).In the same way, we denote as d S the substitution rate of synonymous mutations, which are assumed to be neutral.We can write α as Note that we define α as the realized proportion of positively selected substitutions along the branch, and hence equation ( 2) is an expression for the expectation of α.As noted by Messer and Petrov (2013), d S can be estimated from sequence alignments with the ratio D N /D S under the assumption that the observed number of substitutions along a branch should be proportional to the rate.The ratio (d N − + d N 0 )/d S is more complex to estimate, because it relies on partitioning substitutions by their fitness effects.Under the assumption that polymorphic alleles are rarely selected (because deleterious sites are removed from the population quickly and beneficial sites go to fixation rapidly), previous work (Smith and Eyre-Walker 2002) showed that this ratio can be approximated by substituting (3) However, if selected polymorphisms segregate in the sample, then P N in equation (3) will be inflated relative to the true rate of mutation for neutral non-synonymous alleles, which results in underestimation of α.A potential solution is to exclude alleles with derived allele frequencies lower than some threshold from the quantities P N and P S , since most (negatively) selected alleles should be constrained to lower frequency (Fay et al. 2001).While this solution works well for some DFEs (for example, when all deleterious alleles are strongly selected), weakly deleterious alleles can reach appreciable frequencies and bias inference regardless of the selected frequency threshold.Messer and Petrov (2013) extended this idea by developing a very simple estimator of α (asymptotic-MK) that uses all frequencies simultaneously by rewriting the estimator of Smith and Eyre-Walker (2002) as where P N(x) and P S(x) are the number of non/synonymous alleles at frequency x in a sequencing sample.This method improves the quality of α estimates by using all of the frequency data simultaneously and providing confidence intervals for α.The estimate of α results in an exponential function of the form: α = a + b • e −cx , where the best fit of the exponential at x = 1, eliminates the effect of a weakly deleterious allele.In the following section, including empirical and analytical results, we defined P N(x) and P S(x) as all non-synonymous and synonymous polymorphisms above frequency x.As noted in Uricchio et al. (2019) this approach scales better as sample size increases since most common allele frequencies x have very few polymorphic sites in large samples and both P N(x) and P S(x) quantities trivially have the same asymptote.Despite the improvement, the method does not estimate the DFE and assumes that beneficial alleles do not contribute to P N , which turns into α underestimation in cases where selection is predominantly weak.

Expected fixation rates and frequency spectra
A complementary approach to that of Messer and Petrov (2013) is to directly model the effects of the DFE for beneficial and deleterious alleles on the shape of the α (x) curve, and to infer the best fitting model parameters.Nonetheless, note that while ] is not straightforward to calculate because it depends on the ratio of several random variables, the expectation of each component in equation 4 (P S(x) , P N(x) , D S , D N ) is easily calculated in a directional selection model from first principles using diffusion theory (Evans et al. 2007).Therefore, we make a first-order approximation  6) and (7).
In general, our approach can be applied to any distribution for which we can analytically solve the fixation rates and expected frequency spectra.Descriptions of these calculations follow this section and can be found in the online documentation for our software at web address jmurga.github.io/MKtest.jl.

Expected frequency spectrum
The expected number of alleles at frequency x is estimated from the standard diffusion theory for the site frequency spectrum in an equilibrium population (e.g., see equation 31 of Evans et al. (2007)).
where Ψ(x) is the expected alleles at frequency x in a population of size N and θ s = 4Nµs is the population-scaled mutation rate for mutations with selection coefficient s.To obtain the downsampled frequency spectrum in a finite sample of 2n chromosomes, we convoluted equation ( 6) with the binomial distribution.

Expected number of fixations
Considering the distribution of selection coefficients over new mutations µ s (selection coefficient underlying the mutation rate) and the fixation probability π s , we calculate the expected fixation rate as Following Uricchio et al. (2019), the expected fixation rate can be decomposed by its fitness effect replacing θ s = 4Nµs in equations ( 6) and ( 7) with a Gamma distribution Γ[α, β] over negative selection coefficients and point mass distribution over positive selection coefficients, where p + and p − are the probability that an allele is beneficial or deleterious, respectively, and ζ is the Riemann Zeta function (Eyre-Walker 2010).

Background selection and adaptive divergence
BGS (Charlesworth et al. 1993;Hudson and Kaplan 1995;Nordborg et al. 1996) and selective interference (e.g., Hill-Robertson interference, (Hill and Robertson 1966)) could affect the rate of fixation of weakly deleterious or beneficial alleles.Up to this point, we have considered only selected loci that evolve independently of all other selected loci.In this section we will relax this assumption by exploring the effects of selective interference on fixation rates and the frequency spectrum.We will follow approximations that will apply in some circumstances (in particular, when BGS is the predominant driver of selective interference), but may fail when strongly beneficial alleles interfere (Good et al. 2014;Cvijović et al. 2018;Garcia and Lohmueller 2021;Ragsdale 2022).
To explore α(x) accounting for recombination and BGS impact, we focused on a model in which the coding locus is flanked on each side by loci of length L, which contain deleterious alleles.We modeled deleterious alleles with a population-scaled selection coefficient −2Nt undergoing persistent deleterious mutation at rate 4Nµ − and the whole flanking loci recombined at a rate r per-base, pergeneration.The effects of BGS on fixations and frequency spectra have been subject of much theoretical work (Charlesworth et al. 1993;Charlesworth 1994;Hudson and Kaplan 1995;Barton 1995;Nordborg et al. 1996).Previous work has shown that diversity at the coding locus (π) is decreased relative to its neutral expectation (π 0 ), and closed form expressions for the expected reduction in diversity are available (Hudson and Kaplan 1995;Nordborg et al. 1996).While patterns of sequence variation induced by BGS can be quite complex (Nicolaisen and Desai 2013;Good et al. 2014;Torres et al. 2018Torres et al. , 2020)), to a first approximation the effect of BGS can be thought of as a reduction in the effective population size N e , with N e = N π π 0 (McVicker et al. 2009).To account for the role of BGS on the fixation rate of deleterious alleles, we replace N in the prior equations with N e after accounting for BGS.We also replace N with N e in formulae for the frequency spectra.
For beneficial alleles, the effects of selective interference are slightly more complex.Strongly beneficial alleles are essentially unaffected by BGS, in that their fixation probabilities almost do not depend on the reduction in neutral diversity (Uricchio et al. 2019).Weakly beneficial alleles can have their fixation probabilities substantially reduced by BGS.We followed Barton (1995) to derive formulae for the reduction in fixation rate of weakly and strongly beneficial alleles after accounting for BGS, as described in the Supplemental Materials of Uricchio et al. (2019) (see Background selection and adaptive divergence section).The reduction in fixation probability for a weakly beneficial allele with selection coefficient s under interference with deleterious alleles with selection coefficient t is given by where l is the distance in base pairs from the region of interest, 1 ≤ l ≤ L (see equation 17d of Barton (1995)).Multiplying across all deleterious linked sites and factoring in flanking sequences to both the left and right of the focal site (which requires us to square the product below), we find that the reduction in the probability of fixation relative to the case with no linkage (Φ) is where ψ is the polygamma function.We note that these expressions are not expected to hold under very high rates of mutation for beneficial alleles, and will be less accurate for strongly beneficial alleles than weakly beneficial alleles.Given these expressions, we can replace the fixation rates for beneficial alleles in our prior formulae with the fixation rates after accounting for selective interference.

Poisson-sampling process
The previous sections described the expectation of fixation rates and frequency spectra under a model of directional selection and selective interference.We now develop a simple random sampling scheme around these expectations that accounts for sampling and process variance, linking analytical calculations to an ABC procedure which avoids computationally expensive forward simulations.
We note that the model we explore is quite similar to the BGS model in DeGiorgio et al. the Poisson Random Field model (Sawyer and Hartl 1992), we supposed that the number of fixed differences and polymorphic sites were Poisson random distributed variables with mean values given by the expectations in the previous sections.
To avoid performing branch length estimations in our computation, we assumed that the empirically observed number of fixations should be proportional to the length of the evolutionary branch of interest, T, the locus length L and mutation rate µ.We take the observed total number of fixations (including both non-synonymous and synonymous sites) as a proxy for the expected number, and then sample weakly deleterious, neutral, and beneficial substitutions proportional to their relative expected rates for a fixed set of model parameters.The expected number of substitutions for positively selected substitutions is then where D is the observed number of substitutions in a dataset of interest.
It should be noted that both sampling variance and process variance affect the number of variable alleles at any particular allele frequency in a sequencing sample.The process variance arises from the random mutation-fixation process along the branch, while the sampling variance arises from the random subset of chromosomes that are included in the sequencing data.We sampled a Poisson distributed number of polymorphic alleles at frequency x relative to their rate given the expected frequency spectra.The expected frequency spectra were downsampled using a binomial (with probability of success given by the frequency x 2n in a sample of 2n chromosomes) to account for the sampling variance.In a manner exactly analogous to fixed variants as described above, to account for the process variance To account for background selection, we solved equation ( 9

Computational workflow
Our goal is to infer α, α W , and α S given a set of observed α (x) values from a sequencing dataset.Since α in our framework is a model output and not a parameter per se (i.e., α depends on the random number of fixations along an evolutionary branch, which in turn depend on the parameters of the evolutionary model), we cannot immediately obtain the corresponding fixation rates and frequency spectra values for a given set of expected α values without first solving for the mutation rates and fixation probabilities considering the input model (see Table 1).Given the α and α W (which together uniquely determine α S ), a DFE over negatively selected alleles, a known B value, a selection coefficient for beneficial alleles, a selection coefficient for flanking deleterious alleles, a recombination (ρ) and mutation rate on the coding locus (θ), we numerically solve for the probability of fixation of beneficial alleles and the mutation rate on the flaking locus that correspond to the desired α values and BGS strength.This allows us to rapidly calculate the expected frequency spectra and fixation rates that will correspond to the desired α values and generate a sample of α (x) values following the Poisson-sampling process under the corresponding evolutionary model.

Inferring parameters with ABC
We used a generic Approximate Bayesian Computation (ABC) algorithm to infer the rate and strength of adaptation.ABC first samples the parameter values from prior distributions; second, simulate the model using these random parameter values and calculates informative summary statistics from the simulations; and third, compares the simulated summary statistics to observed empirical data.The summary statistics that best match the observed empirical data form an approximate parameter posterior distribution.Our approach used empirical data to both perform computational workflow and the Poisson-sampling scheme described above to sample α (x) generating summary statistics corresponding to different evolutionary scenarios and to finally compare such summary statistics to empirical α (x) estimations.Since we do not know a priori the values of any model parameter, to estimate summary statistics, we sample 10 5 sets of parameters randomly from a prior uniform distribution (see Table 1, Table 2), which allows for flexibility in the DFE of deleterious and beneficial alleles.We supplied the summary statistics and empirical α (x) into ABCreg (Thornton 2009) to estimate the empirical values of α W , α S and α while accounting for BGS in bins of 2.5% from π π 0 = 0.1 to π π 0 = 1.
We conducted ABC inferences using two different kinds of data, as explained in the following sections.First, for simulated data, we bootstrapped the simulated replicas to generate 100 datasets to test the method.For empirical data, we followed the bootstrap procedure extensively described at Enard and Petrov (2020) and Di et al. (2021) to generate the control sets required to create null distributions and compare adaptation levels between non-VIPs and VIPs genes.
As summary statistics, we used the value of α (x) for x ∈ (2,5,20,50,200,661,925), x ∈ (2, 5, 20, 50, 200, 500, 700), x ∈ (2, 4,5,10,20,30,50,70) regarding the input data and sample size.These values are proportionally similar to the ones used in Uricchio et al. (2019).However, we excluded singletons because very low-frequency alleles are particularly sensitive to sequencing errors and distortions due to demographic processes or other model misspecifications.We inferred posterior distributions for each bootstrapped dataset, each one using the same 10 5 sets of parameters randomly as prior to estimating summary statistics.We set the tolerance threshold in ABCreg to 0.025 such that 2.5 • 10 3 values were accepted from posterior distributions.

Forward-in-time simulations
We used SLiM 3 (Haller and Messer 2019) to generate simulated sequence variation data under our model and test the performances of our approach.We performed three different sets of simulations with and without the inclusion of non-equilibrium demographics, in which we modeled diverse rates of adaptation and BGS.For each set of simulations, we considered a 5.5 Myears branch length mimicking the human split from chimpanzee.In simulations accounting for demography, we added demographic events following a two-epoch model considering diverse expansion and bottleneck.We Each simulation represents a coding locus of 2 • 10 3 bp flanked on each side by a 10 5 bp loci, which contain deleterious alleles driving background selection.A total of 5 • 10 4 genes were simulated, accounting for 10 8 bp of coding sequence.We performed the simulations following previously estimated values of negative selection of human proteins (Boyko et al. 2008), where the distribution of deleterious alleles follows a gamma-distribution with scale and shape parameters of 0.184 and 0.000402, respectively, which implies a mean fitness of 2Ns = −457 for negatively selected non-synonymous alleles.
Strongly and weakly beneficial alleles followed a point-mass distribution given the population-scaled selection coefficients of 2Ns = 10 and 2Ns = 500, respectively.Our simulations supposed that 25% of mutations in each coding locus are synonymous while and 75% are non-synonymous.We used a mutation within each coding locus of θ = 4Nµ = 0.001 and a mean human recombination rate in the flanking sequence of ρ = 4Nr = 0.001.Given these parameters, we set an α value of 0.4 for equilibrium, and Tennessen simulations, and we decreased α to 0.2 for the two-epoch to vary the rate of adaptation tested.In the same way, because α (x) pattern can also be affected by the number of individuals sampled (n), we tested different sample sizes for equilibrium (n = 500), Tennesen (n = 661) and two-epoch (n = 50) simulations.See Table 3 and Table 4 for parameter values of forward simulations.We rescaled ancestral population sizes (N e = 10, 000) for equilibrium, and non-equilibrium model simulations according to our computational resources.

Human divergence and polymorphism
We also tested our model with empirical data.We retrieved polymorphic and divergence data in coding sequences from human hg38 assembly.Human assembly, coding sequences and annotations were retrieved from Ensembl release 109 (Cunningham et al. 2022).We estimated fixed substitutions in the human branch by Maximum Likelihood and ancestral sequence reconstruction based on human, chimpanzee, gorilla and orangutan alignments.First, we used pblat (Wang and Kong 2019) to align human CDS sequences on panTro6, gorGor6 and ponAbe3 assemblies downloaded from UCSC database (Nassar et al. 2023).To ensure that short exons at the edge of CDS were included, we used the pblat-fine option and a minimum identity threshold (-minIdentity) of 60%, keeping only the best scoring hit.The hit with the highest percentage of identity was retrieved if two or more equal best-score hits.Next, chimpanzee, gorilla and orangutan sequences were pblatted back to hg38 human assembly in order to make sure that the overall middle genomic position for each isoform was not greater than 1kb.By doing so, we finally retrieved 17,538 protein-coding genes accounting for a total of 94,030 CDS isoforms, which allowed us to identify the best reciprocal orthologous hits.Then, human, chimpanzee, gorilla and orangutan sequences were aligned using MACSE v2 to explicitly account for codon structure on protein-coding alignments (Ranwez et al. 2018).The total number of non-synonymous and synonymous fixations on the human branch were quantified from substitution maps using Hyphy v2.5 (Kosakovsky Pond et al. 2020).The process was divided into two steps: first, we fit the MG94xREV codon substitution model, a modification of the Muse and Gaut (1994) model, including a generalized time-reversible mutation rate matrix.The fit was done using MACSE alignments, and the great apes' phylogenetic tree and phylogenies were re-estimated during the fitting allowing d N /d S estimation by branch.Finally, each fitted model was used to reconstruct ancestral sequences and substitution maps.Fixations were summarized by gene taking into account all possible isoforms.Sites in multiple isoforms were counted only once, considering the most constraint annotation.Therefore, synonymous substitutions were considered non-synonymous if the same position were non-synonymous in at least one of the gene isoforms.
Polymorphic sites and derived allele frequencies were estimated from 1000GP phase 3 high-coverage phased data (Byrska-Bishop et al. 2022) across seven African ancestry populations.Non-synonymous and synonymous polymorphism were annotated using VEP (McLaren et al. 2016).Ancestral and derived allele frequencies were estimated using Ensembl v109 human ancestral allele information from EPO multi-alignments.Polymorphic sites only were counted if they overlapped those parts of human coding sequences that were previously aligned.In total, 17,538 orthologues were included in the analysis.

Bootstrap test
We followed the bootstrap procedure extensively described at Enard and Petrov (2020) and Di et al. (2021) to compare adaptation levels between VIPs and non-VIP genes.The bootstrap procedure allows us to quantify adaptation specific to VIPs, comparing them with a control set that matches with VIPs on the number and average values of several confounding factors that also can determine the rate of adaptive evolution compared to the rest of the genome (Castellano et al. 2019b;Huang 2021).The bootstrap test relies on a straightforward control set-building algorithm that adds control genes to a progressively growing control set (Enard and Petrov 2020); in such a way, the growing control set has the same range of values of confounding factors as the tested set of genes of interest.Following this procedure, we avoid matching VIPs genes individually with non-VIPs control but match the overall VIPs set.Otherwise, we would drastically reduce the pool of potential control genes.When increasing the non-VIPs control set, we set a 5% matching limit over the VIPs set confounding average and limit each non-VIP to represent up to 1,2% of the control set.In total, we match nine potential confounding factors between VIPs and non-VIP control datasets (other confounding factors previously included in Di et al. (2021) were found to not impact the present comparison of VIPs and matched non-VIPs): • Average overall expression in 53 GTEx v7 tissues (THE GTEX CONSORTIUM 2020).We used the log (in base 2) of TPM (Transcripts Per Million).
• Expression (log base 2 of TPM) in GTEx lymphocytes.Expression in immune tissues may impact the rate of sweeps.
• Expression (log base 2 of TPM) in GTEx testis.Expression in testis might also impact the rate of sweeps.
• deCode recombination rates in 50 kb and 500 kb gene-centered windows.The average recombination rates in the gene-centered windows are calculated using the most recent deCode recombination map (Halldorsson et al. 2019).We use both 50 kb and 500 kb window estimates to account for the effect of varying window sizes on the estimation of this confounding factor (same logic for other factors where we also use both 50 kb and 500 kb windows).
• GC content at the third position of codons in the aligned Ensembl coding sequences.
• The number of protein-protein interactions (PPIs) in the human protein interaction network (Luisi et al. 2015).The number of PPIs has been shown to influence the rate of sweeps (Luisi et al. 2015).We use the log (base 2) of the number of PPIs.
• The aligned coding sequence length.
• The proportion of immune genes.The matched control sets have the same proportion of immune genes as disease genes, immune genes being genes annotated with the Gene Ontology terms GO:0002376 (immune system process), GO:0006952 (defense response) and/or GO:0006955 (immune response) as of May 2020 (The Gene Ontology Consortium 2021).

Evolutionary processes affecting adaptation inference
We used our forward simulations to retrieve polymorphism and divergence data with a priori known adaptation parameters.To input the data in our model, we pooled the SFS and number of fixations of 5 • 10 4 genes.Note that this is about 2.5 times as many genes as appear in the human genome -we use this larger number of genes such that the trends in the simulated data will be clear and not dominated by noise, while noting that noise may play a larger role in real datasets for some species with limited proteome coverage.
As demonstrated previously, the frequency spectrum (Tataru et al. 2017) and α (x) (Uricchio et al. 2019) are substantially affected by the presence of weakly beneficial alleles when the rate of such mutations is relatively high (see Figure 1, Figure 2).The asymptote of the α (x) curve bends below the true value of α (x) , which is caused by an excess of high frequency non-synonymous variants under weak positive selection.This results in downward bias of α estimates when aMK-test (Messer and Petrov 2013) is applied to the data (see Table 5).In real sequencing datasets we cannot a priori separate beneficial and deleterious alleles, but in our simulated datasets we can remove the weakly beneficial alleles and test whether this will fix the downwards bias in aMK.When removing weakly beneficial alleles we observed an increase in α estimates from aMK which tend towards the true value of α (see Figure 1 an Table 5).In addition, α (x) can be substantially affected by BGS, especially when weakly beneficial alleles contribute to the frequency spectrum (see Figure 1).In cases where α is dominated by strong adaptation, both asymptotic values (accounting for all alleles, or just neutral and deleterious) tend to be similar, because strongly beneficial alleles are not substantially impeded by selective interference with linked deleterious variation.
We also tested the effect of recent demographic events with a simulation of the Tennesen demographic model, specifically for the African continental group (Tennessen et al. 2012).We used demographic parameters following Adrion et al. (2020).To improve performance we simulated the African population in isolation, rather than including the full multi-population model -consequently our simulations do not include the effects of migration on the frequency spectrum.We observed similar patterns to equilibrium simulation regarding the overall shape and asymptotic values of α (x) (Figure 2).Since this model includes a recent and rapid population expansion, we observe distortions to the frequency spectrum at extremely low and high frequencies due to the excess of rare alleles relative to an equilibrium demographic model.
The same effects were observed in the two-epoch expansion model.Overall we found the same expected differences on aMK and α (x) curves in the two-epoch expansion considering α were reduced from 0.4 to 0.2 compared to equilibrium and Tennesen simulations (see Figure 3 and Table 6).Hence, BGS and the proportion of weakly advantageous polymorphism would drive the α (x) curve differences despite the demographic model and the overall α.Nonetheless, as in Tennesen simulation, α (x) are biased in extremely low and high frequencies due to the excess of rare alleles, depending on the event strength.However, like in the Tennesen simulation, aMK estimations are almost unaffected compared to the equilibrium simulation when removing weak advantageous mutations.That is not the case in the two-epoch bottleneck model.Although weakly advantageous polymorphism affects α (x) curves similarly depending on the BGS strength, aMK estimations are biased compared to the equilibrium simulation.Since population contraction makes selection less efficient in removing deleterious mutations, such scenarios lead to an increment in P N and underestimation of α, depending on the strength of the demographic event (see Figure 3 and Table 6).Because our method is based on the α (x) , such scenarios must be treated particularly carefully, and we discuss them in the following sections.

ABC evaluation using simulation
To compare true parameter values to inferred values, we estimated the mode and 95% CI for each posterior distribution that we obtained from ABC-MK as an approximation of α.Our method can distinguish both weak and strong adaptive contributions to α while performing reasonably accurate estimations.Table 7 shows inferred values and the associated error for each parameter and simulation.In all cases, the posterior distribution overlaps the distribution of the true values from bootstrapped datasets.Figure 6 presents the simulated values and mode estimates from posterior distributions on equilibrium simulations.
We also tested our method using simulations performed under a two-epoch model, including moderate and strong ancient expansion and bottleneck demographic, and the demographic model of Tennessen et al. (2012) (see Table 7 and 8).Although the demographic events affect the number of segregating sites and the shape of the SFS (which is not modelled in our calculations), our method is reasonably robust to these distortions, including an ancient expansion in the ancestral population and a period of recent rapid growth when excluding low-frequency variants (DAC < 5 i.e.DAF < 0.0038; Figure 5, Figure 6, 7).The inference is biased towards weak selection when we include variants at which the SFS is most distorted, such as singletons and very high-frequency variants.However, the overall value of α is not strongly affected (Figure 4).Such a pattern likely reflects the qualitative similarity of recent growth events and weakly beneficial alleles in terms of their effects on the SFS, as both will disproportionately increase the number of non-synonymous variants at a low frequency relative to an equilibrium model.In Figure 4, we explore a more comprehensive range of parameters using the best-performing set of summary statistics (i.e., excluding derived allele counts under 5).Table 7 and Table 8 show the estimated parameters and their corresponding errors for both the Tennessen et al. (2012) and two-epoch model simulations.
We noted that in cases where the strength of the bottleneck in the ancestral population is larger, our software performed worse α estimation than in the expansion scenarios (see Figure 7, Table 8).For such cases, we compared Grapes (Galtier 2016), a Maximum Likelihood software fitting the DFE to infer α, to our software.We run Grapes using GammaZero and GammaExpo models since both models best fit the simulated DFE (see Material and Methods).As shown in Figure 8, the GammaZero model and ABC-MK perform similarly in all cases, but the inference also fails when the ancient bottleneck and the BGS are strong.Besides, although the simulation included a higher proportion of weak adaptation, and the SFS included a significant amount of weakly adaptive variants (with parameters α = 0.2 and α W = 0.1), we observed that the GammaExpo model provided less accurate estimations compared to the GammaZero model in all simulated scenarios (see Figure 7 and 3).Interestingly, although Grapes assumes independent segregating sites, we noted that it achieves accurate α estimations in the simulations, including BGS.Note that Grapes implement the r nuisance parameter correction proposed in Eyre-Walker et al. (2006).The r parameter was initially designed to capture the effect of demographic changes on the fate of synonymous and non-synonymous mutations (Eyre-Walker et al. 2006).In practice, the r parameter allows each SFS category to account for each own mutation rate (see equations 6 and 7 from Galtier (2016)).Considering the frequency category i at the SFS, r i is estimated as the relative effective mutation rate of the i frequency compared to singletons at neutral sites (Eyre-Walker et al. 2006;Galtier 2016;Tataru et al. 2017).Multiple studies have shown that correcting the expected SFS by r can at least partially absorb other SFS distortions like the one produced by ascertainment bias, nonrandom sampling, population substructure or linked selection (Galtier 2016;Tataru et al. 2017;Galtier and Rousselle 2020).Similarly, we found that the r parameter can absorb unexpected polymorphic patterns in α estimates due to linked selection (Galtier 2016;Tataru et al. 2017;Galtier and Rousselle 2020).Still, as noted in Galtier and Rousselle (2020), we observed that absorbing linked selection distortions leads to a significant underestimation of the inferred γ parameter by Grapes, as well as DFE shape values, resulting in largely biased DFE model inferences in comparison with ABC-MK which recovers these parameters more precisely (see Figure 9, Figure 10, Table 9).

Application of ABC-MK: human virus-interacting proteins
As an example application, we estimated adaptation in human genes that interact with viruses.Several studies have found that viral infections have driven adaptation in the human genome (e.g., Nédélec et al. (2016); Castellano et al. (2019b)).Genomic analysis of patterns of variation within experimentally determined Viral Interacting Proteins (VIPs) has repeatedly uncovered signals of both frequent and strong adaptation (Enard et al. 2016;Uricchio et al. 2019).Multiple lines of evidence suggest that the selective pressure imposed by viruses on hosts appears to be an important driver of strong adaptation during both long term (Enard et al. 2016;Castellano et al. 2019b;Uricchio et al. 2019) and more recent human evolution (Deschamps et al. 2016;Racimo et al. 2017;Enard and Petrov 2018).
We recently found that RNA viruses have driven more recent human evolution (in the past 50,000 years) at RNA virus-interacting VIPs (RNA VIPs) than DNA viruses at DNA virus-interacting VIPs (DNA VIPs) (Enard andPetrov 2018, 2020).Here we used ABC-MK to test whether the same was true during the millions of years of human evolution since divergence with chimpanzees.To this end, we limit our analyses to a subset of VIP and non-VIP mammals orthologs because gene age can significantly impact the rate of adaptation since they can evolve faster and experience mutations with stronger fitness effects to achieve their optimum fitness (Moutinho et al. 2022).Thus, to compare the impact of RNA and DNA viruses, we measured adaptation rates separately for the specific VIPs of nine different RNA viruses and the specific VIPs of six different DNA viruses each with more than 180 VIPs (Figure 12).
We subset genomic data corresponding to the 661 individuals of African descent sampled in the Thousand Genomes Project, and fixed-substitution on human-branch were identified by maximum likelihood based on alignments of human (hg38 assembly), chimpanzee (panTro6 assembly) and orangutan (ponAbe3 assembly).Adaptation rates on VIPs were estimated following the previous bootstrap procedure described at Enard and Petrov (2020) and Di et al. (2021) (see Materials and Methods).The bootstrap procedure allows us to build random non-VIPs control sets that match the same average values of several confounding factors that might explain the adaptation level rather than the virus interactions themselves.We estimated non-VIPs null distributions by sampling 1,000 non-VIPs sets of the same size as the analyzed VIPs category.Because we defined we defined P N(x) and P S(x) as any polymorphisms above frequency x, we excluded any variants above 70% frequency to avoid mispolarization errors (see Figure 10).α values for VIPs and non-VIPs datasets were inferred through the mode and 95% CI estimates from posterior distributions following the previously described ABC scheme for each dataset.
We inferred values of α, α W , and α S that were very similar to those inferred in Uricchio et al. (2019).
Looking at the VIPs for specific RNA or DNA viruses, we find that a subset of RNA virus families drove the vast majority of adaptation in VIPs, (Figure 11).More specifically, RNA viruses drove strong adaptation but not weak adaptation (Figure 11), with five RNA viruses (RHINOV, HIV, ZIKAV, IAV and DENV) with VIPs with significantly elevated strong adaptation, versus only one DNA virus (HPV).We further find this is true using either α or ω a (Figure S1) (Galtier 2016) to compare VIPs and controls.These results are coherent with growing evidence that RNA viruses drove strong recent adaptation more frequently than DNA viruses (Enard andPetrov 2018, 2020).BMC Bioinformatics.20:28.
1 Figure 3 Effect of weakly advantageous mutation in the presence of BGS and population size change strength.We simulated the effect of weakly advantageous allele and BGS effect on α (x) under a two-epoch model.Each row represents a B value.Each column represents the strength of the contraction or expansion of a proportion assuming a total proportion of adaptive substitutions of α = 0.2 in the absence of BGS.Dotted line is the true simulated α. α W were set to contribute a half of the total α.Stronger BGS removes beneficial mutations from α(x).

Figure 1
Figure 1 ABC-MK ω a inference on different RNA and DNA-VIPs categories.Violin plots and solid lines represent inferences on non-VIP bootstrapped datasets and VIP categories according to the virus interaction.The dashed line is the overall VIPs inference.
Murphy et al. (2022)9) values fromMcVicker et al. (2009)at each polymorphic or fixed site.Then, we adjust the fixation probability of deleterious alleles by the expected value of B and the fixation probability for weakly beneficial alleles given the predicted reduction by equation (9).In practice, we bin sites into B value ranges, such that all sites within (for example) a 2.5% B value range of 0.675 to 0.7 experience the same N e and the same strength of selective interference (for example, N e = 0.675N, which is the midpoint of this B value window).Because we note that inferred background selection strength is unavailable for most species, our software can run without this information by inputting any expected B value range into the software.Nonetheless, the analysis can be limited to an empirical B value or B value range as an approximation of the expected reduction in diversity if downstream empirical datasets account for the inferred background selection strength estimated fromMcVicker et al. (2009)orMurphy et al. (2022).

Table 1
Model parameter definitions

Table 2
ABC-MK prior values.

Table 4
SLiM simulated values to the two-epoch model simulations

Table 6 α
and aMK estimations on the twoepoch model N anc N B True α W B True α aMK (neutral + deleterious) aMK (all alleles)

Table 9
DFE inference in two-epoch model simulated datasets using ABC-MK and Grapes.Each method shows the mean ratio between the inference and the true value of simulations replicas.