Inferring the distributions of fitness effects and proportions of strongly deleterious mutations

Abstract The distribution of fitness effects is a key property in evolutionary genetics as it has implications for several evolutionary phenomena including the evolution of sex and mating systems, the rate of adaptive evolution, and the prevalence of deleterious mutations. Despite the distribution of fitness effects being extensively studied, the effects of strongly deleterious mutations are difficult to infer since such mutations are unlikely to be present in a sample of haplotypes, so genetic data may contain very little information about them. Recent work has attempted to correct for this issue by expanding the classic gamma-distributed model to explicitly account for strongly deleterious mutations. Here, we use simulations to investigate one such method, adding a parameter (plth) to capture the proportion of strongly deleterious mutations. We show that plth can improve the model fit when applied to individual species but underestimates the true proportion of strongly deleterious mutations. The parameter can also artificially maximize the likelihood when used to jointly infer a distribution of fitness effects from multiple species. As plth and related parameters are used in current inference algorithms, our results are relevant with respect to avoiding model artifacts and improving future tools for inferring the distribution of fitness effects.


Introduction
The distribution of fitness effects (DFE) among new mutations has long been recognized as having a fundamental importance in evolutionary genetics since the shape of this distribution affects many evolutionary phenomena.For example, the DFE governs the rate of adaptive evolution (Fisher 1930;Schultz and Lynch 1997;Spigler et al. 2017) and the frequency (and hence prevalence) of mutations with different effect sizes (Haldane 1937;Agrawal and Whitlock 2012) and has important implication for phenomena such as inbreeding avoidance, the evolution of sex, and mating system evolution (Kondrashov 1985;Hartfield and Keightley 2012;Hedrick and Garcia-Dorado 2016).Because of this, the DFE has been extensively studied over several decades (for a review, see Eyre-Walker and Keightley 2007).Many important results on the DFE have emerged from this large body of literature; the DFE differs among species (Charlesworth and Eyre-Walker 2006;Castellano et al. 2018), and the shape depends on how welladapted the population is and its effective population size N e (Eyre-Walker 2002;Lynch and Conery 2003;Goldstein 2013;Huber et al. 2017).While the full DFE is probably complex and multimodal, parts of the DFE of deleterious mutations can be modeled as a leptokurtic gamma distribution (Loewe and Charlesworth 2006;Silander et al. 2007) or by a series of discrete bins showing the proportions of mutations with different effects (Keightley and Eyre-Walker 2010).
Most of these results on the DFE have been obtained through two experimental techniques: mutation-inducing experiments and mutation accumulation experiments.In the former, mutations are induced, and their effects are evaluated by comparing different strains of a model organism using some trait, such as fertility or growth rate, as a fitness proxy (Sanjuán et al. 2004).In the latter and more common approach, an ancestral strain is split into small, separate populations which are kept for many generations under favorable conditions that impose very little selection (Ohnishi 1977;Halligan and Keightley 2009;Böndel et al. 2019).The effects of accumulated mutations can then be evaluated by comparing the fitness proxy of the population strains with that of the ancestral strain.The DFE can also be estimated by sampling and comparing haplotypes from individuals in the wild (Dobzhansky and Wright 1941;Crow and Temin 1964;Eyre-Walker et al. 2006;Castellano et al. 2018).Data from the sampled haplotypes are often analyzed in the form of a site frequency spectrum (SFS) which describes the frequency with which different alleles segregate in the sampled population (Fisher 1931;Wright 1938;Evans et al. 2007).
One major challenge in DFE inference is measuring the occurrence of strongly deleterious mutations.Assuming deleterious alleles are not completely recessive, their mutation-selection frequency is generally proportional to 1/s, where s is the mutation's selection coefficient (Wright 1937;Crow and Kimura 1970), and mutations are less likely to be present in a sample of haplotypes https://doi.org/10.1093/g3journal/jkad140Advance Access Publication Date: 20 June 2023

Investigation
as they become more strongly deleterious due to the large selective disadvantage they entail.In other words, the probability distribution of mutational effects in segregating mutations only will almost certainly be different from the probability distribution of all mutations, since strongly deleterious mutations (with s ≫ 1/2N e ) have a tiny probability of showing up in a population sample (Crow and Kimura 1970;Reich and Lander 2001;Rice et al. 2015).This emphasizes the importance of distinguishing between an observed DFE, consisting of alleles segregating in the gene pool of a population at a given timepoint, and an input DFE that represents the actual probability distribution of mutational effects as they arise by mutation and are added to the gene pool.
Much work has been done on inferring what proportion of the DFE consists of strongly deleterious mutations; a study on human polymorphism data puts the estimate at <15% (Eyre- Walker et al. 2006).However, this proportion differs greatly among different branches of the tree of life.The proportion of lethal mutations was found to be approximately 40% in the vesicular stomatitis virus (Sanjuán et al. 2004) and even above 70% in Drosophila (Keightley and Eyre-Walker 2007).However, 2 mutations which are both described as strongly deleterious can still be orders of magnitude different in effect as mutations with, for example, s = 10 2 /N e and s = 10 5 /N e both satisfy s ≫ 1/2N e .As a result, modern DFE inference algorithms such as DFE-alpha (Eyre-Walker and Keightley 2007; Keightley et al. 2016) and polyDFE (Tataru et al. 2017;Tataru and Bataillon 2020) which all work by analyzing a SFS may struggle to describe the extent to which many different strongly deleterious mutations arise.This is partly because very strongly deleterious mutations are unlikely to show up in a genome sample.However, the number of nonsegregating sites in a genome sample (some of which are presumed to be under strong purifying selection) may provide information about the rate at which strongly deleterious mutations occur.In any case, it is hard to differentiate the effect sizes of mutations once the effect size is large, since such a mutation will have essentially no probability of appearing in a genome sample (Galtier and Rousselle 2020).
A few studies have attempted to account for these strongly deleterious mutations by fitting a DFE under the assumption that a certain fraction of mutations are so strongly deleterious that they will remain absent from a haplotype sample, regardless of N e (Eyre-Walker et al. 2006;Boyko et al. 2008;Elyashiv et al. 2010;Kim et al. 2017;Galtier and Rousselle 2020).In particular, Galtier and Rousselle (2020) used DFE methods to infer the mean N e s of mutations among species, thereby obtaining estimate of the difference in N e between different species under the assumption that species shared a common, gamma-distributed DFE.The field of DFE research has often parameterized gamma distributions through the usage of a shape parameter, often denoted β, and a mean value (Keightley and Eyre-Walker 2007).The optimal shape parameter of the gamma-distributed DFE varied greatly among species, but a good fit for similar shape parameters was obtained when a fraction of strongly selected mutations, with an effect size making them deleterious to the point of being "essentially independent of N e ," was included in the SFSs.Here, N e -independent mutations were defined as those deleterious enough so that their probability of showing up in a haplotype sample is essentially zero regardless of N e .Galtier and Rousselle (2020) implemented this approach by assuming that a proportion of 0 ≤ p lth ≤ 1 mutations were either lethal or so strongly selected that they never appeared in a SFS; each entry in the selected SFS was subsequently scaled by 1 − p lth .The method yielded a high model likelihood but produced surprising results such that estimates of the proportion of strongly deleterious mutations (including lethal and nearly lethal mutations) in Drosophila being over 50%, which is well above an order of magnitude larger than previous and frequently used estimates (lethal or very strongly deleterious mutations are estimated to occur at a rate of approx.10% of the rate of mildly deleterious mutations; Simmons and Crow 1977;Crow and Simmons 1983, also see Keightley and Eyre-Walker 2007).While this method provides a way of accounting for strongly deleterious mutations that will be absent from genomic data, the properties and accuracies of this approach have yet to be fully investigated.
Here, we investigate to what extent explicitly accounting for strongly deleterious mutations in a gamma-distributed DFE model improves the accuracy of inference when inferring an input DFE from simulated and genomic data.We do this by formulating a genetically explicit individual-based Wright-Fisher model to simulate steady-state observed DFEs under different strengths of selective effects, then infer the DFEs using the state-of-the-art DFE inference software polyDFE (Tataru et al. 2017).We show that explicitly accounting for a proportion of strongly deleterious mutations can increase the accuracy of inference when inferring the DFE of a single population.However, we also demonstrate that including this parameter artificially increases the inferred proportion of strongly deleterious mutations when considering SFSs from multiple different populations.We further show that the p lth resulting in the best model fit is not equivalent to the true proportion of lethal or strongly deleterious mutations in both cases.

Methods
To simulate the observed DFE of a population, we model a diploid, sexually reproducing Wright-Fisher population with discrete generations and a default population size of N = 500.Individuals have two different chromosomes, each with two homologs.Both chromosomes have a map length of [0, R] where we used R = 10 such that any real number r with 0 ≤ r ≤ R denotes a unique site on the chromosome; the potential number of sites is therefore effectively infinite (e.g.Roze and Rousset 2009).In the first chromosome, mutations are nonneutral, although mutations that are effectively neutral can occur, while in the second chromosome, mutations are fully neutral.The sites on the neutral and nonneutral chromosome are used to calculate a neutral and nonneutral SFS, respectively.These are then used as input datasets for the DFE inference software (see below).
Individuals go through a life cycle similar to that of SFS_CODE (Hernandez 2008).Specifically, individuals are sampled with a probability relative to their fitness standardized by the mean fitness of their sex.The fitness of the ith individual, w i , is defined as Thus, fitness is the product of the effects of all Nhet heterozygous mutations and all Nhom homozygous mutations, thereby using a model of selection where individual effects are additive within a locus (i.e.h = 1/2 for all mutations) and multiplicative between loci.Selection coefficients s are sampled from a gamma distribution (see parameterization below).Parents each provide 1 of their 2 homologs to their offspring after a number of recombination events, sampled from a Poisson distribution with mean R = 10, has occurred.Mutations then occur in offspring, and the number of mutations is sampled from a Poisson distribution with a mean of U = 1.0 mutations per chromosome per generation, yielding a per genome mutation rate similar to that of some eukaryotes (Haag-Liautard et al. 2007).The number of neutral and nonneutral mutations is sampled independently with the same mean.The selection coefficients of new mutations are sampled from a gamma distribution with a mean of S = 2N e s and a shape parameter of β.Sampled values of S are rescaled to s by calculating s = S/2N e using the estimated N e for that parameter set (see below).
Offspring are created by randomly sampling a mother and a father for each offspring, until N viable (w > 0) offspring have been produced with a constant 1:1 sex ratio.Once N viable offspring have been formed, these will constitute the new parent generation.
To estimate N e , we added a single, neutral, linked locus to the center of the nonneutral chromosome and mutated the locus at a Poisson-distributed rate with a mean of 1 by adding a random value sampled from N(0, 1) to the allelic value, thus having the locus satisfy V m = 1, which is the new genetic variance from mutation appearing per generation (Lynch and Hill 1986;Barton and Turelli 1989;Keightley and Otto 2006).The N e of the population was defined as the mean (over 50 replicates) steady-state variance at the neutral linked locus after a burn-in period of 10N generations.Without selection, the expected variance at this locus is NV M = N e V M (Lynch and Hill 1986), and this was confirmed by simulation (Supplementary Information 1, Supplementary Figure 1).Mutational effects were sampled as 2N e s (Hernandez 2008), then rescaled by dividing the sampled value with 2N e before applying it in Eq. 1.To accomplish this, N e was estimated before the Wright-Fisher simulation by plotting a standard curve of estimated steady-state N e against an assumed N e (the value used to rescale 2N e s to s) for a given gamma distribution of selective effects.This allowed us to know the approximate steady-state N e which would emerge in our Wright-Fisher simulation before we used these to simulate data for SFS construction (Supplementary Information 2, Supplementary Figure 2).
Data for calculating SFSs were outputted from the simulation after a burn-in period of 10N generations, after which the full neutral and selected genomes of 10 random individuals (i.e.20 haplotypes) where outputted and used to calculated 1 neutral and 1 selected unfolded SFS.Unfolded SFSs were used throughout.
To infer the mean scaled selection coefficients of mutations S = 2N e s and the shape parameter β, we used polyDFE v1.11 (Tataru et al. 2017;Tataru and Bataillon 2020).To determine whether these parameters could be obtained using our simulation setup, we estimated them from SFSs calculated from simulated populations with the same S and β combinations that Tataru et al. (2017) used to test the inference of purely deleterious DFEs (see their Fig. 4).We found that polyDFE could accurately infer them from our simulations (Supplementary Information 3, Supplementary Figure 3) For each single-species polyDFE run, an input file with 50 neutral and 50 selected SFS was used, wherein each pair of neutral and selected SFS was calculated from the basis of a single replicate of a Wright-Fisher simulation (hence 50 Wright-Fisher simulations per polyDFE run set for single-species DFE).The assumed sequence lengths were 20,000 base pairs.For the multispecies polyDFE runs, an input file with 20 neutral and 20 selected SFSs from 3 different species was used (60 SFSs).The polyDFE command lines and init-files used are available in Supplementary Information 4.
The concept of p lth was previously used when fitting a single DFE model to a collection of SFSs obtained from multiple different species (Galtier and Rousselle 2020).To examine the efficacy and validity of p lth for fitting multispecies DFEs, we simulated various gamma-distributed DFEs in different populations, calculated output SFSs for each population, and inferred DFEs for datasets containing SFSs from multiple different populations.We repeated this analysis with DFEs inferred from genomic data from different species (Chen et al. 2017).
For single-species inferences, we first used the model described to simulate populations with S = 2N e s in the range of [1,000,10,000] with increments of 1, 000 and β = 0.4.Each simulation was replicated 50 times (500 replicates in total).For each simulation set, we then inferred S and β using polyDFE under 10 different p lth values in the range [0.0, 0.9] with increments of 0.1.The parameter p lth was implemented, following Galtier and Rousselle (2020), by setting an initial p lth value and scaling each entry of the selected SFS by 1 − p lth .For each p lth value, 10 polyDFE runs were performed (1,000 polyDFE runs in total) since the performance of polyDFE is affected by the initial S value used.For each combination of p lth value and simulation set, half of the polyDFE runs were run using randomly sampled initial values of the estimated parameters.These values were sampled from a uniform real distribution U(1,000, 10,000) for S and U(0.2, 0.8) for β.For the other half of the polyDFE runs, the true values of S and β were used as the initial value of the estimated parameters (following Tataru et al. 2017, Supplementary file 1, page 3).To thoroughly test how p lth affects the overall DFE inference under weaker selection, we also studied the effect of p lth on S and β inference given different combinations of S = 50, 250, 2,500 and β = 0.15, 0.40, 0.65.For the simulation sets with S = 2N e s in the range of [1,000, 10,000] with increments of 1,000 and β = 0.4, we used standard results from probability theory to calculate the true proportion of lethal mutations resulting from these distributions (Appendix A) and compared it to the value of p lth under which the highest accuracy of inference was found.
We also studied the effects of p lth on inference of multispecies DFEs from more than 1 species and in particular if β is fixed between species.This was studied with both arbitrarily defined weakly deleterious DFEs (which were used because they contained hardly any strongly deleterious mutations), and DFEs inferred from genomic data (Chen et al. 2017(Chen et al. , 2020)).First, we simulated 3 Wright-Fisher populations with different DFEs with the shape parameters β and means S of β = 0.15, S = 5, β = 0.40, S = 20, and β = 0.65, S = 40.We then constructed polyDFE input files using 20 neutral and selected SFSs from each population.Using polyDFE, S and β were estimated from this input file under 4 different p lth values using similar procedure of replication and definition of initial values as for the single-species case.Second, using DFEs estimated from genomic data, we ran the same analysis by simulating the DFEs estimated for populations of Arabidopsis lyrata, Capsella grandiflora, and Zea mays using the data of Chen et al. (2017).We compared the likelihood of the estimated parameters under 4 different p lth values, so that they cover cases when the initial values were correctly defined and when these were purposely highly inaccurate.Following Galtier and Rousselle (2020), we also studied the effect of p lth on S inference from these DFEs given different fixed values of β (as opposed to jointly estimated).Confidence intervals of estimates shown in plots were calculated as ̅ x being the mean value, n being the sample size, and Z being the 97.5 percentile value of the t-distribution given n − 1 degrees of freedom.

Analytical results
To illustrate the utility of including a p lth parameter in DFE inference and determine where it might be most useful, we first present analytical results demonstrating which deleterious alleles are unlikely to be present in a population sample.Assume that mutations are introduced at a rate θ = 4N e μL for N e the effective population size, μ the per-site mutation probability per generation, and L the number of sites where selected sites are likely to arise [e.g.nonsynonymous sites (Galtier 2016)].A focal mutation has a scaled selection coefficient S = 2N e s as a heterozygote and is in the population at a frequency x.If we sample n haploid genomes, then the expected probability that this mutation is present in i of n samples is (Tataru et al. 2017, Equation 1): Where B indicates the binomial distribution of sampling i selected alleles from n genomes and H the expected time that the allele spends between frequencies x and x + δx (Wright 1938): Note there are a few differences between Eq. 3 and the version in Tataru et al. (2017).Eq. 2 scales s by 2N e whereas Tataru et al. (2017) use 4N e scaling, due to the different manner in which heterozygote fitness is calculated.That is, we consider mutations to be additive with a dominance coefficient of h = 1/2, such that the fitness effect of a single heterozygous and homozygous mutation is 1 − sh and 1 − s, respectively.S is also multiplied by −1 in Eq. 3, so a positive value denotes a deleterious mutation with that selection magnitude.
Using Poisson random field theory (Sawyer and Hartl 1992;Sethupathy and Hannenhalli 2008), the probability of observing a certain number of sites with i selected alleles is Poissondistributed with mean E(P = i) (Eq.2).Hence, the probability that no sites will have i selected alleles is e −E(P=i) , and the probability that no copies of the selected allele are present equals . The repository with the code used for all simulations also contains a Mathematica notebook to perform these calculations (see Data availability).
Figure 1 outlines the probability that no selected allele will be present in a sample of genomes, for different sample sizes and mutation rates.We see that if the mutation rate is low (θ = 0.1), then even weakly deleterious mutations (S > 10) are unlikely to be present.For θ = 1, while a small sample is also unlikely to contain strongly deleterious mutations (S > 100), a large sample size of 100 is likely to capture some strongly deleterious mutations with 100 < S < 1,000.A very large sample can capture even more strongly deleterious mutations if the mutation rate is high (θ = 10).However, under all cases considered, it becomes unlikely to sample extremely deleterious mutations (S approaching 5,000), even with large sample sizes and mutation rates.Overall, for realistic sample sizes and mutation rates, once S exceeds 100, it becomes difficult to capture strongly deleterious mutations in a population sample, and it is in this parameter space that p lth might be useful in capturing the proportion of these deleterious mutations.A p lth -type parameter might also be necessary to calculate the fraction of all deleterious variants if the population mutation rate is very low.

Model fitting for single-species DFEs
In simulations with a high S, the accuracy of inference of S and β was affected by p lth (Fig. 2, a and b), and the inaccuracy of inference in 4/10 of all cases was lowest under a nonzero value of p lth (Fig. 2, c and d).In these cases, the inaccuracy of inference (measured as the difference between log 2 (estimate/simulation) and 0 such that log 2 (estimate/simulation) = 0 is maximum accuracy of inference) decreased by a value between 0.13 and 0.99.In terms of the actual estimates of S, this means that in the 4 cases where p lth improved the accuracy of inference, S was estimated 3-49% (depending on simulation set) more accurately than in the corresponding set where p lth = 0.0.In general, S inference becomes more inaccurate as p lth increases; however, the accuracy of inference of β is less affected unless p lth is very high (Fig. 2).At low mutation rates, high accuracy of inference can be achieved under a high p lth , although this comes at the expense of the accuracy of inference of β (see Supplementary Information 7).
Under low values of S and varying values of β, nonzero values of p lth generally increased the inaccuracy of inference, highlighting Fig. 1.The probability that a deleterious mutation with scaled selection coefficient S will not be present in a sample of n genomes.Results are plotted for different n, θ values as denoted by the legend.Note that we do not plot S < 10 since calculations do not assume neutrality and break down as S approaches 0. Strongly deleterious mutations are less likely to end up in a sample of haplotypes, making it difficult to differentiate the effects sizes by analyzing site frequency spectra.that p lth only improves the accuracy of inference when S is high (Fig. 3).
By calculating the true proportion of lethals (Appendix A), which can be used as a lower bound for the proportion of strongly deleterious mutations, we found that the p lth value resulting in the highest accuracy of inference (Figs. 2 and 3) were not equivalent to the true proportion of lethals (Fig. 4), with the true proportions being much higher than those inferred using p lth .

Model fitting for multispecies DFEs
We simulated weakly deleterious DFEs for multiple species that had essentially 0 probability of yielding lethal mutations (Supplementary Information 5, Supplementary Figure 4).Despite this setup, we found that the likelihood of the estimated parameters could be artificially maximized when assuming a high p lth value, when inferring a joint set of DFE parameters from SFSs across different species (Fig. 5a).When a DFE model is fitted to SFSs from multiple species, the model likelihood is artificially maximized because a nonzero value of p lth pushes the entries of the different SFSs closer together by making them numerically similar (Fig. 6).That is, species with very different SFSs will seem to have more similar SFSs under a high value of p lth , because p lth will result in the entries of these different SFSs all being closer to 0. Because of this, fitting a single DFE model to joint SFS data yields a better model likelihood once p lth is applied.Thus, we found that p lth can erroneously result in a high likelihood of a model despite essentially no strongly deleterious mutations being present.
p lth can also artificially maximize the likelihood of DFE models when simulating DFEs estimated from genomic data (Fig. 5b).Whether the likelihood of the DFE model levels off or keeps increasing with p lth depends on the sampled initial values for the polyDFE runs.For high p lth values, the initial values sampled for each polyDFE run becomes increasingly important.This seems to happen as the likelihood of the inferred DFE model drops substantially if the initial S fed into polyDFE was low, and the SFSs are modified by a high p lth value resulting in a seemingly very deleterious selected SFS.Because of this effect, 95% confidence intervals on the mean likelihood increase as p lth increases (Fig. 5).The parameter p lth can also artificially maximize the likelihood of a DFE model in cases where β is fixed and S is inferred, similar to the approach used by Galtier and Rousselle (2020) (Fig. 7).

Exploring wider parameter space
The effect of p lth on the accuracy of inference was also tested under larger population sizes and a larger number of sampled haplotypes (Supplementary Information 6).While p lth was less useful for correcting the underestimation of the mean S when a larger population size was used, the results remained qualitatively similar such that p lth had a more positive effect on the accuracy of inference as S was increased.The ability of p lth to artificially For each simulation set (differing in S; color), bottom panels show the subset that had the highest accuracy of inference of S. For these subsets, the inaccuracy of inference for c) S and d) β is shown.A value of 0 is equivalent to perfect accuracy of inference.The results show that for each simulation set (these differ in their value of S), the best accuracy of inference (i.e.lowest inaccuracy of inference) is found when p lth is either 0 or very low (p lth = 0.1, p lth = 0.2) when a DFE model is fitted to SFS data from a single species.
maximize the likelihood of a model fitted to SFS data from multiple species was also tested under larger simulated population and sample size, and this yielded a result that was qualitatively the same (Supplementary Information 6, Supplementary Figures 5 and 6).That is, p lth will also artificially maximize the likelihood of a DFE model based on multiple species when these species have larger populations.This makes intuitive sense given the mechanism through which p lth seems to cause the artificial maximization of likelihood; that is, the entries of different SFSs will become numerically similar as a higher p lth is used, regardless of the sample size.
We also investigated inference in a single species using a 5-fold lower mutation rate.Here, higher values of p lth can result in the best accuracy of inference for S; however, this happens at the expense of poor accuracy of inference of β (Supplementary Information 7).

Discussion
Improving the accuracy with which we can infer the DFE is important since this is a fundamental parameter of evolutionary genetics with implications for many branches of biology.Constructing an accurate description of the DFE boils down to constructing a model which can describe a wide range of effect sizes of mutations at a broad range of frequencies and hence can be applied to many different species.This is not a trivial problem, because while several models consisting of parametric distributions have been suggested, including gamma, exponential, and lognormal distributions, the evidence shows the DFE of deleterious mutations follows some complex and multimodal distribution rather than a simple parametric one (Nielsen and Yang 2003;Eyre-Walker and Keightley 2007;Kousathanas and Keightley 2013).Thus, fitting a single parametric distribution model to a deleterious DFE can result in a biased estimate, and   A).Red lines show the interval containing p lth value which resulted in the highest accuracy of inference (Figs. 2 and 3).The results show that the lower bounds for true proportion of lethals for each simulation set (points) are higher than the p lth which results in the best accuracy of inference (0.0-0.2, i.e. between the red lines).
this bias is especially prone to affect the inferred proportion of strongly deleterious mutations (Kim et al. 2017).Our results illustrate that once strongly deleterious mutations are present, expanding the typical gamma distribution model of a DFE to one which accounts for a proportion of strongly deleterious mutations (Galtier and Rousselle 2020) can in principle improve the accuracy of inference, although it can artificially increase the likelihood of the inferred parameters when a single DFE is fitted to data from different species or populations.
We show that when inferring a DFE from a single species, a model under which p lth > 0 can sometimes yield the highest likelihood even when the DFE does not contain a fraction of p lth strongly deleterious mutations.This result highlights that p lth is not the true proportion of lethals and should not be viewed as such.Future developments might produce a more useful correcting factor that accounts for underrepresentation of strongly deleterious mutations while being compatible with classic parametric distributions such as gamma distributions.This would likely involve first calculating how much underestimating might occur, assuming some gamma-distributed DFE, and part of answering this would be to calculate the mean effect size of mutations with 2N e s > 2N e .
Further, the classic notion of a model likelihood is not welldefined when using p lth , since p lth is used to multiply all entries of the SFSs by 1 − p lth .This rescaling results in the data itself being modified and consequently 2 DFE models with differing values of p lth will be models on different datasets.This means that the likelihood of 2 different DFE models with different p lth values cannot be compared in the same way.This explanation does not mean that there is a problem with the concept of model likelihood in general; rather, it instead means that model likelihood cannot be used for model selection in the same way when p lth is used, since different p lth models are effectively attempting to explain different datasets.
Given SFS data from a single species, our results illustrate that DFE inference performs better on strongly deleterious DFEs when entries in the input SFS are modified by some value p lth to reflect the proportion of strongly deleterious mutations that will not end up in a haplotype sample.This is in line with the results of Fig. 4A in Tataru et al. (2017) showing that mutational effects are underestimated when the input DFE is strongly deleterious, indicating that strongly selected mutations are not fully detected or represented in the haplotype sample.This is one parameter space where including p lth may help with improving the accuracy of inference, although the resulting p lth values are unlikely to reflect the actual proportion of strongly deleterious mutations.This means that although including some value p lth may improve the model likelihood, p lth cannot reliably be used to infer the proportion of strongly deleterious mutations.A recent study wherein a DFE model was derived on completely different principles (by considering gene regulatory networks and metabolic pathways) also concluded that no parametric distribution suffices to describe the DFE of strongly deleterious mutations and that a satisfactory model must involve some extra class of strongly deleterious mutation akin to p lth (Brajesh et al. 2019).These findings suggest our conclusion has broad generality beyond the Wright-Fisher model.In practice, information about the optimal value of p lth to use when fitting a DFE model to a particular species may eventually be determined a priori based on the species in question.For example, an application of Fisher's geometric model (Fisher 1930) to DFE theory yields the prediction that as the complexity of an organism increases, a larger proportion of the DFE should be strongly deleterious (Lourenço et al. 2011;Tenaillon 2014).This has been tested several times and found to be consistent with data by using the level of pleiotropy as a proxy for the complexity of an organism (Martin and Lenormand 2006;Huber et al. 2017).
It is well-known that the likelihood of an individual DFE model drops when fitted to multiple species (Huber et al. 2017;Galtier and Rousselle 2020), which highlights that the DFE differs among species (Eyre-Walker 2002;Lynch and Conery 2003).In this study, we also show that p lth can artificially improve the likelihood of parameter estimates when a single DFE is fitted to multiple species.This result has several important implications.First, it illustrates the point that while the data strongly suggests the DFE of deleterious mutations usually follows a gamma and lethal model, it can still result in a misleading description of the DFE when fitted to data from multiple species.Second, it shows that since p lth consists of a modification of the data, we should bear this in mind when assessing the likelihood of a model; we find the likelihood of a DFE model is maximized under a high p lth in cases where the true DFE is known to be very weakly deleterious and known to not contain a proportion of p lth strongly deleterious mutations.Because of this, we can conclude that if the likelihood of a given DFE model for multiple species is maximized under the assumption of some p lth > 0, this should not be considered evidence that the DFEs being modeled does indeed contain a proportion of p lth strongly deleterious mutations.Third, our results offer an explanation of the mechanism by which this effect occurs, namely, that p lth makes entries of different SFSs more numerically similar, thereby inflating the likelihood of the resulting DFE model.Several studies model DFEs as following a gamma and lethal model, or gamma and "point mass" model (Eyre-Walker et al. 2006;Elyashiv et al. 2010), but when such models involve applying a correction factor to SFSs, they seem unsuitable for fitting a single DFE to data from a group of species.While our study is limited to purely deleterious DFEs, more work on improving the accuracy of inference for DFEs with beneficial mutations is also needed, since recent work shows that the occurrence of strongly beneficial mutations (much like strongly deleterious mutations) can make DFE inference inaccurate (Booker 2020).Even in the study of purely deleterious DFEs, attention has almost exclusively been focused on the effects of SNPs (resulting from point mutations), but the DFE of INDELs (insertions and deletions) is understudied, and accounting for such mutations might require a slightly different modeling approach (Barton and Zeng 2018).As in our study, mutational effects are most often assumed to be additive; however, the average dominance coefficient of new mutations appears to be substantially lower than h = 1/2 (Mukai et al. 1972;Simmons and Crow 1977;Lynch et al. 1999;Fernández et al. 2004;Spigler et al. 2017).While it is implicitly assumed that current DFE inference algorithms remains accurate when the assumption of additive dominance is violated, very little work has been done to test this assumption (but see Wade et al. 2022).Similarly, other "gamma + lethal" DFE models exist, and since these differ slightly from p lth implemented in Galtier and Rousselle (2020), testing their ability compared to the p lth implementation of a "gamma + lethal" DFE model is a topic worthy of further research (Boyko et al. 2008;Kim et al. 2017).
Experimental evidence suggest that the DFE of deleterious mutations follow a bimodal or perhaps even multimodal distribution, meaning that a good model fit may not be possible under classic parametric distributions such as exponential, gamma, or lognormal (Nielsen and Yang 2003;Eyre-Walker and Keightley 2007;Kousathanas and Keightley 2013).Because of this, it would be worthwhile for future research to explore whether DFE model fitting under different distribution could result in an improve  model fit.In a recent study, the DFE was represented by using a nonparametric distribution in the form of several nonoverlapping uniform distributions (Johri et al. 2020).Some of the current DFE inference software can also be set to infer a DFE where selection coefficients take discrete values rather than necessarily conforming to a single continuous distribution (Tataru et al. 2017).Expanding the set of distributions typically used for the DFE model could prove to be fertile grounds for new research and result in more accurate models.

Conclusion
While the parameter p lth can in principle improve the accuracy of inference, obtaining a good model fit under some nonzero p lth value should not be viewed as evidence for a proportion of p lth mutations segregating in the population in question.This is because it can be shown that p lth is not equivalent to the true proportion of lethals.When inferring a single DFE for a group of species or populations, the usage of p lth is also problematic since it modifies the data, resulting in different SFSs becoming more alike and artificially increases the likelihood of the model inferred.Thus, comparing the likelihood of two models with different p lth values cannot be done in a standard way, since these two models will effectively have different data.We have presented a detailed study of some of the problems with p lth as a concept which will be useful to anyone modeling the DFE, especially with regard to avoiding model artifacts.

Fig. 2 .
Fig. 2. Mean inaccuracy of inference for a) S and b) β under the assumption of different values of p lth for different Wright-Fisher simulation sets (differing in the simulation S, ranging from 1,000 to 10,000).Each point shows the mean value of 10 polyDFE replicates.Bars denote 95% confidence intervals (if they are not visible, then they lie within the point).For each simulation set (differing in S; color), bottom panels show the subset that had the highest accuracy of inference of S. For these subsets, the inaccuracy of inference for c) S and d) β is shown.A value of 0 is equivalent to perfect accuracy of inference.The results show that for each simulation set (these differ in their value of S), the best accuracy of inference (i.e.lowest inaccuracy of inference) is found when p lth is either 0 or very low (p lth = 0.1, p lth = 0.2) when a DFE model is fitted to SFS data from a single species.

Fig. 4 .
Fig. 4. The true proportion of strongly deleterious mutations as a function of S (same values as used in Fig. 2) given β = 0.40 and N e = 500 (AppendixA).Red lines show the interval containing p lth value which resulted in the highest accuracy of inference (Figs.2 and 3).The results show that the lower bounds for true proportion of lethals for each simulation set (points) are higher than the p lth which results in the best accuracy of inference (0.0-0.2, i.e. between the red lines).

Fig. 5 .
Fig. 5. Mean likelihood with 95% confidence intervals returned by polyDFE for a multispecies DFE model against 10 p lth values.β was inferred, and S was assumed fixed at a randomly sampled value.polyDFE runs were conducted on 20 neutral and selected SFSs from 3 different simulated DFEs.a) Model likelihood for inference based on simulated weakly deleterious DFEs with the shape parameters β and means S of β = 0.15, S = 5, β = 0.40, S = 20 and β = 0.65, S = 40 combined into 1 data input file.b) Model likelihood for inference based on the DFE of 3 populations simulated parameterized with data from Chen et al. (2017) using DFEs inferred for A. lyrata, C. grandiflora, and Z. mays.The results show that (1) model likelihood can be maximized under a high value of p lth despite little or no strongly deleterious mutations being present in the DFE because p lth makes different SFSs more similar and (2) this effect can also occur for DFE estimated from natural populations.

Fig. 7 .
Fig. 7. Mean likelihood returned by polyDFE for estimates of S given combinations of fixed β and p lth .A multispecies DFE where 20 neutral and selected SFSs from DFEs based on those estimated for A. lyrata, C. grandiflora, and Z. mays by Chen et al. (2017) were combined was used as data input for polyDFE.The dashed line show the mean β estimated for A. lyrata, C. grandiflora, and Z. mays.The results again shows that model likelihood can be artificially maximized under the assumption of a high value of p lth , and the extent to which this happens depends on the initial values supplied to the DFE inference software.

Fig. 6 .
Fig. 6.Value of the 20 entries of the SFSs of 3 different simulated DFEs (black, red, and blue) under 4 different p lth values: a) p lth = 0.0, b) p lth = 0.3, c) p lth = 0.6, and d) p lth = 0.9.Confidence intervals fall within points as shown.The results show that increasing the p lth value pushes the entries of different SFSs together by making them more numerically similar.