Abstract

Estimating the distribution of fitness effects (DFE) of new mutations is of fundamental importance in evolutionary biology, ecology, and conservation. However, existing methods for DFE estimation suffer from limitations, such as slow computation speed and limited scalability. To address these issues, we introduce fastDFE, a Python-based software package, offering fast, and flexible DFE inference from site-frequency spectrum (SFS) data. Apart from providing efficient joint inference of multiple DFEs that share parameters, it offers the feature of introducing genomic covariates that influence the DFEs and testing their significance. To further simplify usage, fastDFE is equipped with comprehensive VCF-to-SFS parsing utilities. These include options for site filtering and stratification, as well as site-degeneracy annotation and probabilistic ancestral-allele inference. fastDFE thereby covers the entire workflow of DFE inference from the moment of acquiring a raw VCF file. Despite its Python foundation, fastDFE incorporates a full R interface, including native R visualization capabilities. The package is comprehensively tested and documented at fastdfe.readthedocs.io.

Introduction

The distribution of fitness effects (DFE) of new mutations is central to our understanding of how natural selection shapes genetic variation. Inferring the DFE usually involves contrasting the site-frequency spectrum (SFS) of putatively neutral (synonymous or intergenic) and selected (nonsynonymous) sites (Keightley and Eyre-Walker 2007; Galtier 2016; Tataru et al. 2017). The intuition behind this approach is that beneficial alleles are more likely to segregate at higher frequencies than deleterious ones, implying that the SFS provides information on the fitness of new mutations. Summarizing all mutations’ fitness naturally gives rise to the DFE, which essentially is a probability distribution of the selection coefficients of mutations at selected (nonsynonymous) sites.

To account for distortions to the selected SFS due to demography, the neutral SFS is commonly used, which is assumed to be subject to demography alone. Demography is either modeled explicitly, or by the introduction of nuisance parameters which are fit so as to re-scale the observed neutral SFS to match the standard Kingman SFS (Keightley and Eyre-Walker 2007; Tataru et al. 2017). After accounting for demography, the remaining distortions to the selected SFS are assumed to be entirely due to selection.

SFS-based methods for estimating the DFE summarize population genetic variation by quantifying the number of alleles at specific frequencies and discarding details about the specific sites where these frequencies are exhibited. As a result, we obtain one DFE for all variants collectively. This approach provides limited information when considering a single DFE, but its utility significantly increases when comparing DFEs across multiple species, populations, or genomic regions (Chen et al. 2021; Moutinho et al. 2022; Latrille et al. 2023). Existing methods that can infer more than one DFE at a time are either notoriously slow and sometimes prone to numerical instabilities (Tataru and Bataillon 2019) or offer only limited parameter sharing capabilities (Galtier 2016). Here we introduce fastDFE, which is specifically designed to facilitate and encourage the joint inference of multiple DFEs at once, allowing for the introduction of genomic covariates, and covering the entire workflow of DFE inference from the moment of acquisition of a raw VCF file.

Model and Enhancements

fastDFE primarily builds upon the model of polyDFE, overcoming its constraints such as extensive computation time and limited scalability (Tataru et al. 2017; Tataru and Bataillon 2019). As such, it provides the same functionalities such as parametric bootstrapping, customizable DFE parametrizations, nested model comparison, and joint inference. Similarly, the impact of demography is accounted for by the introduction of nuisance parameters (cf. A.2 in the appendix).

To obtain the expected SFS given mutations having a specific selection coefficient, fastDFE makes use of the expected allele frequency sojourn times from Poisson random field theory (Sawyer and Hartl 1992; Sethupathy and Hannenhalli 2008) (A.1). Let E[Psel(i, S)] denote the expected number of derived alleles at frequency i given (population-scaled) selection coefficient S = 4Nes (Figure A.1). Obtaining the expected SFS given a specific DFE requires integrating E[Psel(i, S)] over the DFE, i.e. ∫−∞ E[Psel(i, S)]ϕr(S)dS, where ϕr(S) denotes a density describing the DFE with regard to some chosen parametrization r. The actual result, i.e. inferring the DFE given an SFS, is then achieved by maximum-likelihood optimization with respect to r, where a composite Poisson likelihood for the SFS is used (A.4).

In fastDFE, the above integral is discretized, precomputed, and cached, leading to an at least 1000-fold performance improvement over polyDFE while producing very similar results (Table B.1, Figures B.1, B.2, and B.3). One issue with polyDFE is its numerical integration procedure, which has difficulties to properly capture the DFE's mass in regions where it is highly concentrated. One example of this is the oft-used Γ distribution with low shape parameters. fastDFE circumvents this problem by describing the DFE by means of its cumulative distribution function (CDF), thus providing more reliable estimates (A.5).

When inferring DFEs for multiple datasets, it is essential to standardize the derivation of the SFS to ensure results are directly comparable. To aid this process, fastDFE comes equipped with a VCF parser, enabling the extraction of the necessary SFS input data from raw VCF files. In order to obtain multiple SFS, a single VCF file can optionally be stratified with regard to some genomic properties, the stratification into a neutral and selected SFS being one example. To further aid the data preparation process, sites can be filtered and annotated with respect to degeneracy and ancestral state. Distinguishing between ancestral and derived alleles is important as it greatly improves the accuracy of DFE estimates, especially for beneficial mutations. The implemented ancestral-allele annotation utility uses outgroup information and follows the probabilistic model of EST-SFS (Keightley and Jackson (2018)). At last, utilities are provided to determine the number of mutational target sites when monomorphic sites are not present in the VCF file at hand.

Examples

Basic Inference

In the following Python code snippet, we demonstrate how to use fastDFE to infer the DFE for a single population. First, we define the spectra for both neutral (sfs_neut) and selected (sfs_sel) sites for a sample size of n = 8, leveraging data from a Scandinavian silver birch study (Sendrowski (2022)). We also parametrize the DFE using GammaExpParametrization—a common parametrization consisting of a mixture of a (reflected) gamma and an exponential distribution for deleterious and beneficial selection coefficients, respectively. We use this information to instantiate a BaseInference object, which serves to infer a single DFE. graphic

Upon carrying out the inference process, we proceed to visualize the results (Fig. 1). Note that an equivalent R script would have significant similarities to the provided Python example. graphic

Output from code snippet in Basic Inference. Top left: observed neutral and selected SFS. Top right: observed (selected) and modeled (selected) SFS. Bottom left: parameter estimates. For GammaExpParametrization, Sd and Sb represent the average (population-scaled) strength of deleterious and beneficial mutations, respectively. Parameter b denotes the shape of the gamma distribution and pb the probability of a mutation being beneficial, whereas ɛ is the ancestral misidentification parameter (A.3). α denotes the expected proportion of beneficial nonsynonymous substitutions and is not a real parameter, but rather a function of the inferred DFE. Bottom right: discretized DFE estimate where S = 4Nes is the population-scaled selection coefficient. Vertical bars indicate 95% confidence intervals. DFE, distribution of fitness effects; SFS, site-frequency spectrum.
Fig. 1.

Output from code snippet in Basic Inference. Top left: observed neutral and selected SFS. Top right: observed (selected) and modeled (selected) SFS. Bottom left: parameter estimates. For GammaExpParametrization, Sd and Sb represent the average (population-scaled) strength of deleterious and beneficial mutations, respectively. Parameter b denotes the shape of the gamma distribution and pb the probability of a mutation being beneficial, whereas ɛ is the ancestral misidentification parameter (A.3). α denotes the expected proportion of beneficial nonsynonymous substitutions and is not a real parameter, but rather a function of the inferred DFE. Bottom right: discretized DFE estimate where S = 4Nes is the population-scaled selection coefficient. Vertical bars indicate 95% confidence intervals. DFE, distribution of fitness effects; SFS, site-frequency spectrum.

VCF Parsing

fastDFE offers a versatile parser extracting site-frequency spectra from VCF files while allowing for stratifications of the SFS. The stratified spectra can be directly used for either joint or marginal DFE estimation. In the following code example, we obtain the SFS used in Basic Inference. We determine the site-degeneracy on-the-fly by passing DegeneracyAnnotation, which requires both a reference (FASTA) and annotation (GFF) file to be specified. DegeneracyStratification is then used to stratify the SFS into selected (0-fold degenerate) and neutral (4-fold degenerate) sites. Ancestral alleles are determined by passing MaximumLikelihoodAncestralAnnotation, which requires the VCF to contain one or more outgroup individuals. Since the specified VCF file only contains biallelic sites, we also provide TargetSiteCounter, which determines the number of monomorphic sites for each stratification by sampling sites from the provided FASTA file. The resulting spectra are depicted in Figure C.1. graphic

Joint Inference

In comparative analyses, the ability to stratify and associate site-frequency spectra with covariates (such as genomic features or species-specific properties) can prove very useful. Joint inference enables the sharing of an arbitrary number of parameters across multiple datasets, henceforth referred to as types. Parameters that are not shared are estimated marginally, i.e. for each type separately. However, when using many types with marginal parameters, the large number of parameters to be optimized can make it difficult to find good optima. Covariates address this by introducing a customizable, typically linear, relationship with a specific parameter across all types. This reduces the number of additional parameters to one per covariate and allows for the testing of covariate significance through a likelihood ratio test against the null model, where the parameter in question is constant across all types.

In the example below, we access a configuration file containing data from a study testing the adaptive walk model of gene evolution in A. thaliana (Moutinho et al. 2022). The underlying premise is that younger genes experience less deleterious mutations than older, well-adapted genes. The study provides site-frequency spectra stratified by gene, which are associated with a number of proxies for gene age. Notably, we focus on the relative solvent accessibility (RSA) as our proxy of interest. Younger genes tend to have larger RSA scores because of their shorter length, resulting in a larger proportion of exposed residues due to their greater surface-to-volume ratio. The spectra for each gene, after being downsampled to 20 individuals, are grouped into 10 quantiles based on their RSA score. The covariate for each bin is taken to be the mean RSA score over all its genes. A linear relationship is then introduced with Sd—the average strength of deleterious mutations (cf. this script). All remaining parameters are specified to be shared (i.e. constant) across all types, and the significance of the included covariate is determined by comparing the likelihood against the null model in which Sd is constant across all bins. graphic

We can now visualize the linear relationship between Sd and RSA score, as well as the inferred joint DFE (Figure 2). Indeed, younger genes, being associated with larger RSA scores, are found to exhibit less deleterious mutations. graphic

Output from code snippet in Joint Inference. Left: values of Sd versus. RSA score for the different bins. Sd is restricted to vary linearly with RSA. Right: inferred joint DFE. The low P-value (indicated in the plot title) confirms the high significance of covariate inclusion. Vertical bars indicate 95% confidence intervals. DFE, distribution of fitness effects.
Fig. 2.

Output from code snippet in Joint Inference. Left: values of Sd versus. RSA score for the different bins. Sd is restricted to vary linearly with RSA. Right: inferred joint DFE. The low P-value (indicated in the plot title) confirms the high significance of covariate inclusion. Vertical bars indicate 95% confidence intervals. DFE, distribution of fitness effects.

Discussion

fastDFE aims to improve on existing DFE estimation software, particularly in the realm of joint inference and data preparation. It offers a very efficient solution for joint inference—both performance-wise and by allowing for the introduction of covariates. It is thus well suited for comparative analyses, whether across species, populations, or varying genomic contexts. The VCF parsing and annotation features greatly facilitate the data preparation process, potentially minimizing data handling errors while making different datasets more comparable. Customizable DFE parametrizations by means of the CDF also offer increased flexibility and reliability.

The availability of a Python and R interface instead of a traditional command-line approach aims to make DFE estimation more customizable. In conclusion, we hope that fastDFE will contribute to new insights in evolutionary biology, ecology, and conservation—especially within the scope of comparative studies. The software is freely available at github.com/Sendrowski/fastDFE, and comprehensive documentation as well as usage examples can be found at fastdfe.readthedocs.io.

Supplementary Material

Supplementary material is available at Molecular Biology and Evolution online.

Acknowledgments

The authors would like to thank Sylvain Glémin for discussion on model implementation and Julien Dutheil for his helpful comments on the manuscript. The authors also acknowledge Martin Lascoux and Luis Leal (Formas, Grant no. 2020-01456) for kindly providing the birch dataset.

Funding

This work was supported by the Novo Nordisk Foundation (Data Science Collaborative Research Programme, Grant no. 0069105).

Data Availability

The data underlying this article are available on Zenodo, at https://zenodo.org/doi/10.5281/zenodo.10947575.

References

Chen
J
,
Bataillon
T
,
Glémin
S
,
Lascoux
M
.
Hunting for beneficial mutations: conditioning on SIFT scores when estimating the distribution of fitness effect of new mutations
.
Genome Biol Evol.
2022
:
14
(
1
):
evab151
. https://doi.org/10.1093/gbe/evab151.

Galtier
N
.
Adaptive protein evolution in animals and the effective population size hypothesis
.
PLoS Genet.
2016
:
12
(
1
):
e1005774
. https://doi.org/10.1371/journal.pgen.1005774.

Keightley
P
,
Eyre-Walker
A
.
Joint inference of the distribution of fitness effects of deleterious mutations and population demography based on nucleotide polymorphism frequencies
.
Genetics
.
2007
:
177
(
4
):
2251
2261
. https://doi.org/10.1534/genetics.107.080663.

Keightley
P
,
Jackson
B
.
Inferring the probability of the derived versus the ancestral allelic state at a polymorphic site
.
Genetics
.
2018
:
209
(
3
):
897
906
. https://doi.org/10.1534/genetics.118.301120.

Latrille
T
,
Joseph
J
,
Hartas´anchez
D
,
Salamin
N
. Mammalian protein-coding genes exhibit widespread beneficial mutations that are not adaptive. bioRxiv 2023.05.03.538864. https://doi.org/10.1101/2023.05.03.538864, 4 May 2023, preprint: not peer reviewed.

Moutinho
A
,
Eyre-Walker
A
,
Dutheil
J
.
Strong evidence for the adaptive walk model of gene evolution in Drosophila and Arabidopsis
.
PLoS Biol.
2022
:
20
(
9
):
e3001775
. https://doi.org/10.1371/journal.pbio.3001775.

Sawyer
SA
,
Hartl
DL
.
Population genetics of polymorphism and divergence
.
Genetics
.
1992
:
132
(
4
):
1161
1176
. https://doi.org/10.1093/genetics/132.4.1161.

Sendrowski
J
. Demography of birch populations across Scandinavia, 2022. https://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1675636.

Sethupathy
P
,
Hannenhalli
S
.
A tutorial of the Poisson random field model in population genetics
.
Adv Bioinformatics.
2008
:
2008
:
257864
. https://doi.org/10.1155/2008/257864.

Tataru
P
,
Bataillon
T
.
polyDFEv2.0: testing for invariance of the distribution of fitness effects within and across species
.
Bioinformatics
.
2019
:
35
(
16
):
2868
2869
. https://doi.org/10.1093/bioinformatics/bty1060.

Tataru
P
,
Mollion
M
,
Glémin
S
,
Bataillon
T
.
Inference of distribution of fitness effects and proportion of adaptive substitutions from polymorphism data
.
Genetics
.
2017
:
207
(
3
):
1103
1119
. doi: .

This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] for reprints and translation rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site—for further information please contact [email protected].
Associate Editor: Deepa Agashe
Deepa Agashe
Associate Editor
Search for other works by this author on:

Supplementary data