Abstract

Summary

The fixation index FST can be used to identify non-neutrally evolving loci from genome-scale SNP data across two or more populations. Recent years have seen the development of sophisticated approaches to estimate FST based on Markov-Chain Monte-Carlo simulations. Here, we present a vectorized R implementation of an extension of the widely used BayeScan software for codominant markers, adding the option to group individual SNPs into pre-defined blocks. A typical application of this new approach is the identification of genomic regions, genes, or gene sets containing SNPs that evolved under directional selection.

Availability and implementation

The R implementation of our method, which builds on the powerful population genetics and genomics software PopGenome, is available freely from CRAN.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Genomic analyses that calculate FST values frequently aim to separate neutral contributions to FST caused by population histories from locus-specific contributions due to non-neutral evolution. However, different populations have different histories and are influenced by different factors that affect mutations, genetic drift, and migration (Kronholm et al., 2010). As a consequence, the neutral distribution of FST values strongly depends on the organisms and populations considered. The separation between non-neutral and neutral (population history-based) FST contributions can be made explicit in Bayesian approaches that employ Markov-Chain Monte-Carlo (MCMC) methods to estimate FST (Beaumont and Balding, 2004; Foll and Gaggiotti, 2008; Villemereuil and Gaggiotti, 2015). The BlockFeST method introduced in this paper is based on the work of Beaumont and Balding (2004), which establishes an FST based hierarchical Bayesian model to detect loci that evolve non-neutrally. This Bayesian approach uses a logistic regression model to distinguish between locus-specific effects (selection) and population-specific effects (reflecting the common history of all loci). Foll and Gaggiotti, 2008, extended this work through a reversible jump MCMC method (Green, 1995), which calculates an explicit posterior probability for non-neutral evolution at each locus (implemented in BayeScan, http://cmpg.unibe.ch/software/BayeScan). Whereas these previous algorithms allowed the analysis of either haplotypes or of individual single nucleotide polymorphisms (SNPs), we propose to group neighboring SNPs that likely experienced the same evolutionary pressures into blocks, e.g. by gene, by gene set or by sliding window. This joint analysis of co-evolving SNPs increases the statistical power of the approach, allowing us to more reliably detect directional population-specific selection.

2 Materials and methods

To identify non-neutrally evolving blocks of DNA from SNP data, BayeScan (Foll and Gaggiotti, 2008) allows two possible strategies: either to consider each SNP as an independently evolving unit, or to consider haplotypes that combine all SNPs for one block of DNA. Considering each SNP separately may lead to statistical problems due to non-independence, and ignores prior knowledge (the strong linkage of SNPs within one co-evolving block). Conversely, using haplotypes can produce misleading results at moderate to high mutation rates, as each mutation introduces a new haplotype (Hudson, 1992; Riebler et al., 2008). To combine the strengths of both approaches, we use a slightly modified logistic regression model to estimate region-specific FST:
log(FSTij1FSTij)=αI+βj
(1)

Here, FSTij measures the differentiation between subpopulation j and the ancestral population at locus i within region I. αI is a parameter specific to region I (capturing region-specific deviations from neutral evolution), while βj is a parameter specific to population j (capturing locus-independent effects, i.e. consequences of population history). We consider the likelihood contribution of each SNP separately, but assume that the locus-specific effect αI is identical for all SNPs located in the same DNA block I. This means that also in the jump model, all SNPs in one block would be classified as evolving either neutrally (αI=0) or non-neutrally (αI0). Note that in this modified model, the ancestral population is still modelled as a Dirichlet distribution that considers the allele frequencies of each SNP independently. Recent studies have shown that the reversible jump model tends to produce false positives, particularly for balancing selection (Lotterhos and Whitlock, 2014; Luu et al., 2017). Thus, we restrict the model to the detection of positive (directional) selection. Moreover, in previous simulations (Pfeifer, 2014), we found that the FST values obtained after the initial pilot runs used by BayeScan to tune the reversible-jump model are almost as informative as the corresponding posterior values. Pilot runs are initial MCMC iterations (without the reversible jump model) to estimate the distribution of the model parameters α and β. Thus, we propose a simple sampling scheme to verify significant outlier loci subject to local adaptation based on the distribution of the αI values observed after the MCMC iterations (approximated through a region-specific normal distribution N(αI)) without using a time consuming reversible jump model for testing the null hypotheses: (i) For each region I, sample two values xI,yIN(αI), resulting in two distributions of sampled values Dx and Dy across regions. (ii) For each region I, increment its counter cI if yI is above the q-quantile for Dx. (iii) Repeat (i–ii) Niter = 1000 times.

3 Simulation and results

To validate our algorithm and to compare it to alternative approaches, we performed extensive simulations using the MSMS program (Ewing and Hermisson, 2010), which allows coalescent simulations for a structured population under selection. We assume a population with an effective population size of Ne=10 000 that split into two subpopulations 4000 generations ago. After the splitting event, there is no migration between the two populations (island model). To evaluate the performance of the methods to detect selection in cases of variable pairwise population differentiation, we also study a three-population model including an ancestral population (divergence mode) (see Supplementary Material for details and the MSMS calls). We assessed the performance of BlockFeST and compared it to those of BayeScan (Foll and Gaggiotti, 2008) applied to individual SNPs, FST (Hudson, 1992), and the recently published PCA method implemented in the R-package pcadapt (Luu et al., 2017) (see Supplementary Material). For pcadapt and BayeScan, we used the P-values and calculated the sum of logs for each region to compare it to FST and BlockFeST.

We explored a wide range of simulation scenarios, varying recombination rates, selection strengths and coalescent times. Specifically, we studied the robustness of the approaches regarding the effect of neutrality eroding the signal of selection (Supplementary Figs S3–S4 and S8–S9). While FST and pcadapt are the computationally fastest algorithms (Supplementary Table S1), we found that BlockFeST is most powerful and accurate in almost all cases, especially when the signal of local adaptation is not yet eroded by recombination, i.e. in cases of recent positive selection (Fig. 1 and Supplementary Figs S1–S9). For the divergence model, pcadapt and BlockFeST perform comparably well. While the measure of precision shows slightly better results for pcadapt in the divergence model simulations, we obtain higher AUC values for BlockFeST also here, indicating a better balance between sensitivity and specificity (Fig. 1B and Supplementary Figs S4–S9). BayeScan produces elevated P-values, falsely pointing to balancing selection even when no such signal is present, confirming the results in Lotterhos and Whitlock, 2014. This behaviour highlights the tendency of the jump model to produce false positives especially in cases of hierarchical population structures. Despite its simplicity, FST performs well in comparison to the computationally more involved alternatives, with good detection of intermediate signals of positive selection. Overall, we conclude that BlockFeST, which groups SNPs into blocks and calculates empirical P-values, provides faster and more reliable detection of region-specific directional selection compared to the standard application of BayeScan, and performs favourably also compared to alternative approaches.

Fig. 1.

(A) Island model. BlockFeST, pcadapt, FST, and BayeScan are compared in terms of precision, defined as the fraction of true positives recovered in the top 5% of predictions across varying recombination rates. (B) Divergence model. ROC curves for a three-population model, where population 1 and 2 split 4000 generations ago, when directional selection set in, after having split from the ancestral population 8000 ago; here, the recombination rate is set to zero. For a summary of additional simulation results, see Supplementary Figures S1–S9

Acknowledgements

We thank Pablo E. Verde, Gregory Ewing, Stefan Habenschuss, Laura Rose and Juliette de Meaux for helpful discussions.

Funding

M.J.L. acknowledges financial support from the German Research Foundation (DFG grants EXC 1028 and CRC 680).

Conflict of Interest: none declared.

References

Beaumont
 
M.A.
,
Balding
D.J.
(
2004
)
Identifying adaptive genetic divergence among populations from genome scans
.
Mol Ecol
.,
13
,
969
980
.

Ewing
 
M.
,
Hermisson
J.
(
2010
)
MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus
.
Bioinformatics
,
26
,
2064
2065
.

Foll
 
M.
,
Gaggiotti
O.
(
2008
)
A genome-scan method to identify selected loci appropriate for both dominant and codominant markers: a Bayesian perspective
.
Genetics
,
180
,
977
993
.

Green
 
P.J.
(
1995
)
Reversible jump Markov chain Monte Carlo computation and Bayesian model determination
.
Biometrika
,
82
,
711
732
.

Hudson
 
R.R.
 et al.  (
1992
)
Estimating of levels of gene flow from DNA sequence data
.
Genetics
,
13
,
583
589
.

Kronholm
 
I.
 et al.  (
2010
)
Influence of mutation rate on estimators of genetic differentiation: lessons from Arabidopsis thaliana
.
BMC Genet
.,
11
,
33
.

Lotterhos
 
K.E.
,
Whitlock
M.C.
(
2014
)
Evaluation of demographic history and neutral parameterization on the performance of FST outlier tests
.
Mol. Ecol
.,
23
,
2178
2192
.

Luu
 
K.
 et al.  (
2017
)
pcadapt: an R package to perform genome scans for selection based on principal component analysis
.
Mol Ecol Resour
.,
17
,
67
77
.

Pfeifer
 
B.
(
2014
) Whole-genome population genomic analyses. Dissertation, Landesbibliothek der Heinrich-Heine-Universitaet Duesseldorf.

Riebler
 
A.
 et al.  (
2008
)
Bayesian variable selection for detecting adaptive genomic differences among populations
.
Genetics
,
178
,
1817
1829
.

Villemereuil
 
P.
,
Gaggiotti
O.
(
2015
)
A new FST-based method to uncover local adaptation using environmental variables
.
Methods Ecol Evol
.,
6
,
1248
1258
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Inanc Birol
Inanc Birol
Associate Editor
Search for other works by this author on:

Supplementary data