BetaScan2: Standardized Statistics to Detect Balancing Selection Utilizing Substitution Data

Abstract Long-term balancing selection results in a build-up of alleles at similar frequencies and a deficit of substitutions when compared with an outgroup at a locus. The previously published β(1) statistics detect balancing selection using only polymorphism data. We now propose the β(2) statistic which detects balancing selection using both polymorphism and substitution data. In addition, we derive the variance of all β statistics, allowing for their standardization and thereby reducing the influence of parameters which can confound other selection tests. The standardized β statistics outperform existing summary statistics in simulations, indicating β is a well-powered and widely applicable approach for detecting balancing selection. We apply the β(2) statistic to 1000 Genomes data and report two missense mutations with high β scores in the ACSBG2 gene. An implementation of all β statistics and their standardization are available in the BetaScan2 software package at https://github.com/ksiewert/BetaScan.


Introduction
Balancing selection can maintain multiple alleles at a locus. Several scenarios can result in this type of selection, including heterozygote advantage or frequency-dependent selection. Recently, there have been significant methodological advances in the detection of balancing selection in population-level sequencing data (Siewert and Voight, 2017;DeGiorgio et al., 2014;Cheng and DeGiorgio, 2018;Bitarello et al., 2018). As a result of this recent interest, a number of specific balanced loci in a wide variety of species have been reported with strong observational or experimental evidence (Wheat et al., 2010;Network, 2015;Schweizer et al., 2018;Sano et al., 2018). These results suggest there may be more loci yet to be discovered.
Selection can maintain multiple balanced alleles for very long periods of time, referred to as long-term balancing selection. By doing so, selection increases the time to most recent common ancestor (TMRCA) (Charlesworth, 2006). The resulting signature can be detected to identify putatively balanced loci. Classically, this signature was defined as an excess of heterozygosity and a deficit of substitutions. This is due to variants building up around the balanced alleles through time but being prevented from fixing in the population (Hudson et al., 1987;Tajima, 1989). However, power analyses demonstrate that these classic methods have low power.
More recent methods detect a more precise signature in the site frequency spectrum, and as a result, are higher powered. Specifically, they detect a build-up of variants at highly similar frequencies (Siewert and Voight, 2017;Bitarello et al., 2018). This build-up is most likely a main contribution to the signal that is captured implicitly in methods that rely on simulations of balancing selection to generate an expected site frequency spectrum (DeGiorgio et al., 2014;Cheng and DeGiorgio, 2018). A build-up occurs because two allelic classes are maintained in a population, accumulating variation throughout time. These variants can fix within each haplotype class, and will therefore all be at the same frequency until their frequency is altered by recombination. This signature is not independent of excess heterozygosity, but is instead a more specific signature of the same underlying process.
Simulation-based methods have been shown to be the highest-powered methods, however they require additional types of data rendering them inapplicable in some situations (e.g., large grids of simulations of balancing selection, a sequenced genome from a closely related outgroup and/or genome-wide data in order to estimate the background site frequency spectrum; DeGiorgio et al., 2014;Cheng and DeGiorgio, 2018). In contrast, the b ð1Þ statistics are nearly as high powered, yet do not require additional types of data, enabling wide-spread application.
The previously developed method b detects the signature of alleleic class build-up around each SNP it is applied to (i.e., each core SNP) (Siewert and Voight, 2017) by looking for an excess of genomically proximate SNPs that have a very similar allele frequency to the core SNP. Conceptually, b is a weighted sum of SNP counts around each core SNP, where SNPs are weighted more if they have very similar frequencies to the core SNP. By calculating a b score in a window around each SNP, one can find loci in which the pattern of allele frequencies is consistent with the action of balancing selection. Mathematically, b is the difference between two estimators of the mutation rate b h, one sensitive to excess frequency similarity ( b h b ) and one that is not (Watterson's estimator b h W ) (Watterson, 1975). We previously reported two versions of b: b ð1Þ incorporates outgroup sequence data to call ancestral allele states, whereas b ð1ÞÃ only requires a folded site frequency spectrum.
We now report two improvements to b. The first is a new estimator based on the number of fixed differences with an outgroup species (i.e., substitutions), b h D . Incorporating this estimator into our b framework increases power, owing to the expected reduction of substitutions under balancing selection (Hudson et al., 1987;Charlesworth, 2006). Second, we derive the variance of our statistics, enabling normalization of b. This allows b scores to be properly compared across a range of parameters which can affect its distribution, a feature lacking from other summary statistics, with the exception of Tajima's D (Tajima, 1989).

Results and Discussion
We measure the power of our statistics using simulations conducted using SLiM 2 (supplementary information, Supplementary Material online) (Haller and Messer, 2017). We find that b ð2Þ has higher power than either b ð1Þ statistic, with the greatest gain in power at equilibrium frequencies 25% and 75%, demonstrating that substitution counts provide additional signal over polymorphism data ( fig. 1A, supplementary figs. 5 and 6,Supplementary Material online). When there is mutation rate variation across simulations, we find that standardization improves power ( fig. 1B).
Next, we compare power with alternative methods: NCD2 (Bitarello et al., 2018), NCD mid (Cheng and DeGiorgio, 2018), T1 and T2 (DeGiorgio et al., 2014), Tajima's D (Tajima, 1989), and the HKA test (Hudson et al., 1987). The power to detect balancing selection depends on underlying population parameters, including mutation rate, recombination rate and effective population size. We found that consistent with prior reports, the power of all methods increased with lower recombination rate (supplementary fig.  8, Supplementary Material online) and higher mutation rate (supplementary fig. 9, Supplementary Material online) (DeGiorgio et al., 2014;Siewert and Voight, 2017). This is because a higher mutation rate results in more SNPs fixing in allelic class, whereas a lower mutation rate results in less recombination decoupling SNPs from the balanced SNP. We found that power decreases with increasing population size (supplementary fig. 10, Supplementary Material online). This is expected because a higher population size increases the expected TMRCA at a neutral locus, resulting in a smaller difference in TMRCA between balanced and neutral loci.
We next scan the human genome for evidence of balancing selection, applying the b We next tested if b ð2Þ captures genomic regions which are likely true positives. Trans-species haplotypes between human and chimpanzee are an independent measure of balancing selection and perhaps the closest available thing to a set of true positive loci under long-term balancing selection. We found that SNPs in trans-species haplotypes from Leffler et al. (2013) had a significantly higher b ð2Þ std value than SNPs which are not (Mann-Whitney U P-value ¼ 1:08 Â 10 À14 ). We also compared the values of b ð2Þ std versus T2 in SNPs in trans-haplotypes. We found that after removing rare variants, the mean percentile for SNPs in these haplotypes was 0.63 for b ð2Þ std and 0.66 for T2. Before removing rare variants, the mean percentile for b ð2Þ std was 0.64 and T2 was 0.72. This indicates that b ð2Þ std may have additional noise when calculated on rare variants in real data due to demography that is not captured in the theoretical site frequency spectrum. These effects will be captured in the empirical neutral distribution used by T2, increasing its power. As with the b ð1Þ statistics, we therefore recommend only applying b ð2Þ to SNPs which are not rare. Because balanced variants are unlikely to be maintained at extreme equilibrium frequencies (Ewens and Thomson, 1970;Nei and Roychoudhury, 1973), this should not hurt power.
If b Two missense SNPs in the gene ACSBG2, rs17856650 and rs17856651, have standardized b scores in the top 99.9 percentile in the CEU population ( fig. 2), top 99.5 percentile in the CHB population and top 99 percentile in the YRI population. Confirming this finding, this gene was one of the top 100 loci detected using the T2 statistic (DeGiorgio et al., 2014), but has been previously uncharacterized. In all three populations, the percentile of the standardized b ð2Þ score is slightly higher than the unstandardized, demonstrating the power of standardization in regions with a lower mutation rate, such as within genes. The Acyl-CoA Synthetase Bubblegum Family Member 2 (ACSBG2) gene is involved in fatty acid metabolism (Pei et al., 2006). ACSBG2 may play a role in spermatogenesis (Fraisl et al., 2006), a process that has been previously associated with balancing selection (Bitarello et al., 2018). In addition, this gene was highlighted as harboring potentially deleterious lineage-specific nonsynonymous single nucleotide changes in bonobo (Han et al., 2019). However, the reported bonobo changes are at a different location in the gene than the two missense SNPs in humans.
The results presented in this manuscript demonstrate the power and wide-applicability of the BetaScan2 suite of statistics. In addition to the ability to standardize b and the implementation of b ð2Þ , BetaScan2 can calculate b std on a predefined window of interest, instead of requiring the use of a sliding window across the genome. This locus-based calculation is made possible by standardization allowing comparison of scores across different window sizes, and enables the use of b on species for which only a small fraction of the genome is available. BetaScan2 is freely available at https:// github.com/ksiewert/BetaScan.

Materials and Methods
We first derive a closed-form solution for the expected number of substitutions under the neutral model (supplementary information, Supplementary Material online). This expression leads to an estimator of the mutation rate b h D based upon the number of substitutions D: : Here T is the divergence time between the two species, N e is the effective population size, and n is the number of surveyed chromosomes. To incorporate information from substitutions, Under long-term balancing selection, variants nearby the selected site are maintained at similar allele frequencies rather than fixing in the population, resulting in an increased estimate of h b and reduced estimate of h D . Therefore, b is expected to score above zero in the presence of long-term balancing selection, whereas it will be below or around zero under neutrality. Intuitively, the variances of the b statistics depend on the population size, survey sample size, equilibrium frequency, and mutation rate. We derive the variances of b ð1ÞÃ and b ð2Þ (supplementary material, Supplementary Material online), allowing for comparison of scores across samples, window sizes and genomic loci where these factors may differ. Our expressions for variance match simulation results (supplementary figs. 1 and 2, Supplementary Material online). To obtain the variance of b ð1Þ we used the framework reported in Achaz (2009) (supplementary information, Supplementary Material online). We note that the expected value of all b statistics is zero. This leads to, for b ð2Þ for instance: Full derivations for b

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.