Robust and window-insensitive mixture model approaches for localizing balancing selection

Long-term balancing selection typically leaves narrow footprints of increased genetic diversity, and therefore most detection approaches only achieve optimal performances when sufficiently small genomic regions (i.e., windows) are examined. Such methods are sensitive to window sizes and suffer substantial losses in power when windows are large. This issue creates a tradeoff between noise and power in empirical applications. Here, we employ mixture models to construct a set of five composite likelihood ratio test statistics, which we collectively term B statistics. These statistics are agnostic to window sizes and can operate on diverse forms of input data. Through simulations, we show that they exhibit comparable power to the best-performing current methods, and retain substantially high power regardless of window sizes. Moreover, these methods display considerable robustness to high mutation rates and uneven recombination landscapes, as well as an array of other common confounding scenarios. Further, we applied these statistics on a bonobo population-genomic dataset. In addition to the MHC-DQ genes, we uncovered several novel candidate genes, such as KLRD1, involved in viral defense, and NKAIN3, associated with sexuality. Finally, we integrated the set of statistics into open-source software named BalLeRMix, for future applications by the scientific community.

Nevertheless, all extant approaches are limited by their sensitivity to the size of the region that the statistics are computed on (hereafter referred to as the "window size"). Because the footprints of long-term balancing selection are typically narrow Charlesworth, 2006), small windows with size comparable to that of the theoretical footprint are commonly used in practice to optimize their power, especially for summary statistics. However, such small fixed window sizes lead to increased noise in the estimation of each statistic, and the uneven recombination landscape (Smukowski and Noor, 2011) across chromosomes ensures that a fixed window size is not suitable for all genomic regions. On the other hand, when an overly large genomic region is considered as the window, true signals will likely be overwhelmed by the surrounding neutral regions, diminishing method power as shown by Cheng and DeGiorgio (2019). Available model-based approaches (DeGiorgio et al., 2014;Cheng and DeGiorgio, 2019) could have been made robust to large window sizes if they instead adopted the SFS expected under a neutrally-evolving population of constant size as the null hypothesis, because their model of balancing selection for the alternative hypothesis converges to this constant-size neutral model for large recombination rates. However, this neutral model does not account for demographic factors that can impact the genomewide distribution of allele frequencies, such as population size changes. To guard against such demographic influences, the model-based T 1 and T 2 statistics (DeGiorgio et al., 2014;Cheng and DeGiorgio, 2019) employ the genome-wide SFS instead, compromising the robustness against large windows. Moreover, Cheng and DeGiorgio (2019) showed that although the power of the T 2 statistic decays much slower than other approaches as window size increase, the loss of power is still substantial.
In this article, we describe a set of composite likelihood ratio test statistics that are based on a mixture model ( Figures 1A and B), which allows for the robust and flexible detection of long-term balancing selection regardless of window size. Dependent on the types of data available, we propose a set of five likelihood ratio test statistics termed B 2 , B 2,MAF , B 1 , B 0 , and B 0,MAF , which respectively accommodate data with substitutions and derived (B 2 ) or minor (B 2,MAF ) allele frequency polymorphisms, with substitutions and polymorphisms with unknown allele frequency (B 1 ), and with derived (B 0 ) or minor (B 0,MAF ) allele frequency polymorphisms only. We comprehensively evaluated their performances under diverse scenarios, including their powers for balancing selection with varying ages, distinct strengths and equilibrium frequencies, robustness against window sizes, and robustness against confounding factors such as demographic history, recombination rate variation, and mutation rate variation. We also compared and discussed their performances with other leading approaches-namely HKA, β, β * , β (2) , NCD, T 1 , and T 2 . We further applied B 2 on bonobo genomic data (Prado-  to explore long-term balancing selection in the other close relative of humans. Lastly, we developed the software BalLeRMix (BALancing selection LikElihood Ratio MIXture models) to implement these novel tests for the convenience of the scientific community.

Theory
In this section, we describe a novel set of likelihood ratio test statistics (B 2 , B 2,MAF , B 1 , B 0 , and B 0,MAF ) to scan genomes for balancing selection. These five new methods have been implemented in the software package BalLeRMix, which is written in Python and is available at http://www.personal.psu.edu/mxd60/ballermix.html.

Probability distributions given derived allele polymorphisms and substitutions
For n sampled alleles at an informative site (i.e., polymorphism or substitution), denote the number of derived alleles as k, k = 1, 2, . . . , n. Let ξ n (k) be the total number of informative sites across the wholegenome with k derived alleles observed out of n sampled alleles. The probability of observing such a site is therefore .
When balancing selection maintains an equilibrium frequency of x on the site under selection, the outcomes of observing derived alleles on this site (out of n lineages) can be approximated by a binomial distribution of n trials with a success probability of x. Following this binomial model, the probability of observing the selected site with k observed derived alleles is .
For an informative site d recombination units away from the presumed site under selection, it can either be linked to the derived (with equilibrium frequency x) or ancestral (with equilibrium frequency 1 − x) haplotype under balancing selection, resulting in a bimodal distribution ( Figure 1C). Therefore, the probability of observing k derived alleles out of n sampled alleles is where α A (d) = e −Ad and where A is a model parameter that determines the size of the genomic footprint of balancing selection. When allele frequency information is unavailable at polymorphic sites, the probability of observing a polymorphic site (k = n) or substitution (k = n) would be Similarly, when substitutions are not considered or are missing in the data (i.e., only observe derived allele counts k = 1, 2, . . . , n − 1), the two mixing components can be normalized as .
The probability of observing a polymorphic site with k derived alleles out of n sampled alleles is then

Probability distributions given minor allele polymorphisms and substitutions
When alleles cannot be confidently polarized, minor allele frequencies are often used instead. For informative sites with n sampled alleles, denote the minor allele count as k, k = 0, 1, . . . , n/2 , and the total number of such sites in the genome as η n (k). Substitutions are assigned to η n (0), as the minor allele count is zero. The probability of observing a site with k minor alleles out of n sampled alleles in the genome is .
Assume the equilibrium minor allele frequency at the locus undergoing long-term balancing selection is x ∈ (0, 0.5]. The probability of observing k minor alleles out of n sampled alleles is then n,x (k) = Bin(k; n, x) + Bin(n − k; n, x)1 {k =n/2} n j=1 Bin(j; n, x) .
Hence, for an informative site d recombination units away from the presumed site under selection, the probability of observing k minor alleles out of n sampled alleles is Similarly, when substitutions are not considered or are missing in the data (i.e., only observed minor alleles counts k = 1, 2, . . . , n/2 ), the two mixing components can be normalized as and n,x (k) = Bin(k; n, x) + Bin(n − k; n, x)1 {k =n/2} n−1 j=1 Bin(j; n, x) .
The probability of observing a polymorphic site with k minor alleles out of n sampled alleles is then

Composite likelihood ratio tests based on the mixture models
Based on the probability distributions described for the five models, for each model X ∈ {"2", "2,MAF", "1", "0", "0,MAF"}, the composite likelihood of a genomic region with L informative sites under the null hypothesis of neutrality is where n = [n 1 , n 2 , . . . , n L ] and k = [k 1 , k 2 , . . . , k L ] are the vectors of sample sizes and derived or minor allele counts, respectively, at the L informative sites in the genomic region. Similarly, the composite likelihood under the alternative hypothesis of model X would be is the vector of recombination distances between the test site and each of the L informative sites. This likelihood is maximized at Hence, under model X ∈ {"2", "2,MAF", "1", "0", "0,MAF"}, the log composite likelihood ratio test statistic for the test site is

Interpretation of estimated A and x parameters
The likelihood for the alternative model is maximized over the parameters A and x, where x represents the presumed equilibrium minor allele frequency, and A decides the influence of balancing selection on neutral sites of varying distance away from the test site. After optimizing over this parameter space, the parameter values under the optimal likelihood, A and x, provide information on the nature of detected genomic footprints. The value of x should reflect the enriched minor allele frequency across the region, whereas A is informative about the impact of the balancing selection. Intuitively, A describes how far balancing selection has impacted the genomic variation at nearby neutral sites, and the smaller the A, the wider the footprint would be, and likely the younger the balanced polymorphism. Nonetheless, note that the increase in A values is not only decided by the age of balancing selection, as a large A also reduces the number of informative sites that yield meaningful likelihood ratios, and can thus also occur when data in the examined area fit the alternative model poorly. Therefore, we advise only comparing the A values among regions with reasonably high composite likelihood ratios, and that caution be used when making inferences from these values as they do not map to an explicit evolutionary model.

Performances on simulated data
We simulated 50 kilobase (kb) long sequences using SLiM3 (Haller and Messer, 2019), under the threespecies demographic model ( Figure S1) inspired by the demographic history of great apes (see Methods), and extensively evaluated the performances of all five B statistic variants. We also compared the B statistics to the summary statistics β, β * , HKA, NCD2, and β (2) , which are respectively analogues to B 0 , B 0,MAF , B 1 , B 2,MAF , and B 2 , and to the likelihood statistics T 1 and T 2 , which are respectively analogues to B 1 and B 2 .

Robust high power under varying window sizes
We first examined the robustness of the B statistics to overly large window sizes, under a scenario of strong heterozygote advantage (selective coefficient s = 0.01 with dominance coefficient h = 20) acting on a mutation that arose 7.5 × 10 4 generations prior to sampling. Because BetaScan Voight, 2017, 2018) (which implements the standardized and nonstandardized β, β * , and β (2) statistics, among which we only consider the standardized) operates on windows of fixed physical length, we adopted window sizes of 1, 1.5, 2.5, 3, 5, 10, 15, 20, and 25 kb for all summary statistics and B statistics. The T statistics were applied on windows with matching expected numbers of informative sites. Supplementary Note 1 details the calculation for matching the number of informative sites to physical length of a genomic region.
To reduce potential stochastic fluctuations in the number of true positives when the false positive rate is controlled at a low level, we examined the area under a partial curve with no greater than a 5% false positive rate (hereafter referred to as "partial AUC"). As shown in Figure 2, under optimal window sizes for most other statistics, all variants of B statistics display substantial partial AUCs comparable to that of the respective T statistic variant, which has outperformed other equivalent summary statistics in most previous simulation studies (DeGiorgio et al., 2014;Voight, 2017, 2018;Bitarello et al., 2018;Cheng and DeGiorgio, 2019). Most remarkably, as the window size increases, while all other statistics exhibit drastic decays in power, the powers of all variants of the B statistic only show minor decreases. In fact, when comparing the powers under 25 kb windows against those under optimal window sizes for each statistic, the powers of all statistics drop more than twice as much as B 1 and B 2 ( Figure 2B). In comparison with each method's optimal performance, most statistics (except all B statistics and T 2 , the model-based analog of B 2 ) lose more than 80% of their optimal power under the largest window size examined ( Figure 2C).
Although T 2 still retains considerably higher partial AUC compared to all other extant methods, it still decreases to a value substantially lower than that of B 2 . Such robustness of B statistics to large windows is reasonable and expected, because the probability distribution of allele frequencies at sites far enough from the test site will match the genome-wide SFS, thereby contributing little to the overall likelihood ratio.
Among all statistics evaluated, we found that those considering polymorphism data only (i.e., B 0 variants and β variants) demonstrated relatively poor robustness to increases in window size. This result indicates that the detectable footprint of balancing selection in polymorphism data by itself may decay faster than other types of information, and that incorporating substitution data may help improve robustness to large windows.
Considering that the powers of all B statistics stabilize at a fixed level as the window size increases ( Figure 2), we permit the B statistics to employ all informative sites on a chromosome. However, to reduce computational load, we only consider sites with mixing proportion α A (d) ≥ 10 −8 for each value of A considered during optimization, which does not create discernible differences in performance from when all data are considered ( Figure S2). However, to ensure that other methods still display considerable power for their comparisons, we applied the summary statistics with their optimal window sizes of one kb, and T statistics with numbers of informative sites expected in a one kb window (see Methods), unless otherwise stated.
High power for detecting balancing selection of varying age and selective strength  Figure 3D), or 0.2 (h = 1.33, Figure 3E). Across all scenarios considered, T 2 and β * show the highest power for old balancing selection. The best-performing B variants, B 2 and B 2,MAF , display high power as well, and are often comparable to that of the β (2) statistic. The power of B 1 is also similar to HKA, which is its summary statistic analogue. Furthermore, we noticed that B statistics exhibit superior power for younger balanced alleles, particularly when balancing selection is more recent than 2 × 10 5 generations, and when the equilibrium frequency does not equal to 0.5 ( Figure S3). For older selected polymorphisms, although several statistics outperform B statistics, it is important to point out that all previous methods were provided optimal window sizes, whereas B statistics were set to use all sites with considerable α A (d), under which they show lower power than when window sizes are optimized ( Figure 2). This difference in performance between previous methods applied with their optimal window sizes and B statistics can also explain the seemingly inferior performance of the two B 0 variants when compared with the analogous β statistics, as the B 0 variants lose more power than other B variants when computed on extended windows. When applied with the same window size, however, B 0 outperforms β by a large margin ( Figure 2). Nevertheless, these results give us confidence that B statistics have generally high power to detect young and old balancing selection, even when adopting large windows.

Robustness to recombination rate variation and elevated mutation rates
Despite their robustness to large windows and high power for detecting balancing selection, model-based methods, such as the T and B statistics, incorporate recombination distances in their inference framework, and can therefore be especially susceptible to potential inaccuracies in input recombination maps. Additionally, because many approaches for detecting balancing selection aim to identify genomic regions with increased genetic diversity, the elevation of mutation rates is also a common and potent confounding factor for detecting balancing selection (Charlesworth, 2006;Siewert and Voight, 2018;Cheng and DeGiorgio, 2019).
To test their robustness to inaccurate recombination rates, we applied B and T statistics on simulated sequences with uneven recombination maps (10 2 -fold fluctuations in recombination rates; see Methods).
When the sequences evolve neutrally, neither approach is misled ( Figure S4). When the fluctuation in recombination rate is even more drastic (e.g., 10 4 -fold instead of 10 2 ), all methods tend to report fewer false signals than they would under a uniform map ( Figure S5). This result suggests that the misleading effects of inaccurate recombination maps are limited.
We next simulated a 10 kb mutational hotspot at the center of the 50 kb sequence with a mutation rate five times higher than original and surrounding rate µ, and applied each statistic with parameters derived from the original neutral replicates with constant mutation rate µ across the entire sequence. All methods exhibit considerable robustness against this regional increase of mutation rate ( Figure S6).
We further considered an elevated mutation rate of 5µ across the entire 50 kb sequence, and re-examined the performance of each method. As expected, most statistics display substantially inflated proportions of false signals ( Figures 4A and D). Among them, the B 2 statistic reports the least proportion of false signals, followed by the B 1 statistic. Meanwhile, at low false positive rates, B 2 and B 2,MAF statistics report higher proportions of false signals than T 2 , their coalescence model-based analogue, whereas B 1 outperformed T 1 . Additionally, all statistics that consider only polymorphism data, namely the B 0 , B 0,MAF , β, and β * statistics, are substantially misled. The β (2) statistic, albeit taking substitutions into account, also displays surprisingly high proportions of false signals. Taken together, these results suggest that B statistics are reasonably robust against high mutation rates, relative to their analogous approaches, and that incorporating substitutions and polarized allele frequencies may further buttress the robustness.

Robustness to demographic history
The influence of demographic history was the major motivation for T statistics to adopt the genome-wide  Figure S8A; see details in Methods), respectively. The former have been extensively characterized (e.g., Lohmueller et al., 2009;Gravel et al., 2011;Terhorst et al., 2017), and therefore can reliably reflect the performance of each method under realistic scenarios. On the other hand, because we intend to apply the B statistics on bonobo genomic data, we are also interested in evaluating their performance under an inferred bonobo demographic model.
As previously described, we applied the B statistics with unlimited window sizes, whereas the other statistics were provided with smaller window sizes matching the theoretical size for a footprint of longterm balancing selection (see Supplementary Note 1). Despite being provided disadvantageous window sizes, B statistics still demonstrate comparable to, and often higher power than, current summary statistic approaches, both under the human ( Figure S7) and the bonobo ( Figure S8) demographic models. Although T 2 has higher power than the B statistics, we note that the T statistics were operating with optimal window sizes, whereas the window sizes for B statistics are identified across a parameter range. When B 1 and B 2 are applied with identical window sizes as T 1 and T 2 ( Figures S9 and S10), the margins between their performances are no longer substantial. Additionally, we noticed that most statistics tend to have higher power for sequences evolving under the bonobo demographic history than under that of the Europeans (notice that the y-axes in Figures S7 and S8 have different scales).

Finding footprints of balancing selection in bonobo genomes
We further applied the B 2 statistic on the variant calls of 13 bonobos (Prado-  liftedover to human genome assembly GRCh38/hg38. Only biallelic single nucleotide polymorphisms (SNPs) were considered, and substitutions were called using bonobo panPan2 reference sequence (Prüfer et al., 2012), with the human sequence as the ancestral state. Stringent filters were applied to remove repetitive regions and regions with poor mappability (see Methods), with the resulting mask unintendedly removing the majority of the major histo-compatibility (MHC) locus as well. We observed many genomic regions with outstanding B 2 scores ( Figure 5), which lend insights on the evolution of several biological features in bonobos, including pathogen defense and neuro-behaviors.

Evidence for balancing selection on pathogen defense and autoimmunity
Despite that most of the MHC region was removed by the mappability filter, we still observed extraordinary signals from the remainder of this region. More specifically, the HLA-DQA1 and HLA-DQB1 genes harbor the highest peak across the genome (Figures 5 and S11). These two genes encode the α and β chains of HLA-DQ molecule, which is a surface receptor on antigen-presenting cells (Ball and Stastny, 1984), and has long been known to be highly polymorphic in great apes (Takahata et al., 1992;Prüfer et al., 2012;Teixeira et al., 2015).
In addition to the HLA-DQ genes, KLRD1 also presents prominent B 2 scores ( Figure 6) on its first intron. This gene expresses a natural killer (NK) cell surface antigen, also known as CD94. The interaction between KLRD1 (CD94) and NKG2 family proteins can either inhibit or activate the cytotoxic activity of NK cells (Pende et al., 1997;Cantoni et al., 1998;Masilamani et al., 2006), as well as pivot the generation of cell memory in NK cells (Cerwenka and Lanier, 2016). Furthermore, KLRD1 (CD94) has been shown to play an important role in combating viral infections such as cytomegalovirus (CMV; Cerwenka and Lanier, 2016) and influenza (Bongen et al., 2018) in humans, as well as the mousepox virus in mice (Fang et al., 2011). The involvement in viral defense of KLRD1 presents an especially intriguing case for bonobos, because they have been recently shown to have reduced level of polymorphism in MHC class I genes (Maibach et al., 2017;Wroblewski et al., 2017), and were predicted to have lower ability to bind with viral peptides when compared with chimpanzees (Maibach and Vigilant, 2019). The genes encoding another regulator of MHC class I molecules, the Killer cell Immunoglobin-like Receptors (KIR), were also found to have contracted short haplotypes in bonobos (Rajalingam et al., 2001;Walter, 2014;Wroblewski et al., 2019), and with the lineage III KIR genes serving reduced functions (Wroblewski et al., 2019). In this light, the polymorphisms on KLRD1 may be compensating the reduced diversity in their binding partners.
Moreover, besides pathogen defense, variants in the KLRD1 gene were also found to be significantly associated with the condition of Small for Gestational Age (SGA; Harmon et al., 2014) pregnancies among African American humans, which is speculated to result from its interaction with the placental HLA-E, working against implantation. The KLRD1 gene in humans and chimpanzees was also found to be highly conserved (Khakoo et al., 2000;Shum et al., 2002), unlike in bonobos, which we observed high diversity on this gene. This may suggest differential selective pressures on KLRD1 in bonobos compared to humans and chimpanzees.
Several other genes in high-scoring regions are also found to be involved in immunity. For one, the highest peak on chromosome 7 encompasses the entire GPNMB gene ( Figure S12), with elevated B 2 scores particularly on exons. This gene encodes osteoactivin, a transmembrane glycoprotein found on osteoclast cells, macrophages, and melanoblast (Loftus et al., 2009;Yu et al., 2016), and is shown to regulate proinflammatory responses (Ripoll et al., 2007). Aside from its heavy involvement in cancer (Zhou et al., 2012), GPNMB has also been shown to facilitate tissue repair (Li et al., 2010;Rose et al., 2010;Hu et al., 2013), as well as influence iris pigmentation (Bächner et al., 2002;Maric et al., 2013).
Another genomic region showing outstanding B 2 scores is the intergenic region between BPIFB4 and BPIFA2 ( Figure S13), which encode two Bacterialcidal Permeability-Increasing Fold-containing (BPIF) family proteins. The BPI proteins are strong bacterialcidal proteins secreted in the mucosal epithelia as one of the key members in first defenses against gram negative bacterial infection (Levy, 2000;Schultz and Weiss, 2007). The protein BPIFA2, found mainly in saliva, also carries antibacterial functions as a part of innate immunity (Prokopovic et al., 2014) and has been shown to protect against intestinal inflammation (Gorr et al., 2015). The BPIFA2 gene region is recently shown to harbor many SNPs significantly associated with enteropathy (Fujimori et al., 2019) as well. The BPIFB4 gene, however, is better-known by its association with longevity (Villa et al., 2015b;Spinetti et al., 2017;Villa et al., 2018), and is speculated to accomplish this via vascular protection (Villa et al., 2015a;Puca et al., 2016;, although a recent study did find its significant association with enteropathy (Fujimori et al., 2019). Considering that the intergenic region between BPIFB4 and BPIFA2 where we observe high B 2 scores locates downstream of BPIFB4 but upstream of BPIFA2, we speculate that the polymorphisms in this area function more to regulate the expression of BPIFA2, and are thus driven by innate immunity.

Evidence for balancing selection on neuro-behavior
Among the high-scoring genomic regions, we also found a number of genes involved in neurodevelopment.
One of them locates on an intron of NKAIN3 (Figure 7), which has been significantly associated with male sexuality in humans (Drabant et al., 2012;Sanders et al., 2017). This finding echoes with the pervasive socio-sexual behaviors specific in bonobos (de Waal, 1990;Kano, 1992). Encoding a transmembrane cation channel (Gorokhova, 2006), NKAIN3 is specifically expressed in the brain (Uhlén et al., 2015), and has also been associated with vitamin E bioavailability (Major et al., 2012;Desmarchelier et al., 2015) and adult ADHD (Weiflog et al., 2013). However, the related molecular mechanisms behind the action of NKAIN3 remain largely elusive. We also uncovered outstanding signals on the pain receptor gene SCN9A ( Figure S14), which encodes Na V 1.7, a voltage-gated sodium channel, with mutations on the gene associated with various pain disorders (Yang et al., 2004;Cox et al., 2006;Reimann et al., 2010). Interestingly, the peak we observe covers the overlapping RNA gene encoding its anti-sense transcript, SCN1A-AS1, which regulates the expression of SCN9A (Koenig et al., 2015), suggestive of diversified regulation of pain perception in bonobos.
The other two major signals on chromosome 8 sit in CSMD1 and CSMD3 (Figures S15 and S16, respectively). Despite that both genes are large in size (approximately two and 1.5 Mb, respectively) and may be more likely to harbor increased densities of polymorphisms by chance, the high peaks we observe mostly locate on exons, lending increased support for selection to diversify their functions. Both genes have been associated with neurological conditions such as bipolar disorder (Xu et al., 2014), autism (Floris et al., 2008), schizophrenia (Håvik et al., 2011;Kwon et al., 2013), and epilepsy (Shimizu et al., 2003).

Discussion
In this study, we introduced a novel set of composite likelihood ratio statistics-B 2 , B 2,MAF , B 1 , B 0 , and B 0,MAF -to robustly detect footprints of balancing selection with high power and flexibility. The B statistics are based on a mixture model creating a proper nested likelihood ratio test, which helps them overcome the common susceptibility to oversized windows held by current methods. We have extensively evaluated their performances on simulated data compared with current state-of-the-art methods, and have demonstrated the superior properties of the B statistics under various scenarios. We further applied B 2 onto the genomic data of bonobos (Prado- , and uncovered not only the HLA-DQA1 /HLA-DQB1 gene cluster, but also intriguing candidates that are involved in innate immunity.

Evaluating the performance of B statistics through simulations
In our simulation study, the B statistics showed remarkable robustness to large window sizes, with only minor decays in power under oversized windows, whereas other methods exhibited large declines in power.
Moreover, even when considering all data available as input (i.e., the most disadvantageous window size) all variants of B statistics still exhibit comparable power to extant methods and displayed satisfactory performance across varying types and strengths of balancing selection. Under scenarios with confounding factors, such as high mutation rate and non-equilibrium demographic history, the B statistics demonstrated satisfactory robustness as well. Overall, we believe our results have comprehensively characterized the promising performance of the B statistics.

Confounding effects of mutation rate or recombination rate variation
In our simulation study, sequences with a central 10 kb mutational hotspot did not mislead methods as much as those with the mutation rate elevated across the entire sequence. This result may seem counter-intuitive at first, as a smaller region of increased mutation rate may better resemble the footprints of long-term balancing selection. However, upon a closer examination of the site frequency spectra and proportions of polymorphic sites ( Figure S17), sequences with an extended region of high mutation rate exhibit a greater departure in these features under scenarios with no elevated mutation rate than for scenarios with a central mutational hotspot. Specifically, these sequences have more sites with high derived allele frequencies and a higher proportion of polymorphic sites overall ( Figure S17B), likely resulting from the recurrent mutation on sites that were originally substitutions. The increase is also more profound on sites with high derived allele frequency. For example, the proportions of sites with derived allele frequency of 0.96 increased by almost two-fold from approximately 0.00104 to 0.00190, and the proportions of sites with derived allele frequency of 0.98 increased by almost three-fold from 0.00105 to 0.00273. By contrast, the difference in scale between the proportions of polymorphisms (0.182 versus 0.189) is minor. The larger fold-change in the proportions of high-frequency polymorphisms (i.e., sites with k = n − 1, n − 2, and n − 3 derived alleles) relative to that of substitutions (k = n derived alleles) could explain the more profound inflation in power for the statistics relying only on information at polymorphic sites. Similarly, after folding the SFS, the large changes in the proportions of low-frequency alleles were substantially mitigated, echoing the superior performance of B 2,MAF and β relative to their unfolded counterparts.
Another unexpected result from the simulations of elevated mutation rate is the drastic inflation of false signals reported by β statistics (Figure 4), which can also be observed in the non-standardized β statistics ( Figure S18). Although Siewert and Voight (2018) tested their power to detect balancing selection under high mutation rate, it was unexplored whether their β statistics would mis-classify highly mutable neutral sequences as those undergoing balancing selection, and our results show that they could be easily misled.
However, we further found that the performances of the standardized β statistics largely improve when provided with the correct mutation rate and divergence time ( Figure S18B). This result partly confirms the superiority of standardized β statistics over the unstandardized ones. It also suggests that β statistics are considerably susceptible to the confounding effect of mutation rate elevation, and that their performance relies highly on the accuracy of the provided mutation rate. Instead of using a constant mutation rate for the entire scan, we propose that providing locally-inferred population-scaled mutation rates θ may help improve the robustness of β statistics. Indeed, when we instead estimate θ using the mean pairwise sequence difference θ π (Tajima, 1983) for each replicate and provided BetaScan the respective inferred value as the θ parameter, the standardized statistics no longer report as many false signals ( Figure S18C). Furthermore, we observed that both before and after down-sampling, the T statistics report fewer false signals than their respective B statistic analogues. One potential factor behind their marginally superior performance may be that T statistics perform tests on fixed numbers of informative sites, instead of genomic regions measured by physical lengths (as did B statistics and the summary statistics). For T statistics, the size of the genomic region covered by the same number of informative sites would be much narrower under rapidly mutating sequences than in sequences with the original mutation rate. This means that the resulting T scores in either scenario are reflective of the levels of variation for sequences with drastically different lengths. To account for this factor, we provide B 1 and B 2 with informative site-based windows identical to that of T statistics and re-examined their performances ( Figure S19). After matching the windows, B 1 and B 2 variants in turn display higher robustness than T 1 and T 2 to elevated mutation rates, suggesting that B statistics are at least comparably robust to T statistics. Meanwhile, we also matched the window size for B 0 variants and β to gauge the effect of adopting large windows on the proportions of false signals from B 0 variants. When B 0 scans the sequences with one kb windows, though there is an increase in the resulting number of false signals ( Figure S19A), at a 1% false positive rate the proportions of false signals for the two B 0 variants only increase by less than 0.1, and are still substantially lower than that of β and β * ( Figure S19B).
In contrast to mutation rate variation, all statistics are robust to recombination rate variation, with B 0 and B 0,MAF reporting substantially fewer false signals than the others ( Figure S4). This robustness to recombination rate variation is likely due to the high similarity in the SFS and proportion of polymorphic sites to sequences with uniform recombination rates ( Figure S20).

Evidence for balancing selection in bonobos
As one of the two sister species to humans, bonobos (initially known as the pygmy chimpanzees; Prüfer et al., 2012) have been drawing increasing attention from the genomics community (e.g., Prüfer et al., 2012;Prado-Martinez et al., 2013;de Manuel et al., 2016). However, compared with chimpanzees (the other sister species), bonobos are relatively understudied, despite their close relationship to humans and unique social behaviors. For bonobos, one of their most idiosyncratic traits is their high prevalence of sociosexual activities (de Waal, 1990;Kano, 1992;Wrangham, 1993), which serve important non-reproductive functions and include frequent same-sex encounters. As a close relative to humans, their female-dominance, low-aggression, and hypersexual social behaviors contrast fiercely with those of humans and chimpanzees (Kano, 1992;Wrangham, 1993), and has challenged the traditional understanding of intra-and intersexual relationships in hominid societies (Parish et al., 2000). In fact, a standing hypothesis postulates self-domestication as a main player in the evolution of bonobos, in that their unique social behavior has driven the selection against aggression and shaped other related traits (Hare et al., 2012). A growing number of recent studies have also characterized the differences in physiological responses between bonobos and chimpanzees behind their social behaviors (Heilbronner et al., 2008;Hohmann et al., 2009;Wobber et al., 2010a;Surbeck et al., 2012), yet the genetic component underlying their unique behaviors, however, remains largely elusive.
In this study, we performed the first model-based scan for footprints of balancing selection on bonobo polymorphism data. In addition to the genomic regions previously reported to be under ancient balancing selection in humans and chimpanzees (e.g., the HLA-DQ genes at the MHC locus; Leffler et al., 2013;Teixeira et al., 2015;Cheng and DeGiorgio, 2019), we have also uncovered novel candidates such as KLRD1 and NKAIN3, which are related to pathogen defense and sexuality, respectively. Accumulating evidence (de Waal, 1990;Hare et al., 2012;de Groot et al., 2017;Wroblewski et al., 2017;Maibach and Vigilant, 2019) suggests that bonobos host unique features in phenotypes related to these areas, and our results may enhance our understanding of both the genetic basis of their idiosyncrasies and their evolutionary history.
One surprising finding among all outstanding candidates is the signal of balancing selection on NKAIN3 (Figure 7), a gene significantly associated with male sexuality in humans (Drabant et al., 2012;Sanders et al., 2017). The high B 2 scores are also observed on another sexuality-associated locus, the intergenic region between SLITRK5 and SLITRK6 (Sanders et al., 2017, Figure S21). These genomic footprints echo the high prevalence of same-sex sociosexual interactions in bonobo communities, which suggest a potential balance between the social benefits and possible decrease in fitness of other traits brought by such behaviors. Further, a few other genes involved in neuro-development, such as CSMD1 and SCN9A, also show outstanding B 2 scores. These footprints may correspond with the different stress-coping behaviors in bonobos, as well as the observations of heterochrony in the development of social inhibition learning (Wobber et al., 2010b) and spacial memory (Rosati, 2019) between bonobos and chimpanzees.
In addition to their unique behavior, the immune genes in bonobos are also intriguingly more reduced than in humans and chimpanzees, both in their lower diversity of MHC-I repertoires (de Groot et al., 2017;Maibach et al., 2017;Wroblewski et al., 2017), and in the reduction in the number of both KIR genes and their allotypes (Rajalingam et al., 2001). These features are unlikely the natural consequences of demographic factors even after considering the harsher bottlenecks bonobos have undergone compared with chimpanzees; many studies have speculated potential selective sweeps in bonobos on these regions (Prüfer et al., 2012;Walter, 2014;Maibach et al., 2017;Wroblewski et al., 2017Wroblewski et al., , 2019.  (Krief et al., 2010;Liu et al., 2017), and the few remaining MHC-B subtypes in bonobos have been shown to confer protection against malaria (Hill et al., 1992;Sanchez-Mazas et al., 2017). This hypothesis agrees with our observation that MHC-DQ genes, which are often involved in parasite defense, show the highest B 2 scores genome-wide. On the other hand, our results can also relate to the defense of viral pathogens in bonobos. Besides the involvement of KLRD1 in combating virus infections, CSMD1 ( Figure S15) has also been shown to be associated with antibody response to smallpox vaccines in humans (Ovsyannikova et al., 2012)-intriguingly, bonobos were found to tolerate monkeypox, the simian-borne poxvirus (Inogwabini and Leader-Williams, 2012).
Lastly, we noticed that some candidate genes carry multiple distinct functions, and may have been undergoing balancing selection due to potential evolutionary conflicts between some of their functions. For example, the gene CSMD1 is not only associated with antibody response to smallpox vaccines (Ovsyannikova et al., 2012), it also plays a role in neuro-development (Kwon et al., 2013;Xu et al., 2014;Athanasiu et al., 2017). Likewise, the gene GPNMB plays roles not only in tissue repair (Li et al., 2010), but also in iris pigmentation (Bächner et al., 2002). Another candidate, PDE1A gene ( Figure S22), encodes a phosphodiesterase that is pivotal to Ca 2+ -and cyclic nucleotide-signaling . It is expressed in brain, endocrine tissues, kidneys, and gonads (Uhlén et al., 2015), and has multiple splicing variants. In fact, the high-scoring peak we observed on this gene happens to locate around the exons that are spliced out in some variants ( Figure S22). Studies have demonstrated the relation of this gene to brain development (Yan et al., 1994), mood and cognitive disorders (Xu et al., 2011;Martinez and Gil, 2013;Pekcec et al., 2018;Betolngar et al., 2019), and hypertension (Kimura et al., 2017). Meanwhile, the PDE1A protein is also a conserved component of mammalian spermatozoa Vasta et al., 2005), and is involved in the movement of flagella-organelles featured in both animal sperms and some protozoa such as the Trypanosoma parasites (Oberholzer et al., 2007), with which the endemic geographical regions are avoided by bonobos (Inogwabini and Leader-Williams, 2012). Though it is difficult to judge for these genes which functions may be subject to selective pressures, they nonetheless indicate that pleiotropy can be an important driver of balancing selection.
Comparing the B statistics with the T statistics performance under varying sized demographic models (DeGiorgio et al., 2014;Cheng and DeGiorgio, 2019, Figures S7 and S8). In contrast to the T statistics, the null model for B statistics (which also employs the genome-wide SFS) is nested within the alternative, due to their mixture model framework. This feature mitigates the biases introduced by sites far from the test site, while simulataneously accounting for demographic factors. Consequently, the B statistics display robust performance under oversized windows and realistic demographic models in our simulations (Figures 2, S7, and S8).
Another advantage of the B statistics over the T statistic approach, especially for B 2 compared with T 2 , is the computational load. Because the probability distribution of allele frequencies under the Kaplan-Darden-Hudson  model is difficult to compute, the T 2 statistic relies on previously-generated sets of simulated site frequency spectra over a grid of equilibrium frequencies x ∈ {0.05, 0.10, . . . , 0.95} for each distinct sample size n and recombination distance d. Generation of such frequency spectra is computationally intensive, and the load increases substantially with the increase in sample size, thereby limiting the application of T 2 to datasets with larger sample sizes. However, this is not a limitation of B 2 , as the SFS under balancing selection is determined simply as a mixture of the given genome-wide distribution of allele frequencies and a statistical distribution with closed-form solutions whose computational cost is minor, and only increases linearly with the sample size. Moreover, the rapid computation of this spectrum permits a finer grid of equilibrium frequencies x to be interrogated.

Concluding remarks
Extant methods for detecting long-term balancing selections are constrained by the pliability of their inferences as a function of genomic window size. In this study, we presented B statistics, a set of composite likelihood ratio statistics based on nested mixture models. We have comprehensively evaluated their performances through simulations and demonstrated their robust high performances over varying window sizes in uncovering genomic loci undergoing balancing selection. Moreover, we showed that even when applied with the least optimal window sizes, the B statistics still exhibit high power comparable to current methods, which operated under optimal window sizes, in uncovering balancing selection of varying age and selection parameters, as well as robust performance under confounding scenarios such as elevated mutation rates, variable recombination rates, and population size changes. We further applied the B 2 statistic on the whole-genome polymorphism data of bonobos, and discovered not only the well-established MHC-DQ genes, but also novel candidates such as KLRD1, NKAIN3, PDE1A, SCN9A, and CSMD1, with functional implications ranging from pathogen defense to neuro-behavior.
We have implemented these statistics in the open source software BalLeRMix, which can be accessed at http://www.personal.psu.edu/mxd60/ballermix.html along with the empirical scan results for balancing selection in bonobos.

Methods
In this section, we discuss sets of simulations used to evaluate the performances of the B statistics relative to previously-published state-of-the-art approaches (Hudson et al., 1987;DeGiorgio et al., 2014;Voight, 2017, 2018;Bitarello et al., 2018). Finally, we describe the application of our B statistics to an empirical bonobo dataset (Prado- .

Evaluating methods through simulations
We employed the forward-time genetic simulator SLiM (version 3.2; Haller and Messer, 2019) to generate sequences of 50 kb in length evolving with or without balancing selection. Based on the respective levels in humans and other great apes, we assumed a mutation rate of µ = 2.5 × 10 −8 per base per generation (Nachman and Crowell, 2000), and a recombination rate of r = 10 −8 per base per generation (Payseur and Nachman, 2000). In scenarios with constant population sizes, we set the diploid effective population size as N = 10 4 . To create baseline genetic variation, each replicate simulation was initiated with a burn-in period of 10N = 10 5 generations. To speed up simulations, we applied the scaling parameter λ to the number of simulated generations, population size, mutation rate, recombination rate, and selection coefficient, which allows for the generation of the same levels of variation with a speed up in computational time by a factor λ 2 . For scenarios based on a model of constant population size, we used λ = 100. For the demographic models of European humans and bonobos, we used λ = 20.
We simulated data from two other diverged species, under the demographic history inspired by that of humans, chimpanzees (Kumar et al., 2005), and gorillas (Scally et al., 2012). Specifically, the closer and farther outgroups diverged 2.5 × 10 5 and 4 × 10 5 generations ago, respectively, which correspond to five million and eight million years ago, assuming a generation time of 20 years.
To evaluate the power of each method to detect balancing selection with varying selective coefficient s, dominance coefficient h, and age, for each combination of s and h, we considered 15 time points at which the selected allele was introduced, ranging from 5 × 10 4 to 6.5 × 10 5 generations prior to sampling with time points separated by intervals of 5 × 10 4 generations. Assuming a generation time of 20 years, these time points are equivalent to 1, 2, 3, . . . , 15 million years before sampling. For each scenario, a single selected mutation was introduced at the center of each sequence at the assigned time point, and we only considered simulations where the introduced allele was not lost.

Accelerated mutation rate
To evaluate whether the B statistics are robust to high mutation rates, we applied the methods on simulated sequences evolving neutrally along the same demographic history ( Figure S1), but instead with a fivefold higher mutation rate of 5µ = 1.25 × 10 −7 per site per generation. To generate sequences with regional increases in mutation rate, we simulated 50 kb sequences with a five-fold higher mutation rate of 5µ = 1.25 × 10 −7 per site per generation at the central 10 kb of the sequence, and the surrounding region with the original rate µ.

Recombination rate estimation error
For evaluating the robustness to erroneous estimation of recombination rates, we simulated sequences with uneven recombination maps, and applied the model-based methods with the assumption that the recombination rate is uniform. In particular, we divided the 50 kb sequence into 50 regions of one kb each, and in turns inflate or deflate the recombination rate of each region by m fold, such that the recombination rates of every pair of neighboring regions have a m 2 -fold difference. We tested m = 10 and m = 100 in this study.

Demographic history
To examine the performance of methods under realistic demographic parameters, we considered the demographic histories of a central European human population (CEU; Terhorst et al., 2017) and of bonobos (Prado- . For the human population, we adopted the history of population size changes inferred by SMC++ (Terhorst et al., 2017) that spans 10 5 generations, assuming a mutation rate of 5×10 −9 per site per generation and a scaling effective size of 10 4 diploids. To account for recombination rate variation, we allowed each simulated replicate to have a uniform recombination rate drawn uniformly at random between 5 × 10 −9 and 1.5 × 10 −8 per site per generation. We also simulated an additional population that split from the human population 2.5 × 10 5 generations ago, which is identical to the outgroup (named O1) in the demographic model depicted in Figure 3A, with an effective size of 10 4 diploid individuals.
For the bonobo population history, we scaled the PSMC history inferred from the genome of individual A917 (Dzeeta; sample SRS396202) by Prado-  with a mutation rate of 2.5 × 10 −8 per site per generation, identical to the simulations on the three-population demographic history ( Figure 3A).
Because the inferred PSMC model provides a specific ratio of the mutation and recombination rates, we set the recombination rate to 2.84 × 10 −9 per site per generation. To be consistent with the three-population demographic history, we set the population size prior to 71,640 generations ago, which is the maximum time covered by the PSMC inference, to 10 4 diploid individuals, and had the outgroup split 2.5 × 10 5 generations ago with the same diploid population size, identical to the outgroup O1 in the three-population demographic history (Figure 2A).

Application to empirical data
We obtained the genotype calls of 13 bonobos from the Great Ape Project (Prado- , which were mapped to human genome assembly NCBI36/hg18. We lifted over the variant calls to human genome assembly GRCh38/hg38, and polarized the allele frequencies with the bonobo genome assembly panPan2, with the sequence in hg38 considered as the ancestral allele. Only genomic regions mappable across hg38 and panPan2 were considered for further analyses. For mappable polymorphic sites, we only considered bi-allelic SNPs. For mappable sites without variant calls in bonobo, we assumed these sites were monomorphic for the panPan2 reference genome sequence, and called substitutions based on whether the panPan2 reference allele was different from the hg38 reference allele. To circumvent potential artifacts, we performed one-tailed Hardy-Weinberg equilibrium tests on each site and removed sites with an excess of heterozygotes (p < 0.01). This p-value was determined by the distribution of the p-values of all polymorphic sites across the genome, such that 0.035% of such sites are outliers. We also applied the RepeatMasker, segmental duplication, simple repeat, and interrupted repeat filters (all downloaded from UCSC Genome Browser) to remove repetitive regions. To assess the mappability of each genomic region, we employed the mappability module GEMtools (Derrien et al., 2012), and computed the mappability tracks of 50-mers. Regions with 50-mer mappability scores lower than 0.9 were removed. Because BalLeRMix employs a pre-specified grid of A values to accompany the distances d in centi-Morgans (cM), when applying the method, we assumed a uniform recombination rate of 10 −6 cM per site, which is the approximate recombination rate in humans (Payseur and Nachman, 2000). Figure 2: Partial area under the curve (AUC) conditioned on false positive rates (FPRs) less than or equal to 5% (defined such that the maximal value is 1) as a function of window size measured in kilobases (kb) for B statistics (varying shades of blue), β statistics (dotted line with varying shades of blue), T 2 (orange), T 1 (green), HKA (purple), and NCD 2 (0.5) (pink), under a scenario in which a mutation undergoing ancient balancing selection (selective coefficient s = 0.01 and dominance coefficient h = 20) arose 15 million years ago (assuming a generation time of 20 years). Statistics that consider the same input type share the same point shape. The dark red dashed line marks the level of partial AUC expected at the y=x line, or the baseline of randomly choosing between balancing selection and neutrality. (B) The amount of partial AUC lost, and (C) the proportion of the AUC loss as compared with the optimal value for each statistic when the window size increased from the optimum to 25 kb (e.g., largest evaluated).  when (A) all test sites were considered to compute proportions of false signals and when (B, C) the number of test sites were down-sampled to match the number of sites expected on sequences with the original mutation rate µ. In panel B, five informative sites flanking either side of a test site considered were skipped ("sparse" down-sampling), whereas in panel C only the first 1200 informative sites were considered for calculating T , B 2 , and B 1 statistics, and the first 240 polymorphic sites were considered for B 0 and β statistics ("dense" down-sampling). (D-F) Proportions of false signals at a 1% FPR for each statistic for (D) all test sites, (E) sparse down-sampling, and (F) dense down-sampling. Bars for all B statistics are bordered by black lines, and all β statistics are bordered by gray lines in panels D-F. HKA and NCD 2 (0.5) are applied with a fixed step size, and are therefore not considered when down-sampling in panels B, C, E, and F. Figure 5: Manhattan plot displaying B 2 scores across the 22 human autosomes for which the genomic data were mapped, with the top candidates annotated. Outstanding peaks (B 2 > 600) without annotations do not have neighboring protein-coding regions within a 500 kb radius. Figure 6: Evidence of balancing selection on KLRD1. (A) B 2 scores across the 500 kb genomic region on chromosome 12 centered on the peak. The gray bars directly under the B 2 scores represent the masked regions, as well as the features in these regions. The darker the shade, the greater number of types of repetitive sequences (e.g., RepeatMasker mask, segmental duplication, simple repeats, or interrupted repeats) overlapping the region. Vertical gray bars below display the estimated equilibrium minor allele frequency x for each maximum likelihood ratio B 2 , and the black line traces the value for the respective inferred footprint size log 10 ( A). (B) Proportion of informative sites that are polymorphic in the 500 kb region centered on the peak compared with the whole-genome average. (C) Minor allele frequency distribution in the 500 kb region centered on the peak compared with the whole-genome average. Figure 7: Evidence of balancing selection on NKAIN3. (A) B 2 scores across the 500 kb genomic region on chromosome 8 centered on the peak. The gray bars directly under the B 2 scores represent the masked regions, as well as the features in these regions. The darker the shade, the greater number of types of repetitive sequences (e.g., RepeatMasker mask, segmental duplication, simple repeats, or interrupted repeats) overlapping the region. Vertical gray bars below display the estimated equilibrium minor allele frequency x for each maximum likelihood ratio B 2 , and the black line traces the value for the respective inferred footprint size log 10 ( A). (B) Proportion of informative sites that are polymorphic in the 500 kb region centered on the peak compared with the whole-genome average. (C) Minor allele frequency distribution in the 500 kb region centered on the peak compared with the whole-genome average.