New Methods for Inferring the Distribution of Fitness Effects for INDELs and SNPs.

Small insertions and deletions (INDELs; ≤50 bp) are the most common type of variability after single nucleotide polymorphism (SNP). However, compared with SNPs, we know little about the distribution of fitness effects (DFE) of new INDEL mutations and how prevalent adaptive INDEL substitutions are. Studying INDELs has been difficult partly because identifying ancestral states at these sites is error-prone and misidentification can lead to severely biased estimates of the strength of selection. To solve these problems, we develop new maximum likelihood methods, which use polymorphism data to simultaneously estimate the DFE, the mutation rate, and the misidentification rate. These methods are applicable to both INDELs and SNPs. Simulations show that they can provide highly accurate results. We applied the methods to an INDEL polymorphism data set in Drosophila melanogaster. We found that the DFE for polymorphic INDELs in protein-coding regions is bimodal, with the variants being either nearly neutral or strongly deleterious. Based on the DFE, we estimated that 71.5-83.7% of the INDEL substitutions that took place along the D. melanogaster lineage were fixed by positive selection, which is comparable with the prevalence of adaptive substitutions at nonsynonymous sites. The new methods have been implemented in the software package anavar.


Introduction
New mutations can have a range of effects on an organism's fitness, ranging from being strongly harmful, through being only slightly deleterious, to being neutral, and finally on to being either mildly or highly beneficial. The relative frequencies of mutations with different selective effects is known as the distribution of fitness effects (DFE). The DFE is an important parameter as it is required for addressing many fundamental questions (Eyre-Walker and . Examples include understanding determinants of the efficacy of natural selection (Galtier 2016;Corcoran et al. 2017), the genetic basis of polygenic traits (Zuk et al. 2014), and the evolutionary advantage of sex and recombination (Hartfield and Keightley 2012).
Taking advantage of the massive increase in data availability, many methods have been proposed for estimating the DFE using polymorphism data (Eyre-Walker et al. 2006;Keightley and Eyre-Walker 2007;Eyre-Walker and Keightley 2009;Kousathanas and Keightley 2013;Kim et al. 2017;Tataru et al. 2017). Their development in turn allows more reliable inferences about other important quantities such as a, the proportion of adaptive substitutions (Eyre-Walker and ). However, all these methods are concerned with estimating the DFE for single nucleotide polymorphisms (SNPs). Consequently, much less is known about the DFE and a for other types of genetic variation such as small insertions and deletions (INDELs; 50 bp), despite the fact that INDELs are the second most common type of variants (e.g., Montgomery et al. 2013), and hence represent an important source of raw materials for selection to act on.
A major difficulty in studying INDELs lies with ancestral state identification. This requires multispecies genome alignments. However, INDELs occur disproportionately in repetitive genomic regions Montgomery et al. 2013), where alignment algorithms perform poorly (Earl et al. 2014). Furthermore, there is evidence that homoplasy is a significant issue outside repetitive regions, probably due to the existence of cryptic INDEL mutation hotspots (Kvikstad and Duret 2014). Thus ancestral state identification can be expected to be particularly error prone for INDELs. It is well established that misidenfication of ancestral states can lead to severely biased estimates of the strength of selection using the site-frequency spectrum (SFS) (Hernandez et al. 2007). For SNPs, this difficulty can be avoided by using the folded SFS (e.g., Eyre-Walker et al. 2006;Keightley and Eyre-Walker 2007). However, to determine whether a length variant is an insertion or a deletion, we have to know what the ancestral state is, meaning that the issue of polarisation error is inherent for INDELs. As a result, applying existing methods for estimating the DFE to INDEL data may be liable to biases.
Another challenge is that the SFSs for insertions and deletions may be affected by polarisation errors to different extents. This is because when the ancestral state of an insertion segregating at low frequency is misidentified, it will be incorrectly inferred as a deletion segregating at high frequency (and vice versa). There is direct experimental evidence that the deletion mutation rate is higher than the insertion mutation rate Schrider et al. 2013;Besenbacher et al. 2015;Yang et al. 2015). This mutational bias means that there are more deletions segregating in the population than insertions. The larger number of deletions may lead to the SFS for insertions being disproportionally affected by polarisation errors ( fig. 1). This asymmetry can cause the insertion SFS to have a more pronounced, but artificial, uptick at the high-frequency end, which can be misinterpreted as stronger positive selection on insertions over deletions. As pointed out by Kvikstad and Duret (2014), this methodological issue can, at least in principle, compromises the results of previous studies, which suggest that insertions are more likely to be under positive selection than deletions to prevent the genome size from unconstrained contraction caused by the mutational bias toward deletions (Parsch 2003). Similarly, it will make it difficult to test the possibility that insertions have a higher fixation probability because they are favored by insertion-biased gene conversion .
Toward resolving the confounding efforts ancestral state misidentification have on the study of INDELs, we propose new maximum likelihood methods for inferring the DFE using polymorphism data. These methods are based on recent studies on SNPs which show that polymorphism data contain enough information for simultaneous estimation of the mutation rate, the DFE, and the polarisation error rate (Gl emin et al. 2015;Tataru et al. 2017). Our methods are more general than the existing methods in the following aspects. First, they can handle both INDELs and SNPs. Second, insertions and deletions can have different polarisation error rates, mutation rates, and DFEs. Third, for both INDELs and SNPs, the new methods allow the mutation and polarisation error rates to vary across the genome. Incorporating these heterogeneities may be particularly important for INDELs (Kvikstad and Duret 2014). We carried out extensive simulations to examine the performance of the new methods. As an example, we applied the methods to an INDEL polymorphism data set in Drosophila melanogaster we obtained by reanalysing the raw short-read data published by the Drosophila Population Genomics Project (Pool et al. 2012). Through model comparisons, we tried to find the DFE that best described the observed pattern of INDEL polymorphism within proteincoding regions of the genome. Finally, using the best-fitting DFE, we estimated the proportion of INDEL substitutions fixed by positive selection (a).

New Approach
For ease of presentation, we will start with a description of the SNP models. The INDEL models will be presented later as an extension.

The SNP Models
Consider a diploid population with effective size N e . The size of the genomic region of interest is m base pairs, and the sample size is n.
The Discrete Model Assume that there are C different classes of sites in the focal region. These sites can be different with respect to their mutation rates, the fitness effects of new mutations, and polarisation error rates. This discrete model has several advantages. First, it does not assume that the DFE follows a specific probability distribution, and is therefore able to accommodate complex scenarios such as a multimodal DFE (Kousathanas and Keightley 2013). Second, by allowing the mutation and polarisation error rates to vary freely between site classes, the method can include situations whereby these two variables The SFSs for insertions and deletions may be affected to different extents by polarisation errors. We assume that the population size is constant, that INDELs are neutral, and that the sample size is 10. In the genomic region under consideration, the total scaled mutation rate toward insertions, 4N e um, is 10, where N e is the effective population size u is the insertion mutation rate per site per generation, and m is that size of the focal region. The total scaled mutation rate toward deletions is 20. The expected SFSs were generated using standard neutral theory. The SFSs with polarisation errors were generated by assuming that the ancestral state of an INDEL was wrongly identified with probability 0.1.
Inferring the DFE for INDELs and SNPs . doi:10.1093/molbev/msy054 MBE covary (e.g., hypermutable regions may have a higher polarisation error rate). We assume that the mutation process can be approximated by the infinite-sites model. Let the total scaled mutation rate for sites of class c be mh c , where c 2 f1; 2; . . .; Cg and h c ¼ 4N e u c . To understand u c , consider an alternative formulation whereby the mutation rate for the c th class of sites is v c per site per generation, and sites of class c account for a fraction p c of all sites in the focal region (i.e., P c p c ¼ 1). We have mh c ¼ mp c 4N e v c , which leads to u c ¼ p c v c . By using h c , we can perform searches for maximum likelihood estimates (MLEs) of the parameters without having to deal with the constraint P c p c ¼ 1. Define: Thus, h is the average scaled mutation rate per site, and the total scaled mutation rate is mh. If the per-site mutation rate is uniform across the focal region (i.e., v i ¼ v j for i 6 ¼ j and 1 i; j C), then h c =h ¼ p c . To model selection, we assume that, for mutations arising at sites of class c, the fitnesses of the wild-type, heterozygote, and mutant homozygote genotypes are 1, 1 þ s c , and 1 þ 2s c , respectively. The corresponding scaled selection coefficient c c is defined as 4N e s c . Positive and negative c c values signify beneficial and deleterious mutations, respectively.
The site-frequency spectrum (SFS) for the c th site class, which is defined as the expected number of polymorphic sites of size i (i.e., sites where the derived allele is represented i times; 1 i < n), is given by: where Polarisation errors distort the SFS. Specifically, when the ancestral state of a polymorphic site of size i is mis-identified, it will be regarded as a polymorphic site of size nÀi. To model polarisation errors, we let c be the probability that the ancestral state of a polymorphic site of class c is incorrectly identified (Gl emin et al. 2015). The final SFS for sites of class c is then: In what follows, we refer to the SFS with and without the correction of polarisation errors as the corrected and uncorrected SFS, respectively. The corrected SFS for the focal region is simply the sum of all the contributions from the sites in different classes:  Kim et al. 2017) or assume that the error rate is constant across the focal region (Gl emin et al. 2015;Tataru et al. 2017). The model described here is therefore more general. Allowing variation in the polarisation error rate can be important. For instance, sites under stronger selective constraints tend to evolve slower, and are less likely to be polarised incorrectly due to homoplasy. It should, however, be noted that, when c c c for 8c 2 f1; 2; . . .; Cg, not all the parameters are identifiable. To see this, we rewrite equation (5) as: c h c s nÀi ðcÞ: (6) Appealing to equation (1) and defining Ã such that we can rewrite equation (6) as Thus, when there is no difference in fitness effects between mutations arising at sites of different classes, we cannot detect variation in the scaled mutation rate and polarisation error rate because the model reduces to one that depends on h, c and Ã . This result has important implications for data analysis by pointing out that a model with a small number of site classes may provide an adequate description of the data even when the underlying biological process features complex variation in the mutation rate across the genome.

The Continuous Model
Instead of assuming that the focal region is composed of several classes of sites, we can assume that the fitness effects of new mutations follows a continuous distribution characterised by parameters X. Let h be the scaled mutation rate per site, and be the polarisation error rate. The uncorrected SFS becomes: where f ðcjXÞ is the probability density function. The corrected SFS is analogous to equation (4) with c in the subscripts omitted. Although the modeling framework allows the DFE to follow arbitrary probability distribution (including those mixture distributions considered by Galtier 2016), here we only consider the reflected C distribution, that is, Àc $ Cða; bÞ, where c 0 and a and b are the shape and scale parameters, respectively.

Parameter Estimation
Let X ¼ (x 1 , x 2 ,. . ., x nÀ1 ) represent the observed SFS, where x i is the number of polymorphic sites of size i in the sample. Let H denote all the parameters in the model (i.e., h c , c c , and c for c 2 f1; 2; . . .; Cg for the discrete model and h, X, and for the continuous model). To obtain MLEs of H, we use the Barton and Zeng . doi:10.1093/molbev/msy054 MBE Poisson random field model (Sawyer and Hartl 1992;Bustamante et al. 2001). Omitting constants that have no effects on the shape of the likelihood surface, the log likelihood function is defined as:

Controlling for Demography
We have so far assumed that the population is panmictic and of constant size N e . To control for demography, we employ the method of Eyre-Walker et al. (2006). Take the continuous model as an example. First, we define augmented SFSs as: ( Next, a set of neutral variants is added to the model, which introduces two additional parameters h ð0Þ and ð0Þ , which are the scaled mutation rate per site and the polarisation error rate, respectively, for the neutral sites. Let H ð0Þ denote these new parameters and X ð0Þ denote the neutral SFS. The log likelihood of the observed data can be calculated as: where R ¼ (r 2 , r 3 ,. . ., r nÀ1 ) and the two log likelihood functions on the right-hand side are calculated in the same way as equation (10) with w i replaced by w Ã i . The above method for controlling for demography has been used extensively (Eyre-Walker et al. 2006;Muyle et al. 2011;Gl emin et al. 2015;Galtier 2016;Jackson et al. 2017;Tataru et al. 2017). These previous efforts have gathered clear theoretical and empirical evidence that the method is robust against a wide range of demographic processes, as well as the effects caused by selection at linked sites (e.g., background selection and/or selective sweeps). For instance, in a recent analysis of selection on codon usage bias in Drosophila, Jackson et al. (2017) showed that the estimates of c produced by an estimation method that corrects for demography using the r parameters as set out above closely matched those produced by another estimation method that considers an explicit one-step change in population size (see figure 4 A in Jackson et al. 2017).
It should be noted that equation (12) accommodates the possibility that the focal region and the neutral region have different mutation rates. This is more general than several previous models Eyre-Walker and Keightley 2009;Kim et al. 2017;Tataru et al. 2017). However, it may be challenging to distinguish this model from one in which the two regions have the same mutation rate, but a proportion of new mutations in the focal region are so strongly deleterious that they make negligible contributions to the observed SFS.

The INDEL Models
The Discrete Model First consider insertions. Assume that there are C ins different classes of sites. The total scaled mutation rate toward insertions for sites of class c is mh ins c , and the fitness effect and polarisation error rate are c ins c and ins c , respectively (1 c C ins ). The uncorrected SFS for insertions of class c can be calculated using equation (2), and is denoted by W ins c;i . For deletions, we can similarly assume that there are C del different classes of sites. The associated parameters are h del d ; c del d , and del d , and the uncorrected SFS is denoted by . When the ancestral state of a derived insertion of size i is misidentified, it will be wrongly identified as a deletion of size nÀi, and vice versa for deletions (note that size in this context refers to the frequency of the derived allele, not the number of base pairs inserted or deleted). Thus, the corrected SFSs for insertions and deletions are: The Continuous Model For insertions, define the per-site scaled mutation rate and the polarisation error rate as h ins and ins , respectively. The DFE for insertions is determined by parameters X ins . For deletions, we similarly define the following parameters: h del , X del and del . Finally, the corrected SFSs are: ( where W ins i and W del i are the uncorrected SFSs for insertions and deletions, respectively, and are calculated in the same way as equation (9). As in the SNP case, we only consider cases where the DFE follows a reflected C distribution. The shape and scale parameters for insertions and deletions are denoted by a ins , b ins , a del , and b del , respectively.

Parameter Estimation
. ., x del nÀ1 ) be the observed SFSs for insertions and deletions, respectively. The log likelihood of the data is calculated as: where the two terms on the right are calculated using equation (15) with w z i replaced by w z;Ã i (z 2 fins; delg).

Simulation Results
We evaluate the statistical properties of the new models using computer simulations. Unless stated otherwise, the sample size (n) is 50 and the results are based on 100 replicates. In all cases, we assume the population size is constant and only analyse data from the selected region (see Materials and Methods for justification). For the SNP models, we only present results for the discrete SNP model with C > 1 site classes, because

Properties of the Discrete SNP Model
First consider a model with C ¼ 2 site classes. As can be seen from table 1, there is information in the SFS for simultaneously estimating all the parameters to a high degree of accuracy. Before discussing more simulation results, it should be pointed out that, when C > 1, the order of the site classes is arbitrary. That is, the model considered in table 1 is equivalent to one with parameters h 1 ¼ 0:01; c 1 ¼ À20; 1 ¼ 0:01; h 2 ¼ 0:005; c 2 ¼ À5, and 2 ¼ 0:05. For both cases shown in table 1, all the MLEs can be sorted such that h 1 <ĥ 2 andĉ 1 >ĉ 2 . In other words, the MLEs can be assigned unambiguously to site classes according to the order given in the "True value" row. However, if we were to reduce the amount of data, parameter estimates will become more uncertain, and cases such as those withĥ 1 <ĥ 2 andĉ 1 < c 2 will occur, which makes assigning the MLEs to site classes impossible. Thus, presenting mean and SD of the MLEs may give misleading information about the performance of the model.
In light of the earlier discussion, we investigate the statistical properties of the model using two alternative methods. First, we compare the full model to the following reduced models using the v 2 test: "Equal " (all site share the same polarisation error rate), " ¼ 0" (no polarisation error), and "C À 1"(a model with C À 1 site classes, where C is the true number of site classes). Second, we assess how well these various models predict the average fixation probability l (see eq. 18 in Materials and Methods), which is essential for estimating the prevalence of adaptive substitutions (i.e., a and x a ).
Considering the two pairs of cases in table 2, and focusing on the data presented under "Percent significant," we make the following observations. First, as the amount of data reduces, the ability of the model to infer separate for different site classes drops more rapidly than its ability to detect   Barton and Zeng . doi:10.1093/molbev/msy054 MBE the existence of either polarisation error or more than one site class. This suggests that estimating heterogeneity in may be challenging. Considering all four cases, it appears that the tests for detecting the presence of polarisation error (i.e., the full model vs. " ¼ 0") and for detecting the existence of more site classes (i.e., the full model vs. "C À 1") are more powerful, especially the latter. It should be noted that the likelihood surface appears to be rather flat when C ¼ 3 such that different parameter combinations may produce very similar log likelihoods. This is particularly evident when the amount of data is limited (Case 3 vs. Case 4), leading to a reduction in power of the tests. A similar observation was made by Keightley and Eyre-Walker (2010), who also showed that it can be partly alleviated by increasing the sample size. Nonetheless there may well be a limit as to how many site classes can be included. This identifiability problem is analogous to that discussed extensively in the context of using SNP-based methods for estimating past demographic changes (e.g., Myers et al. 2008).
Interestingly, the reduced model "Equal " makes worse predictions of l than the full model in all cases presented in table 2, even when the full model does not normally provide a better fit to the data (Cases 2 and 4). The same applies to the other two reduced models. Thus, despite the statistical difficulties discussed earlier, fitting the full model to the data may be important for obtaining accurate estimates of a and x a . Table 3 contains simulation results based on a discrete model (with C ins ¼ C del ¼ 1) and two continuous models (differing from each other in terms of the size of the focal region m). The mutation rates are $10 times lower than those used in the SNP cases (tables 1 and 2), and polarisation error rates are $2 times higher. These choices are to reflect the fact that INDELs are generally less prevalent than SNPs, and are potentially more difficult to polarise. As can be seen, with a reasonable amount of data, all the parameters can be reliably estimated. Comparing the two continuous models, we notice that, with limited data, the scale parameter b of the C distribution may be overestimated, but estimates of the shape parameter a and the polarisation error rate remain unbiased.

Properties of the INDEL Models
The true values of l ins and l del for the discrete model are 0.0339 and 4:59 Â 10 À6 , respectively. The mean (SD) of the estimates is 0.0345 (0.0055) for l ins , and 5:27 Â 10 À6 (2:91 Â 10 À6 ) for l del . Thus, the true values are well within the observed ranges of variability. The true values of l ins and l del for the two continuous cases are 0.384 and 0.429, respectively. The mean (SD) of the estimates for the case with more data is 0.382 (0.012) for l ins and 0.429 (0.008) for l del . Encouragingly, for the continuous case with less data, despite the tendency to overestimate the scale parameter, estimates of the average fixation probabilities are still highly accurate: 0.388 (0.050) for l ins and 0.418 (0.028) for l del , suggesting that the reliability of estimates of a and x a is unlikely to be compromised. Interestingly, nonsense mutations are somewhat rarer than frameshift INDELs, an observation also made by . These results indicate strong purifying selection against INDELs in protein-coding regions. INDEL diversity patterns appear to be similar between intronic and intergenic regions. They are combined and referred to as noncoding INDELs in what follows to increase statistical power.

Application to D. melanogaster Data
Comparing between INDELs and SNPs, we notice that INDEL diversity in noncoding regions is $10 times lower than p 4 (4-fold site diversity; table 4), consistent with the fact that the INDEL mutation rate is lower than the point mutation rate (Haag-Liautard et al. 2007;Schrider et al. 2013). However, Tajima's D calculated on noncoding INDELs is more negative than that calculated on 4-fold sites (table 4), probably reflecting the fact that many noncoding DNA in the D. melanogaster genome are under selection (Andolfatto 2005). Furthermore, p 0 (0-fold site diversity; table 4) is only $10 times smaller than p 4 . This level of reduction is much smaller than the 30-fold difference observed between CDS and noncoding INDELs. This suggests that, in protein-coding regions,  S2, Supplementary Material online), consistent with the expectation that mutations are on average more deleterious in more conserved genes (Jackson et al. 2015). The results in this and the preceding paragraphs suggest that our INDEL data set is of high quality.

Inferring the DFE and Using Noncoding INDELs as the Neutral Reference
To infer the DFE for INDELs in CDS regions, we used noncoding INDELs as the neutral reference. Following previous efforts in estimating the DFE for SNPs Eyre-Walker and Keightley 2009;Schneider et al. 2011;Galtier 2016;Tataru et al. 2017), we also assumed that the mutation rate toward insertions and deletions, respectively, were the same between the neutral and selected regions. The best-fitting DFE is one with C ¼ 2 classes of selected sites (table 5 and supplementary table S1, Supplementary Material online). The MLEs of c suggest that polymorphic INDELs are either nearly neutral or are so strongly deleterious that they contribute little to polymorphism. This seems to be consistent with the 30fold difference in INDEL diversity level between CDS and noncoding regions, which is more substantial than the 10fold difference between 0-fold and 4-fold sites (table 4). Fitting the data to a discrete model with C ¼ 3 classes of sites also reveals a bimodal DFE, suggesting that the conclusion is robust (supplementary table S1, Supplementary Material online). With a larger sample containing hundreds or even thousands of alleles, and by fitting a DFE with more site classes, it should be possible to obtain further details of the relative frequencies and fitness effects of strongly selected variants, which tend not to segregate in our current sample of size 17. However, this additional information about the strongly selected end of the DFE is unlikely to affect our estimation of a (see below) because these variants make effectively no contribution to divergence.
To better understand the effects of length, we separated the INDELs in CDS regions into the following length categories: 1 bp, 2 bp, 3 bp, frameshifting (!4 bp), and nonframeshifting (!6 bp). We analysed the data in each category separately. As above, noncoding INDELs with the same length were used as the neutral reference and the mutation was assumed to be constant across neutral and selected sites. Considering the dearth of variants, we only fitted a DFE with C ¼ 1 class of selected sites. Viewing the c in this model as the "average" selection coefficient, frameshift INDELs are consistently more deleterious than nonframeshift INDELs (supplementary fig. S3, Supplementary Material online). Consistent with a prevous study , there is no obvious evidence that longer INDELs are under stronger selection.
Using the best-fitting DFE (table 5), the proportion of INDEL substitutions in the CDS regions fixed by positive selection in the D. melanogaster lineage, a, is 83.7% (100% for insertions and 81.8% for deletions). These a estimates are comparable with previous estimates for SNP substitutions in CDS regions (Andolfatto et al. 2011;Schneider et al. 2011).
As mentioned earlier, some noncoding INDELs are probably nonneutral, as suggested by the negative Tajima's D value (table 4). Our use of these variants as the neutral reference are for several practical reasons. Although using INDELs in  NOTE.-The DFE for polymorphic INDELs in the CDS regions were inferred using either noncoding INDELs or 4-fold sites as the neutral reference. A series of different DFEs were fitted to the data, and the best-fitting models presented above were determined by using the Akaike information criterion (AIC) (see supplementary tables S1 and S5, Supplementary Material online). When noncoding INDELs were used as the neutral reference, a was estimated using INDEL divergence in noncoding regions. When 4-fold sites were used as the neutral reference, the mutation rate ratio between SNPs and INDELs, and that between deletions and insertions, were fixed at values obtained from a mutation accumulation experiment (Schrider et al. 2013). a was estimated using a method based on divergence in the 8-30 bp region of short introns < 66 bp long (see the main text).
"dead-on-arrival" transposable elements as neutral reference may be preferable (Petrov 2002), calling variants from repetitive regions using short-read data is highly prone to error (Li 2014). Using data from the 8-30 bp region of short introns 65 bp, which are also putatively neutral (Parsch et al. 2010), is also problematic because of evidence for selection maintaining intron size (Ptak and Petrov 2002;Parsch 2003;. Note that Tajima's D is more negative for INDELs in CDS regions than for those in noncoding regions, suggesting that the latter are probably under weaker purifying selection (table 4). If this is the case, our method tends to underestimate the strength of purifying selection on INDELs in CDS regions, as suggested by the simulation results presented in supplementary table S2, Supplementary Material online. This should lead to an overestimation of l, the average fixation rate (eq. 18), which should in turn put a downward pressure on the estimation of a (eq. 19). However, biases in a also depend on the way selection on noncoding INDELs alters divergence. For example, if fixations of beneficial noncoding INDELs are so common that d S is greater than the divergence level expected under neutral evolution, then this combined with the overestimation of l can lead to a substantial underestimation of a. In contrast, if most noncoding INDELs are selected against and d S is much smaller than the neutral expectation, it may offset the effect caused by the overestimation of l and result in an overestimation of a.

Inferring the DFE and Using 4-Fold Degenerate Sites as the Neutral Reference
To check the robustness of our results, we conducted a second set of analyses without using noncoding INDELs. We extended our model such that it can infer the DFE for INDELs in CDS regions using 4-fold sites as the neutral reference. We chose 4-fold sites instead of the 8-30 bp region of short introns 65 bp because 4-fold sites are probably not under ongoing selection on codon usage in D. melanogaster, and are similar to short introns in multiple aspects of polymorphism patterns (Jackson et al. 2017). Considering the parameter richness of the models, using 4-fold SNPs as the neutral reference should help statistical inference because they are much more numerous than short-intron SNPs. We used the following approach to obtain neutral divergence for INDELs along the D. melanogaster lineage. The nucleotide divergence in the 8-30 bp region of short introns 65 bp is 0.0674 (Jackson B, personal communication). In a mutation accumulation experiment (Schrider et al. 2013), it was found that the rate to point mutations is 12.2 times higher than that to short INDELs, and that the rate to deletions is 5 times higher than that to insertions (averaging across the two genetic backgrounds considered therein). Thus, an estimate of neutral INDEL divergence can be obtained as 0:0674=12:2 ¼ 0:0055, and the corresponding estimates for insertions and deletions are 9:2 Â 10 À4 and 0.0046, respectively.
Due to the use of 4-fold sites as the neutral reference, it is no longer appropriate to assume that the mutation rate is the same between the selected and neutral regions. Given the evidence that the DFE for INDELs probably features a class of strongly deleterious mutations that make little contribution to polymorphism, allowing the selected and neutral regions to have their separate mutation rates is likely to cause the model to underestimate both the mutation rate in the selected region and strength of purifying selection, as confirmed by simulation results presented in supplementary table S3, Supplementary Material online. An underestimation of the strength of purifying selection is likely to cause an underestimation of a. We observed this in our data set-a for all INDELs obtained from the best-fitting DFE for this analysis (supplementary table S4, Supplementary Material online) is only 21.7%, much smaller than the value of 83.7% when noncoding INDELs were used as the neutral reference (table 5).
To resolve the above problem, we again made use of the information reported in the aforementioned mutation accumulation experiment (Schrider et al. 2013). Specifically, we further extended our model, so that the mutation rate ratio between SNPs and INDELs, and that between deletions and insertions, were fixed at 12.2 and 5, respectively. As shown in table 5 (see also supplementary table S5, Supplementary Material online), the best-fitting DFE has C ¼ 2 class of sites, with one under weak selection, and the other being strongly deleterious. The a estimates for all INDELs, insertions, and deletions are, respectively, 71.5%, 59.7%, and 81.3%.
To make sure that the above results are not dependent on our use of the mutation rate ratios estimated by Schrider et al. (2013), we repeated the analysis using ratios obtained by either Petrov and Hartl (1998)

Numerical Details
We used numerical routines provided by the GNU Scientific Library (GSL; https://www.gnu.org/software/gsl/; last accessed April 6, 2018) to perform the integration in equation (3) numerically. For the continuous model (e.g., eq. 9), the integral was evaluated using Gaussian quadrature, which was implemented based on a routine included in the R package statmod (https://cran.r-project.org/web/packages/statmod/index.html; last accessed April 6, 2018). Maximum likelihood estimates of the model parameters were obtained by both gradient-based and derivative-free Inferring the DFE for INDELs and SNPs . doi:10.1093/molbev/msy054 MBE optimization algorithms implemented in the NLopt package (http://ab-initio.mit.edu/wiki/index.php/NLopt; last accessed April 6, 2018). To ensure the global maximum was found, we initialised the search algorithm using multiple randomly selected starting points.

Simulations
We performed parameter estimation using our program, anavar, on random samples simulated using Mathematica (http://www.wolfram.com/; last accessed April 6, 2018). Because the generation of simulated data is separate from the numerical routines we used to implement anavar, this set-up can help verify the numerical robustness of anavar. Note that, in all simulations, we only used the models to analyse variants from selected regions because we wanted to find out how much information we could obtain by analysing them alone. Including neutral variants, as routinely done in real data analysis, may help to increase the accuracy of parameter estimation. Therefore, our choice should give us a rather conservative assessment of the methods' performance.
In addition to testing whether the data contained enough information for all the parameters to be estimated, we also assessed how well a model could predict the average fixation rate, l (expressed in units of 2N e generations). As an example, if nonsynonymous polymorphism data are fitted to the discrete SNP model, l can be estimated as: whereẐ signifies the MLE of parameter Z and h is defined by equation (1). Understanding the ability to accurately estimate l is important because it is needed for estimating a, the proportion of substitutions fixed by positive selection, which can be written as: where d N and d S are the numbers of selected (e.g., nonsynonymous) and neutral (e.g., synonymous) substitutions per site, respectively (Eyre-Walker and Keightley 2009). We did not generate simulated data from models with demographic changes and selection at linked sites because the effectiveness of the method of Eyre-Walker et al. (2006) in controlling for these confounding factors have been studied extensively (Eyre-Walker et al. 2006;Muyle et al. 2011;Gl emin et al. 2015;Galtier 2016;Jackson et al. 2017;Tataru et al. 2017).

The Drosophila melanogaster Data Set
This data set consisted of 17 Rwandan individuals as described in Jackson et al. (2015Jackson et al. ( , 2017 and made available by the Drosophila Population Genomics Project (Pool et al. 2012).

Summary Statistics
Nucleotide diversity (p) (Tajima 1983), Watterson's h (Watterson 1975), and Tajima's D (Tajima 1989) were calculated for variants in noncoding (intronic and intergenic) and coding regions, as well as for 0-fold and 4-fold degenerate SNPs. The numbers of callable sites used to obtain per-site estimates was taken to be the number of sites in each region that were called in the "all sites" VCF file and passed the filters described previously. Additionally for polarised variants the number of callable sites was reduced to those that could be polarised by our parsimony approach.
To obtain rates of divergence at nonsynonymous and synonymous sites, denoted by d N and d S , CDS regions were extracted from the multispecies alignment using the coordinates from the D. melanogaster CDS fasta alignment file. CDS alignments were removed if they were not in frame, did not start with a start codon, did not end with a stop codon or contained premature stop codons. Additionally any codons with missing data were dropped. For each gene, we retained only the longest transcript. These data were then analysed using codeml in PAML (Yang 2007) with a one ratio model to obtain d N and d S .

Software Availability
The new models have been implemented in a user-friendly package anavar, which is freely available at http://zeng-lab. group.shef.ac.uk. In addition to the models developed herein, anavar also contains implementations of several other widely-used models for estimating the DFE (Eyre-Walker et al. 2006) and for studying GC-biased gene conversion (gBGC) (Gl emin et al. 2015). All scripts used for the anavar simulation analyses are available at https://github.com/ henryjuho/anavar_simulations. Additionally, all scripts used in the D. melanogaster analyses can be found at https:// github.com/henryjuho/drosophila_indels.

Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.