Adaptive evolution is substantially impeded by Hill-Robertson interference in Drosophila

It is known that rates of mutation and recombination vary across the genome in many species. Here we investigate whether these factors affect the rate at which genes undergo adaptive evolution both individually and in combination and quantify the degree to which Hill-Robertson interference (HRi) impedes the rate of adaptive evolution. To do this we compiled a dataset of 6,141 autosomal protein coding genes from Drosophila, for which we have polymorphism data from D. melanogaster and divergence out to D. yakuba. We estimated the rate of adaptive evolution using a derivative of the McDonald-Kreitman test that controls for the slightly deleterious mutations. We find that the rate of adaptive amino acid substitution is positively correlated to both the rates of recombination and mutation. We also find that these correlations are robust to controlling for each other, synonymous codon bias and gene functions related to immune response and testes. We estimate that HRi reduces the rate of adaptive evolution by ∼27%. We also show that this fraction depends on a gene’s mutation rate; genes with low mutation rates lose ∼11% of their adaptive substitutions while genes with high mutation rates lose ∼43%. In conclusion, we show that the mutation rate and the rate of recombination, are important modifiers of the rate of adaptive evolution in Drosophila.


Introduction
It has been shown that there are substantial levels of adaptive protein evolution in many species; for example, in species of Drosophila, rodents, bacteria and some plants it has been estimated that >25% of all amino acid substitutions are consequence of positive adaptive evolution (Bustamante et al. 2002;Smith and Eyre-Walker 2002;Bierne and Eyre-Walker 2003;Sawyer et al. 2003;Charlesworth and Eyre-Walker 2006;Haddrill et al. 2010;Ingvarsson 2010;Slotte et al. 2010;Strasburg et al. 2011). In contrast there are some species, such as humans and many other plants for which rates of adaptive evolution appear to be very low (Chimpanzee Sequencing and Analysis Consortium 2005;Zhang et al. 2005;Boyko et al. 2008;Eyre-Walker and Keightley 2009;Gossmann et al. 2010). The reasons for this variation between species is not fully understood, although effective population size (Ne) appears to be important (Gossmann et al. 2012).
The rate of adaptive evolution also appears to vary between genes within a genome. This is expected for several reasons. First, some genes are expected to undergo more adaptive evolution because of their functions; in particular those genes that interact with the environment or which are caught up in arms races are expected to have high rates of adaptive evolution, whereas those genes with highly conserved functions are expected to adapt slowly. Second, genes with high mutation rates are predicted to adapt faster than those with low mutation rates. This is expected whether most adaptation comes from newly arising mutations or from standing genetic variation. This is obvious if adaptation is mutation limited; if an organism is waiting for advantageous mutations to arise, and adaptation can potentially occur in more than one gene, then adaptation is mostly likely to occur in the gene with the highest mutation rate. However, we also expect adaptation to be greater even if advantageous mutations are selected from standing genetic variation, because genes with the highest mutation rates will contribute most to diversity. Third, we expect the rate of adaptive evolution to depend upon the rate of recombination; genes with low rates of recombination will suffer from Hill-Robertson interference (HRi) (Hill and Robertson 1966;Felsenstein 1974) in which selected mutations interfere with each other: a newly arising advantageous mutation may find itself in linkage disequilibrium with deleterious mutations, which will reduce its probability of fixation if it can not recombine away from them, or in competition for fixation with another advantageous mutation at a linked locus on another chromosome in the population. Fourth, we expect an interaction between the rate of recombination and the rate of mutation; HRi should be more prevalent in genes with high mutation rates and low rates of recombination.
Several studies have shown that gene function is important in determining the rate of adaptive evolution: Obbard et al. 2009 have shown that immune system genes have higher rates of adaptive evolution than other genes in Drosophila and Haerty et al. 2007 andPröschel et al. 2006 have shown that male-biased genes, like testes specific genes, have higher rates of adaptive evolution. It has also been shown that in humans many of the genes that present a signature of positive selection tend to be involved in sensory perception, immune defenses, tumor suppression, apoptosis, and spermatogenesis (Clark et al. 2003;Chimpanzee Sequencing and Analysis Consortium 2005;Nielsen et al. 2005). The role of recombination has also been studied; it has been shown in Drosophila that the rates of adaptation in different regions of the genome vary greatly by differences in the frequency of recombination (Presgraves 2005;Betancourt et al. 2009;Arguello et al. 2010;Mackay et al. 2012;Campos et al. 2014). Surprisingly the role of the mutation rate in the rate of adaptive evolution has not been considered before.
In this work we investigate the effect of the recombination rate and the mutation rate on the rate of adaptive amino acid substitutions (Ka+) in the genome of Drosophila melanogaster and quantify the extent to which HRi impedes the rate of adaptive evolution. Moreover, we are able to estimate how the fraction of lost adaptive amino acid substitutions to HRi depends on gene's mutation rate.

Results
To investigate the role of recombination and mutation in determining the rate of adaptive evolution we compiled 6,141 autosomal protein coding genes from Drosophila for which we have polymorphism data from D. melanogaster and divergence out to D. yakuba. For most of our analyses we use polymorphism data from the Drosophila melanogaster Genetic Reference Panel (DGRP) which was sampled from Raleigh, North Carolina (Mackay et al. 2012). However, in some analyses we compare our results to those obtained using the flies sampled from Gikongoro, Rwanda (DPGP2, Pool et al. 2012 (McDonald and Kreitman 1991) which corrects for slightly deleterious mutations. In this method it is assumed that mutations at one set of sites (in this analysis synonymous sites) are neutral and that selection acts upon the mutations at another set of sites (nonsynonymous sites). The site frequency spectra (SFS) of synonymous and non-synonymous polymorphisms are used to infer the distribution of fitness effects of neutral and deleterious mutations at the nonsynonymous sites and this information is used, in conjunction with the level of synonymous divergence, to predict how many neutral and nearly neutral non-synonymous substitutions are expected. If the observed divergence at non-synonymous sites exceeds this expectation, adaptive evolution is inferred and quantified. The rate of adaptive evolution is typically estimated using one of three statistics: Ka+, the rate of adaptive amino acid substitution, ωA, the rate of adaptive evolution relative to the mutation rate and α, the proportion of substitutions that are adaptive. The α statistic conflates the rates of adaptive and nonadaptive substitution and hence is not useful for our purposes here, and ωA is not useful for studying the effects of mutation on the rate of adaptive evolution because it controls for the factor being investigated, hence we have investigated how Ka+ depends upon the rate of recombination and mutation. However, in terms of recombination, we get qualitatively similar results whether we use Ka+ or ωA.

Recombination and Adaptation
We first studied the relationship between recombination rate and Ka+. To estimate the rate of adaptive evolution it is necessary to combine data from several genes because estimates tend to be error prone and sometimes undefined for individual genes. We therefore grouped genes into 45 bins of 136 genes each based on their rates of recombination. The results are shown in fig. 1 A and supplementary Table 1 A. There is a highly significant positive relationship between the rate of adaptation and recombination rate (Spearman's rank correlation coefficient ρ = 0.64, p < 0.001). However, for values beyond ~2 cM/Mb the relationship between recombination and adaptation reaches an asymptotic value. In order to test whether a curvilinear relationship fits the data better than a linear model we fit the function y = a + b e ⋅ -cx to our data and compared it to the fit of a linear model. Table 1 shows the the inferred parameters, the R 2 and AIC values for the two models. In terms of both AIC and R 2 the curvilinear model is favoured.
Our results are in contrast to those of Campos et al. 2014 who found that ω A was linearly related to the rate of recombination. The difference between the two analyses could be due to the fact that we have used a different measure of adaptive evolution, the polymorphism data come from different populations or to differences in the way in which the rate of recombination was estimated. The difference between the two studies is not due to the measure of adaptive evolution used; we observe a curvilinear  Pool et al. 2012]) and divergence genomic data together with the original unsmoothed high resolution recombination map. In contrast to the linear relationship they originally reported we found the same highly significant curvilinear pattern that we observed using the DGRP polymorphism data (see fig. 1  The positive correlation between recombination rate and Ka+ could be due to a number of potential biases in the dataset. For example, if some classes of genes with high rates of adaptation are preferentially located in regions with high rates of recombination, the observed positive correlation would not necessarily be a consequence of adaptive evolution being affected by recombination. There is evidence that immune system (Obbard et al. 2009) and male-biased or testes specific (Pröschel et al. 2006;Haerty et al. 2007) genes undergo higher rates of adaptive evolution than other genes. We confirm this result taking into account the influence of slightly deleterious mutations, which the previous analyses did not. We find that immune and testes-specific genes together exhibit adaptive rates 1.37x faster than other genes (the difference between immune/testes specific genes and other "control" genes is significant as judged by a permutation test -p = 0.017). However, if we remove immune and testes specific genes we still observe a highly significant curvilinear correlation between recombination rate and Ka+ (see fig. 1 C, In estimating the rate of adaptive evolution we have assumed that synonymous mutations are neutral, however selection is known to act upon synonymous sites in Drosophila (reviewed by Hershberg and Petrov 2008). In many cases, this is thought to be a result of selection favoring codons that can be translated more rapidly or accurately (Shields et al. 1988;Akashi 1994Akashi , 1995Carlini and Stephan 2003;Stoletzki and Eyre-Walker 2006). Additionally, synonymous sites may be under selection to maintain (or avoid) splicing enhancers (Parmley et al. 2006), messenger RNA secondary structures (Parsch et al. 1997;Baines et al. 2004;Stoletzki 2008) or particular short sequence motifs (Antezana and Kreitman 1999). Lawrie et al. 2013 have shown that ~22% of all 4-fold synonymous sites in D. melanogaster are under strong purifying selection, although the specific functional mechanism underlying this strong constraint is complex. Consistent with this weak selection favouring codons that have to be translated more rapidly or accurately we confirm previous results that K4 is significantly correlated to a measure of codon usage bias, Fop (the frequency of optimal codons) (Spearman's rank correlation coefficient ρ = -0.4, p < 0.001) Li 1987, 1989;Moriyama and Hartl 1993;Eyre-Walker 2003, 2006). However, we expect that any sort of weak selection on synonymous mutations to generate a positive correlation between Ka+ and recombination rate. This is because we expect that genes located in regions of high recombination, where selection on synonymous sites is more efficient (Kliman and Hey 1993;Haddrill et al. 2007;Campos et al. 2012), will tend to have a higher estimate of Ka+ because weak negative selection on synonymous mutations inflates the number of synonymous polymorphisms relative to the number synonymous substitutions. Therefore, to investigate whether selection on synonymous codon usage affects our adaptation estimates we divided our genes into 3 roughly equal groups according to their Fop value, and within each of these Fop groups we divided the data into 15 groups of genes according to their recombination rate (details of the groups can be found in the online supplementary material). We observe the same highly significant curvilinear relationship within each of the 3 Fop categories (see supplementary Table 1 and supplementary Table 3) (Spearman's rank correlation coefficient for high Fop genes ρ = 0.87, p < 0.001; medium Fop genes ρ = 0.78, p < 0.001; low Fop genes ρ = 0.76, p < 0.001). We also repeated our analysis using a smaller dataset of 3,369 genes where we can use polymorphisms and substitutions in short introns (<66 bp) as the neutral standard. This dataset is smaller because not all genes fulfil the intron quality and length criteria (see Materials and Methods). The same curvilinear pattern is observed (see fig. 1 D, Table 1 and supplementary Table 1 D) and the strength of the correlation is equivalent to that found with 4-fold sites (Spearman's rank correlation coefficient ρ = 0.75, p < 0.001). Hence, selection on codon usage does not seem to be responsible for the shape or the strength of the relationship between the rates of adaptive evolution and recombination.

Mutation and Adaptation
To investigate whether the rate of adaptive evolution is correlated to the mutation rate is not straightforward since we need to use the rate of synonymous substitution to estimate both the mutation rate and the rate of adaptive evolution. This lack of statistical independence between estimates will tend to generate a negative correlation between Ka+ and K4 just through sampling error. To avoid problems of non-independence we split our synonymous substitution estimate, K4, into 3 independent variables by sampling from a hypergeometric distribution (see Materials and Methods: Hypergeometric Sampling); we used K4,1 to rank genes and assign genes to bins, K4,2 to estimate the rate of adaptive evolution, and K4,3 as an estimate of mutation rate for this bin.
The data were divided, as with the recombination rate analyses, into 45 mutation bins of 136 genes each, but this time the data were divided according to their K4,1 value. Doing this we found a highly significant positive correlation between Ka+ and K4,3 (Spearman's rank correlation coefficient ρ = 0.45, p < is not a consequence of the non-random distribution of this kind of genes relative to the mutation rate. Natural selection on codon usage is expected to weaken rather than generate an artefactual positive correlation between Ka+ and K4,3, because selection on codon usage should reduce the rate of synonymous substitutions more than the level of synonymous polymorphism. To investigate whether selection on codon usage has an effect on the relationship between K a+ and K4,3 we divided the data set into 3 recombination rate levels and 3 Fop levels, and within each recombination rate and Fop class we grouped the genes into 5 groups according to their mutation rate (this yielded 45 bins of 136 genes each, see supplementary Table 2 C). We separate the data according to their recombination rate because it affects both the rate of adaptive evolution as well as the efficiency of selection on codon usage.
The correlation between Ka+ and K4,3, for each recombination rate and Fop category is shown in figure 2 C and 2 D, respectively. The graphs suggest that selection on codon bias makes little difference to the correlation between Ka+ and K4, but that the relationship is strongly affected by the rate of recombination; this is not surprising because we have shown above that genes with low rates of recombination undergo very little adaptive evolution (see fig. 1) and therefore not likely to be influenced by the rate of mutation.
To investigate this more formally we performed an analysis of covariance (ANCOVA), grouping genes by their Fop and recombination rate levels. In ANCOVA, a set of parallel lines are fitted to the data, one for each group. This enables a test of whether the common slope of these lines is significantly different from zero, and one can also investigate whether the groups differ in the dependent variable for a given value of the independent variable by testing whether the lines have different intercepts. If we consider Fop and recombination rate as fixed factors we find no significant correlation between Ka+ and K4,3 (ANCOVA p = 0.16). However, we find evidence that the slopes (ANCOVA p < 0.001) and intercepts (ANCOVA p < 0.001) differ between recombination rate categories, but there is no evidence that either the slope or intercept differs between Fop categories. If genes with low recombination rates (from 0 to 1.32 cM/Mb) are excluded, a very strong positive correlation between Ka+ and K4,3 is found for the rest of the dataset (see  Table 2 A) (Spearman's rank correlation coefficient ρ = 0.42 and p < 0.01), suggesting that the correlation between Ka+ and K4 is not a result of weak selection on 4-fold sites.

The Proportion of Adaptive Substitutions Lost to HRi
Our results show that the rate of adaptive evolution is significantly impeded in low recombining HRi is expected to be more prevalent in regions of the genome with higher rates of mutation, because this will increase the chance that a selected mutation will be segregating with other mutations subject to selection. To investigate whether this is the case in Drosophila we repeated the analysis above splitting the dataset into two subsets according to a gene's inferred mutation rate. To do this we split K 4 into two independent variates by sampling from a hypergeometric distribution. K4,1 was used to divide the genes into different mutation rate categories, while K4,2 was used to calculate Ka+. In this way we ensured that the estimates of adaptive evolution were not influenced by the way in which the data was divided. We find that the proportion of adaptive substitutions differs significantly between the 25% of genes with lowest and the 25% with highest mutation rate estimates (p-value from a bootstrap analysis = 0.

Discussion
We have shown that the rate of adaptive protein evolution is positively correlated to both the rate of recombination and the mutation rate in D. melanogaster. We have shown that these correlations are not due to an enrichment of immune response and testes related genes in regions of high recombination or mutation, or due to selection on synonymous sites. Instead it seems likely that the rate of adaptive evolution is correlated to the rate of recombination because of Hill-Robertson interference (HRi) and that it is correlated to the rate of mutation because genes with higher rates of mutation are more likely to produce the genetic variation needed for adaptation. This work quantifies for the first time the global impact of HRi on a given genome. We estimate that approximately 27% of all adaptive mutations, which would go to fixation if there was free recombination, are lost due to HRi. We show that this effect depends upon the mutation rate with genes with high mutation rates losing a greater proportion of their adaptive substitutions to HRi than genes with low mutation rates.
The recombination rate data we have used only includes cross-overs (CO) and excludes gene conversion (GC) events. This is because GC is expected to be a much less important force reducing HRi than CO. Although GC events occur ~5-times more frequently than COs ( Taken together these results suggest that HRi, at least from deleterious mutations, might be more important in humans than Drosophila. However, this needs to be confirmed by analysis, and this is difficult because humans appear to have undergone relatively little adaptive evolution (Boyko et al. 2008;Eyre-Walker and Keightley 2009;Gossmann et al. 2012) and this makes analysing the factors that affect the rate of adaptive evolution difficult. The potentially higher level of HRi in humans may explain in part why our species appears to have undergone relatively little adaptive evolution compared to Drosophila (Gossmann et al. 2012). However, the effect of HRi will depend upon the distributions of fitness effects and this is something we have limited information about in both of these species. It will be of great interest to do similar analyses to those performed here in other species.  Comeron et al. 2008). This will require further analysis and population genetic modelling, but in combination with other sources of information, for example, the dip in diversity around non-synonymous substitutions (Sella et al. 2009), the site frequency spectrum (Schneider et al. 2011) the high frequency variants that are left by selective sweeps (Fay and Wu 2000) it may be possible to infer much more about the DFE of advantageous mutations than previously thought.
The fact that so many adaptive substitutions are lost to HRi begs the question why Drosophila does not have a higher rate of recombination, particularly in areas where there is little or no recombination in its genome. This may be because selection on modifiers of the recombination rate is weak; a modifier that elevates the rate of recombination may allow advantageous mutations to spread more easily, but by its very nature it will tend to disassociate itself from the advantageous mutations that it helps spread. It therefore gets little or no benefit from the positive effects it causes.

Conclusions
This work has shown how two genomic forces, the rate of recombination and mutation rate, affect the rate of adaptation within the Drosophila melanogaster genome. We find that the rate of adaptive amino acid substitution is positively correlated to both recombination rate and an estimate of the mutation rate. We also find that this correlation is robust to controlling for each other, synonymous codon bias and gene functions related to immune response and testes. Finally we estimate that on average ~27% of all advantageous substitutions have been lost because of HRi and that this quantity depends on gene's mutation rate; ~11% of adaptive substitutions have been lost for low mutation genes and ~43% for high mutation genes. Hence, we have shown evidences that both recombination and mutation are important modifiers of the rate of adaptive evolution in Drosophila.

Population Genomic Data, Polymorphism and Divergence Estimates
This study was carried out on the four large autosomes (2L, 2R, 3L and 3R) of D. melanogaster using  release  5  of  the  Berkeley  Drosophila  Genome  Project  (BDGP  5, http://www.fruitfly.org/sequence/release5genomic.shtml) as the reference genome.

North American Population
The . We obtained a multiple genome alignment between the DGRP isogenic lines (Mackay et al. 2012) and the D. yakuba genome (Clark et al. 2007) using the BDGP 5 coordinates. This alignment is publicly available at http://popdrowser.uab.cat/ (Ràmia et al. 2012). For each gene we took all non-overlapping coding exons, independently of their inclusion levels. When two exons overlapped, the largest was chosen for subsequent analyses. Only exons without frameshifts, gaps or early stop codons were retained. In this way we tried to avoid potential alignment errors will inflate our mutation and adaptation rate estimates and create an artefactual positive correlation between them. Our final data set fulfilling all these criteria had 6,141 coding genes.
Exonic sequences were trimmed in order to contain only full codons. We defined our sites "physically", so we estimated the rates of substitution at sites of different degeneracy separately. Only zero-fold and 4-fold degenerate sites in exon core codons (as described by Warnecke and Hurst 2007) were used. To estimate the rate of synonymous substitutions, we restricted our analysis to those triplets coding the same amino acid in the two species (D. melanogaster -D. yakuba). In restricting our analysis to codons not exhibiting nonsynonymous differences we assume that the codon has undergone no amino acid substitution -this avoids having to compute the different pathways between two codons, which differ by more than one change and it is a reasonable assumption given the level of amino acid divergence. For 4fold degenerate sites we used the method of Tamura (Tamura 1992) to correct for multiple hits; this method allows for unequal GC content and ts/tv bias. Jukes and Cantor substitution method was used to correct for multiple hits at zero-fold sites (Jukes and Cantor 1969). We calculated the number of substitutions and the folded site frequency spectrum (SFS) for 4-fold degenerate sites and zero-fold degenerate sites, using an ad hoc Perl Script.
Following Halligan and Keightley 2006, in this study we used positions 8-30 of introns ≤65 bp in length as an alternative neutral reference for some analyses. For intron sequences, the invariant GT and AG dinucleotides at the 5' and 3' splice junctions, respectively, were excluded before calculating divergence. Only genes with at least two short introns and with less than 10% of gaps in the aligned sequences were kept. 3,369 orthologous genes passed the intron quality criteria in our final data set. We used an ad hoc Perl Script to estimate the number of short intron substitutions and to compute the folded SFS. Multiple hits were corrected using Jukes and Cantor method (Jukes and Cantor 1969).

African Population
We also used population genomic data from an African population. This comes from Gikongoro, Rwanda (DPGP2, Pool et al. 2012). The details of the assembly and data filtering can be found in Campos et al., 2014. The number of synonymous and nonsynonymous sites and substitutions (computed by Comeron 1995 method, which defines a site as a "mutational opportunity") and the SFS for 7,231 autosomal coding genes were estimated by Campos et al. 2012 and details are provided there. We study only those genes shared by both data sets (DGRP and DPGP2) taking into account the differences in gene annotation versions. This resulted in a dataset of 4,283 autosomal genes coming from this dataset.

Codon Bias Estimates and Recombination Landscape
We used the CodonW software (http://codonw.sourceforge.net/ by Peden 1999) to estimate the frequency of optimal codons, Fop. A higher Fop value suggest a higher efficacy of selection for codon usage, and vice versa.
Recombination rates were taken from Comeron et al. 2012 (www.recobinome.com). They estimated the rate of crossovers in 100 kb non-overlapping windows in cM/Mb units. The rate of crossingover for a gene was the rate in the 100kb that overlapped the mid-point of the gene. Unlike Campos et al.
2014, we did not apply LOESS regression to smooth out the recombination landscape, since we were interested in the fine-scale effects of recombination on the D. melanogaster genome.

Testes, Immune Genes and Permutation Test
If immune and male-biased or testes specific genes tend to have higher rates of recombination (and mutation) then the correlations to the rate of adaptive evolution would not necessarily be a consequence of adaptive evolution being affected by recombination (or mutation) rate. Thus, gene Ontology (GO) terms for 6,141 genes were downloaded from Fruitfly release 78 using the R package biomaRt (Durinck et al. 2005). A list of GO terms related to immune response and testes was constructed using the EBI's GO tool QuickGO (Binns et al. 2009). When a given gene was associated to a GO term from this list it was labelled as "Immune&Testes genes", the rest of genes were labelled as "Control genes". The list of immune response and testes related GO terms and the lists of genes in each group can be consulted in the supplementary Table 4. A permutation test was applied to assess whether Ka+ are significantly higher as it has been reported before (Pröschel et al. 2006;Haerty et al. 2007;Obbard et al. 2009) for immune response and testes related genes relative to the rest of control genes. We shuffled without replacement 1,000 times the complete list of genes by means of ad hoc Bash and Perl Scripting. Then, we estimated Ka+ using the DFE-alpha software (Eyre-Walker and Keightley 2009, see below) for each randomized group. Thus, we got the expected null distribution for the differences between "Control genes" minus the "Immune&Testes genes" for the statistic Ka+. Finally, the one-tailed p-value was obtained by counting the number of replicates below the observed difference divided by the total number of replicates (1000). The expected null distributions and the observed differences can be consulted in the supplementary figure 5.

Gene Bins and Adaptation Estimates
To estimate the rate of adaptive evolution it is necessary to combine data from several genes because estimates from a single gene are noisy and often undefined because of the lack of segregating (or divergent) sites for some site classes. We therefore grouped genes into bins according to their rate of recombination, mutation rate and/or Fop. The rank of values for all these bins can be consulted in the supplementary material.
It is essential to have a selection-free reference sequence that can be used as a baseline for determining the the rate of adaptive substitution acting on a particular target sequence (in our case zerofold degenerate sites). In this study we used the exon core 4-fold degenerate sites as the main proxy for the neutral mutation rate. For some cross-validation analyses short intron sites were also used. DFE-alpha (Eyre-Walker and Keightley 2009) models the DFE at functional sites by a gamma distribution, specified by the mean strength of selection, γ = −Nes, and a shape parameter β, allowing the distribution to take on a variety of shapes ranging from leptokurtic to platykurtic. DFE-alpha can model a single, instantaneous change in population size from an ancestral size N1 to a present-day size N2 having occurred t2 generations ago. Provided the SFS at both neutral and functional sites and the respective levels of divergence, DFEalpha infers γ, β, N2/N1, t2, and α at functional sites. From these estimates Ka+ can be easily estimated with the expression: Ka+ = α x Ka. We ran DFE-alpha for each bin independently using the local version provided at: http://www.homepages.ed.ac.uk/pkeightl//software. DFE-alpha was run in the folded SFS mode since the results are more robust.

Hypergeometric Sampling
To analyse the role of mutation on adaptation we correlated 4-fold divergence (K4) to the rate of adaptive zero-fold substitutions (Ka+). A limitation here is that K4 and Ka+ are not independent, since the estimation of Ka+ depends on K4, and then we expect Ka+ and K4 to be negatively correlated just through sampling error. To overcome this problem we split our mutation rate estimate (K4) into 3 independent variables (similar to the splitting done in (Smith and Eyre-Walker 2002;Piganeau and Eyre-Walker 2009;Stoletzki and Eyre-Walker 2011;Gossmann et al. 2012). This was done by generating a random multivariate hypergeometric variable as follows: D4,2 = multivariateHypergeometric(D4,2-3, 0.33 x L4), where L4 is the number of 4-fold sites and D4 is the total number of 4-fold divergent sites. We divided D4,1, D4,2 and D4,3 by ⅓ x L4 to get K4,1, K4,2 and K4,3, respectively. We used K4,1 to rank genes and assign genes to bins, we then used K4,2 to estimate the rate of adaptive non-synonymous substitution (Ka+) and K4,3 as an estimate of the mutation rate.
To test if genes with high mutation rates have lost more adaptive amino acid substitutions than genes with low mutations rates due to HRi, we have categorized genes into low and high mutations groups after splitting K4 into two statistically independent variables; K4,1 was used to rank genes and assign genes to bins, and K4,2 was used to estimate Ka+. Again, this was done by generating a random multivariate hypergeometric variable as follows:

Estimating the number of substitutions lost to HRi
To estimate how many adaptive substitutions are lost to HRi we proceeded as follows. Let Ka+(i), La(i) and RR(i) be the estimated rate of Ka+, the total number of 0-fold sites and the average rate of recombination for the ith group of genes (grouped by recombination rate). We fit a LOESS curve to the relationship between Ka+ and the rate of recombination. Let the estimated value of Ka+ for the ith group of genes from the LOESS curve be Ka+(i)' -this can be thought of as the predicted mean rate of adaptive nonsynonymous substitution for genes of the observed recombination rate. We took the average Ka+ for genes with rates of recombination above 2 cM/MB as our estimate of the rate of adaptive non-synonymous substitution without HRi -let this be Ka+, no_HRi. The expected total number of adaptive non-synonymous without any HRi is therefore: and the number lost to HRi is: for groups of genes with a rate of recombination < 2 cM/MB. Because the mean rate of adaptive nonsynonymous substitution from the LOESS curve can be negative we repeated the analysis setting any value of Ka+(i)' to zero if it was less than zero. Finally the proportion of substitutions lost to HRi:

Confidence Intervals and Bootstraps
To calculate the 95% confidence intervals (CIs) for the proportion of adaptive amino acid substitutions lost due to interference (fHRi), we bootstrapped 1,000 times the data by gene. We split each where fHRi (HIGH) is the proportion of lost substitutions for genes with high mutation rates and fHRi (LOW) is the proportion of lost substitutions for genes with low mutation rates. Finally, the proportion of the Z distribution below zero is the one-tailed p-value for the observed differences between high and low mutation groups.

Statistical Analyses
All statistical analyses were performed using the R statistical package (R Core Team 2013).
ANCOVAs and multiple linear regressions were carried out calling the R function "lm" (from the R package "base"). We calculated Spearman rank correlations (ρ) using the basic R function "cor.test" (from the R package "base"). To compare the linear and curvilinear model fit we used the R function "nls" (from the R package "stats"). The random hypergeometric variable was obtained through the R function "rhyper" (from the R package "stats"). LOESS regression was run using the R package "stats" after setting the smoothness parameter "span" from the default 0.75 value to 1. Increasing the "span" parameter decreases the smoothness of the fitted curve making the regressions more robust (or less noisy) across bootstrap replicates. All these R scripts are available upon request.  Table 1. Linear and curvilinear fit inferred parameters, R 2 and AIC for seven data sets where y = Ka+ and x = cM/Mb. The last column is the p-value for a F-test comparing both models. The DGRP data set is in the first row, the DPGP2 data is in the second row, the DGRP data excluding immune response and testes related genes is in the third row, the DGRP data for high, medium and low Fop genes are in the fourth, fifth and sixths rows, respectively, and finally the DGRP data using short intron sites (<66, bases 8-30) as neutral standard is in the last row.