Variation in the Intensity of Selection on Codon Bias over Time Causes Contrasting Patterns of Base Composition Evolution in Drosophila

Four-fold degenerate coding sites form a major component of the genome, and are often used to make inferences about selection and demography, so that understanding their evolution is important. Despite previous efforts, many questions regarding the causes of base composition changes at these sites in Drosophila remain unanswered. To shed further light on this issue, we obtained a new whole-genome polymorphism data set from D. simulans. We analyzed samples from the putatively ancestral range of D. simulans, as well as an existing polymorphism data set from an African population of D. melanogaster. By using D. yakuba as an outgroup, we found clear evidence for selection on 4-fold sites along both lineages over a substantial period, with the intensity of selection increasing with GC content. Based on an explicit model of base composition evolution, we suggest that the observed AT-biased substitution pattern in both lineages is probably due to an ancestral reduction in selection intensity, and is unlikely to be the result of an increase in mutational bias towards AT alone. By using two polymorphism-based methods for estimating selection coefficients over different timescales, we show that the selection intensity on codon usage has been rather stable in D. simulans in the recent past, but the long-term estimates in D. melanogaster are much higher than the short-term ones, indicating a continuing decline in selection intensity, to such an extent that the short-term estimates suggest that selection is only active in the most GC-rich parts of the genome. Finally, we provide evidence for complex evolutionary patterns in the putatively neutral short introns, which cannot be explained by the standard GC-biased gene conversion model. These results reveal a dynamic picture of base composition evolution.


Introduction
Here, we investigate the forces that affect evolution at 4-fold degenerate coding sites in Drosophila simulans and D. melanogaster. These sites represent a substantial part of the genome and are often used as references against which selection at other sites, for example, nonsynonymous sites, is tested (McDonald and Kreitman 1991;Rand and Kann 1996;Parsch et al. 2010;Stoletzki and Eyre-Walker 2011). Quantifying the forces that affect their evolution is necessary both for a general understanding of genome evolution and for making robust inferences about the influences of demographic factors and selection elsewhere in the genome (Matsumoto et al. 2016).
Codon usage bias (CUB) is a key feature of 4-fold sites, since it involves the disproportionate use of certain codons among the set of codons that code for a given amino acid. There is evidence for CUB in a wide range of organisms, including both prokaryotes and eukaryotes (Drummond and Wilke 2008;Hershberg and Petrov 2008). The most common explanation for CUB is that this maximizes translational efficiency and/or accuracy (Hershberg and Petrov 2008). Avoidance of the toxicity of misfolded proteins generated by translational errors has also been proposed as an explanation of CUB (Drummond and Wilke 2008). Recent work has also suggested the possibility that stabilizing, as opposed to directional, selection maintains the frequencies of synonymous codons, because CUB has been found to be unrelated to recombination rate in D. pseudoobscura, in line with theoretical predictions about the action of stabilizing selection (Charlesworth 2013;Fuller et al. 2014;Kliman 2014).
In most species of Drosophila for which data are available, including D. melanogaster and D. simulans, all the preferred codons are GC-ending (Vicario et al. 2007;Zeng 2010). Selection for preferred codons thus acts to increase the GC content of third position sites in coding sequences (CDSs), and GC-ending and AT-ending codons have been often used as proxies for preferred and unpreferred codons, respectively. As in other species, evidence for selection for preferred codons in D. melanogaster comes from the fact that the level of codon bias is related to expression level (e.g., Duret and Mouchiroud 1999;Hey and Kliman 2002;Campos et al. 2013). There is also a negative relationship between the level of CUB and synonymous site divergence in the Drosophila melanogaster subgroup, consistent with selection for preferred codons (Shields et al. 1988;Powell and Moriyama 1997;Dunn et al. 2001;Bierne and Eyre-Walker 2006).
However, analyses based on between-species sequence divergence have consistently revealed an excess of substitutions towards AT-ending codons in the D. melanogaster lineage (Akashi 1995(Akashi , 1996McVean and Vieira 2001;Poh et al. 2012). Two hypotheses have been proposed for this observation. These are, firstly, that D. melanogaster has undergone a reduction in the population-scaled strength of selection for preferred codons, 4N e s, where N e is the effective population size and s is the selection coefficient favoring preferred codons in heterozygotes for the preferred allele. This reduction in selection could be caused either by a reduction in N e (Akashi 1996), or a reduction in s, perhaps due to changed ecological conditions Vogl 2012a, 2012b). The second explanation is that D. melanogaster has undergone a shift in mutational bias towards AT alleles (Takano-Shimizu 2001; Kern and Begun 2005;Zeng and Charlesworth 2010a;Clemente and Vogl 2012b). It has also been argued that both factors must be invoked to explain patterns of variation and evolution in the D. melanogaster lineage Vogl 2012a, 2012b).
Several attempts to detect selection on codon bias in D. melanogaster have come to conflicting conclusions. For instance, some polymorphism-based studies managed to detect evidence for selection favoring GC-ending codons (Zeng and Charlesworth 2009;Campos et al. 2013), although the intensity of selection may be weak relative to other Drosophila species (Kliman 1999;Andolfatto et al. 2011). However, other studies did not find support for such ongoing selection (Clemente and Vogl 2012a;Vogl and Clemente 2012;Poh et al. 2012). Thus, there is a pressing need to gain a better understanding of the dynamics of selection on codon bias and understand the sources of these conflicting results.
Much less is known about D. simulans. Early studies based on a small number of loci suggest that this species may be at base composition equilibrium, with the number of substitutions from AT-ending codons to GC-ending codons not statistically different from that in the opposite direction (e.g., Akashi 1995Akashi , 1996Kern and Begun 2005;Akashi et al. 2006;Haddrill and Charlesworth 2008). However, more recent analyses have revealed AT-biased substitution patterns (Begun et al. 2007;Poh et al. 2012), suggesting a possible reduction in selection intensity in this lineage, although the reduction may be less severe compared with that in D. melanogaster (McVean and Vieira 2001). In contrast to the situation in D. melanogaster, the few polymorphism-based studies in D. simulans generally point to evidence for selection for preferred codons (Akashi 1997(Akashi , 1999Kliman 1999;Andolfatto et al. 2011). It is therefore unclear whether/how selection intensity has changed over time in D. simulans, and how the dynamics of base composition evolution differ from those in D. melanogaster.
Irrespective of the reason(s) for the AT-biased substitution pattern in these two Drosophila lineages, these findings present a problem for ancestral state reconstruction, a process that is necessary for inferring substitution patterns along a lineage of interest and for polarising segregating sites into ancestral and derived variants to understand their more recent evolution. Use of maximum parsimony methods or maximum likelihood models that assume equilibrium base composition under such circumstances can lead to erroneous inferences although these two methods were used in many previous analyses of various Drosophila species (Akashi et al. 2007;Matsumoto et al. 2015). Departures from base composition equilibrium may also lead to complex polymorphism patterns (Zeng and Charlesworth 2009). Both of these sources of difficulties may contribute to the mixed evidence for the nature of the forces acting on synonymous sites in Drosophila (Zeng and Charlesworth 2010a;Clemente and Vogl 2012a).
A factor that may confound the study of CUB is GC-biased gene conversion (gBGC), which is a recombination-associated process, and acts to increase GC content at sites where recombination occurs (Duret and Galtier 2009). Most studies have found little or no evidence for gBGC in D. melanogaster (Clemente and Vogl 2012b;Comeron et al. 2012;Campos et al. 2013;Robinson et al. 2014), although there is some evidence either for the action of selection for GC basepairs or gBGC on the evolution of non-coding sequences in D. simulans (Haddrill and Charlesworth 2008). In order to control for gBGC, we have analyzed data on the 8-30-bp region of short introns (SIs), which are widely considered to be evolving near-neutrally in Drosophila (Halligan and Keightley 2006;Parsch et al. 2010;Clemente and Vogl 2012b).
To address the questions raised above, we need to look at both divergence and polymorphism data from both species; the analyses should explicitly take into account departures from equilibrium, so that signals of selection can be detected without biases. To this end, we have obtained new wholegenome data from D. simulans and used an existing highquality data set for D. melanogaster. Using the reference genome of D. yakuba as an outgroup, we used state-of-theart methods to reconstruct ancestral states. In addition, we employed methods that can infer selection intensity on different timescales, along the D. melanogaster and D. simulans lineages, with the aim of shedding further light on the evolutionary dynamics of genome composition in these two species.

Sequence Data Preparation
We first describe the sequencing of 22 new D. simulans isofemale lines, 11 of which were collected by William Ballard in 2002 from Madagascar (MD lines: MD03, MD146, MD197, MD201, MD224, MD225, MD235, MD238, MD243, MD255, and MD72); the other 11 were collected by Peter Andolfatto in 2006 from Kenya (NS lines: NS11, NS111, NS116, NS19, NS37, NS49, NS63, NS64, NS89, NS95, and NS96). We produced homozygous lines by full-sib inbreeding in the Charlesworth lab for nine generations; however, six lines (NS11, NS63, NS116, MD224, MD243, and MD255) were lost early in the process of inbreeding. For these lines, we sequenced the initial stocks that we had received from the Andolfatto lab. Genomic DNA was prepared for each isofemale line by pooling 25 females, snap freezing them in liquid nitrogen, extracting DNA using a standard phenol-chloroform extraction protocol with ethanol, and ammonium acetate precipitation. These flies were sequenced by the Beijing Genomics Institute (BGI; http:// bgi-international.com/; last accessed December 28, 2016). A 500-bp short-insert library was constructed for each sample, and the final data provided consisted of 90-bp paired-end Illumina sequencing (pipeline version 1.5), with an average coverage of 64Â. We double-checked the quality of the filtered reads for each allele with FastQC (available at http://www.bioinformatics.babraham.ac.uk/projects/fastqc/; last accessed December 28, 2016), and no further trimming was necessary. The raw reads have been deposited in the European Nucleotide Archive, study accession number: PRJEB7673.
Downstream of sequencing, we combined both data sets and used a BWA/SAMtools/GATK pipeline, previously described in Campos et al. (2014) and Jackson et al. (2015), to generate genotype calls. Briefly, we aligned and mapped reads for each D. simulans line to the second-generation assembly of the D. simulans reference sequence (Hu et al. 2013) using BWA 0.7.10 (Li andDurbin 2009). We used SAMtools 1.1 ) to filter alignments with a mapping quality <20, and to sort and index the resulting alignments. To combine reads from one sample across multiple lanes, we used Picard tools 1.119 (http://broadinstitute.github.io/picard/; last accessed December 28, 2016) to edit BAM file headers and SAMtools 1.1 to merge, resort and index BAM files per sample. We then used Picard tools 1.119 to fix mate information, sort the resulting BAM files and mark duplicates. We performed local realignment using the RealignerTargetCreator and IndelRealigner tools of GATK 3.3 (https://www.broadinstitute.org/gatk/; last accessed December 28, 2016).
For single nucleotide polymorphism (SNP) calling, we used the UnifiedGenotyper for diploid genomes (parameter: sam-ple_ploidy 2) and generated a multisample VCF file (Danecek et al. 2011). Subsequently, we performed variant quality score recalibration (VQSR) to separate true variation from machine artefacts (DePristo et al. 2011). We used biallelic and homozygous (for a given individual) SNPs detected at 4-fold sites at a frequency equal to or higher than seven sequenced individuals as the training set. Six SNP call annotations were considered by the VQSR model: QD, HaplotypeScore, MQRankSum, ReadPosRankSum, FS, and MQ, as suggested by GATK (see http://www.broadinstitute.org/gatk/; last accessed December 28, 2016;DePristo et al. 2011). The SNPs were allocated to tranches according to the recalibrated score, so that a given proportion of the true sites were recovered. We retained variants that passed a cutoff of 95%, the variant score limit that recovers 95% of the variants in the true data set. We refer to this data set as "filtered." From the multisample recalibrated VCF file, we made a consensus sequence FASTA file for each individual using a custom Perl script. The variant calls that did not pass the filter were called N (missing data) at the sites in question. We also generated an unfiltered data set, where we did not implement any form of variant score recalibration. We refer to this data set as "unfiltered." The VCF files and the scripts used to produce them can be downloaded by following the hyperlink provided in http://zeng-lab.group.shef.ac.uk; last accessed December 28, 2016.
Annotation of the D. simulans Data Set Using annotations from the D. simulans reference (Hu et al. 2013), we extracted CDSs for each gene and made FASTA alignments. We included the D. simulans reference sequence and the 1:1 FlyBase orthologous genes of D. melanogaster (release version 5.33) and D. yakuba (release version 1.3). We then performed amino acid sequence alignments using MAFFT (Katoh et al. 2002). These amino acid sequence alignments were translated back to nucleotides using custom scripts in PERL to produce in-frame CDS alignments that included the 42 D. simulans alleles and the D. melanogaster and the D. yakuba outgroups. We extracted 4-fold (and 0-fold) degenerate sites from CDS alignments which were 4-fold (0-fold) degenerate in all lines, with the condition that there was at most one segregating site in the codon to which the 4-fold (0-fold) site belonged. We retained the 4-fold (0-fold) sites from an alignment only if there were at least ten 4-fold (0-fold) sites in that alignment in total. For the polymorphism and substitution analyses on 4-fold sites reported in the Results, we carried out the same procedure with the added condition that sites must also be 4-fold degenerate in the three reference sequences.
We also extracted the intron coordinates from the D. simulans reference genome sequence. Genomes were masked for any possible exons. For each D. simulans intron, we obtained the corresponding orthologous intron of D. melanogaster (Hu et al. 2013). For D. yakuba, for each orthologous gene, we obtained all its annotated introns and blasted them against the D. melanogaster introns (of the same ortholog) with an e-value of <10 À5 and selected the reciprocal best hit (because introns are generally short, the threshold e-value was conservative; see Results). We used RepeatMasker (http://www.repeatmasker.org) to mask repetitive elements in our intron data set, using the library of repeats for D. melanogaster and the default settings. We produced a final alignment of each intronic polymorphism data set of D. simulans with the corresponding D. melanogaster and D. yakuba orthologs using MAFFT.
We extracted positions 8-30 bp of all introns <66-bp long, based on the D. melanogaster reference alignment for each intron, as we considered the D. melanogaster reference to be the best annotated of the three species. To do this, we scanned the D. melanogaster reference sequence for each intronic alignment. We retained the alignment if the D. melanogaster reference sequence was <66-bp long (not including alignment gaps), and then further obtained the coordinates of the 8-bp position and the 30-bp position in the D. melanogaster reference sequence after discarding any gaps introduced by the alignment program. We then cut the whole alignment at these coordinates. These SI sites are thought to be close to neutrally evolving in Drosophila, based on their patterns of polymorphism and substitution (Halligan and Keightley 2006;Parsch et al. 2010;Clemente and Vogl 2012b).
Quality Control of D. simulans Genotypes The lines that were inbred successfully for nine generations to produce homozygous samples still retained low levels of residual heterozygosity, which may have been due to a failure to purge our lines of natural variation (Stone 2012), or to SNP calling errors (the latter should be less likely given the high coverage [64Â] and our stringent SNP calling regime). We quantified the amount of residual heterozygosity per sample for each of the unfiltered and filtered data sets (supplementary fig. S1, Supplementary Material online). As expected, the filtered data set exhibited lower levels of residual heterozygosity (ND samples: mean value = 0.0616%, all values <0.5%; MD samples: mean value = 0.0168%, all values <0.15%). The six lines that were not subject to the inbreeding procedure (see above) did not have substantially higher levels of residual heterozygosity than the remaining samples, presumably because they were already considerably inbred after being kept as laboratory stocks for several years. For downstream analyses we treated heterozygous sites as follows: at each heterozygous site within a sample, one allele was chosen as the haploid genotype call at that site with a probability proportional to its coverage in the sample. The alternative allele was discarded. Because our samples are from partially inbred lines that originated from a mating between at least one wild male and only one wild female, heterozygosity at a site implies that the site is segregating in the wild population. By sampling one allele at random, we attempted to replicate the inbreeding process, which aimed to remove heterozygosity from within the lines.
Pairwise S values (synonymous site diversity) for all 42 D. simulans lines showed three pairs of samples which deviated substantially from the distribution of pairwise S between samples (mean S for all samples = 0.030, SD = 0.0018). These pairs were MD201-NS116 ( S = 7.28 Â 10 À5 ); NS137-NS37 ( S = 0.0034) and NS49-NS96 ( S = 0.0097). A principal component analysis (PCA) of binary genotypes placed NS116 within the cluster of MD samples, and NS116 exhibited a more MD-like genetic distance to the D. simulans reference sequence. These results were based on the filtered data set, but the unfiltered data set returned qualitatively identical patterns (data not shown). We therefore excluded NS116 from all downstream analyses based on the likelihood of its representing labeling error. We also excluded NS37 and NS96 as these individuals had the highest levels of residual heterozygosity out of the remaining two pairs of closely related samples (supplementary fig. S1, Supplementary Material online).
To further assess the quality of our data sets, we compared polymorphism and divergence statistics to data previously published in the literature on D. simulans (see Results). In particular, we calculated a range of summary statistics per gene: F ST between NS and MD samples; , Tajima's D, D , and W within the NS sample, within the MD sample, and for both samples combined. D for a given gene (Langley et al. 2014) is defined as where k represents the mean number of pairwise differences among the n alleles in the sample, and S is the number of segregating sites (Langley et al. 2014). We calculated this statistic using a modified version of the tajima.test() function from the pegas package (Paradis 2010) in R. D is similar to Tajima's D (Tajima 1989), but is normalized by the total amount of diversity. Its advantage over Tajima's D is that it is less dependent on the total diversity for the sample (Langley et al. 2014). We also compared K A and K S between the three reference sequences (D. melanogaster, D. simulans and D. yakuba) in all CDS alignments using the kaks() function from the seqinr package in R, and K SI between the reference sequences in all our SI alignments using the dist.dna() function from the pegas package in R, based on the K80 method (Kimura 1980). These analyses are presented in the first section of the Results.

Divergence-Based Analyses
We used three methods to determine the ancestral state at the melanogaster-simulans (ms) node, all of which used only the three reference sequences. First, we used parsimony, implemented in custom scripts in R. Second, we used the nonhomogeneous general time-reversible (GTR-NH b ) substitution model, implemented in the baseml package of PAML v4.8 (Yang 2007), after checking that GTR-NH b fitted the data better than the stationary GTR model using chi-squared tests (see Results). The use of this method to reconstruct ancestral sites when nucleotide composition is nonstationary is described in the study by Matsumoto et al. (2015) and has been shown to produce highly accurate results in the presence of nonequilibrium base composition, whereas the parsimony method is likely to be biased. Under the GTR-NH b method, we implemented two ways of determining the ancestral state at the ms node, by either using the single best reconstruction (SBR) of the ancestral sequence at the ms node, or by weighting the four possible nucleotides at the ms node by the posterior probability of each. Instead of ignoring suboptimal reconstructions, as the parsimony and SBR methods do, the last option weights all the possible ancestral states by their respective posterior probabilities.
Following Matsumoto et al. (2015), we refer to these two GTR-NH b -based methods as "SBR" and "AWP," respectively. The AWP method should be more reliable than either parsimony or SBR when base composition is not at equilibrium (Matsumoto et al. 2015).
Since some of the models we used are very parameter-rich (e.g., the GTR-NH b model has 39 parameters for three species, and the M1* model described more fully below has 25 parameters for D. simulans and 21 parameters for D. melanogaster, given the sample sizes), we had to group genes into bins to avoid overfitting. To investigate the relationship between selection and GC content at 4-fold sites (a proxy for the extent of CUB), we binned 4-fold sites by the GC content in the D. melanogaster reference sequence, which we used as a proxy for the historic strength of selection favoring GC alleles. GC content evolves very slowly over time (Marais et al. 2004), and is highly correlated between D. simulans and D. melanogaster CDS (Pearson's correlation coefficient r = 0.97, P < 2.2 Â 10 À16 ), so this strategy should accurately represent GC content at the ms node. We binned 4-fold degenerate sites into 20 autosomal and four X-linked bins. Bins were chosen to maintain approximately the same number of genes per bin. The autosomal and X-linked SI sites were always treated as two separate bins. We also followed this binning convention for other analyses. When carrying out correlation analyses between GC content bins and other variables (e.g., substitution rate and estimates of the selection coefficient), we included only the 4-fold degenerate site GC bins, but not the SI bin. We also restricted the correlation analysis to the autosomal bins only. Given the small number of bins on the X chromosome, this type of analysis is underpowered; in fact, the smallest P value that Kendall's can achieve with four data points is 0.08.
To determine whether or not D. melanogaster and D. simulans are in base composition equilibrium, for each bin we counted the numbers of S!W (N S!W ), W !S (N W !S ), and putatively neutral (N neu ) substitutions (i.e., S!S and W !W ), where S represents G or C, the strong (potentially preferred) allele, and W represents A or T, the weak (potentially unpreferred) allele. We did this along each of the D. melanogaster and D. simulans lineages by (probabilistically) comparing the reconstructed ancestral states at the ms node with the reference genomes. This is reasonable because the branch length is much higher than the level of within-species polymorphism (see Results). For the AWP method, we rounded our results to the nearest integer. Where possible, we compared our results to those published in the literature, and to equivalent results kindly provided by Juraj Bergman and Claus Vogl (pers. comm.; supplementary table S2, Supplementary Material online). To obtain the W !S substitution rate (r W !S ) per bin, we divided N W!S by the total number of AT sites (L W ) at the ms node in that bin. Similarly, r S!W ¼ N S!W =L S .

Polymorphism-Based Analyses
For each bin, we estimated the derived allele frequency (DAF) at segregating sites, using the three methods described above to infer ancestral states at the ms node, which should be a reasonable approximation given the rarity of shared polymorphism for the two species (Clemente and Vogl 2012b). We classified these sites into segregating sites at which the ancestral allele was AT and the derived allele was GC (DAF W !S ), and segregating sites at which the ancestral allele was GC and the derived allele was AT (DAF S!W ), as well as segregating sites which had mutated from A to T, or vice versa, and from G to C or vice versa (DAF neu ). We also calculated D (Langley et al. 2014) for each bin. We mostly display results obtained from the AWP method in the Results section, because it is probably the most reliable of the three. Qualitatively, the results are generally insensitive to the choice of method for reconstructing ancestral sites. Thus, we present a set of figures in the supplement (supplementary figs. S6-S11, Supplementary Material online) that are parallel to those shown in the main text, but were obtained using either parsimony or SBR, respectively.
We used two polymorphism-based methods for estimating the population-scaled strength of the force favoring GC alleles, ¼ 4N e s, where N e is the effective population size and s is the selection coefficient against heterozygous carriers of the AT allele. The first is the method of Glé min et al. (2015), which uses three different classes of polarized unfolded site frequency spectra (SFS) for sites that are segregating in the present day: S!W , W !S, and putatively neutral (see above). This method is capable of taking into account polarization errors, which, if untreated, may lead to upwardly biases estimates of (Hernandez et al. 2007), by incorporating them into the model and estimating them jointly with the parameters of interest. It is also capable of correcting for demographic effects, by introducing nuisance parameters to correct for distortions in the SFS due to demography (after Eyre-Walker et al. 2006). Because it only considers the SFS of derived alleles, we expect this method to recover signatures of selection on a relatively recent time scale (~4N e generations if we conservatively assume neutrality). We generated unfolded SFSs for this model using the AWP method to infer the ancestral state at the ms node and estimated the strength of using R code provided in the supplementary material of Glé min et al. (2015). We refer to the models using this method with the same notation as Glé min et al. (2015). These are model M0, where ¼ 0 and polarization errors are not taken into account; M1, where 6 ¼ 0 and polarization errors are not taken into account; and M0* and M1*, which are the equivalent models after correcting for polarization errors. Note that the method for controlling for demography drastically increases the number of model parameters. For instance, for M1, in addition to and the three mutational parameters for each of the three SFSs ( ¼ 4N e ), it requires an additional n -2 nuisance parameters, where n is the number of frequency classes (in our case, this is the same as the sample size). Given the dearth of SNPs relative to substitutions, and in particular the lower diversity level in D. melanogaster, we repeated some of these analyses by pooling SNP data across several nearby GC content bins (see Results).
Second, we used the method of Zeng and Charlesworth (2009), modified as described by Evans et al. (2014), which uses the unpolarized SFS (including fixed sites) to infer parameters of a two-allele model with reversible mutation between W and S alleles, selection and/or gBGC, and changes in population size (see Zeng (2012) for a discussion of the differences between the reversible mutation model and the infinite-sites model on which the method of Glé min et al. (2015) is based). Because this method uses the unpolarized SFS, no outgroup is required. This method can recover signals of selection (and other population genetic parameters) over a longer time scale than the methods of Glé min et al. (2015) because it uses information on the base composition of the species to estimate the parameters (see Zeng and Charlesworth 2009; supplementary fig. S8-S11). As above, we defined W (AT) and S (GC) as our two alleles. We define u as the rate at which S alleles mutate to W alleles, and v as the mutation rate in the opposite direction, and ¼ u=v as the mutation bias parameter. To incorporate a change in population size, we assume that the population in the past is at equilibrium with population size N 1 , which then changes instantaneously to N 0 (this can be either an increase or a reduction in size) and remains in this state for t generations until a sample is taken from the population in the present day (Zeng and Charlesworth 2009;Haddrill et al. 2011;Evans et al. 2014). As with M1* and M1, we also tested the equivalent models where ¼ 0. For each model, in order to ensure that the true MLE was found, we ran the search algorithm multiple times (typically 500), each initialized from a random starting point. All the results reported above were found by multiple searches with different starting conditions. Chi-squared tests were used to evaluate statistical support for different models. We refer to these models as ZC0 ( ¼ 0) and ZC1 ( 6 ¼ 0) below. A software package implementing this approach is available at http://zeng-lab.group.shef.ac.uk. For all methods (Zeng and Charlesworth 2009;Glé min et al. 2015), we fitted independent models for each SI and 4-fold bin (Zeng and Charlesworth 2010b;Messer and Petrov 2013).

Results
Patterns of Polymorphism and Divergence in the D. simulans and D. melanogaster Data Sets For D. simulans, after extracting 4-fold degenerate sites and SI (positions 8-30 bp of introns <66-bp long), we retained 7,551 autosomal CDS alignments and 1,226 X-linked CDS alignments, as well as 5,578 autosomal SI alignments and 516 X-linked SI alignments. The final data set contained the reference sequences of D. simulans, D. melanogaster, and D. yakuba, as well as polymorphism data from 39 D. simulans lines, including 21 Madagascan (MD) lines and 18 Kenyan (NS) lines, with 22 of the 39 lines being described for the first time in this article (see Materials and Methods). For D. melanogaster, we retained 5,550 autosomal CDS alignments and 888 X-linked CDS alignments, as well as 7397 autosomal SI alignments and 738 X-linked SI alignments, containing polymorphism data from 17 Rwandan (RG) lines, as well as the three reference sequences.
Summary statistics calculated using a D. simulans data set that was filtered to separate true genetic variation from variant-calling artefacts are presented in table 1 (see supplementary table S1, Supplementary Material online for the unfiltered data). Consider first the MD lines (n = 21) collected from the putatively ancestral range of the species in Madagascar (Dean and Ballard 2004). Autosomal at 4-fold sites (referred to as 4 ) was 0.0329 and 0.0317 for the unfiltered and filtered data sets, respectively, similar to the value of 0.035 reported by Begun et al. (2007). On the X, 4 was 0.0191 and 0.0182 for the two data sets; the Begun et al. (2007) value was 0.02. Tajima's D and Á p at 4-fold sites are both negative, implying that there may have been a substantial recent population size expansion. Again, values obtained from the filtered and unfiltered data are very similar (cf . table 1 and supplementary table  S1, Supplementary Material online). Overall, diversity was slightly reduced for our filtered data set, which may have been a result of more conservative variant filtering criteria, but the differences are minimal. In what follows, we only present results obtained from the filtered data set. SI sites, which we only obtained from our filtered data set, are more diverse than 0-fold and 4-fold sites in the MD population, for both the autosomes (A) ( SI = 0.0321) and the X ( SI = 0.0208) (table 1).
The samples collected from Kenya (the NS lines; n = 18) have consistently lower diversity levels at 0-fold, 4-fold, and SI sites, and less negative Tajima's D and Á p , probably caused by bottlenecks associated with the colonization process (Dean and Ballard 2004). Nonetheless, F ST between the two populations at 4-fold sites is rather low:~2.5% between NS and MD (table 1), suggesting that there is relatively little genetic differentiation between the ancestral and derived populations. There is also little difference in F ST at 4-fold sites between the X and A. Similar to the MD population, SI sites are the most diverse class of site as measured by (table 1).
The patterns reported above contrast with those observed in D. melanogaster (see table 1 of Jackson et al. 2015). We focus first on samples from the putatively ancestral ranges of both species (i.e., the RG lines for D. melanogaster, and the MD lines for D. simulans). Autosomal 4 is~2.06 times higher in D. simulans, suggestive of higher N e , which may lead to more effective selection (see Discussion). Tajima's D is also less negative in D. melanogaster, with the differences at 4-fold sites being the most noticeable (À0.11 vs. À1.03 for A, and À0.47 vs. À1.31 for the X), suggesting a more stable recent population size in D. melanogaster, which is supported by the fits of the Zeng and Charlesworth (ZC) method to the data (see below). The X:A ratio of 4 in D. melanogaster was 1.08, much higher than the expected value of 0.75 under the standard neutral model, whereas it was 0.57 in D. simulans. Furthermore, F ST at 4-fold sites between RG and a sample from France (Jackson et al. 2015) in D. melanogaster is~10 times higher than that between the MD and NS populations in D. simulans. Interestingly, the difference in F ST between the X and A is much more marked in D. melanogaster (0.29 vs. 0.17 for the X and A, respectively) than in D. simulans (0.025 for both X and A). Various theories have been proposed to explain differences in diversity levels between X and A, which include sex-specific variance in reproductive success (Charlesworth 2001), demographic effects (Pool and Nielsen 2007;Singh et al. 2007;Pool and Nielsen 2008;Yukilevich et al. 2010), positive and negative selection (Singh et al. 2007;Charlesworth 2012), and differences in recombination rate (Charlesworth 2012). Detailed analyses of the factors underlying X-autosomal differences are outside the scope of this study; below we present results from X and the autosomes separately.
We also assayed divergence between the reference sequences in our alignments. Between D. melanogaster and D. simulans, K A , K S and K SI were 0.014, 0.109 and 0.130, respectively. These values are similar to those in Table 1 of Parsch et al. (2010) (K A = 0.019, K S = 0.106 and K SI = 0.123), and in Zhang et al. (2013; supplementary table S2 therein) (K A = 0.015 and K S = 0.12). In our data K A , K S and K SI , between D. melanogaster and D. yakuba were 0.036, 0.266 and 0.294, respectively; between D. simulans and D. yakuba, they were 0.036, 0.250 and 0.302, respectively. Note that divergence is always highest at the SI class of site, which is in agreement with these sites being relatively unconstrained (Halligan and Keightley 2006;Parsch et al. 2010;Clemente and Vogl 2012b). Overall, these patterns suggest that our alignments are of high quality.
In the following sections of this article, we first focus on analysing the forces that act on 4-fold sites. To investigate the relationship between selection and GC content at 4-fold sites (a proxy for the extent of CUB), we binned 4-fold sites by their GC content in the D. melanogaster reference sequence, which we used as a proxy for the historic strength of selection favoring GC alleles. In this part of the analysis, the putatively neutrally evolving SI sites are analyzed as a whole and presented alongside results from 4-fold sites for comparison. Later, to gain further insights into the evolution of the SI sites themselves, we binned them according their GC content, and analyzed the bins in the same manner as the 4-fold sites. Only data from the putatively ancestral populations (i.e., MD in D. simulans and RG in D. melanogaster) are considered, in order to avoid complications introduced by population structure. For ease of notation, we use GC and S (the strong, potentially preferred allele) interchangeably below; the same applies to AT and W (the weak, potentially unpreferred allele).
Excess of S!W substitutions at 4-Fold sites on both the D. simulans and the D. melanogaster Lineages For all the 4-fold site bins and the SI bin (on both A and X), a nonhomogeneous (GTR-NH b ) substitution model implemented in PAML always fitted the data significantly better than a stationary (GTR) substitution model in both species (min 2 = 166.86, df = 28, P = 1.05 Â 10 À21 ), which is indicative of a nonequilibrium base composition. Considering the genome as a whole, both the D. melanogaster and D. simulans lineages showed an excess of S!W changes at autosomal and X-linked 4-fold degenerate sites, regardless of which method was employed to infer ancestral states at the melanogaster-simulans (ms) node (table 2; supplementary table S2, Supplementary Material online; see Materials and Methods). It is evident that the excess is greater in D. melanogaster than D. simulans. For instance, based on autosomal data obtained by the AWP method, which we expect to be the most accurate method of the three (Matsumoto et al. 2015), the ratio N W !S =N S!W , where N W !S and N S!W are the numbers of substitutions between the S and W alleles along the lineage of interest, is 0.49 in D. simulans, but is only 0.26 in D. melanogaster ( 2 = 2145.8, df = 1, P < 0.001). Interestingly, the S!W bias is much more pronounced on the X of D. melanogaster with an N W !S =N S!W ratio of 0.17, significantly different from the A value of 0.26 ( 2 =212.8, df = 1, P < 0.001), whereas in D. simulans the ratios are much closer to one another, 0.53 and 0.49, respectively, although this difference is still significant ( 2 = 6.97, df = 1, P = 0.008). These results are in line with previous findings of an excess of AT (or unpreferred codon) substitutions at silent sites in D. melanogaster (Akashi 1995(Akashi , 1996Takano-Shimizu 2001;Akashi et al. 2006). For D. simulans, our data are in agreement with a data set curated entirely independently by Juraj Bergman and Claus Vogl (personal communication; supplementary table S2, Supplementary Material online), and suggest that there is a much more pronounced S!W bias than was found in some previous studies (Akashi et al. 2006;Begun et al. 2007;Poh et al. 2012).
The ratio N W !S =N S!W is much closer to unity for SI sites than for 4-fold sites (table 2), which is also in agreement with the previous finding that SI are generally closer to equilibrium than 4-fold sites in both species (Kern and Begun 2005;Singh et al. 2009;Haddrill and Charlesworth 2008;Robinson et al. 2014). The three methods for inferring ancestral states in the ms ancestor consistently suggest an AT substitution bias at SI sites in the D. melanogaster lineage (table 2). The situation is somewhat more complex in D. simulans. For the X, all three methods suggest a mild GC bias, but the ratio based on AWP, which should be the most reliable method of the three (Matsumoto et al. 2015), is not significantly different from 1 ( 2 = 0.286, df = 1, P = 0.59). For the autosomes, parsimony suggests a GC bias ( 2 = 19.7, df = 1, P = 0.01), but both SBR and AWP provide some support for a slight AT bias (SBR: 2 = 3.73, df = 1, P = 0.05; AWP: 2 = 5.55, df = 1, P = 0.019) (table 2). This may reflect the tendency for parsimony to overestimate changes from common to rare basepairs (Collins et al. 1994;Eyre-Walker 1998;Akashi et al. 2007;Matsumoto et al. 2015).

Variation in 4-Fold Site Substitution Patterns across Regions with Different GC Content
Under strict neutrality, the substitution rate per site is equal to the mutation rate per site (Kimura 1983). Thus, if 4-fold degenerate sites have never been affected by selection on CUB and/or gBGC, the two substitution rates per site, r W !S and r S!W , should be uniform across the GC bins, unless there are systematic differences in mutation rates across bins. However, as can be seen from figure 1, in both species, on both the autosomes and the X chromosome, r W !S is positively correlated with GC content (D. simulans, autosomes: Kendall's = 0.45, P = 0.006; D. melanogaster, autosomes: = 0.53, P = 0.001). Here and in what follows, we refrain from conducting formal correlation tests of the X-linked data due to the dearth of data points; in addition, data from the SI bins are not included in correlations. In contrast, r S!W shows a clearly negative relationship with GC content (Kendall's = À0.95, P < 0.001 and = À0.96, P < 0.001 for D. simulans and D. melanogaster autosomes, respectively). These patterns are expected if GC alleles (i.e., preferred codons) were favored over AT alleles (i.e., unpreferred codons) for a substantial amount of time along these two lineages, and the intensity of the GC-favoring force increases with GC content (see the Discussion for an explicit model). Also of note is the marked increase in r S!W relative to r W !S with GC content in the D. melanogaster lineage, which is suggestive of mutations becoming more AT-biased. However, the arguments set out in the Discussion suggest that a change in mutational bias alone is unlikely to explain the data reported here.
As stated before, the N W!S =N S!W ratio at SI sites, particularly in D. simulans, is close to unity, the value expected under equilibrium base composition. An investigation across the 4-fold site GC content bins suggests that all of the bins considered here are experiencing some level of AT fixation bias N W !S =N S!W < 1 ð Þ , and that genomic regions with higher GC contents are evolving towards AT faster than regions with lower GC contents. This is clear from the negative correlations between GC content and the level of substitution bias N W!S =N S!W ð Þ calculated per 4-fold site bin in both species (Kendall's = À0.96, P < 0.001 and = À0.91, P < 0.001 for D. simulans and D. melanogaster autosomes, respectively) ( fig. 2). As explained in the Discussion, this negative correlation can readily be explained by a genome-wide reduction in the intensity of the GC-favoring force.

DAF at 4-Fold Sites Provide Clear Evidence of Ongoing Selection for Preferred Codons
If selection/gBGC favors GC alleles over AT alleles, then the frequencies of derived GC alleles at AT/GC polymorphic sites (DAF W !S ) should on average be higher than the frequencies of derived AT alleles at AT/GC polymorphic sites (DAF S!W ). Furthermore, DAF W !S should increase as the GC-favoring force becomes stronger (i.e., as 4-fold site GC content increases), whereas DAF S!W should decrease with increasing GC content. In addition, we expect DAF neu , the DAF for putatively neutral changes (i.e., segregating sites that had mutated from A to T, or vice versa, and from G to C or vice versa), to lie in a position intermediate between DAF S!W and DAF W!S (i.e., DAF W !S > DAF neu > DAF S!W ). In contrast, in a neutral model with a recent increase in mutational bias towards AT, the higher number of derived AT mutations entering the population, which tend to be young and segregate at low frequencies, will depress DAF S!W , leading to The ancestral state at the melanogaster-simulans node was determined using three methods: parsimony, the SBR under the GTR-NH b model implemented in PAML, and the average weighted by posterior probability (AWP) under the GTR-NH b model implemented in PAML.
DAF W !S > DAF S!W , but DAF neu should be comparable to DAF W !S . Moreover, GC content and DAF W !S should be unrelated under this model. D. simulans fits the expectations of the first model: DAF W !S is greater than DAF S!W in all autosomal and Xlinked 4-fold bins, and DAF neu is always intermediate between DAF W !S and DAF S!W (fig. 3). Autosomal 4-fold site DAF W!S correlates positively with GC content (Kendall's = 0.6, P < 0.001; fig. 3), and autosomal 4-fold site DAF S!W correlates negatively with GC content (Kendall's = À0.85, P < 0.001; fig. 3); data from the X display similar trends. These patterns suggest the action of forces favoring GC over AT alleles in the recent past in this species (a time period of the order of 4N e generations), with higher GC content bins experiencing a higher strength of recent selection favoring GC.
In D. melanogaster, the equivalent results are less clear. Autosomal DAF W!S is higher than autosomal DAF S!W for 19/20 4-fold bins ( fig. 3). As in D. simulans, autosomal 4-fold DAF W!S correlates positively with GC content (Kendall's = 0.41, P = 0.01; fig. 3), and autosomal 4-fold DAF S!W correlates negatively with GC content (Kendall's = À0.47, P = 0.004; fig. 3). DAF neu falls between DAF W!S and DAF S!W in 14/20 autosomal 4-fold site bins, but only 1/4 X-linked 4-fold bins ( fig. 3). Additionally, the difference between DAF W !S and DAF S!W seems less pronounced than in D. simulans, especially on the X chromosome, although on the autosomes the gap between DAF W !S and DAF S!W does tend to increase with GC content and is the largest and most comparable in magnitude to those seen in D. simulans in the bins with the highest GC content. Overall, these data provide some evidence of recent selection for GC at 4-fold sites in D. melanogaster, but its extent seems to be smaller than in D. simulans, and may be restricted to autosomal regions with high GC contents.

Estimating and Other Parameters Using 4-Fold Site Polymorphism Data
To shed further light on the evolutionary dynamics of selection on CUB, we used two different methods for inferring the scaled strength of selection for GC alleles ( ¼ 4N e s) from polymorphism data. First, we applied the method of Glé min et al. (2015), which detects recent selection (timescale~4N e generations). We refer to the different variants of this method using the same notation as Glé min et al. (2015). These are model M0, where ¼ 0 and polarization errors (with respect to inferring ancestral vs. derived alleles) are not taken into account; M1, where 6 ¼ 0 and polarization errors are not taken into account; and M0* and M1*, which are the equivalent models after correcting for polarization errors. Second, we used the method of Zeng and Charlesworth (2009), modified as described by Evans et al. (2014), which provides estimates over a longer period. We used two variants of this method, which are referred to as ZC0 ( ¼ 0) and ZC1 ( 6 ¼ 0). For every D. simulans bin on both the A and X, both ZC1 and M1 fit the data significantly better than the corresponding models with ¼ 0 (i.e., ZC0 and M0; min 2 = 17.84, df = 1, P < 0.001); the only exception is the X-linked SI bin where M1 does not fit the data better than M0 ( 2 = 0.071, df = 1, P = 0.79) ( fig. 4) signed-rank test, P = 0.25). The agreement between the results from the two methods, which are expected to be sensitive to forces favoring GC on different timescales (see Material and Methods), suggests consistent selection over time favoring GC alleles at 4-fold degenerate sites in D. simulans. In addition, GC content correlates positively with on both the autosomes (Kendall's = 0.98, P < 0.001; = 0.88, P < 0.001 for ZC1 and M1, respectively) and the X chromosome. Thus, in agreement with the results obtained from the divergence-and DAF-based analyses, selection for GC is indeed stronger in regions with higher GC content. The patterns obtained from comparing M0* and M1* are qualitatively identical (supplementary fig. S2, Supplementary Material online). In addition, when using the Akaike Information Criterion (AIC) to rank the four Glé min models (this is necessary because, e.g., M0* and M1 are not nested and cannot be compared using the likelihood ratio test), M1 and M1* are always the two best fitting models for all bins across both chromosome sets, except for the SI bin on the X (supplementary table S3, Supplementary Material online).
Similarly to the analysis based on DAFs, the patterns are less clear-cut in D. melanogaster. When M1 and M0 are compared, 13/20 autosomal 4-fold site bins are found to be non-neutrally evolving, including the four highest autosomal GC bins, and none on the X (fig. 4) online). In particular, the fact that none of the high GC bins have a significant test is out of keeping with the observation that these bins have large differences between DAF W !S and DAF S!W . A close inspection suggests that statistical power may be an issue: there are on average four times fewer SNPs in the 4-fold site bins in D. melanogaster, and in the highest 4-fold site bin, there were only 69 W !S SNPs. As described in the Materials and Methods, the Glé min models are parameter rich, especially M0* and M1*. In fact, M1* often came out (e.g., in 10/20 autosomal 4-fold site bins) as the worse fitting one among the four models according to the AIC.
To deal with this issue, we redid the comparison by reducing the number of autosomal 4-fold bins to 10. M1 fits better than M0 in 9/10 bins, while M1* fits better than M0* in 4/10 bins, including two out of the top four GC bins (supplementary fig. S3, Supplementary Material online). According to the AIC, the frequency of M1 being the best fitting model increases to 9/10 bins, whereas the frequency of M1* being the worse fitting model decreases to 2/10 bins (supplementary table S3, Supplementary Material online). The observation that M1* sometimes ranked lower than M1 according to the AIC in both species may also be due to the fact that our method for correcting for nonequilibrium when reconstructing ancestral states has reduced the need to correct for polarization errors. As is apparent from figure 4, M1 also estimates consistently lower absolute values of than ZC1 in D. melanogaster (Wilcoxon paired signed-rank test, P = 1.9 Â 10 À6 ). Given that the ZC method returns long-term average estimates of , these differences clearly indicate a recent decline in the strength of selection on CUB in this species. As with D. simulans, however, autosomal GC content correlates positively with under both models (Kendall's = 0.87, P < 0.001; = 0.48, P = 0.003 for ZC and M1, respectively; fig. 4), which is suggestive of some, if weak, ongoing selection for GC at autosomal 4-fold sites, particularly in GC-rich regions of the genome. The fact that the SFS is more negatively skewed at 4-fold sites in regions of higher GC content in both species, as measured by Á (supplementary fig. S4, Supplementary Material online), is also consistent with selection on these sites.
In addition to , the two methods also produced estimates of other parameters of interest. For instance, both methods can estimate , the mutational bias parameter, defined as u/v where u is the mutation rate from S to W per site per generation, and v is that in the opposite direction. As shown in supplementary fig. S12, Supplementary Material online, in D. simulans, is close to 2 across the 4-fold site bins, similar to previous estimates obtained by different methods (Singh et al. 2005;Keightley et al. 2009;Zeng 2010;Schrider et al. 2013). The fact that is estimated to be similar across the bins suggests that the difference in 4-fold sites' GC content can be attributed to stronger selection, not to differences in mutational bias. In D. melanogaster, the difference in the estimates between the two methods is much more pronounced, with from the Glé min method (short timescale) being consistently higher than those estimated by the ZC method (long timescale), probably reflecting a recent increase in the mutation rate towards A/T nucleotides (see Discussion).
Consistent with the apparently negative Tajima's D values calculated using 4-fold sites in D. simulans (table 1), the ZC method detected clear evidence for recent population expansion in all bins (P < 10 À16 for all bins; supplementary table S4, Supplementary Material online), whereas for D. melanogaster, no clear evidence for recent population expansion was found, which is consistent with the observed data (e.g., Tajima's D is only À0.11 for A in D. melanogaster, but is À1.03 in D. simulans) and our previous analysis based on a different data set (Zeng and Charlesworth 2009). In supplementary text S2, Supplementary Material online (see also supplementary tables S5 and S6, Supplementary Material online), we present a more detailed description of estimation of the demographic parameters in D. melanogaster, and the statistical and computational issues we encountered. We also provide evidence that our conclusion of a continuing decline in selection intensity in D. melanogaster is robust to these potential issues (supplementary fig. S13, Supplementary Material online).

A More Detailed Analysis of the SI
The SI data shown in figures 3 and 4 suggest that GC may be favored over AT in SI. Given the apparent lack of selective constraints on SI sites (Halligan and Keightley 2006;Parsch et al. 2010), this is suggestive of the action of gBGC. In contrast to selection on CUB at 4-fold sites, all alleles have equal fitness under the gBGC model, and the selection-like pattern is created by the preferential transmission of the S allele in SW heterozygotes to the next generation (Duret and Galtier 2009). The S!S and W !W mutations are "neutral" in the sense that they should be unaffected by gBGC. To gain further insights, we carried out additional analyses by binning the SI data according to their GC content, and asked whether gBGC could be responsible for the observed patterns. Constrained by the limited amount of data and the parameter-richness of some of the models, we only carried out these analyses using the autosomal SI data, divided into five bins. These data were then examined in the same way as the 4-fold sites. However, with such a small number of bins, the correlation-based analysis is likely to be prone to statistical noise; the results should thus be treated with caution.
As shown in figure 5A and E, r S!W decreases as GC content increases in both species (Kendall's = À1, P = 0.03), which may reflect an ancestral reduction in the strength of the force favoring G/C nucleotides (see Discussion). However, r W !S is not significantly correlated with GC content in either species (Kendall's = À0.8, P = 0.09, in D. simulans; Kendall's = 0.8, P = 0.09, in D. melanogaster). Comparing N W !S and N S!W across bins using a 2 Â 5 contingency table test suggests that the substitution pattern is heterogeneous across the bins in both species (P < 2.2 Â 10 À16 in D. simulans and P = 2.04 Â 10 À8 in D. melanogaster). The N W !S /N S!W ratio decreases with increasing GC content in D. simulans (Kendall's = À1, P = 0.03; fig. 5B), qualitatively similar to what we reported above for the 4-fold sites in this species ( fig. 2). However, this ratio shows no significant correlation with GC content in D. melanogaster (Kendall's = 0.8, P = 0.09; fig.  5F). These results highlight the difficulty in conducting detailed analyses in the SI regions, due to insufficient data. Nevertheless, they provide evidence for variation between different SI regions.
We did not detect any statistically significant correlation between the three types of DAFs and GC content in D. simulans ( fig. 5C, minimum P = 0.22 for the three tests), although the relationship DAF S!W < DAF neu < DAF W!S holds in all bins. The lack of strong support for a relationship with GC content was also reflected when the Kruskal-Wallis test was used to test for heterogeneity in median DAFs across bins; the p-values for S!W , neutral, and W !S are 0.38, 0.20 and 0.04, respectively. In D. melanogaster ( fig. 5G), DAF S!W is significantly negatively correlated with GC content (Kendall's = À1, P = 0.03), but no relationship was found for the other two DAFs (minimum P = 0.22). In the three bins with higher GC content, we have DAF S!W < DAF neu < DAF W !S . But the order is completely reversed in the lowest GC content bin, although the differences between the DAFs are nonsignificant based on the Glé min model (see below). Consistent with this, the Kruskal-Wallis test detected significant heterogeneity in median DAF across bins in the DAF S!W case (P = 1.40 Â 10 À8 ), but not in the other two cases (P > 0.08).
Finally, we used polymorphism data to estimate the strength of the force favoring GC, as measured by . In line with the DAF-based analysis, in neither D. simulans (Kendall's = 0, P = 1; fig. 5D) nor D. melanogaster (Kendall's = 0.8, P = 0.09; fig. 5H) did we find a significant relationship between GC content and as estimated by the M1 model of Glé min et al. (2015). In D. simulans, M1 fits the data significantly better than M0 in all five bins, whereas in D. melanogaster, the neutral model M0 is sufficient to explain the data for the first two bins, with the M1 model being more adequate for data collected from the more GC-rich bins. Estimates of produced by the ZC1 method are positively correlated with GC content in both species (Kendall's = 1, P = 0.03; fig. 5D and H). Interestingly, ZC1 fits the data significantly better than ZC0 in all cases, even in bins where is fairly close to zero. A close inspection suggests that this is not due to poor convergence in the search algorithm. Furthermore, simulations have shown that the ZC model is very robust to linkage between sites and demographic changes (Zeng and Charlesworth 2010b), suggesting that these results are unlikely to be methodological artefacts, and may reflect long-term dynamics in these regions. Finally, in D. melanogaster, there is no clear evidence that the estimates of long-term derived from ZC1 are higher than estimates of short-term derived from M1 ( fig. 5H).

Evidence for Past Selection on CUB in Both Drosophila Species
The correlations between the substitution rates and GC content at 4-fold sites presented in figure 1 and supplementary fig. S5, Supplementary Material online can be explored using the following modelling framework (Li 1987;Bulmer 1991;McVean and Charlesworth 1999), which assumes a fixed N e and thus a fixed value of for each GC bin. If there are temporal changes along a lineage, we can regard these parameters as long-term averages. Let u be the mutation rate from S!W per site per generation; and v be that in the opposite direction. Define as u=v. The two substitution rates, r S!W and r W !S , are proportional to u= exp ð Þ À 1 ½ and v= 1 À exp À ð Þ ½ , respectively (e.g., Eq. B6.4.2b of Charlesworth and Charlesworth (2010); Eq. 11 of Sawyer and Hartl (1992); Akashi et al. 2007). We can then define Assuming that u and v are constant across the GC bins and over time ( is thus also constant), R is a function of . Taking the derivative with respect to , we have In other words, R ¼ when ¼ 0 (neutrality), and decreases as becomes positive (i.e., when W is selected against). Thus, the decreasing values of R shown in figure  1 and supplementary fig. S5, Supplementary Material online suggest that S is more strongly favoured in high GC bins. For instance, the R values for the lowest and highest autosomal 4-fold site bins in D. simulans are 1.51 and 0.56, respectively. If the SI sites are neutral (see below), can be estimated by the R value from the SI bin, which is 1.93, very close to the value of 2 reported previously (Singh et al. 2005;Keightley et al. 2009;Zeng 2010;Schrider et al. 2013), solving eq. (2) for gives values of 0.25 and 1.24 for the lowest and highest bins, respectively. These rough, long-term estimates are about 2-fold lower than those obtained from the polymorphism data ( fig. 4). It is possible that D. simulans has a larger recent N e (reflected in the polymorphism-based analysis) than the average N e along the entire lineage, which is consistent with the evidence for population expansion from the negative Tajima's D values (table 1). Finally, as detailed in the supplementary text S1, Supplementary Material online, this model can also explain why the slope for r S!W is apparently steeper than that for r W!S ( fig. 1). The above model can also explain why, at 4-fold sites, R N ¼ N W !S =N S!W < 1 and there is a negative relationship between R N and GC content ( fig. 2), where N W !S and N S!W are the numbers of substitutions between the S and W alleles along the lineage of interest. Note first that N S!W and N W!S are, respectively, proportional to Qu= exp ð Þ À 1 ½ and 1 À Q ð Þv= 1 À exp À ð Þ ½ , where Q is the GC content at the ms node (since Q changes very slowly, this should be a reasonable first approximation). At equilibrium, Q ¼ 1= 1 þ exp À ð Þ ½ (Li 1987;Bulmer 1991) and hence N W !S =N S!W ¼ 1. Consider a model where the ancestral species was at equilibrium, but is reduced to p 0 p < 1 ð Þalong a lineage that leads to an extant species, so that N S!W and N W !S become proportional to Qup= exp p ð Þ À 1 ½ and 1 À Q ð Þvp= 1 À exp Àp ð Þ ½ , respectively. Then, R N for the GC content bin in question can be written as Assuming that p is constant across bins (i.e., there has been a genome-wide proportional reduction in ), then R N decreases as increases. This, together with the arguments presented above that the long-term average is higher in high GC bins, eq. (4) implies that the negative relationship between R N and GC content is consistent with a genome-wide reduction in the intensity of selection in both species (see also Akashi et al. 2007).
In contrast, if we assume that ¼ 0 and is constant across the bins (i.e., there has been no selection along both the D. melanogaster and D. simulans lineages), the fact that R ¼ means that a genome-wide increase in (i.e., a more AT-biased mutation pattern) would not cause a negative relationship between R and GC content. If the relationship between R and GC content were entirely mutational in origin, then u must decrease as GC content increases, whereas v changes in the opposite direction ( fig. 1). Such a model is incompatible with the evidence for selection from the two polymorphism-based methods ( fig. 4), and cannot easily explain the well-known positive correlation between GC content of CDSs (or the extent of CUB) and gene expression levels (e.g., Campos et al. 2013), especially when considering the lack of support for transcription-coupled mutational repair in Drosophila (Singh et al. 2005;Keightley et al. 2009).
As shown in supplementary fig. S12, Supplementary Material online, the Glé min method (short timescale) and the ZC method (long timescale) returned estimates that are more comparable in D. simulans than in D. melanogaster; the ZC method produced consistently lower estimates in D. simulans and consistently higher estimates in D. melanogaster (two-sided binomial test, P = 1.91 Â 10 À6 in both cases). Taken at face value, these results suggest that there probably has been relatively little change in the extent of mutational bias in the D. simulans lineage, whereas mutation may have become more AT-biased in D. melanogaster. These results suggest that the patterns shown in figure 2 are probably a result of an ancestral reduction in the efficiency of selection in D. simulans. For D. melanogaster, it is possible that a more ATbiased mutational pattern has also contributed to the evolution of base composition in its genome, as suggested by previous studies (Takano-Shimizu 2001;Kern and Begun 2005;Nielsen et al. 2007;Zeng and Charlesworth 2010a;Clemente and Vogl 2012b).
Overall, the above considerations suggest that the data presented in figures 1 and 2 and supplementary fig. S5, Supplementary Material online cannot be explained by a shift towards a more AT-biased mutational pattern alone. Instead, selection favoring GC over AT basepairs must have acted on both species for a significant amount of time since they last shared a common ancestor, although both lineages are likely to have experienced an ancestral reduction in the efficacy of selection that led to the AT-biased substitution patterns.

Estimating the Intensity of Selection on Preferred Codons over Different Timescales
A novelty of this study is that, by applying two different methods to the same polymorphism data set, we have attempted to understand how the selective pressure on CUB has changed over time by comparing estimates reflective of either a short timescale (for roughly the last 4N e generations; i.e., the Glé min method (Glé min et al. 2015)), or a long timescale (for > 4N e generations; i.e., the ZC method; Zeng and Charlesworth 2009). However, pinpointing the exact timescale for the ZC method is difficult, because it depends on details of past evolutionary dynamics that we know little about (e.g., the timescale can be affected by both the time when the ancestral population size reduction took place and the severity of the reduction; see supplementary figs. S8-S11 in Zeng and Charlesworth 2009). This difference in timescale between the methods is due to the use of the derived SFS under the infinite-sites model (Kimura 1983) in the Glé min method and the use of a reversible mutation model in the ZC method (see Zeng 2012 for a more thorough discussion of the differences between these two models). By the same token, we can classify other polymorphism-based methods into short timescale (Akashi and Schaeffer 1997; Bustamante et al. 2001) and long timescale (Maside et al. 2004;Cutter and Charlesworth 2006;Galtier et al. 2006;Zeng 2010;Clemente and Vogl 2012a;Vogl and Bergman 2015).
Contrasting the results obtained from the ZC method with those from the divergence-based analysis (figs. 1 and 2) and the Glé min method ( fig. 4) is informative. First, consider D. simulans. The fact that values of estimated by both the ZC method and the Glé min method are virtually identical suggests that there have not been significant changes in the intensity of selection over the time period that the ZC method considers. Hence, the reduction in suggested in the previous section, which may have caused N W !S =N S!W < 1 and the negative correlation between N W !S =N S!W and GC content, probably happened so early during the evolution of D. simulans that it did not leave detectable traces in the polymorphism data.
In contrast, in D. melanogaster, both the divergence-based analysis and the comparison between the ZC method and the Glé min method provide evidence for a reduction in , indicating a recent decline in this species. Assuming that SI are neutral, and using autosomal data from the putatively ancestral populations (i.e., MD and RG), table 1 in this study  and table 1 in Jackson et al. (2015) suggest that N e is 2.21-fold higher in D. simulans compared with D. melanogaster, implying more efficient selection in the recent past. In fact, focusing on the 13 autosomal 4-fold site bins in D. melanogaster where M1 fits the data better than M0 (filled squares in fig. 4), the estimates in the corresponding bins in D. simulans are on average 2.93 times higher, comparable to the difference in N e suggested by the SI data. This difference in N e may be due to differences in the two species' demographic history. Previous studies have also suggested that the lower recombination rate in D. melanogaster compared with D. simulans (Comeron et al. 2012;True et al. 1996) may have played a role through stronger Hill-Robertson interference between selected sties (Takano-Shimizu 1999;McVean and Charlesworth 2000;Comeron et al. 2008Comeron et al. , 2012Cutter and Payseur 2013). However, without detailed genetic maps from closely-related outgroup species, it is impossible to ascertain whether the reduced map length in D. melanogaster represents the ancestral or derived state; this is an important area for further research.
Comparison with Previous Studies Poh et al. (2012) suggested that AT-ending codons might be favored in D. melanogaster, based on the observation that, along the D. melanogaster lineage, S!W mutations fixed at a higher rate than W !S changes; also, in their polymorphism data set, the proportion of singletons in the SFS for S!W changes was smaller than in the SFS for W !S changes (23.2% vs. 24.3%). The latter difference is significant under a Mann-Whitney U test, although neither Tajima's D nor Fu and Li's D* were significantly different from zero. Here we have provided evidence that the pattern of r S!W > r W!S can be readily explained by a reduction in selection intensity favoring S basepairs along the D. melanogaster lineage. As for their polymorphism data, Poh et al. (2012) used lines collected from Raleigh, North America. There is clear evidence that this population has experienced bottlenecks in the recent past, as can be seen from the lower level of diversity in this population compared with populations from Africa (genome-wide S = 0.013 vs. 0.019 for the Raleigh and Malawi populations; Langley et al. 2012). Without using model-based methods to correct for the effects of demographic changes, the results of Poh et al. (2012) may be susceptible to complications caused by such complex demography. In addition, their ancestral states were inferred using maximum parsimony, which is prone to error. In supplementary fig. S14, Supplementary Material online, we used parameter values realistic for D. melanogaster to show that, with demography and polarization error, it is possible for the proportion of singletons in the SFS for S!W changes to be lower than that for W !S changes in the presence of weak selection favoring S (see the legend to supplementary fig. S14, Supplementary Material online for further discussion of this issue).
Another possible cause of the Poh et al. (2012) results is admixture with African D. melanogaster during the recovery from the bottleneck (Caracristi and Schlö tterer 2003;Duchen et al. 2013;Bergland et al. 2016). Because the average synonymous site GC content is >60% (Campos et al. 2013) and mutation is AT-biased (supplementary fig. S12, Supplementary Material online), S!W SNPs should be more common overall among the introduced variants than W !S SNPs. Rapid population growth following the bottleneck would make the introduced S!W variants contribute more multiple copies of the derived W alleles than W !S variants, which could create the relative deficit of W !S singletons. Because this effect is expected to be stronger in regions with higher GC content, it could also explain Poh et al.'s (2012) observation that the relative deficit of S!W singletons is more apparent in highly-expressed genes.
A detailed analysis of these demographic factors is beyond the scope of this article, as it would require knowledge of many poorly-known parameters (for example, the time and the extent of the admixture; see Duchen et al. 2013). Overall, notwithstanding the possibility that AT-ending codons may be favored in some genes (DuMont et al. 2004;Nielsen et al. 2007), our data from a nonbottlenecked population that is close to the putative ancestor of D. melanogaster suggest that the genome-wide pattern is compatible with a model in which selection on CUB is reduced in the D. melanogaster lineage and ongoing selection is confined to the most GC-rich parts of the genome.
In addition, Lawrie et al. (2013) suggested that a subset of 4-fold sites may be under strong selective constraints in D. melanogaster. These authors based their conclusions on two main observations that were made from analysing a North American population generated by the Drosophila Genetic Reference Panel (DGRP): a lack of difference in the shape of the SFSs between 4-fold and SI sites and a~22% reduction in diversity level at 4-fold sites relative to SI sites (after correcting for differences in GC content; see their fig. 1). The authors suggested that their findings might represent "a largely orthogonal force to canonical CUB" (p. 12 of Lawrie et al. (2013)). Indeed, by using a sample with 130 alleles, they were able to detect signals of much stronger purifying selection (with estimated to be À283) than is permitted by our sample sizes (21 MD lines from D. simulans and 17 RG lines from D. melanogaster). Additionally, their estimates of the intensity of strong selection appear to be uniform across genes with high and low levels of CUB, in contrast to the pattern we report here.
Obtaining more information about these two seemingly independent forces acting on 4-fold sites (weak selection on CUB and strong purifying selection) is an important area for future investigation. Several factors are of note. As discussed above, admixture is likely to complicate the analysis of the North American population of D. melanogaster. Although Lawrie et al. (2013) used the same method as that of Glé min et al. (2015) to control for demography, this method is nonetheless an approximation and may still lead to biased estimates of under certain conditions, as demonstrated by simulations ). Using nonadmixed populations and explicit demographic models (as in this study) may be preferable. Second, with a larger sample size (as in Lawrie et al. (2013)), it should be possible to jointly model the effects of both weak selection on CUB, which requires distinguishing W !S, S!W , and putatively neutral mutations (i.e., S!S and W !W ) (which were ignored by Lawrie et al. 2013), and strong purifying selection, which primarily leads to an excess of very low frequency variants. By doing so, we should be able to explicitly test the relative importance of these two forces, and gain further insights into the evolution of 4-fold sites in the Drosophila genome.

Complex Evolutionary Patterns in Short Introns
Short introns have been widely used as a neutral reference in Drosophila evolutionary genetic studies (Halligan and Keightley 2009;Parsch et al. 2010), and are thought to be closer to base composition equilibrium than other genomic regions (Kern and Begun 2005;Haddrill and Charlesworth 2008;Singh et al. 2009;Robinson et al. 2014), a pattern we have also observed ( fig. 2). When analyzed as a whole, the data point to the existence of a GC-favoring force in both species (figs. 3 and 4). Given the apparent lack of selective constraints in SI regions, it seems probable that gBGC may have played a significant role in their evolution. Although our detailed analyses were complicated by insufficient data, multiple aspects of the data presented in figure 5 are nonetheless inconsistent with the standard gBGC model, which would predict that the strength of the GC-favoring effect should increase with GC content (Duret and Galtier 2009).
For D. simulans, the substitution patterns across SI bins shown in figure 5 are qualitatively similar to those shown in figures 1 and 2 for the 4-fold sites. This seems to imply that the GC-favoring force acting on short introns may also have experienced a reduction in strength. However, in contrast to the 4-fold sites for which a genome-wide excess of S!W substitutions was observed ( fig. 2), we obtained contrasting patterns in low-GC and high-GC SI bins ( fig. 5B), with the former having a significant bias towards W !S substitutions ( 2 test, P = 2.30 Â 10 À16 ), and the latter a significant bias towards S!W substitutions ( 2 test, P = 2.95 Â 10 À24 ). These contrasting patterns could potentially be explained by an increase in the strength of the GC-favoring force in the low-GC short introns, but a decrease in the high-GC ones. The difference between the values estimated by the Glé min method and the ZC method gives some tantalising indications that this might have happened ( fig. 5D). However, we are unaware of any direct evidence supporting this possibility, and it is also hard to reconcile with what we observed at the 4fold sites, which were extracted from the same set of genes. Furthermore, the Glé min model provides little evidence that S basepairs are more favored in high GC content regions, although this might have been the case in the past according to the ZC model.
In D. melanogaster ( fig. 5F), a bias towards fixing W basepairs was observed in the first four SI bins ( 2 test, maximum P = 5.85 Â 10 À8 ), but not the last bin ( 2 test, P = 0.40). Again this is inconsistent with the genome-wide fixation bias towards W at the 4-fold sites ( fig. 2). Estimates of from the two polymorphism-based methods are closer to each other compared with D. simulans, and both methods seem to suggest that S basepairs are more favored in GC-rich regions ( fig.  5H), but the small number of bins makes it difficult to draw definitive conclusions from correlation-based analyses.
To investigate this further, we calculated the polymorphism-to-divergence ratio for W !S changes, S!W changes, and changes that are supposedly unaffected by gBGC (i.e., W !W and S!S changes), denoted by rpd W !S , rpd S!W , and rpd neu , respectively. If high GC content is driven by gBGC, we expect rpd neu /rpd W!S > 1 (i.e., fixation bias towards S) and rpd neu /rpd S!W < 1 (i.e., fixation bias against W) in high GC bins, but these two ratios should be close to one in low GC bins where gBGC should be weak. In D. melanogaster, the first prediction was met (rpd neu / rpd W !S = 1.60, P = 0.001 and rpd neu /rpd S!W = 0.69, P = 7.2 Â 10 À3 , in the most GC-rich bin). However, we found evidence for the existence of an AT-favoring force in the bin with the lowest GC content (rpd neu /rpd W !S = 0.66, P = 7.60 Â 10 À5 , and rpd neu /rpd S!W = 2.01, P = 3.50 Â 10 À12 ), which is in agreement with estimates produced by the ZC method ( fig. 5H), but inconsistent with the gBGC model. In a similar analysis of the SI bins in D. simulans, none of the polymorphism-to-divergence ratios were found to be significantly different from 1, except in the bin with the lowest GC content where rpd neu /rpd W !S = 1.25 (P = 0.0079). These findings are again inconsistent with the gBGC model.
Overall, the data from both species suggest that there is heterogeneity in evolutionary patterns between short introns residing in different parts of the genome, and that there might be some GC-favoring forces acting on short introns. However, there are substantial uncertainties as to how much of the GCfavoring effect is caused by gBGC. This conclusion is consistent with several previous studies that found little or no evidence for gBGC in D. melanogaster (Clemente and Vogl 2012b;Comeron et al. 2012;Campos et al. 2013;Robinson et al. 2014). Furthermore, in contrast to the 4-fold sites, where a reduction in is clear when estimates from the Glé min model and the ZC model are compared, no clear evidence of such a difference can be seen in the SI data. Regardless, this GC-favoring force acting on short introns is unlikely to be the sole explanation of the results obtained from 4-fold sites, because the estimates obtained from the latter are consistently higher than those from the former ( fig. 4 vs. fig.  5). Given the importance of these putatively neutral sites in short introns, more work is necessary to understand the unique features reported above.

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.