Abstract

We conducted a genome-wide analysis of the roles of mutation and selection in sculpting intron size in the fungal pathogen Cryptococcus neoformans. We find that deletion rate is positively associated with intron length and that insertion rate exhibits a weak negative association with intron length. These patterns suggest that long introns as well as extremely short introns in this unusually intron-rich fungal genome are in mutation–selection disequilibrium and that the proportion of constrained functional sequence in introns does not scale linearly with size. We find that untranslated region introns are longer than coding-region introns and that first introns are substantially longer than subsequent introns, suggesting heterogeneous distribution of constrained functional sequence and/or selective pressures on intron size within genes. In contrast to Drosophila, we find a positive correlation between dN and first intron or last intron length and a negative correlation between dN and internal intron length. This contrasting pattern may indicate that terminal introns and internal introns are differentially subject to hypothesized selection pressures modulating intron size and provides further evidence of widespread selective constraints on noncoding sequences.

Introduction

Introns are ubiquitous features of eukaryotic genomes. Despite their broad distribution, the forces affecting their evolution are only just now becoming clear. Although introns are subject to less evolutionary constraint than coding sequence, they do not evolve neutrally at the sequence level (Waterston et al. 2002; Hare and Palumbi 2003). Introns have been found to be subject to selective constraint at the level of primary sequence for many reasons, including maintenance of splicing efficiency or alternative splicing (Sorek and Ast 2003; Voelker and Berglund 2007) and conservation of transcription factor binding sites (Majewski and Ott 2002).

Though intron size varies widely within and across genomes of different strains, there is evidence to suggest that intron size is not a neutral feature. First, intron size distributions are not symmetrically distributed. Most eukaryotes exhibit a highly skewed intron size distribution, with many introns at what is presumably the minimum length necessary for efficient splicing and a very long tail of introns exhibiting greater length (Mount et al. 1992; Lim and Burge 2001; Yu et al. 2002). In humans, nematodes, and Drosophila, introns in highly expressed genes are shorter than introns in lowly expressed genes (Castillo-Davis et al. 2002; Urrutia and Hurst 2003; Marais et al. 2005), suggesting that selection may be acting to minimize the energetic costs associated with transcription of nontranslated sequence and/or increase the rate of transcription. In addition, there is an inverse relationship between recombination rate and intron size in Drosophila melanogaster (Carvalho and Clark 1999; Comeron and Kreitman 2000) and in Caenorhabditis elegans (Prachumwat et al. 2004). This suggests that regions of high recombination may be preventing the fixation of insertions that augment intron length beyond the minimum size required for splicing and/or that weak selection may preferentially be fixing insertions in low-recombination regions to reduce Hill–Robertson effects and increase the genetic distance between selected loci in flanking exons (Hill and Robertson 1966). Finally, Presgraves (2006) has found recent evidence in D. melanogaster that small insertions exhibit an elevated probability of fixation relative to other insertions or deletions (indels) and that the indel fixation probabilities may vary according to chromosome.

Eukaryotic genomes generally exhibit a mutation pressure biased toward deletions over mutations (Petrov and Hartl 1997; Petrov 2002), suggesting that most introns in recombining genomic regions should drift to a minimum size required for efficient splicing unless they contain regulatory sites or other functional sequences. Ptak and Petrov (2002) have observed that introns in D. melanogaster exhibit a lower rate of deletion relative to neutrally evolving, “dead-on-arrival” retroelement sequences, suggesting that introns do indeed harbor functional sequences subject to selective constraint.

In the manuscript, we investigate whether introns that are longer than modal size are at selection–mutation equilibrium (presumably because they contain a greater proportion of constrained functional sequences) or alternatively if long introns represent temporary departures from selection–mutation equilibrium and are destined to be trimmed of superfluous nucleotides by deletion-biased mutation pressure. We addressed this question by analyzing the distribution of recent insertion and deletion mutations in 4 recently sequenced genomes from the Cryptococcus neoformans species group. Cryptococcus neoformans is a pathogenic, basidiomycetous yeast exhibiting high intron density relative to other fungi, with an average of 5.3 introns/gene (vs. 1–2 introns/gene for most sequenced ascomycetes; Loftus et al. 2005). In addition to high intron density, C. neoformans offers several other key traits for this analysis, including 4 independent evolutionary lineages of sufficiently close relatedness to permit alignment of noncoding sequences and inference of indels (average dS = 0.37; Neafsey and Galagan 2007) and an intron size distribution that clusters extremely tightly around the modal size of 52 bp (fig. 1).

FIG. 1.—

Intron size distribution in Cryptococcus neoformans, Drosophila melanogaster, and Homo sapiens. Distributions are based on 29,146 C. neoformans introns, 61,031 D. melanogaster introns, and 239,661 H. sapiens introns. Numbers on horizontal axis indicate midpoint of each 10 bp interval. Homo sapiens and D. melanogaster intron data were drawn, respectively, from the March 2006 and April 2006 assemblies available at the UCSC genome browser database (Karolchik et al. 2003).

FIG. 1.—

Intron size distribution in Cryptococcus neoformans, Drosophila melanogaster, and Homo sapiens. Distributions are based on 29,146 C. neoformans introns, 61,031 D. melanogaster introns, and 239,661 H. sapiens introns. Numbers on horizontal axis indicate midpoint of each 10 bp interval. Homo sapiens and D. melanogaster intron data were drawn, respectively, from the March 2006 and April 2006 assemblies available at the UCSC genome browser database (Karolchik et al. 2003).

We find that C. neoformans introns that are shorter than the modal size exhibit an insertion bias and introns longer than the modal size exhibit a deletion bias. Assuming a similar indel mutation profile across all introns, this finding implies that selection is differentially enhancing the fixation probability of insertions in short introns and deletions in long introns and that extremely long or short introns are in mutation–selection disequilibrium. This suggests that long introns do not necessarily contain greater amounts of functional information per capita relative to introns closer to the modal size. We further report a reduced rate of deletion in introns flanked by coding exons relative to introns flanked by noncoding exons in the untranslated region (UTR) or intergenic sequence, perhaps indicating a higher density of constrained functional sequence in coding-region introns relative to other noncoding sequence classes. We find a difference in the evolutionary profile of first-ranked and last-ranked coding-region introns and internal coding-region introns, suggesting different selective forces sculpt intron length variation within genes according to intron position. These findings further demonstrate the pervasive impact of selection on noncoding sequence and better elucidate the relative roles of neutral mutation and selection in defining fundamental genomic features such as intron length distributions.

Materials and Methods

Genome assemblies of C. neoformans strains JEC21, H99, WM276, and R265 were acquired and aligned as previously described (Neafsey and Galagan 2007). We will refer to these genomes as strains in the manuscript, though pairwise synonymous divergence among the strains is generally concordant with species-level divergence. For the purposes of our analysis, therefore, we considered differences observed among strains to be fixed differences rather than polymorphisms. Introns and intergenic regions in the alignment were identified according to the annotation produced by the Institute for Genome Research for strain JEC21, which was based on a library of ∼23,000 full-length cDNA-paired sequencing reads (Loftus et al. 2005). All analyses of mutations were conducted using custom Perl scripts. Only introns exhibiting canonical splice sites (GT/AG) in all 4 strains were analyzed, resulting in the elimination of 3,511 coding-region introns and 349 UTR introns. In total, we analyzed 24,702 coding-region introns, 669 UTR introns, and 867 intergenic regions. Indels where sequence was present in 1 strain and absent in 3 were assumed to be insertions, and indels where sequence was present in 3 strains and absent in 1 were assumed to be deletions. Our assumption that the minor allelic state is the derived state may be subject to error when parallel insertions or deletions occur or when the minor allelic state occurs in the strain closest to the root and is in fact ancestral. We expect parallel indel mutations of the same size and position to be quite rare, however, and phylogenetic analyses with more distantly related species of fungi indicate that these Cryptococcus strains are most likely rooted on the internal branch of their clade rather than any of the terminal branches (results not shown). Indels that were present in 2 strains and absent in 2 strains were not analyzed due to inability to infer their ancestral state. Only those indels occurring in alignable regions were retained for analysis, so as to not bias the size spectrum or ratio of observed deletions and insertions. We required >70% alignment similarity and no gaps in the flanking ±5 bp to consider a region alignable. This filter removed approximately 17% (2,256) of candidate deletions and 15% (1,131) of candidate insertions. Three large insertions exhibiting sequence similarity to transposons were not included in rate analyses because they were extreme outliers with respect to insertion size.

Rates of insertion and deletion were scaled by the number of terminal-branch base pair mutations in the alignments. Counts of terminal-branch mutations were performed using the Baseml program in the PAML v3.14 software package (Yang 1997). An HKY85 substitution model was used, and the following topology was assumed: (JEC21, H99, (WM276, R265)). Rates of insertion were calculated by dividing the total number of inserted base pairs by the number of terminal-branch substitutions. Rates of deletion were calculated in a similar way but with a correction factor to account for the reduced probability or opportunity for fixing large deletions in small introns relative to large introns (Blumenstiel et al. 2002). This “window size” correction for deletion rate (D) was implemented as follows: 

graphic
where ni,j is the number of deletions of size j in fragment i, αi,j is the number of sites available for deletions of size j in fragment i (length − j + 1), MD is the maximum observed deletion size, and ti is the number of terminal substitutions per base pair in fragment i as estimated in PAML.

An additional filter was used to exclude aligned introns and intergenic regions containing misannotated coding sequence. The greater degeneracy of the third position of codons relative to the first 2 positions yields a 3mer-based periodicity to the pattern of divergence between orthologous coding sequences that is not observed between orthologous noncoding sequences (Kellis et al. 2004). Using this principle, those intron and intergenic alignments exhibiting 4 or more consecutive substitutions with 3mer-based periodicity were discarded. This filter disqualified from further analysis 359 out of 28,256 intron alignments and 627 out of 1,629 intergenic alignments. A total of 1,681,403 bp of aligned intronic sequence and 284,267 bp of aligned intergenic sequence were analyzed. The 95% confidence intervals (CIs) for all indel rates were estimated using a nonparametric bootstrap approach, in which introns or intergenic regions were resampled with replacement 1,000 times to generate simulated rate distributions. The significance of observed differences between different indel rates and other pairwise comparisons were evaluated using a bootstrapping approach. The KruskalWallis test was used to test for significance of differences in rates among 3 or more sequence classes or strains. Correlation between pairs of parameters was measured using Spearman's rank correlation coefficient, with significance evaluated by 1,000 random permutations of ranks.

Evolutionary rate analyses of coding sequences were performed using the codeml program in the PAML v3.14 software package, with parameters corresponding to the M0 model of evolution (Yang 1997).

Results and Discussion

Intron Size Disequilibrium

We report a strong relationship between indel rates and coding-region intron size in C. neoformans (fig. 2). When introns are binned into quintiles by length, insertion rate is significantly greater than deletion rate only for the shortest set of introns (10–49 bp; P < 10−4). Insertion rate and deletion rate are not significantly different for introns in the second length quintile, which contains the mode (mode = 52 bp; bin = 50–52 bp; P = 0.110). For all quintiles containing longer introns, deletion rate significantly exceeds insertion rate (P < 10−4). Deletion rate becomes progressively greater with intron size and is significantly different between all consecutive quintile bins (P < 10−4). Insertion rate does not significantly differ between any consecutive quintile bins, but the bin containing the shortest introns (10–49 bp) exhibits an insertion rate significantly greater than introns in the 2 bins, the bins containing the longest introns (bins = 57–63 bp, 64–971 bp; P < 10−4). Assuming that the neutral profile of insertions and deletions is similar across all introns, these patterns implicate the role of natural selection in differentially affecting the fixation probability of insertions and deletions according to intron size. The much greater contrast in deletion rate among intron size categories relative to insertion rate indicates that selective modulation of intron size is primarily achieved through regulation of the fixation of deletions. Intron size is not significantly different among the 4 strains (Kruskal–Wallis test, H = 3.2, P = 0.366), eliminating the possibility that these results could be explained by drift in intron size in one or more strains.

FIG. 2.—

Insertion and deletion rates as a function of intron size in coding-region introns. Error bars indicate 95% CIs. Modal intron size is 52 bp. Bin sizes approximate quintile divisions of introns according to size. From left to right, bins contain 5,468, 4,736, 5,218, 4,409, and 4,872 introns. All deletion rates are significantly different from each other (P < 10−4). The insertion rates of the leftmost and rightmost bins are significantly different (P = 0.001).

FIG. 2.—

Insertion and deletion rates as a function of intron size in coding-region introns. Error bars indicate 95% CIs. Modal intron size is 52 bp. Bin sizes approximate quintile divisions of introns according to size. From left to right, bins contain 5,468, 4,736, 5,218, 4,409, and 4,872 introns. All deletion rates are significantly different from each other (P < 10−4). The insertion rates of the leftmost and rightmost bins are significantly different (P = 0.001).

The increased rate of deletion in longer introns could be due to a greater opportunity for fixing large deletions in such introns. Deletions that straddle the intron/exon boundaries or otherwise disrupt proper splicing would presumably be selectively deleterious, and large deletions might be more likely to disrupt splicing in short introns. However, we consider this explanation for the deletion rate variation to be unlikely, given that the size spectrum of deletions is heavily biased toward very short deletions (95% ≤ 5 bp), and our method for calculating deletion rate takes into account variation in mutational opportunity as a function of intron size. The size spectra of insertions and deletions observed in coding-region introns, UTR introns, and intergenic regions are presented in figure 3.

FIG. 3.—

Indel size spectra in various noncoding sequence classes. (A) coding-region introns, (B) UTR introns, and (C) intergenic sequence. ins, insertion; del, deletion.

FIG. 3.—

Indel size spectra in various noncoding sequence classes. (A) coding-region introns, (B) UTR introns, and (C) intergenic sequence. ins, insertion; del, deletion.

Counts of insertions, deletions, and singletons (mutations present in only one strain) for each noncoding sequence class are presented in table 1. In total, we analyzed 24,703 coding-region introns, 669 UTR introns, and 867 intergenic regions. Overall rates of insertion and deletion in the different sequence classes are presented in figure 4. We observe significant variation in both insertion rate (Kruskal–Wallis test, H = 115.3, P < 10−3) and deletion rate (Kruskal–Wallis test, H = 109.7, P < 10−3) among the noncoding sequence classes. Insertion rates are similar in UTR introns (0.039 bp/substitution [95% CI: 0.030–0.048]) and intergenic sequence (0.039 bp/substitution [95% CI: 0.0235–0.044]) but significantly lower (bootstrapping; P < 10−4) in coding-region introns (0.032 bp/substitution [95% CI: 0.031–0.034]). Deletion rates are also similar in intergenic sequence (0.075 bp/substitution; 95% CI: 0.068–0.083) and UTR introns (0.084 bp/substitution; 95% CI: 0.073–0.096) but significantly lower (bootstrapping; P < 10−4) in coding introns (0.055 bp/substitution; 95% CI: 0.053–0.057). These observations may indicate a higher density of constrained functional sequence in coding introns relative to the other noncoding sequence classes.

Table 1

Substitutions and Indels Observed in Noncoding Cryptococcus Sequence

Sequence Class Number of Regions Subs Subs/bp Number of Del bp Del/Sub (95% CI) Number of Ins bp Ins/Sub (95% CI) Del Biasa 
Coding-region introns 24,703 422,786 0.27 11,129 0.055 (0.053–0.057) 6,400 0.032 (0.031–0.034) 0.023 
UTR introns 669 16,563 0.25 548 0.084 (0.073–0.096) 244 0.039 (0.030–0.048) 0.045 
Intergenic 867 58,675 0.21 1,399 0.075 (0.068–0.083) 737 0.039 (0.035–0.044) 0.036 
Sequence Class Number of Regions Subs Subs/bp Number of Del bp Del/Sub (95% CI) Number of Ins bp Ins/Sub (95% CI) Del Biasa 
Coding-region introns 24,703 422,786 0.27 11,129 0.055 (0.053–0.057) 6,400 0.032 (0.031–0.034) 0.023 
UTR introns 669 16,563 0.25 548 0.084 (0.073–0.096) 244 0.039 (0.030–0.048) 0.045 
Intergenic 867 58,675 0.21 1,399 0.075 (0.068–0.083) 737 0.039 (0.035–0.044) 0.036 

NOTE.—Subs, terminal-branch substitutions; Del, deletions; Ins, insertions.

a

(bp Del/Sub) − (bp Ins/Sub).

FIG. 4.—

Overall insertion and deletion rates in various noncoding sequence classes. Error bars indicate 95% CIs. Insertion rate and deletion rate are significantly reduced in coding-region introns relative to UTR introns and intergenic sequence (bootstrapping; P < 10−4).

FIG. 4.—

Overall insertion and deletion rates in various noncoding sequence classes. Error bars indicate 95% CIs. Insertion rate and deletion rate are significantly reduced in coding-region introns relative to UTR introns and intergenic sequence (bootstrapping; P < 10−4).

Our findings suggest that introns in Cryptococcus that significantly deviate in size from the mode are largely in mutation–selection disequilibrium and that long introns therefore do not necessarily exhibit a greater density of functional sequence than short introns. On the contrary, there is a weak but significant positive correlation between coding-region intron size and nucleotide divergence (RS = 0.032, P < 10−4, n = 24,703), which further suggests that many longer introns may contain a lower density of constrained functional sequence than short introns. This contrasts with the pattern that has been observed in Drosophila, where first intron size is negatively correlated with divergence (RS = −0.29, P = 0.03) and subsequent intron size is not significantly correlated with divergence (RS = −0.14, P > 0.05; Haddrill et al. 2005; Marais et al. 2005).

Parsch (2003) proposed a compensatory model of intron size evolution for Drosophila, which predicts a cyclical pattern of intron size evolution rather than precise maintenance of introns at an optimal size. Under this dynamic equilibrium model, small, slightly deleterious deletions fix frequently in introns close to modal size as a result of deletion-biased mutation pressure and genetic drift and are offset at irregular intervals by rare large insertions, which are fixed by positive selection. The finely graduated deletion rate variation we see among Cryptococcus intron size classes, combined with the near absence of large deletions in our data set, suggests that even small indel events may be subject to selection in Cryptococcus. If a larger effective population size in Cryptococcus explains this highly modulated deletion rate, the question arises of how large, presumably deleterious insertions would have been allowed to fix and create introns significantly longer than the modal size. We observed one instance of an apparent transposon insertion in a UTR intron (size = 554 bp) and 2 transposon insertions in intergenic regions (size = 1,100 and 768 bp) but did not include them in calculations of insertion rate because of their status as rare, extreme outliers. As we observed only a few instances of large insertions in our genome-wide analysis, we speculate that such insertions may fix during sporadic bursts of transposon activity and/or population bottlenecks. Alternatively, if some long introns are favored to reduce Hill–Robertson interference in low-recombination regions, minor disequilibrium may be maintained not by the inadequacy of selection to keep pace with insertions but by shifts in local recombination rate.

As figure 1 illustrates, eukaryotes exhibit varying degrees of constraint on intron size, but mechanistic requirements of splicing likely constrain at least a subset of introns in most organisms. For example, despite the comparatively broad distribution of intron size in Drosophila relative to Cryptococcus, Fox-Walsh et al. (2005) have found that many introns in Drosophila exhibit efficient splicing only at sizes of less than 200–250 bp. Detection of mutation–selection disequilibrium in intron size is facilitated in Cryptococcus by the fact that this group of strains exhibits 2 traits not typically found in association with each other: high intron density and a tight intron size distribution. Many unicellular eukaryotes with lower intron density exhibit an intron size distribution similar to that of Cryptococcus, and their genomes likely contain a similar proportion of introns in mutation–selection disequilibrium.

Intron Size Variation within Genes

In addition to global, genome-wide forces affecting intron size, selection is known to affect intron size variation among introns within genes and may influence the genome-wide intron length distribution. As has been observed before in other organisms (Duret 2001; Marais et al. 2005; Hong et al. 2006), we find that UTR introns (average size = 108 bp, standard deviation [SD] = 102 bp) are significantly longer than coding-region introns (average size = 68 bp, SD = 39 bp) in C. neoformans (Mann–Whitney U test, U = 6,031,095, P < 10−7). This suggests that patterns of intron evolution and functionality in intron-rich fungal lineages may be comparable with those of other eukaryotic lineages that have retained much of their ancestral intron complement (Roy and Gilbert 2005). In Drosophila and vertebrates, first introns are significantly longer than subsequent introns within genes (Duret 2001; Marais et al. 2005). We observe the same pattern in Cryptococcus, where first coding-region introns (average size = 75.9 bp, SD = 61.3 bp) are significantly longer than subsequent coding introns (average size = 64.4 bp, SD = 42.0 bp; Mann–Whitney U test, U = 49,127,030, P < 10−7). Though intron length may vary with rank, the positive association between length and deletion rate we report in figure 2 holds true independently of rank. We find that longer introns (80th length percentile or above) exhibiting ranks 1, 2, 3, 4, and 5 all exhibit a higher deletion rate than shorter introns (20th length percentile or below) of the same rank (P < 10−4).

Marais et al. (2005) detect a significant negative correlation (RS = −0.20, P < 10−4) between dN and intron size for first introns in Drosophila but not for subsequent introns (table 2). Given that regulatory elements are more common in first introns than subsequent introns in mammals and Drosophila (Duret 2001; Majewski and Ott 2002; Keightley and Gaffney 2003; Chamary and Hurst 2004), Marais et al. (2005) attribute this negative correlation between first intron length and dN to a higher density of regulatory elements in the first introns of slowly evolving genes relative to quickly evolving genes.

Table 2

Summary of Correlates with Intron Length in Cryptococcus and Drosophila

Variable 1 Variable 2 Cryptococcus Correlation Drosophila Correlationa 
Coding intron length Intron divergence Positive Negative 
First coding intron length dN or dN/dS Positive Negative 
Subsequent intron length dN or dN/dS Negative None 
Last coding intron length dN or dN/dS Positive Unknown 
Variable 1 Variable 2 Cryptococcus Correlation Drosophila Correlationa 
Coding intron length Intron divergence Positive Negative 
First coding intron length dN or dN/dS Positive Negative 
Subsequent intron length dN or dN/dS Negative None 
Last coding intron length dN or dN/dS Positive Unknown 
a

Data from (Marais et al. 2005).

We observe a quite different relationship between intron size and dN in Cryptococcus (table 2). First intron length in Cryptococcus exhibits a weak but significant positive correlation with dN (Spearman's RS = 0.0753, P < 10−4), as well as dN/dS (Spearman's RS = 0.0379, P = 0.005), suggesting that slowly evolving genes in this lineage may be less regulated than their counterparts in Drosophila, or may utilize first introns for regulation to a lesser degree. Surprisingly, however, we observe a weak but significant negative correlation between cumulative subsequent intron length and dN (Spearman's RS = −0.024, P = 0.020) or dN/dS (Spearman's RS = −0.0357, P < 10−4) in Cryptococcus. The incongruence of these associations indicates that first introns and subsequent introns are subject to different selective pressures with regard to length. A summary of correlates of intron length in Cryptococcus and Drosophila is presented in table 2.

Testing Hypotheses of Selection on Intron Size

Interpretation of the nature of these selective pressures on intron length within genes is not straightforward. The evolutionary rate metrics dN and dN/dS are affected by many different factors (Plotkin and Fraser 2007). In addition, there is controversy as to whether most amino acid changes are adaptive and driven by positive selection (Sawyer et al. 2007) or slightly deleterious and fixed because of insufficiently strong purifying selection (Shapiro et al. 2007).

None of the existing hypotheses of selection on intron size can fully accommodate the observations in table 2. If most of the amino acid changes in Cryptococcus were fixed by positive selection, the negative correlation between subsequent intron length and dN could be explained by the cost of expression hypothesis (Carvalho and Clark 1999). Under this scenario, the hypothesis predicts that genes that are best able to fix adaptive amino acid substitutions (=high dN) are also best able to fix deletions or prevent fixation of insertions, thereby maximizing expression efficiency and/or rate. This hypothesis neither easily accommodates the positive correlation we observed between first intron length and dN nor very compatible with the observation that the predominant manifestation of selection in Cryptococcus introns is a reduction in the deletion rate, as opposed to a reduction in insertion rate or acceleration of deletion rate (figs. 2 and 4).

Alternatively, if variation in protein evolutionary rate is mainly driven by the fixation of slightly deleterious mutations, the selective interference hypothesis predicts that longer introns would actually be adaptive in low-recombination regions. Under this hypothesis, longer introns could more effectively decrease Hill–Robertson interference between selected sites in adjacent exons than short introns, reducing the rate of fixation of mildly deleterious replacement mutations and decrementing the dN and dN/dS metrics. This could potentially explain the negative correlation we observe between dN and cumulative subsequent intron length. The interference hypothesis may also explain the incongruence in the dN versus length relationship between first introns and subsequent introns, as first introns are less subject to interference selection as a result of their being flanked by at most a single 5′ exon, whereas most subsequent introns on average will be flanked by a greater amount of coding sequence and presumably more selected sites (Comeron and Kreitman 2000).

If this interference selection explanation applies, it would predict that the length of last introns would similarly be positively correlated with the rate of protein evolution, as last introns are flanked by at most a single 3′ exon. In fact, we find a weak but positive correlation between last intron length and dN (Spearman's RS = 0.051, P = 0.0008) as well as last intron length and dN/dS (Spearman's RS = 0.038, P = 0.0088), similar to the finding for first introns. This suggests that part of the observed disparity in evolutionary profile between first introns and subsequent introns (Marais et al. 2005) may not be driven by a distinct regulatory role played by first introns but instead by the differential capacity of first introns to reduce Hill–Robertson selective interference effects relative to internal introns.

Alternatively, it is possible that some unknown selection pressure, perhaps related to transcript folding or splicing efficiency, may regulate intron length within genes as a function of distance from transcript ends. Further analysis of intron size variation within genes across a wider sampling of eukaryotic genomes may help to elucidate any position-dependent selective pressures modulating intron size variation within genes.

Conclusions

We report the first genome-wide analysis of intronic insertion and deletion profiles in a fungus. In Cryptococcus, most introns are very close to the modal size, but natural selection is gradually lengthening short introns and gradually shortening long introns. This suggests that most introns that depart from the modal size are nonoptimal in terms of organismal fitness and may have resulted from chance fixation of a large insertion or deletion mutation sometime in the past. Heterogeneous selective pressures govern intron size variation within genes. Within genes, first introns and subsequent introns differ in length and exhibit opposing correlations with the evolutionary rate of the protein encoded by the parent gene, suggesting that selection to reduce the cost of gene expression, selection to reduce Hill–Robertson interference, or possibly some unknown selective pressure may differentially explain intron size variation according to the position of introns within genes.

S.S.H. was supported by a National Human Genome Research Institute grant to the Broad Institute for summer undergraduate research and the National Institutes of Health-funded RISE program for undergraduate research at Jackson State University. D.E.N. was supported at the Broad Institute by National Institute of Allergy and Infectious Diseases funding. We are grateful to Bruce Birren and Angela Brunache for helping to foster this collaboration. We thank Scott Roy, Justin Blumenstiel, and 2 anonymous reviewers for helpful comments and discussion about the manuscript.

References

Blumenstiel
JP
Hartl
DL
Lozovsky
ER
Patterns of insertion and deletion in contrasting chromatin domains
Mol Biol Evol
 , 
2002
, vol. 
19
 (pg. 
2211
-
2225
)
Carvalho
AB
Clark
AG
Intron size and natural selection
Nature
 , 
1999
, vol. 
401
 pg. 
344
 
Castillo-Davis
CI
Mekhedov
SL
Hartl
DL
Koonin
EV
Kondrashov
FA
Selection for short introns in highly expressed genes
Nat Genet
 , 
2002
, vol. 
31
 (pg. 
415
-
418
)
Chamary
JV
Hurst
LD
Similar rates but different modes of sequence evolution in introns and at exonic silent sites in rodents: evidence for selectively driven codon usage
Mol Biol Evol
 , 
2004
, vol. 
21
 (pg. 
1014
-
1023
)
Comeron
JM
Kreitman
M
The correlation between intron length and recombination in drosophila. Dynamic equilibrium between mutational and selective forces
Genetics
 , 
2000
, vol. 
156
 (pg. 
1175
-
1190
)
Duret
L
Why do genes have introns? Recombination might add a new piece to the puzzle
Trends Genet
 , 
2001
, vol. 
17
 (pg. 
172
-
175
)
Fox-Walsh
KL
Dou
Y
Lam
BJ
Hung
SP
Baldi
PF
Hertel
KJ
The architecture of pre-mRNAs affects mechanisms of splice-site pairing
Proc Natl Acad Sci U S A
 , 
2005
, vol. 
102
 (pg. 
16176
-
16181
)
Haddrill
PR
Charlesworth
B
Halligan
DL
Andolfatto
P
Patterns of intron sequence evolution in Drosophila are dependent upon length and GC content
Genome Biol
 , 
2005
, vol. 
6
 pg. 
R67
 
Hare
MP
Palumbi
SR
High intron sequence conservation across three mammalian orders suggests functional constraints
Mol Biol Evol
 , 
2003
, vol. 
20
 (pg. 
969
-
978
)
Hill
WG
Robertson
A
The effect of linkage on the limits of artificial selection
Genet Res
 , 
1966
, vol. 
8
 (pg. 
269
-
294
)
Hong
X
Scofield
DG
Lynch
M
Intron size, abundance, and distribution within untranslated regions of genes
Mol Biol Evol
 , 
2006
, vol. 
23
 (pg. 
2392
-
2404
)
Karolchik
D
Baertsch
R
Diekhans
M
, et al.  . 
(13 co-authors)
The UCSC genome browser database
Nucleic Acids Res
 , 
2003
, vol. 
31
 (pg. 
51
-
54
)
Keightley
PD
Gaffney
DJ
Functional constraints and frequency of deleterious mutations in noncoding DNA of rodents
Proc Natl Acad Sci U S A
 , 
2003
, vol. 
100
 (pg. 
13402
-
13406
)
Kellis
M
Patterson
N
Birren
B
Berger
B
Lander
ES
Methods in comparative genomics: genome correspondence, gene identification and regulatory motif discovery
J Comput Biol
 , 
2004
, vol. 
11
 (pg. 
319
-
355
)
Lim
LP
Burge
CB
A computational analysis of sequence features involved in recognition of short introns
Proc Natl Acad Sci U S A
 , 
2001
, vol. 
98
 (pg. 
11193
-
11198
)
Loftus
BJ
Fung
E
Roncaglia
P
, et al.  . 
(54 co-authors)
The genome of the basidiomycetous yeast and human pathogen Cryptococcus neoformans
Science
 , 
2005
, vol. 
307
 (pg. 
1321
-
1324
)
Majewski
J
Ott
J
Distribution and characterization of regulatory elements in the human genome
Genome Res
 , 
2002
, vol. 
12
 (pg. 
1827
-
1836
)
Marais
G
Nouvellet
P
Keightley
PD
Charlesworth
B
Intron size and exon evolution in Drosophila
Genetics
 , 
2005
, vol. 
170
 (pg. 
481
-
485
)
Mount
SM
Burks
C
Hertz
G
Stormo
GD
White
O
Fields
C
Splicing signals in Drosophila: intron size, information content, and consensus sequences
Nucleic Acids Res
 , 
1992
, vol. 
20
 (pg. 
4255
-
4262
)
Neafsey
DE
Galagan
JE
Dual modes of natural selection on upstream open reading frames
Mol Biol Evol
 , 
2007
, vol. 
24
 (pg. 
1744
-
1751
)
Parsch
J
Selective constraints on intron evolution in Drosophila
Genetics
 , 
2003
, vol. 
165
 (pg. 
1843
-
1851
)
Petrov
DA
Mutational equilibrium model of genome size evolution
Theor Popul Biol
 , 
2002
, vol. 
61
 (pg. 
531
-
544
)
Petrov
DA
Hartl
DL
Trash DNA is what gets thrown away: high rate of DNA loss in Drosophila
Gene
 , 
1997
, vol. 
205
 (pg. 
279
-
289
)
Plotkin
JB
Fraser
HB
Assessing the determinants of evolutionary rates in the presence of noise
Mol Biol Evol
 , 
2007
, vol. 
24
 (pg. 
1113
-
1121
)
Prachumwat
A
DeVincentis
L
Palopoli
MF
Intron size correlates positively with recombination rate in Caenorhabditis elegans
Genetics
 , 
2004
, vol. 
166
 (pg. 
1585
-
1590
)
Presgraves
DC
Intron length evolution in Drosophila
Mol Biol Evol
 , 
2006
, vol. 
23
 (pg. 
2203
-
2213
)
Ptak
SE
Petrov
DA
How intron splicing affects the deletion and insertion profile in Drosophila melanogaster
Genetics
 , 
2002
, vol. 
162
 (pg. 
1233
-
1244
)
Roy
SW
Gilbert
W
Complex early genes
Proc Natl Acad Sci U S A
 , 
2005
, vol. 
102
 (pg. 
1986
-
1991
)
Sawyer
SA
Parsch
J
Zhang
Z
Hartl
DL
Prevalence of positive selection among nearly neutral amino acid replacements in Drosophila
Proc Natl Acad Sci U S A
 , 
2007
, vol. 
104
 (pg. 
6504
-
6510
)
Shapiro
JA
Huang
W
Zhang
C
, et al.  . 
(12 co-authors)
Adaptive genic evolution in the Drosophila genomes
Proc Natl Acad Sci U S A
 , 
2007
, vol. 
104
 (pg. 
2271
-
2276
)
Sorek
R
Ast
G
Intronic sequences flanking alternatively spliced exons are conserved between human and mouse
Genome Res
 , 
2003
, vol. 
13
 (pg. 
1631
-
1637
)
Urrutia
AO
Hurst
LD
The signature of selection mediated by expression on human genes
Genome Res
 , 
2003
, vol. 
13
 (pg. 
2260
-
2264
)
Voelker
RB
Berglund
JA
A comprehensive computational characterization of conserved mammalian intronic sequences reveals conserved motifs associated with constitutive and alternative splicing
Genome Res
 , 
2007
, vol. 
17
 (pg. 
1023
-
1033
)
Waterston
R
Lindblad-Toh
K
Birney
E
, et al.  . 
(202 co-authors)
Initial sequencing and comparative analysis of the mouse genome
Nature
 , 
2002
, vol. 
420
 (pg. 
520
-
562
)
Yang
Z
PAML: a program package for phylogenetic analysis by maximum likelihood
Comput Appl Biosci
 , 
1997
, vol. 
13
 (pg. 
555
-
556
)
Yu
J
Yang
Z
Kibukawa
M
Paddock
M
Passey
DA
Wong
GK
Minimal introns are not “junk”
Genome Res
 , 
2002
, vol. 
12
 (pg. 
1185
-
1189
)

Author notes

1
Present address: Department of Biology, Jackson State University.
Aoife McLysaght, Associate Editor