Abstract

The residence of spliceosomal introns within protein-coding genes can fluctuate over time, with genes gaining, losing or conserving introns in a complex process that is not entirely understood. One approach for studying intron evolution is to compare introns with respect to position and type within closely related genes. Here, we describe new, freely available software called Common Introns Within Orthologous Genes (CIWOG), available at http://ciwog.gdcb.iastate.edu/, which detects common introns in protein-coding genes based on position and sequence conservation in the corresponding protein alignments. CIWOG provides dynamic web displays that facilitate detailed intron studies within orthologous genes. User-supplied options control how introns are clustered into sets of common introns. CIWOG also identifies special classes of introns, in particular those with GC- or U12-type donor sites, which enables analyses of these introns in relation to their counterparts in the other genes in orthologous groups. The software is demonstrated with application to a comprehensive study of eight plant transcriptomes. Three specific examples are discussed: intron class conversion from GT- to GC-donor-type introns in monocots, plant U12-type intron conservation and a global analysis of intron evolution across the eight plant species.

INTRODUCTION

Spliceosomal introns reside in protein-coding genes and represent significant proportions of transcribed genes in eukaryotes. The average number of introns per gene varies greatly among organisms. For example, Drosophila melanogaster shows an average of three introns per gene, whereas the number in Arabidopsis thaliana is estimated to be greater than six [1]. Among genes in a single organism, intron content also displays large variability, such as in Arabidopsis with a range of 0–78 introns per gene [2]. Representing such a large portion of eukaryotic transcription and having large variation among organisms and among genes, several evolutionary questions arise about introns, including whether there is any selective advantage of introns, and how and when did they get to be situated within protein-coding genes. Regarding their functional significance, some introns have been shown to create protein diversity through alternative splicing and regulate gene expression by containing regulatory elements [3]. However, these functions do not cover all introns. Regarding evolutionary questions, many studies have proposed theories concerning the behavior and dates of intron evolution, and these topics are debated [4–7]. A popular technique to study introns is by comparison of homologous genes to detect and analyze common introns [8–11]. This technique and visualization of the output is the focus of this work.

At the most basic level, common intron detection is the derivation of introns shared between putatively homologous genes and inferring homology, or common ancestry, of the introns in these groups. Merely comparing introns by their number within genes typically will not reliably identify homologous introns, because genes can experience within-gene tandem duplication and other sequence modification events that may change intron order and number and disrupt identification of correct homologous intron relationships. Molecular sequence alignment is a well-established tool that allows inference of homologous gene components such as protein domains and binding sites. However, introns usually share very low sequence similarity relative to exons, and nucleotide alignments, including the introns, are not feasible across large evolutionary distances. A more effective procedure is to construct alignments of the encoded protein sequences and then map the introns onto the protein alignment based on the intron positions within the underlying coding sequences [9]. This procedure can easily be followed individually for a small number of sequences [12] or can be automated by computer algorithms for large-scale applications [10, 11, 13]. These analyses process protein sequences pairwise or in multiple alignments, find introns at the exact same position in the alignment and employ some condition about local alignment quality, such as requiring minimum sequence identity. Alignment sequence identity conditions are necessary because alignments merely provide homologous character estimates, of which the accuracy can vary due to unknown evolutionary sequence constraints. Another recent study proposes a modified alignment algorithm for creating intron-aware protein sequence alignments [14].

One application of common introns is the measurement of intron evolutionary dynamics between different organisms. Through comparing intron presence and absence between related genes, rates of intron gain, loss and conservation can be estimated [1, 9]. On a gene family scale, such analysis is useful for dating intron changes within specific gene families and helping to delineate orthologous and paralogous gene relationships. Also, on a genome-wide scale this type of analysis has shown some clades to have experienced varying degrees of putative intron loss and gain [1]. Because intron fluctuation is thought to be a very rare event, most of these studies have focused on broad taxonomic organism sampling to capture intron change events [1], including a few members of each kingdom limited to those with whole genome sequence availability at the time of the study. Concerning plants, some studies have compared Arabidopsis and rice [15, 16]; Arabidopsis, poplar and rice [17] and Chlamydomonas reinhardtii and Bigelowiella natans [16]. A conclusion from these studies is that introns have been predominantly lost in contemporary species. Although providing important information about plant intron evolution over large time spans, these studies have not utilized the large number of newly sequenced plant genomes. Plants provide an intriguing source of recent genome evolution due to their large genome duplications and may reveal new trends of intron evolution over shorter time spans.

Here, we present the Common Intron Detection Algorithm (CIDA) and the associated Common Introns Within Orthologous Genes (CIWOG) database. The CIDA is an algorithm for processing protein sequence multiple alignments to detect common introns. This algorithm is unique compared to previous algorithms, in particular by offering user-specified options to control common intron detection in regions of poor alignment quality and putative intron sliding, to control the calling of intron absence markers and to facilitate special intron detection. To analyze the CIDA results, CIWOG provides sequence-integrated dynamic web-based visualizations. CIWOG displays common introns overlaid on protein sequence, by graphical summary and by nucleotide sequence. These displays allow users to explore putative cases of intron change and genic evidence to decipher possible annotation inaccuracies. The software is freely available. To illustrate the use of CIDA, we briefly describe its application to orthologous gene clusters of eight plant species resulting in a plant-specific common intron database, ciwogPlants. Features of the software are highlighted through three case studies: conservation of U12-type introns in plants, GC donor site evolution and evolutionary tree-based inference of intron gain and loss.

SOFTWARE INSTALLATION AND USAGE

CIWOG is software that detects common introns in groups of genes and provides dynamic web displays for complete analysis of the common intron (cintron) output, including intron sequences (see Figure 1 for a flow chart of the algorithm). The CIWOG software consists of Perl scripts, Javascript code and a MySQL database and is freely available. The CIDA component of the software accepts user-supplied options that control how introns are clustered into cintrons and is described in the subsequent section. Freely available external software packages required for CIDA are Perl and, for the web displays, MySQL, Apache web server, and the Perl libraries CGI, GD, and DBI. Installation instructions on acquiring the external components and configuring CIWOG are provided with the software. Here we describe a walk-through of the software usage process for users desiring to use CIWOG software to detect cintrons in their own sequences. For those interested in analyzing particular plant gene families via the web displays or cintron matrix data, we provide an application of CIWOG applied to eight plant species.

Figure 1:

Common Introns Within Orthologous Genes (CIWOG) analysis pipeline. This figure illustrates the process of populating a CIWOG database. The entire pipeline is executed with a single script. Web displays provide dynamic tools for analyzing particular gene groups and common introns. Data export utilities export cintron data in the form of intron presence/absence matrices and intron nucleotide sequences for further analyses.

Figure 1:

Common Introns Within Orthologous Genes (CIWOG) analysis pipeline. This figure illustrates the process of populating a CIWOG database. The entire pipeline is executed with a single script. Web displays provide dynamic tools for analyzing particular gene groups and common introns. Data export utilities export cintron data in the form of intron presence/absence matrices and intron nucleotide sequences for further analyses.

To execute CIWOG, user-supplied gene groups are required. These groups of genes can be putatively orthologous genes, which are often defined as the best reciprocal sequence matches among species. Alternatively, these gene groups can be of other varieties such as large paralogous gene families, a particular gene family of interest or alternatively spliced gene transcripts from a single locus. A user could create orthologous gene groups de novo using external tools such as OrthoMCL [18, 19], could manually construct a single group by hand using BLAST [20] and public databases such as PlantGDB [21] or could use publicly available homologous gene databases such as HomoloGene [22]. After collecting the gene groups, two files are required to be input: a gene-structure information file and a CLUSTAL-formatted [23] protein alignment file.

The gene-structure information file has a simple format and contains the following data for each gene: gene identifier, organism identifier, GenBank-style gene-structure, genomic sequence, genome sequence start and stop, and protein translation start and stop. An example of GenBank-style gene structure for a two-exon gene is: ‘complement (join(exon_1_start..exon_1_stop,exon_2_start..exon_2_stop))’. The genome sequence is a chromosomal nucleotide sequence segment defined by the genome sequence start and stop and encompasses the gene's genomic boundaries. These data are separated by new lines and a pound symbol (see http://ciwog.gdcb.iastate.edu/3688 for an example). This file can easily be prepared by hand using databases such as PlantGDB [21] or in batch by ad hoc scripting. The alignment file can be generated by programs such as CLUSTALW [24] or MUSCLE [25]. The alignment sequence identifiers should be in the form of ‘organism|gene identifier’ with no spaces. Using this input and user-specified options, the CIDA derives common introns and provides SQL output that is loaded into a MySQL database. Further usage details and sample input data are provided with the software.

CIDA

The CIDA defines common introns (‘cintrons’) in protein-coding genes as introns occurring in conserved positions of at least two genes being compared. For further studies, CIDA also keeps track of single introns (‘sintrons’) currently observed in one gene only; however, these cases will not be discussed here. Conserved positions are assessed relative to the amino acid alignment of the corresponding gene products (see below). Input for the CIDA consists of a gene-structure information file and a corresponding multiple protein sequence alignment. CIDA begins by pre-processing the gene-structure information file to derive the following about each gene's introns: position, sequence, peptide position and phase. Intron peptide position is defined as the amino acid number of the complete codon immediately upstream of the intron, and phase is either 0, 1 or 2, depending on whether the intron inserts between codons or splits a codon after the first or second nucleotide in the triplet.

After the pre-processing steps, CIDA evaluates the protein alignment file to detect common introns. First, CIDA assigns alignment positions to introns by searching for the introns’ peptide positions within the alignment. Using single linkage clustering, introns are then grouped into cintrons based on identical alignment positions (‘strict’ criterion) and above threshold local protein sequence similarity. The threshold minimal local protein sequence similarity is specified by the CIDA arguments simRegion and minSim, which specify the required proportion (minSim) of identical amino acids within the region extending from simRegion amino acids to the left up to simRegion amino acids to the right of the alignment position (Figure 2). An optional second grouping step is invoked by options that describe relaxed local alignment quality. maxSlide specifies the number of amino acids to the left and right of a given intron position in the protein sequence alignment that can be considered evolutionarily the same position (after moderate intron sliding, e.g. [26] or alignment ambiguities). maxGap specifies a number of gap symbols that can be used to further extend the region specified by maxSlide, so that a small number of gaps introduced by one sequence are not included in simRegion. These two options give each intron an alignment position range (Figure 2). The second grouping step groups introns based on the intron alignment position ranges and fulfillment of two conditions: the minimum local similarity described above and a maximum of one intron per gene in a given cintron group.

Figure 2:

Hypothetical cintron examples. The above alignment segments represent hypothetical gene groups and cintrons derived by CIDA. The cintrons are indicated by numbers below the alignments. Gray shading represent introns. Open gray rectangles represent absent introns. For (A) and (B), CIDA options are: simRegion = 4, minSim = 0.5, minAbsentSim = 0.5, maxSlide = 0, maxGap = 0. In (A), cintron #1 is conserved because both genes have an intron at the same alignment position and minSim is satisfied (6/9 amino acids are identical in the region from four amino acids upstream to four amino acids downstream of the intron position, which is Q in the first gene). In contrast, the second introns of both genes are not clustered together because minSim is not satisfied (2/9 < 0.5) and two separate entries (#2 and #3) are kept at the same alignment position. In (B), cintron #1 consists of an observed intron from one gene and, because minAbsentSim is satisfied (9/9 identical amino acids), an ‘absent intron’ from the other gene. For cintron #2, the minAbsentSim criterion is not satisfied (0/9 identical amino acids), and thus the observed intron remains as a singlet entry in the CIWOG database. For (C), CIDA options are: simRegion = 4, minSim = 0.5, minAbsentSim = 0.5, maxSlide = 1, maxGap = 1. Here, cintron #1 qualifies as a cintron because the introns are located within maxSlide alignment positions and minSim is satisfied. (The intron alignment position range is the union of 6–8 from the first sequence and 5–7 from the second sequence, i.e. 5–8. There are 3/4 identical amino acids within the intron alignment position range, plus 4/4 identical amino acids to the left and 3/4 identical amino acids to the right for a total of 10/12 identical amino acids, greater than minSim.) The second introns of both genes are not clustered together because they are farther than maxSlide positions apart. For (D), the CIDA options are the same as in (C). Here, the intron alignment region for intron 1 is ARL in the first sequence and VT-A in the second because maxGap = 1. Because these regions overlap and the sequence similarity condition is met, this qualifies as a cintron. Cintrons #2 and #3 are not grouped into a single cintron because the first gene's intron region is VAP and the second gene's is SS-S, which do not overlap.

Figure 2:

Hypothetical cintron examples. The above alignment segments represent hypothetical gene groups and cintrons derived by CIDA. The cintrons are indicated by numbers below the alignments. Gray shading represent introns. Open gray rectangles represent absent introns. For (A) and (B), CIDA options are: simRegion = 4, minSim = 0.5, minAbsentSim = 0.5, maxSlide = 0, maxGap = 0. In (A), cintron #1 is conserved because both genes have an intron at the same alignment position and minSim is satisfied (6/9 amino acids are identical in the region from four amino acids upstream to four amino acids downstream of the intron position, which is Q in the first gene). In contrast, the second introns of both genes are not clustered together because minSim is not satisfied (2/9 < 0.5) and two separate entries (#2 and #3) are kept at the same alignment position. In (B), cintron #1 consists of an observed intron from one gene and, because minAbsentSim is satisfied (9/9 identical amino acids), an ‘absent intron’ from the other gene. For cintron #2, the minAbsentSim criterion is not satisfied (0/9 identical amino acids), and thus the observed intron remains as a singlet entry in the CIWOG database. For (C), CIDA options are: simRegion = 4, minSim = 0.5, minAbsentSim = 0.5, maxSlide = 1, maxGap = 1. Here, cintron #1 qualifies as a cintron because the introns are located within maxSlide alignment positions and minSim is satisfied. (The intron alignment position range is the union of 6–8 from the first sequence and 5–7 from the second sequence, i.e. 5–8. There are 3/4 identical amino acids within the intron alignment position range, plus 4/4 identical amino acids to the left and 3/4 identical amino acids to the right for a total of 10/12 identical amino acids, greater than minSim.) The second introns of both genes are not clustered together because they are farther than maxSlide positions apart. For (D), the CIDA options are the same as in (C). Here, the intron alignment region for intron 1 is ARL in the first sequence and VT-A in the second because maxGap = 1. Because these regions overlap and the sequence similarity condition is met, this qualifies as a cintron. Cintrons #2 and #3 are not grouped into a single cintron because the first gene's intron region is VAP and the second gene's is SS-S, which do not overlap.

After cintron definition, genes lacking detectable cintrons could possibly be the result of putative intron loss or gain or caused by poor-quality protein alignment. To distinguish these possibilities, genes lacking a given cintron are evaluated for local protein sequence similarity at the cintron alignment region. If the gene exceeds a proportion of identical amino acids in the simRegion amino acids to the left and to the right specified by option minAbsentSim and also has no gaps in this region, then the gene is added to the cintron with a cintron class of ‘absent’. If the cintron either precedes or follows the gene in the alignment, the gene is added with a class of ‘external missing’. Otherwise the cintron is within the gene in the alignment, and the gene is added with a class of ‘internal missing’. The distinction between absent and missing introns is important for data integrity in gene-structure evolution studies. Whereas absent introns denote sites of observed intron loss or gain, missing introns either reflect other evolutionary events that caused divergence in the otherwise similar protein sequences in the alignment (e.g. truncation, loss or divergence of domains) or they reflect erroneous gene-structure annotations. The latter possibility highlights the potential use of CIWOG displays for manual genome annotation, but is not discussed further here.

After all cintrons have been identified, special types of introns are classified. Introns beginning with a GC dinucleotide are distinct from the majority of introns, which begin with a GT dinucleotide and potentially have different splicing propensities. Introns with this characteristic are classified as type ‘GC’. U12-type introns are a rare class of introns, which have highly conserved donor (consensus [GA]TATCCTT where [GA] denotes G or A) and branch sequences (consensus CCTTAAC) that are recognized by a rare spliceosome [27]. A prior study identified Arabidopsis U12-type introns that are transcript-confirmed, an important requirement for identification because gene-structure annotations for non-canonical introns are particularly error-prone [28]. Using position weight matrices from this study, introns are scored via a donor and branch site log odds ratios (see [28] for further details). Introns having donor and branch site scores greater than pre-specified cutoffs are classified as ‘U12-type’. Introns with a GC donor dinucleotide but qualifying U12-type donor and branch scores are classified as U12-type introns. Other introns are given a ‘U2’ classification. This intron-type classification enables subsequent analysis of the conservation of these special introns. Users wishing to use additional classification rules, different weight matrices or adjust score cutoffs can easily do so using the intron sequences contained in the CIWOG database and updating the intron class field. Intron-type classification is a unique feature of CIWOG compared to other large-scale intron-evolution programs [10, 11, 13].

COMMON INTRON VISUALIZATION

Manual review of cintrons within a collection of orthologous genes is helpful to resolve complex patterns of intron difference caused by evolutionary mechanisms. Also, some instances of intron difference between genes can be the result of inaccurate gene annotations, and manual curation is often necessary to resolve these cases [29]. To enable cintron review and detailed analysis, CIWOG includes web displays. The main web display provides a cintron graphic, cintrons overlaid onto protein sequence alignments and dynamic navigation features (Figure 3). Users can click on introns within the graphic to scroll to the corresponding portion of the multiple sequence alignment. When users mouse over introns in the cintron boxes or in the alignment, the introns are highlighted in yellow (Figure 3E). This feature allows users to quickly review cintrons, introns, conservation quality and alignment quality.

Figure 3:

CIWOG display (GC-intron example). This figure shows the main CIWOG web display (http://ciwog.gdcb.iastate.edu/ciwogPlants_loose-cgi/display.pl?cid=6514) for orthologous genes encoding putative flap endonuclease 1 proteins. ‘A’ marks the cintron graphic which has the following color assignments: black horizontal lines—aligned sequences, gray rectangles—gaps in the alignment, thin vertical bars of a single color—cintrons, thin vertical bars with a circle atop—GC introns. Numbers to the right of the black rectangles indicate a number of introns in the untranslated region of the gene. ‘B’ marks the intron-annotated protein alignment with the following color assignments: orange—GT:AG introns, magenta—GC:AG introns. Sequence identifiers to the left of the alignment are in the form ‘organism abbreviation ∼ gene name’. ‘D’ marks the cintron box that corresponds to the alignment position marked by ‘F’ in the alignment and ‘C’ in the graphic. ‘E’ marks a single intron member of the cintron selected by a user, which causes this row and corresponding intron at ‘F’ to be highlighted in yellow. In a complete screen, there are additional cintron boxes, but here only one box is shown for clarity.

Figure 3:

CIWOG display (GC-intron example). This figure shows the main CIWOG web display (http://ciwog.gdcb.iastate.edu/ciwogPlants_loose-cgi/display.pl?cid=6514) for orthologous genes encoding putative flap endonuclease 1 proteins. ‘A’ marks the cintron graphic which has the following color assignments: black horizontal lines—aligned sequences, gray rectangles—gaps in the alignment, thin vertical bars of a single color—cintrons, thin vertical bars with a circle atop—GC introns. Numbers to the right of the black rectangles indicate a number of introns in the untranslated region of the gene. ‘B’ marks the intron-annotated protein alignment with the following color assignments: orange—GT:AG introns, magenta—GC:AG introns. Sequence identifiers to the left of the alignment are in the form ‘organism abbreviation ∼ gene name’. ‘D’ marks the cintron box that corresponds to the alignment position marked by ‘F’ in the alignment and ‘C’ in the graphic. ‘E’ marks a single intron member of the cintron selected by a user, which causes this row and corresponding intron at ‘F’ to be highlighted in yellow. In a complete screen, there are additional cintron boxes, but here only one box is shown for clarity.

A separate cintron detail page is accessible by clicking on the cintron number within the cintron box (Figure 3D). The cintron detail page presents intron genomic position coordinates and hyperlinks to external genome browsers to facilitate review of gene evidence, such as expressed sequence evidence supporting an intron or gene. To enable users to search CIWOG for particular gene families, several utilities are provided to query by gene name, gene description, sequence similarity via BLAST [20] and cluster identifiers. In addition to these specific searches, CIWOG provides bulk download of intron sequences annotated by cintrons and intron presence/absence matrices.

APPLICATION OF CIWOG: A GENOME-WIDE STUDY OF CINTRONS IN PLANTS

Here we demonstrate the application of CIWOG to construct a plant-specific database, ciwogPlants, to discover orthologous introns and study intron evolution across eight plant species: A. thaliana, Glycine max, Medicago truncatula, Oryza sativa, Physcomitrella patens, Populus trichocarpa, Sorghum bicolor and Vitis vinifera. First, we downloaded 330 860 gene annotations from the official genome annotation providers (http://ciwog.gdcb.iastate.edu/ciwogPlants_source.html). We then composed a set of 309 196 representative genes by selecting the annotation with the longest open reading frame from each locus. Orthologous gene clusters were identified by running OrthoMCL [18, 19] on all-versus-all BLAST [20] results based on protein sequences of the representative genes. Note that these gene clusters may contain several genes for a species in the case when in-paralogs are too similar to determine one-to-one orthologs. For each cluster we prepared a CLUSTAL-formatted protein alignment file, using MUSCLE [25], and a gene-structure information file. Finally, we executed CIWOG with two sets of options: a ‘loose’ option allowing intron slide events and a ‘strict’ option not allowing intron slide events (loose: maxSlide = 1, maxGap = 1, simRegion = 10, minSim = 0.1, minAbsentSim = 0.3; strict: maxSlide = 0, maxGap = 0, simRegion = 10, minSim = 0.1, minAbsentSim = 0.3). After these steps, ciwogPlants comprises of 29 967 orthologous gene clusters and 185 561 and 203 587 cintrons for the loose and strict options, respectively. Note that there are fewer cintrons for the loose option, because in this case qualifying close cintrons by the strict option will be clustered into a single intron position range. Both data sets are available for analysis and download. ciwogPlants enables us to not only zero in on a particular orthologous intron (see case studies) but also gain a general insight of plant intron evolution.

A CASE STUDY OF U12-TYPE INTRON CONSERVATION

U12-type introns are a rare class of spliceosomal introns (<1%) that are spliced by a distinct spliceosome from the common class of introns, referred to as U2 type [27, 30–35]. A proposed unique functional role for U12-type introns is that they mute gene expression by reducing mature mRNA transcript production, which is thought to be caused by the lower abundance of the U12-type splicesome compared to the U2-type spliceosome in the cell, thus causing U12-type introns to be spliced at a lower rate than U2-type introns [36]. Other than this role, any selective advantage for possessing a redundant, low-abundance U12-type splicing system over more than one billion years of evolution is unknown. Comparison of orthologous introns [28, 32] has yielded important observations about U12-type intron evolutionary phenomena, such as one recent study that demonstrated that U12-type intron positions are conserved more often than U2-type introns between animals and plants [37].

To explore these introns in CIWOG, we queried ciwogPlants for U12-type introns and discussed one particular example of a conserved intron in genes encoding SEC22 vesicle-trafficking proteins (CIWOG cluster 4948, cintron #4; Figure 4A). Here, a U12-type intron is conserved across nine genes from seven plant species. We compared these genes to animal orthologs and found U2-type, rather than U12-type, introns at this cintron. Alignment of the plant genes with one representative animal gene, mouse NM_011 342, shows identical intron position and intron phase, as well as a high degree of conservation in flanking protein sequences, suggesting that this cintron is orthologous rather than an annotation artifact, which is common in U12-type intron host genes (Figure 4A). Based on the observed complete U12-type conservation in plants, we infer that this intron in the plant's most recent common ancestor was U12-type. Regarding the difference in intron type between animals and plants, a parsimonious explanation is that an intron class conversion event has occurred since the divergence of animals and plants. Although it is currently impossible to determine whether a common U12-type changed to a U2-type in animals or whether a common U2-type changed to a U12-type in plants, and also whether the change was selectively driven or resulted as a consequence of chance, further investigation on the biology of the gene may throw some light on the evolution of its introns. A recent study comparing human and Arabidopsis homologous introns found 15 cases of a U12-type intron in humans corresponding to a U2-type intron in Arabidopsis and five cases of the opposite arrangement [37]. Figure 4 is a novel example of the putatively less prevalent U12- to U2-type conversion in animals relative to plants. For the purpose of this discussion, the example is merely meant to demonstrate that CIWOG facilitates investigation of U12-type introns on a global scale and finer analysis of particular introns on a case-by-case basis. A detailed analysis of U12-type introns will be presented elsewhere.

Figure 4:

Cintron examples illustrating conserved U12 introns and intron sliding detected by CIDA. A portion of cluster 4948, containing SEC22 vesicle trafficking proteins, is shown (A) which illustrates U12-type intron conservation among plants and a U2-type intron in animals, represented by the mouse gene NM_011342 (http://ciwog.gdcb.iastate.edu/dist-cgi/display.pl?cid=4948). The asterisk marks cintron #4 shown in the cintron box. A portion of cluster 9438, containing alcohol dehydrogenases, is shown (B), which illustrates possible intron sliding (http://ciwog.gdcb.iastate.edu/ciwogPlants_loose-cgi/display.pl?cid=9438). The asterisk marks cintron #3 shown in the cintron box. Manual review of the expressed sequence supporting this intron-sliding case confirmed the gene annotation accuracy, suggesting this is an authentic case of intron sliding (Arabidopsis AT1G22440.1 and O. sativa LOC_Os07g42924.1 confirmed). Sequence identifiers to the left of the alignment are in the form ‘organism abbreviation ∼ gene name’. Orange highlighting indicates introns. Red highlighting indicates U12-type introns.

Figure 4:

Cintron examples illustrating conserved U12 introns and intron sliding detected by CIDA. A portion of cluster 4948, containing SEC22 vesicle trafficking proteins, is shown (A) which illustrates U12-type intron conservation among plants and a U2-type intron in animals, represented by the mouse gene NM_011342 (http://ciwog.gdcb.iastate.edu/dist-cgi/display.pl?cid=4948). The asterisk marks cintron #4 shown in the cintron box. A portion of cluster 9438, containing alcohol dehydrogenases, is shown (B), which illustrates possible intron sliding (http://ciwog.gdcb.iastate.edu/ciwogPlants_loose-cgi/display.pl?cid=9438). The asterisk marks cintron #3 shown in the cintron box. Manual review of the expressed sequence supporting this intron-sliding case confirmed the gene annotation accuracy, suggesting this is an authentic case of intron sliding (Arabidopsis AT1G22440.1 and O. sativa LOC_Os07g42924.1 confirmed). Sequence identifiers to the left of the alignment are in the form ‘organism abbreviation ∼ gene name’. Orange highlighting indicates introns. Red highlighting indicates U12-type introns.

A CASE STUDY OF SPLICE SITE CONVERSION

In animals and plants, the majority of the introns have GT donor sites (GT introns), while introns with GC donor sites (GC introns) account for roughly 1% of the intron population [38, 39]. The small class of GC introns has been shown, both computationally and experimentally, to play a role in alternative splicing. Computational analyses demonstrated an enrichment of GC donor sites in alternatively spliced introns [40, 41]. Experiments in Caenorhabditis elegans confirmed the importance of the GC donor site in regulating alternative splicing during development [42]. There are also interesting studies about GC intron evolution: donor site switching between GT and GC is not uncommon in mammals and chicken [43]; and there are more GT to GC than GC to GT conversions in mammals [44]. Fluctuation between GT and GC donor sites may have an impact on splicing efficiency or regulation and may provide clues of intron and splice site evolution. The design of CIWOG greatly facilitates studies of intron-type switching. For illustration, we pulled out of the ciwogPlants database a clear example of GT/GC donor site changes.

CiwogPlants cluster 6514 (Figure 3) has nine genes from eight plant species including two monocots, five dicots and moss. Cintron #5 (Figure 3F) has GC donor sites in genes from monocots but GT donor sites in genes from dicots and moss. By parsimony, the most recent common ancestor of monocots and dicots should have GT donor site at this intron site if we consider moss as an outgroup. The reconstruction of ancestral splice site allows us to infer that at this cintron position there is a GT to GC donor site conversion in monocots and a conservation of GT donor sites in dicots. Our preliminary analysis of GT/GC donor site conversions on all branches of the eight-species tree (Figure 5) showed that the GT to GC conversion is more abundant than the opposite conversion, as was observed in mammals [44]. A comprehensive discussion of intron-type distribution and intron evolution in plants will be presented elsewhere.

Figure 5:

Plant species tree. The evolutionary relationship between eight plant species is derived based on two published reviews [47, 48]. Branch lengths do not represent time. Species abbreviations are in parentheses. Internal nodes are labeled from1 to 7.

Figure 5:

Plant species tree. The evolutionary relationship between eight plant species is derived based on two published reviews [47, 48]. Branch lengths do not represent time. Species abbreviations are in parentheses. Internal nodes are labeled from1 to 7.

INTRON EVOLUTION IN PLANTS

Studies showed that introns are predominately lost rather than gained in paralogous genes in rice [15] and orthologous genes in Arabidopsis, rice and green algae [16]. However, analysis in Arabidopsis, poplar and rice indicated that genes with chloroplast origin slowly acquire introns [17]. Depending on the number and types of organisms and genes analyzed by CIWOG, users can study the resulting cintrons to address diverse intron evolution questions. Here we show the application of the aforementioned technique of ancestral intron reconstruction to all cintrons in ciwogPlants to study intron gain and loss in monocots and dicots. We assume cintrons are the result of common descent and focus on cintrons involving at least two species, and having one or more of the intron types of ‘U2’, ‘U12’, ‘GC’ and ‘absent’. We collected 88 368 such cintrons in the ‘strict’ and 86 807 in the ‘loose’ data set. Given all species in each cintron, a tree was derived according to the topology of the eight-species ciwogPlants tree (Figure 5). As an example application, we used Dollo parsimony [45] to assign the presence/absence of intron to each internal node of the tree. At a cintron position, Dollo parsimony assumes that an intron is gained at most once and minimizes the number of losses. Alternatively, users can choose maximum likelihood approaches [46], e.g. via the freely available Malin software [10]. Lastly, we inferred intron conservation, gain and loss by comparing the intron status of a leaf or internal node with that of its immediate ancestor. As an example, Figure 6 illustrates details of inferring intron evolution in a cintron.

Figure 6:

Inferring intron evolution in cintron #2 of ciwogPlants cluster 9952. A portion of the CIWOG display is shown (A) for cluster 9952, which contains orthologous genes encoding mannitol dehydrogenases (http://ciwog .gdcb.iastate.edu/ciwogPlants_loose-cgi/display.pl?cid=9952). In the cintron, four of the genes have introns; while the other three genes are absent of introns. The species tree of this cintron (B) is derived by using the available species in the cintron and by maintaining the tree topology in Figure 5. Species abbreviations and node IDs are specified, followed by intron statuses (in parentheses; +: presence; −: absence; +/−: ambiguous, could be presence or absence) that are inferred using Dollo parsimony [45]. In more detail, if a species/leaf has multiple genes (in-paralogs), its intron status is the union of intron presence and absence in all in-paralogs. Given the presence/absence of introns in leaves, our algorithm went down the tree from leaves to root to infer intron status at each node using parsimony. Then the algorithm backtracked the tree from root to leaves and finalized intron status at each node according to the following rules: (i) a node has an intron if both of its children have at least one descendant that has an intron. All nodes that are descendants of this node and ancestors of the descendants having introns should also have introns; (ii) if intron status is ambiguous, the node's final intron status should be the status intersection of itself and its immediate ancestor (excluding root). For example, node 1 had ambiguous intron status (+/−) after the first round of inference. But after backtracking the tree, its status is changed to intron presence (+). Finally after all intron statuses are determined, intron evolution events are inferred by comparing each node/leaf with its immediate ancestor (C).

Figure 6:

Inferring intron evolution in cintron #2 of ciwogPlants cluster 9952. A portion of the CIWOG display is shown (A) for cluster 9952, which contains orthologous genes encoding mannitol dehydrogenases (http://ciwog .gdcb.iastate.edu/ciwogPlants_loose-cgi/display.pl?cid=9952). In the cintron, four of the genes have introns; while the other three genes are absent of introns. The species tree of this cintron (B) is derived by using the available species in the cintron and by maintaining the tree topology in Figure 5. Species abbreviations and node IDs are specified, followed by intron statuses (in parentheses; +: presence; −: absence; +/−: ambiguous, could be presence or absence) that are inferred using Dollo parsimony [45]. In more detail, if a species/leaf has multiple genes (in-paralogs), its intron status is the union of intron presence and absence in all in-paralogs. Given the presence/absence of introns in leaves, our algorithm went down the tree from leaves to root to infer intron status at each node using parsimony. Then the algorithm backtracked the tree from root to leaves and finalized intron status at each node according to the following rules: (i) a node has an intron if both of its children have at least one descendant that has an intron. All nodes that are descendants of this node and ancestors of the descendants having introns should also have introns; (ii) if intron status is ambiguous, the node's final intron status should be the status intersection of itself and its immediate ancestor (excluding root). For example, node 1 had ambiguous intron status (+/−) after the first round of inference. But after backtracking the tree, its status is changed to intron presence (+). Finally after all intron statuses are determined, intron evolution events are inferred by comparing each node/leaf with its immediate ancestor (C).

Based on our results, we note several interesting observations about plant intron evolution. First, there are more losses than gains at each leaf/species of the tree (Table 1 and Figure 7). Also, the loss/gain ratio is higher in monocot than in dicot species (a range of 9–11.27 versus a range of 1.79–5.26 under the strict option in Table 1; a range of 15.1–16.12 versus a range of 2.35–10.94 under the loose option in Table 1). Second, when compared within dicot nodes (nodes 1–4), the more ancient ancestors (nodes 3 and 4) had roughly the same number of gains and losses; while the more modern ancestors (nodes 1 and 2) had more losses than gains (Table 1 and Figure 7). This indicates a balance of intron gains and losses in ancient dicots and that the degree of intron net loss in dicots increases over time.

Table 1:

Intron evolution events by comparing current nodes/leaves (specified) with their immediate ancestors (internal nodes only; not specified)—based on the ‘strict’ and ‘loose’ options

Current node/leafa Conservationb Gainc Lossd Absencee Unknownf Totalg Loss/ Gain ratio 
Classification based on the ‘strict’ option 
 Dicot leaves/species        
        GM 56 363 340 610 3214 2069 62 596 1.79 
        MT 36 223 129 675 2545 678 40 250 5.23 
        PT 52 800 137 473 3573 1136 58 119 3.45 
        AT 58 184 206 1084 3758 1981 65 213 5.26 
        VV 49 976 171 322 3259 948 54 676 1.88 
 Dicot nodes        
        1 31 796 24 136 2312 62 34 330 5.67 
        2 47 796 19 50 3413 74 51 352 2.63 
        3 55 800 54 51 3728 257 59 890 0.94 
        4 44 653 66 69 3039 606 48 433 1.05 
 Monocot leaves/species        
        OS 65 612 26 293 4221 1835 71 987 11.27 
        SB 64 309 54 486 4226 2112 71 187 9.00 
 Monocot node        
        5 51 425 54 321 3358 574 55 732 5.94 
 Ancestor of dicots and monocots 
        6 33 743 – – 1468 5782 40 993 – 
 Outgroup        
        PP 34 321 – – 1192 7925 43 438 – 
Classification based on the ‘loose’ option 
 Dicot leaves/species        
        GM 58 252 78 318 2554 979 62 181 4.08 
        MT 37 011 52 569 1765 503 39 900 10.94 
        PT 54 050 59 353 2492 806 57 760 5.98 
        AT 59 517 167 1029 2460 1724 64 897 6.16 
        VV 51 069 92 216 2211 768 54 356 2.35 
 Dicot nodes        
        1 32 539 121 1533 29 34 229 17.29 
        2 48 901 12 43 2300 57 51 313 3.58 
        3 57 285 45 45 2490 164 60 029 1.00 
        4 45 911 65 68 2010 521 48 575 1.05 
 Monocot leaves/species        
        OS 66 817 17 274 2976 1579 71 663 16.12 
        SB 65 395 30 453 2995 1840 70 713 15.10 
 Monocot node        
        5 53 092 39 306 2257 421 56 115 7.85 
 Ancestor of dicots and monocots 
        6 35 237 – – 725 5483 41 445 – 
 Outgroup        
        PP 36 016 – – 504 7330 43 850 – 
Current node/leafa Conservationb Gainc Lossd Absencee Unknownf Totalg Loss/ Gain ratio 
Classification based on the ‘strict’ option 
 Dicot leaves/species        
        GM 56 363 340 610 3214 2069 62 596 1.79 
        MT 36 223 129 675 2545 678 40 250 5.23 
        PT 52 800 137 473 3573 1136 58 119 3.45 
        AT 58 184 206 1084 3758 1981 65 213 5.26 
        VV 49 976 171 322 3259 948 54 676 1.88 
 Dicot nodes        
        1 31 796 24 136 2312 62 34 330 5.67 
        2 47 796 19 50 3413 74 51 352 2.63 
        3 55 800 54 51 3728 257 59 890 0.94 
        4 44 653 66 69 3039 606 48 433 1.05 
 Monocot leaves/species        
        OS 65 612 26 293 4221 1835 71 987 11.27 
        SB 64 309 54 486 4226 2112 71 187 9.00 
 Monocot node        
        5 51 425 54 321 3358 574 55 732 5.94 
 Ancestor of dicots and monocots 
        6 33 743 – – 1468 5782 40 993 – 
 Outgroup        
        PP 34 321 – – 1192 7925 43 438 – 
Classification based on the ‘loose’ option 
 Dicot leaves/species        
        GM 58 252 78 318 2554 979 62 181 4.08 
        MT 37 011 52 569 1765 503 39 900 10.94 
        PT 54 050 59 353 2492 806 57 760 5.98 
        AT 59 517 167 1029 2460 1724 64 897 6.16 
        VV 51 069 92 216 2211 768 54 356 2.35 
 Dicot nodes        
        1 32 539 121 1533 29 34 229 17.29 
        2 48 901 12 43 2300 57 51 313 3.58 
        3 57 285 45 45 2490 164 60 029 1.00 
        4 45 911 65 68 2010 521 48 575 1.05 
 Monocot leaves/species        
        OS 66 817 17 274 2976 1579 71 663 16.12 
        SB 65 395 30 453 2995 1840 70 713 15.10 
 Monocot node        
        5 53 092 39 306 2257 421 56 115 7.85 
 Ancestor of dicots and monocots 
        6 35 237 – – 725 5483 41 445 – 
 Outgroup        
        PP 36 016 – – 504 7330 43 850 – 

aSee Figure 5 for node IDs and species abbreviations.

bConservation: introns are present in both current node/leaf and ancestor.

cGain: intron is present in current node/leaf but absent in ancestor.

dLoss: intron is absent in current node/leaf but present in ancestor.

eAbsence: introns are absent in both current node/leaf and ancestor.

fUnknown: evolution events cannot be inferred due to ambiguous intron presence/absence status.

gTotal = conservation + gain + loss + absence + unknown.

Figure 7:

Intron evolution events in the eight-species tree (Figure 5) for the ‘strict’ (A) and ‘loose’ (B) options. Intron evolution is inferred by comparing nodes/leaves with their immediate ancestors in the tree (see Figure 6 for details). Node 7 is compared with the root of the tree, which is assumed to be intronless for this display. The number of intron losses (top), gains (middle) and conservations (bottom) are on the right side of each branch. Dash symbol means that the number of evolution events cannot be inferred. The figure considers evolution events for branches that are part of the eight-species tree, while Table 1 considers evolution events for all branches of all possible trees. So evolution events in this figure are a subset of corresponding events in Table 1.

Figure 7:

Intron evolution events in the eight-species tree (Figure 5) for the ‘strict’ (A) and ‘loose’ (B) options. Intron evolution is inferred by comparing nodes/leaves with their immediate ancestors in the tree (see Figure 6 for details). Node 7 is compared with the root of the tree, which is assumed to be intronless for this display. The number of intron losses (top), gains (middle) and conservations (bottom) are on the right side of each branch. Dash symbol means that the number of evolution events cannot be inferred. The figure considers evolution events for branches that are part of the eight-species tree, while Table 1 considers evolution events for all branches of all possible trees. So evolution events in this figure are a subset of corresponding events in Table 1.

In this application, we provided two versions of ciwogPlants—a strict version that only clusters introns at identical alignment positions and a loose version that clusters introns within one aligned amino acid with the possibility of one gap. Both versions have different sets of cintrons and are offered as estimates of homologous plant introns. To compare the two ciwogPlants versions on the basis of cintron correspondence, cintrons were compared by identical alignment positions. Both versions share the majority of their cintrons, 83 and 85% of loose and strict, respectively. However, the loose cintrons overlapped 1.5 strict intron positions on average. Manual inspection of these cases revealed that the unique loose cintrons often offered a better homologous grouping than the strict cintrons, for reasons such as alignment artifacts where homologous characters are not aligned or ambiguously aligned, gene-structure annotation inaccuracies where an intron is shifted to its ‘true’ position or authentic intron sliding events, which are the extension or shrinking of intron boundaries with new exonic nucleotides [26] (see Figure 4B for an example). Alternatively, the loose version in contrast to the strict version can excessively cluster introns that occur at close positions over many genes but are very different between genes (http://ciwog.gdcb.iastate.edu/ciwogPlants_loose-cgi/display.pl?cid=2386#iSBSb01g021990.1). The alignment position size of unique loose cintrons has an average of 1.8 indicating that excessive clustering is not a typical occurrence. Complex cases such as this are best to be manually reviewed using the web displays. More introns are conserved in the loose version in all node comparisons than in the strict version (Table 1), which suggests that the loose version is probably a more realistic estimate of homologous introns. It may be worthwhile to extend previous studies of orthologous introns that were based entirely on strictly conserved positions. The CIDA/CIWOG algorithm and web displays allow users to easily explore a wide range of relevant questions, while allowing flexible definitons of intron type and position.

CONCLUSIONS AND FUTURE DIRECTIONS

CIWOG provides an easy-to-use, flexible pipeline for the task of detecting common introns from protein-sequence alignments. New additions in comparison to previous programs are the intron sliding, local alignment quality and intron typing options. CIWOG also provides dynamic visualization displays for manual review of individual gene families. Manual review enables deconvolution of complex common introns and gene families, and also can reveal potential gene annotation inaccuracies. Our analysis of eight plant species is the most comprehensive to date. We show that, based on Dollo parsimony, intron loss is more common than intron gain in recent plant evolution and that intron loss tends to be more common than intron gain in monocots relative to dicots. Intron presence and absence matrices are provided for individuals wishing to estimate intron evolution through other models. Although intron evolution is an active field, much of the effort has been focused on evolutionary modeling of common intron data rather than the derivation of common introns themselves. We feel this is an important part of the discussion, and tools, such as CIWOG, will make important contributions to this field. One assumption in contemporary common intron analysis is that each locus contains exactly one transcript isoform. Future directions of CIWOG will focus on eliminating this assumption by applying CIWOG to all reliable transcript isoforms in orthologous gene clusters.

Key Points

  • The CIDA incorporates flexible criteria for detecting common introns in protein-coding genes and enables studies of intron evolution with respect to occurrence, position and type.

  • CIWOG web displays provide unique dynamic tools for analysis of gene-structure prediction and intron evolution in the context of groups of orthologous genes.

  • CIDA and CIWOG are distributed freely as open-source software.

  • CIWOG detects U12-type and GC-donor-type in addition to canonical U2-type introns.

  • Plant introns are evolutionarily lost at a higher rate than they are gained. The loss/gain ratio is higher in the studied monocot species than in the dicots.

  • Ancient dicots may have similar number of gains and losses and the degree of intron net loss in dicots increases over time.

FUNDING

National Science Foundation Plant Genome Research Program [DBI-0606909 to V.P.B.].

Acknowledgements

We thank Feng Chen for providing a memory-efficient version of OrthoMCL, Nicola L.B. Pohl for help with the figures and group members for helpful discussion. We also thank anonymous reviewers for their thoughtful and useful comments.

References

Roy
SW
Gilbert
W
The evolution of spliceosomal introns: patterns, puzzles and progress
Nat Rev Genet
 , 
2006
, vol. 
7
 (pg. 
211
-
21
)
Swarbreck
D
Wilks
C
Lamesch
P
, et al.  . 
The Arabidopsis Information Resource (TAIR): gene structure and function annotation
Nucleic Acids Res
 , 
2008
, vol. 
36
 (pg. 
D1009
-
14
)
Fedorova
L
Fedorov
A
Introns in gene evolution
Genetica
 , 
2003
, vol. 
118
 (pg. 
123
-
31
)
Coulombe-Huntington
J
Majewski
J
Characterization of intron loss events in mammals
Genome Res
 , 
2007
, vol. 
17
 (pg. 
23
-
32
)
Roy
SW
Irimia
M
Mystery of intron gain: new data and new models
Trends Genet
 , 
2009
, vol. 
25
 (pg. 
67
-
73
)
Babenko
VN
Rogozin
IB
Mekhedov
SL
, et al.  . 
Prevalence of intron gain over intron loss in the evolution of paralogous gene families
Nucleic Acids Res
 , 
2004
, vol. 
32
 (pg. 
3724
-
33
)
Roy
SW
Penny
D
On the incidence of intron loss and gain in paralogous gene families
Mol Biol Evol
 , 
2007
, vol. 
24
 (pg. 
1579
-
81
)
Rogozin
IB
Sverdlov
AV
Babenko
VN
, et al.  . 
Analysis of evolution of exon-intron structure of eukaryotic genes
Brief Bioinform
 , 
2005
, vol. 
6
 (pg. 
118
-
34
)
Irimia
M
Roy
SW
Spliceosomal introns as tools for genomic and evolutionary analysis
Nucleic Acids Res
 , 
2008
, vol. 
36
 (pg. 
1703
-
12
)
Csűrös
M
Malin: maximum likelihood analysis of intron evolution in eukaryotes
Bioinformatics|
 , 
2008
, vol. 
24
 (pg. 
1538
-
9
)
Fedorov
A
Merican
AF
Gilbert
W
Large-scale comparison of intron positions among animal, plant, and fungal genes
Proc Natl Acad Sci USA
 , 
2002
, vol. 
99
 (pg. 
16128
-
33
)
Krauss
V
Pecyna
M
Kurz
K
, et al.  . 
Phylogenetic mapping of intron positions: a case study of translation initiation factor eIF2gamma
Mol Biol Evol
 , 
2005
, vol. 
22
 (pg. 
74
-
84
)
Rogozin
IB
Wolf
YI
Sorokin
AV
, et al.  . 
Remarkable interkingdom conservation of intron positions and massive, lineage-specific intron loss and gain in eukaryotic evolution
Curr Biol
 , 
2003
, vol. 
13
 (pg. 
1512
-
7
)
Csűrös
M
Holey
JA
Rogozin
IB
In search of lost introns
Bioinformatics
 , 
2007
, vol. 
23
 (pg. 
i87
-
96
)
Lin
H
Zhu
W
Silva
JC
, et al.  . 
Intron gain and loss in segmentally duplicated genes in rice
Genome Biol
 , 
2006
, vol. 
7
 pg. 
R41
 
Roy
SW
Penny
D
Patterns of intron loss and gain in plants: intron loss-dominated evolution and genome-wide comparison of O. sativa and A. thaliana
Mol Biol Evol
 , 
2007
, vol. 
24
 (pg. 
171
-
81
)
Basu
MK
Rogozin
IB
Deusch
O
, et al.  . 
Evolutionary dynamics of introns in plastid-derived genes in plants: saturation nearly reached but slow intron gain continues
Mol Biol Evol
 , 
2008
, vol. 
25
 (pg. 
111
-
9
)
Chen
F
Mackey
AJ
Stoeckert
CJ Jr
, et al.  . 
OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups
Nucleic Acids Res
 , 
2006
, vol. 
34
 (pg. 
D363
-
8
)
Li
L
Stoeckert
CJ
Jr.
Roos
DS
OrthoMCL: identification of ortholog groups for eukaryotic genomes
Genome Res
 , 
2003
, vol. 
13
 (pg. 
2178
-
89
)
Altschul
SF
Madden
TL
Schaffer
AA
, et al.  . 
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Res
 , 
1997
, vol. 
25
 (pg. 
3389
-
402
)
Duvick
J
Fu
A
Muppirala
U
, et al.  . 
PlantGDB: a resource for comparative plant genomics
Nucleic Acids Res
 , 
2008
, vol. 
36
 (pg. 
D959
-
65
)
Sayers
EW
Barrett
T
Benson
DA
, et al.  . 
Database resources of the National Center for Biotechnology Information
Nucleic Acids Res
 , 
2009
, vol. 
37
 (pg. 
D5
-
15
)
Thompson
JD
Higgins
DG
Gibson
TJ
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
Nucleic Acids Res
 , 
1994
, vol. 
22
 (pg. 
4673
-
80
)
Larkin
MA
Blackshields
G
Brown
NP
, et al.  . 
Clustal W and Clustal X version 2.0
Bioinformatics
 , 
2007
, vol. 
23
 (pg. 
2947
-
48
)
Edgar
RC
MUSCLE: multiple sequence alignment with high accuracy and high throughput
Nucleic Acids Res
 , 
2004
, vol. 
32
 (pg. 
1792
-
7
)
Rogozin
IB
Lyons-Weiler
J
Koonin
EV
Intron sliding in conserved gene families
Trends Genet
 , 
2000
, vol. 
16
 (pg. 
430
-
2
)
Patel
AA
Steitz
JA
Splicing double: insights from the second spliceosome
Nat Rev Mol Cell Biol
 , 
2003
, vol. 
4
 (pg. 
960
-
70
)
Zhu
W
Brendel
V
Identification, characterization and molecular phylogeny of U12-dependent introns in the Arabidopsis thaliana genome
Nucleic Acids Res
 , 
2003
, vol. 
31
 (pg. 
4561
-
72
)
Shepelev
V
Fedorov
A
Advances in the Exon-Intron Database (EID)
Brief Bioinform
 , 
2006
, vol. 
7
 (pg. 
178
-
85
)
Schuler
MA
Splice site requirements and switches in plants
Curr Top Microbiol Immunol
 , 
2008
, vol. 
326
 (pg. 
39
-
59
)
Pessa
HK
Will
CL
Meng
X
, et al.  . 
Minor spliceosome components are predominantly localized in the nucleus
Proc Natl Acad Sci USA
 , 
2008
, vol. 
105
 (pg. 
8655
-
0
)
Burge
CB
Padgett
RA
Sharp
PA
Evolutionary fates and origins of U12-type introns
Mol Cell
 , 
1998
, vol. 
2
 (pg. 
773
-
85
)
Russell
AG
Charette
JM
Spencer
DF
, et al.  . 
An early evolutionary origin for the minor spliceosome
Nature
 , 
2006
, vol. 
443
 (pg. 
863
-
6
)
Will
CL
Luhrmann
R
Splicing of a rare class of introns by the U12-dependent spliceosome
Biol Chem
 , 
2005
, vol. 
386
 (pg. 
713
-
24
)
Pessa
HK
Ruokolainen
A
Frilander
MJ
The abundance of the spliceosomal snRNPs is not limiting the splicing of U12-type introns
RNA
 , 
2006
, vol. 
12
 (pg. 
1883
-
92
)
Patel
AA
McCarthy
M
Steitz
JA
The splicing of U12-type introns can be a rate-limiting step in gene expression
EMBO J
 , 
2002
, vol. 
21
 (pg. 
3804
-
15
)
Basu
MK
Makalowski
W
Rogozin
IB
, et al.  . 
U12 intron positions are more strongly conserved between animals and plants than U2 intron positions
Biol Direct
 , 
2008
, vol. 
3
 pg. 
19
 
Sheth
N
Roca
X
Hastings
ML
, et al.  . 
Comprehensive splice-site analysis using comparative genomics
Nucleic Acids Res
 , 
2006
, vol. 
34
 (pg. 
3955
-
67
)
Kitamura-Abe
S
Itoh
H
Washio
T
, et al.  . 
Characterization of the splice sites in GT-AG and GC-AG introns in higher eukaryotes using full-length cDNAs
J Bioinform Comput Biol
 , 
2004
, vol. 
2
 (pg. 
309
-
31
)
Thanaraj
TA
Clark
F
Human GC-AG alternative intron isoforms with weak donor sites show enhanced consensus at acceptor exon positions
Nucleic Acids Res
 , 
2001
, vol. 
29
 (pg. 
2581
-
93
)
Campbell
MA
Haas
BJ
Hamilton
JP
, et al.  . 
Comprehensive analysis of alternative splicing in rice and comparative analyses with Arabidopsis
BMC Genomics
 , 
2006
, vol. 
7
 pg. 
327
 
Farrer
T
Roller
AB
Kent
WJ
, et al.  . 
Analysis of the role of Caenorhabditis elegans GC-AG introns in regulated splicing
Nucleic Acids Res
 , 
2002
, vol. 
30
 (pg. 
3360
-
7
)
Abril
JF
Castelo
R
Guigo
R
Comparison of splice sites in mammals and chicken
Genome Res
 , 
2005
, vol. 
15
 (pg. 
111
-
9
)
Churbanov
A
Winters-Hilt
S
Koonin
EV
, et al.  . 
Accumulation of GC donor splice signals in mammals
Biol Direct
 , 
2008
, vol. 
3
 pg. 
30
 
Farris
JS
Phylogenetic analysis under Dollo's Law
Syst Zool
 , 
1977
, vol. 
26
 (pg. 
77
-
88
)
Roy
SW
Gilbert
W
Complex early genes
Proc Natl Acad Sci USA
 , 
2005
, vol. 
102
 (pg. 
1986
-
91
)
Barbazuk
WB
Fu
Y
McGinnis
KM
Genome-wide analyses of alternative splicing in plants: opportunities and challenges
Genome Res
 , 
2008
, vol. 
18
 (pg. 
1381
-
92
)
Hedges
SB
The origin and evolution of model organisms
Nat Rev Genet
 , 
2002
, vol. 
3
 (pg. 
838
-
49
)