Phylogenetic Relationships among Group Ii Intron Orfs

Group II introns are widely believed to have been ancestors of spliceosomal introns, yet little is known about their own evolutionary history. In order to address the evolution of mobile group II introns, we have compiled 71 open reading frames (ORFs) related to group II intron reverse transcriptases and subjected their derived amino acid sequences to phylogenetic analysis. The phylogenetic tree was rooted with reverse transcriptases (RTs) of non-long terminal repeat retroelements, and the inferred phylogeny reveals two major clusters which we term the mitochondrial and chloroplast-like lineages. Bacterial ORFs are mainly positioned at the bases of the two lineages but with weak bootstrap support. The data give an overview of an apparently high degree of horizontal transfer of group II intron ORFs, mostly among related organisms but also between organelles and bacteria. The Zn domain (nuclease) and YADD motif (RT active site) were lost multiple times during evolution. Differences in domain structures suggest that the oldest ORFs were concise, while the ORF in the mitochondrial lineage subsequently expanded in three locations. The data are consistent with a bacterial origin for mobile group II introns.


INTRODUCTION
Group II introns are self-splicing RNAs that are widely believed to have been ancestors of nuclear pre-mRNA introns (1).Some group II introns are also active retroelements due to reverse transcriptases (RTs) encoded within the introns (2).These mobile group II introns are linked mechanistically and phylogenetically to non-long terminal repeat (non-LTR) elements, an abundant class of retroelements in eukaryotes (3).Based on these relationships, it has been proposed that group II introns might be ancestors of non-LTR retroelements as well as spliceosomal introns (3,4).Yet despite the potential importance of group II introns to the evolution of eukaryotic genomes, little information is available about the evolutionary history of group II introns themselves.
Mobile group II introns consist of an ∼600 nt self-splicing RNA structure surrounding an ∼2 kb open reading frame (ORF; Fig. 1).The conserved secondary structure of the intron comprises six domains (5), and the ORF is invariably located in domain IV (Fig. 1A and B).The ORF itself is divided into conserved domains RT, X and Zn.The RT domain forms the bulk of the ORF and consists of subdomains 1-7 [palm and finger regions in the crystal structure of HIV-RT (6)].Subdomain 0 can be considered an N-terminal extension of the RT domain and is conserved among non-LTR RTs.[Domain 0 was formerly called domain Z but was more recently renamed domain 0 in non-LTR RTs (7)].Domain X (analogous to the thumb structure of HIV-RT) is implicated in the splicing, or maturase, function of the RT protein (8), while the Zn domain contains a potential zinc finger, and contributes a nuclease activity.The Zn domain is related to a family of bacterial colicin and pyocin nucleases, as well as to some group I intron ORFs (9,10).
Splicing and mobility of group II introns require catalytic activities of both the intron and intron-encoded protein (see 2 for complete description).In brief, intron splicing occurs in vivo when the RT protein binds to unspliced intron transcript and stimulates the inherently RNA-catalyzed splicing reaction.Mobility of the intron is initiated when the RNP product of splicing (RT bound to lariat intron) encounters a homing site DNA (fused exons lacking intron).The intron reverse splices either partially or completely into the exon sequences of DNA.Then the Zn domain of the RT nicks the antisense strand of the DNA target 9 or 10 bp downstream of the exon junction, and the RT reverse transcribes the intron using the cleaved antisense strand as a primer.Intron insertion is completed by host repair enzymes.
Our understanding of the evolutionary history of group II introns is based primarily on speculation and general observations.It has been proposed that mobile group II introns were created when an RT was inserted into a pre-existing group II intron (11,12), although other possibilities have been considered (13).Such a formative event may have occurred in bacteria, after which introns migrated to mitochondria and chloroplasts.The theory of a bacterial origin was prompted by the discovery of ORF-containing group II introns in bacterial species related to the ancestors of organelles [Calothrix, *To whom correspondence should be addressed.Tel: +1 403 220 7933; Fax: +1 403 289 9311; Email: zimmerly@ucalgary.caPresent addresses: Georg Hausner, Department of Botany, University of Manitoba, Winnipeg, Manitoba R3T 2N2, Canada Xu-chu Wu, Department of Cell Biology and Anatomy, University of Calgary School of Medicine, Calgary, Alberta T2N 4N1, Canada Azotobacter and Escherichia coli (14,15)].Group II introns have since been reported in diverse bacterial species including Lactococcus (16), Clostridium (17), Pseudomonas (18), Sinorhizobium (19), Bacillus (20) and Sphingomonas (21), indicating a widespread presence of group II introns in bacteria.
Deciphering the history of group II introns is complicated by the fact that group II introns are inherited horizontally as well as vertically.Horizontal transfer is suggested by an idiosyncratic distribution of introns among species and strains, and also by the observation that related introns are sometimes found in seemingly less related host genes or host organisms (11).For example, Kluyveromyces lactis cox1I1 and Saccharomyces cerevisiae cox1I2 are 96% identical in DNA sequence (intron and ORF) and are located at the same site within the cox1 gene, yet the cox1 genes are 88% identical (22).Another example is the group II intron ORF ltrA of Lactococcus lactis, which is more closely related to ORFs of mitochondrial introns than to other bacterial intron ORFs (16).
Vertical inheritance of group II introns is best exemplified by introns with degenerate ORFs.The most extreme example is the matK family of proteins found in the chloroplast trnK genes of higher plants.MatK proteins do not resemble other group II intron-encoded proteins except for some conservation in RT subdomains 5-7 and domain X.It is believed that matK ORFs lost mobility functions but retained their maturase function which is associated with the domain X motif still present in these ORFs (8,23).
To help elucidate the evolutionary past of mobile group II introns, we have undertaken phylogenetic analysis of the intron-encoded proteins.Evolution of these proteins is obviously distinct from the evolution of the intron RNA structure itself, although one report indicated coevolution for a limited subset of RT-encoding introns (24).We have compiled group II intron ORFs from the databases, including a number of bacterial ORFs not explicitly reported as group II introns in the literature.Alignment of the ORF sequences and construction of a phylogenetic tree suggest a model for the history of mobile group II introns.This model gives an overview of horizontal versus vertical inheritance and predicts how the ORF structure, and possibly activities, evolved.

Compilation of sequences
Putative group II intron ORFs were identified by BLAST searches (25) at the National Center for Biotechnology Information (NCBI) web site http://www.ncbi.nlm.nih.gov/Genbank/GenbankSearch.html using a selection of known group II intron ORFs as query sequences.After the phylogenetic tree was constructed, further searches were done based on representatives from each of the major branches to ensure all ORFs related to each subfamily were identified.Sequences were aligned by PILEUP (Wisconsin Package v.8; Genetics Computer Group, Madison, WI) on the ANGIS (Australian National Genomic Information Service) supercomputing server (http://morgan.angis.su.oz.au/) and by CLUSTAL X (26), followed by manual refinement using the editing programs GeneDoc (K.B.Nicholas and H.B.Nicholas Jr, for Windows; http://www.psc.edu/biomed/genedoc/) and SeqApp (D.G.Gilbert, for Macintosh; http://iubio.bio.indiana.edu/soft/molbio/seqpup/).Consensus sequences were calculated using the GCG program PRETTY on the ANGIS server, taking into account conservative substitutions.
Phylogenetic estimates were generated by the programs contained within the PHYLIP package (Version 3.573c; http:// evolution.genetics.washington.edu/phylip/getme.html).Neighborjoining analysis utilized PROTDIST (setting: Dayhoff PAM250 substitution matrix) and NEIGHBOR programs.Parsimony analysis was with PROTPARS (protein parsimony algorithm, version 3.55c).SEQBOOT and CONSENSE programs were used for bootstrap analysis and generation of the majority rule consensus trees (27).Maximum-likelihood trees were inferred with PUZZLE 4.02 (ftp://ebi.ac.uk/pub/software; setting: JTT correction matrix and frequency of amino acid usage estimated from data), with 100 quartet puzzling steps (28).
Sequences omitted from phylogenetic analysis were: C.p.psbCI4, P.p.rtl, A.s. (missing domains); P.co.cox1I1 and P.cu.ND5I1 (redundance); and E.g.psbDI8 (divergence).S.ma. was omitted because it lacked a group II intron structure and its inclusion reduced statistical support for the tree; we conclude that the S.ma.ORF lost its intron structure and is degenerated.The approximate positions of these ORFs on the phylogenetic tree based on BLAST search similarities are: A.s., E.g.psbDI8, chloroplast-like group; M.l., P.p.rtl, algal group; C.p.psbCI4, euglenoid group; S.ma., unclear.

Compilation of ORF sequences
Group II intron ORF sequences were identified by BLAST searches of GenBank to find relatives of known group II intron ORFs (Materials and Methods).The compiled 71 ORFs and 14 ORF fragments are displayed in Tables 1-3 along with their features and accession numbers.The tables include 40 mitochondrial, 11 chloroplast and 20 bacterial ORFs.The 14 bacterial ORF fragments are listed in the footnote to Table 3.In comparison with the previous compilation (8), additions are: 20 mitochondrial ORFs (five related nad1I4 ORFs from higher plants, eight ORFs of green, red and brown algae, seven fungal ORFs), five chloroplast ORFs (four related euglenoid psbCI4 ORFs; one cryptomonad ORF) and 19 bacterial ORFs.In cases where nearly identical intron ORFs were reported in the same species (e.g.E.coli intron ORFs of accession numbers AE000133 and D37918), only one entry was included in the table.Only five matR ORFs (nad1I4 ORFs in higher plants) are included although over 100 are reported (29).We have omitted members of the matK family (trnKI1 ORFs in land plants) because they are too divergent for phylogenetic comparisons, and also because over 1000 matK sequences have been reported due to their use in molecular systematics of higher plants (30).The mat2 ORFs of euglenoids are also too divergent for phylogenetic analysis (31).
The most important ORFs compiled in our search are the bacterial ORFs, only one of which was included in the previous compilation.Because many of the bacterial ORFs were discovered in sequencing projects, their database entries are often poorly annotated and fail to define correct intron and exon boundaries.For the newly identified bacterial ORFs, we have folded the flanking sequences to verify that all except S.ma. are located in group II intron RNA structures (Table 3).

Sequence alignment
ORF sequences were aligned by standard methods, and the alignment has been submitted to the EMBL database (accession number ALIGN_000044). Figure 2 shows a representative alignment of four ORF sequences, each representing a phylogenetic grouping.For convenience we refer to these clusters as the mitochondrial, algal, bacterial and euglenoid groupings, although the bacterial grouping is not a discrete clade, and the mitochondrial and algal groups have mixed compositions (Fig. 3).A 75% consensus sequence is presented for each subgrouping, along with the total consensus sequence for group II intron ORFs.Conserved motifs of group II intron ORFs (subdomains 0-7 of the RT domain, the X domain and the Zn domain) are labeled according to previous studies (7,9,32) taking into account the boundaries of conservation seen in our alignment.[Subdomains 0 and 2A are named in accordance with non-LTR RTs (7); subdomain 0 is conserved only between group II intron and non-LTR RTs; subdomain 2A is weakly conserved among non-LTR, group II intron and retron RTs.]We have also defined two spacer regions which are sites for amino acid insertions in many of the ORFs.These sites are between subdomains 4 and 5 of the RT domain (the 4/5 spacer) and between subdomain 7 and domain X (the 7/X spacer).The 4/5 spacer varies from 1 to 179 amino acids for different RTs, while the 7/X spacer ranges from 0 to 235 amino acids.
Variations between the consensus sequences in Figure 2 reflect evolutionary divergence of the lineages of ORFs.The most striking differences are in domain X.Domain X is highly conserved in the mitochondrial and euglenoid groups (21 and 38 positions conserved out of 105), less conserved in the algal group (15 positions) and poorly conserved in the bacterial group (five positions).In concluding that domain X is 'poorly conserved' in bacteria, we note that RT subdomains 0-7 are equally conserved within each of the mitochondrial and bacterial groups (90 and 94 residues respectively).Interestingly, the conserved positions in domain X are not shared among the groups, further suggesting that domain X is the most rapidly evolving region of the ORF.

Phylogenetic analysis
Phylogenetic analysis was based on RT subdomains 0-7 and domain X.Although plant nad1I4 ORFs do not contain subdomains 0 and 1, their inclusion improved resolution for the rest of the tree and did not affect placement of nad1I4 ORFs (not shown).All other positions that were not unambiguously alignable among all group II intron ORFs were omitted from analysis, including the domain 4/5 spacer and all idiosyncratic insertions (see Fig. 3 for listing of insertions).In total, the alignable sequence used to construct the tree was 260 amino acids, of which 235 sites were informative.The tree was rooted with RTs of four subclasses of non-LTR elements [D.m.RD2, D.m.jockey, C.e.RT1 and human L1 (7)], using only the alignable amino acids in RT subdomains 0-7 (202 positions).Since domain X sequence is not present in the outgroups, we confirmed that inclusion of domain X in the analysis did not significantly affect the branching pattern.We also confirmed that the internal topology of an unrooted tree was the same as the rooted tree (not shown).
Phylogenetic trees were derived by neighbor-joining (NJ), maximum parsimony (MP) and maximum likelihood (ML) algorithms (Materials and Methods).The phylogenetic model derived from a neighbor-joining algorithm and rooted with non-LTR RTs is presented in Figure 3 along with bootstrap values for NJ and MP analyses.ML analysis was consistent with NJ and MP analysis but gave lower statistical support.The topology of the tree is divided into two major clusters which we term the mitochondrial and chloroplast-like lineages.Members of the mitochondrial lineage include all known fungal, liverwort and plant mitochondrial group II intron ORFs, as well as several ORFs of bacteria and brown algal mitochondria.The base of the mitochondrial lineage is defined by a node with 96% bootstrap support (NJ).Bootstrap support within the mitochondrial lineage does not support a specific branching order among the subgroupings; however, a cluster of four liverwort intron ORFs (atpAI1, atpAI2, cob1I3 and cox1I1 ORFs) appear to have given rise to the nad1I4 (matR) family of ORFs in higher plants.

Table 1. Mitochondrial group II intron-encoded ORFs and related ORFs
a Species abbreviations used in this manuscript are shown in parentheses.b The presence of conserved domains: RT, reverse transcriptase domain; X, domain X (putative splicing function); Zn, Zn domain (nuclease activity).In cases where the complete RT domain is not present, subdomains are indicated in parentheses (0-7).c The presence of a catalytic YxDD motif in subdomain 5 of the RT.Deviations from YxDD are indicated, with FxDD considered to be functional since it is common in other RTs (32).d Size (amino acids) of the ORF.For ORFs in-frame with the upstream exon, the length was calculated based on the first amino acid fully coded within the intron.However, these sizes are inaccurate since the encoded proteins are processed at their N-termini.e Fusion: ORF is translated in-frame with the upstream exon.Free: start codon for ORF is contained within the intron.f Accession numbers are for DNA database entries except where noted.g No data.The complete flanking sequence was not reported.h Published size is 681 amino acids; a single frameshift in domain X replaces 34 C-terminal amino acids with 124 amino acids similar to other domain X sequences (8).i Published size is 887 amino acids; a single termination readthrough adds 27 amino acids with similarity to other Zn domains.j Published size is 742 amino acids; a single termination readthrough adds 94 amino acids with similarity to the Zn domain (9).k Published size is 501 amino acids; two frameshifts and a termination readthrough add 72 amino acids with similarity to the Zn domain (9).The chloroplast-like lineage is defined by a node of 75% bootstrap support (NJ), and is divided into the algal group and the euglenoid group.The algal group is a highly heterogeneous collection of ORFs from algal chloroplasts, algal mitochondria and bacteria.The euglenoid group consists mostly of related psbCI4 ORFs which are found in group III introns, a degenerate form of group II introns (33).Members of the euglenoid group were omitted from the NJ phylogenetic calculation because of their extreme divergence (∼25% identity to algal ORFs; Fig. 2), but we include them in Figure 3 with dotted lines because MP analysis consistently placed these ORFs in the chloroplast-like lineage with substantial bootstrap support (>70%), and also because BLAST searches suggested that their closest relatives are algal ORFs (6/6 of the best matches, data not shown).
Intron ORFs that do not belong to the mitochondrial or chloroplast-like lineages are bacterial, and comprise four welldefined clades, each with 100% bootstrap support.The four bacterial groups are positioned at the base of the two main lineages, but bootstrap support for their positions is quite weak due to low sequence conservation among group II intron ORFs, and also because of low conservation between group II intron ORFs and the outgroup RTs (average of 21% identity).Rooting the tree with retron RTs (NJ) or non-LTR RTs (MP) resulted in similar trees, but with <50% support for all basal nodes.These alternate trees predicted the earliest branching ORFs to be bacterial group C (NJ, retron RT outgroup) and bacterial group D (MP, non-LTR RT outgroup).Thus, it is not possible with this data set to accurately predict the position of the root or the basal branching order.

Variation in spacer elements supports the phylogenetic groupings
Spacer elements were omitted from the phylogenetic analyses because they could not be aligned for all sequences; however, they provide additional support for some groupings of the phylogenetic tree.Figure 3 tabulates the lengths of the spacers for all ORFs in the phylogenetic tree.The domain 4/5 spacer is 1-38 amino acids in bacterial and chloroplast-like groups, 42-101 amino acids in the fungal group, and 176-179 amino acids in the matR family.This pattern supports the mitochondrial clade and also suggests, since the outgroups have short spacers (6-25 amino acids), that the most primitive group II intron ORFs also had a short spacer between domains 4 and 5, which was expanded when the ORF migrated to mitochondria and was expanded even further in the transfer to plant mitochondria.A similar scenario is seen for the 7/X spacer.The 7/X spacer is 0-7 amino acids in the bacterial groups A, B, C and D, 0-19 amino acids in the chloroplast-like group, and 19-35 amino acids in the fungal mitochondrial group.Notably, the 7/X spacer is 150-235 amino acids in both the liverwort subcluster and the matR family, suggesting that an expansion of the 7/X spacer occurred in liverwort before the ORF was passed on to higher plants.
Because of the unusually large size of the insertions (235 amino acids = 26 kDa), we examined the position of the spacers in the crystal structure of HIV-RT (6).Insertions in the 4/5 spacer would be predicted to produce an extension of the finger domain and would probably not interfere with the active site in domain 5. Similarly, the insertions in 7/X spacer would be located in a tether region near the thumb domain, and could reasonably extend away from the active site of the RT.

Evolution of domain structures and activities of the ORFs
Figure 3 tabulates additional intron properties, including the presence of ORF structural domains, a YADD motif at the polymerase active site, and a group II intron RNA structure.It appears that the Zn domain was lost many times during evolution, since ORFs without the Zn domain are scattered throughout the tree.Lack of the Zn domain is expected to The presence of an intron structure surrounding the ORF was evaluated by folding the sequence into a consensus group II intron structure (N.Toor and S.Zimmerly, unpublished). f The intron does not appear to be inserted into an ORF although a very small ORF cannot be ruled out.
g No host gene is annotated in the GenBank entry, but the intron probably interrupts neighboring ORFs.h Sequence has not been reported for the 5′ end of the intron, including the upstream exon (IS629-like ORF) and 680 bp of the intron.Otherwise, this sequence is virtually identical to the S.f.

intron. i
The intron is located between ORFs 7070 and 7073; a 5′ extension of the 7073 ORF could include the intron.reduce the efficiency of mobility, but not necessarily block mobility in all organisms.For the yeast intron S.c.cox1I2, deletion of the Zn domain reduces mobility to essentially undetectable levels (34), although the bacterial P.a. and S.me.introns are mobile in vivo despite lacking a Zn domain (18,35).We note that the Zn domain is missing in most of the putatively early branching bacterial ORFs, suggesting that bacterial introns in general do not require a Zn domain.Some bacterial ORFs without Zn domains may reflect ancestral ORFs that predate acquisition of the Zn domain.
The YADD motif was also lost several times in evolution.The YADD motif is absent in the euglenoid lineage and in four mitochondrial ORFs.Loss of the YADD motif in yeast introns S.c.cox1I1 and S.c.cox1I2 eliminates RT activity, but only reduces in vivo mobility to 40% of wild-type levels, which has been explained by an alternative mobility mechanism based on double-strand break repair (36).In contrast to organellar intron ORFs, all bacterial ORFs contain the YADD motif, which suggests that intron survival in bacteria requires RT activity.

Evolution of the Zn domain
To address the spotty distribution of the Zn domain in Figure 3, we phylogenetically-analyzed the Zn domain alone.The data set included 77 amino acid positions with 59 informative sites, Figure 2. Alignment of RT amino acid sequences.One example sequence is shown for each of the major phylogenetic groupings (mitochondrial, algal, bacterial and euglenoid; see Fig. 3).Below each example sequence is the consensus sequence for that group, defined as the positions conserved in at least 75% of the members.The overall consensus is presented above the alignment, and is defined as residues common to the mitochondrial, algal and bacterial consensus sequences.Domains 0-7, X and Zn are marked as described in the text.The locations of spacers between RT domains 4 and 5 and between domains 7 and X are indicated (see text and Fig. 3).Residues marked in bold are consensus sequence positions specific to one group.In cases of closely related sequences (e.g.plant nad1I4 introns), only one example was included in the consensus sequence calculation.Sequences used in the calculation were: mitochondrial group (Z.m. and ambiguously aligned residues were not excluded.Outgroups used to root the tree were the E.coli colicin E7 and Pseudomonas aeruginosa pyocin S1, members of the larger nuclease family to which the Zn domain belongs (9).The phylogenetic tree derived from NJ analysis is shown in Figure 4A.Bootstrap support for the branching order is poor due to low sequence conservation and the short sequence analyzed; however, the Zn domain of the mitochondrial lineage is separated from other Zn domains by a node with 79% bootstrap support.Figure 4B shows a sequence alignment of Zn domains for selected ORFs along with 50% consensus sequences for mitochondrial and chloroplast-like lineages.The Zn domains of bacterial group B and chloroplast-like groups are seen to be similar to the nuclease motifs of colicin and pyocins in their lengths, and slightly in their sequences (pink shading).In contrast, the Zn domain of the mitochondrial lineage is significantly expanded, and has additional conserved positions near its C-terminus (NRKQIPLC).This data is consistent with the possibility of acquisition of the Zn domain in bacteria from a bacterial family of nucleases, and subsequent expansion of the domain in mitochondria.Taking into account the low resolution in the phylogenetic analysis, there is little indication for 'swapping' of Zn domains among ORFs, and the spotty distribution of the Zn domain may be due to domain loss alone.

Horizontal versus vertical inheritance
The phylogenetic model in Figure 3 predicts both horizontal and vertical inheritance of group II intron ORFs.Vertical inheritance is suggested for the euglenoid psbCI1 and plant nad1I4 families of ORFs, since the introns of each family are confined to the same DNA location.Furthermore, the psbCI1 introns are unlikely to be mobile because the ORFs lack mobility-related motifs (Fig. 2).Mobility competence of nad1I4 ORFs is less clear since the ORFs contain a YADD motif.Still, vertical inheritance seems likely because of large insertions within the ORF, and because of the apparent agreement between ORF phylogeny and species phylogeny, with monocots and legumes each forming subgroupings.
Other than these two intron families, there is little evidence for strict vertical inheritance.The only other ORFs located in identical genomic sites in different species are: K.l.cox1I1 and S.c.cox1I2; A.m.cox1I3 and P.a.cox1I4; and S.f. and E.c.D.As described in the introduction, K.l.cox1I1 and S.c.cox1I2 probably represent a horizontal transfer event.In the case of A.m.cox1I3 and P.a.cox1I4 ORFs, the ORF amino acid sequences are 54% identical while cox1 amino acid sequences are 68% identical.Although this would be consistent with vertical inheritance, vertical inheritance is not certain since the introns have been reported only in these two distantly related fungi (Fig. 5), and not in other sequenced fungal genomes such as Schizosaccharomyces pombe, S.cerevisiae or Pichia canadensis.The introns S.f. and E.c.D probably represent a horizontal transfer, since the introns are 99.6% identical in total DNA sequence (intron and ORF) while their IS629-like exon DNA sequences are 91% identical and, furthermore, E.c.D is present in only a minority of E.coli strains (15).
The predicted ORF phylogeny in Figure 3 suggests multiple horizontal transfers between fungi and liverwort, between fungi and brown algae, among fungi, and among bacteria.Within the fungal and bacterial groups of ORFs, there is little correspondence between ORF phylogeny and species phylogeny (compare Figs 3 and 5).For example, of the three S.pombe ORFs, two are found only in subsets of S.pombe strains (37,38); the S.p.cox2I1 ORF is most closely related to a liverwort ORF, while S.p.cox1I1 and S.p.cob1I1 ORFs are more related to other fungal ORFs than to the other S.pombe ORFs.Of the nine M.p. intron ORFs, four are related and possibly diverged within liverwort, while at least two (M.p.cox1I2 and M.p.SSUI1) are more related to fungal and brown algal ORFs than the other M.p.ORFs, suggesting that they were the result of horizontal transfers.The bacterial intron ORFs B.a.-07 and B.a.-23 are phylogenetically distant but are located 10 kb apart on the same plasmid.The three ORFs in E.coli are also phylogenetically distant.All of these examples are most easily explained by a high frequency of horizontal transfers, although vertical inheritance cannot be ruled out in all cases.Horizontal transfers may be the rule for the timeframe represented in the phylogenetic tree, while long term vertical inheritance might occur only for ORFs that have lost mobility functions but retained splicing function.
Horizontal transfers appear to be relatively infrequent between fungi and bacteria.The only clear example is the set of three bacterial ORFs found in the mitochondrial lineage.Because of low resolution for branching order within the mitochondrial lineage, it is not possible to predict whether these ORFs were the earliest branching in the lineage, or reflect a horizontal transfer from mitochondria to bacteria.In either case, a horizontal transfer is implicated.The exception for long distance horizontal transfers is in the chloroplast-like lineage, which contains representatives from Gram-positive bacteria, Gram-negative bacteria, cyanobacteria, chloroplasts of green algae and euglenoids, and mitochondria of red and brown algae.These ORFs appear to have transferred horizontally at rates exceeding other phylogenetic classes.The extent of horizontal transfers may reflect unique properties of the lineage such as independence from factors of the host organism.

DISCUSSION
In this paper we present a compilation and phylogenetic analysis of group II intron ORFs, and suggest a model for the evolutionary history for mobile group II introns.Our data are consistent with previously published phylogenetic trees, the most detailed of which were reported by Ferat et al. (15) (NJ, 26 ORFs), (24) (NJ, 19 ORFs) and (39) (NJ, 14 ORFs).The major differences are that our analysis is expanded to include many more sequences, particularly bacterial sequences, our tree is rooted, and we include a detailed analysis of differences in ORF structure among the phylogenetic groupings.

Group II intron ORFs in bacteria
We have uncovered numerous bacterial ORFs and ORF fragments which had not been specifically reported as group II introns in the literature.The expanded data set confirms the earlier observation that bacterial intron ORFs are mainly found in mobile DNAs (15).In 18/20 ORFs and 5/6 ORF fragments where the locus of the intron is known, the ORF is found in a plasmid or mobile DNA.The location of bacterial introns in mobile DNAs is distinct from introns in mitochondria and chloroplasts, where the introns typically lie in housekeeping genes.Bacterial introns are also distinct because they are sometimes inserted outside of genes (20; Table 3), and the ORFs are frequently truncated, suggesting a higher degree of intron insertions at new locations followed by intron loss.

Are bacterial group II intron ORFs the oldest?
Our study is consistent with the theory that mobile group II introns originated in bacteria, but does not contribute substantial phylogenetic evidence toward it.The earliest branching ORFs in our analyses are bacterial with both NJ and MP algorithms, and with either non-LTR or retron RTs as outgroups, but the bootstrap values in all cases leave the phylogenetic support weak at best.We anticipate that, as more bacterial group II introns are reported, the bacterial cluster at the base of the tree will enlarge and perhaps the branching order will become more defined.Apart from the phylogenetic data, the observation that mitochondrial ORFs are expanded in three locations compared to the outgroup RTs suggests that ORFs of the mitochondrial lineage are not the earliest branching.
Were group II introns introduced to eukaryotes through the original organellar endosymbiont?Our evidence is consistent  (8).Bootstrap values are expressed as percentages, and were derived from 1000 (NJ) or 100 (MP) samplings, with MP values shown in italics and parentheses.Nodes with <50% support are collapsed.The predicted approximate location of euglenoid ORFs is shown with dotted lines (see text).Juxtaposed with the inferred phylogenetic relationships are properties of the introns, including protein domains present (subdomains 0-7 of the RT domain, domain X, Zn domain), the size of spacer segments between conserved motifs (see Fig. 2 for spacer definitions), idiosyncratic insertions, the presence of the YADD motif or a functional substitute (see Table 3 footnote), and the presence of a group II intron structure (see Table 3).Euglenoid ORFs are found in group III introns; P.s.cpn60I1 is reported to be a twintron (46), but the published RNA structure is probably incorrect.Abbreviations and color codings are: M (mitochondria; blue), C (chloroplast; green), B (bacteria; yellow), P (higher plant; green outline), L (liverwort; green outline), F (fungus; no outline), BA (brown alga; brown outline), GA (green alga; no outline), RA (red alga; pink outline), E (euglenoid; no outline), Cr (cryptopmonad; no outline).with this theory, but the inferred degree of horizontal transfer between bacteria and organelles is great enough that it would not have been necessary for the endosymbiont to have introduced a group II intron.

A model for the evolutionary history of group II intron ORFs
Our model for the evolutionary history of group II intron ORFs is shown in Figure 6.The oldest ORFs were probably bacterial and were compact in structure with only an RT and X domain.The Zn domain was acquired subsequently, which may have enhanced mobility.Introduction of mobile group II introns into chloroplasts and mitochondria may or may not have been mediated by the organellar endosymbionts.The ORF of the chloroplast-like lineage was essentially the same structurally as its bacterial ancestor except that the X domain became somewhat more conserved.In euglenoid chloroplasts, the psbCI4 ORFs diverged drastically and lost mobility but probably not splicing function.Intron ORFs in algal chloroplasts transmitted with high frequency across species and organellar boundaries, possibly because of inherent mobility properties such as independence from host cofactors.It is also possible that the chloroplast-like lineage originated outside chloroplasts and spread horizontally; the data are consistent with horizontal transfers from any source.In contrast to the chloroplast-like lineage, the mitochondrial lineage of ORFs changed significantly in structure from bacterial ORFs.These changes include the creation or expansion of the 4/5 and 7/X spacers, an increase in domain X conservation, and an increase in size and complexity of the Zn domain.Horizontal transfers among fungi, liverwort and brown algae were rampant.Possibly some intron ORFs transferred back to bacteria, although it is also possible that the L.l, S.a.1 and S.a.2 ORFs represent the earliest branches of the mitochondrial lineage.For a subgroup of liverwort ORFs the 7/X spacer expanded further, and one of these ORFs inserted into nad1 of plants with concomitant expansion of the 4/5 spacer and loss of RT subdomains 0, 1 and the Zn domain, which together resulted in vertical inheritance.

Evolution of mobility activities?
Does the described progression in ORF structure correspond to development of mobility activities?We note that all mobile group II introns studied in any detail belong to the mitochondrial lineage, which differs in ORF structure from bacterial and chloroplast-like groups.Therefore, the 'classical' mobility properties associated with group II introns may not apply in all respects to the chloroplast-like and putatively early branching bacterial ORFs.There are several properties of mitochondrial group II intron ORFs which might not be as extensively developed for other families of group II intron ORFs.First, maturase activity may be less developed since domain X is seemingly poorly conserved in bacterial groups A, B, C and D, and somewhat less conserved in the chloroplast-like group.The ltrA protein of L.l. (mitochondrial lineage) binds to its intron very tightly and specifically with a K d of ∼0.25 pM (12,40).It is plausible that more primitive ORFs may lack such specialized binding properties.An alternative explanation is that the bacterial X domains might have evolved to interact with different intron RNA structures specific to their clade.In fact, the X domains are somewhat more conserved within some of the bacterial clades.Within bacterial group C, there are 21 absolutely conserved residues in domain X versus 106 in the RT domain, which can be compared to 21 versus 90 for mitochondrial ORFs (based on the 75% consensus).On the other hand, ORFs of bacterial group B have a ratio of only 8 versus 131 conserved residues, suggesting that the X domain is poorly conserved within at least one of the bacterial clades.A second potential difference in activities among group II introns is sitespecificity, which is very high for mitochondrial introns due to a long recognition sequence [31 bp for S.c.cox1I2, (41); 35 bp for L.l. (42,43)].In bacteria, less controlled mobility is suggested by the numerous fragmented ORFs and the occasional insertion of introns outside of genes.Finally, the efficiency of mobility is very high in the mitochondrial lineage [∼90% for S.c.cox1I1, S.c.cox1I2 (36); 10-100% for L.l. (43,44)], but this may not be true of all bacterial introns.In the most extreme scenario, primitive group II introns may have less developed mobility functions across the board, including less efficient maturase activity, less site-specificity in insertion and lower mobility frequency.The only experimental evidence addressing this issue comes from the S.me.intron, whose mobility properties are so far mostly consistent with introns of the mitochondrial lineage (35,45).Nevertheless, given these speculations, it is clear that group II introns in all lineages need to be investigated.At this point it is impossible to know to what extent characterized introns of the mitochondrial lineage represent all mobile group II introns.

Figure 1 .
Figure 1.Structure of a typical mobile group II intron based on S.c.cox1I2 of yeast mitochondria.(A) DNA structure showing the upstream and downstream exons (E2, E3), intron domains I-VI, and the conserved domains of the ORF [RT(0-7), X, Zn] (drawn to scale).(B) An unspliced intron transcript of S.c.cox1I2 showing a simplified secondary structure of the six conserved intron structural domains, with the ORF looped out of domain IV (not drawn to scale).

j
No data.Complete flanking sequence was not reported.k Intron domains 5 and 6 are clearly not present; the intron structure may be degenerated.

Figure 3 .
Figure3.Phylogenetic model of group II intron ORF relationships.The phylogenetic estimate was based on RT subdomains 0-7 and domain X, and was calculated by a neighbor-joining algorithm (PHYLIP; see Materials and Methods).The tree was rooted with four RTs of non-LTR retroelements: Caenorhabditis elegans RTE1 (accession number AF025462), Drosophila melanogaster RD2 (X51967), Homo sapiens L1 (U93574) and D.melanogaster jockey (M22874)(8).Bootstrap values are expressed as percentages, and were derived from 1000 (NJ) or 100 (MP) samplings, with MP values shown in italics and parentheses.Nodes with <50% support are collapsed.The predicted approximate location of euglenoid ORFs is shown with dotted lines (see text).Juxtaposed with the inferred phylogenetic relationships are properties of the introns, including protein domains present (subdomains 0-7 of the RT domain, domain X, Zn domain), the size of spacer segments between conserved motifs (see Fig.2for spacer definitions), idiosyncratic insertions, the presence of the YADD motif or a functional substitute (see Table3footnote), and the presence of a group II intron structure (see Table3).Euglenoid ORFs are found in group III introns; P.s.cpn60I1 is reported to be a twintron(46), but the published RNA structure is probably incorrect.Abbreviations and color codings are: M (mitochondria; blue), C (chloroplast; green), B (bacteria; yellow), P (higher plant; green outline), L (liverwort; green outline), F (fungus; no outline), BA (brown alga; brown outline), GA (green alga; no outline), RA (red alga; pink outline), E (euglenoid; no outline), Cr (cryptopmonad; no outline).

Figure 4 .
Figure 4. Phylogenetic analysis of the Zn domain.(A) A phylogenetic tree of the Zn domain was derived by neighbor-joining analysis with 1000 bootstrap samplings, or by maximum parsimony with 100 bootstrap samplings (italics and parentheses).The tree was rooted with nuclease domains of E.coli colicin E7 (ECCE7; accession number 144375) and P.aeruginosa pyocin S1 (PEPS1; accession number Q06583).(B) Alignment of the Zn domains of selected group II intron ORFs and the nuclease domains of ECCE7 PEPS1.A 50% consensus sequence is shown for chloroplast-like and mitochondrial lineages.x represents residues with <50% conservation.Positions marked in bold show group-specific consensus sequences (mitochondrial, chloroplast-like and colicin/pyocin), or show agreement with the consensus sequence of another group (C.d. and B.m.).Pink shading indicates similarities between colicin/pyocin nuclease domain and bacterial or chloroplast-like Zn domains.The consensus sequence for the colicin/pyocin family of nucleases is a 100% consensus sequence according to Gorbalenya (9).

Figure 6 .
Figure 6.Model for the history of group II intron ORFs.Group II intron ORFs probably originated in bacteria where the introns transferred horizontally with high frequency.The ORFs were introduced to chloroplasts and mitochondria perhaps via the ancestral endosymbionts.In euglenoid chloroplasts, the introns lost mobility functions and were transferred vertically, while in the algal lineage the ORFs spread horizontally at a high rate.Introduction of group II intron ORFs to mitochondria resulted in an expansion of the subdomain 4/5 spacer, 7/X spacer and Zn domain, while domain X became highly conserved.The intron ORFs spread horizontally among mitochondria of fungi, liverwort and brown alga, and possibly transferred back to bacteria.In a subset of liverwort introns, the 4/5 and 7/X spacers were expanded further with concomitant loss of subdomains 0, 1 and the Zn domain, giving rise to the matR family of introns in higher plants.Dotted arrows indicate uncertainty for the source or direction of horizontal transfer events.Colored shading of domains indicates development of group-specific motifs or sequence conservation.

Table 2 .
Chloroplast group II intron-encoded ORFs and related ORFs a, b a See notes for Table 1 for a general description of column entries.b The sequence Z99832 (Cryptoglena pigra) is not shown in the table.Its 102 amino acids are homologous to the C-terminus of other euglenoid psbCI4 intron-encoded proteins but the DNA sequence encoding the N-terminus was not reported.c Accession number is for the protein database entry.The DNA sequence was reported by Kück (49).d The published size is 274 amino acids; three frameshifts add 204 amino acids with similarity to group II intron-encoded proteins (8). e Translation of the DNA database entry was reported by Doetsch et al. (50).

Table 3 .
Bacterial group II intron-encoded ORFs and related ORFs a, b a For a general description of column entries see notes to Table 1.b Sequences omitted from the table were as follows.Highly divergent sequences which may be degenerate remnants were: D90902 [Synechocystis sp., c The ORF name listed in the publication or database entry.d The locus of the ORF, if known.e