Rickettsiella grylli is an intracellular bacterial pathogen of aquatic and terrestrial arthropods. Previous determination of its 16S rRNA-encoding sequence has led to the taxonomic classification of the genus Rickettsiella in the class Gammaproteobacteria, order Legionellales, family Coxiellaceae, i.e. in close vicinity to vertebrate pathogenic bacteria of the genera Coxiella and Legionella. Here we use the additional information available from the recently published first whole genome sequence from this genus to evaluate critically the taxonomic classification of R. grylli beyond the 16S rRNA gene level. Using phylogenetic reconstruction, together with significance testing on a data basis defined by a core set of 211 previously identified families of protein-encoding genes, together with a reanalysis of 16S rRNA gene data, the present study firmly corroborates the assignment of this species to both the class Gammaproteobacteria and the order Legionellales. However, the results obtained from concatenated and single protein, single protein-encoding gene, and 16S rRNA gene data demonstrate a similar phylogenetic distance of R. grylli to both the Coxiellaceae and the Legionellaceae and are, therefore, inconsistent with its current family-level classification. Consequently, a respective reorganization of the order Legionellales is proposed.
The genus Rickettsiella (Philip) comprises intracellular bacterial pathogens of a wide range of arthropods that typically multiply in vacuolar structures within fat body cells and are frequently associated with protein crystals. The taxonomy of these bacteria is primarily based on the indication of a strain's original host. Moreover, the resulting pathotype designation is superposed by the morpho- and serologically founded distinction of three currently recognized species named according to the pathotype of the respective type strain, i.e. the nomenclatural type species Rickettsiella popilliae (Dutky & Gooden) as well as Rickettsiella grylli (Vago & Martoja) and Rickettsiella chironomi (Weiser). Several pathotypes have been placed in synonymy with each of these species, while others await conclusive species assignment (Garrity et al., 2005a).
Owing to their early perception as ‘rickettsiae of insects’ (Wille & Martignoni, 1952), bacteria of the genus Rickettsiella had originally been assigned to the order Rickettsiales (Weiss et al., 1984) in contrast to an alternative classification in the order Chlamydiales that had been considered as well (Federici, 1980). However, determination of a 16S rRNA-encoding sequence from a R. grylli strain revealed the highest homology to the orthologous gene from Coxiella burnetii (Roux et al., 1997), i.e. a bacterial species that, due to earlier 16S rRNA gene analyses, had been transferred from the alphaproteobacterial order Rickettsiales to the order Legionellales of the distantly related class Gammaproteobacteria (Weisburg et al., 1989). Interestingly, the first identified Rickettsiella pathotype, R. popilliae, had originally been described as ‘Coxiella popilliae’ (Dutky & Gooden, 1952). As a consequence of these findings, the entire genus Rickettsiella has been removed recently from the Rickettsiales and has instead been assigned to the family Coxiellaceae within the Legionellales (Garrity et al., 2005a).
At the order level, this taxonomic reorganization receives support from the determination of 16S rRNA-encoding sequences from further Rickettsiella pathotypes, e.g. from ticks (Kurtti et al., 2002), collembola (Czarnetzki & Tebbe, 2004), scarabaeids (Leclerque & Kleespies, 2008a), dipteran insects (Leclerque & Kleespies, 2008b), and aquatic isopods. Based on an analysis of the latter sequences, monophyly of the genus Rickettsiella has been claimed recently (Cordaux et al., 2007). However, two further arthropod-associated bacteria originally described as likely Rickettsiella pathotypes (Drobne et al., 1999; Radek, 2000) were removed from this taxon and reorganized instead in the candidate genus ‘Rhabdochlamydia’ of the order Chlamydiales after the respective 16S rRNA-encoding sequences had been determined (Kostanjsek et al., 2004; Corsaro et al., 2007). Moreover, the genome of a further strain assigned to the species R. grylli has meanwhile been sequenced. Expectedly, this organism carries two identical gene copies encoding 16S rRNA with the highest homology to orthologs from Rickettsiella-like bacteria assigned to the order Legionellales (Leclerque & Kleespies, 2008a).
Comparison of orthologous gene sequences has been widely used to infer phylogenetic relationships among bacteria, and for good reason 16S rRNA-encoding sequences (rrs genes) have become the standard molecular chronometer in phylogenetics (Woese, 1987). However, lateral gene transfer (LGT) has been recognized as both a major force in the evolution of prokaryotes (Koonin et al., 2001; Brown, 2003) and as a major problem of the reconstruction of bacterial phylogenies from gene trees (Gogarten et al., 2002; Bapteste et al., 2004). It has been argued that LGT has been much less extensive for informational genes encoding components of the complex functional networks driving cellular processes as replication, transcription, and translation than for operational genes encoding, for instance, metabolic enzymes (Rivera et al., 1998; Jain et al., 1999), and that coherent phylogenetic patterns can be extracted from the analysis of core sets of bacterial genes (Daubin et al., 2003; Philippe & Douady, 2003), preferably single-copy or panorthologs (Lerat et al., 2003; Blair et al., 2005). Even if 16S rRNA-encoding sequences might be correctly supposed to be among the genes most refractory to horizontal gene transfer, LGT of rrs genes seems to have occurred in a number of cases (Ueda et al., 1999; Yap et al., 1999). The phylogenetic analysis of 16S rRNA gene sequence data can be further complicated by the frequent presence of multiple rrs gene copies in bacterial genomes that generally, but with exceptions, display a very low degree of intragenomic heterogeneity (Coenye & Vandamme, 2003; Acinas et al., 2004). This effect of interoperonic divergence may be particularly disturbing if whole genome sequences are unavailable and PCR-amplified sequences are analyzed (Cilia et al., 1996), and it might, therefore, appear risky to draw far-reaching conclusions about bacterial phylogeny on the basis of 16S rRNA gene comparisons alone (Clayton et al., 1995; Olivier et al., 2005). Furthermore, the phylogenetic analysis of rrs sequences of organisms insusceptible to multiplication in pure culture, as based on e.g. PCR reactions using universal primers on infected host tissue, bears the additional risk of misidentification by amplification of the corresponding gene from a different specimen associated with the same host.
Recently, a set of 36 universally distributed families of panorthologous genes has been identified and proposed as a basis for the reconstruction of a universal tree of life (Ciccarelli et al., 2006). Furthermore, a wider core set of 205 families of single-copy orthologous genes (SCOGs) present in 13 completely sequenced gammaproteobacterial genomes has been used to reconstruct a highly robust phylogeny for this group of organisms (Lerat et al., 2003). Here we use the R. grylli genome sequence together with these two core sets of genes as a working basis to investigate the phylogeny of this bacterium in a multi-gene approach. Combining the reconstruction of organismal phylogenies with the assessment of the discriminative power of underlying sequence alignments by means of specific significance tests, this study critically evaluates, beyond the 16S rRNA gene level, the taxonomic assignment of R. grylli to the Gamma-class of the phylum Proteobacteria, order Legionellales, family Coxiellaceae.
Materials and methods
In addition to the R. grylli genome sequence (GenBank accession number NZ_AAQJ00000000), the following completely annotated bacterial genomes were used in this study: Aeromonas hydrophila ssp. hydrophila ATCC 7966 (NC_008570), Alkalilimnicola ehrlichei MLHE-1 (NC_008340), Candidatus Protochlamydia amoebophila UWE25 (NC_005861), Chlamydia trachomatis A/HAR-13 (NC_007429), Coxiella burnetii RSA 493 (NC_002971), Coxiella burnetii Dugway 7E9-12 (NC_009727), Dichelobacter nodosus VCS1703A (NC_009446), Escherichia coli K12 (NC_000913), Francisella tularensis ssp. tularensis SCHU S4 (NC_006570), Haemophilus influenzae Rd KW20 (NC_000907), Hahella chejuensis KCTC 2396 (NC_007645), Legionella pneumophila ssp. pneumophila Philadelphia 1 (NC_002942), Legionella pneumophila ssp. fraseri Lens (NC_006369), Methylococcus capsulatus Bath (NC_002977), Pseudomonas aeruginosa PAO1 (NC_002516), Rickettsia conorii Malish 7 (NC_003103), Shewanella putrefaciens CN-32 (NC_009438), Vibrio cholerae O395 (NC_009456/7), Wolbachia endosymbiont of Drosophila melanogaster (NC_002978), and Xanthomonas campestris pv. campestris ATCC 33913 (NC_003902).
In order to identify gene families consisting of SCOGs across diverse subsets produced from this genome space, the criterion of ‘best reciprocal hits’ was implemented at the level of deduced amino acid sequences by a series of blastp searches (Altschul et al., 1997) with the presumed ortholog from each of the genomes in question in turn being used as a query sequence. A gene family was selected for downstream analyses if each blastp run identified the same set of best hits containing exactly one sequence from each genome. As annotation of the R. grylli genome sequence is not yet complete, orthologs identified from R. grylli were used as query sequences in an additional tblastn search on the genome draft to make sure that no unannotated paralogs had escaped our notice.
Nucleotide and amino acid sequence alignments were produced with the clustalw function (Thompson et al., 1994) of the mega 4 program (Tamura et al., 2007) using an IUB DNA and a Gonnet protein weight matrix, respectively, with alignments of protein-encoding DNA sequences being filtered for potentially hypervariable sites allowing for synonymous substitution. The tree-puzzle 5.2 program (Schmidt et al., 2002) was used to estimate data set-specific parameters as amino acid and nucleotide frequencies, percentages of invariable sites, transition/transversion ratios, and α parameters for the Γ-distribution-based correction of rate heterogeneity among sites. Based on these parameters, the Hasegawa–Kishino–Yano (HKY) model (Hasegawa et al., 1985) was chosen along the rationale outlined by Posada & Crandall (1998) as the most appropriate common model of sequence evolution for phylogenetic analyses of our nucleotide data (including 16S rRNA gene sequence alignments). Amino acid sequence comparisons were performed using the Jones–Taylor–Thornton (JTT) model of substitution (Jones et al., 1992). Under these model assumptions, organismal phylogenies were reconstructed from both DNA and protein sequence alignments by three alternative approaches: (1) with the maximum likelihood (ML) method as implemented in the phyml software tool (Guindon & Gascuel, 2003) and in mega 4, with both (2) the minimum evolution (ME) method using the close-neighbor-interchange (CNI) algorithm, and (3) the neighbor joining (NJ) method. In all cases, pair-wise deletion for missing data and a Γ-distribution-based model of rate heterogeneity (Yang, 1993) allowing for eight different rate categories were applied. Tree topology confidence limits were explored in nonparametric bootstrap analyses over 1000 pseudo-replicates using the respective function of the software tools employed for phylogenetic reconstruction. Individual phylogenetic trees were condensed into extended majority rule consensus tree topologies by means of the consense module of the phylip 3.6 software package (Felsenstein, 2004). Candidate topologies for significance testing were generated manually as text files using Newick format.
The presumably most widely used method for likelihood-based tree topology testing in phylogenetics, the two-sided Kishino–Hasegawa or the 2sKH test (Kishino & Hasegawa, 1989), has been shown to be highly inappropriate if sets of tree topologies are permutatively incomplete and in particular if so due to previous tree selection by likelihood-based methods from the sequence data to be used for significance testing (Goldman et al., 2000 and references therein). This problem has been partially solved in two related nonparametric significance tests: the multi-comparison Shimodaira–Hasegawa or the SH test (Shimodaira & Hasegawa, 1999) and its less conservative adaptation to pair-wise testing, the one-sided Kishino–Hasegawa or the 1sKH test (Goldman et al., 2000). From a given set of candidate phylogenetic tree topologies, the three tests firstly determine which topology fits best the phylogenetic information present in an underlying DNA or protein sequence alignment, i.e. select a (relative) ML tree, and secondly positively reject those topologies that are significantly worse interpretations of the sequence data than this best tree with respect to an exogenously established significance threshold. It is, consequently, an important feature of these tests that unrejected candidate topologies are by no means positively accepted. Unless otherwise indicated, significance tests were performed at the 5% significance level as implemented in the tree-puzzle 5.2 software package.
Results and discussion
The universal core set of gene families proposed by Ciccarelli et al. (2006) as a basis for the reconstruction of the tree of life is highly biased (33/36 gene families) towards informational genes encoding ribosomal proteins and aminoacyl-tRNA synthetases. These highly conserved sequences are often found to be only moderately phylogeny informative, and it can be advantageous to rely on a more extensive or a less biased core set in phylogenetic analyses limited to certain branches of the tree of life. In an investigation of 13 gammaproteobacterial genomes, Lerat et al. (2003) have identified a phylogenetic core set of 175 additional families of SCOGs. For the present study, the data sets from both approaches have been combined into a basic core set of 211 gene families (supplementary Table S1) that is largely unbiased with respect to informational vs. operational genes (51.7% and 48.3%, respectively). This basic core set has subsequently been used as a point of departure for the identification of subsets comprising SCOG families across diverse sets of bacterial genomes relevant throughout the different stages of analysis presented in this paper. Even in those parts of the present study that operate at the level of deduced amino acid sequences and hence justify the use of expressions as ‘orthologous protein’ or ‘protein tree’, the groups of sequences making up these subsets of data will be referred to in a ‘gene’-based terminology, e.g. as ‘single-copy orthologous gene’ or ‘SCOG’ families, in reminiscence of this investigation being anchored at the genome level.
Assignment of R. grylli to the taxonomic class Gammaproteobacteria
Based on analyses of 16S rRNA-encoding sequences, the genus Rickettsiella has been classified previously in the order Legionellales (class Gammaproteobacteria) as opposed to alternative assignments to the orders Rickettsiales (Alphaproteobacteria) and Chlamydiales (Chlamydiae). In order to assess whether this classification is corroborated by a genomic approach, single-copy orthologs were identified across the R. grylli genome draft and six further bacterial genomes including two completely annotated sequences spanning the range of diversity present in each of these taxonomic orders, i.e. the genomes of Legionella pneumophila strain Lens and Coxiella burnetii RSA493 (order Legionellales), of Rickettsia conori and a Drosophila endosymbiont of the genus Wolbachia (Rickettsiales), as well as of Chlamydia trachomatis and an Acanthamoeba endosymbiont assigned to the candidate genus Protochlamydia (Chlamydiales). Application at the deduced amino acid sequence level of the criterion of best reciprocal hits to each of the 211 gene families of the basic core set led to the identification of 121 unambiguously delineated gene families, henceforward referred to as ‘SCOG7’ families, made up of exactly one gene copy in each of the seven genomes. Expectedly, the portion of informational genes among the SCOG7 families (70%) is considerably higher than in the basic core set, but the bias is still small compared with the universal core set proposed by Ciccarelli et al. (2006).
Of the 90 non-SCOG7 families, 80 lack an ortholog and 10 comprise additional paralogs in at least one genome (Table S1). As must be expected when initiating the search for orthologs from a basic core set of gene families preselected for gammaproteobacterial genomes, orthologous genes are mostly lacking in the genomes of Rickettsiales (54 gene families) and Chlamydiales (55). In the R. grylli genome draft sequence, there is no identifiable ortholog to 16 gene families mostly encoding gene products functional in tRNA aminoacylation or purine/pyrimidine ribonucleotide biosynthesis. For a single non-SCOG7 family, namely the one encoding glutamyl-tRNA-synthetase enzymes, multiple (two) paralogs were identified for R. grylli. Interestingly, the same is true for Coxiella, Rickettsia, and Wolbachia, organisms known to produce Gln-tRNAGln by transamidation of Glu-tRNAGln, whereas E. coli and a number of other Gammaproteobacteria possess a glutaminyl-tRNA-synthetase enzyme of presumably eukaryotic origin (Brown & Doolittle, 1999).
In order to evaluate the taxonomic classification of R. grylli on the basis of the aligned amino acid sequences as deduced from the identified SCOG7 families, nine unrooted candidate tree topologies representing possible phylogenetic relationships among the seven organisms under study were created (Fig. 1a). As this set of candidate topologies designed to evaluate all possible positions of R. grylli in a fixed backbone structure reflecting the division of the three relevant taxonomic classes is permutatively highly incomplete, the one-sided Kishino–Hasegawa test is the method of choice for significance testing. When the nine topologies are subject to a series of 1sKH tests with respect to each of the 121 protein alignments representing SCOG7 families, 120 of these are best described by one of the topologies (#4–#6) having R. grylli placed in the branch representing the Gammaproteobacteria. Furthermore, the results are considerably biased towards a preferential rejection of those topologies that locate R. grylli in the Alphaproteobacteria or Chlamydiae branch, and the bias becomes still more pronounced after unequivocally phylogeny noninformative gene families have been filtered out. Under the 5% significance criterion applied, one SCOG7 family does not contain sufficient information to discriminate between the alternative assignment of R. grylli to either the Gammaproteobacteria or the Chlamydiae, whereas another five sequence alignments do not allow to differentiate between topologies placing R. grylli in either the Gamma- or the Alphaproteobacteria branch. For a single alignment, the 1sKH test fails to reject any of the candidate topologies. In total, therefore, seven out of 121 SCOG7 families are found to be insufficiently phylogeny-informative to address the problem under study. They all encode informational proteins, and all but one belong to the set of panorthologs identified by Ciccarelli et al. (2006). These findings are in line with expectations from the definition of informational and operational genes (Rivera et al., 1998) and the related ‘complexity hypothesis’ (Jain et al., 1999), and reflect the concerns that motivated the enlargement of the basic data set of this study by operational genes.
The remaining 114 SCOG7 families contain sufficient phylogenetic information to differentiate between the alternative assignments of R. grylli to the three taxonomic classes (Fig. 1b). Without exception, the 114 alignment-specific best trees locate R. grylli among the Gammaproteobacteria. This set of best trees is largely unbiased with respect to the internal organization of the gammaproteobacterial clade, with relative frequencies of the alternative branch structures represented by candidate topologies #4, #5, and #6 being 29.8%, 42.1%, and 28.1%, respectively. Consistently, when marked as only second best, trees from this group of topologies are rarely rejected (3.5–7.8% of 1sKH tests). In contrast, the four candidate topologies having R. grylli placed in an internal position of either the chlamydial (#1, #2) or the alphaproteobacterial (#7, #9) branch are without exception judged significantly worse interpretations of the underlying SCOG7 family sequence data than the corresponding best tree topology locating R. grylli among the Gammaproteobacteria. These results demonstrate on a broad phylogenetic basis that R. grylli is more closely related to members of the gammaproteobacterial order Legionellales than to bacterial specimen belonging to the Alphaproteobacteria or the Chlamydiae and thus lend strong support to its taxonomic organization in the class Gammaproteobacteria.
Assignment of R. grylli to the taxonomic order Legionellales
In order to determine more exactly the position of R. grylli within the class Gammaproteobacteria, the basic core set of 211 gene families was used as the starting point to identify single-copy orthologs across a set of 18 bacterial genomes including one representative from 12 of the 14 currently recognized gammaproteobacterial orders (Garrity et al., 2005b). While a genome sequence is not yet available from the order Acidithiobacillales, the Legionellales are represented in this genome set by the R. grylli draft sequence and four further genomes from the Legionella pneumophila subspecies pneumophila and fraseri as well as from two strains of C. burnetii. Moreover, the genome sequence of the Alphaproteobacterium R. conori was included to deliver outgroup sequences for downstream phylogenetic analyses. When applying the criterion of best reciprocal hits at the deduced amino acid level, a total of 132 out of 211 gene families, henceforward termed ‘SCOG18’ families according to the number of genomes considered, are found to comprise single-copy orthologs across this genome space. This data subset is made up of 88 families of informational and 44 families of operational genes. Moreover, when the 36 universal panorthologs described by Ciccarelli et al. (2006) are investigated across the same 18 genomes, no ortholog of isoleucyl-tRNA synthetase is identified in the Haemophilus influenzae Rd KW20 sequence due to an N-terminal nonsense mutation in the ileS gene and evolution beyond recognition of its downstream sequence (Tatusov et al., 1996). The remaining 35 universal panorthologs are elements of the SCOG18 core.
To reassess the taxonomic position of R. grylli within the Gammaproteobacteria, three subsets of gene families were formed from the SCOG18 core set being made up of the 88 informational, the 44 operational, and the 35 universally panorthologous SCOG18 families, respectively. Metaproteins were generated by concatenation of amino acid sequences deduced from these sets of gene families and concatenated sequences from each set were aligned. The three resulting gene family subset-specific metaprotein alignments that comprise a total of 35 772, 19 685, and 11 831 sites, respectively, were used to reconstruct organismal phylogenies by the ML, ME, and NJ methods, thereby giving rise to nine different phylogenetic trees (supplementary Fig. S1). These firstly coincide in placing the metaproteins generated from genes of R. grylli in a single branch together with the sequences representing the genera Legionella and Coxiella. All trees being optimally bootstrap-supported at its root, this branch consistently represents the presumed Legionellales clade within the Gammaproteobacteria. This result unambiguously corroborates the current order-level classification of R. grylli.
Assignment of R. grylli to the taxonomic family Coxiellaceae
Secondly, the nine metaprotein trees can be subdivided into two groups of topologies according to the internal organization of the Legionellales clade; extended majority rule consensus trees generated from both groups are presented in Fig. 2. Both groups of metaprotein trees coincide in containing unambiguously supported Coxiella and Legionella clades, but differ in the relative position of the R. grylli sequence. More exactly, the latter is located in a sister clade position relative to the Coxiella clade in the six metaprotein trees produced from the informational gene subset of SCOG18 families and the set of 35 universal panorthologs, giving rise to a topological constellation reminiscent of the taxonomic classification of the genus Rickettsiella in the family Coxiellaceae (Fig. 2b). In contrast, according to the metaprotein trees based on the set of 44 operational gene families, R. grylli appears to be more closely associated with the genus Legionella as is consistent with an alternative assignment to the family Legionellaceae (Fig. 2a). Irrespective of the method of reconstruction used, the occurrence of both topological constellations appears, therefore, to be highly dependent on the underlying data set. Moreover, the conflicting nodes receive near-optimal bootstrap support within both groups of metaprotein trees, a finding that might seem counterintuitive for phylogenies reconstructed from rather extensive data sets.
However, firm bootstrap support on a topological substructure does not indicate that a change in this structural element generates a significantly worse overall topology; a similar statement would instead have to rely on significance testing. In the present context, significance testing is of interest in line with the following rationale. The sets of sequence data used should on a priori grounds be sufficiently extensive to reveal the true phylogeny of R. grylli, and an accumulated sequence analysis as the one performed above would be expected to produce unambiguous results as is the case for the order-level classification of Rickettsiella. However, at the family level, best tree topologies obtained from different data sets contradict each other. In this situation, significance testing is the method of choice used to determine for each data set whether an otherwise identical second-best tree representing the alternative family-level classification of R. grylli is or is not significantly worse than the best tree. Phylogenies obtained from a sequence alignment containing sufficient information to positively reject second-best phylogenetic hypotheses would deserve more confidence than those reconstructed from less phylogeny informative data.
In the present context, the apparently straightforward way to perform likelihood-based significance testing would be to compare each of the above metaprotein ML trees with the respective alternative topology generated by a change in the internal structure of its Legionellales clade. However, due to the restrictions ensuing from the theory of significance testing mentioned earlier, a similar approach could not be expected to generate reliable results. Therefore, the problem of the family-level taxonomy of R. grylli was addressed in the present multi-gene approach under reduction of the above genome space to the five organisms making up the Legionellales clade and E. coli to function as an outgroup. Operating in this model system, the concomitant reduction in complexity allows comparing all possible clade topologies and, thus, meeting the prerequisites for a meaningful interpretation of results generated by both the Shimodaira–Hasegawa and the two-sided Kishino–Hasegawa significance tests. Figure 3 gives an overview of the 105 different bifurcative topologies possible for a phylogenetic tree composed of the organisms in question, with E. coli being fixed in the outgroup position (supplementary Table S2).
Two alternative methods of likelihood-based significance testing, the SH test and the less conservative 2sKH test, were performed to evaluate these 105 candidate topologies against metaprotein sequence alignments from the correspondingly reduced sets of 88 informational, 44 operational, and 35 universally panorthologous gene families comprising 32 234, 17 867, and 11 150 sites, respectively. Both tests firstly coincide in rejecting for each alignment all topologies but those presented in Fig. 4, i.e. all topologies that do not organize the Coxiella and Legionella sequences into two clearly delineated clades. Secondly, both tests select the same data set-specific ML topologies, i.e. topology 4a from the concatenation of 44 operational SCOG18 families and topology 4b from the two further metaprotein alignments, in perfect congruence with the topology of the Legionellales clades in the corresponding metaprotein-based gammaproteobacterial phylogenies (Fig. 2). In contrast, both tests differ in that none of the second-best topologies 4a through 4c is rejected for any of the metaprotein alignments by the SH test, whereas all of them are rejected by the 2sKH test. Thus, despite operating at relevantly different levels of sensitivity, both significance tests fail to detect any bias in the distances between data set-dependent pairs of best and second-best topologies of the Legionellales clade and, therefore, do not motivate any inference concerning the family-level classification of R. grylli. These results are remarkable as the SH test, even if designed to be more conservative in rejecting suboptimal topologies than the 2sKH-test, should be expected to discriminate clearly at the 5% significance level among alternative candidate topologies of low complexity when multi-gene data are assessed. The finding that not only topologies 4a and 4b but also the third possible clade structure, 4c, cannot be discriminated by the significance tests applied makes it sufficiently clear that there is no conclusive phylogenetic signal in the accumulated information as contained in the three sets of concatenated amino acid sequences. With respect to these results, none of the three data sets can, therefore, be claimed to be a more reliable phylogenetic basis than the others.
However, a further possibly discriminative feature of the diverse sets of gene families that might be of interest under these circumstances is the distribution of phylogenetic signal across the data. The observed lack of a conclusive overall signal in the concatenated sequences could, for instance, be the result of two different scenarios at the level of the disaggregated elements of these data sets, i.e. (1) of a lack of relevant phylogenetic signal within individual gene families or (2) of a compensation of dissonant phylogenetic signals among the gene families. The second scenario would likely motivate a closer investigation into and possible filtering of strongly dissonant gene families that are a priori suspected to have resulted from lateral gene transfer. The first scenario, however, would make it difficult to support the assignment of R. grylli to one of the two taxonomic families in question and rather point towards a relatively independent phylogeny of the Rickettsiella within the order Legionellales.
In order to assess the distribution of phylogenetic signal across disaggregated sequence data, all those among the 211 gene families from the basic core set that represent single-copy orthologs across the six genomes present in the model tree topologies shown in Fig. 4 were identified as described earlier. A total of 181 gene families are found to fulfill this condition, i.e. 49 in addition to those used to generate metaprotein sequences. Alignments were produced from these ‘SCOG6’ families at both the nucleotide and the deduced amino acid sequence level, and the same 105 candidate tree topologies (Fig. 3, Table S2) were subject to an SH as well as a 2sKH test against both the DNA and the protein alignments. When the results from this series of 724 individual significance tests are analyzed (supplementary Table S1), best trees are found to have, without any exception, one of the three topologies displayed in Fig. 4, i.e. topological constellations other than those already examined at the concatenated sequence level can be left unconsidered. Consistently, candidate topologies other than 4a through 4c are rejected for the majority of SCOG6 families by the SH test and for virtually all of them by the 2sKH test, with corresponding nucleotide and amino acid sequence alignments performing with similar sensitivity. The following analysis, therefore, concentrates on these three relevant topological structures of the Legionellales clade; the respective 2sKH-test results are presented in Table 1.
|Gene family subset||Nucleotide||Amino acid|
|181 SCOG6 families (complete set)||87−71 −23||48 −101−32|
|10 ‘fully resolving’ SCOG6 families||3−6−1||2−7−1|
|88 informational gene families||45−31 −11||23 −46−18|
|44 operational gene families||22−17 −5||15−22−7|
|35 universally panorthologous gene families||13 −19−3||6 −22−7|
|Gene family subset||Nucleotide||Amino acid|
|181 SCOG6 families (complete set)||87−71 −23||48 −101−32|
|10 ‘fully resolving’ SCOG6 families||3−6−1||2−7−1|
|88 informational gene families||45−31 −11||23 −46−18|
|44 operational gene families||22−17 −5||15−22−7|
|35 universally panorthologous gene families||13 −19−3||6 −22−7|
Numbers of gene families giving rise to best trees of one of the three relevant structures are presented in the order of topologies 4a - 4b - 4c. Maximal values are in bold face.
Inclusion of the laterally transferred murG family in the respective subset.
Results from the corresponding SH-test are identical at the protein level and differ at the DNA level by a single gene family, argS, selecting an additional best tree of topology 4a instead of 4b for the data sets comprising 181, 88, and 35 gene families. Moreover, the SH-test does not generate ‘fully resolving’ gene families other than murG.
Using the SH test at both the DNA and the protein level, suboptimal topologies 4a through 4c are not rejected for all but one SCOG6 family. This single exception concerns the family of murG genes encoding the enzyme N-acetylglucosaminyl transferase. The corresponding nucleotide as well as amino acid sequence alignments reject both topologies 4b and 4c. Comparison of MurG protein sequences with database entries reveals evidence for the occurrence of lateral gene transfer in this family, with murG gene products of R. grylli and L. pneumophila appearing most closely related to orthologs from Gram-positive bacteria of the genera Bacillus and Clostridium, whereas the N-acetylglucosaminyl transferase of C. burnetii shows the highest homology to gammaproteobacterial orthologs (data not shown). In contrast to the SH test, the less conservative 2sKH test at both the DNA and the protein level rejects second-best topologies 4a through 4c for families of nine further genes in addition to murG. These two sets of ‘fully resolving’ SCOG6 families are not identical, but have an overlap of four elements. A closer comparison of sequences from these gene families with homologous database entries does not produce further evidence for the occurrence of LGT.
With respect to the alternative ML tree topologies 4a through 4c, nucleotide and amino acid sequence alignments give diverging results for 79 out of 181 SCOG6 families. This finding supports the view that compensation of contradictory phylogenetic signals is already important within and not only among gene families. If significance tests are performed at the protein level, the three sets of gene families used for metaprotein sequence assembly are moderately (50.0–62.9%) biased towards topology 4b representing the taxonomic assignment of R. grylli to the Coxiellaceae (Table 1). At the DNA level, however, there is a comparable bias towards tree topology 4a representing a classification in the taxonomic family Legionellaceae in the gene family sets comprising 44 operational (50.0%) and 88 informational genes (51.1%). In both directions, biases are too small to lend support to a scenario of strong, dissonant phylogenetic signals from few laterally transferred genes overriding the otherwise coherent, but weaker signal from the bulk of data. A similar picture arises if the full set of 181 SCOG6 families is analyzed, with most of the gene family-specific ML trees showing topology 4b at the protein, but topology 4a at the nucleotide sequence level. However, best trees across the small set of ‘fully resolving’ gene families are relatively strongly biased (60–70%) towards topology 4b at both levels. Tree topology 4c is of minor importance irrespective of the data set investigated or the significance test applied.
In summary, significance testing based on both concatenated sequence as well as single gene family data fails to discriminate between the alternative topological structures of the Legionellales clade originally derived from a reconstruction of the phylogeny of the Gammaproteobacteria. Furthermore, extensive significance testing with disaggregated sequence data in a specifically adapted model system shows that this negative result is neither due to the peculiarities of a particular mode of testing nor due to working at the protein or the DNA level, or to data sets being skewed by a dissonant phylogenetic signal acquired by LGT. This outcome of a consistent failure to reject topologies 4a through 4c under rejection of all 102 further candidate topologies as obtained from both aggregated and the vast majority of disaggregated sequence data firstly demonstrates that there is sufficient information in the alignments to discriminate unambiguously between alternative tree topologies. Secondly, this information is obviously not suitable to discriminate between candidate topologies 4a and 4b, due to a compensation of qualitatively contradictory and quantitatively balanced signals not only among but also within gene families in line with the first above scenario for the distribution of phylogenetic signal. Taken together, these findings are consistent with the view of an equally close phylogenetic relationship of R. grylli with both the genera Coxiella and Legionella, and, therefore, fail to corroborate the currently accepted taxonomic classification of the genus Rickettsiella in the family Coxiellaceae.
Re-evaluation of the 16S rRNA data
The taxonomic organization of the genus Rickettsiella in the family Coxiellaceae has been based on a comparison of 16S rRNA-encoding genes from several bacteria including R. grylli (Roux et al., 1997). In order to assess whether 16S rRNA genes would be more phylogeny informative in the present context than the SCOG6 set of protein-encoding genes, the above 105 candidate tree topologies were subject to both an SH and a 2sKH test against a 1363 site comprising alignment of the virtually complete 16S rRNA-encoding sequences from the six genomes under study in the model system. Moreover, ML, ME, and NJ phylogenies were reconstructed from this alignment and confidence limits were explored. Irrespective of the method of phylogenetic reconstruction used, best trees show topology 4b representing the currently valid family-level assignment of R. grylli, but with rather low bootstrap support values ranging from 65% to 79% for the root of the Coxiellaceae clade while all the remaining branches are optimally supported. Consistently, the same topology is marked as the best tree by both likelihood-based significance tests, but neither topology 4a nor 4c is rejected as a significantly worse description of the 16S rRNA gene data, whereas all further 102 candidate topologies are unambiguously rejected by the 2sKH test. The relative logarithmic likelihood values attributed by the SH test to topologies 4a and 4c are 96.3% and 90.2%, respectively, of that of the ML tree topology 4b, i.e. in a 16S rRNA gene-based analysis, too, both second-best topologies are by far not rejected at the 5% significance level. The sensitivity of 16S rRNA gene-based significance testing of the present data is, therefore, in a similar range as for the majority of individual protein-encoding SCOG6 families and, expectedly, considerably lower than for metaprotein sequences, where log-likelihood values attributed to unrejected second-best topologies range from 39% to 78% of those of the respective ML topology (data not shown). The fact that the sensitivity of 16S rRNA gene-based phylogenetic reconstruction as a tool of phylogenetic analysis at the family level is in general beyond doubt emphasizes once more that the statistical insignificance found in the present case is a consequence of the particular phylogenetic relationships under study.
Using phylogenetic reconstruction, together with significance testing, on a data basis defined by a core set of 211 previously identified families of bacterial genes, together with the R. grylli genome sequence, the above analyses corroborate the 16S rRNA gene-based taxonomic assignment of the species R. grylli to both the class Gammaproteobacteria and the order Legionellales, but do not lend support to its currently accepted family-level classification. At neither the concatenated protein nor the SCOG family nucleotide or amino acid sequence, or the 16S rRNA gene level is the assignment of R. grylli to the family Coxiellaceae a significantly better description of the molecular data than an alternative classification to the Legionellaceae. This finding of consistent insignificance is confirmed when using different methods of significance testing and when exploring the phylogeny-relevant information present in the data set separately for subsets comprising families of universally panorthologous, informational, and operational genes. In contrast, there is sufficient phylogeny-relevant information even in single gene family sequences to unambiguously discriminate under identical significance criteria between these and further phylogenetic hypotheses, i.e. there is not a general lack of phylogenetic signal in the data, but the signal is rather balanced between the two phylogenetic hypotheses in question. Consequently, the present study demonstrates that neither the analysis of an elevated number of protein-coding genes nor the reanalysis of 16S rRNA-encoding sequences provides a sufficient reason for the taxonomic assignment of R. grylli to either the Coxiellaceae or the Legionellaceae, but is consistent with R. grylli having a similar phylogenetic distance from the specimen representing both taxa. We conclude that the currently accepted family-level classification of R. grylli has been both premature and not well founded, and therefore propose a respective reorganization of the order Legionellales as would be e.g. the creation of a further, independent taxonomic family comprising R. grylli. As sequence information from the type strain of the genus Rickettsiella, a specimen belonging to the pathotype R. popilliae, is not yet available and, therefore, sequencing of the 16S rRNA-encoding gene from R. grylli has been determinative previously for the taxonomic organization of the entire genus, a corresponding reassessment of the taxonomic position of R. grylli would likely apply at least to those further Rickettsiella pathotypes that are on a firm basis assigned to this taxon.
I am highly indebted to my colleague, Dr Regina G. Kleespies, from the Julius Kühn Institute at Darmstadt for support and critical discussion.
The following supplementary material is available for this article:
This material is available as part of the online article from: http://www.blackwell-synergy.com/doi/abs/10.1111/j.1574-6968.2008.01158.x This link will take you to the article abstract).
Please note: Blackwell Publishing is not responsible for the content or functionality of any supplementary materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the article.