Bacterial Rho-independent terminators (RITs) are important genomic landmarks involved in gene regulation and terminating gene expression. In this investigation we present RNIE, a probabilistic approach for predicting RITs. The method is based upon covariance models which have been known for many years to be the most accurate computational tools for predicting homology in structural non-coding RNAs. We show that RNIE has superior performance in model species from a spectrum of bacterial phyla. Further analysis of species where a low number of RITs were predicted revealed a highly conserved structural sequence motif enriched near the genic termini of the pathogenic Actinobacteria, Mycobacterium tuberculosis. This motif, together with classical RITs, account for up to 90% of all the significantly structured regions from the termini of M. tuberculosis genic elements. The software, predictions and alignments described below are available from http://github.com/ppgardne/RNIE.
Transcription termination in bacteria is accomplished by two different mechanisms, both dependent upon RNA polymerase (RNAP) pause sites. The first relies upon the interaction of a protein called Rho with RNAP. The second, often called ‘intrinsic termination’, depends on the presence of short genomically encoded motifs called Rho-independent terminators (RITs). These are characterized by short G + C-rich hairpin loops containing ∼30 nt followed by a polyuridine tail of typically three to seven consecutive uridines which precede the stop site (Figure 1). These motifs bind to components of RNAP causing the bacterial polymerase to stall, releasing the nascent transcript resulting in transcription termination (1).
An early bioinformatic analysis of complete genomes implied that there is a broad range in the degree of dependence on intrinsic versus Rho-dependent termination (2). This result is supported by mutagenesis experiments showing that the terminator protein Rho is essential in some organisms such as Salmonella enterica yet non-essential in others such as Bacillus subtilis (3,4). Indeed, Rho itself is the only protein known to depend on Rho-dependent termination in B. subtilis. Consequently, there appears to be competition between the two termination systems across bacterial species, resulting in clade-specific skews in usage for one or the other (5–7). For example, in Eschericha coli just 171 genes have experimentally characterized RITs, whereas in B. subtilis there are 891 confirmed terminations by RITs.
As the number of bacterial genome sequences grows (1194 in the June 2010 release of EMBL version 104), gene annotation is increasingly reliant on automated approaches. This will accelerate as new sequencing technologies are targeted towards sparse and poorly covered corners of the bacterial tree of life (8). Therefore, it is of utmost importance to ensure that this annotation is as accurate as possible. Transcription terminators, along with promoters, Shine–Dalgarno sequences and the start and stop codons form important landmarks in the bacterial genomic landscape. By evaluating each landmark in the context of neighbouring landmarks, we can start to improve the depth of genome annotations and support predictions of biological significance (9). Furthermore, some researchers have successfully inferred and subsequently verified the existence of non-coding RNA (ncRNA) genes using promoter and terminator annotation tools (10,11). Historically, these ncRNAs have been difficult to infer. This approach has now been automated using the sRNApredict and sRNAscanner bioinformatic tools (12,13).
The task of predicting RITs has been tackled previously but no existing method results in highly reliable predictions. Typically, these approaches use a free energy-based approach to infer a secondary structure and either an ad hoc score for the characteristic poly-U tail or a further energy-based approach computing the affinity for the 3′ RIT tail with the template DNA strand (14–16). Some of the methods also give bonuses to, or are biased towards, predictions that occur in the typical genetic context e.g. immediately 3′ to an annotated protein coding sequence (CDS) (16,18).
In this work, we have implemented a covariance model (CM) based approach for annotating RITs (17,20). We show that the CM-based approach is more accurate than existing methods, despite the fact that we are not using additional information such as proximity to the 3′-end of an annotated gene. After noticing a paucity of predicted RITs in the Mycobacterium tuberculosis genome, we investigated the existence of significantly structured regions in the proximity of 3′-ends of annotated genes relative to shuffled controls. We were surprised to discover the presence of a previously unknown abundant hairpin motif with a well-conserved sequence.
MATERIAL AND METHODS
The approach for RIT prediction that we have developed makes use of CMs (19,20). CMs are powerful statistical models for identifying homologues to a family of related RNAs. This is done by comparison to a ‘seed’ alignment of representative RNAs that have been annotated with a consensus secondary structure. In recent years, CM methods have become increasingly practical due to dramatic improvements in the memory requirements (21), computation time (22–24) and accuracy (25). For the following work, we have used the Infernal package, version 1.0.2 (26).
This approach avoids the awkward issue of combining unrelated measures of free energy and sequence composition. Instead, the primary sequence and predicted secondary structure of target sequences are scored within a unified statistical framework. The probability that each subsequence of a target database was generated by the CM is computed and compared with the probability that is generated by a background model of random sequence, resulting in a log-odds score called a ‘bit score’. In general, positive scores indicate a given sequence that looks more like the seed sequences than a random sequence; conversely, negative scores look more like random sequences than seed sequences. More specifically, a bit score of x bits indicates the sequence is 2x times more likely to have been generated from the CM than from the random background model.
Building a RIT alignment
We obtained 171 and 891 experimentally validated terminators from the Gram-negative Proteobacteria E. coli and the Gram-positive Firmicute B. subtilis, respectively (5–7). These were made available by the ECDC (E. coli database collection) and a well annotated set of B. subtilis RITS from the supplementary materials of de Hoon et al. (7). The evidence for the RITs was carefully checked and 981 sequences with experimental evidence for RIT activity were used for further work.
We ran iterative rounds of alignment, structure prediction, refinement and homology search on this dataset to produce a large alignment of RITs. The alignments and structures were inferred and refined using a combination of the computational methods WAR (28), CMfinder (29), MLocarna (30) and Infernal (26) followed by manual refinement using the RALEE alignment editor (31). Searches of the EMBL database were performed using the Rfam annotation pipeline (32). Carefully selected predicted terminators, variant from existing seed sequences, were incorporated into the seed. These selected sequences were required to fulfil the following criteria: (i) the maximum similarity to an existing seed sequence had to be 95% and the minimum 60%, (ii) the minimum fraction of canonical basepairs had to be 75%, (iii) the sequence annotation should not contain terms like contaminant, pseudogene, repeat or transposon and (iv) they must score above a bit score threshold of 20. These criteria have been found in other work to produce candidate sequences with useful levels of variation for extending RNA families (32). Finally, the selected sequences were manually checked for the typical position 3′ to a gene annotation. This resulted in a total of 1117 aligned sequences (the Rfam pipeline produces a multiple alignment). We then split the sequences into two groups based on bit score by building a CM from the alignment and using it to rescore all 1117 sequences. If a sequence scored 14 bits or higher it was placed into group A, else it was placed into group B. Each group's alignment was then iteratively refined by re-aligning the group's sequences to a CM built from the current alignment until no major changes in bit score were observed (using the –refine option in Infernal's cmbuild program). The resulting two alignments were then manually refined and used to build the final two RNIE CMs that were used for all subsequent annotations and benchmarks.
Two major RNIE modes
We have built two major modes into the RNIE program. The default mode, dubbed ‘genome mode’, is optimized for the task of high-throughput genome annotation. This mode employs parameters that ensure a rapid search (∼43 kb/s) with a very low false positive rate (∼1.7 FP/M). The sensitivity, positive predictive value and Matthews' correlation coefficient for this mode is 0.70, 0.79 and 0.74, respecively. The second major mode, dubbed ‘gene mode’, is optimized for the task of individually annotating the downstream regions of genes. Typically, these are smaller datasets and a higher sensitivity (0.83) is desirable, while a slower search is tolerable (∼1 kb/s). The false positive rate, positive predictive value and Matthews' correlation coefficient for this mode is ∼9.6 FP/MB, 0.45 and 0.61, respecively.
The genome and gene modes are launched with the following respective cmsearch parameters:
- cmsearch -T 16 -g --fil-no-qdb
- --fil-T-hmm 2 --cyk --beta 0.05
- CM query_sequence.fasta
- cmsearch -T 14 -g --fil-no-qdb
- --fil-no-hmm --no-qdb --inside
- CM query_sequence.fasta
In order to evaluate the performance of our method relative to comparable tools, we have conducted two independent benchmarks.
First, we discuss some caveats to this benchmark. In any bioinformatic setting, the ideal is to seperate one's training and test data in order to avoid problems due to over-training. However, in this situation we had relatively few examples for training or testing purposes; these were from just two organisms. Furthermore, it was impossible to remove the training data from the alternative methods that we test here TransTermHP (16), RNAmotif (14) and Rnall (33). Therefore, we have had to include a biased test (the alpha benchmark) using the training data for testing. The results of this can be considered an upper bound on the likely true performance of these algorithms. To alleviate the worst of these concerns, we took the two best algorithms from the alpha benchmark and added a ‘beta benchmark’ which is independent of the training data. This test considered the correlation of whole genome annotations with gene ends on both native and shuffled genome sequences. These genome sequences were selected from a broad range of bacteria spanning all the main bacterial phyla and specifically avoided either E. coli or B. subtilis.
The first benchmark used 485 previously established E. coli and B. subtilis terminators. The 144 E. coli RITs were derived from the ECDC (6). These RITs have little associated annotation. Consequently, the provenance of the data is difficult to establish and therefore some of these RITs might not be biological. This contrasts with the 341 B. subtilis RITs that are derived from a screen by de Hoon et al. (7). This data set has an excellent annotation of the evidence for each RIT. We manually selected those with good experimental evidence for function.
These RITs were embedded in 1000 bases of randomly selected and then permuted bacterial genomic sequence (see Table 1 for the sources of genomic sequences). The permuted genomic sequences were shuffled using a dinucleotide frequency preserving procedure that preserves some of the strongest statistical signals in the genome such as CpG content and the stacking signals that are important to control for when investigating RNA secondary structure (34). An additional set of 100 decoy sequences were generated for each known terminator. These were also embedded in 1000 bases of randomly selected permuted genomic sequence. The decoy sequences were generated using a first-order Markov process with nucleotide transition rates estimated from the known terminators. This method was used rather than shuffling since short terminator sequences may have a limited number of permuted conformations with an identical dinucleotide content. TransTermHP (16), the Lesnik et al. RNAmotif descriptor (14) and Rnall (33) and RNIE were used to predict RITs in these datasets. Since the TransTermHP algorithm requires annotated genic features, we artificially generated sets of 2, 4, 9 and 10 features for each sequence. In each set, one of the features had a 3′-end corresponding to the start of a known terminator or a decoy terminator sequence. Each terminator prediction for each algorithm was classified as either a true positive or a false positive depending upon whether the prediction overlapped a known terminator sequence by 1 nucleotide or more. The score (or scores) for each prediction were also stored for each terminator prediction and the receiver operator characteristic (ROC) and sensitivity versus positive predictive value (PPV) plots were generated for each tool and score combination (Figure 2).
|Species||EMBL accession||Phylum||Genome size||Number CDSs||G + C content||Number predictions|
|Species||EMBL accession||Phylum||Genome size||Number CDSs||G + C content||Number predictions|
The sensitivity, PPV and false-positive rate (FPR) metrics that were used to generate Figure 2 are defined as:
For each method, the true-positive (TP) values were computed by counting all the predicted terminators that overlapped a known terminator by at least one nucleotide. All other predicted terminators on the positive strand were counted as false positives (FP). The known terminators that were not detected contributed to the false negative (FN) count. In this context, the true negative (TN) values can be computed in many different ways: one could conservatively count the regions that remain unclassified and one could liberally count every possible substring that remains unclassified. We chose the middle ground by using the number of nucleotides that remained unclassified.
The second benchmark relies upon the correlation of predicted terminators with annotated genic elements on a range of native and shuffled genome sequences. For this test, we were able to exclude all the training data from the test set by taking a selection of 14 representative bacterial genomes that are widely distributed throughout the better characterized portions of bacterial phylogeny (Table 1). The annotations for each genome were extracted from the EMBL nucleotide database (35) and supplemented with ncRNA annotations from Rfam 10.0 (32). We took all the unique genic features for each test genome and computed the minimum distance between these and each terminator prediction on the same strand for each method. The pooled results are shown in Figure 3.
The alpha benchmark illustrates some interesting features of the terminator predictors (Figure 2). Many of the energy-based methods employ scoring schemes based on the free energy of the RIT hairpin and the free energy of the RIT tail disassociation with the template DNA strand (Figure 1). These two score types show characteristic trajectories through the ROC and sensitivity versus PPV plots. We noted in particular that the disassociation energies reported by RNAMotif (dG), RnaII (hbG) and RnaII-Brkr (hbG) show very little potential for discriminating between true and false RITs based on their atypical trajectories through these plots. However, the RIT hairpin energies from RNAMotif (struct), RnaII (dG) and RnaII-Brkr (dG) show some discriminatory potential; in particular, the RNAMotif descriptor by Lesnik et al. (14) performed well. In fact, this was the only method of this class to reach over the y = 1-x threshold on the sensitivity versus PPV plot (see the red dotted line in Figure 2). This line is an indicator whether a method is doing better or worse than a ‘random’ predictor.
For the TransTermHP method, we were forced to provide fictional gene annotations in order to get the method to run. In order to assess dependence on the number of features, we ran four tests using 2, 4, 9 and 10 regularly spaced genes. In each case, one annotation was terminated by either a native or decoy RIT. There was little consistent influence on the performance of TransTermHP based on the number false gene annotations. The maximum Matthews' correlation coefficient was acheived by the run with 10 annotations [max(MCC)=0.50], the minimum was with 4 annotations [max(MCC)=0.44].
The RNIE method we are presenting in this work performed very well compared to the alternative tools on this particular benchmark. The highest maximum Matthews' correlation coefficient was attained by RNIE run in genome mode [max(MCC) = 0.75], followed by the run in gene mode [max(MCC) = 0.74]. We used these results to identify thresholds for each mode that optimally balanced the number of true and false positives. A too high threshold will mean we miss a lot of real terminators: a too low threshold will mean we are swamped in noisy predictions. For both genome and gene modes, we selected thresholds (16.00 and 14.00 bits, respectively) slightly lower than those suggested by the optimal MCC-based threshold (16.45 and 19.09 bits, respectively). The lower thresholds accepted a few more FP but generally researchers are more forgiving of these than FN in genomics work. Furthermore, most FP can be discounted by their genomic context.
The speed of the RNIE algorithm, in genome mode, is comparable to the alternative methods (Figure 2). While CMs have long been known to be computationally intensive, for this work we use an optimized CM approach that employs several methods to increase the computational efficiency (24,26). In genome mode, RNIE can scan >43 kb/s, in gene mode it scans just 1 kb/s. This is comparable to TransTermHP which scans 74–186 kb/s; however, for this tool there is a linear relationship between the number of annotations and speed for this tool i.e. the more annotations the slower it scans. The RNAmotif descriptor scans 602 kb/s and is the fastest tool we encountered. Finally, Rnall scans ∼1 kb/s; however, the Rnall speed had to be estimated by computing a CPU factor as the only version we had access to runs on an outdated computer architecture.
The beta benchmark illustrates that RNIE can accurately detect terminators across a broad range of bacterial genomes outside of E. coli and B. subtilis genomes and without requiring gene annotation information (Figure 3). Figure 3A shows that in genome annotation mode there is an excess of RIT predictions near the 3′-end of CDS and ncRNA annotations. The dotted lines show that RNIE, in genome annotation mode, makes a negligible number of predictions in the permuted genomes, verifying that the FPR for this approach is very low. The results for RNIE in gene annotation mode show similar results, with an expected higher number of predictions in the correct context to gene annotation but with a correspondingly higher FPR. The results for TransTermHP show the worrying result that RIT predictions are enriched in the 3′-ends of genes for both the native and permuted genome sequences. This suggests that a significant fraction of predictions by TransTermHP are false even though they appear in a genomic context associated with genic termini.
The Figure 3B plot shows the fraction of genes with RIT predictions associated with genic termini for all the species in Table 1 for each of the three prediction approaches. Again, these generally illustrate the high sensitivity of RNIE and low FPR relative to TransTermHP. This plot also illustrates the diverse degree of RIT usage across bacterial species which does not follow traditional lines of bacterial classification. For example, the Gram-positive species from the phyla Firmicutes and Actinobacteria have a mixture of exemplars from both ends of the usage spectrum. That is, B. subtilis makes substantial use of RITs, whereas neither of the Actinobacteria M. tuberculosis and Streptomyces griseus make substantial use of RITs. Even within phyla, there can be a lot of variation of RIT usage. For example, the Proteobacteria E. coli and S. enterica clearly employ high levels of RITs for transcriptional termination, whereas Helicobacter pylori does not.
Case study: M. tuberculosis termination
In M. tuberculosis, the number of predicted RITs was very low. However, other researchers have suggested that M. tuberculosis do employ a rho-independent terminator mechanism (2,18,36–38). Therefore, we chose to analyse genic termini in more detail within this species. There is published evidence that M. tuberculosis genes are enriched for stable secondary structures near the coding terminus (2). However, further analysis has shown that these do not fit the canonical terminator model. A method has been developed that attempts to classify predicted secondary structures from coding termini sequence into any of five different classes (18,36–38), with bonuses given for ‘correct’ genomic context. The predominant form of these are ‘i-shaped’ structures (>90%) or a short stem loop, that have <3 U's in the 10 nt stretch following the stem loop.
To illustrate the flexibility of our approach, we took all the annotated coding sequences from the M. tuberculosis CDC1551 complete genome and extracted subsequences from −20 to +80 nt around all annotated gene termini. These were each folded using the RNAfold routine from the Vienna package (39) and then subjected to a permutation test, where native minimum free energies (MFEs) were compared to the pooled distribution of MFEs for 1000 permuted sequences with the same dinucleotide content for each termini (34). The regions that had a P < 0.001 where subsequently fed into the alignment and folding algorithm CMfinder (29). This alignment was manually refined using the RALEE alignment editor (31). A covariance model was built for the resulting alignment (26) and then RNIE was deployed for annotating the entire M. tuberculosis CDC1551 genome with the more specific covariance model.
We were surprised to discover a previously unpublished well-conserved motif (Figure 4) that we have called the ‘Tuberculosis Rho-independent terminator’ or TRIT. The TRITs account for 72% (59/82) of the significantly stable terminator sequences (P < 0.001), the standard models add a further 7%. This ratio increases to 81% (29/36) when we use a more stringent threshold of P < 0.0001 plus a further 8% from the standard models (Figure 4B). Consequently, the RIT and TRIT models account for 80−90% of all the highly structured regions near gene termini in M. tuberculosis. There are 147 copies of TRIT scattered throughout the M. tuberculosis genome (EMBL accession: AE000516). Given the palindromicity of these sequences, the bulk are bidirectional. The TRITs are closely associated with the terminal regions of annotated genic features (Figure 4A).
A unique TRIT feature that we observed is that the distribution of TRIT sequences relative to the nearest annotated stop codon is very narrow. These are largely positioned within the coding sequence, around the −8 position, whereas the RIT sequences are much more broadly distributed and are predominately located further downstream between +10 and +20. Scans of the public sequence databases for other TRITs show that this terminator type is restricted to Mycobacterium. The TRIT utilizing species that we identified include M. abscessus, M. avium, M. bovis, M. gilvum, M. intracellulare, M. kansasii, M. marinum, M. smegmatis, M. ulcerans and M. vanbaalenii.
There are two major modes researchers want from RIT prediction software. The first, that has been targeted by existing methods, is to investigate the possibility of a RIT for a specific gene or cis-regulatory element. For this a high-sensitivity approach is desirable with an acknowledged cost to specificity, where the context of the prediction should add some specificity. The second major mode is to screen entire bacterial genomes for RITs in the hope of identifying short ORFs, sRNAs and cis-regulatory elements such as riboswitches (10,40). These can also be used to validate, provide strand information and otherwise improve the annotation of transcripts from transcriptome data such as RNA-seq (41). The benchmarks have shown that the ‘gene’ and ‘genome’ modes implemented in RNIE both provide complementary features that are both accurate and computationally efficient.
Our further investigation of gene termini in M. tuberculosis identified a terminator motif with both strong sequence and structure conservation. This TRIT motif, together with our RIT predictions, account for 80−90% of all the most highly structured regions near gene termini in M. tuberculosis. The high sequence conservation of the TRITs implies that further cellular machinery may be involved in termination in this organism; possibly a factor that binds the double stranded RNA sequence motif.
We tried the same approach with two other species with a paucity of RIT predictions; these were H. pylori and Fervidobacterium nodosum, but could not identify any obvious terminator motif. Fervidobacterium nodosum did show some enrichment of structured elements; however, the bulk of these fit the traditional RIT motif with a lower G + C content than the other well-characterized examples, where a lower threshold for this species identified the missing RITs.
In conclusion, this work has shown that covariance models can be deployed to predict Rho-independent terminators with an accuracy that has not been available previously. The method we propose is slightly slower than some competing approaches; however, the boost in prediction accuracy is worth the sacrifice.
Wellcome Trust [grant number WT077044/Z/05/Z] (to P.P.G., L.B. and A.B.); Howard Hughes Medical Institute support to Sean R. Eddy (to E.P.N.) and to Ronald R. Breaker (to Z.W.). Funding for open access charge: The Wellcome Trust.
Conflict of interest statement. None declared.
The authors are grateful to Dave Ecker from Ibis Biosciences for providing the Lesnik et al. (14) RNAmotif descriptor that after a decade still performs well.