ANNOgesic: A pipeline to translate bacterial/archaeal RNA-Seq data into high-resolution genome annotations

To understand the gene regulation of an organism of interest, a comprehensive genome annotation is essential. While some features, such as coding sequences, can be computationally predicted with high accuracy based purely on the genomic sequence, others, such as promoter elements or noncoding RNAs are harder to detect. RNA-Seq has proven to be an efficient method to identify these genomic features and to improve genome annotations. However, processing and integrating RNA-Seq data in order to generate high-resolution annotations is challenging, time consuming and requires numerous different steps. We have constructed a powerful and modular pipeline called ANNOgesic that provides the required analyses and simplifies RNA-Seq-based bacterial and archaeal genome annotation. It predicts and annotates numerous features, including small non-coding RNAs, with high precision. The software is available under an open source license (ISCL) at https://pythonhosted.org/ANNOgesic/.


INTRODUCTION
As the number of available genome sequences has rapidly expanded in the data bases, numerous tools have been developed that can detect genomic features of interest based on the genome sequence. Prominent representatives are Glimmer to identify open reading frames (ORFs) [1], tRNAscan-SE [2] to spot tRNAs and RNAmmer to find rRNAs [3]. Pipelines like Prokka [5] or ConsPred [6] combine such tools and are able to search multiple features in bacterial and archaeal genomes. Still, these tools that make their predictions purely based on the genome sequences can predict features like transcriptional start sites and non-coding RNAs, if at all, only with low confidence.
Recent developments in high-throughput sequencing offer solutions to this problem. RNA-Seq has revolutionized how differential gene expression can be measured and is widely used for this purpose [7]. Besides this it has also been applied in numerous cases to improve the genome annotation of bacteria [8][9][10], archaea [11] and eukaryotes [12]. RNA sequencing based protocols like differential RNA-Seq (dRNA-Seq) [13,14], Term-seq [15] and ribosome profiling [16,17] have been applied to globally detect transcriptional start sites (TSSs), small non-coding RNAs (ncR-NAs/sRNAs), terminators, ORFs and sRNA regulatory target but require dedicated data processing. While there are tools that can process RNA-Seq data in order to predict genome-wide features like TSSs based on dRNA-Seq data [18][19][20] or based on conventional RNA-Seq data [21,22], there has been, so far, no solution that combines different predictions of genomic features and compiles them into a consistent annotation.
Here we present ANNOgesic -a modular, command-line tool that can integrate different types of RNA-Seq data like dRNA-Seq as well as RNA-Seq generated after transcript fragmentation (or conventional RNA-Seq) and generate high-quality genome annotations. It can detect several genomic features including genes, CDSs (coding DNA sequence), tRNAs, rRNAs, TSSs, and processing sites (PSs), transcripts, terminators, untranslated regions (UTRs) as well as sRNAs, small open reading frames (sORFs), circular RNAs, CRISPR-related RNAs, riboswitches, and RNA-thermometers. It can also perform RNA-RNA and protein-protein interaction predictions on detected features. Furthermore, it groups genes into operons as well as sub-operons and can generate promoter motifs that are found in front of transcriptional start sites. It can also allocate GO (Gene Ontology) terms and subcellular localizations to genes. Several of ANNOgesic's data processing steps are new implementations, while others are performed by third-party tools after dynamic parameter-optimizations through ANNOgesic itself. Numerous visualizations and statistics help the user to quickly evaluate the feature predictions. The pipeline is modular and was intensively tested with several RNA-Seq data sets from bacterial as well as from archaeal species.

Modules and input data of ANNOgesic
ANNOgesic consists of the following twenty modules -their names indicate their functions: Sequence modification, Annotation transfer, SNP/Mutation, Transcript, TSS, Terminator, UTR, Processing site, Promoter, Operon, sRNA, sRNA target, sORF, GO term, Protein-protein interaction network, Subcellular localization, Riboswitch, RNA thermometer, Circular RNA, and CRISPR. Several potential workflows connecting these modules are displayed in Supplementary Figure 1.
Depending on the task to one wish to perform, ANNOgesic requires a specific set of input information -either as in coverage information in wiggle, or alignments in BAM format. This can be generated by a mapper like BWA [28], STAR [29], segemehl [30], or a full RNA-Seq pipeline like READemption [31]. Certain modules additionally require annotations in GFF3 format. In case a sufficient genome annotation is not available, ANNOgesic can perform an annotation transfer from a closely related strain.

Optimization of the parameter set for TSSpredator
For several parts of ANNOgesic, the selection of parameters has a strong impact on the final results. Especially the TSS prediction -building on TSSpredator [18] -requires a sophisticated fine-tuning of several parameters (namely height, height reduction, factor, factor reduction, enrichment factor, processing site factor and base height). To overcome the hard task of manual parameter selection, ANNOgesic optimized the parameters by applying a genetic algorithm, a machine learning approach, [32] which is trained based on a small user curated set of TSS predictions. This approach has the advantage of being able to find global, not only local, optima. The process of optimization is composed of three parts -random change, large change, and small change ( Figure 1). In this context, a global change means a random allocation of values to all parameters, a large change is a random allocation of values to two parameters, while a small change is adding or subtracting a small fraction to or from one parameter value. The result of each iteration is evaluated by a decision statement (Equation 1).
Equation 1: T P m is the number of manually-detected TSSs. T P c /T P R c represents the true positive/true positive rate of the current parameters. T P b /T P R b represents the true positive/true positive rate of the best parameters. F P c /F P R c represents the false positive/false positive rate of the current parameters. F P b /F P R b represents the false positive/false positive rate of the best parameters. If one of these six situations is true, it will replace the best parameters with current parameters.

Test data sets
In order to test ANNOgesic's performance, we applied it to RNA-Seq data sets originating from Helicobacter pylori 26695 [8,14] and Campylobacter jejuni 81116 [18]. The dRNA-Seq data sets were retrieved from NCBI GEO where they are stored under the accession numbers GSE67564 and GSE38883, respectively. For Helicobacter conventional RNA-Seq data -i.e. without TEX treatment (which degrades transcripts without a 5'-triphosphate) and with fragmentation of the transcript before the library preparation -was also retrieved from NCBI SRA (accession number SRR031126).

Genome sequence improvement and SNP/mutation calling.
Conventionally, differences in the genome sequence of a strain of interest and the reference strain are determined by DNA sequencing. However, RNA-Seq reads can also be re-purposed to detect such SNPs or mutations that occur in transcribed regions which can help to save the resources required for dedicated DNA sequencing or DNA SNP microarray measurements. The two drawbacks of this method are that only locations which are expressed can be analyzed and that, due to RNA editing, changes could be present only in the RNA and are not found in the genome. On the other hand, it has been shown to be a valid approach for eukaryotic species and that the majority of SNPs are found in the expressed transcripts [33,34]. In conclusion, such an analysis could be useful to generate hypotheses that then need to be tested with complementary methods. ANNOgesic offers the user to perform the SNP/mutation calling via SAMtools [35] and BCFtools [35] applying read counting-based filtering. Figure 1: The genetic algorithm that ANNOgesic uses for optimizing the parameters of TSSpredator. It starts from the default parameters. These parameter sets will go through three steps -global change (change every parameter randomly), large change (change two of the parameters randomly), and then small change (adds/subtracts a small fraction to one of the parameters). It will then select the best parameter set for reproduction when one step is done. Usually, ANNOgesic can achieve the optimized parameters within 4000 runs.

Annotation transfer.
ANNOgesic integrates RATT [36], which can detect the shared synteny and mutations between a reference and query genome to transfer annotation (i.e. genes, CDSs, tRNAs, rRNAs) by applying MUMmer [37]. For the chosen strains, H. pylori 26695 and C. jejuni 81116 annotation files in GFF3 format could be obtained from NCBI RefSeq. Due to this there was no need to transfer the annotation from a closely related strain.

Detection of transcript boundaries
Knowing the exact boundaries and sequence of a transcript is crucial for a comprehensive understanding of its behaviour and function. For example, UTRs can be the target of regulation by sRNAs or small molecules (e.g. riboswitches) [38,39] or even sources of sRNAs [40]. Unfortunately, most bacterial annotations only cover the protein coding regions while the information about TSSs, terminators and UTRs is lacking. To address this issue, ANNOgesic combines several feature predictions for the reliable detection of transcript boundaries ( Figure 2).

Transcript detection.
The primary step for the detection of transcript boundaries is transcript detection. For this purpose numerous tools are available (e.g. [41]), but most of them are optimized for the assembly of eukaryotic transcripts. Due to this, we combined several heuristics to perform such predictions based on the nucleotide coverage data, given gene annotations and several parameters that can be set by the user ( Figure  3). By running ANNOgesic's subcommand for transcript prediction, we detected 1715 transcripts in H. pylori 26695 and 1147 transcripts in C. jejuni 81116. These transcripts cover 1520 and 1568 genes which shows that 97% and 93% of the known genes are expressed in at least one condition, respectively.

Optimization of TSS prediction parameters.
For the prediction of TSSs, ANNOgesic builds on TSSpredator [18], which takes dRNA-Seq coverage data as input. The outcome of TSSpredator's predictions depends strongly on the setting of numerous parameters and fine-tuning those can be time consuming. Due to this, a parameter optimization was implemented in  ANNOgesic that builds on a small, manually curated set of TSSs to find optimal values.
In order to test the performance of ANNOgesic, we manually annotated TSSs in the first 200 kb of the genome of H. pylori 26695 and C. jejuni 81116 (Supplementary Table 2 and 3). This set was used to benchmark the prediction of TSSpredator with default settings as well with the parameters optimized by ANNOgesic. For the test set, we manually annotated TSSs from first 200 kb or first 400 kb in the genome of H. pylori 26695 and C. jejuni 81116 (Supplementary Table 2 and 3), respectively. As displayed in Table 2, the optimization had minor sensitivity improvements in H. pylori 26695 (from 96.8% to 99.6%), while strongly increased the sensitivity for the TSS prediction for C. jejuni 81116 (67.1% to 98.7%) at the same level of specificity. To underpin those findings, we looked at the overlap of the predicted TSS and predicted transcripts. This was nearly the same for H. pylori 26695 (82% for default and 83% for optimized setting) but also increased significantly for C. jejuni 81116 from 81% for default parameters to 96% with optimized parameters.
Moreover, TSSs are classified depending on their relative positions to genes by TSSpredator. Based on these classifications, Venn diagrams representing the different TSS classes are automatically generated (Supplementary Figure 2).

Processing sites.
Several transcripts undergo processing, which influences their biological activity [40,42]. In order to detect processing sites based on dRNA-Seq data, ANNOgesic facilitates the same approach as described for TSS detection but searches for the reverse enrichment pattern (i.e. a relative enrichment in the library not treated with  (26) 99.99% (7) C. jejuni 81116 Default 61.3% (19) 99.99% (2) Optimization 93.5% (29) 99.99% (6) The percentages in the table are the sensitivity or specificity. The numbers in brackets are true positive or false positive. TEX). As done for the TSSs, we manually annotated the processing sites in first 200 kb of the genomes and found in H. pylori 26695, 281 and in C. jejuni 81116, 345 processing sites, respectively. Based on these, we performed parameter optimization on the test set (manually-curated from first 200 kb to 400 kb, Supplementary Table 4 and 5, Table 2) and could improve the prediction of processing sites via TSSpredator [18].

ρ-independent terminators.
While the transcriptional start sites are in general clearly defined boarders, the 3'-end of a transcript is often not very sharp. A commonly used tool for the prediction of the 3'-end of a transcript is TransTermHP [43], which detects ρ-independent terminators based on genome sequences. Manual inspection showed us that TransTermHP predictions are not always supported by the RNA-data ( Supplementary Figure 3e and f). This could be due to the lack of expression in the chosen conditions. Additionally, certain locations in 3'-ends that may be ρ-independent were not detected by TransTermHP. Due to this, we extended the prediction by two further approaches based on RNA-Seq coverage and the given genome sequence. At first, terminators predicted by TransTermHP that show a significant decrease of coverage are marked as high-confidence terminators. For this, the drop of coverage inside the predicted terminator region plus 30 nucleotides up and downstream is considered as sufficient if the ratio of the lowest coverage value and the highest coverage value is at a user-defined value (see Supplementary Figure 3). In order to improve the sensitivity, an additional heuristic for the detection of ρ-independent terminators was developed. In this approach, only converging gene pairs (i.e. the 3'-end are facing to each other) are taken into account (Supplementary Figure 4). In case the region between the two genes is A/T-rich and a stem-loop can be predicted in there, the existence of a ρ-independent terminator is assumed. As default, the region should consist of 80 or less nucleotides, the A/T-rich region should be longer than 3 nucleotides, the stem-loop needs to be 4 -20 nucleotides, the length of the loop needs to be between 3 and 10 nucleotides and maximum 25% of the nucleotides in the stem should be unpaired.

UTRs.
Based on the CDS locations and the above described detection of TSSs, terminators and transcripts, 5' UTR and 3' UTR can be annotated by ANNOgesic. Additionally, it visualizes the distribution of UTR lengths in a histogram (as shown in Supplementary Figure 5).

Promoters.
ANNOgesic integrates MEME [44] (which detects ungapped motifs) and GLAM2 [45] (which discovers gapped motifs) for the detection and visualization of promoter motifs. The user can define the number of nucleotides upstream of TSSs that should be screened and the length of potential promoter motifs. The motifs can be generated globally or for the different types of TSSs (example in Supplementary  Figure 6).

Operon.
Based on the TSS and transcript prediction, ANNOgesic can generate statements regarding the organization of genes in operons and suboperons as well as report the number of monocistronic operons and polycistronic operons (Figure 4).

Detection of sRNAs and their targets
The detection of sRNAs based on RNA-Seq data is a non-trivial task. While numerous sRNAs are found in intergenic regions, there are also examples of 3' UTR-derived sRNAs [40,[46][47][48]. ANNOgesic offers the detection of both classes, combined with a detailed characterization of the sRNA candidates.
In order to classify newly detected intergenic transcripts as sRNAs, ANNOgesic tests several of their features ( Figure 5A). If a BLAST+ [49] search of a transcript finds homologous sequences in BSRD [50] -a database that stores experimentally confirmed sRNAs -the transcript gets the status of an sRNA. The user can also choose further databases for searching homologous sequences. In case a search against the NCBI non-redundant protein database leads to a hit it is marked as potentially protein-coding. Otherwise, a transcript must have a predicted TSS, form a stable secondary structure (i.e. the folding energy change calculated with RNAfold from Vienna RNA package [51] must be below user defined value) and their length should be in the range of 30 to 500 nt in order to be tagged as an sRNA. All these requirements are used per default but can be modified or removed The two genes are located in the same operon, but also in different sub-operons (two empty red squares).
via ANNOgesic's command line parameters. ANNOgesic stores the results of all analyses and generates GFF3 files, fasta files, secondary structural figures, dot plots, as well as mountain plots based on those predictions.
For sRNAs that share a transcript with CDSs -5' UTR, inter-CDS, or 3' UTR located sRNAs -we implemented several detection heuristics ( Figure 5B / C). 5' UTR-derived sRNAs must start with a TSS and show a sharp drop of coverage or a PS in the 3'-end. The requirement for the detection of inter-CDS located sRNAs is either a TSS or a PS as well as a coverage drop at the 3'-end or a PS. Small RNAs derived from the 3' UTR are expected to have a TSS or a PS and either end with the transcript or at a PS. After the detection of a bona fide sRNA, the above described quality filters (length range, secondary structure etc.) are applied to judge the potential of a candidate (examples are shown in Supplementary Figure  7, 8). For the validation of sRNA candidates in our test case, the described sRNAs of two publications were chosen. Sharma et al. [8] described 63 sRNAs of which 4 were not expressed in the condition of the test data set (removed from the dataset) (Supplementary Figure 9). Of these 59, 53 were detected by ANNOgesic. In the C. jejuni 81116 set, 31 sRNAs were found by Dugar et al. [18], and ANNOgesic could recover 26 (84%) (Supplementary Figure 10).
In order to deduce potential regulatory functions of newly-predicted sRNAs, ANNOgesic performs prediction of interaction between them and mRNAs using RNAplex [51,52] and RNAup [51,53]. The user can chooose if only interactions supported by both tools are reported.

Detection of sORFs
All newly detected transcripts that do not contain a previously described CDS as well all 5' UTRs and 3' UTRs are scanned for potential sORFs [54] (Figure 6). For this, ANNOgesic searches for start and stop codons (non-canonical start codons are not included, but can be assigned by the user) that constitute potential ORFs of 30 to 150 base-pairs. Furthermore, ribosomal binding sites (based on the Shine-Dalgarno sequence, but different sequences can be assigned as well) between the TSS and 3 to 15 bp upstream of the start codon are required for a bona fide sORF.

Detection of functional related attributes
In order to facilitate a better understanding of the biological function of known and newly detected transcripts, ANNOgesic predicts several attributes for these features.
This includes the allocation of GO as well as GOslim [55] terms to CDSs via searching of protein ids in Uniprot [56]. The occurrence of groups is visualized for expressed and non-expressed CDSs (Supplementary Figure 11). Furthermore the subcellular localization is predicted by PSORTb [57] for the proteins (Supplementary Figure 12). Additionally, the protein entries are enriched by protein-protein interaction information retrieved from STRING [58] and PIE [59] (examples in Supplementary  Figure 13).

Circular RNAs
ANNOgesic integrates the tool "testrealign.x" from the segemehl package for the detection of circular RNAs [60] and adds a filter to reduce the number of false positive. Candidates for circular RNAs must be located in intergenic regions and exceed a given number of reads.

CRISPRs
CRISPR/Cas systems represent a bacterial defence system against phages and consist of repeat units and spacers sequences as well as Cas proteins [61]. The CRISPR Recognition Tool (CRT) [62] is integrated into ANNOgesic and extended by comparison of CRISPR/Cas candidates to other annotations to remove false positive (Supplementary Figure 14).

Riboswitches and RNA thermometers
Riboswitches and RNA thermometers are regulatory sequences that are part of transcripts and influence the translation based on the concentration of selected small molecules and temperature change, respectively. For the prediction of these riboswitches and RNA thermometers, ANNOgesic searches [63] the sequences which are between TSSs (or starting point of a transcript if no TSS was detected) and downstream CDSs, as well as associated with ribosome binding site in the Rfam database using Infernal [64].

DISCUSSION
While RNA-Seq has become a powerful method to annotate genomes, the integration of the data is usually very laborious and time-consuming. It requires bioinformatic expertise and involves the application of different programs to perform the different required steps. Here we presented ANNOgesic, a modular, user-friendly annotation pipeline for the analysis of bacterial RNA-Seq data that integrated several tools, optimizes their parameters, and includes novel prediction methods for several genomic features. With the help of this command-line tool, RNA-Seq data can be efficiently used to generate high-resolution annotations of bacterial genomes with very little manual effort. Besides the annotation files in standard formats, it also returns numerous statistics and visualizations that help the user to explore and to evaluate the results. While it ideally has conventional (fragmentation) RNA-Seq as well as dRNA-Seq as input (see Supplementary Figure 15), it can also perform sufficient predictions with only one class of data for the majority of the genomic features.
The performance of ANNOgesic has been here demonstrated by applying it on two published data sets and comparing the results to manually-conducted annotations. ANNOgesic could detect 90% and 83% of the manually-annotated sRNAs H. pylori 26695 and C. jejuni 81116, respectively. The sRNAs missed by ANNOgesic can be explained by low coverage, not being associated with TSSs, or lack of expression in the assayed conditions (see Supplementary Figure 16 and 17).
Besides the analyses presented as examples in this study (H. pylori 26695 and C. jejuni 81116), ANNOgesic was meanwhile successfully applied for detecting transcripts, sRNAs, and TSSs in additional annotation projects (e.g. Pseudomonas aeruginosa [65] and Rhodobacter sphaeroides [66]. Despite the fact that the program was developed mainly with a focus on bacterial genomes, it has also been used to annotate archaeal genomes (namely Methanosarcina mazei (Lutz et al., unpublished)) and eukaryotic genomes which have no introns (Trypanosoma brucei (Müller et al.,  unpublished)).
ANNOgesic is freely available under an OSI compliant open source license (ISCL) and an extensive documentation has been generated in order to guide the novice and advanced users.

FUNDING
The project was funded by the German Research Foundation (DFG) as part of the Transregio 34 (CRC-TRR34). This publication was funded by the German Research Foundation and the University of Würzburg in the funding program Open Access Publishing.
Diarmaid Tobin and Till Sauerwein for giving feedback regarding code and documentation.