RiboReport - benchmarking tools for ribosome profiling-based identification of open reading frames in bacteria

Abstract Small proteins encoded by short open reading frames (ORFs) with 50 codons or fewer are emerging as an important class of cellular macromolecules in diverse organisms. However, they often evade detection by proteomics or in silico methods. Ribosome profiling (Ribo-seq) has revealed widespread translation in genomic regions previously thought to be non-coding, driving the development of ORF detection tools using Ribo-seq data. However, only a handful of tools have been designed for bacteria, and these have not yet been systematically compared. Here, we aimed to identify tools that use Ribo-seq data to correctly determine the translational status of annotated bacterial ORFs and also discover novel translated regions with high sensitivity. To this end, we generated a large set of annotated ORFs from four diverse bacterial organisms, manually labeled for their translation status based on Ribo-seq data, which are available for future benchmarking studies. This set was used to investigate the predictive performance of seven Ribo-seq-based ORF detection tools (REPARATION_blast, DeepRibo, Ribo-TISH, PRICE, smORFer, ribotricer and SPECtre), as well as IRSOM, which uses coding potential and RNA-seq coverage only. DeepRibo and REPARATION_blast robustly predicted translated ORFs, including sORFs, with no significant difference for ORFs in close proximity to other genes versus stand-alone genes. However, no tool predicted a set of novel, experimentally verified sORFs with high sensitivity. Start codon predictions with smORFer show the value of initiation site profiling data to further improve the sensitivity of ORF prediction tools in bacteria. Overall, we find that bacterial tools perform well for sORF detection, although there is potential for improving their performance, applicability, usability and reproducibility.


A Introduction
This document contains supplemental material for RiboReport -Benchmarking tools for ribosome profiling-based identification of open reading frames in bacteria.

B Validation of labeling method
We validated our labeling approach by comparison to available published MS datasets (proteomics) for the same strains grown under similar conditions. Comparing the translated ORFs found by our labeling method to those detected in the retrieved proteomics data for each organism showed that the majority of genes labeled as translated based on Ribo-seq were also detected by MS (Supplemental Figure S1)(on average 83.81% across all four organisms), thereby validating our labeling procedure.
To further showcase the overlap of human labeling in the genome browser versus MS to generate a robust benchmark set, we selected a set of conserved and highly translated ORFs that could be compared in all four organisms: a long ribosomal protein island that is highly conserved in bacteria (genes rpmJ to secY ) and also features several sORFs with less than 50 aa. The genome browser tracks show that for E. coli, P. aeruginosa, and L. monocytogenes, both MS and labeling detected all 22 of these ORFs and that the entire island showed strong Ribo-seq coverage (Supplemental Figures S2A -S2C). In contrast, while all 22 ORFs were labeled as translated in S. Typhimurium only five were detected by MS, consistent with the generally lower sensitivity of the S. Typhimurium MS dataset ( Figure S1).
As an additional check of our labeling procedure, we also labeled a selection of tRNAs and annotated non-coding RNAs based on our de novo E. coli dataset (data not shown). None of the tRNAs examined (86) were called as translated based on comparison of Ribo-seq and RNA-seq coverage. Several wellcharacterized ncRNAs were also labeled as not translated, such as the regulatory small RNAs MicA (Supplemental Figure S3A), Spot 42, SdsR, and GlmY, as well as the housekeeping RNA components of the signal recognition particle (SRP RNA) and RNase P (RnpB) (data not shown). However, some known non-coding RNAs were labeled as translated, such as CsrC (Supplemental Figure S3B) and tmRNA (data not shown). Some of these false positives could be due to resistance to RNase digestion or association with ribosomes (tmRNA). However, some, such as the annotated non-coding RNA RyeG, were recently shown to encode translated sORFs (Supplemental Figure S3C, yodE, 48 aa) [1]. We also inspected an ORF that was detected by MS, but not labeled as translated, katE. This ORF was not labeled as translated because of low coverage (generally below 10 relative reads) (Supplemental Figure S3D). Its detection by MS suggests either the conditions for Ribo-seq and MS were slightly different, or that KatE is a relatively stable protein that is translated at an earlier growth phase from an mRNA that is not highly expressed during exponential growth. Nonetheless, while these examples (Supplemental Figure S3B and S3D) highlight some limitations of our labeling approach, overall the overlap of labels and MS, together with accurate labeling of known non-coding RNAs and sORFs, demonstrates the validity of our approach. : Comparison between ORFs labeled as translated with Riboseq data and ORFs detected with proteomics. The intersection between annotated ORFs labeled as translated based on Ribo-seq data (Labels) and those detected by mass spectrometry (MS) for the four benchmark datasets. The majority of the genes detected by MS are also present in the labeled benchmark dataset. The percent of MS ORFs that were also found by our labeling method for each organism are as follows: E. coli, 90%; L. monocytogenes, 90%; P. aeruginosa, 86%; S. Typhimurium, 68%; and on average 83.8%. The large overlap validates the manual labeling strategy we employed to create the benchmark datasets.  Fig. S2: Comparison of manual Ribo-seq labels and proteomics (MS) for conserved ribosomal protein ORFs from S. Typhimurium (A), P. aeruginosa (B), and L. monocytogenes (C). Related to Main Figure 3. The conserved region between rpmJ and secY encodes several highly expressed ORFs of diverse length. Corresponding genes between the organisms are labeled with the same colour in the annotation. Genes that are detected in the publicly available MS dataset and by manual curation of the Ribo-seq data (label) are dark grey. Detection by the indicated tools at a 70% overlap threshold based on Ribo-seq data (or RNA-seq, IRSOM) is indicated in light grey. Those that are not detected are white. Transcriptional start sites, if available, are indicated with a bent arrow (+1). Hatched arrows indicate that predictions could not be generated for the genome/dataset.  Fig. S3: Assessing the quality/accuracy of manual curation for wellcharacterized and/or validated E. coli genes. An example of a true negative, false positive, true positive, and false negative manual curation from the E. coli dataset were selected. (A) The non-coding base-pairing regulatory RNA MicA was correctly labeled as not translated based on Ribo-seq coverage and is a true-negative. (B) The non-coding RNA CsrC shows coverage in the Ribo-seq library and was false-positively labeled as translated. (C) The newly validated sORF true-positive yodE (48 aa) [1], encoded on the annotated RyeG sRNA, was labeled as translated based on Ribo-seq.(D) The katE ORF was not labeled as translated and shows low coverage in the RNA-seq and Ribo-seq libraries, but has been detected by MS under similar growth conditions. Tool predictions and MS data for annotated coding genes are also included for comparison. Detection (grey arrows) or no detection (white arrows) was determined at a 70% overlap threshold.  The ORFs annotated in syntenic cydAB /cioAB regions, encoding terminal oxidase complexes, were inspected for MS detection, detection based on Ribo-seq by either manual labeling or the computational tools. Corresponding genes between the organisms are labeled in the same colour in the annotation track. Genes encoding associated small protein components of the complexes (pink arrow) are annotated so far only in P. aeruginosa (cioZ ) and have so far not been detected in the L. monocytogenes cydAB operon [2]. Detection (grey arrows) or no detection (white arrows) was determined at a 70% overlap threshold. Hatched genes indicate that predictions could not be generated for the genome/dataset. Spaces between cydA and cydB are as follows: S. Typhimurium: 15 nt, P. aeruginosa: 3 nt, and L. monocytogenes, 14 nt overlap. For all panels, detection (grey arrows) or no detection (white arrows) was determined at a 70% overlap threshold. Hatched arrows indicate that predictions could not be generated for the genome/dataset.  coli. set of eight lowly-transcribed polycistronic ORFs (ydjX-ynjE ) was not labeled as translated based on Ribo-seq because of low overall coverage, but protein products from three of these ORFs were detected by MS under similar growth conditions. (B) The overlapping btuB-murI operon. Both ORFs were detected in the MS dataset, labeled as translated based on Ribo-seq, and were also identified by DeepRibo, REPARATION blast, ribotricer, smOR-Fer, and IRSOM. (C) The leaderless ORF rluC, which was detected by MS and labeled based on Ribo-seq, was also detected by the prediction tools DeepRibo, REPARATION blast, smORFer, ribotricer, and IRSOM, even without a Shine-Dalgarno sequence. (D) The dual function RNA SgrS acts as both a base-pairing repressor, and also encodes the small protein SgrT (43 aa [3]. SgrT was not detected by MS under these conditions but was labeled as translated based on Ribo-seq coverage and was detected by DeepRibo and REPARATION blast. For all panels, detection (grey arrows) or no detection (white arrows) was determined at a 70% overlap threshold. The predictions are split into annotated (orange) and novel (blue) ORF predictions. Novel predictions are those having a score of '-1' for the 'dist' parameter, as described in the DeepRibo documentation. For E. coli the prediction score of the last predicted annotated ORF is -12.14 and the number of novel predictions after this ORF is 42,879. For L. monocytogenes the prediction score of the last predicted annotated ORF is -11.13 and the number of novel predictions after this ORF is 45,282. For P. aeruginosa the prediction score of the last predicted annotated ORF is -12.01 and the number of novel predictions after this ORF is 37,470. For S. Typhimurium the prediction score of the last predicted annotated ORF is -14.8 and the number of novel predictions after this ORF is 36,272.

D Supplementary Result Tables
The following supplemental result tables are also available as Google Spreadsheets for exporting and further analysis. Table S1: Statistical evaluation of tool performance for translatome ORFs for the E. coli benchmark dataset using overlap thresholds of 0.01, 0.7, or 0.9. The overlap threshold is the percent of the ORF length that must overlap with the prediction. The overlap threshold needs to be fulfilled both for prediction with labeled ORF and vice versa.  Table S2: Statistical evaluation of the translatome ORFs for the benchmark dataset L. monocytogenes using overlap thresholds of 0.01, 0.7, or 0.9. A detailed column description can be found in main Table S1.  Table S3: Statistical evaluation of the translatome ORFs for the benchmark dataset P. aeruginosa and using overlap threshold of 0.01, 0.7, or 0.9. A detailed column description can be found in main Table S1.  Table S4: Statistical evaluation of the translatome for the benchmark dataset S. Typhimurium and using overlap threshold of 0.01, 0.7 or 0.9. A detailed column description can be found in Table S1.  Table S5: Statistical evaluation of ORFs located in operons for the benchmark dataset E. coli and using overlap threshold of 0.01, 0.7, or 0.9. A detailed column description can be found in main Table S1.  Table S6: Statistical evaluation of ORFs located in operons for the benchmark dataset L. monocytogenes and using overlap threshold of 0.01, 0.7 or 0.9. A detailed column description can be found in Table S1.  Table S7: Statistical evaluation of ORFs located in operons for the benchmark dataset P. aeruginosa and using overlap threshold of 0.01, 0.7 or 0.9. A detailed column description can be found in Table S1.  Table S8: Statistical evaluation of ORFs located in operons for the benchmark dataset S. Typhimurium and using overlap threshold of 0.01, 0.7 or 0.9. A detailed column description can be found in Table S1.  Table S9: Statistical evaluation of ORFs located outside operons for the benchmark dataset E. coli and using overlap threshold of 0.01, 0.7, or 0.9. A detailed column description can be found in main Table S1.  Table S10: Statistical evaluation of ORFs located outside operons for the benchmark dataset L. monocytogenes and using overlap threshold of 0.01, 0.7 or 0.9. A detailed column description can be found in Table S1.  Table S11: Statistical evaluation of ORFs located outside operons for the benchmark dataset P. aeruginosa and using overlap threshold of 0.01, 0.7, or 0.9. A detailed column description can be found in main Table S1.  Table S12: Statistical evaluation of ORFs located outside operons for the benchmark dataset S. Typhimurium and using overlap threshold of 0.01, 0.7, or 0.9. A detailed column description can be found in Table S1.  Table S13: Statistical evaluation of sORFs for the benchmark dataset E. coli and using overlap threshold of 0.01, 0.7 or 0.9. We could not find any sORF results for the benchmark dataset: L. monocytogenes. A detailed column description can be found in Table S1.  Table S14: Statistical evaluation of sORFs for the benchmark dataset P. aeruginosa and using overlap threshold of 0.01, 0.7 or 0.9. We could not find any sORF results for the benchmark dataset: L. monocytogenes. A detailed column description can be found in Table S1.  Table S15: Statistical evaluation of sORFs for the benchmark dataset S. Typhimurium and using overlap threshold of 0.01, 0.7 or 0.9. We could not find any sORF results for the benchmark dataset: L. monocytogenes. A detailed column description can be found in Table S1.

E Evaluation of key results
To follow the good practice for benchmarking as proposed by Mangul et al. [5] we adapted one of their example summary figures [6] for our benchmark scenario. Here, we propose a straightforward evaluation system to summarize the performance of all evaluated tools. The evaluation results are visualized in Figure 5 of the main text. The evaluation system reveals the performance of each tool for several categories. We rate the tools as follows: a violet circle for superior performance, a light blue circle for satisfactory performance, and a dark blue circle for unsatisfactory performance. How each tool is rated for each category is described in the following.

E.3 Prediction of novel sORFs
The predictive power of finding novel ORFs was tested using 33 verified novel ORFs outside of the annotation( [7], see main text). Only DeepRibo was able to find 18/19 of the 33 novel ORFs and has therefore a Satisfactory predictive power. REPARATION blast with only two and Ribo-TISH and IRSOM with which found non of the novel ORFs show a unsatisfactory performance.

E.4 Runtime
The runtime comparison can be found in the main document (Table 7). We evaluated tools using single-and multithreading. Since not every tool supports multithreading, we took the minimum runtime for each tool for either singleor multithreading. Be aware that the original REPARATION tool should be faster than REPARATION blast because of the use of ublast in REPARATION blast.
Superior if test data was computed in less than 30 minutes Satisfactory if test data was computed in less than 2 hours Unsatisfactory if test data was computed in more than 2 hours

E.5 Memory
The memory comparison can be found in the main document (Table 7). We evaluated tools using single-and multithreading. Since not every tool supports multithreading, we take the minimum run time for each tool from either approach.
Superior : 2 or less GB Satisfactory : 4 or less GB Unsatisfactory : more than 4 GB

E.6 Applicability
The applicability describes how universally the tool can be applied. In the following we list several applicability criteria. If a tools fulfills the criteria it gets one point. Based on the amount of points each tool achieved the applicability is evaluated: 1. Can use replicates 2. Is deterministic 3. Outputs a standard file format (gff/bed/ ...) 4. Uses unit testing or some other correctness declaration 5. Stable results throughout different organisms In the following, we describe how each tool was rated for applicability:  Superior : 4 or more points fulfilled Satisfactory : 2 or more points fulfilled Unsatisfactory : 1 or less points fulfilled

E.7 Usability
User friendliness or usability is one of the key factors on how convenient it is for the users to apply the tool. In the following, we describe several usability criteria, each giving a single point if it is fulfilled by the tested tool. 6. Documentation: input 7. Documentation: output 8. Documentation: dependencies 9. Open source In the following, we describe the rating of the tools' usability: Package managers, (3) Example data, (4) Version control, (5) Documentation: parameter, (6) Documentation: input, (7) Documentation: output, (8) Documentation: dependencies, (9) Open source. * REPARATION was using a proprietary, closed source sequence search tool that we replaced with the free and open source tool blast, to make the software usable without fee. This version is called REPARATION blast.