Summary: RNA-Seq has become a potent and widely used method to qualitatively and quantitatively study transcriptomes. To draw biological conclusions based on RNA-Seq data, several steps, some of which are computationally intensive, have to be taken. Our READemption pipeline takes care of these individual tasks and integrates them into an easy-to-use tool with a command line interface. To leverage the full power of modern computers, most subcommands of READemption offer parallel data processing. While READemption was mainly developed for the analysis of bacterial primary transcriptomes, we have successfully applied it to analyze RNA-Seq reads from other sample types, including whole transcriptomes and RNA immunoprecipitated with proteins, not only from bacteria but also from eukaryotes and archaea.
Availability and implementation: READemption is implemented in Python and is published under the ISC open source license. The tool and documentation is hosted at http://pythonhosted.org/READemption (DOI:10.6084/m9.figshare.977849).
RNA-Seq, the examination of cDNA by massively parallel sequencing technologies, is a potent way to perform transcriptome analyses at single-nucleotide resolution and with a high dynamic range (Wang et al., 2009). It has been successfully used to annotate transcript boundaries and to identify novel transcripts such as small regulatory RNAs in both prokaryotes and eukaryotes (Filiatrault, 2011; Ozsolak and Milos, 2011). Most prominently, it can be applied to quantify the expression levels of genes, having been shown to be more powerful to detect changes in gene expression than microarrays (Zhao et al., 2014). It can also be used to study the interaction of proteins and RNAs through performing RNA-Seq of coimmunoprecipation (coIP) samples (König et al., 2012). Likewise, any other subset of RNA molecules from, for instance, RNA size-fractionation, ribosome profiling, metatranscriptomics or degradome profiling experiments can be sequenced. Owing to decreasing costs and ever increasing speed of deep sequencing, the bioinformatical analysis has become a bottleneck of RNA-Seq–based projects.
We have created an automated RNA-Seq processing pipeline named READemption with the initial purpose to handle differential RNA-Seq (dRNA-Seq) data for the determination of transcriptional start sites in bacteria (Sharma et al., 2010, Sharma and Vogel, 2014). We saw the need for this, as other available RNA-Seq analysis pipelines (e.g. Delhomme et al., 2012, McClure et al., 2013) were not designed for this application. Additionally, while most available RNA-Seq pipelines put priority on fast mapping, we have chosen
READemption integrates the steps that are required to interpret and gain biological knowledge from RNA-Seq experiments in one tool and makes them accessible via a consistent command line interface. Additionally, it conducts parallel data processing to reduce the runtime. The tool performs quality trimming, poly(A) and adapter clipping as well as size filtering of raw cDNA reads from different sequencing platforms, mapping to reference sequences, coverage calculation, gene-based quantification and comparison of expression levels. A summary of the pipeline’s workflow is depicted in the flow chart in Figure 1A. Moreover, it provides several statistics such as read mappability and generates plots and files for the visualization of the results in genome browsers (for examples, see Fig. 1B).
READemption was designed as high-performance application and follows the concept of ‘convention over configuration’. This includes the use of established default parameter values and the approach that files are placed or linked into defined paths and are then treated accordingly. The names of the input read files are used to generate names for the associated output files. Though the described design principle, READemption offers several parameters, which enable the user to adapt its execution to the specific needs.
READemption provides the subcommands
Read processing and mapping: The fundamental tasks of preprocessing the input reads and aligning them to reference sequences is covered by the subcommand
Coverage calculation: Based on the read alignments provided in the BAM files, cDNA coverage files can be generated using the subcommand
Gene expression quantification: The read alignments can also be further used by the subcommand
Differential gene expression analysis: For pairwise expression comparison, the subcommand
Plotting: The final three subcommands called
We present an open source pipeline for the analysis of RNA-Seq data from all domains of life. READemption generates several output files that can be examined with common office suites, graphic programs and genome browsers. Its features make it a useful tool for anybody interested in the computational analysis of RNA-Seq data with the required basic command line skills.
The authors thank members of the Sharma and the Vogel groups, especially Thorsten Bischler and Lei Li for testing and constructive feedback.
Funding: Work in the Sharma and Vogel laboratories is supported by the Bavarian Research Network for Molecular Biosystems (BioSysNet). The JV laboratory received financial support from a BMBF eBio grant RNAsys and DFG project VO 875/4-2. The CMS laboratory received financial support from the ZINF Young Investigator program at the Research Center for Infectious Diseases (ZINF, Würzburg, Germany), DFG project Sh580/1-1, and the Young Academy program of the Bavarian Academy of Sciences.
Conflict of interest: none declared.