Oqtans: the RNA-seq workbench in the cloud for complete and reproducible quantitative transcriptome analysis

We present Oqtans, an open-source workbench for quantitative transcriptome analysis, that is integrated in Galaxy. Its distinguishing features include customizable computational workflows and a modular pipeline architecture that facilitates comparative assessment of tool and data quality. Oqtans integrates an assortment of machine learning-powered tools into Galaxy, which show superior or equal performance to state-of-the-art tools. Implemented tools comprise a complete transcriptome analysis workflow: short-read alignment, transcript identification/quantification and differential expression analysis. Oqtans and Galaxy facilitate persistent storage, data exchange and documentation of intermediate results and analysis workflows. We illustrate how Oqtans aids the interpretation of data from different experiments in easy to understand use cases. Users can easily create their own workflows and extend Oqtans by integrating specific tools. Oqtans is available as (i) a cloud machine image with a demo instance at cloud.oqtans.org, (ii) a public Galaxy instance at galaxy.cbio.mskcc.org, (iii) a git repository containing all installed software (oqtans.org/git); most of which is also available from (iv) the Galaxy Toolshed and (v) a share string to use along with Galaxy CloudMan. Contact: vipin@cbio.mskcc.org, ratschg@mskcc.org Supplementary information: Supplementary data are available at Bioinformatics online.


Introduction
The majority of RNA-seq analyses require four essential steps: sequencing, read mapping, transcript prediction, and quantification. The sheer number of different software programs available for the same task can be overwhelming. For instance, today, roughly a dozen tools have been published that specifically align RNA-seq reads to a reference genome and take into account or detect novel splicing events (PALMapper (Jean et al., 2010;De Bona et al., 2008), TopHat , MapSplice (Wang et al., 2010), SpliceMap (Au et al., 2010), etc.), and there are likely many more tools for this purpose. It is difficult for researchers to determine which ones are best suited for their experimental setup. The difficulty is to first find the most accurate or appropriate program for each task and second to combine several programs effortlessly to obtain a complete pipeline.

Availability of the Oqtans-enabled images
We have extended a virtual machine image that can be used with the tools we have created. These tools are released under an open-source license (GPL). The machine image we used is available publicly (as "ami-65376a0c") from Amazon Web Services (AWS) and can be launched directly in an EC2 environment. The following basic steps are required to create an new Oqtans instance in the Amazon EC2 cloud: (a) create an account with AWS (e.g., a free tier account) and obtain security credentials, (b) use the "Request Instances Wizard" and create an instance based on an Oqtans image (i.e., ami-65376a0c & instance type m1.large), (c) enter security credentials as "User Data" for the new instance, (d) define access rules and allow http access, and finally, (e) launch the instance. The instance will shortly be available with a ready-to-use Galaxy server. Then you can (f) execute the Oqtans setup script. Detailed instructions are available at oqtans.org/instantiate. Cloud service providers provide persistent storage of results, a service invaluable for science: once an analysis for a publication is complete, the entire machine image and all dependent data files can be archived, ready to be run again with all original data and parameter settings in place. We found it easiest to create a fresh instance with each project, that can be independently archived, removed, or distributed. Persistence, in this case, is limited by the contract of the cloud provider to the one paying for it. To ensure scientific reproducibility, it should be broadly explored, how data from publicly-funded research is best made publicly accessible in a sustainable manner.

Installation of Oqtans tools and images
The current version of Oqtans can be downloaded from our public git repository git@github.com: ratschlab/oqtans.git into an existing Galaxy cloud instance or a local Galaxy installation. To enable the tools, users have to include the tool description into the tool_conf.xml file of the running instances (for detailed instructions, see oqtans.org/install).
The Oqtans tools are also available individually via the Galaxy toolshed toolshed.g2.bx.psu.edu. Upcoming versions of Galaxy and the toolshed will allow fully automatized installation and integration of new tools (personal communication, Galaxy Developer Team). Once the Galaxy toolshed is fully operational, we will provide versions of the tools that automatically install within a running Galaxy instance.

Evaluations in Figure 2
From the Short Read Archive (SRA) we downloaded reads with accessions SRX019652, the three days old female adults, and SRX019653, the three days old male adults, of a D. melanogaster wild type strain commonly used in laboratories, called Canton-Special. The data consist of two sets of around 25 millions (female) and 15 millions (male) 75 bp paired-end reads generated with the Illumina Genome Analyzer II. We used two short-read alignment programs to align the pairedend, spliced reads, namely Tophat ) version 1.1.4 and PALMapper version 0.4 (Jean et al., 2010). We have used the flybase annotation (?) together with the evaluationtool described in Jean et al. (2010) to estimate the intron prediction accuracy (sensitivity and specificity was computed and used to compute the displayed F-Score). A new version of this tool is also available on our Galaxy instance galaxy.cbio.mskcc.org (section "NGS: Evaluation", tool "Compare Spliced Alignment to Annotation").
To generate Figure 2b we used a C. elegans dataset and followed the same steps as in Figure  2b of Görnitz et al. (2011a) also comparing Cufflinks and mTIM. A major difference is that in Görnitz et al. (2011a) we used the same alignments for both methods, whereas here we use Tophat alignments for Cufflinks and PALMapper alignments for mTIM.

Supplementary Use Case: Gene family expression in Arabidopsis thaliana
In the second use case, we computed and visualized fractions of unexpressed, expressed, and differentially expressed gene families. Different gene families often behave differently when comparing the expression levels of two natural accessions (strains) from the same species. In this example, we examined two strains from the model plant Arabidopsis thaliana. This example is taken from the study of genomes and transcriptomes of multiple Arabidopsis strains (Gan et al., 2011) that compared the reference sequence Col-0 (Columbia) to the accession known as Can-0 (Canary Islands). The latter accession comes from a population that was isolated for a long time and shows many differences to the reference sequence. Comparing lists of differentially expressed genes among different strains of the same species leads to interesting biological insights. For example, in different Arabidopsis accessions, the genes encoding the plants' "immune system" (pathogen defense and production of glucosinolates to deter herbivores) are the most differentially expressed group. For accessions that are found at different latitudes around the globe as it is the case in our example, genes associated with flowering time show stark contrasts. As mentioned in Gan et al. (2011), we expect striking expression polymorphisms for the type II MADS box transcription factor family, which includes genes specific to flowering, whereas housekeeping genes are much more constant across different accessions.
The entire pipeline for this comparison consists of aligning short reads, quantifying them, testing for differential expression, assigning genes to their families, and visualizing the result ( Figure  1). We downloaded the aligned read data from the resources website of the 19 genomes of Arabidopsis thaliana project by Gan et al. (2011) (bioweb.me/19g) for the accessions Col-0 and Can-0. In total, between 1,241,437 and 4,920,935 reads had been aligned with PALMapper Jean et al. (2010). That were use for in silico quantification.
With DESeq, which we integrated into in Oqtans, we counted the number of reads mapping within each unique exonic region of the genes in the TAIR annotation Lamesch et al. (2012) that mapped to the accessions' genome coordinates. We also used DESeq to test for differential expression of all 65,238 annotated features (i.e., genes, pseudogenes, transposable elements and others) between the two accessions of interest.
Employing the conservative Bonferroni correction for multiple testing, we obtained an adjusted p-value for differential expression of each gene. From the TAIR database (Lamesch et al., 2012), we downloaded information about gene names and their families. Finally, we applied GeneSetter to display the fractions of expressed, differentially expressed and non-expressed genes per family (see Supplementary Figure S2, which is very similar to Figure 4B in Gan et al. (2011)). With our tool GeneSetter (Supplementary Table S1), gene lists with meta information that are proper subsets, differences, and complements of one another can be plotted. The figures created are versatile visualizations of the annotation and the corresponding differences in the lists. Examples include the overrepresentation of transcription factor binding sites in regulatory regions of a gene, as they are used within KIRMES Schultheiss et al. (2009), or genes that have a certain GO term in common, for instance from the first use case.

Supplementary Figures and Tables
Supplementary Figure