- Split View
-
Views
-
Cite
Cite
Fernando Mora-Márquez, José Luis Vázquez-Poletti, Unai López de Heredia, NGScloud: RNA-seq analysis of non-model species using cloud computing, Bioinformatics, Volume 34, Issue 19, October 2018, Pages 3405–3407, https://doi.org/10.1093/bioinformatics/bty363
- Share Icon Share
Abstract
RNA-seq analysis usually requires large computing infrastructures. NGScloud is a bioinformatic system developed to analyze RNA-seq data using the cloud computing services of Amazon that permit the access to ad hoc computing infrastructure scaled according to the complexity of the experiment, so its costs and times can be optimized. The application provides a user-friendly front-end to operate Amazon’s hardware resources, and to control a workflow of RNA-seq analysis oriented to non-model species, incorporating the cluster concept, which allows parallel runs of common RNA-seq analysis programs in several virtual machines for faster analysis.
NGScloud is freely available at https://github.com/GGFHF/NGScloud/. A manual detailing installation and how-to-use instructions is available with the distribution.
1 Introduction
RNA-seq experiments often yield huge amount of data, especially when several NGS libraries are involved. The algorithms used in the bioinformatic analyses are very complex, particularly those referred to the assembly of reads (Miller et al., 2010). Thus, the hardware requirements to run RNA-seq analysis are very high in terms of CPUs and GiBs of RAM memory, and computing infrastructure to fulfill such requirements is not always available in small research centers. In such cases, cloud computing is a solution that provides resizable computing capacity, and therefore, allows to fit the hardware to the nature of the experimental data. One of the main cloud computing solutions is the Elastic Compute Cloud (EC2), a service of the Amazon Web Services (AWS). The EC2 has a wide range of scalable instances that allow the optimization of the experiment costs, because the user only pays for the time of use of the resources. Also, the EC2 provides immediacy, since a virtual machine can be booted in only a few minutes.
We present NGScloud, a bioinformatic system developed to analyze RNA-seq data using the cloud computing offered by EC2. NGScloud is oriented to non-model species whose reference genomes are not available, and it implements parallel runs in several virtual machines for faster analysis. The application aims to ease the researcher the use of EC2 resources and the performance of RNA-seq analysis.
2 Materials and methods
2.1 Software
NGScloud was programmed in Python3, and it runs in any computer with an OS that allows for Python3: Linux, Microsoft Windows, Mac OS X and other platforms. To work properly, NGScloud has the following dependencies for the local computer of the user: (i) StarCluster, an open source cluster-computing toolkit for EC2 (http://star.mit.edu/cluster/); (ii) Boto3, the AWS SDK for Python (https://boto3.readthedocs.io/); (iii) Paramiko, an implementation of the SSHv2 protocol in Python (http://www.paramiko.org/) and (iv) AWS CLI (https://aws.amazon.com/cli/).
NGScloud offers a user-friendly front-end to operate the EC2 resources, to control the implement RNA-seq workflow and to handle the data. NGScloud runs in graphical mode using the graphical user interface (GUI) by default, but it can also be run in console mode on server machines without GUI installed.
In addition, several free bioinformatic applications of common use in RNA-seq workflows are easily set up from the front-end (Point 2.3).
2.2 Cloud computing
NGScloud philosophy is based on the cluster concept. A cluster is a set of virtual machines of an AWS instance type. Each instance type has its hardware features (machine type, CPU number, memory amount, etc).
Data volumes allow to save data and keep them even if there is not any cluster created. NGScloud uses Amazon’s EBS volumes to hold applications, read files, references, databases and results of analysis.
Through the NGScloud front-end, the user can easily to: (i) create and terminate clusters; (ii) show the cluster composition; (iii) add and remove nodes dynamically to a cluster; (iv) create and remove volumes; mount and unmount volumes; (v) submit and kill jobs to the RNA-seq workflow; (vi) show the status of the batch job; (vii) view the log of every batch job to inspect correct program operability and (viii) upload, download, compress, decompress and remove datasets.
When a cluster is created, it has only a virtual machine named master node. After the master node creation, subsidiary nodes can be added if necessary, to run some processes in parallel. In this case, the new job will run in the node determined according to the workload.
2.3 RNA-seq workflow
The RNA-seq workflow implemented has the standard steps of a RNA-seq analysis in non-model species (López de Heredia and Vázquez-Poletti, 2016), including: (i) pre-processing: read quality assessment with FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc.), trimming with Trimmomatic (Bolger, 2014) and the insilico-read-normalization procedure of Trinity; (ii) de novo assembly with SOAPdenovo-Trans (Xie, 2014), Trans-Abyss (Robertson, 2010) and Trinity (Grabherr, 2011; Haas, 2013), and reference-based assembly with STAR (Dobin, 2013); (iii) assessment of the assembly quality and transcript quantification with Transrate (Smith-Unna, 2016), GMAP (Wu and Watanabe, 2005), BUSCO (Waterhouse, 2017), QUAST (Gurevich, 2013), rnaQUAST (Bushmanova, 2016) and RSEM-EVAL included in DETONATE package (Li, 2014); (iv) post-filtering with CD-HIT-EST (Li and Godzik, 2006) and transcript-filter included in NGShelper package, which also has some other tools to perform the RNA-seq analysis workflow and (v) annotation with transcriptome-blast that encapsulates blastx runs in several nodes, and is included in NGShelper. The results may be downloaded for downstream analysis (Fig. 1).
The workflow steps are run separately by the user, with each step requiring the setup of the cloud resources. NGScloud is configured to read the output generated in each step as the required input file(s) of subsequent steps, or to download the output to a local machine. For instance, to perform an assembly, reads are uploaded, the assembly program runs in the cluster and the output is downloaded, or submitted to the next step of the RNA-seq workflow. Running of bioinformatic applications is easy since the researcher is guided in the choice of the input files and the parameters to be used, encapsulating the complexity of the command line. Multiple runs of applications can be run in parallel creating nodes. In addition, the annotation step supports parallelization, so it can use several nodes to increase the run speed.
3 Conclusions
NGScloud provides a user-friendly front-end to operate the EC2 resources, and to control the workflow of non-model species oriented RNA-seq experiment in a modular way. The application allows to optimize the cost-efficiency ratio of RNA-seq experiments when appropriate computational facilities are not available.
Funding
This work has been supported by the projects SPIP2014-01093 (Spanish National Parks Agency, Ministry of Agriculture and AGL2015-67495-C2-2-R and FedCloudNet) (MINECO TIN2015-65469-P) (Spanish Ministry of Economy and Competitiveness) and by an Amazon Research Grant.
Conflict of Interest: none declared.
References