-
PDF
- Split View
-
Views
-
Cite
Cite
Alexey Stupnikov, Shailesh Tripathi, Ricardo de Matos Simoes, Darragh McArt, Manuel Salto-Tellez, Galina Glazko, Matthias Dehmer, Frank Emmert-Streib, samExploreR: exploring reproducibility and robustness of RNA-seq results based on SAM files, Bioinformatics, Volume 32, Issue 21, November 2016, Pages 3345–3347, https://doi.org/10.1093/bioinformatics/btw475
Close - Share Icon Share
Abstract
Motivation: Data from RNA-seq experiments provide us with many new possibilities to gain insights into biological and disease mechanisms of cellular functioning. However, the reproducibility and robustness of RNA-seq data analysis results is often unclear. This is in part attributed to the two counter acting goals of (i) a cost efficient and (ii) an optimal experimental design leading to a compromise, e.g. in the sequencing depth of experiments.
Results: We introduce an R package called samExploreR that allows the subsampling (m out of n bootstraping) of short-reads based on SAM files facilitating the investigation of sequencing depth related questions for the experimental design. Overall, this provides a systematic way for exploring the reproducibility and robustness of general RNA-seq studies. We exemplify the usage of samExploreR by studying the influence of the sequencing depth and the annotation on the identification of differentially expressed genes.
Availability and Implementation: samExploreR is available as an R package from Bioconductor.
Contact: v@bio-complexity.com
Supplementary information: Supplementary data are available at Bioinformatics online.
1. Introduction
RNA-seq data ( Mortazavi et al. , 2008 ) generated with next-generation sequencing (NGS) platforms are offering many new and exciting opportunities from basic biology to translational clinical research. However, one important problem that have these diverse applications in common is the question regarding the reproducibility and the robustness of obtained results ( Peng, 2011 ). Due to the novelty of RNA-seq data, there are so far only a few studies investigating either the optimal sequencing depth in general ( Sims et al. , 2014 ) or for context specific problems, e.g. for identifying differentially expressed (DE) genes ( Liu et al. , 2014 , Rapaport et al. , 2013 ).
In this paper, we introduce an R package called samExploreR that allows the convenient exploration of reproducibility and robustness of RNA-seq data analysis results. We demonstrate the utility of samExploreR by a case study for identifying DE genes.
2 Methods
The identification of DE genes from RNA-seq data requires the following 4 analysis steps. Step 1: Alignment of short-reads, e.g. with Bowtie2 ( Langmead and Salzberg, 2012 ). Step 2: Annotation dependent matching and summarization, e.g. CuffLinks ( Trapnell et al. , 2010 ), ( Anders et al. , 2014 ) or featureCounts (a function available in Rsubread ( Liao et al. , 2013 )). Step 3: Normalization ( Dillies et al. , 2013 ). Step 4: Statistical analysis including multiple testing correction ( Soneson and Delorenzi, 2013 ).
In principle, the subsampling of reads can be implemented as an additional step at any position before step 3. In this paper, we introduce the R package samExploreR (which is our modification of featureCounts, that integrates the matching and summarization of reads (Steps 2) with a subsampling step. Thus, the application of samExploreR provides ‘a shortcut’ mapping SAM files directly to subsampled count vectors (cv) representing the number of reads assigned per gene, see Figure 2. A significant advantage of our package is its ability to perform a direct subsampling analysis for various versions of genomic annotations, as matching reads to genes loci and reads resampling, in one procedure. This is in contrast to procedures like SAMtools ( Li et al. , 2009 ), Picard ( http://broadinstitute.github.io/picard/ ) or subSeq ( Robinson and Storey, 2014 ), see Figure 1A for a visualization.
Figure 1A: Schematic working mechanism of samExploreR with f : fraction of reads; R : number of resamplings. Figure 1B: Results from samExploreR. (A) Number of DE genes for three annotations and (B) their pairwise intersections of common DE genes. The Venn diagram is for intersections at f = 0.7 (vertical dashed line) whereas r is the number of used f values (Color version of this figure is available at Bioinformatics online.)
3 Results
The parameter f has the interpretation of a ‘simulated sequencing depth’ of a virtual sequencing experiment. Furthermore, for every value of f , we generate R = 25 replicates (see loop in Fig. 1 ). Each of these datasets is then analyzed with DESeq2 ( Love et al. , 2014 ) to obtain a list of DE genes.
In Figure 1B.A , we show the number of DE genes ( y -axis) as a function of the simulated sequencing depth, f , ( x -axis) for three annotations. The three included P -values correspond to results from a Friedman test for dependent groups, comparing the results from r = 3 user-specified values of f (horizontal slices; see arrows) for a fixed annotation, testing the null hypothesis of equal means. Ideally, a nonsignificant Friedman test corresponds to statistically robust results because a further increase of the (real) sequencing depth is unlikely to change the results. Practically, the Friedman test may give significant results indicating an imperfection in the robustness of the results w.r.t. variations of the parameter f , as is the case for the examples in Figure 1B.A . Next, we study the reproducibility of different annotations for fixed f values (vertical slices; see arrows). As one can see from Figure 1B.A , the results for different annotations are diverging for increasing values of f making the results less reproducible w.r.t. different genome annotations. The choice of f values is explored in the Suppl.file
In order to emphasize that ‘robustness’ as well as ‘reproducibility’ are always defined w.r.t. to a specific metric, we repeat the above analysis for the ‘common number of DE genes’ between two annotations; results shown in Figure 1B.B . Further questions that could be studied with samExploreR in a similar way would be the reproducibility of results of RNA-seq data from different wet labs, sequencing platforms, gene summarizations (including introns), tissue preparations (FFPE versus FF samples) or statistical analysis methods. Also the robustness of the summarization parameter for multiple gene hits or the number of mismatches for the alignments could be explored.
4 Conclusion
A cornerstone of any scientific study is the question regarding the reproducibility and robustness of obtained results. Unfortunately, for high-throughput data from RNA-seq experiments such questions are highly non-trivial to answer, which may be an explanation for the severe underrepresentation of this topic in the literature. samExploreR provides a flexible exploratory tool for investigating general RNA-seq datasets w.r.t. user-defined metrics.
Funding
A.S. is supported by an international studentship by the CCRCB (Belfast, UK). M.D. thanks the Austrian Science Funds for supporting this work (project P26142). F.E.-S. is supported by the Tampere University of Technology (Finland).
Conflict of Interest : none declared.
References
Author notes
Associate Editor: Ivo Hofacker
