Abstract

Motivation: Next-generation sequencing experiments, such as RNA-Seq, play an increasingly important role in biological research. One complication is that the power and accuracy of such experiments depend substantially on the number of reads sequenced, so it is important and challenging to determine the optimal read depth for an experiment or to verify whether one has adequate depth in an existing experiment.

Results: By randomly sampling lower depths from a sequencing experiment and determining where the saturation of power and accuracy occurs, one can determine what the most useful depth should be for future experiments, and furthermore, confirm whether an existing experiment had sufficient depth to justify its conclusions. We introduce the subSeq R package, which uses a novel efficient approach to perform this subsampling and to calculate informative metrics at each depth.

Availability and Implementation: The subSeq R package is available at http://github.com/StoreyLab/subSeq/.

Contact:dgrtwo@princeton.edu or jstorey@princeton.edu

Supplementary information:Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Many next-generation sequencing technologies have been developed to answer important biological questions. One property these technologies have in common is that they depend on read depth or coverage: increasing the number of reads typically increases the power and accuracy. For instance, in RNA-Seq greater read depth is known to increase the power of differential expression testing and the accuracy of expression estimates (Liu et al., 2013; Tarazona et al., 2011). The advent of multiplexed sequencing means that researchers should consider their read depth as a trade-off against cost and replication when designing experiments (Liu et al., 2014), which means knowing the relationship between read depth and power is essential to designing sequencing experiments. Similarly, many researchers need to demonstrate that they have adequate depth in an existing experiment to support their biological conclusions.

One valuable approach that multiple studies have used is to randomly subsample reads (sometimes called downsampling) and perform an identical analysis on each subsample. This is in contrast to methods that fit a parametric model to calculate power, such as Scotty (Busby et al., 2013). By determining where metrics of power and accuracy ‘saturate’ with increasing depth, one can both determine recommendations for future experiments and demonstrate whether an existing experiment has sufficient depth. Studies have used random subsampling to propose guidelines for future experiments (Black et al., 2014; Liu et al., 2014), to perform a survey of different RNA-Seq analysis methods at varying read depths (Labaj et al., 2011; Liu et al., 2013; Rapaport et al., 2013), or to demonstrate that they had achieved adequate read depth (Daines et al., 2011; Toung et al., 2011; Wang et al., 2011). However, all took the approach of randomly subsampling from either the fastq or alignment file, and then reperforming the analysis, including the computationally intensive step of matching reads to genes, on each file. This process is slow, demanding of disk space, and requires possessing the original reads or mappings, which limits the number of subsamples that can be performed and the ease of performing this analysis on existing experiments.

We introduce the subSeq R package, which instead subsamples sequencing reads with binomial sampling after they have been matched to genes and assembled into a count matrix. Because the step of matching reads to genes is independent and deterministic, this approach is functionally identical to the common approach of subsampling the read alignment files, but requires only the count matrix rather than the read alignment file. It also takes negligible time and computing resources even on large datasets, as the steps downstream of the read subsampling are much faster than the upstream steps. A similar approach is used to generate saturation figures in the NOISeq package (Tarazona et al., 2011), but subSeq is designed to be used with any RNA-Seq analysis method. subSeq could be performed immediately on any experiment in the ReCount resource of analysis-ready datasets (Frazee et al., 2011), and on any RNA-Seq experiment that provides a matrix of read counts per gene. An early version of this software was used in Robinson et al. (2014), on Bar-Seq measurements of the yeast deletion set, to determine the effect of read depth on detection of differential abundance.

subSeq also streamlines the process of performing a differential expression analysis on each subsample, and of calculating relevant biological metrics for each to determine how they vary depending on read depth. In particular, subSeq reports metrics representing (i) the power to detect differential expression or abundance, (ii) the accuracy of effect size estimation and (iii) the estimated rate of false discoveries relative to the full experiment.

2 METHODS

The user provides an unnormalized M × N matrix X of read counts, where each row represents one of M genes, each column represents one of N samples and each value denotes the number of reads aligned to each gene within each sample. The user also specifies a vector of K subsampling proportions p, each in the interval (0, 1], and the number of replications to perform at each proportion. For each pk, a subsampled matrix Y(k) is generated such that Ym,n(k)Binom(Xm,n,pk) for m=1,,M and n=1,,N. This is equivalent to allowing each original mapped read to have probability pk of being included in the new counts, as done, for example, by the Picard DownsampleSam function.

For each subsample, we perform the same analysis that is performed on the full set of reads. Multiple approaches for the determination of RNA-Seq differential expression from a matrix of counts, including edgeR (Robinson et al., 2010) and DESeq2 (Love et al., 2014), are built into subSeq, as is DEXSeq for differential exon usage detection (Anders et al., 2012). The user can also provide a custom method to be applied to each subsample.

Here we use subSeq to examine the effect of depth on the RNA-Seq dataset from Hammer et al. (2010), testing for differential expression between rats with induced chronic neuropathic pain and a control group. The mapped read counts were downloaded from ReCount, only samples from the 2-month time point were used, and genes with fewer than five mapped reads were filtered out. We subsampled 11 proportions on a logarithmic scale from 0.01 to 1, performing five replications at each proportion.

3 RESULTS

As an illustrative example, we show the results of subsampling of an RNA-Seq dataset from Hammer et al. (2010), using edgeR or DESeq2 to normalize and test each subsample for differential expression. To perform these subsamples manually, it would have required downloading 11.4 Gb of reads, mapping them to the mouse genome, downsampling to produce an additional 95Gb of alignments, matching each read to the gene annotations and only then performing the differential expression analysis. Using subSeq, the subsampling requires only the 4.9 Mb matrix from the ReCount database, can be performed entirely in memory in R and takes a negligible amount of time (<1 s to perform the 55 subsamplings, ∼2–8 minutes to perform the analysis at each step, depending on the method chosen).

After constructing subsamples and performing an analysis on each, subSeq calculates and visualizes summary metrics about each sequencing depth (Fig. 1); these plots aid in determining saturation of depth (Supplementary Fig. S1). As the plots show how read depth changes the conclusions of the analysis, the ‘oracle’ is defined as the P-values and estimates at the full depth. To estimate the power, subSeq determines the number of genes found significant at a given false discovery rate. To determine whether the decrease in read depth affects specificity, we also estimate the false discovery proportion (FDP) at each depth. subSeq does this by using the qvalue package to estimate the local false discovery rate for each gene in the oracle, then calculating the average of the oracle local FDR values among the genes found significant at each depth. To determine how depth affects the accuracy of effect size estimation, subSeq compares the log fold-changes estimated at each depth with the oracle estimates, reporting the mean-squared error and the Pearson and Spearman correlations.

Fig. 1.

The default plot generated by subSeq on subsamples of Hammer et al. (2010). This shows the number of significant genes at each depth (top left), the estimated FDP (top right) and the Spearman correlation (bottom left) and mean-squared error (bottom right) comparing the estimates at each depth with the full experiment

Fig. 1.

The default plot generated by subSeq on subsamples of Hammer et al. (2010). This shows the number of significant genes at each depth (top left), the estimated FDP (top right) and the Spearman correlation (bottom left) and mean-squared error (bottom right) comparing the estimates at each depth with the full experiment

subSeq is designed to allow any analysis to be performed on each subsample. While the example demonstrated here used RNA-Seq data, subSeq works equally well on other genomic approaches such as Bar-Seq or Tn-Seq, as demonstrated in Robinson et al. (2014).

ACKNOWLEDGEMENT

The authors thank A.J. Bass for helpful comments on the software and manuscript.

Funding: This work was supported in part by NIH (R01 HG002913).

Conflict of interest: none declared.

REFERENCES

Anders
S
, et al.  . 
Detecting differential usage of exons from RNA-seq data
Genome Res.
 , 
2012
, vol. 
22
 (pg. 
2008
-
2017
)
Black
MB
, et al.  . 
Comparison of microarrays and RNA-seq for gene expression analyses of dose-response experiments
Toxicol. Sci.
 , 
2014
, vol. 
137
 (pg. 
385
-
403
)
Busby
MA
, et al.  . 
Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression
Bioinformatics
 , 
2013
, vol. 
29
 (pg. 
656
-
657
)
Daines
B
, et al.  . 
The Drosophila melanogaster transcriptome by paired-end RNA sequencing
Genome Res.
 , 
2011
, vol. 
21
 (pg. 
315
-
324
)
Frazee
AC
, et al.  . 
ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets
BMC Bioinformatics
 , 
2011
, vol. 
12
 pg. 
449
 
Hammer
P
, et al.  . 
mRNA-seq with agnostic splice site discovery for nervous system transcriptomics tested in chronic pain
Genome Res.
 , 
2010
, vol. 
20
 (pg. 
847
-
860
)
Labaj
PP
, et al.  . 
Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling
Bioinformatics
 , 
2011
, vol. 
27
 (pg. 
i383
-
i391
)
Liu
Y
, et al.  . 
Evaluating the impact of sequencing depth on transcriptome profiling in human adipose
PLoS One
 , 
2013
, vol. 
8
 pg. 
e66883
 
Liu
Y
, et al.  . 
RNA-seq differential expression studies: more sequence or more replication?
Bioinformatics
 , 
2014
, vol. 
30
 (pg. 
301
-
304
)
Love
MI
, et al.  . 
Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2
2014
 
. biorXiv preprint. doi:10.1101/002832
Rapaport
F
, et al.  . 
Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data
Genome Biol.
 , 
2013
, vol. 
14
 pg. 
R95
 
Robinson
DG
, et al.  . 
Design and analysis of Bar-seq experiments
G3 (Bethesda)
 , 
2014
, vol. 
4
 (pg. 
11
-
18
)
Robinson
MD
, et al.  . 
edgeR: a Bioconductor package for differential expression analysis of digital gene expression data
Bioinformatics
 , 
2010
, vol. 
26
 (pg. 
139
-
140
)
Tarazona
S
, et al.  . 
Differential expression in RNA-seq: a matter of depth
Genome Res.
 , 
2011
, vol. 
21
 (pg. 
2213
-
2223
)
Toung
JM
, et al.  . 
RNA-sequence analysis of human B-cells
Genome Res.
 , 
2011
, vol. 
21
 (pg. 
991
-
998
)
Wang
Y
, et al.  . 
Evaluation of the coverage and depth of transcriptome by RNA-Seq in chickens
BMC Bioinformatics
 , 
2011
, vol. 
12
 
Suppl. 10
pg. 
S5
 

Author notes

Associate Editor: Inanc Birol
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

Comments

0 Comments