-
PDF
- Split View
-
Views
-
Cite
Cite
Paul L Auer, R W Doerge, Statistical Design and Analysis of RNA Sequencing Data, Genetics, Volume 185, Issue 2, 1 June 2010, Pages 405–416, https://doi.org/10.1534/genetics.110.114983
Close - Share Icon Share
Abstract
Next-generation sequencing technologies are quickly becoming the preferred approach for characterizing and quantifying entire genomes. Even though data produced from these technologies are proving to be the most informative of any thus far, very little attention has been paid to fundamental design aspects of data collection and analysis, namely sampling, randomization, replication, and blocking. We discuss these concepts in an RNA sequencing framework. Using simulations we demonstrate the benefits of collecting replicated RNA sequencing data according to well known statistical designs that partition the sources of biological and technical variation. Examples of these designs and their corresponding models are presented with the goal of testing differential expression.
NEXT-GENERATION sequencing (NGS) has emerged as a revolutionary tool in genetics, genomics, and epigenomics. By increasing throughput and decreasing cost, compared to other sequencing technologies (Hayden 2009), NGS has enabled genome-wide investigations of various phenomena, including single-nucleotide polymorphisms (Craig et al. 2008), epigenetic events (Park 2009), copy number variants (Alkan et al. 2009), differential expression (Bloom et al. 2009), and alternative splicing (Sultan et al. 2008). One application with demonstrated effectiveness over previous technologies [e.g., microarrays and serial analysis of gene expression (SAGE)] is called RNA sequencing (RNA-Seq) (Cloonan et al. 2009). RNA-Seq uses NGS technology to sequence, map, and quantify a population of transcripts (Mortazavi et al. 2008; Morozova et al. 2009). While RNA-Seq is a relatively new method, it has already provided unprecedented insights into the transcriptional complexities of a variety of organisms, including yeast (Nagalakshmi et al. 2008), mice (Mortazavi et al. 2008), Arabidopsis (Eveland et al. 2008), and humans (Sultan et al. 2008).
At present, there are three widely accepted commercially available NGS devices [Illumina's Genome Analyzer, Applied Biosystems' (Foster City, CA) SOLiD, and the 454 Genome Sequencer FLX] for RNA-Seq (Cloonan et al. 2008; Eveland et al. 2008; Marioni et al. 2008). Across platforms, the RNA-Seq methodology is approximately the same. Briefly, RNA is isolated from cells, fragmented at random positions, and copied into complementary DNA (cDNA). Fragments meeting a certain size specification (e.g., 200–300 bases long) are retained for amplification using polymerase chain reaction (PCR). After amplification, the cDNA is sequenced using NGS; the resulting reads are aligned to a reference genome, and the number of sequencing reads mapped to each gene in the reference is tabulated. These gene counts, or digital gene expression (DGE) measures, can be transformed and used to test differential expression (see Morozova et al. 2009 for a review of these technologies as applied to RNA-Seq).
Although there are many steps in this experimental process that may introduce errors and biases, RNA-Seq has been hailed as the future of transcriptome research (Shendure 2008) because it potentially generates an unlimited dynamic range, provides greater sensitivity than microarrays, is able to discriminate closely homologous regions, and does not require a priori assumptions about regions of expression (Cloonan et al. 2009; Morozova et al. 2009). As research transitions from microarrays to sequencing-based approaches, it is essential that we revisit many of the same concerns that the statistical community had at the beginning of the microarray era (Kerr and Churchill 2001a).
Soon after the introduction of microarrays (Schena et al. 1995), a series of articles was published elucidating the need for proper experimental design (Kerr et al. 2000; Lee et al. 2000; Kerr and Churchill 2001a,b; Churchill 2002). All of these articles rely heavily on the three fundamental aspects of sound experimental design formalized by R. A. Fisher (Fisher 1935a) 70 years ago, namely replication, randomization, and blocking. These concepts can be understood by considering the following controlled experiment that is designed to test the effectiveness of two different diets. A sound experimental design would include many different subjects (i.e., replication) recruited from multiple weight loss centers (i.e., blocking). Each center would randomly assign its subjects to one of the two diets (i.e., randomization).
Although the principles of good design are straightforward, their proper implementation often requires significant planning and statistical expertise. To date, many NGS applications, specifically RNA-Seq, have neglected good design. While a few RNA-Seq studies have reported highly reproducible results with little technical variation (e.g., Marioni et al. 2008; Mortazavi et al. 2008), in the absence of a proper design, it is essentially impossible to partition biological variation from technical variation. When these two sources of variation are confounded, there is no way of knowing which source is driving the observed results. No amount of statistical sophistication can separate confounded factors after data have been collected.
Generally, for differential expression analyses, researchers are interested in comparisons across treatment groups in the form of contrasts or pairwise comparisons, and the designs for these analyses are usually quite simple. The good news for NGS technologies is that certain properties of the platforms can be leveraged to ensure proper design. One such feature, available in all three NGS devices, is the capacity to bar code. Genomic fragments can be labeled or bar coded with sample-specific sequences that in turn allow multiple samples to be included in the same sequencing reaction (i.e., multiplexing) while maintaining, with high fidelity, sample identities downstream (Craig et al. 2008; Hamady et al. 2008; http://www3.appliedbiosystems.com/). To date, bar coding has been appreciated only as a means to increase the number of samples per sequencing run. Yet here we demonstrate how multiplexing can be used as a quality control feature that offers the flexibility to construct balanced and blocked designs for the purpose of testing differential expression.
We anticipate that the progression from the current unreplicated unblocked designs to more complex designs will be swift once the full offerings of NGS technologies are appreciated. Toward this end, we provide a brief review of some powerful statistical techniques for testing differential expression under a variety of designs. Although the designs that are presented are specific to RNA-Seq using the Illumina (Solexa) platform, the same statistical principles are applicable to the other NGS devices, as well as to other types of comparative genetic and “omic” data.
REPLICATION
Unreplicated data:
Observational studies with no biological replication are common in the RNA-Seq literature (e.g., Marioni et al. 2008). In an observational study, as opposed to a controlled experiment, the assignment of subjects to treatment groups is not decided by the investigator. In many cases, the different treatment groups consist of different tissue types. For example, in Marioni et al. (2008) messenger RNA (mRNA) was isolated from liver and kidney tissues, randomly fragmented, and sequenced using the Illumina Genome Analyzer (GA). The Illumina technology (aka “Solexa”) relies on a flow cell with eight lanes, or channels, and massively parallel sequencing by synthesis to simultaneously sequence millions of short DNA fragments in each of the lanes. Typically, independent samples of mRNA are loaded into different lanes of the flow cell such that sequencing reactions occur independently between samples. For illustration purposes, consider an example with seven subjects and seven treatment groups (T1,…, T7), where each subject is randomly assigned to one treatment group, and mRNA from each subject is loaded into a different lane (Figure 1). Note that there is no biological replication because there is only a single subject in each treatment group.
![Hypothetical Illumina GA flow cell with mRNA isolated from subjects within seven different treatment groups $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \((T_{1},{\ldots},{\,}T_{7})\) \end{document}$ and loaded into individual lanes (e.g., the mRNA from the subject within treatment group 1 is sequenced in lane 1). As a control, a $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \({\Phi}X\) \end{document}$ genomic sample is often loaded into lane 5. The bacteriophage $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \({\Phi}X\) \end{document}$ genome is known exactly and can be used to recalibrate the quality scoring of sequencing reads from other lanes (Bentley et al. 2008).](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/genetics/185/2/10.1534_genetics.110.114983/7/m_405fig1.jpeg?Expires=1722424059&Signature=BfmF3LFli-jYXUOs6NO68GOh4jId38Q8oSudE0NjH34LhNMOj~qWUb7MezhR2GARLlakHdD5XL0dptQPHMfUBSGXMvaZT7oBxX-tFiv7qfoEvYljus4~Chm8ZGB-ojGbbBlr-CZUTrblrMkXNRQxQSHEBUVNEXq6-4L7O0Ep81ZQmFOBbTGap6vfzJ-abDnjEynZai6w7EBRxDbSE1Q4K0cIhIVw7xES74GYPIfuiea12e19QpCuX8Cj-hCtYDkSP5coXJk24WtK8ELhtFN5KY1QFiPaPJQDUnuCYPDPUeu0u4E6Dl4lRj~6pnJy1DacUHa43C9BT6Q9Ctyg6e4upA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Hypothetical Illumina GA flow cell with mRNA isolated from subjects within seven different treatment groups
To analyze data from unreplicated designs, the sampling hierarchy must be taken into account. Regardless of the design, we can define three levels of sampling at work in RNA-Seq data: subject sampling, RNA sampling, and fragment sampling. Subjects (e.g., organisms or individuals) are ideally drawn from a larger population to which results of the study may be generalized (unreplicated data consist of a single subject within each treatment group). RNA sampling occurs during the experimental procedure when RNA is isolated from the cell(s). Finally, only certain fragmented RNAs that are sampled from the cells(s) are retained for amplification, and since the sequencing reads do not represent 100% of the fragments loaded into a flow cell, fragment-level sampling is also at play.
Unreplicated data consider only a single subject per treatment group. Typically there is either one subject to which every treatment is applied (e.g., in Marioni et al. 2008, liver and kidney samples were extracted from one human cadaver) or one distinct subject within each treatment group (e.g., Figure 1). In either situation, it is not possible to estimate variability within treatment group, and the analysis must proceed without any information regarding within-group biological variation. As such, in the context of RNA-Seq, statistical methods for finding differences between groups are limited to RNA and fragment-level sampling information.
Since the sampling scheme for RNA-Seq is similar to SAGE (Velculescu et al. 1995), and a sizable statistical literature is already established for analyzing differential DGE measures from unreplicated SAGE data, similar methods can be used for unreplicated RNA-Seq data. See Man et al. (2000), Romualdi et al. (2001), and Ruijter et al. (2002) for reviews and comparisons of techniques and Tino (2009) for further discussion. For both RNA-Seq and SAGE data the analysis usually proceeds on a gene-by-gene basis by organizing the data in a 2 × 2 table (Table 1). Perhaps the most natural test for differential expression in the unreplicated case is Fisher's exact test (Fisher 1935b), which fixes the marginal totals of the 2 × 2 table and tests differential expression using the hypotheses
A 2 × 2 contingency table of (unreplicated) digital gene expression (DGE) measures for testing differential expression between Treatment1 and Treatment2 of gene A
. | Treatment1 . | Treatment2 . | Total . |
|---|---|---|---|
| Gene A | \(n_{11}\) | \(n_{12}\) | \(N_{1.}\) |
| Remaining genes | \(n_{21}\) | \(n_{22}\) | \(N_{2.}\) |
| Total | \(N_{.1}\) | \(N_{.2}\) | \(N_{..}\) |
. | Treatment1 . | Treatment2 . | Total . |
|---|---|---|---|
| Gene A | \(n_{11}\) | \(n_{12}\) | \(N_{1.}\) |
| Remaining genes | \(n_{21}\) | \(n_{22}\) | \(N_{2.}\) |
| Total | \(N_{.1}\) | \(N_{.2}\) | \(N_{..}\) |
The cell counts
A 2 × 2 contingency table of (unreplicated) digital gene expression (DGE) measures for testing differential expression between Treatment1 and Treatment2 of gene A
. | Treatment1 . | Treatment2 . | Total . |
|---|---|---|---|
| Gene A | \(n_{11}\) | \(n_{12}\) | \(N_{1.}\) |
| Remaining genes | \(n_{21}\) | \(n_{22}\) | \(N_{2.}\) |
| Total | \(N_{.1}\) | \(N_{.2}\) | \(N_{..}\) |
. | Treatment1 . | Treatment2 . | Total . |
|---|---|---|---|
| Gene A | \(n_{11}\) | \(n_{12}\) | \(N_{1.}\) |
| Remaining genes | \(n_{21}\) | \(n_{22}\) | \(N_{2.}\) |
| Total | \(N_{.1}\) | \(N_{.2}\) | \(N_{..}\) |
The cell counts

The log2 fold change, between Treatment1 and Treatment2, of the normalized gene expression is plotted on the y-axis, and the mean log2 expression is plotted on the x-axis. Gene expression counts were normalized by the column totals of the corresponding 2 × 2 table (e.g., Table 1). Blue dots represent significantly differentially expressed genes as established by Fisher's exact test; gray dots represent genes with similar expression. The red horizontal line at zero provides a visual check for symmetry.
Limitations of unreplicated data:
The fundamental problem with generalizing results gathered from unreplicated data is a complete lack of knowledge about biological variation. As Fisher (1935a) noted, without an estimate of variability (i.e., within treatment group), there is no basis for inference (between treatment groups). Although we can test for differential expression between treatment groups from unreplicated data, the results of the analysis only apply to the specific subjects included in the study (i.e., the results cannot be generalized). To better understand the unrealistic conclusions that can be drawn from unreplicated data, suppose that an alien visits Earth and observes only two people, one male named “John” standing 177 cm and one female named “Jane” standing 180 cm. The same reasoning that results in unrealistic conclusions from testing differential expression between treatment groups on the basis of unreplicated data would compel the alien to believe not only that John is shorter than Jane, but also that women are, on average, taller than men.
Replicated data:
Consider extending the example illustrated in Figure 1 (i.e., seven treatment groups with one subject per treatment) to include two more biological subjects within each treatment group (Figure 3). The biological replicates allow for the estimation of within-treatment group (biological) variability, provide information that is necessary for making inferences between treatment groups, and give rise to conclusions that can be generalized.
![A multiple flow-cell design based on three biological replicates within seven treatment groups. There are three flow cells with eight lanes per flow cell. The control $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \({\Phi}X\) \end{document}$ sample is in lane 5 of each flow cell. Tij refers to the $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \(j\mathrm{th}\) \end{document}$ replicate in the $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \(i\mathrm{th}\) \end{document}$ treatment group $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \((i{=}1,{\ldots},{\,}7;j{=}1,{\ldots},{\,}3)\) \end{document}$.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/genetics/185/2/10.1534_genetics.110.114983/7/m_405fig3.jpeg?Expires=1722424059&Signature=JPqJoBzEVN4dSgMZHictY6u0fovNfZpZKNmrYpaG~G~f8HkjXb-iPxElXsWhzmLlrOXpP~xw~IyziFvLzJD-jfv~4kv0OyUVJsLeZFNq~ij0nm7I~Fds8Hg1uOsYsCPMeEHCh7pk0K68UibhJUhIN8BQHbgdcjruIzZEUe-ixmE1nsqL7kqSxMy0xM-GVirmSBDXbj4sBuno3YUzcGtXBEf0mzVjgwazDD40Yad2rQlsVql2B3Yts8nbLp6HP-KGdnbt6HirOic5g2Pa8TFmAxY9GrhAcowMughasRsuRli~JcfBZe2Ueth-GCzJg9Yp9QteRWbMCKICL5GaxnnPKg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
A multiple flow-cell design based on three biological replicates within seven treatment groups. There are three flow cells with eight lanes per flow cell. The control
BALANCED BLOCK DESIGNS
Without careful planning an unblocked design faces a fundamental problem with generalizing the results, namely, the potential for confounding. With respect to an RNA-Seq analysis, if the treatment effects are not separable from possible confounding factors, then for any given gene, there is no way of knowing whether the observed difference in abundance between treatment groups is due to the biology or the technology (e.g., amplification or sequencing bias). For example, in Figure 3 all replicates of Treatment1 are sequenced in lane 1 and all replicates of Treatment2 are sequenced in lane 2. Any differences in expression between Treatment1 and Treatment2 are confounded with possible lane effects that may persist across flow cells. In fact, once the data are collected, there is no way of separating the effects due to lane from the effects due to true treatment differences.
In RNA-Seq data, the design is the same for every gene, even though different genes have different variances and are potentially subject to different errors and biases. Of course, there are sources of variation that affect the majority of genes and these should certainly be integrated into the design. However, to ensure a robust analysis across all genes, sources of variation affecting only a minority of genes should be integrated into the design as well (e.g., a PCR-based GC bias may affect only a small proportion of transcript fragments; therefore if it is possible, PCR batch should be integrated into the design). As such, we examine two main sources of variation (beyond the sampling hierarchy explained previously) that may contribute to confounding of effects in RNA-Seq data, namely “batch effects” and “lane effects.” Batch effects include any errors that occur after random fragmentation of the RNA until it is input to the flow cell (e.g., PCR amplification and reverse transcription artifacts). Lane effects include any errors that occur from the point at which the sample is input to the flow cell until data are output from the sequencing machine (e.g., systematically bad sequencing cycles and errors in base calling).
Batch and lane effects have both been observed in previous studies. PCR amplification and reverse transcription artifacts were found to be nonnegligible in both Balwierz et al. (2009) and Chepelev et al. (2009). Chepelev et al. (2009) also observed systematically bad sequencing cycles, and Rougemont et al. (2008) discusses the presence of base-calling errors in the Solexa platform. Although Marioni et al. (2008) found that variation across lanes generally follows a Poisson sampling process, they did observe considerably more variation for a nonnegligible number of genes (on the order of
Balanced blocks by multiplexing:
To eliminate confounding caused by batch or lane effects, consider the situation in which all samples of RNA are pooled into the same batch and then sequenced in one lane of a flow cell. This would ensure that any batch effects are the same for all samples, and since the sequencing reaction is contained in one lane, all effects due to lane will be the same for all samples. While indeed this is an idealized situation, it can be accomplished by bar coding the RNA immediately after fragmentation. Once bar codes are attached to the random fragments, the samples can be pooled and processed together through the reverse transcription, size selection, and amplification steps. Typically, each lane is dedicated to sequencing one sample, so the number of samples m is equal to the number of lanes
It is worth noting that if L lanes are utilized, there is no loss of sequencing depth compared to running each sample in a lane. Figure 4 shows a comparison of this design to a typical design that confounds sample with batch and lane.
![Comparison of two designs for testing differential expression between treatments A and B. Treatment A is denoted by red tones and treatment B by blue tones. In the ideal balanced block design (left), six samples $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \((m{=}6)\) \end{document}$ are bar coded, pooled, and processed together. The pool is then divided into six equal portions that are input to six lanes $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \((L{=}6)\) \end{document}$ of the flow cell. Bar coding in the balanced block design results in six technical replicates $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \((T{=}6)\) \end{document}$ of each sample, while balancing batch and lane effects and blocking on lane. The balanced block design also allows partitioning of batch and lane effects from the within-group biological variability. The confounded design (right) represents a typical RNA-Seq experiment and consists of the same six samples, with no bar coding, and does not permit partitioning of batch and lane effects from the estimate of within-group biological variability.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/genetics/185/2/10.1534_genetics.110.114983/7/m_405fig4.jpeg?Expires=1722424059&Signature=dRRwOccmXr1jjU6IaWZhK3nZiRjGtxC5N67~1jVcggNIIZnT4VbF0ECsAvvwjlVgpAVNkvHYIB~L8qhUH0i1or8e4tg9mC0i0aUUKTO27ziRmJu01eAQZhgDnCW76e5FHM-iSi5MDlwBTMY9iFTjXDe38sbrA2L1vOueLBa0j83F5gsG~pRPaaXXv7K5gXRKcvk4JQYNC8BMjDOWB9lWF8O-Y409DKD2Zo5K50uvZibfIiMcw2JjS0au2TXNOqUeSOBiRN7~Nf4o2h97Y5To~PCX8WRqhcpf9yTUBJ65WY5VYkMc1bXFqx2D-1FDkMa43NoKWpHajvfajifL0~wtPw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Comparison of two designs for testing differential expression between treatments A and B. Treatment A is denoted by red tones and treatment B by blue tones. In the ideal balanced block design (left), six samples
Balanced incomplete block designs and blocking without multiplexing:
Although the previous balanced block design is convenient for illustration purposes, in reality resources, technical constraints, and the scientific hypotheses under investigation will dictate the number of treatments

A balanced incomplete block design (BIBD) for three treatment groups (T1, T2, T3) with one subject per treatment group (T11, T21, T31) and two technical replicates of each (T111, T112, T211, T212, T311, T312). After fragmentation, each of the three samples is bar coded and divided in two (e.g., T11 would be split into T111 and T112) and then pooled and sequenced as illustrated (e.g., T111 is pooled with T212 as input to lane 1).
Clearly multiplexing is useful for generating technical replicates that are effective in blocking on lane and batch effects to reduce confounding. However, it is important to understand that technical replicates are no substitute for independent biological replication and that for a sufficient number of biological replicates certain designs can accommodate lane and/or flow cell as blocking factors. As an illustration of this flexibility, the design in Figure 3 can be rearranged as a balanced complete block design where the blocks are flow cells and a balanced incomplete block design where the blocks are lanes (Figure 6).
![A design based on three biological replicates within seven treatment groups. For each of the three flow cells there are eight lanes per flow cell and a control ($\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \({\Phi}X\) \end{document}$) sample in lane 5. Tij refers to the $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \(j\mathrm{th}\) \end{document}$ replicate in the $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \(i\mathrm{th}\) \end{document}$ treatment group $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \((i{=}1,{\ldots},{\,}7;j{=}1,{\ldots},{\,}3)\) \end{document}$. In this design the flow cells form balanced complete blocks, and the lanes form balanced incomplete blocks.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/genetics/185/2/10.1534_genetics.110.114983/7/m_405fig6.jpeg?Expires=1722424059&Signature=G1i8GH~raoAYRnfJ~41J7Ca5LvqsBnZmfGifZbVo6eFrT9ZpNwtOUpU09-NnxpXQgL42hCoMlG5jhUUpfmb3C3edCmBE43HJtoCm4QFEb-ymwr3EsTGuXVW1t0oVEh9J3zLYvZDc9PuJE74giUq26-1az2pzV4JH6vhY-EHg8v9jm8G1-NFf4l88nMdmC49OczQs672jR3zqbPxeEsMh1krvdG8TAO2ZBVkJhOvMOwKHV~RCUcVhFqjmHswmSEVhm~ecktShOtPnzpL9svD9ty7Z-2947gHoemDJVx-vIUbGJR9ShIrU~7gCwqz5EPjrmYp~PsdIPsXLMkellpPIDw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
A design based on three biological replicates within seven treatment groups. For each of the three flow cells there are eight lanes per flow cell and a control (
Analyzing a balanced block design:
Analyzing a balanced block design with technical replicates:
Consider the balanced block design in Figure 4. Let
SIMULATIONS
To evaluate the effectiveness of the proposed multiplexed designs, we compare a biologically unreplicated unblocked design (Figure 7A), a biologically unreplicated balanced block design with technical replicates (i.e., multiplexing; Figure 7B), a biologically replicated (triplicate) unblocked design (Figure 7C), and a biologically replicated (triplicate) balanced block design with technical replicates by multiplexing (Figure 7D) in an experimental setting testing differential expression between Treatment 1
![Four designs (A–D) are compared in the simulation study for treatments $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \(T_{1}\) \end{document}$ and $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \(T_{2}\) \end{document}$. Design A is a biologically unreplicated unblocked design with one subject for treatment group $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \(T_{1}(T_{11})\) \end{document}$ and one subject for treatment group $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \(T_{2}(T_{21})\) \end{document}$. Design B is a biologically unreplicated balanced block design with $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \(T_{11}\) \end{document}$ split (bar coded) into two technical replicates $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \((T_{111},{\,}T_{112})\) \end{document}$ and $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \(T_{21}\) \end{document}$ split into two technical replicates $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \((T_{211},{\,}T_{212})\) \end{document}$ and input to lanes 1 and 2. Design C is a biologically replicated unblocked design with three subjects from treatment group $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \(T_{1}(T_{11},{\,}T_{12},{\,}T_{13})\) \end{document}$ and three subjects from treatment group $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \(T_{2}(T_{21},{\,}T_{22},{\,}T_{23})\) \end{document}$. Design D is a biologically replicated balanced block design with each subject (e.g., $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \(T_{11}\) \end{document}$) split (bar coded) into six technical replicates (e.g., $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \(T_{111},{\ldots},{\,}T_{116}\) \end{document}$) and input to six lanes.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/genetics/185/2/10.1534_genetics.110.114983/7/m_405fig7.jpeg?Expires=1722424059&Signature=X-U2KCm-~Irwsylzm6i5hg7EJ7YMrYpqaFkQ0AUQJsTjJZy8SOsGEeUT-EKGoTjVlg~rPKX5D~pWdJA19Fk6cazCbWLESRicOuEWD3cBbIHMZ88wjwJDJA-l-g~VlCVVG4c6cTu8xiiYasdIK0JmwM-fLPOvDB8FXarutHLpg6t3d0mGGikdlQD9YvHr8pYJXcmP7Bg~-LOOY6TSA31IkUDhOPk0~GVmYD4PAj7IECg5Dzqb3CLawyNrexSjPoK-Ue5Zf19yvdsyHKmb7Q4YtvlR55PoF8PiqyyqSTDgrpLIW0frFfA5abNoxNMHXfhKtWLII6stfB~Qs7tqGCqofA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Four designs (A–D) are compared in the simulation study for treatments
Data simulation:
For design A (Figure 7) we tested for differential expression using Fisher's exact test (1), setting both column totals of the 2 × 2 table to 3,000,000. For design B (Figure 7) we fit a balanced block design with technical replicates model. We set the offset term to c, and since this design does not have any biological replicates (i.e., J = 1), we did not estimate dispersion. The likelihood ratio was compared to a
We ran 10,000 simulations under each setting (S1–S4), varying
Results:
Receiver operating characteristic (ROC) curves offer a useful way of comparing false positive rates with true positive rates. Using ROC curves the true positive rate is typically plotted on the vertical axis, and the false positive rate is plotted on the horizontal axis. The resulting curves can be compared by fixing a false positive rate (type I error rate) and contrasting the corresponding true positive rates (statistical power). If one ROC curve is always above another, this indicates its superiority in classifying genes as differentially expressed. The diagonal identity line indicates the performance of classifying a gene as differentially expressed, using a completely random guess (e.g., guessing differential expression 90% of the time yields a 90% true positive and false positive rate).
The designs featuring independent replication (designs C and D) demonstrate remarkably better performance than the unreplicated designs (designs A and B) whenever there is nonnegligible within-treatment-group biological variability (Figure 8; supporting information, Figure S1 and Figure S2) across simulation settings (S1–S4). Even when there is very small within-treatment-group biological variability (Figure S3), the replicated designs still outperform the unreplicated designs.
![ROC curves for the within-group variability setting $\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \(\mathrm{{\psi}}{=}5\) \end{document}$. The x-axis represents the false positive rate and the y-axis represents the true positive rate. The four panels of the graph show results for each of the four simulation settings. The ROC curve for the unblocked unreplicated design (A) is in solid red, for the blocked unreplicated design (B) is in dotted red, for the unblocked replicated design (C) is in solid blue, and for the blocked replicated design (D) is in dotted blue. The replicated designs always outperform the unreplicated designs, and whenever there is a batch effect or a lane effect, the blocked designs outperform their unblocked counterparts.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/genetics/185/2/10.1534_genetics.110.114983/7/m_405fig8.jpeg?Expires=1722424059&Signature=Stbk-D3yAycY4Q7YcgRIoDsAPgLdCTwKPZCIFA58n2cXP0WatxcktbRCgLsScaaSyyVU2P1TcWlAzptNS3HVgn3td42o~g52G1EdIzL~PmYQfbxa6QkE1ZH4Ef86b201U80keUHH2oT0szltCalGJePUBOxGQ1mfmclRoVJBQKzWpEoyOssIEudX0aK16YL1j2SxiTnLbcqGyOynFtO1PoR458yV2aSyY1llvgxLAA-Jurd-ik9V3KqT6k2oqZEbtMBkamZXSEZM~DaWp5Sa9I0mb9QQw99mhfTMsAWrV6uR21Y-DhoaozDbV1lQooVbbIiDslDezJcgj5UMJV3nVw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
ROC curves for the within-group variability setting
To evaluate which designs upheld the typical 0.05 false positive rate, for each of the four different sampling rates
The false positive rates (at the 0.05 nominal level) for designs A–D considering four settings of
. | . | Simulation setting . | |||
|---|---|---|---|---|---|
| Sampling rate \((\mathrm{{\lambda}}_{1})\) . | Design . | S1 . | S2 . | S3 . | S4 . |
\(10^{{-}2}\) | A | 0.9655 | 0.9548 | 0.9630 | 0.9508 |
| B | 0.9524 | 0.9521 | 0.9496 | 0.9498 | |
| C | 0.0494 | 0.0499 | 0.0485 | 0.0480 | |
| D | 0.0476 | 0.0463 | 0.0482 | 0.0487 | |
\(10^{{-}3}\) | A | 0.8789 | 0.8595 | 0.8744 | 0.8434 |
| B | 0.8456 | 0.8479 | 0.8431 | 0.8481 | |
| C | 0.0477 | 0.0480 | 0.0532 | 0.0499 | |
| D | 0.0506 | 0.0491 | 0.0472 | 0.0489 | |
\(10^{{-}4}\) | A | 0.6521 | 0.5873 | 0.6325 | 0.5527 |
| B | 0.5551 | 0.5622 | 0.5583 | 0.5677 | |
| C | 0.0522 | 0.0505 | 0.0527 | 0.0516 | |
| D | 0.0538 | 0.0529 | 0.0522 | 0.0532 | |
\(10^{{-}5}\) | A | 0.2662 | 0.2299 | 0.2491 | 0.2111 |
| B | 0.2407 | 0.2452 | 0.2458 | 0.2411 | |
| C | 0.0482 | 0.0524 | 0.0503 | 0.0461 | |
| D | 0.0488 | 0.0460 | 0.0494 | 0.0477 | |
. | . | Simulation setting . | |||
|---|---|---|---|---|---|
| Sampling rate \((\mathrm{{\lambda}}_{1})\) . | Design . | S1 . | S2 . | S3 . | S4 . |
\(10^{{-}2}\) | A | 0.9655 | 0.9548 | 0.9630 | 0.9508 |
| B | 0.9524 | 0.9521 | 0.9496 | 0.9498 | |
| C | 0.0494 | 0.0499 | 0.0485 | 0.0480 | |
| D | 0.0476 | 0.0463 | 0.0482 | 0.0487 | |
\(10^{{-}3}\) | A | 0.8789 | 0.8595 | 0.8744 | 0.8434 |
| B | 0.8456 | 0.8479 | 0.8431 | 0.8481 | |
| C | 0.0477 | 0.0480 | 0.0532 | 0.0499 | |
| D | 0.0506 | 0.0491 | 0.0472 | 0.0489 | |
\(10^{{-}4}\) | A | 0.6521 | 0.5873 | 0.6325 | 0.5527 |
| B | 0.5551 | 0.5622 | 0.5583 | 0.5677 | |
| C | 0.0522 | 0.0505 | 0.0527 | 0.0516 | |
| D | 0.0538 | 0.0529 | 0.0522 | 0.0532 | |
\(10^{{-}5}\) | A | 0.2662 | 0.2299 | 0.2491 | 0.2111 |
| B | 0.2407 | 0.2452 | 0.2458 | 0.2411 | |
| C | 0.0482 | 0.0524 | 0.0503 | 0.0461 | |
| D | 0.0488 | 0.0460 | 0.0494 | 0.0477 | |
The false positive rates (at the 0.05 nominal level) for designs A–D considering four settings of
. | . | Simulation setting . | |||
|---|---|---|---|---|---|
| Sampling rate \((\mathrm{{\lambda}}_{1})\) . | Design . | S1 . | S2 . | S3 . | S4 . |
\(10^{{-}2}\) | A | 0.9655 | 0.9548 | 0.9630 | 0.9508 |
| B | 0.9524 | 0.9521 | 0.9496 | 0.9498 | |
| C | 0.0494 | 0.0499 | 0.0485 | 0.0480 | |
| D | 0.0476 | 0.0463 | 0.0482 | 0.0487 | |
\(10^{{-}3}\) | A | 0.8789 | 0.8595 | 0.8744 | 0.8434 |
| B | 0.8456 | 0.8479 | 0.8431 | 0.8481 | |
| C | 0.0477 | 0.0480 | 0.0532 | 0.0499 | |
| D | 0.0506 | 0.0491 | 0.0472 | 0.0489 | |
\(10^{{-}4}\) | A | 0.6521 | 0.5873 | 0.6325 | 0.5527 |
| B | 0.5551 | 0.5622 | 0.5583 | 0.5677 | |
| C | 0.0522 | 0.0505 | 0.0527 | 0.0516 | |
| D | 0.0538 | 0.0529 | 0.0522 | 0.0532 | |
\(10^{{-}5}\) | A | 0.2662 | 0.2299 | 0.2491 | 0.2111 |
| B | 0.2407 | 0.2452 | 0.2458 | 0.2411 | |
| C | 0.0482 | 0.0524 | 0.0503 | 0.0461 | |
| D | 0.0488 | 0.0460 | 0.0494 | 0.0477 | |
. | . | Simulation setting . | |||
|---|---|---|---|---|---|
| Sampling rate \((\mathrm{{\lambda}}_{1})\) . | Design . | S1 . | S2 . | S3 . | S4 . |
\(10^{{-}2}\) | A | 0.9655 | 0.9548 | 0.9630 | 0.9508 |
| B | 0.9524 | 0.9521 | 0.9496 | 0.9498 | |
| C | 0.0494 | 0.0499 | 0.0485 | 0.0480 | |
| D | 0.0476 | 0.0463 | 0.0482 | 0.0487 | |
\(10^{{-}3}\) | A | 0.8789 | 0.8595 | 0.8744 | 0.8434 |
| B | 0.8456 | 0.8479 | 0.8431 | 0.8481 | |
| C | 0.0477 | 0.0480 | 0.0532 | 0.0499 | |
| D | 0.0506 | 0.0491 | 0.0472 | 0.0489 | |
\(10^{{-}4}\) | A | 0.6521 | 0.5873 | 0.6325 | 0.5527 |
| B | 0.5551 | 0.5622 | 0.5583 | 0.5677 | |
| C | 0.0522 | 0.0505 | 0.0527 | 0.0516 | |
| D | 0.0538 | 0.0529 | 0.0522 | 0.0532 | |
\(10^{{-}5}\) | A | 0.2662 | 0.2299 | 0.2491 | 0.2111 |
| B | 0.2407 | 0.2452 | 0.2458 | 0.2411 | |
| C | 0.0482 | 0.0524 | 0.0503 | 0.0461 | |
| D | 0.0488 | 0.0460 | 0.0494 | 0.0477 | |
Interestingly, the blocked designs did not outperform the unblocked designs in terms of the false positive rates (Table 2). However, across simulation settings, whenever a batch or lane effect is present, the blocked designs demonstrate distinguishably higher true positive rates than the unblocked designs. We speculate that there are two reasons for this result. First, the blocked designs are all included in the same batch such that batch-to-batch variation is removed from residual error. Second, even though we did not use lane as a blocking factor, lane effects were balanced across every biological replicate in the two treatment groups, thereby reducing the chance that a lane effect would overly influence one treatment group and produce a misleading result (i.e., confounding). Partitioning the variation due to lane through a statistical model that included blocks on lanes may further enhance the performance of the blocked designs by reducing the residual error.
DISCUSSION
Fisher (1935a) was right. Replication, randomization, and blocking are essential components of any well planned and properly analyzed design. RNA-Seq designs and analyses are no exception. Luckily, the current format and properties of the NGS platforms lend themselves nicely to the concepts of randomization and blocking. However, the decision to biologically replicate remains in the hands of the scientist.
Our purpose in writing this article is to demonstrate that variability (which is dependent on the organism, laboratory techniques, and the biological factors under investigation) may be larger than expected and, if not estimated properly, will negatively affect the results of any study. This variability is positively correlated with the magnitude of the overall variability between biological subjects and can be dealt with by employing statistical models that not only accommodate estimation of within-treatment-group biological variability, but also remain faithful to a nominal significance level. Indisputably, the best way to ensure reproducibility and accuracy of results is to include independent biological replicates (technical replicates are no substitute) and to acknowledge anticipated nuisance factors (e.g., lane, batch, and flow-cell effects) in the design.
For both replicated and unreplicated scenarios the proposed balanced block designs benefit from both the NGS platform design and multiplexing. These designs are as good as, if not better than, their unblocked counterparts in terms of power and type I error and are considerably better when batch and/or lane effects are present. Realizing of course that it is not possible to determine whether batch and/or lane effects are present a priori, we recommend the use of block designs to protect against observed differences that are attributable to these potential sources of variation. Since we understand both the expense associated with block designs and the concern of multiplexing, we offer some alternatives. Certainly, it is possible to avoid multiplexing if there are enough biological replicates and sequencing lanes that allow for designs that block on lane and/or flow cell (see Figure 6). However, if resources are limited (i.e., one flow cell), multiplexing offers an alternative that at the very least eliminates the potential for confounding of effects. Multiplexing and blocking aside, the bottom line remains the same: results from unreplicated data cannot be generalized beyond the sample tested (here, differential expression).
Even though the benefits of good designs far outweigh any drawbacks, we anticipate objections to the multiplexed designs related to cost, loss of sequencing depth, and bar-code bias. To mitigate these concerns we provide some reassurances. First, although increased cost may be a concern, the added cost and time to multiplex are negligible when compared to the overall time and resources required for RNA sequencing. Second, there may be additional concerns that multiplexing will result in an overall loss of sequencing depth. This will be problematic only if enough bar codes are incorrectly identified such that the read count for each gene is affected. Recently, Philippe et al. (2009) estimated that, on the Illumina platform, the probability that a 20-base transcript tag contains one or more sequencing errors is 0.0048. Using this as an upper bound on the probability that the bar code will contain one or more sequencing errors (the bar code is 6 bases long), in a lane with 10 million usable sequencing reads and 10 different transcripts, miscalled bar codes would result in an average loss of at most 4.8 reads from the read count per transcript. Third, while there may be technical problems with multiplexing, such as uneven coverage of the samples or bias in the base calls adjacent to the bar code, as long as the problem with uneven coverage is consistent within each sample, normalization schemes that take into account lane- and sample-specific coverage can correct for this (e.g., dividing each gene count by the coverage over sample or the coverage over lane). Read bias associated with bar coding is not problematic if it affects all samples the same way. Specifically, a proper normalization scheme will be robust to bias as long as the problem is consistent within sample.
The design principles that are presented here are applicable to a variety of applications involving quantitative comparisons across samples and can be put into practice on every NGS platform as applied to RNA-Seq. However, for other applications (e.g., ChIP-Seq and copy number variant studies, etc.) a clearly defined statistical model that fully accounts for sources of variation must be developed before specific details of the design can be described.
Footnotes
Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.110.114983/DC1.
Footnotes
Communicating editor: L. M. McIntyre
Acknowledgements
We thank the anonymous reviewers for their thorough and thought-provoking review. We are also grateful to Andrea Rau for helpful comments on the manuscript and to Scott Jackson, Rob Martienssen, and their respective labs for a wealth of sequencing data and biological information. This work is supported by a National Science Foundation Plant Genome grant (DBI-0733857) in part to R.W.D.
References
Alkan, C., J. M. Kidd, T. Marques-Bonet, G. Aksay, F. Antonacci et al.,
Audic, S., and J. Claverie,
Baggerly, K. A., L. Deng, J. S. Morris and C. M. Aldaz,
Balwierz, P. J., P. Carninci, C.O. Daub, J. Kawai, Y. Hayashizaki et al.,
Bentley, D. R., S. Balasubramanian, H. P. Swerdlow, G. P. Smith, J. Milton et al.,
Bloom, J. S., Z. Khan, L. Kruglyak, M. Singh and A. A. Caudy,
Chepelev, I., G. Wei, Q. Tang and K. Zhao,
Churchill, G. A.,
Cloonan, N., A. R. R. Forrest, G. Kolle, B. B. A. Gardiner, G. J. Faulkner et al.,
Cloonan, N., Q. Xu, G. J. Faulkner, D. F. Taylor, T. P. Tang et al.,
Craig, D. W., J. V. Pearson, S. Szelinger, A. Sekar, M. Redman et al.,
Eveland, A. L., D. R. McCarty and K. E. Koch,
Fisher, R. A.,
Fisher, R. A., and F. Yates,
Gentleman, R. C., V. J. Carey, D. M. Bates, B. Bolstad, M. Dettling et al.,
Hamady, M., J. J. Walker, J. K. Harris, N. J. Gold and R. Knight,
Kal, A. J., A. J. van Zonneveld, V. Benes, M. van den Berg, M. G. Koerkamp et al.,
Kerr, M. K., and G. A. Churchill,
Kerr, M. K., and G. A. Churchill,
Kerr, M. K., M. Martin and G. A. Churchill,
Lee, M. T., F. C. Kuo, G. A. Whitmore and J. Sklar,
Lu, J., J. K. Tomfohr and T. B. Kepler,
Man, M. Z., X. Wang and Y. Wang,
Marioni, J. C., C. E. Mason, S. M. Mane, M. Stephens and Y. Gilad,
Morozova, O., M. Hirst and M. A. Marra,
Mortazavi, A., B. A. Williams, K. McCue, L. Schaeffer and B. Wold,
Nagalakshmi, U., Z. Wang, K. Waern, C. Shou, D. Raha et al.,
Park, P. J.,
Philippe, N., A. Boureux, L. Brehelin, J. Tarhio, T. Commes et al.,
Robinson, M. D., and G. K. Smyth,
Robinson, M. D., and G. K. Smyth,
Robinson, M. D., D. J. McCarthy and G. K. Smyth,
Romualdi, C., S. Bortoluzzi and G. Danieli,
Rougemont, J., A. Amzallag, C. Iseli, L. Farinelli, I. Xenarios et al.,
Ruijter, J. M., A. H. C. V. Kampen and F. Baas,
Schena, M., D. Shalon, R. W. Davis and P. O. Brown,
Shendure, J.,
Smyth, G. K.,
Sultan, M., M. H. Schulz, H. Richard, A. Magen, A. Klingenhoff et al.,
Thygesen, H. H., and A. H. Zwinderman,
Tino, P.,
Tjur, T.,
Velculescu, V. E., L. Zhang, B. Vogelstein and K. W. Kinzler,
Vêncio, R. Z., H. Brentani, D. F. Patrão and C. A. Pereira,