How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets

The sequencing of the full transcriptome (RNA-seq) has become the preferred choice for the measurement of genome-wide gene expression. Despite its widespread use, challenges remain in RNA-seq data analysis. One often-overlooked aspect is normalization. Despite the fact that a variety of factors or ‘batch effects’ can contribute unwanted variation to the data, commonly used RNA-seq normalization methods only correct for sequencing depth. The study of gene expression is particularly problematic when it is influenced simultaneously by a variety of biological factors in addition to the one of interest. Using examples from experimental neuroscience, we show that batch effects can dominate the signal of interest; and that the choice of normalization method affects the power and reproducibility of the results. While commonly used global normalization methods are not able to adequately normalize the data, more recently developed RNA-seq normalization can. We focus on one particular method, RUVSeq and show that it is able to increase power and biological insight of the results. Finally, we provide a tutorial outlining the implementation of RUVSeq normalization that is applicable to a broad range of studies as well as meta-analysis of publicly available data.


Microarray data
The "raw" microarray data are available in GEO with the accession number GSE50423. See the GEO submission for details on the pre-processing.
We mapped the Affymetrix probe IDs to ENSEMBL gene IDs. We retained only the genes with a one-to-one mapping. Peixoto_FC_array_combined.txt contains the combined dataset used for the ranking analysis of Figure 5.

Positive and negative controls
Peixoto_NegativeControls.txt contains a list of negative control genes, i.e., genes that are not influenced by the biological effect of interest (see the main text for details). Peixoto_positive_controls.txt contains a list of positive control genes, i.e., genes known to be differentially expressed with respect to the biological effect of interest (see the main text for details).

Bioconductor packages
Remove Unwanted Variation (RUV) normalization is implemented in the RUVSeq package. Upper-quartile (UQ) normalization is implemented in the EDASeq package. To generate the figures, we have used the plotRLE and plotPCA functions implemented in EDASeq and the CATplot function implemented in the ffpe package. The differential expression (DE) analysis was carried out with edgeR for the RNA-seq data and with limma for the microarray data. See "Session Info" for the package versions used in this document.

Fear Conditioning (FC) data
After reading the FC counts and the positive and negative control genes in R, we filter out the non expressed genes.

UQ normalization
The betweenLaneNormalization function of EDASeq implements UQ normalization. We can then use the plotRLE and plotPCA functions of EDASeq to explore the normalized data.

RUV normalization
Analogously, we can use the RUVs function of RUVSeq to normalize the data with RUV. Note that RUVs uses replicate samples and negative control genes to normalize the data. We use the biological replicates and a list of negative control genes obtained from a recent microarray study. Here, we use RUVs on UQ-normalized data, see (Risso et al. 2014) for additional details.
The information about the replicates is contained in the matrix groups, which has as many rows as the number of replicate groups and as many columns as the number of replicates in each group (see the RUVSeq package manual for additional details).

Differential expression
Next, we evaluate the impact of the normalization on differential expression. Note that it is preferable to model the original read counts, adding the normalization factors to the differential expression model, rather than modeling the normalized counts. See (Risso et al. 2014) for a discussion.
Here, we consider edgeR, but other count-based models, such as DESeq can also be used.

UQ normalization
We model the counts with a generalized linear model (GLM), using the group variable x as covariate. We compute the UQ normalization factors, estimate the dispersion parameters and fit the GLM. We expect the distribution of the p-values to be uniform for the majority of the non DE genes, with a spike at zero corresponding to the DE genes. The histogram of the p-value corresponding to UQ normalization is not ideal.
Additionally, we can look at a "volcano plot," in which we plot the negative log p-values vs. the log-fold-change. We expect to see the positive controls (red circles) as high as possible and the negative controls (green circles) as close to zero as possible. The blue points correspond to genes called DE at an FDR of 0.01.

RUV normalization
In the RUV approach, we consider the factors of unwanted variation as additional covariates in the GLM (see risso2014ruv for details We compute the UQ normalization factors, estimate the dispersion parameters and fit the GLM. y <-DGEList(counts = filtered, group = x) y <-calcNormFactors(y, method = "upperquartile") y <-estimateGLMCommonDisp(y, design) y <-estimateGLMTagwiseDisp(y, design) fit <-glmFit(y, design) FC vs CC. First, we compare the expression levels after training (FC) to those of the controls (CC). The histogram looks like expected and the volcano plot shows that we obtain smaller p-values for the positive controls, while the negative controls are still close to one (zero in the -log scale). We also detect more genes as DE, suggesting more power, although the DE genes might be a mixture of true and false positives.
RT vs. CC. We also compare the expression levels after retrieval of the memory (RT) to those of the controls (CC).

Object Location Memory (OLM) data
We next look at the OLM experiment. After reading the OLM counts into R, we filter out the non expressed genes.

Normalization
As for the FC experiment, we consider UQ and RUVs normalizations.

Combined Analysis
In this section, we consider a combined analysis of the FC and OLM experiments.

Differential expression
Here, we consider the "average effect" of the OLM and FC tasks compared to the controls as differential expression. Note that the combined analysis allows us to consider many interesting comparisons (not shown here and not run in the code below, for ease of reading). In particular, one can consider the individual effects (FC vs. CC, OLM vs. CC) and the difference between the two tasks (FC vs. OLM).

UQ normalization
design <-model.matrix(~x -1) y <-DGEList(counts = filtered, group = x) y <-calcNormFactors(y, method = "upperquartile") EdgeR's results with UQ normalization are not satisfactory: in particular, some of the positive controls are not detected as DE and some of the negative controls show a strong negative log-fold-change.

Comparison with microarray data
Finally, we look at the concordance between DE lists obtained from RNA-seq (either with UQ or RUV normalization) and microarray.
We consider a subset of genes, including only the genes mapped uniquely by the Affymetrix probe-sets and detected by RNA-seq, and only the samples assessed with both technologies.

Tuning parameters
The two main tuning parameters of RUV are the number of factors of unwanted variation, k, and the set of negative controls genes. In this section, we provide guidelines on how to select these parameters and we show that RUVs is robust to the choice of negative controls.

Number of factors of unwanted variation (k)
The choice of the parameter k is not easy and is dataset-dependent. Hence, we recommend to perform extensive exploratory data analysis, comparing different values of k.
For this task, we found very useful to compare the RLE and PCA plots of the normalized data, as well as volcano plot and histogram of the p-values.

RT8
When the value of k is too high, the RLE distributions become uneven again. More importantly, the model over-corrects for unwanted variation and ends up removing (almost) all the biological variability within the conditions. This is shown in the PCA plot, where all the biological replicate samples are collapsed to almost the same coordinates.

Set of negative control genes
The choice of the set of negative control genes is somewhat dependent on the dataset. When results on similar datasets are available (either because of previous experiments or through publicly available data), it is often a good choice to use a context-specific set of negative controls.
In the absence of such a tailored set, alternative choices are a set of housekeeping genes, a set of synthetic controls (spike-ins), or a set of in silico empirical controls (Risso et al. 2014).
A good property of RUVs is its robustness to the choice of negative control genes. To highlight this robustness, we show the performance of RUVs when using all the genes as negative controls (k = 5).

RT7 RT8
The results are very close to those based on the set of negative controls (see above).