RNA-seq is a methodology for RNA profiling based on next-generation sequencing that enables to measure and compare gene expression patterns at unprecedented resolution. Although the appealing features of this technique have promoted its application to a wide panel of transcriptomics studies, the fast-evolving nature of experimental protocols and computational tools challenges the definition of a unified RNA-seq analysis pipeline. In this review, focused on the study of differential gene expression with RNA-seq, we go through the main steps of data processing and discuss open challenges and possible solutions.
In every living organism, DNA encodes the whole information needed to determine all the properties and functions of each single cell. From this blueprint, cells can dynamically access and translate specific instructions through ‘gene expression’, namely, by selectively switching on and off a particular set of genes. The information encoded in the selected genes is transcribed into RNA molecules, which in turn can be translated into proteins or can be directly used to finely control gene expression. Thus, the set of RNAs transcribed in a certain condition and time reflects the current state of a cell and can reveal pathological mechanisms underlying diseases. More interestingly, the study of differential gene expression enables the comparison of gene expression profiles from different tissues and conditions to identify genes that play a major role in the determination of the phenotype. For instance, the comparison of healthy versus diseased tissues can provide new insights over the genetic variables involved in pathology.
In recent years, RNA-seq , a methodology for RNA profiling based on next-generation sequencing (NGS) , is replacing microarrays for the study of gene expression. The sequencing framework of RNA-seq enables to investigate at high resolution all the RNAs present in a sample, characterizing their sequences and quantifying their abundances at the same time. In practice, millions of short strings, called ‘reads’, are sequenced from random positions of the input RNAs. These reads can then be computationally mapped on a reference genome to reveal a ‘transcriptional map’, where the number of reads aligned to each gene gives a measure of its level of expression.
The powerful features of RNA-seq, such as high resolution and broad dynamic range, have boosted an unprecedented progress of transcriptomics research, producing an impressive amount of data worldwide. To support this exponential growth and to deal with the different steps of data analysis, several computational tools have been developed and updated at a fast pace. Nevertheless, the analysis scheme depicted above, which might seem simple at first glance, is in fact far more complex and consists in several processing steps.
In this review, we describe the current RNA-seq analysis framework, focusing on each computational step from read preprocessing to differential expression (DE) analysis. We review some of the most promising methodologies available, along with their underlying algorithmic strategies and mathematical models, and identify the research topics that require further investigation. We believe this work can provide a broad overview of RNA-seq analysis and can guide users to define and implement their own processing pipeline. Moreover, the dissection of the most challenging aspects of data analysis can help users to select the best methods depending on the characteristics of the specific study, considering both the driving biological question and the actual data features.
INVESTIGATING GENE EXPRESSION WITH RNA-SEQ
The transcriptome is the whole set of RNAs transcribed from the genes of a cell. As discussed above, their relative abundances reflect the level of expression of the corresponding genes, for a specific developmental stage or physiological condition. Although RNAs are not the final products of the transcription–translation process, the study of gene expression and differential gene expression can unveil important aspects about the cell states under investigation.
In past years, hybridization-based approaches such as microarrays, were the most used solutions for gene expression profiling and DE analysis, thanks to their high throughput and relatively low costs . These technologies consist in an array of probes, whose sequences represent particular regions of the genes to be monitored. The sample under investigation is washed over the array, and RNAs are free to hybridize to the probes with a complementary sequence. A fluorescent is used to label the RNAs, so that image acquisition of the whole array enables the quantification of the expressed genes. Although widely used in quantitative transcriptomics, these techniques have several limitations [3, 4]:
reliance on prior knowledge about the genome for probe design;
possibility to monitor only some portions of the known genes and not the actual sequences of all transcribed RNAs;
high background levels due to cross-hybridization, i.e. imperfect hybridization between quasi-complementary sequences;
limited dynamic range due to background noise and signal saturation;
need for normalization to compare data from different arrays.
The advent of NGS has revolutionized transcriptomics and quickly established RNA-seq as the preferred methodology for the study of gene expression [3, 5]. The standard workflow of an RNA-seq experiment is described in the following. The RNAs in the sample of interest are initially fragmented and reverse-transcribed into complementary DNAs (cDNAs). The obtained cDNAs are then amplified and subjected to NGS. In principle, all NGS technologies can be used for RNA-seq, even though the Illumina sequencer (http://www.illumina.com) is now the most commonly used solution . The millions of short reads generated can then be mapped on a reference genome and the number of reads aligned to each gene, called ‘counts’, gives a digital measure of gene expression levels in the sample under investigation.
Although RNA-Seq is still under active development, it is now widely used in place of microarrays to measure and compare gene transcription levels because it offers several key advantages over hybridization-based technologies [3–5, 7–9], such as:
reconstruction of known and novel transcripts at single-base level;
broad dynamic range, not limited by signal saturation;
high levels of reproducibility.
The flexibility enabled by single-base resolution probably represents the most powerful feature, as it allows the quantification and sequencing of all the transcripts present in a sample. Compared with microarrays, that can only assay portions of transcripts corresponding to probes, RNA-seq leverages on the sequencing framework to overcome the pure quantification task, enabling new applications, such as transcriptome profiling of non-model organisms [10, 11], novel transcripts discovery , investigation of RNA editing [13, 14] and quantification of allele-specific gene expression .
Despite all these newsworthy features and apparently easy scheme of data analysis, RNA-seq studies produce large and complex data sets, whose interpretation is not straightforward [16, 17]. Data analysis is further challenged by technical issues inherent to the specific NGS technology, such as sequencing errors in the output reads due to miscalled bases , or to biases introduced by the different steps of the RNA-seq protocol, such as amplification, fragmentation and reverse-transcription [18–20]. In particular, protocol-specific bias may under- or over-represent specific loci leading to biased results, thus necessitating careful data quality control and normalization. The latter issue is described in details in the ‘Count bias and normalization’ section. Nevertheless, if a well-annotated reference genome or transcriptome is available and if the aim of an RNA-seq study is the detection of DE genes, a basic data processing pipeline consists in the following steps: (i) read mapping, (ii) counts computation, (iii) counts normalization and (iv) detection of differentially expressed genes (Figure 1). More sophisticated pipelines can be tailored on the specific need by considering the addition of pre- and post-processing modules to be used before and after read mapping.
ALGORITHMS FOR READ MAPPING
The first computational step of the RNA-seq data analysis pipeline is read mapping: reads are aligned to a reference genome or transcriptome by identifying gene regions that match read sequences. So far, many alignment tools have been proposed [21, 22]. In all cases, the mapping process starts by building an index of either the reference genome or the reads, which is then used to quickly retrieve the set of positions in the reference sequence where the reads are more likely to align. Once this subset of possible mapping locations has been identified, alignment is performed in these candidate regions with slower and more sensitive algorithms [21, 23]. The available mapping tools can be divided into two main categories based on the methodology used to build the index: hash tables or Burrows–Wheeler transform (BWT) (reviewed in ).
The hash table is a common data structure for indexing complex data sets so to facilitate rapid string searching. Mapping tools can build hash tables either on the set of input reads or on the reference, considering all subsequences of a certain length (k-mers) contained in the considered sequences. In the hash table, the key of each entry is a k-mer, while the value is the list of all positions in the reference where the k-mer was found. The two solutions have different advantages and drawbacks [21, 23]. For instance, building hash tables of the reference requires constant memory, for a given reference and parameter set, regardless of the size of the input read data. Conversely, building hash tables of reads typically requires variable but smaller memory footprint, depending on the number and complexity of the read set. However, this latter solution may require longer processing time to scan the entire reference sequence when searching for hits, even if the input read set is small, and is not suited for parallelization .
BWT  is a reversible string rearrangement that encodes the genome into a more compact representation, leveraging on redundancy of repeated subsequences. Methods based on BWT create an index of the BWT, called ‘FM-index’, that can be used to perform fast string searching in a reduced domain of available subsequences, without scanning the whole genome . The combination of BWT and FM-index ensures both limited memory and space occupancy, but requires longer computational time for index construction than hash-based methods. However, since the index has to be constructed only once for a given reference and precomputed indexes for several model genomes are already available, this aspect has minimum impact on the total computational time. Conversely, the strategy used to extend the first partial high-quality hits identified thanks to hash- or BWT-based indexes into full-read alignments has a major impact on algorithm performance. Usually, hash-based algorithms implement a ‘seed-and-extend’ approach  leveraging on a bounded version of the Smith–Waterman (SW) algorithm . BWT-based solutions sample substrings of the reference using the FM-index and then accommodate inexact matches by tolerating some mismatches, up to a certain threshold . BWT implementations, which were developed for short (<50 nt) read alignment, impose very stringent constraints on inexact matches, which make them much faster than hash-based approaches, but less sensitive [21, 28]. As NGS technologies are producing increasingly longer reads (>100 nt), mapping tools are implementing hybrid solutions, which exploit the efficiency of BWT and ‘FM-index’ for seeding and then perform alignment extension with SW-like algorithms [29–33].
Besides the specific indexing and string-matching approach implemented, differences in mapping solutions can be due to algorithmic strategies and heuristics specifically implemented to map reads:
having no perfect matches with the reference;
obtained with paired-end sequencing;
generated from exon–exon junctions.
Owing to the presence of sequencing errors in NGS data, mapping algorithms must allow imperfect alignments. By tolerating a certain number of mismatches, they are able to increase the percentage of mapped reads . The available tools implement very different mismatch policies but, in general, allow the user specifying customized parameter settings. However, the mismatch policy strongly impacts on both mapping accuracy and computational performance, and the definition of its best configuration is not trivial [21, 22].
Besides systematic errors, the sequenced organism can present true single-nucleotide polymorphisms (SNPs), which result in nucleotidic differences between the reads and the reference. The flexibility/stringency given by the mismatch policy adopted is thus important to correctly map these reads, as reads having one or more SNPs have a lower probability of being mapped .
In addition to SNPs, reads can contain small insertions or deletions (indels). Algorithms that do not perform gapped alignment often fail to align reads containing indels . Early NGS read mapping tools avoided or limited gaps in the alignment because of the computational complexity of choosing an indel location, but more recent software versions accommodate gapped alignment. Algorithms that do not perform gapped alignment have lower mapping accuracy, resulting in a significant reduction of the number of correctly mapped reads in correspondence of regions surrounding indels .
Another difference comes from the ability to map paired-end reads. Unlike conventional single-end sequencing, which can read only one end of each DNA fragment, paired-end protocols enable to sequence both ends, generating two reads per fragment. The information about the expected distance of the reads sequenced from these two ends, estimated from the distribution of fragment lengths, can be exploited to increase mapping or assembly accuracy. Paired-end reads are particularly useful to solve repeats, as they can cover long genomic regions (up to 20 kb), possibly extending into univocally determined sequences flanking the repeated ones. Moreover, they can be particularly useful for the identification of alternatively spliced isoforms and for the detection of fusion transcripts in cancer samples . However, if an RNA-seq study is more focused on the quantification of (differential) gene expression than on the reconstruction of the exact transcript sequences, particular attention must be paid to ensure that the strategy for paired-end reads mapping is not too stringent, as it may result in a reduced number of mapped reads  and possibly in biased expression estimates.
Unlike tools for genome-sequencing data mapping, algorithms developed for RNA-seq may have to handle ‘spliced reads’. Splicing is a post-transcriptional modification underwent by most of RNAs transcribed in eukaryotic organisms. During splicing, non-coding regions (introns) are removed and coding sequences (exons) are concatenated together. Although the order of exons is always preserved, some exons can be removed along with introns, giving rise to different RNAs. This process, called ‘alternative splicing’, enables to produce different protein isoforms starting from the same gene. Thus, RNAs in eukaryotes can give rise to spliced reads that span exon–exon junctions and that cannot be directly mapped onto the genome, where exons are separated by introns. To map these spliced reads back to the genome, algorithms for RNA-seq data analysis must handle spliced alignment (Figure 2A). Generally, simple gapped alignment is not sufficient to account for introns because they can span a wide range of lengths . To align spliced reads, many tools implement a two-step procedure: first, reads are mapped to the genome and used to identify putative exons; then, candidate exons are used to build all possible exon–exon junctions, which are considered for mapping the spliced reads that failed to map in the first step (e.g. [37, 38]).
Despite attempted in several works, the assessment and comparison of mapping algorithms, especially for RNA-seq reads, is not straightforward [17, 22, 39]. Ideally, the perfect algorithm would find, for each read, its true genomic source. However, the presence of sequencing errors, repeats and genetic variants, greatly increases uncertainty in read mapping and even challenges the definition of ‘correct mapping’ . Moreover, the different features of the input data and the possibility to greatly change the parameter settings add further variability to the results . In this scenario, it is impossible to identify the best tool, but the top performers have to be selected with respect to the specific application and input data, depending on the biological question under consideration [16, 21]. For instance, aligners that are suited for transcript quantification, might not be precise enough to study SNPs or RNA editing events.
COUNTS: THE DIGITAL MEASURE OF GENE EXPRESSION
After mapping, the reads aligned to each coding unit, such as exon, transcript or gene, are used to compute counts, so to give an estimate of its expression level. The most used approach for computing counts considers the total number of reads overlapping the exons of a gene. However, even in well-annotated organisms, a fraction of reads map outside the boundaries of known exons . Thus, an alternative strategy considers the whole length of a gene, also counting reads from introns. Moreover, if correctly handled in the mapping step, spliced reads can be used to model the abundance of different splicing isoforms of a gene [41, 42]. Particular attention should be paid to genes with overlapping sequence. The ‘Union-Intersection gene’ model considers, for each gene, the union of the exonic bases that do not overlap with the exons of other genes . Htseq-count implements instead a more flexible approach, which lets the user selecting the desired model for read counting in the presence of overlapping features . Unlike the methods described so far, ‘maxcounts’ approach does not compute the sum of aligned reads, but estimates the expression level of each exon or single-isoform transcript as the maximum read coverage reached along its sequence . This approach can be easily used for RNA-seq studies on prokaryotes, where transcripts are not subjected to splicing, while requires further research to define transcription models in eukaryotes that can be used for combining ‘maxcounts’, computed at exon level, into a measure of gene or transcript expression. Although the final strategy has the potential to significantly change expression estimates, limited research has been carried out to assess and compare the available approaches .
As explained above, quantification of gene expression from RNA-seq data is typically implemented in the analysis pipeline through two computational steps: alignment of reads to a reference genome or transcriptome, and subsequent estimation of gene and isoform abundances based on aligned reads. Unfortunately, the reads generated by the most used RNA-Seq technologies are generally much shorter than the transcripts from which they are sampled. As a consequence, in the presence of transcripts with similar sequences, it is not always possible to uniquely assign short reads to a specific gene. In particular, the human genome contains duplicated and paralogous genes with high sequence similarity, and interspersed or tandem repeats that are likely to produce similar or identical short reads [38, 45, 46]. Thus, NGS data arising from repeated regions have to be handled properly in order not to bias the results [46–48]. RNA splicing makes transcriptome reconstruction even more challenging, as it generates alternatively spliced isoforms of the same gene that share a large part of their sequence and can be hardly assigned to one specific isoform. As a consequence, a non-negligible fraction of RNA-seq reads are ‘multireads’: reads that map with comparable fidelity on multiple positions of the reference. The fraction of multireads over total mapped reads depends on transcriptome complexity and read length, varying from 10 to >50% [38, 45]. When considering reads mapping on multiple isoforms of the same gene, this percentage exceeds 70% .
One of the first strategies proposed for handling gene multireads was that of simply discarding them, so to estimate gene expression considering only uniquely mapping reads [1, 49]. Owing to the likelihood of assigning multireads to the wrong genomic location, introducing further bias in the results, multireads filtering is a commonly used approach in the analysis of RNA-seq and NGS data in general . However, in RNA-seq studies, where the aim is both the reconstruction of transcripts sequences and the quantification of their relative abundances, discarding multireads causes an information loss and a systematic underestimation of expression levels in correspondence of repetitive regions. An alternative strategy reduces data loss by allocating multireads considering the coverage given by uniquely mapping reads and obtains expression estimates that are in better agreement with microarrays . Ji et al. propose a more sophisticated approach that takes also into account the mismatch profiles between the unique reads and the sequence of the genomic locations they are aligned to . Their method called BM-Map, calculates the posterior probability of mapping each multiread to a genomic location considering three sources of information: the sequencing error profile, the likelihood of true polymorphisms and the expression level of competing genomic locations. Conversely, the ‘proportional’ method described above only considers the latter information. The mismatch profile is also taken into consideration by MMSEQ , which estimates both isoform expression and allelic imbalance, namely expression differences between different alleles of the same gene or isoform. A two-step alignment procedure is used to reduce the uncertainty in read mapping. First, mismatch profiles are used to build a sample-specific transcriptome whose genotype can be different from that of the reference sequence. Then, once the reference transcriptome is updated considering the genotype, reads are realigned to estimate isoform expressions and allelic imbalance. More recent methods, such as RSEM, define a probabilistic model of RNA-Seq data and calculate maximum likelihood estimates of isoform expression levels using the Expectation-Maximization algorithm [38, 51]. True mappings are identified leveraging on the information provided by the distribution of sequencing errors, fragment lengths and read coverage across transcripts, modeled as random variables and estimated from the data.
COUNT BIAS AND NORMALIZATION
After the first optimistic expectation of a relative ease of analysis of RNA-seq data , many works have highlighted the need for a careful normalization of count data before assessing differential gene expression [9, 52–56] to correct for different sources of bias.
The first bias to be taken into account is the ‘sequencing depth’ of a sample, defined as the total number of sequenced or mapped reads. Let and being two RNA-seq experiments with no differentially expressed genes. If experiment generates twice as much reads as experiment , it is likely that the counts from experiment will be doubled too. Hence, a common practice is that of scaling counts in each experiment by the sequencing depth estimated for that sample. In early works was computed by counting the total number of reads sequenced or mapped in sample (global scaling) [8, 49]. More recent approaches consider counts depending on the whole RNA population of the sequenced sample [57–59]. For instance, if there is a set of highly expressed genes in a sample, it will inevitably ‘consume’ the available reads, so that the expression level of the remaining genes will be underestimated . A similar issue may result from the presence of contaminants. When a restricted set of highly-expressed genes accounts for the largest part of total counts, as happens in most of RNA-seq assays, global scaling techniques only capture and correct for differences related to these high-count genes (Figure 2B) [9, 59]. Bullard et al. propose a quantile normalization similar to that used for microarray preprocessing  and an alternative global scaling that adjusts counts distributions with respect to their third quartile, so to reduce the effect of high-counts genes . More generally, slightly different normalizations can be defined by selecting different count quantiles . Robinson and Oshlack et al.  propose the ‘Trimmed Mean of M-values’ (TMM) normalization to account for differences in library composition between samples. To reduce bias due to high-count genes, TMM is computed removing the 30% of genes that are characterized by the most extreme ‘M-values’ (i.e. log-fold-changes) for the compared samples. This normalization factor is then used to correct for differences in library sizes. Li et al.  propose a novel normalization method that assumes a Poisson model of counts and estimates the sequencing depth on a set of genes that are not differentially expressed. A Poisson goodness-of-fit statistic is used to determine which genes belong to this restricted set. In the R package ‘DESeq’ , the ratios between gene-wise counts in each sample and the geometric mean of gene-wise counts across all samples are calculated, and the library size is computed as the median of these ratios across genes. Different studies (e.g., [44, 61]) indicated TMM and ‘DESeq’ methods as the most effective approaches for library size normalization. However, if the common assumption that the compared samples contain similar amount of RNA does not hold, count normalization methods are ineffective, and calibration techniques leveraging on spike-in RNA measurements can be used .
RNA-seq counts also show a gene length bias: the expected number of reads mapped on a gene is proportional to both the abundance and length of the isoforms transcribed from that gene. Indeed, longer genes produce more reads than shorter ones (Figure 2C), resulting in higher power for DE detection [9, 16, 64]. To reduce this bias, Mortazavi et al.  propose to summarize mapped reads as ‘Reads Per Kilobase of exon model per Million mapped reads’ (RPKM), computed dividing the number of reads aligned to gene exons by the total number of mapped reads and by the sum of exonic bases. An analogous measure is given by ‘Fragments Per Kilobase of exon per Million fragments mapped’ (FPKM) , which account also for paired-end data and estimate transcript abundances in terms of expected number of fragments, from which single-end and paired-end reads arise in a RNA-seq experiment. RPKM and FPKM are defined so to reduce both differences in library size and length bias. Other methods estimate and correct the dependence of counts on gene length and other sequence-specific covariates, such as GC-content and dinucleotide composition, using quantile regression [52, 65] and generalized linear models . In DE analysis, as methods that correct counts for length bias can introduce additional biases [9, 44, 56, 61], normalizations that apply to DE test statistics while leaving gene counts unchanged have been proposed [9, 64]. Differently from all the above described methods, ‘maxcounts’, which do not count the reads along exons or transcripts but select the best represented regions in terms of coverage, strongly reduce length bias before normalization .
Gene-specific covariates do not suffice to explain counts variability , which has been shown to vary greatly along gene sequences. The uneven distribution of reads along gene and transcript sequences are due to mapping errors and, primarily, to experimental biases. For instance, fragmentation methods based on restriction enzymes present sequence-specific efficiency . Moreover, reverse-transcription can either over- or under-represent 3′ end of transcripts if performed with poly-dT oligomers or random hexamers, respectively [1, 3, 19]. More generally, RNAs and cDNAs can form secondary structures that depend on their primary sequences and that can either hamper or facilitate the binding of reverse-transcription primers and sequencing adapters . Since the first RNA-seq experiment , several changes in library preparations and sequencing protocols have been introduced to reduce bias (e.g. postponing reverse transcription after fragmentation), but non-uniformity of read coverage remains an issue of state-of-the-art sequencing technologies . Li et al.  call this sequence-specific bias ‘sequencing preference’: different regions of the same transcript can generate different amount of reads depending on their local nucleotidic sequence, which determines their ‘sequenceability’. They model read counts as Poisson variables with variable rates along transcripts and perform an iterative Poisson linear regression to fit the data. They also use multiple additive regression trees (MART) to capture non-linear relationships between counts and local sequences. The models are fitted using the top 100 genes with the highest expression levels and used to predict the sequencing preference of the remaining genes. This approach allows explaining up to ∼50% of data variance due to coverage non-uniformity and predicting sequencing preferences that can be used in quantitative analysis of RNA-seq data to improve gene expression estimates.
In recent years, a fervent research has characterized the RNA-seq field and many different tools for DE detection have been developed [67–69]. At its simplest, methods for DE detection rely on a test statistic, used to identify which genes are characterized by a statistical significant change in gene expression in the compared conditions. In principle, non-parametric methods can be used (e.g. [70, 71]). However, because of the small number of replicates typically available in RNA-Seq experiments, non-parametric methods usually do not offer enough detection power, and parametric methods are preferred [67, 72]. Each parametric method assumes a specific model to describe the underlying distribution of count data, and seeks to identify the genes whose differences between the tested conditions exceed the variability predicted by the model. The models considered and implemented in most of the analysis tools are based on the Poisson and Negative Binomial (NB) distributions. In the following, we present a statistical description of the parameterization of RNA-seq count data and a more general summary of state-of-the-art approaches for DE analysis in RNA-seq studies. However, because of the high number of tools available, here we specifically focus on few interesting and well-characterized data modeling approaches implemented in recently developed methods.
Models of RNA-seq count data
Let be the set of transcripts in the sample of interest . For each transcript in sample , let be its length and its expression level. All the positions within that can give rise to a read, i.e. all possible read starts, are given by . Therefore, the probability that a read comes from some transcript in sample , can be computed, similarly to , as
‘Technical variation’, due to the measurement error due to the adopted technology.
‘Biological variation’, representing the variability among samples belonging to the same treatment group or condition.
As a consequence, for biological replicates the variance is larger than the mean, and count data are said to be ‘over-dispersed’ [74–76]. In this case, the Poisson distribution cannot handle this additional variability, and models based on the NB distribution are preferred [62, 74–76]. If is modeled with a Gamma distribution, the marginal probability distribution of counts is Negative Binomial, with mean and variance that depends on the chosen parametrization of .
Tools for DE analysis of RNA-seq data
Given a specific statistical model of RNA-seq count data, all parametric tools for DE analysis consist in two main steps: estimation of model parameters from data and detection of DE genes with a test statistics. Library normalization can also be considered part of DE analysis , as it is implemented within all DE tools, despite with different approaches. So far, several studies have been focused on DE methods comparison [67–69, 72, 76–79], but a consensus on methods performance is challenged by the lack of gold-standard measures and by frequent tools updates, with several versions released each year [67, 72]. However, some findings are widely confirmed across different studies, such as the superior performance of NB-based methods over their Poisson-based counterparts [68, 74, 76–79].
The higher performance of NB-based tools is mainly because of their ability to capture biological variability. As discussed above, this variability is due to the stochastic nature of gene expression (i.e. some genes have more variable levels of expression than others), and is thus gene-specific and independent from the adopted technology . Owing to the small sample-size that generally characterizes RNA-seq data sets, gene-wise estimation of cannot be performed and different strategies are used to fit the data. edgeR  and DESeq , which are among the best performers in most of the comparative studies cited above, are both based on the NB model of Equation (4), but implement different strategies for dispersion estimation. The default strategy implemented in edgeR shrinks gene-wise dispersion estimates toward a common value. Alternatively, edgeR can compute a ‘trend’ estimate across genes in place of a single value. DESeq considers the variance being a smooth function of the mean and uses non-parametric regression to fit the variance as a function of the mean. Another approach, implemented in NBPseq , considers the model with three parameters described by Equation (5); and are considered constant across genes and estimated jointly. Nevertheless, this approach does not outperform DESeq and edgeR .
More recently, Law et al. proposed to apply ‘limma’ , a method developed for microarrays and based on the normal distribution, to analyze RNA-seq data . The underlying idea is that correctly modeling data mean–variance relationship is more important than exactly specifying the probabilistic count distribution. In their approach, called ‘limma voom’, the mean–variance relationship is estimated from data through lowess fit and used to estimate gene-wise variances. For each gene, the inverse of the variance is then used as weight in the ‘limma’ framework. Applied to RNA-seq data, ‘limma voom’ performs comparably with top-ranking NB-based approaches [67, 78]. Even though further assessments are needed to finally select the best approach for DE analysis from RNA-seq data, the promising results obtained with this strategy may enable to exploit a wide panel of methods developed for microarrays.
RNA-seq has rapidly become the method of choice for the study of differential gene expression, as it enables the investigation and comparison of gene expression levels at unprecedented resolution. However, turning huge and complex RNA-seq data sets into biologically meaningful findings is not trivial. The interpretation of RNA-seq data requires the definition of a computational pipeline that comprises several steps: read mapping, count computation, normalization and testing for differential gene expression. Here, we reviewed some of the most used methodologies and models implementing these processing steps and discussed the main challenges of data analysis. We believe this review can guide users to define an accurate analysis pipeline.
RNA sequencing is evolving at a fast pace and emerging ‘Third-Generation’ technologies now enable single-molecule sequencing ; computational tools themselves are chasing this development to accommodate changes in the data features, and the assessment and comparison of state-of-the-art methods must be constantly performed to implement an updated RNA-seq computational pipeline. However, some issues raised in this work, such as the impact of read mapping heuristics and count normalization, can be considered of broad interest, and should be carefully taken into consideration for all RNA-seq data analyses, independently from data features.
RNA-seq is a novel methodology based on NGS that enables to investigate differential gene expression at high resolution. However, data interpretation is not straightforward and requires several analysis steps: read mapping, counts computation, counts normalization and DE testing.
Tools for read mapping provide different solutions depending on the specific algorithm and heuristics implemented. Particular care must be taken to handle reads mapping on multiple genomic locations to estimate correct gene expression levels even in the presence of high-similarity sequences.
In RNA-seq studies, gene expression levels are measured by counts, i.e. by the number of reads mapped on each gene.
Counts often depend on gene- and sample-specific covariates, such as gene length and library size, respectively. Between-sample differences in library size must be necessarily corrected before comparing samples to detect differentially expressed genes. Conversely, correction of gene-specific covariates is not mandatory and must be performed carefully to avoid information loss.
DE analysis can be tested with parametric methods based on the Poisson or NB distribution. NB models are preferred, as they capture both technical and biological variability.
This work was supported by Fondazione CARIPARO ["RNA sequencing for quantitative transcriptomics" PhD program]; and PRAT 2010 [CPDA101217].