Summary: It is expected that emerging digital gene expression (DGE) technologies will overtake microarray technologies in the near future for many functional genomics applications. One of the fundamental data analysis tasks, especially for gene expression studies, involves determining whether there is evidence that counts for a transcript or exon are significantly different across experimental conditions.
Availability: The package is freely available under the LGPL licence from the Bioconductor web site (http://bioconductor.org).
Modern molecular biology data present major challenges for the statistical methods that are used to detect differential expression, such as the requirement of multiple testing procedures and increasingly, empirical Bayes or similar methods that share information across all observations to improve inference. For microarrays, the abundance of a particular transcript is measured as a fluorescence intensity, effectively a continuous response, whereas for digital gene expression (DGE) data the abundance is observed as a count. Therefore, procedures that are successful for microarray data are not directly applicable to DGE data.
This note describes the software package
Bioinformatics researchers have learned many things from the analysis of microarray data. For instance, power to detect differential expression can be improved and false discoveries reduced by sharing information across all probes. One such procedure is
We assume the data can be summarized into a table of counts, with rows corresponding to genes (or tags or exons or transcripts) and columns to samples. For RNA-seq experiments, these may be counts at the exon, transcript or gene-level. We model the data as negative binomial (NB) distributed,
The NB distribution reduces to Poisson when ϕg=0. In some DGE applications, technical variation can be treated as Poisson. In general, ϕg represents the coefficient of variation of biological variation between the samples. In this way, our model is able to separate biological from technical variation.
The required inputs for
For users of
A number of features have been added to the
Many of the early RNA-seq datasets involve sequence reads from technical replicates (e.g. same source of RNA) as opposed to biological replicates (e.g. RNA from different individuals). Technical replicates will generally have lower variability than biological replicates and in our experience, the dispersion parameter (and the moderation procedure in
We have developed a Bioconductor package
Funding: National Health and Medical Research Council Program (Grant 406657 to G.K.S.); NHMRC, Independent Research Institutes Infrastructure Support Scheme (Grant 361646); Victorian State Government OIS grant (awarded to the WEHI); a Melbourne International Research Scholarship (to M.D.R.); Belz, Harris and IBS Honours scholarships (to D.J.M.).
Conflict of Interest: none declared.