- Split View
-
Views
-
Cite
Cite
Jean-Michel Claverie, Thi Ngan Ta, ACDtool: a web-server for the generic analysis of large data sets of counts, Bioinformatics, Volume 35, Issue 1, January 2019, Pages 170–171, https://doi.org/10.1093/bioinformatics/bty640
- Share Icon Share
Abstract
More than 20 years ago, our laboratory published an original statistical test [referred to as the Audic-Claverie (AC) test in the literature] to identify differentially expressed genes from the pairwise comparison of counts of ‘expressed sequence tags’ determined in different conditions. Despite its antiquity and the publications of more sophisticated packages, this original publication continued to gather more than 200 citations per year, indicating the persistent usefulness of the simple AC test for the community. This prompted us to propose a fully revamped version of the AC test with a user interface adapted to the diverse and much larger datasets produced by contemporary omics techniques.
ACDtool is a freely accessible web-service proposing three types of analyses: (i) the pairwise comparison of individual counts, (ii) pairwise comparisons of arbitrary large lists of counts and (iii) the all-at-once pairwise comparisons of multiple datasets. Statistical computations are implemented using standard R functions and can accommodate all practical ranges of counts as generated by modern omic experiments. ACDtool is well suited for large datasets without replicates.
Supplementary data are available at Bioinformatics online.
1 Introduction
Sequence-based approaches started to supersede micro-array hybridization-based platforms for the measurement of gene expression following the introduction of the concept of ‘expressed sequence tags’ (Adams et al., 1993). This trend was amplified by the ‘Serial analysis of gene expression’ approach (Velculescu et al., 1995) that provided an increased output for a lower cost. At this point, the nature of the raw gene expression data changed from fluorescence intensities to numbers (i.e. counts) of gene-specific tags. New bioinformatic methods had to be introduced to interpret these new expression profiles. Our laboratory was among the first to propose a statistical framework to point out the genes most likely to be differentially expressed and study the influence of sampling size on the reliability of these inferences (Audic and Claverie, 1997). As the sequence tags approaches became increasingly popular (becoming known as ‘RNA-seq’ with the advent of next generation sequencing), more specific bioinformatic packages have been developed (reviewed in Huang et al., 2015). Among the most cited are Limma (Ritchie et al., 2015), DESeq (Love et al., 2014) or EdgeR (Anders et al., 2013). More recently, new packages specifically handling single-cell RNA sequencing data have been proposed (Finak et al., 2015; Kharchenko et al., 2014; Li and Li, 2018; Pflug and von Haeseler, 2018). All the above tools are R/Bioconductor packages the implementation of which requires in-house bioinformatics expertise. Only a few tools are proposed as web-services (e.g. Zhu et al., 2017). Surprisingly, our initial paper (Audic and Claverie, 1997) continued to be cited over the years with a large increase since 2012. The persistent usage of this statistical test [referred to as the ‘Audic-Claverie (AC) test’, e.g. Bortoluzzi et al., 2005; Metta et al., 2006; Tino, 2009; Wong et al., 2013] prompted us to revisit its mathematical formulation and adapt it to the larger datasets and count values generated today. We implemented the modernized R-library-based version of the test as a web-service targeted to biologist end users and allowing in-bulk analyses of multiple datasets. ACDtool can process the very large count data sets (albeit often very sparse) generated by various omics techniques (RNA-seq, metagenomics, barcoding, population genetics, etc). Given the general mathematical principles on which the AC test is based, ACDtool is not intended to compete with the specialized packages targeted to each of the above techniques. However, ACDtool remains useful to picture the global trends from a given data sets (especially in absence of replicate) and decide whether it will benefit from the much larger investment required by specialized bioinformatic approaches.
2 Materials and methods
Under the null hypothesis that the tag counts are generated from Poisson distributions with equal means (or proportional to the respective sample sizes), Equation (2) can be used for statistical testing (Tino, 2009). A P-value is computed from the cumulative form of Equation (2) [e.g. summing up all the terms in the range (y, 0) if y/N2 <x/N1]. Using a rewriting of Equation (2) as a negative binomial distribution [Supplementary Equation (3)], ACDtool implements a numerical scheme allowing the fast and robust processing of the large range of counts and sparse data sets encountered in modern omic approaches (see Supplementary Material).
3 Results
3.1 Tool 1: comparing a pair of counts
Tool 1 requests a pair of counts of a given event and the sizes of the two samples. Each count must be small enough [in proportion to the total count (e.g. <5%)] to justify our assumption of a Poisson distribution. Tool 1 returns the probability that the compared samples contain the same proportion of that event. Tool 1 is also helpful to determine the suitable combination of counts and sample sizes required to diagnose differences reaching a given threshold of statistical significance.
3.2 Tool 2: comparing lists of paired counts
Tool 2 compares two lists of counts associated to the same set of events drawn from two samples and determine which events exhibit the most significant differences. An optional normalization procedure is available for overdispersed data. Tool 2 is expecting a tab-delimited input file such as that produced by Excel (‘save as’ tab-delimited text, .txt). The input screen of Tool 2 requests (i) the count table file name, (ii) the headings of the two columns of counts to be compared. The output is an interactive display of the events ranked by increasing P-values. This output can be saved as a tab-delimited file (.txt).
3.3 Tool 3: pairwise distances of multiple datasets
Tool 3 performs the complete set of pairwise comparisons of multiple lists of counts (associated to the same set of events) all at once, delivering an interactive heat map of their relative distances (Supplementary Material). The associated distance matrix can be saved as a tab-delimited file (.txt) for further (e.g. as input for various clustering algorithms). Tool 3 solely requests a count table file name. Tool 3 and Tool 2 are complementary. First, Tool 3 will be used to reveal the overall similarity/discrepancy between several sampling experiments. Tool 2 will then be used to identify which of the events are the most discrepant between them.
Acknowledgement
We thank Dr. Chantal Abergel for suggesting improvements to the user interface.
Funding
Our PACA-Bioinfo platform is supported by France Génomique (ANR-10-INBS0009) and the French Bioinformatics Institute (ANR-11-INSB0013).
Conflict of Interest: none declared.
References