Summary: We have developed a program for microarray data analysis, which features the false discovery rate for testing statistical significance and the principal component analysis using the singular value decomposition method for detecting the global trends of gene-expression patterns. Additional features include analysis of variance with multiple methods for error variance adjustment, correction of cross-channel correlation for two-color microarrays, identification of genes specific to each cluster of tissue samples, biplot of tissues and corresponding tissue-specific genes, clustering of genes that are correlated with each principal component (PC), three-dimensional graphics based on virtual reality modeling language and sharing of PC between different experiments. The software also supports parameter adjustment, gene search and graphical output of results. The software is implemented as a web tool and thus the speed of analysis does not depend on the power of a client computer.
Availability: The tool can be used on-line or downloaded at http://lgsun.grc.nia.nih.gov/ANOVA/
Global gene-expression analysis with microarrays becomes a routine procedure in biomedical research. Although many programs have been developed to support the statistical analysis of microarray results (Kim et al., 2001; Theilhaber et al., 2004; TIGR, 2004http://www.tigr.org/software/tm4/; Tusher et al., 2001), they do not necessarily contain all the advanced analysis methods. To facilitate the use of these relatively new methods we developed NIA Array Analysis software. A complete description of the software as well as the glossary of technical and statistical terms can be found at http://lgsun.grc.nia.nih.gov/ANOVA/. In this paper, we describe the main features of this software.
The NIA Array Analysis software can be used for both single-color and two-color microarrays with or without a dye swap. It uses a tab-delimited text file as an input and generates outputs in both graphics and text formats. An additional tool (Arrayjoin) assembles multiple input files from different experiments into one input file. The software can also take an annotation file that hyperlinks each microarray probe to various web resources, including Unigene, TIGR, MGI and NIA Mouse Gene Index. These gene links allow the users to incorporate microarray data into other programs, e.g. the GenMAPP for Gene Ontology analysis. All results can be saved as a stand-alone web-page for sharing or releasing the data.
The software offers an optional adjustment of signal intensities, when two-color hybridizations are used. This is based on our observation that signal intensities in one channel (e.g. red) often increase with the increasing signal intensities in the other channel (e.g. green), even when the same reference RNA is always used for the red channel. If readings from these two channels are independent, the signal intensities in the red channel should not vary among experiments and should be corrected if there are changes.
We have implemented the single-factor analysis of variance (ANOVA) for testing statistical significance. Testing multiple hypotheses with the ANOVA requires some modifications such as error variance averaging and false discovery rate (FDR). The average error variance for genes with similar signal intensities is estimated using the sliding window of adjustable size applied to genes sorted by their average signal intensities. Because some genes (outliers) may have unusually high error variance, genes with the highest variance values (a top 1% by default) are not used for the error variance averaging. To obtain an estimate for the true error variance, the software provides the following five different error models as options: (1) actual error variance (this option processes each gene independently), (2) intensity-specific average error variance, (3) Bayesian error model (Baldi and Long, 2001), (4) maximum between intensity-specific average error variance and actual error variance and (5) maximum between intensity-specific average error variance and Bayesian error variances. Option (4), the most conservative model, is used as default. However, if error variance is too high, none of these models is reliable. Thus, we tag and visually examine genes with high error variance (five times greater than the average). Users can also select more stringent criteria for removing outliers (i.e. a lower z-threshold level). The default threshold (z = 8) removes only the most deviating outliers. Estimation of the z-value is based on the ANOVA results; thus, ANOVA is applied iteratively with outlier removal in each cycle until no new outliers are detected.
The FDR identifies the proportion of false positives among significant genes (Benjamini and Hochberg, 1995; Reiner et al., 2003). Traditional p-values, which are designed for testing a single hypothesis, are not suited to the comparison of several thousand genes. The Bonferroni correction is not relevant either, because it is too stringent and allows no false positives among significant genes. We have implemented the original method (Benjamini and Hochberg, 1995):
The software offers two methods of clustering tissue samples and subsequent identification of correlated genes. First, hierarchical clustering of samples (e.g. tissues and cells) is done by using the average distance method. A set of genes, unique to each cluster is identified in the following manner. For each gene, g, we first identify a sample T1(g) with the lowest average expression, E[T1(g)], within the cluster and a sample T2(g) with the highest average expression, E[T2(g)], outside the cluster. If K genes satisfy E[T1(g)] > E[T2(g)], these genes always have higher expressions in samples within the cluster than in samples outside the cluster. To determine if the difference E[T1(g)] − E[T2(g)] is statistically significant, we calculate z-values based on the error model and p-values based on single-tail normal distribution. Finally we estimate FDR values using Equation 1, in which N is the minimum between 2K and the total number of genes. The set of K genes represents only a half (the positive part) of the normal distribution, and thus K is doubled for estimating the FDR.
Second, we have implemented the principal component analysis (PCA). One advantage of PCA is that the principal components are always orthogonal (uncorrelated), whereas other methods (e.g. K-means clustering) often produce redundant correlated clusters. We have also implemented the singular value decomposition method, which reduces the dimension in both columns and rows of the data matrix. The method combines samples and genes in a single graph (called biplot) so that their association can be analyzed visually (Chapman et al., 2002; Gabriel, 1971). The NIA Array Analysis tool generates interactive two-dimensional (2D) and 3D biplots (Fig. 1). Each gene in a biplot is hyperlinked to its annotation and histogram showing the expression levels in each sample. We identify two sets of genes that are positively and negatively correlated with each principal component (PC). If the degree of a gene-expression change associated with a specific PC exceeds a user-defined threshold, then the gene is considered correlated with the PC.
The NIA Array Analysis tool has been successfully used for the last two years (Hamatani et al., 2004; Sharov et al., 2003). This open-source non-restricted software will be a valuable resource for the research community.