Novel techniques for analyzing microarray data are constantly being developed. Though many of the methods contribute to biological discoveries, inability to properly evaluate the novel techniques limits their ability to advance science. Because the underlying distribution of microarray data is unknown, novel methods are typically tested against the assumed normal distribution. However, microarray data are not, in fact, normally distributed, and assuming so can have misleading consequences. Using an Affymetrix technical replicate spike-in data set, we show that oligonucleotide expression values are not normally distributed for any of the standard methods for calculating expression values. The resulting data tend to have a large proportion of skew and heavy tailed genes. Additionally, we show that standard methods can give unexpected and misleading results when the data are not well approximated by the normal distribution. Robust methods are therefore recommended when analyzing microarray data. Additionally, new techniques should be evaluated with skewed and/or heavy-tailed data distributions.
Microarray data may be viewed as having 2 sources of variability, from technology and biology. The first is error which would preferably be removed (noise), while the second is the true value which is sought (signal). In order to develop useful statistical methods, it is important to understand the shape of the likely data distribution. The purpose of this paper is to distinguish and study the technical variation/error of Affymetrix microarrays. Two conclusions are drawn: (1) it is possible to separate and study the technical variability. Affymetrix gene expression data deviate substantially from normality by being skewed and heavy tailed. Therefore, (2) robust methods appropriate for nonnormality should be used when analyzing Affymetrix, and presumably other, arrays.
Throughout this paper, the word “transformation,” and its cognates, will be used to refer to any published statistical method which calculates a single gene expression value from the entire probe set, for each gene, on an oligonucleotide chip. The idea is that one probe set from an oligonucleotide microarray is “transformed” into one gene expression value.
The study of the distribution of technical error is important. Some argue that technical error is negligible (Klebanov and Yakovlev, 2007), while others conclude that it is not negligible (Mushegian, 2007), (Koonin, 2007). The large MicroArray Quality Control report (MAQC Consortium, 2006) found that technical variation can substantially affect results. Whether or not the distribution of technical error is well approximated by a normal distribution is important. Many methods for analyzing microarray data assume normality. In what follows, we attempt to show that the technical error of transformed Affymetrix data is not well approximated by the normal distribution and that this can have a nontrivial effect on subsequent analysis. The 2 main sections of this paper correspond to the 2 conclusions of the opening paragraph. Section 2 argues that technical error is often far from normal because of skewness and heavy tails. Section 3 shows that some conventional Affymetrix data analysis methods become inconsistent and lose accuracy as a result. Section 4 concludes with a discussion of the results. Detailed results are available in a full manuscript at http://pages.pomona.edu/∼jsh04747/Research/Papers.htm/ and www.bubbs.biola.edu/∼jason.wilson/Professional_Links.htm or as a supplementary material at available at Biostatistics online (www.biostatistics.oxfordjournals.org).
DEVIATION FROM NORMALITY
Giles and Kipling (2003), hereafter GK, examined the Affymetrix 59 chip spike-in (SI) data set† and concluded that there is “strong support for the normality of the [technical error of the] data produced by 4 different algorithms commonly used for extracting expression values” (pp 2258–2259). We replicated the work of GK and came to the opposite conclusion, primarily because GK did not account for the fact that quantile-quantile plots (QQ-plots) have inherently high positive correlations (see supplementary material available at Biostatistics online). Our work, which includes a rebuttal of GK, is important because, to our knowledge, there is no other published work on the distribution of technical variation for microarray data by probe set. For our analysis, the following 5 transformations were implemented using Bioconductor (www.bioconductor.org), dChip PM-only (Li and Wong, 2001), Affymetrix MicroArray Suite 5.0 (MAS5) (Affymetrix, 2002), robust multi-array average (RMA) (Irizarry and others, 2003) , GeneChip robust multi-array average (GCRMA) (Wu and others, 2004), and probe logarithmic intensity error (PLIER) (Affymetrix, 2005). One piece of evidence illustrating the nonnormality of the data is the box plots of the skewness coefficients in Figure 1.
In order to characterize the nonnormality of the data, we explored the skewness, peakedness, and tail heaviness of the data using the skew coefficient, kurtosis coefficient, and Hogg's Q2 (Hogg and others, 1975), respectively. The statistics for each gene were displayed in box plots alongside simulated Normal(0, 1), t5 (heavy tail), and χ32 (skew) distributions (Figures 1–3 of the supplementary material available at Biostatistics online). Instead of reflecting the Normal(0, 1) box plot, nearly every box plot of the transformed data was more similar to the t5 or χ32 plots, if not more extreme than them. Thus, we inferred that skewness, peakedness, and tail heaviness were likely contributors to nonnormality.
To investigate further, we considered only those genes that failed the Shapiro–Wilk test of normality (the “rejected genes”). We counted the number of skew, kurtosis, and Hogg's Q2 statistics above the 99th percentile for the simulated N(0, 1) data (Table 3 of the supplementary material available at Biostatistics online). Under each of the transformations, at least 67% of the rejected genes had extreme skew deviations (either positive or negative), 34–73% of the rejected genes had extreme high kurtosis deviations, and 19–51% of the rejected genes had heavy tails (peakedness and tail heaviness are often commensurate and we subsequently grouped them). See Figure 2 for an example of a gene whose distribution is not consistent with normality under any of the transformations. We conclude that skewness and tail heaviness are major contributors to the nonnormality of the technical error.
EFFECTS OF NONNORMALITY
From the characterization of the nonnormal genes being caused partly by skewness and partly by tail heaviness, it becomes important to understand the effects of these attributes on methods typically applied to microarray data. The effects nonnormality will have on novel techniques is unclear. However, the typical practitioner is probably interested in the effects of nonnormality on basic differential expression gene ranking and gene clustering procedures.
To investigate the consistency of techniques which discover differentially expressed genes, we compared 2 standard methods: the t-test and the Wilcoxon rank sum (WRS) test. Both methods test for a difference in shift across 2 samples. As expected, for simulated normal data, the t-test was more powerful than WRS. For both skewed and heavy-tailed simulated data, the WRS test was more powerful than the t-test, which always identified fewer differentially expressed genes (see Figure 4 and Tables 4 and 5 of the supplementary material available at Biostatistics online). Additionally, the WRS test identified practically all the genes that the t-test identified. Because the gain in power was small for normal data, WRS may still be preferable since the normality assumption cannot be verified. The conclusion of Section 2 is supported by the behavior of the transformed data to be more consistent with the tail heavy (t5) and skew (χ32) data than with normally distributed data when testing for differential expression (Table 5 of the supplementary material available at Biostatistics online).
Our simulations showed strong consistency of Pearson correlation and Spearman correlation for normally distributed data. However, if the data were not normally distributed, the methods became more discrepant. For both heavy-tailed data and skewed data, the 2 measures of similarity were less correlated than the statistics calculated with normal data. Particularly with the heavy-tailed data, sometimes one measure identified a pair as similar while the other measure did not (see Figure 5 of the supplementary material available at Biostatistics online). As an additional note, Pearson correlation is highly affected by outliers (typically prevalent in microarray data; Wang and others, 2007), so the Pearson correlation estimate of similarity is likely to be inaccurate with heavy-tailed data or any data with outliers. For the SI data, the range of the differences (between Spearman and Pearson correlations) is closer to the heavy-tailed and skewed data than to that of the normal data (Figures 5 and 6 of the supplementary material available at Biostatistics online).
For both tests of differential expression and measuring similarity between genes, with microarray data we recommend using robust statistical measures. A robust measure of correlation that is consistent for the population correlation, like the translated biweight correlation (Hardin and others, 2007), may give optimal results.
We have demonstrated that transformed microarray data are not, as a rule, well approximated by the normal distribution. Using the 59 technical replicates in the Affymetrix SI data set, we showed that none of the standard transformation techniques result in universally normal data. Given the nature of microarray technologies, we presume that our conclusions will hold for other kinds of microarrays, although further study would be required for confirmation. In practice, our work on consulting projects using other microarray data sets is consistent with our conclusion that microarray data are not, in general, approximately normally distributed.
Finally, not having normal data can yield misleading results for both standard (as mentioned) and novel methods. We advocate that when testing a novel method, heavier-tailed and/or skewed distributions should be used regularly in simulations. It is unclear how newer techniques designed or modified specifically for microarrays (e.g. bagging, boosting, PCA, PLS) will be affected by distributional assumptions, but we hope our results will encourage future researchers to be realistic in simulating data to test novel methods.
The authors would like to thank John Kloke for his helpful comments and ideas. The authors would also like to thank Jason's advisor, Xinping Cui, for her support on this project while he was completing his dissertation and Barry Arnold for useful discussions on this topic. Conflict of Interest: None declared.