Abstract

Novel techniques for analyzing microarray data are constantly being developed. Though many of the methods contribute to biological discoveries, inability to properly evaluate the novel techniques limits their ability to advance science. Because the underlying distribution of microarray data is unknown, novel methods are typically tested against the assumed normal distribution. However, microarray data are not, in fact, normally distributed, and assuming so can have misleading consequences. Using an Affymetrix technical replicate spike-in data set, we show that oligonucleotide expression values are not normally distributed for any of the standard methods for calculating expression values. The resulting data tend to have a large proportion of skew and heavy tailed genes. Additionally, we show that standard methods can give unexpected and misleading results when the data are not well approximated by the normal distribution. Robust methods are therefore recommended when analyzing microarray data. Additionally, new techniques should be evaluated with skewed and/or heavy-tailed data distributions.

INTRODUCTION

Microarray data may be viewed as having 2 sources of variability, from technology and biology. The first is error which would preferably be removed (noise), while the second is the true value which is sought (signal). In order to develop useful statistical methods, it is important to understand the shape of the likely data distribution. The purpose of this paper is to distinguish and study the technical variation/error of Affymetrix microarrays. Two conclusions are drawn: (1) it is possible to separate and study the technical variability. Affymetrix gene expression data deviate substantially from normality by being skewed and heavy tailed. Therefore, (2) robust methods appropriate for nonnormality should be used when analyzing Affymetrix, and presumably other, arrays.

Throughout this paper, the word “transformation,” and its cognates, will be used to refer to any published statistical method which calculates a single gene expression value from the entire probe set, for each gene, on an oligonucleotide chip. The idea is that one probe set from an oligonucleotide microarray is “transformed” into one gene expression value.

The study of the distribution of technical error is important. Some argue that technical error is negligible (Klebanov and Yakovlev, 2007), while others conclude that it is not negligible (Mushegian, 2007), (Koonin, 2007). The large MicroArray Quality Control report (MAQC Consortium, 2006) found that technical variation can substantially affect results. Whether or not the distribution of technical error is well approximated by a normal distribution is important. Many methods for analyzing microarray data assume normality. In what follows, we attempt to show that the technical error of transformed Affymetrix data is not well approximated by the normal distribution and that this can have a nontrivial effect on subsequent analysis. The 2 main sections of this paper correspond to the 2 conclusions of the opening paragraph. Section 2 argues that technical error is often far from normal because of skewness and heavy tails. Section 3 shows that some conventional Affymetrix data analysis methods become inconsistent and lose accuracy as a result. Section 4 concludes with a discussion of the results. Detailed results are available in a full manuscript at http://pages.pomona.edu/∼jsh04747/Research/Papers.htm/ and www.bubbs.biola.edu/∼jason.wilson/Professional_Links.htm or as a supplementary material at available at Biostatistics online (www.biostatistics.oxfordjournals.org).

DEVIATION FROM NORMALITY

Giles and Kipling (2003), hereafter GK, examined the Affymetrix 59 chip spike-in (SI) data set and concluded that there is “strong support for the normality of the [technical error of the] data produced by 4 different algorithms commonly used for extracting expression values” (pp 2258–2259). We replicated the work of GK and came to the opposite conclusion, primarily because GK did not account for the fact that quantile-quantile plots (QQ-plots) have inherently high positive correlations (see supplementary material available at Biostatistics online). Our work, which includes a rebuttal of GK, is important because, to our knowledge, there is no other published work on the distribution of technical variation for microarray data by probe set. For our analysis, the following 5 transformations were implemented using Bioconductor (www.bioconductor.org), dChip PM-only (Li and Wong, 2001), Affymetrix MicroArray Suite 5.0 (MAS5) (Affymetrix, 2002), robust multi-array average (RMA) (Irizarry and others, 2003) , GeneChip robust multi-array average (GCRMA) (Wu and others, 2004), and probe logarithmic intensity error (PLIER) (Affymetrix, 2005). One piece of evidence illustrating the nonnormality of the data is the box plots of the skewness coefficients in Figure 1.

Fig. 1.

Skewness box plots. The percentage of genes greater than (smaller than) the 99% (1%) of the empirical normal is written above (below) each nonnormal box plot. PLIER produces more genes which are skewed left while the other transformations seem to produce more skewed right genes.

Fig. 1.

Skewness box plots. The percentage of genes greater than (smaller than) the 99% (1%) of the empirical normal is written above (below) each nonnormal box plot. PLIER produces more genes which are skewed left while the other transformations seem to produce more skewed right genes.

In order to characterize the nonnormality of the data, we explored the skewness, peakedness, and tail heaviness of the data using the skew coefficient, kurtosis coefficient, and Hogg's Q2 (Hogg and others, 1975), respectively. The statistics for each gene were displayed in box plots alongside simulated Normal(0, 1), t5 (heavy tail), and χ32 (skew) distributions (Figures 13 of the supplementary material available at Biostatistics online). Instead of reflecting the Normal(0, 1) box plot, nearly every box plot of the transformed data was more similar to the t5 or χ32 plots, if not more extreme than them. Thus, we inferred that skewness, peakedness, and tail heaviness were likely contributors to nonnormality.

To investigate further, we considered only those genes that failed the Shapiro–Wilk test of normality (the “rejected genes”). We counted the number of skew, kurtosis, and Hogg's Q2 statistics above the 99th percentile for the simulated N(0, 1) data (Table 3 of the supplementary material available at Biostatistics online). Under each of the transformations, at least 67% of the rejected genes had extreme skew deviations (either positive or negative), 34–73% of the rejected genes had extreme high kurtosis deviations, and 19–51% of the rejected genes had heavy tails (peakedness and tail heaviness are often commensurate and we subsequently grouped them). See Figure 2 for an example of a gene whose distribution is not consistent with normality under any of the transformations. We conclude that skewness and tail heaviness are major contributors to the nonnormality of the technical error.

Fig. 2.

Nonnormal gene. As an example of a gene that is not consistent with normality under any of the 5 transformations, we give the empirical distributions for the same gene under each of the transformations.

Fig. 2.

Nonnormal gene. As an example of a gene that is not consistent with normality under any of the 5 transformations, we give the empirical distributions for the same gene under each of the transformations.

EFFECTS OF NONNORMALITY

From the characterization of the nonnormal genes being caused partly by skewness and partly by tail heaviness, it becomes important to understand the effects of these attributes on methods typically applied to microarray data. The effects nonnormality will have on novel techniques is unclear. However, the typical practitioner is probably interested in the effects of nonnormality on basic differential expression gene ranking and gene clustering procedures.

Differential expression

To investigate the consistency of techniques which discover differentially expressed genes, we compared 2 standard methods: the t-test and the Wilcoxon rank sum (WRS) test. Both methods test for a difference in shift across 2 samples. As expected, for simulated normal data, the t-test was more powerful than WRS. For both skewed and heavy-tailed simulated data, the WRS test was more powerful than the t-test, which always identified fewer differentially expressed genes (see Figure 4 and Tables 4 and 5 of the supplementary material available at Biostatistics online). Additionally, the WRS test identified practically all the genes that the t-test identified. Because the gain in power was small for normal data, WRS may still be preferable since the normality assumption cannot be verified. The conclusion of Section 2 is supported by the behavior of the transformed data to be more consistent with the tail heavy (t5) and skew (χ32) data than with normally distributed data when testing for differential expression (Table 5 of the supplementary material available at Biostatistics online).

Gene similarity

Our simulations showed strong consistency of Pearson correlation and Spearman correlation for normally distributed data. However, if the data were not normally distributed, the methods became more discrepant. For both heavy-tailed data and skewed data, the 2 measures of similarity were less correlated than the statistics calculated with normal data. Particularly with the heavy-tailed data, sometimes one measure identified a pair as similar while the other measure did not (see Figure 5 of the supplementary material available at Biostatistics online). As an additional note, Pearson correlation is highly affected by outliers (typically prevalent in microarray data; Wang and others, 2007), so the Pearson correlation estimate of similarity is likely to be inaccurate with heavy-tailed data or any data with outliers. For the SI data, the range of the differences (between Spearman and Pearson correlations) is closer to the heavy-tailed and skewed data than to that of the normal data (Figures 5 and 6 of the supplementary material available at Biostatistics online).

For both tests of differential expression and measuring similarity between genes, with microarray data we recommend using robust statistical measures. A robust measure of correlation that is consistent for the population correlation, like the translated biweight correlation (Hardin and others, 2007), may give optimal results.

DISCUSSION

We have demonstrated that transformed microarray data are not, as a rule, well approximated by the normal distribution. Using the 59 technical replicates in the Affymetrix SI data set, we showed that none of the standard transformation techniques result in universally normal data. Given the nature of microarray technologies, we presume that our conclusions will hold for other kinds of microarrays, although further study would be required for confirmation. In practice, our work on consulting projects using other microarray data sets is consistent with our conclusion that microarray data are not, in general, approximately normally distributed.

Finally, not having normal data can yield misleading results for both standard (as mentioned) and novel methods. We advocate that when testing a novel method, heavier-tailed and/or skewed distributions should be used regularly in simulations. It is unclear how newer techniques designed or modified specifically for microarrays (e.g. bagging, boosting, PCA, PLS) will be affected by distributional assumptions, but we hope our results will encourage future researchers to be realistic in simulating data to test novel methods.

SUPPLEMENTARY MATERIAL

Supplementary Material is available at http://www.biostatistics.oxfordjournals.org.

The authors would like to thank John Kloke for his helpful comments and ideas. The authors would also like to thank Jason's advisor, Xinping Cui, for her support on this project while he was completing his dissertation and Barry Arnold for useful discussions on this topic. Conflict of Interest: None declared.

References

Affymetrix
Statistical Algorithms Description Document
 , 
2002
Santa Clara, CA
Affymetrix, Inc
Affymetrix
Guide to probe logarithmic intensity error (PLIER) estimation
Technical Report
 , 
2005
Santa Clara, CA
Affymetrix, Inc
Giles
P
Kipling
D
Normality of oligonucleotide microarray data and implications for parametric statistical analyses
Bioinformatics
 , 
2003
, vol. 
19
 (pg. 
2254
-
2262
)
Hardin
J
Mitani
A
Hicks
L
VanKoten
B
A robust measure of correlation between two genes on a microarray
BMC Bioinformatics
 , 
2007
, vol. 
8
 pg. 
220
 
Hogg
R
Fisher
D
Randles
R
A two-sample adaptive distribution-free test
Journal of the American Statistical Association
 , 
1975
, vol. 
70
 (pg. 
656
-
661
)
Irizarry
R
Hobbs
B
Collin
F
Beazer-Barclay
Y
Antonellis
K
Scherf
U
Speed
T
Exploration, normalization, and summaries of high density oligonucleotide array probe level data
Biostatistics
 , 
2003
, vol. 
4
 (pg. 
249
-
264
)
Klebanov
L
Yakovlev
A
How high is the level of technical noise in microarray data?
Biology Direct
 , 
2007
, vol. 
2
 (pg. 
1
-
6
)
Koonin
EV
Review 3: how high is the level of technical noise in microarray data?
Biology Direct
 , 
2007
, vol. 
2
 (pg. 
8
-
9
)
Li
C
Wong
W
Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application
Genome Biology
 , 
2001
, vol. 
2
  
research 0032
MAQC Consortium
The microarray quality control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements
Nature Biotechnology
 , 
2006
, vol. 
24
 (pg. 
1151
-
1161
)
Mushegian
A
Review 1: how high is the level of technical noise in microarray data?
Biology Direct
 , 
2007
, vol. 
2
 (pg. 
6
-
7
)
Wang
Y
Miao
Z
Pommier
Y
Kawasaki
E
Player
A
Characterization of mismatch and high-signal intensity probes associated with affymetrix genechips
Bioinformatics
 , 
2007
, vol. 
23
 (pg. 
2088
-
2095
)
Wu
Z
Irizarry
R
Gentleman
R
Martinez-Murillo
F
Spencer
F
A model-based background adjustment for oligonucleotide expression arrays
Journal of the American Statistical Society
 , 
2004
, vol. 
99
 (pg. 
909
-
917
)
The data set was selected because removing the 16 SI genes leaves enough technical replicates to test distributional assumptions; it is available at the website for R's Bioconductor software at http://bioconductor.org/packages/2.0/data/experiment/html/Spikeln.html.