Motivation: The introduction of oligonucleotide DNA arrays has resulted in much debate concerning appropriate models for the measurement of gene expression. By contrast, little account has been taken of the possibility of identifying the physical imperfections in the raw data.
Results: This paper demonstrates that, with the use of replicates and an awareness of the spatial structure, deficiencies in the data can be identified, the possibility of their correction can be ascertained and correction can be effected (by use of local scaling) where possible. The procedures were motivated by data from replicates of Arabidopsis thaliana using the GeneChip® ATH1-121501 microarray. Similar problems are illustrated for GeneChip® Human Genome U133 arrays and for the newer and larger GeneChip® Wheat Genome microarray.
Availability: R code is freely available on request.
This paper begins with a brief description of Affymetrix arrays. The key to their design is that information about each gene is collected from widely separated locations in the microarray. The design of the microarray recognizes that there is likely to be uncontrolled variation across the microarray, so that at each location there are measures of what might be termed ‘signal’ and ‘background noise’.
In the next section we briefly examine the best known models. These do not take account of the spatial arrangement of the information. We show how to exploit replicates of the existing microarrays so as to uncover spatial flaws in those replicates. Some of these flaws can be compensated for (the compensation procedure is explained), but others can provide no useful information. We demonstrate how to distinguish between these two types of flaw.
2 THE DESIGN OF AFFYMETRIX ARRAYS
Oligonucleotide expression microarray technology (Lockhart et al., 1996) is now well established as a method for the determination of gene expression. The data and analysis described here refer to the Affymetrix GeneChip® ATH1-121501 for use with Arabidopsis thaliana. This chip consists of a 720 × 720 array providing information on 22 810 genes.
As described by Lipshutz et al. (1999) and Warrington et al. (2000) pairs of 25-base oligonucleotides are used to probe genes. In each pair, the first probe, termed the perfect match (PM) probe, is intended to match a target sequence exactly; the second probe, termed the mismatch (MM) probe, is identical to the first except for its central (13th) base. For each gene there are multiple probe pairs forming a ‘probe set’. In the ATH1-121501 microarray most sets consist of 11 pairs, with a few having 20 pairs. Each probe pair occupies adjacent locations in the array, with the pairs in a probe set being at widely separated locations.
The idea behind this arrangement is that the MM intensity provides an estimate of the background ‘noise’, so that the (PM–MM) differences give estimates of the gene expression. By placing each pair of PM and MM probes at adjacent locations in the microarray, spatial variations across the microarray can be ignored, since the difference between neighbours will be negligible.
3 COMMONLY USED MODELS
The most commonly used model is probably that introduced by Li and Wong (2001). The model states that in the i-th replicate the MM and PM values for the j-th probe pair for a specific gene, denoted by MMij and PMij, are given by:
For an adjacent pair of probes it is reasonable to assume that any additive location effects incorporated into Equations (1) and (2) would cancel out. Naef et al. (2002) suggest that when |Dij| is near zero then the background intensity will be well estimated by 1/2 (PMij + MMij).
The problem with this model (and related models) is that it is not uncommon to find negative values for the (PM–MM) differences. These are a consequence of cross-hybridization of the probes with genes other than the target gene. In our study we found that about one-fourth of the PM–MM differences were negative; in a study of GeneChip® HG-U95A arrays Irizarry et al. (2003a) noted that about one-third of the differences were negative.
The model defined by Equations (1) and (2) has both additive and multiplicative components. In contrast, the mixed-effect model suggested by Chu et al. (2002) is a simple linear model with cell line, treatment, probe and microarray main effects and various interactions. As with the Li–Wong model, the Chu model is affected by cross-hybridization.
To avoid the problems caused by cross-hybridization, several authors (Naef et al., 2002; Šášik et al., 2002; Irizarry et al., 2003a,b) have recently proposed models that consider only the PM values. The general form of these models is
Apart from the random errors, and the variations in b from one microarray to another, Equation (5) ignores variations within a microarray (as do the other models). With spatial variations across the microarray, a more appropriate model would be
The purpose of the following sections is to demonstrate the existence of the spatial components and then to remove them. The model given by Equation (4) or (5) assumes that variations from one replicate to another are constant across the replicates. Using the uncorrected PMij values would ignore the spatial variation across the microarray that had motivated Affymetrix's original use of PM/MM pairs. Later in this paper, we will demonstrate that there are indeed spatial variations and show how these may be reduced.
4 THE USE OF REPLICATES
The production of a microarray is a complex business and it is no surprise that there are occasional imperfections in the results. An overview of the problems of production is provided by Model et al. (2002) and Hartemink et al. (2001). Model et al. (2002) suggest a method for diagnosing faulty arrays, so that these are omitted in the subsequent analysis. In our analysis, however, while we find flaws for most arrays, we recommend discarding only a tiny proportion of the total data. The possibility of errors varying with location on the microarray is noted in the context of cDNA arrays by Yang et al. (2002).
One key to unmasking imperfections in microarray data is the comparison of replicate measurements. The importance of replication is sometimes underestimated, and while microarrays remain expensive commodities it is easy to use cost as a reason for not replicating. However, Lee et al. (2000) conclude ‘The main lesson learned from the study is that replication in microarray studies is not equivalent to duplication and hence is not a waste of scientific resources’, while Quackenbush (2002) states ‘Replication is essential for identifying and reducing the variation in any experimental assay, and microarrays are no exception’. According to Faller et al. (2003), experimenters in ‘high-impact journals therefore use three to four microarray repeats per experimental condition’.
4.1 Replicates used
A total of six ATH1-121501 arrays were used. One set of three arrays was probed with biological replicate samples from wild-type A. thaliana (ecotype Wassilewskija) and the other set of three arrays was probed with biological replicate samples from the pho3 mutant (in the same genetic background). The seeds were germinated in MS medium containing 1% Suc (w/v). Ten-day-old seedlings were transferred into compost (Fisons Levington) in a controlled environment cabinet (22°C; 140 μmol m−1 s−1 light; 16 h day length). Rosette leaves were harvested at developmental stage 6.00 = first flower open (Boyes et al., 2001). Three independently grown samples (each containing leaves pooled from 10 to 15 plants) were harvested as biological replicates of wild type and pho3 tissue. Replicate samples of total RNA extracted from this tissue were sent to the Nottingham Arabidopsis Stock Centre for probe preparation and Arabidopsis Ath1 GeneChip analysis (Affymetrix Inc., Santa Clara, CA) at the Nottingham Arabidopsis Stock Centre (NASC) as described in Lloyd and Zakhleniuk (2004). The microarray data from this experiment are available on public access at the Nottingham Arabidopsis Stock Centre website ().
4.2 Replicate screening
If genetically identical material is exposed in each of R arrays, then, at any specified microarray location, the value reported for each microarray should be the same, except for independent random errors. This assumption underlies all the models previously described; none of these models allows for the spatially correlated errors that would arise from cases where an entire region of a microarray has enhanced or reduced values.
Li and Wong (2001) reveal microarray imperfections by fitting their model to the entire microarray and highlighting probe pairs that differ by more than three standard deviations from the fitted value. This procedure, which is implemented in the dChip suite, leads to the identification of scratches, possibly contaminated arrays and misaligned arrays. However, their procedure is dependent upon the appropriateness of their model, and can only be implemented following careful work in robustly estimating the model parameters. We will present a simple alternative based directly on the data from a set of replicate arrays and using information from local regions, rather than from the entire chip.
Our single assumption is that all R arrays have been presented with the same genetic material in the same fashion. If this is the case, then there should be little variation between the overall values obtained from the replicates.
The data that motivated this study were described in the previous section. Set 1 (replicates 1–3) refers to the pho3 mutant strain and Set 2 (Replicates 4–6) to the wild type. Table 1 shows that while there is little variation between most replicates, Replicate 5 displays markedly lower values. The first step is therefore to use a multiplicative scaling of the sets of intensities so that the microarray totals are equal.
The second replicate in the second set exhibits a markedly lower total intensity.
After scaling we then identify which of the R arrays has provided the largest value at each location. Ideally these would be independent choices from the R possible replicates in a set. If the same replicate provides the maximum value for many neighbouring locations, then this is an indication of a local bias in that replicate. A similar logic applies for minima.
To investigate whether such biases exist, we divide the 712 × 712 ATH1-121501 microarray into non-overlapping 3 × 3 sub-arrays. Since, in the ATH1-121501 microarray, the PM and MM pairs are column neighbours, we avoid possible complications with related cells, by focusing on the centre cell and the four corner cells of each 3 × 3 sub-array (so that the selected cells refer to different genes). For each sub-array we determine two quantities: Thus C is a value in (1, R), while N is a value in (0,4). The example shown in Figure 1a indicates that Replicate 3 had the largest value in three cells (including the centre cell). Figure 1b, centred on the same location, records the fact that two corner cells matched the centre cell.
C The identifier of the microarray that provides the largest of the R values of the centre cell.
N The number of corner cells whose maximum value emanates from microarray C.
Since the separate microarrays have been scaled to have the same mean values as one another, if there were no spatially correlated errors, the probability of C taking any particular value would be 1/R for each of the possible values, and the value taken would be independent of the values taken at other locations. As a consequence, Pn, the probability of N being equal to n, would be binomial:
|Probability or proportion||Number of neighbours|
|Theory: Equation (5)||0.198||0.395||0.296||0.099||0.012|
|Probability or proportion||Number of neighbours|
|Theory: Equation (5)||0.198||0.395||0.296||0.099||0.012|
Also given are the observed proportions using indicator arrays of maxima and of minima. The values given combine information from the two sets of three ATH1-121501 microarrays.
Table 2 shows clear evidence of spatial correlation. The proportion of cases for which the component 3 × 3 sub-arrays have all four corners having the same C value as the central cell is about five times that expected in the absence of spatial autocorrelation.
In this study there are only three replicates in each group, so that the figures for the maxima and minima indices are closely related. If there were more replicates then the two sets of figures would provide a clearer distinction between the effects of regions of high and low values.
5 VISUALIZING IMPERFECTIONS
Many examples of the imperfections seen in arrays appear in the literature (Li and Wong, 2001; Schadt et al., 2000, 2001; Workman et al., 2002; Gautier et al., 2004). The dChip and RMAExpress packages provide visualizations of the residuals from fitted models which thereby highlight spatial imperfections. In this section we present simpler, replicate-based methods that require no model assumptions.
5.1 Maxima and minima
Consider the (C, N) indexes introduced in the previous section. Our aim is to identify compact regions where individual arrays present either unusually high values, or unusually low values. The appropriate procedure will depend on R, since we need to prescribe circumstances that have a low probability of occurrence under the hypothesis of spatial independence (in order not to be overwhelmed with cases) while ensuring that the probability is not so small that we find no cases at all. As an example of the approach, we describe a procedure that works well in the case of three replicates.
A sub-array will be described as suggestive if it has a value for N that is 3 or 4. For homogeneous replicates the probability of a sub-array being suggestive is therefore one-ninth (Table 2). Each interior 3 × 3 sub-array has eight neighbouring 3 × 3 sub-arrays, as illustrated in Figure 1b. The probability that, by chance, a neighbour will be suggestive with the same value for C is 1/27, since there are three possible values for C. Under the hypothesis of spatially independent errors, the proportion of suggestive sub-arrays having two or more suggestive neighbours with the same central value is of the order of 10−3. However, as Table 2 suggested, the observed proportion is much greater.
Figure 2 shows the results of the visualization procedure for the two sets of arrays. The diagrams only show suggestive sub-arrays that have at least one suggestive neighbour with the same bias. There are several interesting features:
Blobs. Replicate 1 of the maxima shows a very extensive triangular region in the top left of the microarray. Replicate 3 shows a smaller blob and small blobs are also visible towards the bottom of Replicate 6.
Lines. Several left–right lines are visible. The most marked are of high values in Replicate 3, with less marked lines in Replicate 2. The similar lines that appear faintly in the minima diagram for Replicate 1 are a consequence of the maxima in the other replicates.
Rectangular enhancement. Replicate 5 shows an obviously enhanced region.
Rings. The minima diagrams reveal sections of circles reminiscent of the rings left on tables by wet-bottomed coffee mugs. There are partial rings at the bottom right of Replicates 5 and 6 with a similar ring at the left-hand side of Replicate 3.
Evidently there are compact regions of relatively high and relatively low values. Affymetrix's initial decision to place pairs of probes side-by-side at widely separated locations in the microarray was designed to compensate for this variability; an unusually low PM value would be compensated for by an equivalently low neighbouring MM value. In contrast, the precision of estimates based on unadjusted PM values would be compromised.
5.2 Local correlation between replicates
If all the cells in a local region of one replicate are slightly increased (or decreased) in value (either additively or multiplicatively), then the variation in their relative magnitudes would continue to match that in the equivalent region of other replicates.
To test whether this is the case, we calculate the correlations in matching 5 × 5 sub-arrays. The microarray size remains small to obtain a sharp spatial focus, but is increased from the 3 × 3 arrays used previously to ensure that the correlation value obtained is reasonably reliable.
Consider the 5 × 5 sub-array centred on cell (i, j) of replicate r of a set. We calculate the correlation between these 25 values and the corresponding sets in the remaining (R − 1) replicates. Let ρr(i, j) be the maximum of these (R − 1) correlations. With a reliable production process, ρr(i, j) will be close to 1. However, if there is a flaw affecting this region of replicate r, then ρr(i, j) will be low. For the two sets of three replicates studied previously, the results are illustrated in Figure 3, in which sub-arrays with maximum correlations <0.9 are identified by a dot (at the centre of the sub-microarray). Over the six replicates an average of 2.5% of sub-arrays are so identified.
There are three interesting features in Figure 3:
The blobs in Replicates 1 and 3. These are again visible. They are actually regions where the PM values are as much as 40 times greater than those in the other replicates, with the PM values and MM values being comparable in size. In the central (saturated) regions of these blobs the correlation with the values in the other replicates is often negative. The implication is that no adjustments will be able to recover these values.
Lines, rectangular areas and rings. None of these are apparent. This suggests that these effects can be effectively removed by simple scaling as described later.
Unreliable arrays. A feature apparent in Figure 3, but not in Figure 2, is that one replicate is a relatively poor match to the other two replicates in both sets. The average values of ρr(i, j) for the three arrays of Set 1 are 0.988, 0.990 and 0.970, with the corresponding values for Set 2 being 0.961, 0.986 and 0.987. Although the reduction in correlation is only slight, Figure 3 shows that it is widespread.
6 APPLICATION TO OTHER DATASETS
In order to illustrate the general nature of these spatial effects we have also examined some publicly available Affymetrix datasets.
Affymetrix have made available the spiked arrays that formed part of their Latin Square experiment utilizing the GeneChip® Human Genome U133 arrays (). These consist of three technical replicates of 14 separate hybridizations of 42 spiked transcripts. These arrays are the same size (712 × 712) as the ATH1-121501 arrays. Figure 4 shows the locations of clusters of maxima and minima for the three replicates of the fourth of the fourteen hybridizations.
Many of the Latin Square triples show spatial features, but this fourth set appears to be of particular interest since each replicate appears to show a similar flaw. In each case it appears that the material in the centre of the left-hand edge has been displaced, resulting in a local minimum adjacent to a local maximum. Other flaws are also visible.
As a final example we have examined the new 1164 × 1164 Wheat Genome microarray. The data examined are available at . It is taken from the BarleyBase Accession No. TA2 provided in February 2005 by Timothy Close at the University of California Riverside. There are again three replicates and the results are shown in Figure 5, with maxima on the top row and minima on the bottom row. The third replicate is seen to have a near circular maximum and an apparently unrelated minimum.
7.1 Existing procedures
The simplest scaling is one in which every value in a microarray is multiplied by a constant chosen so that each microarray has the same total. Schadt et al. (2001) note that if the extent of cross-hybridization varies between arrays, simple scaling will be inadequate. Naef et al. (2002) suggest that the presence of a blob of saturated values would distort this scaling and suggest that these should be discarded before employing a scaling based on comparing corresponding values in replicates. Hubbell et al. (2002) utilize the differences between a replicate pair to identify local variations.
Affymetrix include in their arrays a set of ‘housekeeping genes’ for quality control purposes. Although their use has been advocated by some authors as providing a basis for scaling (Hill et al., 2001; Ermolaeva et al., 2001; Lemon et al., 2002), others (Schadt et al., 2000, 2001) find that the variation in the expression levels of these genes does not accurately reflect the variation in the genes of interest. Instead therefore, Schadt et al. (2001) advocate a method based on matching ordered sets of values of Dj. However, their method is demanding, since it relies on identifying a set of Dj values which occur in the same order of magnitude in each replicate. None of these methods takes account of spatial features.
7.2 Local scaling
There is no possible treatment for the cells identified in Figure 3 as being poorly correlated with their counterparts in other replicates. These cells are therefore not used in what follows.
The original analysis proposed by Affymetrix and implicit in Equation (3) was based on the subtraction of unwanted effects so as to focus on the effects of interest. This approach proved ineffective because of the frequent inflation of the MM values by cross-hybridization, leading to many negative values for Dij. We now propose a slight variant on the original procedure, which cannot lead to negative values, but will certainly remove the majority of any local non-biological effects.
Let ykl be the observed value in the cell in row k and column l, and let be the smallest usable value in the immediate region surrounding cell (k, l). By ‘immediate region’ we mean the (2m + 1) × (2m + 1) array centred on cell (k, l), suitably reduced for cells near the edge of the array. We propose that the corrected value, be given by
7.2.1 The choice of the values for m and φ
We cannot know the ‘correct’ value for any location on an array, so any judgement of the effectiveness of the procedure must be indirect. A natural indicator is provided by examining the effect on the observed distribution of N. Table 3 presents the results obtained with changes in m and φ. The results shown are averaged over four sets of outcomes (maxima and minima for two sets of replicates). The χ2 statistic referred to in the table is the Pearson goodness-of-fit statistic, comparing the observed numbers of neighbours (0–4) with the numbers to be expected under the homogeneity model.
|Number of neighbours||χ2|
|Number of neighbours||χ2|
Before computing the value of N, each set of adjusted values has been multiplicatively scaled so as to have the same overall total. The best combination (in terms of minimizing χ2) is m = 1 and φ = 1 (the results for this case are actually those obtained with φ = 0.9999 rather than φ = 1, in order to avoid problems with tied values resulting in no identifiable value for C). As the other results indicate, small changes in φ lead to little difference in the outcomes. The critical choice is the value of m.
The choice m = 1 and φ = 1 can be simply expressed as follows: The ATH1-121501 array consists of alternating rows of PM cells and MM cells. Ignoring the edges, each PM cell has six MM neighbours, whereas each MM cell has two MM neighbours. This is reflected in the proportions of cases where the values are reduced to zero (4% for PM cells and 16% for MM cells) indicating that the central square is the smallest in the immediate area. The lowest values are found in PM cells on ∼30% of occasions.
For each cell, subtract the smallest neighbouring value. If the result is negative, then replace with a zero value.
It is not the purpose of this paper to consider the subsequent analysis, but it is worth observing that any analysis based on the PM–MM differences will scarcely be affected, since in most cases the PM and MM values will have been reduced by the same amount. The correction proposed here is similar in spirit to that originally proposed by Affymetrix.
8.1 The value of these techniques
Our techniques are intended as a precursor to the biological analysis, and not as a substitute. However, this simple preprocessing of the data will have two beneficial effects on the subsequent analysis:
it will reduce the noise in the system, by largely eliminating spatial imperfections,
it will identify outlier values that cannot provide biological information.
The latter will usually be detectable by other means as having unreasonably small or large values—but not all the cells in the low correlation regions will be unusual in size.
Studying the correlations between replicates can provide insights not otherwise obvious. Visualization of the spatial flaws gives information concerning the likely validity of individual arrays.
8.2 Handling single arrays
It is emphatically not the case that these techniques can be used to classify incoming new arrays. Such arrays will refer to data collected under other circumstances, and we have in any case observed that genetic variation can be considerable and misleading—recall that Replicate 3 in Set 1 bore a greater resemblance (in terms of average correlation) to Replicate 4 in Set 2 than to either of the replicates of its own strain.
However, the visualization technique presented here could be used to provide an imperfect visualization of the flaws in a single array by comparison with a previously obtained apparently flawless array. Any genetic variations would not be confined to local regions, so one could expect random ‘spots’ in the picture—but serious spatial flaws would still be easily identified.
The local scaling method does not require replicates and could be used on any single array without reference to other arrays.
We thank the reviewers and subeditor for very helpful comments on previous versions of this paper.
Conflict of Interest: none declared.