## Abstract

A two-channel microarray measures the relative expression levels of thousands of genes from a pair of biological samples. In order to reliably compare gene expression levels between and within arrays, it is necessary to remove systematic errors that distort the biological signal of interest. The standard for accomplishing this is smoothing “MA-plots” to remove intensity-dependent dye bias and array-specific effects. However, MA methods require strong assumptions, which limit their general applicability. We review these assumptions and derive several practical scenarios in which they fail. The “dye-swap” normalization method has been much less frequently used because it requires two arrays per pair of samples. We show that a dye-swap is accurate under general assumptions, even under intensity-dependent dye bias, and that a dye-swap removes dye bias from a single pair of samples in general. Based on a flexible model of the relationship between mRNA amount and single-channel fluorescence intensity, we demonstrate the general applicability of a dye-swap approach. We then propose a common array dye-swap (CADS) method for the normalization of two-channel microarrays. We show that CADS removes both dye bias and array-specific effects, and preserves the true differential expression signal for every gene under the assumptions of the model.

## INTRODUCTION

A two-channel microarray simultaneously measures the expression levels of thousands of genes from a pair of biological samples (Schena *and others*, 1995; Hughes *and others*, 2001). This system has existed in the form of spotted cDNA arrays since the beginning of high-throughput gene expression technology (Schena *and others*, 1995). With the advent of oligo-based two-channel microarrays using new ink-jet printing technologies (e.g. Agilent Laboratories, Palo Alto, CA), the two-channel platform continues to be of much interest.

In a two-channel microarray, each of two matched samples is labeled with a distinct dye, Cy3 (green) or Cy5 (red), and both are competitively hybridized to the same array. The relative dye intensity for each spot is used to quantify expression for its corresponding gene. Due to various sources of systemic effects, the relative expression levels need to be adjusted to accurately reflect the underlying amount of mRNA from each sample. For example, the two dyes incorporate into samples at different rates, creating “dye bias” (Tseng *and others*, 2001; Yang *and others*, 2001, 2002). In addition to dye bias, there may be systematic differences between arrays. It is therefore necessary to remove dye bias and any other systematic effects from two-channel microarray measurements.

To address this problem, we investigate the dye-swap design, showing that a dye-swap removes dye bias in general from a single pair of samples, an observation related to others previously made (Kerr *and others*, 2000; Kerr and Churchill, 2001; Fang *and others*, 2003). We show that while the original justification of the dye-swap (Kerr *and others*, 2000) assumed a constant dye effect, the dye-swap is accurate more generally, even in the case of intensity-dependent dye bias. We develop a model that treats dye and array effects as flexible, nonlinear functions of the amount of mRNA present in each sample. Within this framework, we show that a simple dye-swap average removes dye bias without affecting biological signal and preserves the ordering of true expression means. We then propose the “common array dye-swap" (CADS) method, a two-stage multi-array procedure for removing dye bias and array-specific effects. The first step is the usual dye-swap average. The second step centers each array around a “common array,” i.e. an approximation of the relationship between mRNA amount and fluorescence intensity when all array effects have been removed.

The most important goal in preprocessing microarray data is to maintain the true type of differential expression within and between arrays. We show that CADS preserves differential expression relationships in general. Specifically, CADS preserves whether there is no differential expression, overexpression, or underexpression among the “true" population-level mRNA levels. Furthermore, CADS achieves this accuracy without requiring any of the strict assumptions of “MA-plot” (Tseng *and others*, 2001; Yang *and others*, 2002) or global (Chen *and others*, 1997) normalization methods. We highlight the general applicability of CADS through simple examples in which the assumptions behind MA and global normalization methods are violated. Importantly, in each of these examples, MA and global methods may cause every null, nondifferentially expressed gene to become spuriously differentially expressed. At the same time, the truly differentially expressed genes may have the degree and direction of their differential expression altered. Furthermore, without performing a dye-swap, we show it is not possible to verify that the assumptions behind MA and global methods hold.

## A NEW MODEL FOR NORMALIZATION

Existing normalization methods based on global adjustments or smoothing “MA-plots” require a number of assumptions that limit their general applicability, as we detail in Section 4. We propose a flexible model that allows for normalization under more general assumptions. The model we develop expresses fluorescence intensities in terms of functions of the true amount of mRNA in each biological sample.

### Dye-swap removes dye bias in general

The ultimate goal is to use the proposed model to adjust the observed fluorescence intensities so that they accurately reflect the underlying levels of mRNA. As a first step toward this end, it is necessary to determine the type of observations that are needed to unequivocally parse the different sources of signal. For each gene, there is the possibility for at least three signals to affect the relative expression values: biological signal, dye bias, and array-specific effects. With only a single observation of red divided by green for each gene, it is impossible to separate these. Figure 1 shows a picture of these three sources of signal as a function of underlying mRNA amount. The method we propose (called CADS) estimates each one of these signals and preserves only the biological signal, removing both dye bias and array effects. Dye bias is estimated from the information given by the dye-swap design, and the array effects are estimated by borrowing strength across arrays.

It can be shown that a dye-swap separates dye bias from biological signal in a single pair of samples (Supplementary Material). Consider the observed expression intensities for a single gene. Ignoring random measurement error and assuming that any bias is due exclusively to the dyes used, there are four unknown quantities involved in these two observations: the true target and control mRNA amounts and two functions that translate the mRNA amounts into measurements on the red and green dyes. In other words, we observe two unknown functions evaluated at two unknown arguments, making it impossible to make any reliable statement about the relative relationship between the two samples. On the other hand, by adding just two more observations, we can compare target to control as measured by a common function. Specifically, we also label target green and control red, performing a dye-swap. Averaging the two target observations and two control observations separately, we have target and control measurements labeled with an “average dye.” The rationale is that, in principal, the mRNA amount for each respective sample remains constant across both dye assignments. Assuming that each dye (with all sources of variation and systematic bias removed) provides a useful measure of expression for a given underlying mRNA amount, the “average dye” should work equally well. Array effects represent systematic differences between fluorescence intensities on different arrays. These might arise, for example, when one array is inherently brighter than another. Simple averaging of two dye-swap arrays can not remove array effects in general. However, as we show below, array effects can be removed in dye-swap experiments by borrowing strength across multiple dye-swap array pairs.

### Models of fluorescence intensity in terms of mRNA amount

As a second step, we formulate models of fluorescence intensity in terms of the true underlying mRNA amounts present in each sample. Let $yTRij$ denote an observed fluorescence intensity on the $log2$ scale for gene *i* on target sample *j*, using the red dye. Similarly, $yCGij$ corresponds to fluorescence intensity obtained from the control sample with green dye (on the same array as $yTRij$), and $yTGij$, $yCRij$ are the dye-swap analogs. Let $xTij$ and $xCij$ be the true mRNA amounts for gene *i* in target and control samples *j* (say, in units that count the number of mRNA molecules). Similarly, let $\mu Ti$ and $\mu Ci$ be the population average mRNA amounts for gene *i* in the comparison groups. Note that we use “target” and “control” only as arbitrary names to distinguish the two populations from which the biological samples come.

We assume that the observed intensities $yij$ are noisy versions of the sum of a true underlying dye-specific fluorescence function ($dG$ or $dR$) and an array-specific function ($a1$ or $a2$), with all functions applied to the $xTij$ and $xCij$:

Because we model the underlying mRNA amounts for each individual separately, the ϵ-variables do not include biological variability and hence are comparable across genes. The $xij$ are random variables in that they represent biological variability in gene expression from sample to sample. We assume that $E(d(xij))=d(\mu i)$ for both dye functions and $E(aj(xij))=aj(\mu i)$ for all array functions. This will be approximately true if $xij$ equals $\mu i$ plus random error and the functions *d* and $aj$ are locally linear. This allows us to connect the above models across different pairs of samples.

## PROPOSED METHODS

The fact that we do not know the dye functions, array functions, or the true underlying mRNA amounts makes it difficult to produce expression measures that do not include systematic biases. However, we now propose a method based on (2.1) that accomplishes exactly this in expectation.

### Simple dye-swap average

A simple way to remove the dye-specific effect would be to replace the individual dye functions with their average. A dye-swap does just that. The dye-swap average intensities can be written in terms of the models given in (2.1):

*d*is $[(dR+dG)/2]$, $aj$ is $[(a1j+a2j)/2]$, and the ϵ-variables are the sums of the corresponding error terms in (2.1) divided by two. The $y\u02dc$-variables are the data resulting from a simple dye-swap average. Assuming there are no array effects ($aj=0$), the average of the red and green channels from separate arrays can be seen as a single function

*d*applied to the separate mean expression levels. We refer to

*d*as the “average dye" function. Since $dR$ and $dG$ are increasing,

*d*is also increasing. Therefore, the true ordering of the $xij$ will be preserved, in expectation, by simple averaging of dye-swap arrays. More generally, when array effects are present, they are not removed by simple dye-swap averaging.

Note that the dye-swap method has previously only been justified by assuming that any dye effect is constant (i.e. is not intensity dependent). However, it can be seen from the above calculation that the dye-swap does not make this assumption. In fact, it seems to offer the most general assumption about dye bias. By its very design, the underlying mRNA amounts are kept constant in each dye configuration, thereby automatically incorporating any intensity-specific dye bias.

### Common array dye-swap

Array effects represent the variability from array to array that is not due to biological sources of variation. It is important to remove random array effects because they increase the variability and uncertainty of the measurements, and they can create dependence between genes that have adverse effects on significance calculations (Qiu *and others*, 2005).

We propose a modification of the simple dye-swap average that removes array effects. The method is described here for a direct comparison between “target” and “control”; more general experimental designs are considered in a subsequent manuscript (Dabney and Storey, 2006). Our approach can be motivated under the assumption that each array function comes from a random distribution of curves that have expected value equal to the zero line. We show, however, that the method is equally effective under more general assumptions, which basically require that the array functions do not confound the relevant signal from the dye functions. Suppose that we form *n* dye-swap array pairs and subtract the dye-swap averages from each other to form $M\u02dcij=y\u02dcTij\u2212y\u02dcCij$, $i=1,2,\u2026,m$ and $j=1,2,\u2026,n$. Averaging the $M\u02dcij$ and taking expectations with respect to arrays gives

This CADS method can be summarized by the following algorithm:

Perform simple averaging of dye-swap arrays to form $M\u02dcij=y\u02dcTij\u2212y\u02dcCij$, $i=1,2,\u2026,m$, $j=1,2,\u2026,n$. These quantities are random observations of target minus control, having labeled both with the same dye on

*n*different pairs of arrays.Average over the $M\u02dcij$ to form $\Delta \u02dci=1n\u2211j=1nM\u02dcij$. This is an estimate of the expected difference between target and control under the average dye function. From our assumptions, it follows that $E[\Delta \u02dci]=d(\mu Ti)\u2212d(\mu Ci)$, $i=1,2,\u2026,m$.

Fit a smoother

*f*to the scatterplot with the $\Delta \u02dci$ on the*x*-axis and $M\u02dcij\u2212\Delta \u02dci$ on the*y*-axis to form $a\u02dcij=f(\Delta \u02dci)$, $i=1,2,\u2026,m$, $j=1,2,\u2026,n$. Since we interpret any systematic difference between an array profile and the average dye function as an array effect, these are estimates of the array functions $aj$, $j=1,2,\u2026,n$.Form $M^ij$ by subtracting off the estimated array functions: $M^ij=M\u02dcij\u2212a\u02dcij$, $i=1,2,\u2026,m$, $j=1,2,\u2026,n$. These quantities are random observations of target minus control, having labeled both with the same dye, and with all

*n*comparisons being made on an estimated common array. Note that, after Step 2, there are*n*dye-swap profiles that have had dye bias removed but not array effects. In Step 3, each of these profiles is compared individually to the “common array” to remove the remaining array effects. In terms of Figure 1, CADS compares each averaged dye-swap profile to an estimate of the average dye function*d*. A systematic difference between the two is regarded as an array effect*a*and is subtracted off. Figure 6 in the Supplementary Material illustrates the estimation of array effects in Step 3 of the CADS algorithm and corresponds to one of the simulations described in Section 5. Note that the array effect is especially straightforward to estimate here because of our assumption about a direct comparison between target and control. For more complicated experimental designs (e.g. multiple treatments, batch effects), the common array effect must be estimated taking all relevant variables into account (Dabney and Storey, 2006).

The CADS model (2.1) assumes an additive relationship between bias terms on the $log2$ scale. This assumption is analogous to accepted analysis of variance (ANOVA) -based models (Kerr *and others*, 2000; Kerr and Churchill, 2001) for two-channel microarrays, as well as to models for the single-channel platform (Irizarry *and others*, 2003). We also assume the same array function applies to both the red and green channels. The calibration experiments of Tseng *and others* (2001) allowed for a comparison of the array effects on both channels and give support to this assumption (data not shown). There are connections between the multi-array nature of CADS and normalization methods for single-channel microarrays. For example, Li and Wong (2001) center each array around the “median array,” the array that corresponds to the median overall intensity.

### CADS preserves differential expression relationships

CADS produces normalized data that preserve differential expression relationships in expectation. This means that null genes are null, overexpressed genes are overexpressed, and underexpressed genes are underexpressed. Specifically, under the CADS model, we show that the CADS estimator is unbiased for the parameters of interest $d(\mu Ti)\u2212d(\mu Ci)$, $i=1,2,\u2026,m$. Since differential expression methods are almost always based on sample averages (Cui and Churchill, 2003), it is sufficient to show that the sample average of the CADS-normalized arrays is unbiased. It can be shown that

*d*already distorts the true mRNA count.

## SMOOTHING MA PLOTS

We now compare and contrast our proposed model and normalization method with an existing standard. The vast majority of normalization methods used today build on two highly influential papers (Tseng *and others*, 2001; Yang *and others*, 2002), where “MA-plots” were used to demonstrate that dye bias depends on intensity in an array-specific manner. An MA-plot is a scatterplot with log intensities *A* (log of red times green, divided by two) on the *x*-axis and log ratios *M* (log of red divided by green) on the *y*-axis. The left plot in Figure 2 is a “self–self” MA-plot from calibration experiments (Tseng *and others*, 2001), where the pair of samples hybridized to each array are biologically equivalent. Since the intensities of the two dyes should be equal, the trends away from the zero line are evidence of dye bias. We refer to any normalization method that is based on removing trends from MA-plots as “MA methods.”

MA methods attempt to remove bias by smoothing away MA trends. A smooth curve is fit through the MA-plot, and its predicted values are subtracted off, recentering the plot along the zero line. The smoother curve can be estimated using all the data or some subset of control genes. While Figure 2 gives empirical support to this idea, no theoretical justification has been given for its general use. Note that in “self–self” experiments, there is no biological signal of interest present because each gene is equivalently expressed. True biological differences in expression between the two samples can also produce trends in the MA-plots. Therefore, regressing away trends in MA-plots may change the biological signal of interest. Our analysis below indicates that this procedure can create artificial differential expression, destroy true differential expression, or both.

### The assumptions behind MA methods

MA methods were justified by Yang *and others* (2002) when “there are good reasons to expect that (i) only a relatively small proportion of the genes will vary significantly in expression between the two cohybridized mRNA samples or (ii) there is symmetry in the expression levels of the up/downregulated genes.” We show that the fundamental assumption necessary to justify MA methods is that all systematic MA trends are due exclusively to bias (Supplementary Material). Necessary conditions to satisfy this assumption are (Supplementary Material)While the symmetry assumption was discussed in the original MA papers, Assumptions 2 and 3 were not. As the original authors correctly stated, MA methods are only guaranteed to work properly when their assumptions are met. Applying MA methods when their assumptions are violated can create or destroy differential expression signal. The key observation here is that assumptions must be made about the true “biological signal" before applying MA methods. The motivating data sets for the MA-plot methods were specialized in that there was little to no differential expression. In such a case, the assumptions for MA methods may be met. However, “the expression profiles in biological samples are (frequently) more divergent in nature than in the examples investigated (here)” (Yang *and others*, 2002).

Differential expression between the two samples is symmetric about zero.

There is no relationship between differential expression and expression abundance.

The variation of expression across genes is the same in each sample.

### Asymmetric differential expression

Suppose that two groups are compared where there is more differential expression in one direction than in the other. For example, targets may be overexpressed more often than they are underexpressed relative to the controls. In this case, target measurements will tend to be greater than their control counterparts, and an MA-plot will be centered above the zero line, even in the absence of dye bias. Smoothing the MA curve pushes the cloud of data down to the zero line. Overexpressed genes will appear less extreme, underexpressed genes will appear more extreme, and all equivalently expressed genes will now be artificially differentially expressed. Smoothing the MA-plot reduces signal in some genes and creates bias in others. Note that asymmetry is common (Tusher *and others*, 2001; Hedenfalk *and others*, 2001); in fact, only in specialized circumstances would it be guaranteed that differential expression between two biological groups of interest be perfectly symmetric.

### Intensity-dependent differential expression

Suppose that differential expression is related to expression abundance; for example, at low-abundance genes differential expression may tend to be in the direction of controls, and at high-abundance genes in the direction of targets. Differential expression is often related to the level of activity and importance of a gene in a particular cell type. For example, comparing cancer tumor to healthy cells, the genes that are related to the increased replication rate may be most abundant and most different in expression. If nothing else, differential expression cannot occur among genes that are not active in a cell, which forms a relationship between differential expression and abundance. This scenario produces an MA trend increasing from low- to high-abundance genes, even in the absence of dye bias or other systematic effects. Smoothing the MA-plot will again destroy signal in some genes and create bias in others. In particular, it will again create an apparent signal in all the null, equivalently expressed genes wherever the MA smoother does not pass through the zero line.

### Unequal variation in expression means

Unequal variances between comparison groups occur, for example, when comparing immune challenged to normal cells (Storey *and others*, 2005). Resources in the cell are reallocated so that a large proportion of genes have decreased expression, while immune response genes have increased expression, creating a difference in the variation of expression across genes between the two immune challenged and normal cells. As another example, due to the amount of genetic mutation and cell cycle activities associated with the cancer process, cancer expression measurements would easily have much more variation than control measurements. This scenario also produces an MA trend increasing from low- to high-abundance genes, even in the absence of bias. Smoothing the MA-plot again destroys biological signal.

### Real examples where assumptions are violated

Figure 3 shows a real microarray experiment in which all three MA assumptions are violated; a reference design was used in this example, so we know any MA trends are not due to dye bias. Arrays were obtained from sporadic, BRCA1 mutation-positive, and BRCA2 mutation-positive breast cancer tumors (Hedenfalk *and others*, 2001). The left plot shows the log$2$ fold-change among 3220 genes between BRCA1 and BRCA2 arrays, where there is clear asymmetric differential expression, violating Assumption 1. The center plot shows there is a relationship between overall expression abundance and differential expression, violating Assumption 2. The right plot shows boxplots of the expression data among the sporadic, BRCA1, and BRCA2 groups. It can be seen that the variation of expression across genes increases from sporadic to BRCA1 to BRCA2, violating Assumption 3. The increase in standard deviation across genes is about 11% moving from one group to the next.

### Other normalization methods

In addition to MA methods, alternative normalization methods are surveyed in Quackenbush (2002). Global normalization methods subtract some constant from all log expression ratios on an array (Kerr *and others*, 2000). These are rarely used since systematic effects tend to be more complicated than constant shifts at the array level. Also, since a global procedure is equivalent to smoothing an MA plot with a flat curve, global methods suffer from the same limitations of MA methods discussed above. Others have proposed procedures based on observed deviations in “housekeeping” genes (genes whose expression levels are known to remain constant) (Chen *and others*, 1997). Still others attempt to either create exogenous housekeeping genes (Yang *and others*, 2002; Benes and Muckenthaler, 2003) or identify and/or synthesize genes on an existing array that behave like housekeeping genes (Zien *and others*, 2001; Tseng *and others*, 2001). The former approach is dependent on one's ability to identify a set of control genes for each individual experiment. The latter approach necessarily involves a subjective choice of the rule for calling a gene unchanged.

### The difference between CADS and MA methods

MA methods assume that every MA-plot should be centered on the zero line. The idea behind CADS is to estimate the true target-to-control difference by averaging out array effects and centering each array around the result. The most important difference between CADS and MA methods is that CADS removes bias due to dye and array effects under general assumptions, preserving differential expression relationships in expectation, while MA methods do not. There are also distinct operational differences beween CADS and MA methods. In a dye-swap experiment, MA methods are typically applied by first smoothing the MA-plot for each array, then averaging the dye-swaps. The first step of CADS is to average the dye-swaps, removing dye bias. The second step uses all arrays to form the “common array,” making CADS a multi-array method. Because CADS is a multi-array method, persistent trends that are not due to bias can be identified and retained.

## SIMULATED EXAMPLES

We illustrate CADS on three simulated examples. Simulation 1 has asymmetric differential expression. Simulation 2 has intensity-dependent differential expression, where differential expression increases with increasing abundance. Simulation 3 has unequal variances in expression measurements between comparison groups. In each of 30 simulations, we generated expression measurements for 3000 genes in five target and five control samples. For each target/control sample pair, a dye-swap was carried out. Of the 3000 genes, 35% are differentially expressed. Nonlinear dye functions were incorporated, as were random array functions generated from a distribution with mean value equal to the zero line. All summaries reported below were averaged over the 30 simulations. For full details, see the (Supplementary Material). We note that in subsequent work (Dabney and Storey, 2006), we develop CADS further and use a real data set on prostate development, provided by the Pete Nelson laboratory at the Fred Hutchinson Cancer Research Center.

To compare MA-smoothing and CADS in the simulations, we formed *t*-statistics for each gene that test for equality of expression between target and control. In order to compare power, plots of the number of genes called significant versus the number of false discoveries are shown in Figure 4. Due to the MA assumptions being violated in these examples, MA methods do not perform well, with fewer genes called significant for any given number of false positives. CADS increases power over simple dye-swap averaging. This is expected, since array effects essentially represent extra variation. By removing them, we decrease the variation of each gene's observations. We also compared the number of expected false positives under the correct null distribution versus the number of observed false positives. The MA method produces more observed false positives than expected, implying that significance would be artificially inflated when performing significance tests on MA-normalized data when, say, estimating a *p*-value or false discovery rate (Storey and Tibshirani, 2003).

## DISCUSSION

When the assumptions behind MA methods are not expected to be valid, some sort of control must be used. The “microarray sample pool" (MSP) is designed for this purpose (Yang *and others*, 2002), although (based on a PubMed search) the MSP does not appear to have been used much in practice. The “rank-invariant" MA method (Tseng *and others*, 2001) is similarly motivated, attempting to use only genes whose expression is equal between comparison groups to form the MA smoother curve. Genes are included in the smoother fit if their ranks in the targets are approximately equal to their ranks in the controls. This assumes that equality in rank corresponds to equality in expression. In Simulations 1 and 2 of Section 5, however, the distribution of target mRNA is shifted relative to the distribution of control mRNA. As a result, the rank-invariant method will include differentially expressed genes in the smoother.

CADS can easily be applied when more than two groups are compared. A loop design with more than two nodes is required, but the basic approach behind the method remains unchanged (Supplementary Material). Also, CADS can be more generally formulated as an ANOVA model in which factor terms have been replaced with functions of mRNA amount. Framed as an ANOVA model, it is simple to incorporate additional sources of bias or covariates. Furthermore, the CADS “functional” ANOVA model supports a more efficient dye-swap design in which only a single array is used for each sample pair. These extensions are developed in subsequent work (Dabney and Storey, 2006). These results together highlight important limitations in existing normalization methods and provide alternatives that are simple, flexible, and accurate under general assumptions.

This research was supported in part by the Cancer-Epidemiology and Biostatistics Training Grant 5T32CA009168-29 and National Institutes of Health grant R01 HG002913. *Conflict of Interest:* None declared.