## Abstract

Motivation: Human clinical projects typically require a priori statistical power analyses. Towards this end, we sought to build a flexible and interactive power analysis tool for microarray studies integrated into our public domain HCE 3.5 software package. We then sought to determine if probe set algorithms or organism type strongly influenced power analysis results.

Results: The HCE 3.5 power analysis tool was designed to import any pre-existing Affymetrix microarray project, and interactively test the effects of user-defined definitions of α (significance), β (1 − power), sample size and effect size. The tool generates a filter for all probe sets or more focused ontology-based subsets, with or without noise filters that can be used to limit analyses of a future project to appropriately powered probe sets. We studied projects from three organisms (Arabidopsis, rat, human), and three probe set algorithms (MAS5.0, RMA, dChip PM/MM). We found large differences in power results based on probe set algorithm selection and noise filters. RMA provided high sensitivity for low numbers of arrays, but this came at a cost of high false positive results (24% false positive in the human project studied). Our data suggest that a priori power calculations are important for both experimental design in hypothesis testing and hypothesis generation, as well as for the selection of optimized data analysis parameters.

Availability: The Hierarchical Clustering Explorer 3.5 with the interactive power analysis functions is available at or .

Contact:jseo@cnmcresearch.org

## 1 INTRODUCTION

An increasingly popular tool for biological research is microarrays of oligonucleotide DNA probes. The most commonly used microarrays (Affymetrix) are used to query the mRNA expression of genes in an organism, providing relative quantitation of each mRNA expressed by a cell or tissue sample. Up to 1 million oligonucleotide probes are available in the currently available microarrays, with most designed with 11 perfect match 25mers and 11 mutated mismatch 25mers against each mRNA transcript.

An important aspect of experimental design is the choice of number of samples per group (replicates). To date, the large majority of published microarray projects have chosen replicates based upon empirical grounds (cost, feasibility). However, projects involving human subjects typically require a priori power analyses, and power analyses are increasingly required for laboratory animal studies as well. The rationale for requiring power analyses is for ethical reasons. An underpowered study would expose subjects to risks, yet the data are inadequate to make any conclusions. An overpowered study uses more human or animal subjects than is necessary to answer the experimental question, thus inappropriately posing risks to excess subjects. An equally important consideration is that valuable research money should not be wasted by using more subjects than necessary.

The statistical power of a study is the probability of detecting a biologically meaningful difference (effect size) if there is indeed such a difference in the original population. There are four variables in power calculations: statistical significance (α), type II error rate (β), sample size (N) (number of replicates per group) and effect size (ES). The statistical power is defined as 1 − β.

To conduct an a priori power analysis, one uses pre-existing data from a project resembling the proposed project (similar subjects and variables), then uses the power calculation formula by setting three of these variables to a constant, while adjusting the fourth. This allows the investigator to balance the number of subjects per group with the anticipated outcome.

For microarray experiments, the calculation of power is considerably more complex. There are tens of thousands of probe sets, each providing a result for a specific gene mRNA transcript. The signal from each probe set can vary considerably, with regards to intensity, variance and signal/noise ratios, and each of these can change when different probe set algorithms are used (Seo et al., 2004). It would be expected to be impossible to appropriately power all probe sets in a microarray, as the number of subjects and cost of microarrays becomes prohibitively expensive, if not impossible.

There have been many proposed methods of sample size determination for microarray projects (Tsai et al., 2005). Most of them focus on how to deal with multiple testing problems and the calculation of significance level either by the family-wise error (FWE) rate or by the false discovery rate (FDR). Wang and Chen (2004) proposed a method to calculate the number of arrays required to detect at least a fraction of the truly altered genes under the model of an equally standardized ES for all altered genes with a FWE-controlled approach. Yang and Speed (2003) suggested that the well-established and generally applicable FWE-based methods such as Bonferroni's and Holm's do not seem to be helpful in the microarray context. Pawitan et al. (2005) explained how the FDR instead of FWE-based methods can be used to statistically access microarray data via sample size control at the design stage. Among various R packages for FDR estimation, SAM (Tusher et al., 2001) provides users with estimates of FDR, FNR, type I error and power for different sample sizes. To use FDR-based methods, researchers must estimate it by assuming the number of true null hypotheses since the exact proportion of false positives amongst the rejected hypotheses is unknown. The number of true null hypotheses can be estimated by various methods based on mixture models.

Microarray research has become more mature in terms of data generation and volume of publicly accessible compiled datasets corresponding to a large number of projects. We reasoned that pre-existing data could be drawn upon to appropriately power a proposed microarray experiment. We felt that an interactive power calculation tool would permit a practical approach to microarray project design. The tool should be able to utilize different probe set algorithms, noise filters and gene ontologies, and then output a filter so that only appropriately powered probe sets are subsequently studied. Importantly, we show that different methods of data normalization have a strong effect on power calculations, with one popular method likely leading to high false positive rates.

## 2 METHODS AND SYSTEMS

### 2.1 Statistical method and implementation

The following two formulas were implemented in the current version of our power analysis tool. The sample size formula for one-sample t-test is

$N={σ(z1−α+z1−β)ES}2,$
where z is the standard normal distribution, σ is the standard deviation of the given sample group, ES the effect size and N the number of replicates per sample (or group).

The sample size formula for two-sample t-test is

$N=2{s(z1−α+z1−β)ES}2,$
where s is a pooled standard deviation defined as
$s=(n1−1)σ12+(n2−1)σ22n1+n2−2,$
where n1 is the size of group 1, n2 is the size of group 2, σ12 is the variance of group 1 and σ22 is the variance of group 2.

The three parameters (ES, α, N) and the power of the study (1−β) constitute a closed-form equation, where fixing any three will determine the remaining. For example, if α, β and the ES are specified, the equation returns the sample size. In our approach, we estimate the standard deviations from pre-existing projects and ask users to enter three of the four parameters and then calculate the remaining parameter for each probe set.

We built an automated power analysis tool for microarrays on the Hierarchical Clustering Explorer (HCE) backbone. HCE is a public domain, interactive visual analysis tool for multidimensional datasets () (Seo and Shneiderman, 2005). Users can orderly explore their dataset by ranking low-dimensional projections and effectively visualize the ranking results. Knowledgeable and motivated users in diverse fields provided multiple perspectives that refined our understanding of strengths and weaknesses of HCE through case studies and a user survey (Seo and Shneiderman, 2006). HCE has been utilized for many microarray studies including muscle regeneration study (Zhao et al., 2003) and signal-noise ratio optimization in expression profiling (Seo et al., 2004). In this paper, we used Microsoft Visual C++ 6.0 to implement a power analysis tool and integrated the tool as a modal dialog box. Interactions between the power analysis tool and other HCE components are based on the document/view model in the Microsoft Foundation Class library.

The pseudo code for the interactive power calculation is shown in Figure 1. Users select one among four parameters to become the dependent parameter. For example, the sample size is the dependent parameter in Figure 2. After adjusting remaining three parameters, the first ‘for’ loop is executed and the dependent parameter value for each probe set is saved in the data structure for each probe set. As users change the lower or upper bound of the dependent parameter using the double-sided slider, the next for loop is executed. Probe sets whose values of the dependent parameter within the specified range are shown in the probe sets list control and the proportion of those probe sets are visualized in the progress control.

Fig. 1

Pseudo code for interactive power calculation.

Fig. 1

Pseudo code for interactive power calculation.

Fig. 2

Interface design for interactive power analysis.

Fig. 2

Interface design for interactive power analysis.

### 2.2 User interface design

We designed a user interface for interactive power analysis (Fig. 2), where users can perform assigning samples to groups, noise filtering, selecting a statistical model, selecting an independent parameter, setting values of independent parameters, dynamic querying over probe sets using a selected dependent parameter and finally exporting or highlighting the dynamic query result. Before conducting probe set based power analysis, users can apply a noise filter (present call filter; see section 3.1 for detail). Users can perform a power analysis with selected probe sets or with all probe sets.

Users can assign samples to one or two groups using a list control for original samples, a tree control for groups and buttons to adjust the group assignment. After assigning samples to appropriate groups, users can choose a statistical model to perform the power calculation. Users set one parameter as the dependent parameter using radio button controls (round-shaped buttons for exclusive single selection). Once users specify values for independent parameters using slider controls, they can click the ‘Calculate’ button to evaluate the dependent parameter. Then, a double-sided slider shows the range of the dependant parameter values. Users can generate and modify dynamic queries using this slider to obtain instantaneous updates on the probe sets list, which shows the resulting probe sets that are appropriately powered.

The proportion of the filtered probe sets with respect to the whole list is visualized using a progress bar control above the probe sets list control. The update of the list showing the resulting probe sets is instantaneous with most Affymetrix GeneChips, so that users do not feel that their tasks are interrupted and subsequent query refinements are naturally encouraged to lead users to satisfying results.

The resulting list can be exported as a text file so that it can be imported into another analysis tool such as GeneSpring. The resulting text file can also be imported into HCE again later to examine concordance of various results from different knowledge sources or analysis methods by using set operation between the current selection and an imported list of interest.

### 2.3 Implementation of data filters

We enabled data filters and visualization tools previously established in our HCE for the power analysis tool. Users can select a subset of probe sets in visualization tools and perform the interactive power analysis only with that selected subset. In this way, users can narrow down the candidate gene list to get more relevant and compact list, and determine if their proposed experiment is appropriately powered for this candidate list. For example, users may select ‘inflammatory genes’ from HCE's gene ontology view and select the genes associated with the GO term if they are interested in those gene products. Users can also perform a series of dynamic queries in the profile search view to select a subset of probe sets when they know the approximate patterns of desired ones and then they can perform power analysis on the selected probe sets. Alternatively, users can conduct an interactive power analysis to select an appropriately powered probe sets, and then highlight the selected ones in other visualization tools of HCE, e.g. dendrogram view, gene ontology view and profile search view.

Users can generate an even smaller but more focused list of probe sets by combining various analysis results of the same project. In HCE, users can import a custom list of items to perform a set operation (e.g. union, intersection and difference) with the currently selected list of items. For example, users can select a properly powered probe sets using MAS 5.0 and power analysis tool in HCE, and import a different list that was generated by the power analysis tool using dChip or RMA software. Users are provided with an interface component to select a set operation during import. Then, they can examine the concordance of two results by checking the intersection or difference of the two sets. In doing so, users can generate a compact list of probe sets that are not only appropriately powered but also concordant among probe set signal algorithms.

### 2.4 Datasets

Three microarray datasets were used to test the effect of signal/noise ratios on power calculations. An Arabidopsis thaliana (plant) disease resistance project was used. This included a powdery mildew resistant 4 (pmr4) group (8 samples) and a mutant lacking pathogen-induced callose group (8 samples). Infected and uninfected pmr4-1 and wild-type plants were examined (see ; Affymetrix GeneChip Arabidopsis ATH1 Genome Array utilized). The signal-to-noise ratio of this dataset, or mean/(standard deviation), was ∼2.89.

The second dataset was a rat spinal cord injuries project, with 12 severe injury samples and 13 control samples on RG-U34A microarrays. The original project was a temporal analysis of molecular mechanisms of spinal cord degeneration and repair to analyze spinal cord at thoracic vertebrae T9 at various time points up to 28 days post injury. There are three injury types, but we only used severe injury samples (see ; see also ). The signal-to-noise ratio of this dataset was ∼2.71.

The third dataset was a human muscle biopsy project, with 26 muscle biopsies used individually on U133A microarrays, in two biological (diagnostic) groups (). The two groups studied were normal skeletal muscles from volunteers in exercise studies (n = 16) (Chen et al., 2003) and Duchenne muscular dystrophy (n = 10) (dystrophin mutations; Chen et al., 2000). The signal-to-noise ratio of this dataset was ∼2.64.

The human and rat projects were done using standard operating procedures and quality control metrics, as we have previously described (Tumor analysis best practices working group, 2004).

### 2.5 Probe set algorithms

There are various probe set algorithms to estimate expression values from Affymetrix GeneChip arrays. They vary in performing background adjustment, normalization and summarization. These three steps can be performed in different order and each step can be implemented differently in each probe set algorithm. Thus, the final signal values typically differ for each algorithm, thus impacting analysis results. In this paper we tested three popular probe set algorithms: MAS 5.0, dChip and RMA, with regards to the proportion of probe sets that were adequately powered for any number of replicates. A comparison of these algorithms is given in the Results section.

MAS5.0 takes a robust average of log(PM − MM) using one step Tukey's biweight estimate, where outliers are penalized with low weights. Then, it normalizes arrays by scaling each array so that all arrays have the same mean. ()

dChip () chooses an array (the default is the one with median overall intensity) to normalize other arrays against the array at the probe intensity level. Normalization is done by determining the normalization curve with a subset of probes, or invariant probe sets. The resulting signals, or Model-based expression indexes, are either the weighted average of PM/MM differences (PM/MM model) or background-adjusted PM values (PM-only model) of selected probes estimated using a multiplicative model (Li and Wong, 2001). The model-fitting and outlier-detection are iterated until the set of array, probe and single outliers is stabilized, then those outliers can be excluded or imputed.

RMA (Robust Multichip Average) is another probe set algorithm (or summary measure) that takes a robust multi-array average (RMA) of background adjusted, normalized (quantile normalization) and log transformed PM only values. A robust procedure called ‘median polish’ (Holder et al., 2001) is used to estimate parameters of an additive model (Irizarry et al. 2003). We used RMAExpress (), which is a standalone GUI program for Windows (and Linux) to compute gene expression summary values using RMA.

## 3 RESULTS

### 3.1 A power analysis tool for microarrays

We designed and implemented an interactive power analysis method, using the design strategy shown in Figure 3. Researchers first identify a pre-existing project that best matches their proposed project using any one of the existing data repositories (e.g. GEO, ArrayExpress, PEPR). The project must use the same microarray and the same tissue/cell type as in the proposed (future) experiment. The power analysis tool in HCE can use either a one-sample t-test (one group of microarrays corresponding to replicates with a single variable) or a two-sample t-test (two groups of microarrays differing by one variable).

Fig. 3

Interactive power analysis framework.

Fig. 3

Interactive power analysis framework.

There are four parameters in power calculations: statistical significance (α), type II error (β), sample size (N) (number of replicates per group) and ES. Our tool enables users to perform a power calculation for each gene (probe set) based on the selected statistical model (one group = one-sample t-test, or two groups = two-sample t-test). The interactive power analysis tool allows users to fix three of these parameters then visualize the effect of changing the fourth parameter for all probe sets on the microarray. Dynamic query sliders are available for each of the four parameters. Researchers can interactively balance the number of arrays they can afford to run, with the genes they want to look at in the end. The output of the tool is ‘probe sets that are sufficiently powered’ versus ‘probe sets that are not sufficiently powered’, given any setting of the four parameters of the power calculation formula. The tool then creates a data mask that permits analysis of only sufficiently powered probe sets in the subsequent experiment.

There is considerable debate concerning signal/noise levels in microarrays (see Seo et al. 2004). Many probe sets may be at or under the noise threshold and may be particularly susceptible to false positive results. We therefore implemented a noise filter in our interactive power calculation tool. One commonly used noise filter is the ‘present call’ determination. This is an assessment of the difference between the perfect match probes (signal) and mismatch probes (noise) in a single probe set (11 pairs of PM/MM probes). The Affymetrix MAS 5.0 algorithm uses one-sided Wilcoxon's signed rank test on the differences between discrimination scores [(PM − MM)/(PM + MM)] and a small positive number (default = 0.015) to generate the detection P-value. Poor detection P-values (≤α1) are assigned an ‘absent’ call, while more robust detection P-values (greater than α2) are assigned a ‘present’ call (default α1 = 0.04 and α2 = 0.06). We thus implemented the ‘% present call filtering’ slider as a noise filter that can be applied to any probe set algorithm (Fig. 2). Our tool queries a probe set within the project, and determines the percentage of microarrays in the project that show a ‘present call’ for that particular probe set. Setting this filter at ‘0’ will allow all probe sets on the microarray to enter the power calculations, whereas setting the filter at 50% will require a probe set to show a ‘present call’ in half the microarrays in the project in order to qualify for power analysis. As probe sets below the noise threshold will typically show a high proportion of absent calls, this tool provides a noise filter of user-defined stringency.

### 3.2 Effect of biological noise and probe set algorithms on power analysis results

We used three different pre-existing Affymetrix microarray projects to test our power calculation tool, one from a plant (Arabidopsis), one from a rat spinal cord damage project and one from a human muscular dystrophy patient muscle biopsy project. We chose these three projects to test the effects of two variables, biological noise and probe set algorithms, on the resulting power calculations. The Arabidopsis project showed the best signal/noise ratio, the human project the worst signal/noise, and rat project an intermediate level. We tested two probe set algorithms that subtract the mismatch signal from the perfect match (MAS5.0, chip-based normalization; dChip PM/MM, project-based normalization) and a third algorithm that uses only perfect match probes, with a project-based normalization (RMA). For each of the resulting project-based expression values files, we used both one-sample t-test model and two-sample t-test model for power calculations. We varied the sample size parameter from 2 to 10 with other parameters fixed (α = 0.05, β = 0.2, ES = 1.5-fold change). The null hypothesis being tested in two-sample t-test in Figure 4 is that there is no change in the expression level between control group and experimental group. In one-sample t-test, the null hypothesis is that the average expression level of the experimental group is the same as the average expression level calculated from the control group.

Fig. 4

Power analysis results show differing effects of sample size, depending on probe set algorithm, noise filtering and species. Shown are organisms in rows and analysis models in columns. The x-axis of each graph is the number of microarrays, and the y-axis of each graph is the percentage of appropriately powered probe sets. Diamonds represent the RMA algorithm, squares represent dChip and triangles represent MAS5. Formulas are as in Methods.

Fig. 4

Power analysis results show differing effects of sample size, depending on probe set algorithm, noise filtering and species. Shown are organisms in rows and analysis models in columns. The x-axis of each graph is the number of microarrays, and the y-axis of each graph is the percentage of appropriately powered probe sets. Diamonds represent the RMA algorithm, squares represent dChip and triangles represent MAS5. Formulas are as in Methods.

The one-sample t-test has only one measure of variance, while the two-sample t-test calculates the variance for both groups (samples). As expected, a higher proportion of probe sets were sufficiently powered by the one-sample t-test, compared with the two-sample (Fig. 4). This was true for all projects, and all settings of parameters. All projects also showed an increased proportion of sufficiently powered probe sets as the number of replicates per group increased (Fig. 4).

As we hypothesized, there was a clear correlation between the amount of biological noise and the resulting power calculations. The Arabidopsis project, with the lowest amount of confounding biological noise, consistently showed a higher proportion of sufficiently powered probe sets relative to either the rat or human project (Fig. 4). For example, considering the two sample t-test calculations, at n = 3 replicates/group with the dChip probe set algorithm, the human project showed 20% of probe sets to be sufficiently powered, rat project 50% and Arabidopsis 55%. The human project was shown to have a particularly poor signal/noise ratio by our analysis. This was anticipated, given that the individual samples within each group have different ethnic backgrounds, different stages of disease progression and include issues of tissue heterogeneity.

Unexpected was the very different power results from the three different probe set algorithms (Fig. 4). In all the three projects, using both one-sample and two-sample t-test models, the probe set algorithms showed consistent and relatively dramatic differences with regards to the proportion of probe sets that were adequately powered for any number of replicates. In each case, RMA required less replicates per group to achieve sufficient power for the majority of probe sets. MAS5.0 showed the lowest proportion of adequately powered probe sets at each data point, while dChip PM/MM showed intermediate values. As the number of samples per group increased, the three probe set algorithms begin to converge (Fig. 4).

Both RMA and dChip PM/MM algorithms use a project-based normalization, while MAS5.0 uses intra-array normalization. Our power calculation studies were consistent with the project-based normalizations being more effective at reducing variance than the chip-based normalization. These findings are consistent with previous publications showing that the RMA probe set algorithm provides signals with least variance, particularly at low signal strengths (Irizarry, 2003; Bolstad, 2003). This is because of the quantile normalization and data polishing that reduces variance. This power analysis showed that >90% of probe sets were sufficiently powered with only two replicates in either Arabidopsis or rat, when using RMA with a one-sample t-Test (Figure 4).

### 3.3 Effect of noise filters on power calculations

Not all genes are expressed into mRNA in each cell or tissue type. Those probe sets detecting mRNAs that are not expressed, or expressed at very low levels, are expected to result in signals that are at or near background (noise) levels. We therefore tested the effects of a ‘present call’ noise filter on the resulting power calculations; we expected that the ‘performance’ (e.g. proportion of sufficiently powered probe sets for any given number of arrays) would improve with this noise filter.

In the example shown, we applied a fairly stringent noise filter (50% present calls) and repeated the same power analysis for the three datasets (Fig. 4). Both dChip PM/MM and MAS5.0 showed improved performance in all organisms and at all numbers of microarrays, as expected except for the slight degradation of dChip with the rat data. Surprisingly, the RMA algorithm showed a consistent degradation of performance in each organism, with the noise filter leading to a decrease in the proportion of sufficiently powered probe sets. For example, if one looks at the human muscle data two-sample t-test for RMA, 58% of probe sets were sufficiently powered with two microarrays, while this decreased to 32% with the noise filter. A less stringent nose filter lowered the degree of degradation. For example, 40% of probe sets were sufficiently powered with two microarrays after 10% present call filter for the same case. However, RMA remained the only algorithm that showed degradation after noise filtering with the Arabidopsis and human data.

The degradation of the performance of RMA with noise filters, as shown by our power analysis tool, is likely because of the normalization methods employed by RMA. dChip PM only model showed similar degradation with the rat and human data (but not with Arabidiopsis data), but the degree of degradation was less than RMA and the difference was statistically significant (P < 0.0003). A similar degradation of RMA was seen with spike-in control probe sets, filtered in a similar manner. This was further examined by testing of concordance of appropriately powered probe sets below.

### 3.4 Effect of noise filter on concordance of power analysis results by probe set algorithms

Given the strong effects of the probe set algorithm choice on the proportion of sufficiently powered probe sets (Fig. 4), we then tested the intersection of the appropriately powered probe sets. For this test, we selected a gene ontology group, inflammatory response genes, where we expected many of the probe sets to show relatively low signals. We used the two-sample t-test, and studied both the rat and human data. It should be noted that both of these projects are known to show increased inflammatory gene expression in one of the two groups (severe damage rat group; Duchenne muscular dystrophy human group). We also studied the intersection with and without a 50% present call noise filter.

For the rat project, there were 110 probe sets included within the ‘inflammatory response’ group. Without the noise filter and n = 3 per group, there was relatively good concordance between RMA and dChip, with ∼64% (70/110) of probe sets showing sufficient powering, and about half of these concordant between the two algorithms (Fig. 5). MAS5.0 showed poor sensitivity for these same settings, with only 7% (5/70) of probe sets showing sufficient powering (Fig. 5). Use of the noise filter resulted in loss of 77% of the appropriately powered probe sets in the intersection between RMA and dChip (35 to 8) (Fig. 5). This analysis suggested that, for inflammatory genes in this example (rat project, two group, n = 3), a good data mask would be the intersection of RMA and dChip with no noise filter.

Fig. 5

Concordance of sufficiently powered probe sets using a defined functional group (inflammatory response genes; total 110 probe sets), in the rat spinal cord dataset. Each number represents the number of sufficiently powered probe sets. The proportion of concordant probe sets by the three probe set algorithm increases, although there is also a severe penalty in number of probe sets when using ‘present call’ filtering, owing to the relatively low expression level of many of the inflammatory cytokines.

Fig. 5

Concordance of sufficiently powered probe sets using a defined functional group (inflammatory response genes; total 110 probe sets), in the rat spinal cord dataset. Each number represents the number of sufficiently powered probe sets. The proportion of concordant probe sets by the three probe set algorithm increases, although there is also a severe penalty in number of probe sets when using ‘present call’ filtering, owing to the relatively low expression level of many of the inflammatory cytokines.

We then turned to the human project (Fig. 6). Use of the same parameters as in the previous project showed considerably less concordance between probe set algorithms. Without a noise filter, only 1% of appropriately powered probe sets using the RMA algorithm were concordant with dChip, and none were concordant with MAS5 (Fig. 6). Application of the noise filter significantly reduced the number of genes entered into the power calculation, but also reduced the concordance.

Fig. 6

Concordance of sufficiently powered probe sets corresponding to human inflammatory response genes (total 263 probe sets) in the human muscle dataset. Each number represents the number of sufficiently powered probe sets. The proportion of concordant probe sets that are sufficiently powered when n = 3 is very low, both with and without a noise filter (a and b). When increasing the replicates to n = 10, the concordance is seen to increase for the probe set algorithms (c and d). RMA continues to show many discordant probe sets when used without the noise filter (d).

Fig. 6

Concordance of sufficiently powered probe sets corresponding to human inflammatory response genes (total 263 probe sets) in the human muscle dataset. Each number represents the number of sufficiently powered probe sets. The proportion of concordant probe sets that are sufficiently powered when n = 3 is very low, both with and without a noise filter (a and b). When increasing the replicates to n = 10, the concordance is seen to increase for the probe set algorithms (c and d). RMA continues to show many discordant probe sets when used without the noise filter (d).

This analysis of the human data suggested that the higher levels of confounding biological noise were too high for the relatively low number of replicates (n = 3) in this example. To determine if concordance improved with higher replicates, the same power calculations were run changing n = 10. Considering the n = 3 without noise filtering, RMA shows 210 sufficiently powered inflammatory gene probe sets, of which only two are shared with the dChip algorithm, and none with the MAS5 algorithm. This could reflect greater sensitivity of RMA, greater false positives, or both. Increasing n = 10 results in a small increase in RMA probe sets (210 to 238), with 74% (175/238) of probe sets now concordant with dChip. We can assume that those probe sets detected solely by RMA at n = 3, which become concordant with other probe set algorithms at n = 10, reflect greater sensitivity of RMA (74%). On the other hand, those probe sets seen solely by RMA at n = 3 that remain discordant with other algorithms at n = 10 most likely represent false positives (24% = 56/238). Thus, our power analysis shows that RMA is indeed a much more sensitive probe set algorithm, but this sensitivity comes at a cost of a relatively high false positive rate, and for this noisy human project, this is ∼24% of probe sets deemed ‘significant’ by RMA. This analysis also shows that using probe set algorithm concordance is a good method of assessing true positives versus false positives at n = 10, but a very poor method at n = 3.

## 4 DISCUSSION

Microarrays are used for both hypothesis testing and hypothesis generation. Hypothesis testing requires that a smaller number of probe sets (e.g. functional group of genes) are tested and that the experiment is appropriately powered to be able to answer the hypothesis (the functional group shows altered expression, or not). On the other hand, with hypothesis generation it is best to include the largest possible number of tests (probe sets), with the goals of maximum sensitivity and minimizing false positives. False positives are particularly deleterious in hypothesis generation, both because of the multiple testing problems and the possibility of generating the wrong hypothesis.

The power analysis tool described here shows utility for both hypothesis testing, and hypothesis generation. The tool allows the selection of any subset of genes and probe sets, and to easily query the number of microarrays needed for any specified ES, α and β. It also enables user-specified noise filters, and selection of gene ontologies or other user-specific gene groups. In the example provided of inflammatory genes, we show that the species studied (rat versus human), probe algorithm selection and noise filters have effects on the resulting appropriately powered probe sets, and concordance of power analysis results by probe set algorithms. Importantly, once the user has selected optimized experimental parameters, then a data filter can be produced limiting to future data analyses to only those appropriately powered probe sets. This tool will likely prove particularly important for human clinical trials, where hypothesis testing is typically a key part of clinical trial design, and carefully studied by review boards for statistical robustness. Our data on the human muscle project demonstrates that there are likely large amounts of uncontrolled biological noise intrinsic to most if not all human projects, and that this has a dramatic effect on power analysis relative to less complex and more easily controlled organisms (rats and plants). Specifically, we found that much higher numbers of microarrays are needed to obtain a large numbers of appropriately powered probe sets. Furthermore, we found that the RMA algorithm, while providing high sensitivity for lower replicates, also resulted in ∼24% false positives. Such false positives could be damaging to a human clinical trial hypothesis testing.

Many microarray projects, particularly in experimental organisms, are done for the purpose of hypothesis generation. Our power analysis tool allows the user to optimize the maximum number of appropriately powered probe sets over an entire microarray, thus balancing sensitivity and cost. Interestingly, while noise filters appeared beneficial in the human project, they seemed to be deleterious in the rat project. Thus, experimental organisms with less confounding noise should use noise filters judiciously, as they severely penalize sensitivity. Since our strategy outputs a data filter to limit future analyses to only those appropriately powered probe sets, this enables researchers to focus on a smaller number of candidate genes, thus alleviating some of the multiple-testing problems in microarray projects without using unhelpful classical P-values correction methods.

Using our power analysis tool on a series of projects, and with three different probe set algorithms, we obtained some insights regarding the performance of specific probe set algorithms. First, we conclude that different projects in different species have different ‘optimal’ analysis methods. RMA is best for projects with less confounding noise, where it provides high sensitivity at low replicates. On the other hand, dChip appears to perform best in noisier projects, with fewer false positives. Particularly interesting was the degradation of performance of RMA with the application of a noise filter. This was unexpected, as one would assume that any reduction in noise should improve signal/noise ratios, and thus the performance of any signal generation algorithm. We feel that the reason for this degradation of performance lies in the quantile normalization and median polish of RMA. Probe sets with very higher absolute intensity levels (e.g. highly expressed genes) lie high in the dynamic range, and the RMA algorithm apparently leads to greater variance in this subset of probe sets.

## 5 CONCLUSION

To our knowledge, we present the first method to provide researchers with a practical solution for a priori power calculations with microarrays which is useful for both hypothesis testing and generation.

This work was supported by Department of Defense W81XWH-04-01-0081 and NIH 1P30HD40677-01 (MRDDRC Genetics Core).

Conflict of Interest: none declared.

## REFERENCES

B.
, et al.  .
A comparison of normalization methods for high density oligonucleotide array data based on variance and bias
Bioinfromatics
,
2003
, vol.
19
(pg.
185
-
193
)
Chen
Y.W.
, et al.  .
Expression profiling in the muscular dystrophies: identification of novel aspects of molecular pathophysiology
J. Cell. Biol.
,
2000
, vol.
151
(pg.
1321
-
1336
)
Chen
Y.W.
, et al.  .
Molecular responses of human muscle to eccentric exercise
J. Appl. Physiol.
,
2003
, vol.
95
(pg.
2485
-
2494
)
Holder
D.
, et al.  .
Statistical analysis of high density oligonucleotide arrays: a SAFER approach
2001
In Proceedings of the ASA Annual Meeting
Atlanta, GA
Irizarry
R.A.
, et al.  .
Summaries of Affymetrix GeneChip probe level data
Nucleic Acids Res.
,
2003
, vol.
31
pg.
e15

Li
C.
Wong
W.H.
Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection
,
2001
, vol.
98
(pg.
31
-
36
)
Pawitan
Y.
, et al.  .
False discovery rate, sensitivity and sample size for microarray studies
Bioinformatics
,
2005
, vol.
21
(pg.
3017
-
3024
)
Seo
J.
Shneiderman
B.
A rank-by-feature framework for interactive exploration of multidimensional data
Informat. Visualiz.
,
2005
, vol.
4
(pg.
99
-
113
)
Seo
J.
Shneiderman
B.
Knowledge discovery in high dimensional data: case studies and a user survey for an information visualization tool
IEEE Trans. Vis. Comput. Graph.
,
2006
, vol.
12

(in press)
Seo
J.
, et al.  .
Interactively optimizing signal-to-noise ratios in expression profiling: project-specific algorithm selection and detection P-value weighting in Affymetrix microarrays
Bioinformatics
,
2004
, vol.
20
(pg.
2534
-
2544
)
Tsai
C.A.
, et al.  .
Sample size for gene expression microarray experiments
Bioinformatics
,
2005
, vol.
21
(pg.
1502
-
1508
)
The Tumor Analysis Best Practices Working Group
Expression profiling—best practices for data generation and interpretation in clinical trials
Nat. Rev. Genet.
,
2004
, vol.
5
(pg.
229
-
237
)
Tusher
V.G.
, et al.  .
Significance analysis of microarrays applied to the ionizing radiation response [Erratum (2001). Proc. Natl Acad. Sci. USA, 98, 10515.]
,
2001
, vol.
98
(pg.
5116
-
5121
)
Wang
S.J.
Chen
J.J.
Sample size for identifying differentially expressed genes in microarray experiments
J. Comput. Biol.
,
2004
, vol.
11
(pg.
714
-
726
)
Yang
Y.H.
Speed
T.
Speed
T.
Design and analysis of comparative microarray experiments
Statistical Analysis of Gene Expression Microarray Data
,
2003
Boca Raton, FL
Chapman & Hall
(pg.
35
-
91
)
Zhao
P.
, et al.  .
In vivo filtering of in vitro expression data reveals MyoD targets
C. R. Biol.
,
2003
, vol.
326
(pg.
1049
-
1065
)

## Author notes

Associate Editor: John Quackenbush