CressExpress: A tool for large-scale mining of expression data from Arabidopsis thaliana

CressExpress is a user-friendly


Introduction
Availability of abundant, high-quality data sets from microarray expression experiments has stimulated rapid progress in gene networks analysis for a variety of plant and animal species (Stuart et al., 2003;Craigon et al., 2004;Wille et al., 2004;Wei et al., 2006;Zhong and Sternberg, 2006).These data are making it possible to explore correlated expression patterns for the entire genome, as well as answer focused questions regarding specific pathways and processes.By examining correlated expression patterns between genes, investigators can infer new functions for previously uncharacterized genes or identify potential causal relationships between regulators and their targets.Although the details of individual analyses and applications vary, most are based on the idea that correlated expression, or co-expression, implies biologically-relevant relationships between gene products.
Many applications of this idea utilize variations of Pearson's correlation coefficient and linear regression to quantify co-expression relationships.Figure 1 presents an example scatter plot that illustrates the idea.Each point on the plot represents data from one array; x and y coordinates represent expression values for genes indicated on the horizontal and vertical axes, respectively.In this case, there is a strong positive relationship between the two genes' expression values; when one gene's expression is high, the other gene's expression is also high.
Computing a linear regression between expression values for the two genes quantifies the strength of this relationship.This yields an r-squared (r 2 ) value, equivalent to the square of Pearson's correlation coefficient, and a p-value that expresses the probability of obtaining the observed r 2 value (or larger) assuming a random relationship between the two variables.The regression also yields a slope, which indicates the direction of the co-expression relationship.
Larger r 2 and smaller p-values signal higher confidence in the co-expression relationship.
Altogether, these numbers summarize how closely two genes are co-expressed across multiple array experiments and provide a way for experimenters to identify and quantify relationships between genes.When these numbers are available for a large number of genes, they can be used to build co-expression graphs, or networks, in which highly-correlated genes (nodes) are linked, and less well-correlated genes are not.Studies analyzing co-expression networks have demonstrated that genes connected in the co-expression network often perform related functions, thus demonstrating biological relevance of the approach (reviewed in (Aoki et al., 2007) and (Saito et al., 2008)).
A number of groups have established web-based interfaces for mining pre-computed coexpression results and co-expression networks in Arabidopsis.To our knowledge, none offer ways to re-compute the co-expression networks using subsets of experiments and arrays, perhaps because of technical challenges involved.Re-computing correlation between a query gene and all the genes in the genome is computationally-intensive and cannot easily be done in real-time during a user's visit to a Web site.However, the co-expression networks arising from different inputs may vary greatly, depending on sample or tissue type.We address this problem by developing an easy-to-use Web tool (CressExpress) that allows users to select distinct tissue types and experiments to include in an analysis, which then executes the calculations off-line.
When the analysis finishes, users receive an email linking to a zip file on the CressExpress site that contains a complete package of results, along with a record of all parameters and samples used as inputs to the experiment.By providing a complete report of results and inputs, CressExpress makes it convenient for users to integrate co-expression analysis into their research workflow.

Results
CressExpress is an easy-to-use Web-based tool that allows researchers to set up and run co-expression analysis experiments using a variety of different data sets and sample types.To set up and run an experiment, users enter query identifiers and analysis parameters on the CressExpress Web site located at http://www.cressexpress.org.To begin an analysis, users click the "Run the Tool" link and then proceed through a series of screens (Figure 2) that offer users the opportunity to vary quality control parameters, specify data release and array platforms, and select subsets of sample types to include in analysis.This latter feature can be particularly important for query genes that exert their effects in a tissue-or developmentally-restricted fashion.At each step, a "help" icon links to a page describing the various options and how they would likely affect the analysis results.CressExpress also provides reasonable default choices so that users can easily perform pilot studies and quickly learn how the tool operates.
In step one, users choose a data release of expression values that will be used in the analysis.Currently, there are four data release options, each one representing different array collections and array processing methods (Table I).Each release contains expression data harvested from the Nottingham Arabidopsis Stock Center (NASC) AffyWatch subscription service (Craigon et al., 2004) and includes samples from two Affymetrix Arabidopsis array designs: the ATH1 array (22,810 probe sets) and the AG array (around 8,000 probe sets).
Release 2.0 provides the same data used in a previously-published analysis of metabolic pathways; we provide this data set as a courtesy for users interested in investigating the prior study's results (Wei et al., 2006).Releases 3.0, 3.1, and 3.2 are the same set of arrays, but the expression values in each were generated using different processing methods.We provide data from different array processing methods mainly for users who want to compare results with other on-line tools or investigate how these methods may affect downstream co-expression analysis results.However, we generally recommend using Releases 2.0 or 3.0, which were generated using the RMA algorithm (Irizarry et al., 2003).We recommend these releases mainly because we have observed good separation between correlation coefficients for probe sets expected to be correlated (e.g., redundant probe set pairs (Cui and Loraine, 2006)) versus probe set pairs selected at random from the genome (Loraine, unpublished data).
Step one also features an option to configure a quality control (QC) setting for individual arrays.This QC setting is based on a Kolmogorov-Smirnov (KS) test of deleted residuals, which is described in detail elsewhere (Persson et al., 2005;Trivedi et al., 2005), but we also describe it briefly here: The Kolmogorov-Smirnov D (KS-D) test statistic ranges from 0 to 1 and quantifies how much a given array's expression values deviate from other arrays in the same group.Arrays sharing the same NASC experiment identifier are considered as belonging to a single group.
Arrays with larger KS-D test statistics are of lower quality, at least with respect to how well they resemble other arrays in the same experiment.Decreasing the KS-D threshold excludes outlier arrays that are more variable with respect to the other arrays in the same experimental grouping.
Eliminating these lower-quality, outlier arrays can therefore increase observed correlation between genes that are co-expressed.We recommend that when running the tool with a relatively small number of arrays (e.g., fewer than fifty), users should utilize the default KS-D value of 0.15, since computing co-expression with a smaller number of arrays makes the regression more vulnerable to outliers that can skew results.Co-expression analysis experiments involving more arrays will be less vulnerable to outlier arrays, and eliminating the outlier arrays will have less effect.Figure 3 presents the distribution of KS-D statistics computed for each data release.Note that the Release 2.0 arrays were pre-screened to exclude arrays with KS-D above 0.15 as described in (Wei et al., 2006).
In step two, users enter a list of up to fifty queries, using AGI gene names or probe set names to identify genes.To map AGI gene names onto probe set names, CressExpress uses the probe set-to-gene id annotations provided by The Arabidopsis Information Resource.However, because these mappings are problematic in some cases (Cui and Loraine, 2006), CressExpress always reports results using both probe set identifiers and AGI codes.Users also use this screen to specify the array type; options include the ATH1 (22,810 probe sets) (Redman et al., 2004) or the older AG (~8,000 probe sets) array, both from Affymetrix.By default, the ATH1 array is selected, since more recent and greater amounts of data are available for the ATH1 relative to the AG array.For each query gene, the CressExpress server will perform a large-scale linear regression experiment, comparing each query to all genes represented on the selected array, using all or just some expression data stored in the CressExpress database, depending on the sample and experiments selected in subsequent steps.
In steps three and four, users choose sample types (step three) and experiments (step four) to include in the regression analysis for their query genes.In step three, CressExpress builds a list of sample types for arrays that met the quality control and array type criteria specified in previous steps.Users can select all sample types (the default) or subsets of sample types from a menu listing, where the wording for each item on the list comes from the text provided by NASC.Because NASC obtains the textual description of sample types from experimenters who generated and submitted the original data, the sample type descriptions may use a variety of different terms meaning the same thing.Therefore, we advise users to read the entire list when attempting to limit their analyses to specific tissue types, since different text may have similar meanings.For example, users wishing to examine co-expression networks of flowering organs might select sample types labeled "flowers" and "flower buds" as well as "inflorescence." The default option for steps three and four is to use all available sample types.This default setting is mainly for the convenience of users wishing to run quick pilot experiments and become familiar with the tool and how it operates.To take full advantage of the CressExpress tool, we recommend that users choose sample types where the query gene products are expected to be expressed or active.For example, users interested in investigating the co-expression network surrounding genes involved in flowering should choose sample types that include flowers.Similarly, users interested in investigating pathways involved in photosynthesis should choose sample types derived from green shoots and leaves.
To demonstrate the effects of sample type filtering, Figure 4 presents a diagram showing output from a co-expression experiment in which 185 flowering-related genes were compared to each other.Each connection in the network represents an r 2 value of 0.64 or above, such that each pair of connected genes exhibit expression correlation above 0.8 or below -0.8. Figure 4A shows the network as computed using all 1,771 arrays from CressExpress Data Release 3.0, while Figure 4B shows the network computed using a subset of 129 arrays that were labeled as being from flower-or pollen-related samples.The flower-based network shown in 4B contains many more connections than in 4A, demonstrating that, in this case, restricting inputs to biologically-relevant sample types yields a richer, more informative network.For example, one of the best-connected genes in 4B is AT3G18990 (VRN1, REDUCED VERNALIZATION RESPONSE 1), which encodes a DNA-binding protein involved in vernalization, the process by which exposure to cold temperatures helps to trigger flowering in Arabidopsis (Chandler et al., 1996 ;Levy et al., 2002).This gene is absent from the network generated from all available samples.This means that if one were entirely unaware that VRN1 plays a role in flowering, an analysis of co-expression would reveal its role in flowering thanks to the many connections VRN1 shares with flowering-related genes, but an analysis of the network generated from all available samples would not.However, we have also observed that in many instances, reducing the number of samples can inflate the number of connections, even among groups of genes that one would not normally consider to be co-expressed.To control for this, we re-computed the flowering network one hundred times, using different randomly-selected subsets of 129 arrays.
Not once did we observe a network as richly-connected as the one computed from flowering samples alone, suggesting that, in this case, using flowering-related samples exposes coexpression relationships that would otherwise be masked when all available data are used.
In step five, users may configure a pathway-level co-expression (PLC) analysis, which identifies genes that are co-expressed with multiple query genes.Seen from the point of view of the larger co-expression network, PLC analysis identifies query genes' common neighbors, such that each neighbor is connected to at two or more members of the query group.We previously used PLC analysis to identify candidate genes involved in metabolic pathways and cell wall biosynthesis, and, in general, we have found that genes identified as co-expressed not just with a single gene but with ensembles of genes acting together are often the best and more successful candidates to test (Persson et al., 2005;Wei et al., 2006).Running a PLC analysis requires the user to have entered two or more query genes in step two, and also requires the user to specific a linear regression r 2 cutoff parameter for designating co-expression between gene pairs, where the r 2 value is taken from the linear regression performed during the initial analysis.However, a default value of 0.36, equivalent to a correlation coefficient (Pearson's r) of 0.6, is provided for convenience.
Interpreting and using the PLC functionality and its results requires understanding how the PLC algorithm operates, and so we describe it in detail here.The PLC method as implemented in CressExpress operates as described previously, with some differences in how results are ranked (Wei et al., 2006).The PLC algorithm examines the co-expression results for each of the user-supplied query genes and then builds a list of candidate genes that are coexpressed with two or more members of the query group, where co-expression is defined as an r 2 regression result greater than the user-specified threshold.Genes co-expressed with two or more members of the query gene group are ranked first according to the number of genes within the query group with which they are co-expressed, and second by the average r-squared value.Thus, genes that are co-expressed with many members of the query group have a higher rank than genes co-expressed with fewer members of the query group.
The PLC method is most useful and relevant when at least two of the query genes are coexpressed with each other at the given PLC r 2 threshold.One way to determine the best cutoff for determining co-expression in PLC is to examine the correlations between the query genes that are expected ahead of time to be co-expressed and then choose an r 2 threshold that is lower than the smallest pairwise r 2 between them.If this is done, then the CressExpress PLC will recover all genes that are co-expressed with the queries at least as well as any two query genes are coexpressed with each other.To find out the threshold r 2 between queries, users can run the tool once, examine the list of co-expressed genes for each query individually to find out the crossquery r 2 values, and then re-run the tool with a new PLC r 2 threshold that is smaller than the lowest cross-query pair.Another method for finding the correlations between queries would be to use the CressExpress expression data direct access method in conjunction with a statistical analysis environment like R ( R Development Core Team, 2008) or TableView (Johnson et al., 2003), described in detail below.Regardless of the precise r 2 threshold chosen, the PLC analysis has the most meaning when at least two or more of the query gene products are expected to be co-expressed because they act in concert, either in the same pathway or as constituents of a protein complex.
The final step (step six) asks the user to enter a comma-separated list of one or more email addresses that will receive emails reporting on the status of the analysis.Upon successful completion of the analysis, each address receives a "Job Completion" message containing a link to a compressed, archive folder (a zip file) stored temporarily on the CressExpress server.The zip file contains the full complement of analysis results as well as a record of all parameters used in the experiment (Table II).CressExpress generates results files (in csv format) for each query gene, in which regression results comparing the query gene to all probe sets on the selected array are reported.The spreadsheet files are named after the query gene's matching probe set and can be loaded directly into Excel or any other program capable of reading comma-separated format files.Each row of data represents the results from a single linear regression comparing the query gene against another gene on the designated array and includes the linear regression p value, r 2 value, and slope, along with the brief description of the target gene.These descriptions come from TAIR and are provided for the convenience of users as they scan results searching for interesting patterns in the types of genes that are most highly-co-expressed with their queries.
The PLC results files include csv files reporting co-expressed neighbor genes and corresponding Web pages with hyperlinks to The Arabidopsis Information Resource, allowing for rapid review of results.In addition, the PLC analysis generates a network file (coexp.sif)together with companion node and edge attribute files suitable for loading and visualization in Cytoscape, a popular network analysis and visualization tool (Shannon et al., 2003).Users can configure Cytoscape to utilize a custom styles file packaged with the results files.Directions on how to view the network file using Cytoscape appear in the FAQ section of the CressExpress Web site.Figure 5 presents an example of a Cytoscape visualization showing PLC results for six CESA (cellulose synthase) genes from Arabidopsis.This type of visualization is particularly useful when query genes resolve into separate networks, as with the six CESA genes presented in the figure.As can be seen from the figure, CESA1, 3, and 6, which predominate in primary cell wall formation, are connected to a different set of genes than are CESA4, 7, and 8, which predominate in secondary cell wall biosynthesis.In this case, the co-expression connections for the six genes appear to mirror the distinct functions of the secondary and primary cell wall genes encoding cellulose synthase subunits.
Because CressExpress distributes data in bulk as one "zip" file, users typically find it relatively easy to track and store results from CressExpress analysis runs using their preferred electronic data archiving scheme.For example, users who incorporate results from CressExpress in published work might prefer to distribute the original "zip" file as part of Supplementary Data Files on journal or lab Web sites.The CressExpress design philosophy is that each run of the CressExpress tool is a self-contained experiment and should not only be easy to repeat but also should be easy to record and incorporate into users' research workflow.By providing complete records of results and experimental parameters, CressExpress aims to make it easier for users to manage and track their in silico co-expression experiments.

CressExpress Direct Access to Expression Values
CressExpress offers direct access to pre-computed expression data via a simple URL-based method in which users access expression values for specific probe sets by encoding the requests as Web addresses.Using the direct access approach, users can retrieve expression values for individual genes and arrays, save the values to local files, or import them into directly into webenabled desktop programs like R or TableView (Johnson et al., 2003).This feature of CressExpress is useful for advanced users who wish to perform their own custom analyses and therefore need direct access to the underlying raw data used by the CressExpress tool in the large-scale co-expression analysis.On the CressExpress Web site, the tabbed panel labeled "visualization" links to a tutorial explaining how to use the direct access method to import expression data into the TableView program, a user-friendly, freely-available desktop visualization tool implemented in Java.The tutorial describes how users can identify "outlier" arrays that signal potentially-meaningful deviations from the co-expression patterns between genes, identify sets of samples that yield unusually high or low expression values for a given gene, or compare the relative variability of similar sample types from different experiments.For more advanced users, the "web services" link provides an example of a short R script showing how to import the data into the R statistical programming environment and compute Pearson's correlation coefficient between query genes from the glucosinolates biosynthesis pathway.We also provide several BioMoby services for programmers to incorporate CressExpress functionality into their applications (Wilkinson et al., 2005).We do not describe these here, but instead refer readers to the CressExpress "Web Services" section, which documents the BioMoby services and their uses.

Example analysis: Cellulose synthase enzymes and cell wall biosynthesis
The Arabidopsis genome contains several CESA genes encoding putative and known components of multi-subunit cellulose synthase complexes responsible for primary and secondary wall formation.Previously, we described a co-expression-based analysis that identified genes that were co-expressed with all or some of the CESA genes, and subsequent analysis of mutant phenotypes for some of these genes confirmed their role in cell wall formation and/or stability (Persson et al., 2005).Here, we demonstrate how researchers can use the coexpression tool to recapitulate and extend the analysis, using new releases of expression data featured as part of the CressExpress tool.
Following the procedures described above, we instigated a CressExpress experiment using the six primary and secondary cell wall genes (Table III) as queries and a PLC r 2 cutoff 0.36. Figure 5 shows a screen capture from the Cytoscape network visualization tool depicting the six query genes and their PLC-identified neighbors.We find that the secondary and primary cell wall CESA genes are co-expressed with different, non-overlapping groups.Of the genes linked with secondary cell wall CESA genes (Figure 5B), at least seventeen have been investigated experimentally and found to exhibit secondary cell wall-related phenotypes and functions (Table IV), while many more are annotated as having predicted functions related to carbohydrate synthesis or cell wall functions (Supplemental File 1).This example illustrates how one might use CressExpress to tease apart the different functions of genes which share considerable similarity at the sequence level but which may play distinct roles in the plant body.
In this case, it was already known that CESA4, 7, and 8 perform a different role from CESA1, 3, and 6, and we found that the co-expression analysis tends to confirm this view, since the two groups are co-expressed with non-overlapping groups of genes, as determined by the PLC analysis.The same argument may be made for other closely-related genes, potentially yielding new hypotheses regarding gene function even for members of closely-related gene families.

Example analysis: Glucosinolate biosynthesis from tryptophan
Glucosinolates are nitrogen and sulfur-containing secondary metabolites that are derived from several different amino acids in plants, including Arabidopsis and many agriculturally-important Brassicaceae species (Grubb and Abel, 2006;Halkier and Gershenzon, 2006).Glucosinolates undergo conversion to toxic or otherwise bio-active breakdown products through the action of βthioglucosidase enzymes called myrosinases that only come into contact with their glucosinolate substrates when cells are damaged and cellular contents mix.This mechanism is termed the "mustard oil bomb" and contributes to the plant's ability to resist pathogen attack and herbivory.
Glucosinolate breakdown products provide the characteristically pungent tastes of horseradish and wasabi, and at least one has been shown to have anti-cancer properties.Still others are toxic to animals in high doses.Because of their clear importance in plant defense as well as nutrition and human health, glucosinolate biosynthesis has been studied intensively in Arabidopsis and related species, and many of the genes required in glucosinolate synthesis have been identified.
Hirai and colleagues used a co-expression-based approach to identify Myb-family transcription factors required for biosynthesis of cysteine-derived glucosinolates (Hirai et al., 2007).Similarly, Gachon and co-workers used hierarchical clustering of expression profiles to expose patterns of correlated expression among genes involved synthesis of glucosinolates from tryptophan (Gachon et al., 2005).Co-expression analysis of glucosinolate biosynthesis in Arabidopsis therefore provides an excellent example application for the CressExpress tool.We used the AraCyc database hosted at The Arabidopsis Information Resource Web site to look up AGI codes for genes associated with the indolic glucosinolates pathway (Zhang et al., 2005).The pathway as annotated in AraCyc includes five reactions catalyzed by six gene products.We used CressExpress to determine the degree to which these genes are co-expressed with each other.Using CressExpress tool default parameters, we performed a co-expression analysis of all six genes, comparing them both to each other as well as to all other genes represented on the ATH1 array.This initial pilot study revealed that all six genes are highly coexpressed with each other.The lowest r 2 we obtained for any pair of genes in the pathway was 0.37, obtained for SUR1 compared with CYP79B2.We therefore ran CressExpress a second time, using identical parameters but with a PLC r 2 threshold of 0.35, in order to capture both the co-expression network involving the six biosynthetic genes as well as any other genes surrounding them which are co-expressed to the same or slightly less degree as the query genes.
The PLC analysis revealed 155 different genes that are co-expressed (at r 2 equal to 0.35 or better) with two or more of the glucosinolate pathway genes; of these, seven are co-expressed with all six.We then used the BioMoby literature aggregator tool (LitRep, http://mips.gsf.de/proj/planet/araws/litRepSearch.html) to search for articles referencing this and the other top seven co-expressed genes (Table V).So far, none appears to have been tested for glucosinolate biosynthesis-related phenotypes, but their annotated functions (sulfate assimilation, sulfotransferase activity, protein-methionine-S-oxide reductase activity, tryptophan biosynthesis) suggest they are reasonable candidates for involvement in this pathway.Testing the role of these genes in glucosinolate biosynthesis or plant defense is beyond the scope of this article, but clearly these genes would serve as good candidates for experimental investigation based on their strong co-expression with each annotated gene in the pathway.

Discussion
We developed CressExpress, an easy-to-use, freely-available Web-based tool that allows users to run co-expression analysis experiments using biologically-relevant subsets of expression data harvested from public domain Affymetrix Arabidopsis expression microarrays.For each experiment, the tool re-computes co-expression relationships by performing a linear regression comparing each user-entered query gene to all other genes represented in the expression data, using a user-selected subset of experiments in the database.Computing the linear regressions typically requires several minutes, and so individual analyses are performed off-line as distinct jobs.All analysis code executes on a server remote from the user's desktop, and users set up experiments using their Web browser by proceeding through a series of steps in which they enter query genes, choose quality control parameters, and specify sample types to include in the analysis.At each point where a choice must be made, CressExpress provides links to help pages describing how the different parameters may affect results and also supplies reasonable defaults for users wanting to run preliminary pilot experiments.Thus, to operate the tool, users need only be able to operate a Web browser and have access to an email account.When the experiment "job" completes, users receive an email containing a link to a compressed "zip" file on the server containing results and output files.
Several groups have developed on-line, Web-based analysis tools that offer a variety of methods and approaches for analyzing and visualizing publicly-available Arabidopsis expression data (Zimmermann et al., 2004;Manfield et al., 2006;Obayashi et al., 2007).However, these tools are based on pre-computed correlations stored in a database and do not allows users to specify experiments or sample types to include in the calculations.This makes it possible to compare how the observed co-expression network changes when different arrays are included in an analysis.In addition, CressExpress explicitly tracks data releases, allowing users to experiment with different parameter settings, such as different array types, without concern that the underlying database has changed between experiments.To our knowledge, none of the other co-expression data-mining systems provides either the ability to specify sample types to include in analysis or provides detailed information about analysis inputs and parameters.
Another distinctive feature of CressExpress is that it provides comprehensive analysis results in formats aimed at facilitating downstream data-mining and analysis.After a run of the tool, users receive a set of simple, comma-separated plain text file for each query gene.Each of these files lists the r 2 , slope, and p values for each linear regression between the query and all possible target genes, along with a short textual description of each target to aid users as they explore the results.In general, CressExpress aims to provide data in ways that allow researchers to visualize and explore results using desktop visualization programs and analysis tools, such as Excel, TableView (Johnson et al., 2003), and Cytoscape (Shannon et al., 2003).CressExpress also provides a direct access method allowing users to retrieve raw expressed data directly from the CressExpress database using a simple, URL-based scheme.Using the direct access method, users can build spreadsheets with expression values for every array in a designated data release, or import expression values directly into desktop visualization and statistical analysis programs like R and TableView.By providing this relatively straightforward method of accessing data along with tutorials explaining how to use it with R and TableView, CressExpress aims to give users of varying computational experience the opportunity to experiment with computational methods in their research and extract new value from published, publicly-available expression microarray data.
In future, we plan to add several new features to CressExpress, focusing on three major goals: facilitating comparative co-expression analysis across species, making sample selection easier, and re-packaging the software to allow easy deployment at other sites.To streamline array selection (step three), we plan to add a feature that will let users select samples based on their Plant Ontology term annotations as they become available (Avraham et al., 2008).We also would like to support comparisons and candidate gene prediction across species and array types.
Toward this end, we developed a prototype tool (see www.ssg.uab.edu/ccpmt) that matches probes and probe sets from different arrays based on target gene similarity.Currently, the tool supports six Arabidopsis arrays and a poplar genome array from Affymetrix.We would also like to make it easy for other groups to mirror the CressExpress software on their own sites, thus allowing them to re-use the software with their own custom data sets.In hopes of recruiting community interest in this latter effort, we created developer's source code release of the software on the CressExpress Web site under the "Data Mining Code" link.However, it is important to note that CressExpress uses a commercial Java statistics library from Visual Numerics, Inc., which charges a licensing fee.Although we feel that the convenience of having a robust library of statistical routines is well worth the price, we may ultimately replace this nonfree component with a version from the open source community.Another modification we are considering involves increasing the current limit of fifty queries per CressExpress analysis, depending on community interest.Readers who would like to suggest other new features or changes to the tool behavior are welcome to contribute ideas and requests.

Array informatics
CD media containing expression data were obtained from the Nottingham Arabidopsis Stock Center AffyWatch subscription service.Each CD contained CEL files with "raw" expression data, grouped into folders named for the investigator who contributed the data.
Upon receipt of each CD, XML files describing each experiment were harvested from the NASC site, using the "passthru" parameter as described at the Web page located at http://arabidopsis.info/bioinformatics/narraysxml/index.html.Slide names associated with each experiment id were harvested from the XML files by extracting the content of "NASC:Name" tags.For about half the CEL files, we were able to use CEL file names and slide names to connect slides with their corresponding CEL files, and thus capture in our database the experimental group affiliation for each CEL file.For the remainder, NASC supplied (via email) Excel spreadsheets that report the correspondence between CEL file names and NASC slide names.The mapping between slide name and CEL file names were also manually checked; cases Stock Center for their generosity and help with matching slide and CEL file names.Lastly, we wish to thank the anonymous reviewers for their excellent comments on the manuscript.

Figure legends Figure 1 .
Figure legends Figure 1.Two highly-co-expressed genes from Arabidopsis.RMA-normalized expression values for AtSTSa (At1g74100) (x axis) and CYP83B1 (y axis) are plotted relative to each other.Each point indicates expression values from a single array.Regressing expression values for CYP83B1 (y axis) against expression values for AtSTS5 (x axis) yields an r-squared value of approximately 0.7, equivalent to Pearson's correlation coefficient of 0.8.The regression line appears as a dashed line on the plot.

Figure 2 .
Figure 2. CressExpress Operation.Screen captures showing step-by-step operation of the coexpression tool are shown.

Figure 4 .
Figure 4. Co-expression network for flowering-related genes (A) without and (B) with sample-type filtering.In (B), only samples derived from flowers were included in the analysis.

Figure 5 .
Figure 5. Cytoscape visualization depicting PLC results for CESA genes involved in (A) primary (B) secondary cell wall biosynthesis.

Figure 2 .
Figure 2. CressExpress Operation.Screen captures showing step-by-step operation of the co-expression tool are shown.

Figure 4 .
Figure 4. Co-expression network for flowering-related genes (A) without and (B) with sample-type filtering.In (B), only samples derived from flowers were included in the analysis.

Figure 5 .
Figure 5. Cytoscape visualization depicting PLC results for CESA genes involved in (A) primary and (B) secondary cell wall biosynthesis.

Table II .
Data output files.When an analysis completes, users receive an email message reporting a URL where they can download a "zipped" folder containing several files with results and descriptions of the analysis.Several of these are listed.

Table III . Probe set-to-target gene mappings for genes encoding cellulose synthase enzymes involved in secondary and primary cell wall formation
. Probe set to gene mappings are fromThe Arabidopsis Information Resource.
2indicates the mean of the r 2 values obtained from regressing the gene in column 1 against each of the three CESA genes.Column 5 (phenotype) indicates whether the studies referenced in column six(refs)determined that the gene in column 1 possessed a secondary cell wall-related phenotype.

Table V . Genes involved in glucosinolate biosynthesis from tryptophan as reported in AraCyc version 3.5.
Values in the column labeled "Other names" are from TAIR.Names marked with * are from a review of glucosinolate biosynthesis(Grubb and Abel, 2006).Probe set to gene mappings are from The Arabidopsis Information Resource.

Table VI . Results from PLC analysis of genes involved in biosynthesis of glucosinolates from tryptophan.
Several genes with unknown functions or functions related to sulfur metabolism and tryptophan biosynthesis exhibit strong co-expression with all six genes from the glucosinolate biosynthesis pathway, as annotated in AraCyc 3.5.The full list of genes identified via PLC appears in Supplemental File 2. References cited were collected using LitRep, a BioMoby literature aggregator search tool that queries the Aramemnon, ATIDB, TAIR, and MPIMP databases (http://mips.gsf.de/proj/planet/araws/litRepSearch.html).