Abstract
Microarray technology has become a standard molecular biology tool. Experimental data have been generated on a huge number of organisms, tissue types, treatment conditions and disease states. The Gene Expression Omnibus (Barrett et al., 2005), developed by the National Center for Bioinformatics (NCBI) at the National Institutes of Health is a repository of nearly 140 000 gene expression experiments. The BioConductor project (Gentleman et al., 2004) is an open-source and open-development software project built in the R statistical programming environment (R Development core Team, 2005) for the analysis and comprehension of genomic data. The tools contained in the BioConductor project represent many state-of-the-art methods for the analysis of microarray and genomics data. We have developed a software tool that allows access to the wealth of information within GEO directly from BioConductor, eliminating many the formatting and parsing problems that have made such analyses labor-intensive in the past. The software, called GEOquery, effectively establishes a bridge between GEO and BioConductor. Easy access to GEO data from BioConductor will likely lead to new analyses of GEO data using novel and rigorous statistical and bioinformatic tools. Facilitating analyses and meta-analyses of microarray data will increase the efficiency with which biologically important conclusions can be drawn from published genomic data.
Availability: GEOquery is available as part of the BioConductor project.
Contact:sdavis2@mail.nih.gov
1 OVERVIEW OF GEO AND GEOQUERY
The NCBI Gene Expression Omnibus (GEO) serves as a public repository for a wide range of high-throughput experimental data. These data include single and dual channel microarray-based experiments measuring mRNA, genomic DNA and protein abundance as well as non-array techniques, such as serial analysis of gene expression (SAGE) and mass spectrometry proteomic data. Currently, nearly 140 000 samples and over 3000 different microarray platforms are represented in GEO. At the most basic level of organization of GEO, there are four basic entities. The first three (Sample, Platform and Series) are supplied by users; the fourth, the dataset, is compiled and curated by GEO staff from the user-submitted data.1 GEO platform files (accessions like GPLxxxx where ‘x’ is a number) describe an array layout and content while GEO samples (accessions like GSMxxxx) describe the actual results of a single hybridization. An entire experiment, including information about all hybridizations and their associated platform descriptions can be accessed as a GEO series (GSExxxx). Finally, GEO staff have selectively curated GEO series into a more compact format, the GEO dataset (GDSxxxx) that includes a single spreadsheet of ‘final’ values and accompanying rich sample annotation. The data can be accessed from GEO in a proprietary format they term SOFT. Each of these entity types has a corresponding class representation defined by GEOquery. Methods for accessing the various pieces of the GEO entity are available from GEOquery, making access to the data easy and quick once the data from GEO has been parsed into the GEOquery data structures.
2 USING GEOQUERY
The primary goal of GEOquery is to download and parse the SOFT format files from GEO, maintaining all of the information contained in the GEO records. The design of GEOquery makes accessing data from GEO very simple. There is only one command that is needed,
This command loads the GEOquery library into R.
Now, the R variable
2.1 GEOquery data structures
Two of the most powerful features of GEOquery are the custom data structures and associated methods that organize the data from a GEO download. The GEOquery data structures can be described in two broad groups that are analogous to the structure used by GEO to deliver the data. The first group, comprising GDS, GPL and GSM, all have similar GEO SOFT format structure and the associated GEOquery methods have similar effects on each. Each of these classes is comprised of a metadata header, taken nearly verbatim from the SOFT format header and including information pertaining to the entire GEO object, and a GEODataTable. The GEODataTable has two simple parts, a Columns part which describes the column headers and a table part which typically contains a 2D table of information. Here is the truncated output of some of the accessors applied to the GEO sample that was stored in the previous section.
The GPL behaves exactly as the GSM class, although the
The GSE is a composite class. A GSE entry from GEO can represent an arbitrary number of samples hybridized on an arbitrary number of platforms. The GSE has a metadata section, just like the other classes. However, it does not have a GEODataTable. Instead, it contains two lists, accessible using
2.2 Converting to BioConductor data structures
GEO datasets are quite similar to the limma package (Smyth, 2005) data structure MAList and to the Biobase package (Gentleman et al., 2004) data structure ExpressionSet. Therefore, there are two functions,
As a simple example, converting
Now,
3 CONCLUSION
The GEOquery package provides a bridge between the BioConductor analysis tools and the vast public data resources contained in the NCBI GEO repositories. By maintaining the full richness of the GEO data rather than focusing on getting only the ‘numbers’, it is possible to integrate GEO data into current Bioconductor data structures and to perform analyses on that data quite quickly and easily or to export the data into any number of formats for use by other tools or for local storage and data mining. GEOquery will hopefully open GEO data more fully to the bioinformatics community at large.
ACKNOWLEDGEMENTS
We would like to thank the staff at NCBI GEO for valuable input and support during the development process.
Conflict of Interest: none declared.

Comments