The Gene Expression Database (GXD) provides the scientific community with an extensive and easily searchable database of gene expression information about the mouse. Its primary emphasis is on developmental studies. By integrating different types of expression data, GXD aims to provide comprehensive information about expression patterns of transcripts and proteins in wild-type and mutant mice. Integration with the other Mouse Genome Informatics (MGI) databases places the gene expression information in the context of genetic, sequence, functional and phenotypic information, enabling valuable insights into the molecular biology that underlies developmental and disease processes. In recent years the utility of GXD has been greatly enhanced by a large increase in data content, obtained from the literature and provided by researchers doing large-scale in situ and cDNA screens. In addition, we have continued to refine our query and display features to make it easier for users to interrogate the data. GXD is available through the MGI web site at or directly at .
The laboratory mouse serves as a premier animal model in studying the complex molecular networks that underlie the processes of human development, differentiation and disease. To gain insights into these networks, it is essential to know where, when and in what amounts transcripts and proteins are expressed, and how their expression varies in different mouse strains and mutants. The Gene Expression Database (GXD) addresses this objective in a uniquely comprehensive way. GXD is the only resource that acquires mouse expression data from the literature in a systematic manner, as well as acquiring data directly from conventional and large-scale providers via electronic data submission and bulk data downloads. GXD integrates various types of mRNA and protein expression information, collects data from all tissue and developmental stages and includes data from many different mouse strains and mutants. Annotations in GXD make extensive use of controlled vocabularies and ontologies to provide the standardization of data that enables complex queries. In addition, GXD is fully integrated with the other databases of the Mouse Genome Informatics (MGI) resource, including the Mouse Genome Database (MGD) (1,2) and the MGI part of the Gene Ontology Project (GO) (3). MGI also maintains comprehensive links to external resources such as sequence databases, Entrez Gene, UniProt, InterPro, Online Mendelian Inheritance in Man (OMIM), PubMed and other mammalian databases (4–15). This robust integration puts the expression data annotated in GXD into a much larger biological and analytical context. Thus, users are able to query using extensive genetic, sequence, functional, expression and phenotypic information.
Other public and laboratory databases have been developed in recent years to store mouse expression data (16–26). They store data from one or two specific assay types and/or focus on specific tissues/developmental stages; they are often dedicated to specific data generation projects. These databases are complementary to the GXD effort. Due to its broad scope, its thorough approach and its data integration and querying capabilities, though, GXD provides a unique resource to the biomedical research community. New data are entered and made publicly available on a daily basis. GXD and its query interfaces have been described previously (27–30). Here we focus on recent progress in terms of data acquisition and querying capabilities.
The Gene Expression Literature Index
GXD curators survey journals to find all published papers that describe endogenous gene expression and knock-in reporter studies done in the embryonic mouse. In a first annotation step, the curators record the genes and ages analyzed and the expression assay types used in these publications. GXD combines these data with information obtained from PubMed and makes them available for searching via the Gene Expression Literature Index. Therefore, users can query for specific types of expression information in combination with bibliographic information (author, journal, year) or specific words in the title or abstract of publications. The Literature Index is comprehensive and up-to-date; it contains all pertinent journal articles from 1993 to the present and articles from major developmental journals from 1990 to the present. Currently, the index contains >56 500 entries covering nearly 12 300 references analyzing nearly 8700 genes. Thus, it provides a powerful tool to quickly locate expression information in the literature.
Gene expression data
GXD currently collects detailed expression data from the following assay types: RNA in situ hybridization, immunohistochemistry, in situ reporter (knock in), northern blot, RT–PCR, western blot, RNase protection and nuclease S1 protection studies. Work is underway to incorporate microarray data as well. As illustrated in Figure 1, expression records in GXD are detailed. Each entry contains a description of the assay type and the molecular probe used in the assay, the genetic origin of the sample and the experimental conditions used. The time and tissue of expression, the authors' description of pattern and strength of expression, the number and sizes of detected bands and sequence information are also recorded. Expression patterns are described using an extensive dictionary of standardized anatomical terms that lists the anatomical structures for each developmental stage in a hierarchical fashion, thus enabling the recording of expression results from assays with different spatial resolution in a consistent manner. The embryonic part of the anatomical dictionary was developed by our collaborators from the Edinburgh Mouse Atlas and Gene Expression Database (EMAGE) project (31); the adult part was developed by the GXD project (32). As well as enabling complex querying capabilities, these detailed annotations make it easier to interpret and compare expression data.
GXD's data content has increased significantly in recent years (Figure 2). Currently, GXD contains data from >24 600 assays that provide >260 000 detailed expression results for nearly 7700 genes, including expression data from almost 1000 different mouse mutants. Two-thirds of these data are linked to images of the primary expression data; GXD currently contains >43 000 images of expression data. This rapid growth in data content was made possible by the daily annotation of expression data from the literature and through the incorporation of large sets of expression data from large-scale RNA in situ hybridization and RT–PCR screens. Recently acquired large data sets include: RNA in situ and RT–PCR studies of mouse genes homologous to human chromosome 21 genes (33,34); RNA in situ studies that analyzed expression patterns of >1300 transcription factors in the developing mouse brain (35); RNA in situ studies examining the expression of >1000 genes during retinal development (36); and RNA in situ and RT–PCR studies of >300 RNA-binding protein encoding genes in the developing mouse brain (37). In these instances curators worked with the researchers to fully integrate into GXD the large amounts of supplemental material that accompanied these publications. Curators worked with the laboratories to bring the data into standardized formats and to resolve issues pertinent to nomenclature and referential integrity; the data was reviewed both computationally and manually. The integration of these data into GXD greatly expands the research community's ability to query these data, increasing their utility.
GXD also stores expression data on the tissue or cell-line source of mouse cDNA clones, data mainly acquired via large data downloads. We have made strong progress in this area. In collaboration with the other members of the FANTOM consortium, including our colleagues from the MGD, we incorporated all cDNA data derived from the FANTOM project into GXD (38–40). We have also incorporated all other publicly available mouse cDNA and EST data, including cDNA data from the I.M.A.G.E consortium, the sets from the National Institute of Aging (NIA) (41,42) and the Mammalian Gene Collection (MGC) (43,44). We have loaded each clone's IDs and sequence information, including ESTs and any longer sequences that are known. Clone records have been associated with genes via manual curation and computational analysis of sequence associations. Likewise, using coordinated automatic and manual processes, we have mapped the source information for each cDNA library (such as information about strain, tissue, cell line and sex) to our controlled vocabularies. Thus, all the cDNA source data are now recorded in standardized form. This annotation and integration of all these data allows users to do comprehensive, expression-related queries based on cDNA source information. Currently GXD contains data for >1.7 million clones for nearly 28 000 genetic markers.
The Gene Expression Notebook
GXD's usefulness is directly proportional to the amount of data it contains, but extracting data from the literature is a time-intensive process. Therefore, to facilitate direct submissions of expression data from the research community, we have developed the Gene Expression Notebook (GEN). GEN is an Excel-based application designed to function as an electronic notebook to store and organize expression data and images in the laboratory. Researchers can then, with minimal additional effort, select data stored in the GEN and submit it to GXD. Data submissions are reviewed by GXD curators and receive accession numbers that can be cited in publications. GEN has been described in detail previously (45). Since then we have developed new versions of the GEN to be compatible with recent versions of Excel: 2000, 2003 and X. GEN is freely available for download at: .
ACCESSING GXD'S EXPRESSION DATA
Users primarily access GXD using web-based query forms. A complete listing of our query forms is provided in the supplementary material. The standard (Figure 3) and expanded Gene Expression Data query forms provide access to the detailed expression data. Both forms provide users with a wide variety of query fields to allow them to extract the data from the database that is of the most interest to them. The expanded GXD query form is designed to allow users to query for genes expressed in some anatomical structures and/or developmental stages but not others. The functionality of the map position section on these forms (as well as on the cDNA clone query form) was expanded to allow searching by genome coordinates and by a range of markers. When combined with the previously existing ability to search by centiMorgan position and cytogenetic band, users have a powerful tool to limit their expression searches to specified chromosomal regions, useful when doing positional cloning studies to hunt for candidate genes. All queries in GXD, including those from the standard and expanded query forms, return data summaries; an example is pictured in Figure 3. Links in the Results Detail column of these summary pages provide access to our detailed data entries; an example of a detailed entry is provided in Figure 1.
GXD is implemented in the Sybase relational database management system. For those users who wish to perform custom queries or analyses not possible through the web-based query forms, direct SQL access or custom database reports can be requested from User Support (contact details supplied below).
MOUSE GENE EXPRESSION INFORMATION RESOURCE
GXD has a longstanding collaboration with the EMAGE project (16). EMAGE provides for 3D graphical storage of mouse developmental in situ expression patterns (from wild-type mice). GXD makes all pertinent text-annotated in situ data, including primary image data, available to EMAGE so that it can be spatially mapped. EMAGE already provides links from its Anatomical Section Browser () to the appropriate GXD data summaries. GXD is in the process of adding links to EMAGE. The ultimate goal of the collaboration between our two databases is the creation of the Mouse Gene Expression Information Resource (MGEIR) that will fully combine standardized text-based and graphical means for storing and querying expression data (46).
GXD provides support to its users through detailed on-line documentation and a dedicated User Support staff. The on-line documentation can be accessed by clicking on the question mark found in the upper left hand corner of most web pages. Our User Support personnel can be contacted via email at -firstname.lastname@example.org or by clicking User Support on our web pages. They can also be reached by phone at 1-207-288-6445.
The following citation format is suggested when referring to data from GXD: These data were retrieved from the Gene Expression Database (GXD), Mouse Genome Informatics (MGI), The Jackson Laboratory, Bar Harbor, Maine, USA (URL: ). [Type in date (month, year) when you retrieved the data cited.] To reference the database itself, please cite this article.
Supplementary Data are available at NAR Online.
We would like to thank our colleagues from the other MGI projects for their contributions to the GXD project and to the larger, integrated MGI resource. We thank Drs. Carol Bult and Gregory Cox for critically reading the manuscript. GXD is available to the public for free due to funding from NIH grant HD033745. Funding to pay the Open Access publication charges for this article was provided by NIH grant HD033745.
Conflict of interest statement. None declared.