WormBase (http://wormbase.org), the public database for genomics and biology of Caenorhabditis elegans, has been restructured for stronger performance and expanded for richer biological content. Performance was improved by accelerating the loading of central data pages such as the omnibus Gene page, by rationalizing internal data structures and software for greater portability, and by making the Genome Browser highly customizable in how it views and exports genomic subsequences. Arbitrarily complex, user-specified queries are now possible through Textpresso (for all available literature) and through WormMart (for most genomic data). Biological content was enriched by reconciling all available cDNA and expressed sequence tag data with gene predictions, clarifying single nucleotide polymorphism and RNAi sites, and summarizing known functions for most genes studied in this organism.
WormBase is the central public database for Caenorhabditis elegans biology. It began as a web interface for its predecessor, the genomic database ACeDB (1,2). During the last half decade, it has been expanded to cover classical genetics and cell biology (3), functional genomics (4) and the Caenorhabditis briggsae genomic sequence (5,6). New releases of WormBase are built every three weeks by amalgamating physical and genome sequence data from the C.elegans Sequencing Consortium (Sanger Institute and Washington University, St Louis), genetic map data curated by the Caenorhabditis Genetics Center (University of Minnesota and Oxford University) and diverse biological data curated by the WormBase Consortium. Every 10th release is maintained as a permanently available, stable data source (‘freeze’) for reproducible bioinformatics. In the last year, the WormBase Consortium has worked to make WormBase more reliably useful and stable, while continuing to add new biology and preparing to handle an expected five nematode genomic sequences in 2006.
Much of WormBase is organized around two key data hubs, the Gene page and the Genome Browser. Both of these can summarize large amounts of data in a single view. However, as the contents of WormBase grew, the Gene page became increasingly slow to load in users' web browsers. We revised our software so that Gene pages are pre-built and stored ready for use; as a result, most Gene pages now loaded in <10 s. We also redesigned WormBase so that its software and data releases are packaged for easy uploading and updating. This allowed us to construct and maintain several mirror sites at the Institute of Molecular Biology and Biotechnology (Crete, Greece), the California Institute of Technology, and the Center for Computational Biology and Bioinformatics (Daejeon, South Korea). It also allows us to run WormBase on laptop computers for network-independent use and efficient software development. These improvements were made possible by clarifying internal data structures that are invisible to the user but critical for effective database management. For example, classical loci and coding sequences were consolidated into a single Gene data object that can stably represent genes regardless of fluctuations in their classical or molecular names.
We revised the Genome Browser display (7) so that different subsets of genomic data (‘tracks’), as well as different sections of the Browser's display framework, can be alternatively shown or hidden at the user's option. This allows a user to construct economical and individualized views of any section of the genome, ranging in size from a few nucleotides to 1 Mb in length. These views can be bookmarked as stable URLs, or exported as publication-quality scalable-vector graphics images. Protein motifs (8–11) now have their own data track, showing the domain organization of proteins in the context of intron/exon structure, interspecies conservation, single nucleotide polymorphisms (SNPs), PCR reagents, RNAi results and other genomic features. Moreover, users can import and display their own data tracks seamlessly beside the core WormBase ones, either by uploading their own annotations from a local text file or by invoking a remote URL; using remote URLs enables collaborative genomic analyses by multiple users sharing a common data repository.
We developed Textpresso, a tool for searching the full content of C.elegans articles for meaningful word relationships, and incorporated it into WormBase (12). We recently expanded the Textpresso ontology with four new categories: ‘reporter gene’, ‘restriction enzyme’, ‘second messenger’ and ‘vector’. We also added new terms to the ‘drugs and small molecules’ and ‘organism’ categories. The literature searchable by Textpresso within WormBase contains 6259 full-text articles, including 5571 from the core C.elegans literature; this body of literature is automatically updated and expanded every week. Textpresso also contains 18 642 abstracts, including 8450 from international and regional C.elegans meetings. While Textpresso was first designed for use by WormBase, it has proven useful to several other model organism databases (e.g. http://www.yeastgenome.org/textpresso and http://www.ciliate.org/textpresso) and is being extended to non-genomic disciplines (such as neuroscience; http://www.textpresso.org/neuro). Textpresso has been made available to the Generic Model Organism Database software project (http://www.gmod.org) as open source code.
WormMart is a data warehousing system (13) that allows users to construct complex queries on WormBase and obtain results in HTML or tab-delimited text format. WormMart supersedes the ‘Batch Sequences’ and ‘Batch Genes’ reports, and facilitates arbitrarily complex queries such as ‘Find all genes in C.elegans that have an orthologue in C.briggsae, are located in chromosome III, have reduced fertility in an RNAi screen, and have annotated untranslated regions (UTRs)’. In addition to gene-centric queries, WormMart supports querying over-expression patterns, RNAi phenotypes, mutant phenotypes, variations (alleles) and literature citations. WormMart is based on BioMart (http://www.ebi.ac.uk/biomart), the core software driving the EnsMart query engine at EnsEMBL (13).
Even for C.elegans, with a relatively compact and well-determined genomic sequence, it is a continuing challenge to detect the existence and correct structures of ∼21 600 genes (14). In the past year, 2405 gene structures have been revised or newly identified. Approximately 5000 cDNAs have been connected to protein-coding sequences, resulting in 948 more protein-coding sequences becoming completely confirmed by cDNA data (a 17.2% increase). Essentially all C.elegans expressed sequence tag (EST) and cDNA sequences in public databases have been incorporated into WormBase gene structures. The number of introns identifiable from cDNA sequences but absent from existing gene structures was lowered considerably (from 746 to 121). Many other data besides cDNA sequences were also used to identify correct gene structures: detailed studies from individual research papers, personal communications to WormBase staff, Twinscan predictions (15), SL1/2 (16) and TEC-RED sequences (17), multiple alignments of protein families (18), and C.briggsae homologies (5). 5′- and 3′-UTRs in WormBase are now automatically generated as part of full-length coding transcripts, taking into account additional data such as trans-spliced 5′ leader sequences (16) and polyadenylation sites (19); 1800 new instances of trans-splicing have been identified.
SNPs (20,21) have been systematically overhauled. As originally published, C.elegans SNPs have often been inconsistent or incomplete: clone positions have changed over the years as sequence changes have been made, and published flanking sequences were often too short to uniquely map them to either clones or chromosomes. We thus went back through the original data and generated new flanking sequences that are unique in the genome. Similarly, we remapped all RNAi experiments, while adding two new large-scale datasets from an ORFeome library-based RNAi screen (22) and a full-genome RNAi profiling of early embryogenesis (23). This brought the total number of large-scale RNAi data points in WormBase from 27 112 to 58 778, and the number of distinct RNAi phenotypes from 78 to 119. We also continued adding microarray data to WormBase. The WS145 database release contained 2 984 398 microarray data points from 19 papers, describing 234 independent experiments, compared with 1 690 379 data points from 15 papers and 113 experiments from a year earlier.
Functional genomics is a growing part of WormBase, with the incorporation of protein–protein interaction and isolated promoter data: these currently include 5534 yeast two-hybrid interactions covering 15% of the C.elegans proteome (24) and 6538 promoter sequences cloned in the MultiSite Gateway system (25).
CELLULAR AND ORGANISMAL BIOLOGY
Molecular data become more useful when accompanied by human-readable, concise descriptions of gene function (26). In WormBase, 3064/7864 (39%) of genes that have been named (i.e., that are not simply anonymous, little-studied gene predictions) now have such descriptions. For genes with at least one reference, 58% have concise descriptions (2421/4133); for genes with five or more references, 74% have concise descriptions (925/1248); for those with >10 references, 76% have concise descriptions (635/839); with >100 references, 86% have concise descriptions (85/99); and with >200 references, 88% have concise descriptions (29/33). Thus, for those genes that are information-rich, we have ∼75% coverage with our concise descriptions. In addition, we are also annotating gene functions with structured, computationally tractable gene ontology (GO) terms (27). 5806 gene–GO term linkages have so far been identified, from data in 718 references. Meanwhile, the entire genome has been scanned with automatic mappings to GO terms from RNAi phenotypes and from Interpro domains (28), yielding a total of 23 688 annotations.
There is far more important biology of C.elegans than WormBase can expect to describe in a reasonable time by traditional approaches. We thus developed a new semi-automated annotation strategy and tested it by mass-extracting genetic interactions from the primary literature. Extraction began with a Textpresso advanced query (12) for sentences containing ≥2 ‘gene’, ≥1 ‘association’ and ≥1 ‘regulation’ categories. A curator then read the individual sentences and identified individual gene–gene interactions. In this way, ∼26 000 sentences were retrieved by Textpresso from ∼4400 papers. From these, ∼10 000 interactions or possible interactions were identified, including: 5439 genetic interactions (54%); 1820 non-genetic interactions (18%); and 2739 possible interactions (27%). These represented ∼2000 unique, previously unannotated gene pairs.
Two encyclopedic volumes describing C.elegans biology were published in 1988 and 1997 (29,30); while still invaluable, both predate the last decade of research and the rise of functional genomics with web-based bioinformatics. WormBook is a new, online collection of original reviews on topics related to all aspects of C.elegans biology, as well as a repository for experimental protocols used by C.elegans researchers. WormBook is freely available as HTML or PDF documents (www.wormbook.org). WormBook provides a text companion to WormBase with contributions by >100 expert biologists reviewing and synthesizing the facts presented in WormBase. When complete, WormBook will have hypertext links for genes, alleles, proteins and literature citations to WormBase and PubMed. Conversely, researchers using these linked primary databases will have reciprocal access to WormBook, facilitating the exchange of ideas and promoting further research. Over 20 completed WormBook chapters and >60 preprints of WormBook chapters had been released by September 2005.
We anticipate that WormBase will be called upon to manage genomic sequences from multiple Caenorhabditis species (31,32). Work on this began in 2005 with a gene set prediction for Caenorhabditis remanei, in which several different gene prediction sets were generated, tested against C.elegans genomic and C.remanei EST sequences, combined and hierarchically selected for the best possible automatic prediction. This yielded a total of 26 253 predicted C.remanei genes. We also intend to expand the classical biological content of WormBase by systematically annotating mutant alleles with an extensive phenotype ontology (33) adopted for nematodes to allow better searches of gene function. We plan to make cells, cell groups and biological processes more significant entry points into the content of WormBase.
P.W.S. is an investigator with the Howard Hughes Medical Institute. WormBase is supported by grant P41-HG02223 from the US National Human Genome Research Institute, and by the British Medical Research Council. Funding to pay the Open Access publication charges for this article was provided by grant P41-HG02223 from the US National Human Genome Research Institute.
Conflict of interest statement. None declared.