PlantGDB ( http://www.plantgdb.org/ ) is a genomics database encompassing sequence data for green plants (Viridiplantae). PlantGDB provides annotated transcript assemblies for > 100 plant species, with transcripts mapped to their cognate genomic context where available, integrated with a variety of sequence analysis tools and web services. For 14 plant species with emerging or complete genome sequence, PlantGDB's genome browsers (xGDB) serve as a graphical interface for viewing, evaluating and annotating transcript and protein alignments to chromosome or bacterial artificial chromosome (BAC)-based genome assemblies. Annotation is facilitated by the integrated yrGATE module for community curation of gene models. Novel web services at PlantGDB include Tracembler, an iterative alignment tool that generates contigs from GenBank trace file data and BioExtract Server, a web-based server for executing custom sequence analysis workflows. PlantGDB also hosts a plant genomics research outreach portal (PGROP) that facilitates access to a large number of resources for research and training.
PlantGDB serves the plant research community by providing access to plant sequence data as well as a variety of sequence and genome analysis tools in a single online resource [( 1 , 2 ); Table 1 ]. This update outlines recent developments at PlantGDB that have expanded its usefulness as a tool for comparative genomics. Key features include: expanded EST assemblies; new genome browsers for a larger number of species; overnight annotation of emerging genome sequences; and novel tools for sequence retrieval and analysis, including an innovative system for the creation and management of workflows that integrates database queries, linked web services, and local tools.
In this table, sequence resources are divided into four categories: Uploaded Sequence, Assembled Sequence, Genome Browsers, and Other Tools. For each resource in column 1, the species available, source/version, current sequence count, update frequency, tool/services, download options, and alignment to genome are shown in adjacent cells. Web links for both external and internal data/tool sources are indicated with superscript numbers and are listed at the end of the table under Web Resources. Sequence counts displayed here are as of 30 October 2007.
DATABASE FEATURES AND ADDITIONS
Plant sequence data and transcript assemblies
PlantGDB periodically uploads and parses all Viridiplantae sequences from GenBank ( 3 ) and Uniprot ( 4 ) into ∼70 000 individual data sets according to species or subspecies origin ( Figure 1 ). PlantGDB's sequence data are refreshed approximately every 4 months, coinciding with alternate bimonthly GenBank version releases. GenBank and Uniprot sequences are uploaded, parsed by (sub)species, indexed for BLAST ( 5 ) and GeneSeqer ( 6 ) analysis, and loaded into MySQL tables. For all species with >10 000 published transcripts, a non-redundant set of PlantGDB-generated Unique Transcripts (PUTs) is generated using a custom assembly pipeline ( http://www.plantgdb.org/prj/ESTCluster/PUT_procedure.php ). PUTs are aligned to UniProt entries using BLAST, and the best matches (if any) and UniProt-associated Gene Ontology (GO) annotations ( 7 ) are stored. Currently, 116 species have PUT assemblies at PlantGDB, spanning diverse taxonomic groups ( Figure 2 ). Users can track assembly progress in PlantGDB at http://www.plantgdb.org/prj/ESTCluster/progress.php . PlantGDB also provides genome survey sequence (GSS) assemblies for maize and sorghum ( http://www.plantgdb.org/prj/GSSAssembly/ ). All PlantGDB sequence data (raw and processed) are available for download in a variety of file formats at http://www.plantgdb.org/download/download.php . In addition, all Zea mays sequence data at PlantGDB are uploaded monthly to MaizeGDB, the central repository for maize genetic information ( http://www.maizegdb.org ) ( 8 ).
Query and analysis tools
Data housed at PlantGDB are stored in MySQL tables and can be queried by accession number, GI number or text search. TableMaker, a search module for querying and retrieving PlantGDB's GenBank data in tabular format, described previously ( 1 ), has been expanded to include a new query wizard to simplify the search process for users not familiar with GenBank data models ( http://www.bioextract.org/genbank/home/index.jsp ).
For sequence similarity searching, a batch NCBI-BLAST tool is available for querying any combination of plant species data sets against up to 100 query sequences at a time ( http://www.plantgdb.org/PlantGDB-cgi/blast/PlantGDBblast ). For specialized queries, PatternSearch ( http://www.plantgdb.org/PlantGDB-cgi/vmatch/patternsearch.pl ) interrogates the database for relatively short matches possibly interspersed with mismatches and indels, and ProbeMatch ( http://www.plantdb.org/PlantGDB-cgi/prj/PLEXdb/ProbeMatch.pl ) allows users to match sequence to array probes and link to array probe databases.
PlantGDB also provides online access to GeneSeqer alignment software, allowing the user to calculate spliced alignments of expressed transcripts to a target genomic sequence, as described previously ( 1 ) ( http://www.plantgdb.org/PlantGDB-cgi/GeneSeqer/PlantGDBgs.cgi ). Currently, transcript data sets (EST, cDNA, PUT) from 50 species are preprocessed at PlantGDB to allow rapid online GeneSeqer analysis, and a range of splice site models and alignment parameters can be specified at runtime.
PlantGDB provides sequence analysis tools that automate processes that normally require iterative searching or tedious parsing of information. Tracembler ( http://www.plantgdb.org/tool/tracembler/ ) allows the user to do chromosome walks with pre-assembly trace data by performing an iterated search of NCBI's trace archives with a seed sequence ( 9 ). Tracembler invokes CAP3 ( 10 ) to assemble a contig sequence from one or more automated rounds of BLAST analysis and displays a pairwise alignment between contig and seed sequence. MuSeqBox ( http://www.plantgdb.org/MuSeqBox/MuSeqBox.html ) is an online tool for generating tabular output from multiple BLAST queries, based on user-specified thresholds ( 11 ). The tool also contains algorithms for detecting potential alternate splicing and full-length transcripts. MuSeqBox provides pre-computed Uniprot BLASTx data sets for maize, sorghum, barley, Arabidopsis and rice PUTs, allowing the user to generate filtered, tabulated output. Alternatively, the user can upload a custom BLAST output file.
Bioinformatics workflow tools
A bioinformatics research project may utilize a variety of query and computational tools, some web-based and others local to the user, which may be carried out in serial fashion along with parsing and formatting for input or display. There is a growing need for systems that can integrate disparate tools and workflows in a way that automates the process of input/output, computation and documentation. The PlantGDB-associated BioExtract Server ( http://www.bioextract.org/login/Login.html ) addresses this need by providing researchers with a web interface that allows them to query sequence databases, analyze data with web-based as well as local bioinformatics tools, save results and create and manage workflows using a directed acyclic graph (DAG) model (Lushbough,C., Bergman,M.K., Lawrence,C.J., Jennewein,D. and Brendel,V. BioExtract server – an integrated system to access and analyze heterogeneous, distributed biomolecular data. Submitted for publication.). As a simple example, a user could develop a workflow that performs a BLAST search, retrieves peptide sequences from query results, eliminates redundant sequences and produces a multiple sequence alignment output. BioExtract workflows can be paused, modified, saved, shared with an online workgroup or the world, and documented electronically for future reference.
PlantGDB provides genome browsers (xGDB) for 14 plant species whose genomes have been completely or partially sequenced ( 12 ). AtGDB, OsGDB, MtGDB, SbGDB, VvGDB and PtGDB are chromosome-based genome browsers for Arabidopsis thaliana (thale cress), Oryza sativa (rice), Medicago truncatula (barrel medic), Sorghum bicolor (sorghum), Vitis vinifera (grapevine) and Populus trichocarpa (western balsam poplar ) , respectively, while ZmGDB, GmGDB, HvGDB, SlGDB, GhGDB, TaGDB, BrGDB and LjGDB are BAC-based browsers for Zea mays (corn or maize), Glycine max (soybean), Hordeum vulgare (barley), Solanum lycopersicum (tomato), Gossypium hirsutum (cotton), Triticum aestivum (bread wheat), Brassica rapa (field mustard) and Lotus japonicus , respectively ( http://www.plantgdb.org/prj/Genome_browser.php ).
The xGDB browser Context View ( Figure 3 B) displays current gene model annotation together with high quality, cognate and non-cognate GeneSeqer alignments of ESTs, cDNAs, and PUTs to genomic sequence. Similarly, Oryza sativa predicted polypeptides from TIGR ( http://rice.tigr.org/tdb/e2k1/osa1/ ) ( 13 ) and/or Arabidopsis thaliana predicted polypeptides from TAIR ( http://www.arabidopsis.org/ ) ( 14 ) are splice-aligned to genomic sequence using GenomeThreader ( 15 ) and displayed in the same window. For species with microarray probe sequence, these are downloaded from the microarray database at PLEXdb ( 16 ), aligned relative to PUT assemblies and displayed. Significantly, users can view spliced alignments in a genomic region at the nucleotide level and also retrieve quality scores and provenance information for any spliced alignment displayed at xGDB. Additional sequence alignments, including GSS contigs and repeat masked regions, are displayed for some genomes. A subset of xGDB annotation data are accessible through the Distributed Annotation System (DAS) ( http://www.biodas.org/ ). These data can be downloaded for further analysis, or alternatively imported into another genome browser capable of importing DAS formatted data.
Key design features of the xGDB browsers are modularity and expandability, such that a browser for a newly emerging genome can be rapidly deployed and populated with computed alignments. For the user, xGDB provides a clean, easy to navigate interface with links to internal and external data sources, zoomable and customizable views of transcript alignments, BLAST, search tools and the ability to evaluate alignments and contribute new annotations online (see next section). The browsers are designed with Perl/PHP and MySQL database management, and complete code is available from SourceForge ( http://xgdb.sourceforge.net/ ) for standalone installations.
Although excellent tools are available for defining genic regions and variant transcript forms from evidence-based data as well as ab initio prediction, models can often be improved further by human curated annotation. PlantGDB's yrGATE ( http://www.plantgdb.org/prj/yrGATE/ ) is a recently developed tool for community annotation of gene models that is integrated with PlantGDB's xGDB genome browsers ( 17 ). From a single browser window the user can rapidly evaluate a selected region for intron/exon structures based on any combination of EST/cDNA evidence and ab initio prediction, compare the model with known proteins via GenBank BLAST, and submit the annotation for review and publication on the genome browser. To assist in the identification of gene models in need of annotation, the Genome Annotation Evaluation (GAEVAL) module generates quality scores for gene structure predictions and classifies cases of incongruence of the annotation with experimental evidence ( http://www.plantgdb.org/AtGDB-chtml/gaeval/ ). The yrGATE tool is available for both BAC and chromosome-based xGDB browsers and is being used to communicate evidence-based gene models to the A . thaliana genome database, TAIR ( 18 ). Figure 1 shows an example of how yrGATE can be used, together with xGDB's annotation tables and genome browser, to identify and annotate potential splicing variants for a gene of interest in maize.
Pipelines for genome annotation
Genome browsers at PlantGDB are refreshed on a timetable that depends on the pace of accumulation of new genomic or transcript sequence data or assemblies for the respective species. New spliced alignments are calculated for ESTs, cDNAs and PUTs as well as for other sequence types (where available) and data are uploaded. To match the rapid pace with which some genomes are being sequenced, PlantGDB staff have developed and implemented an automated genome data pipeline for species with rapidly expanding sequence data, using Zea mays as an initial example. In 2007, new maize BAC sequences began to be deposited in GenBank at the rate of over 60 BACs or ∼10 Mb of sequence per day ( http://www.maizesequence.org ). In addition, there is a growing catalog of transposable element-tagged maize genomic sequence in GenBank, facilitating reverse genetics in maize ( http://www.plantgdb.org/prj/AcDsTagging/ ) ( 19 ) as well as a large repository of EST sequence-derived PUTs. PlantGDB's daily Z. mays pipeline downloads and processes all new maize BACs with transcript, protein, microarray probe, transposon insertion tag and other genomic alignments, and displays the cumulative output for all BACs in ZmGDB (the xGDB browser for Z. mays ; http://www.plantgdb.org/ZmGDB/ ) within 12 h. The pipeline also updates BLAST and sequence download resources daily. Significantly, the pipeline also generates a browsable, searchable, tabular output of rice gene models and putatively transposon-tagged genes for the entire BAC data set ( http://www.plantgdb.org/ZmGDB/DisplayGeneAnn.php ), providing a powerful and timely gene discovery tool for researchers ( Figure 3 ). This effort represents an early implementation of a real-time, high-throughput, discovery-oriented annotation process using automated workflows.
The Plant Genome Outreach Portal (PGROP) at PlantGDB provides a portal for plant genomics resources online as well as a repository of outreach content, serving the needs of a wide-ranging audience from high school through postgraduate ( 20 ). Users can query for resources or add a resource using simple online tools, and query results are ranked via algorithms that highlight the most popular resources.
PlantGDB will continue to provide comprehensive and up-to-date plant sequence information online and available for download. As additional genomes become available, xGDB browsers will be expanded, with additional annotations contemplated for certain species [e.g. tracks for transcription factor binding sites and conserved non-coding sequences ( 21 )]. Also planned are additional comparative genomics tools such as SynBrowse ( 22 ), expanded DAS import and export, and the development of qualitative (e.g. quality score) and quantitative (e.g. library) filters for spliced alignments. Expanded help and tutorial sections are also under development.
PlantGDB has expanded greatly in scope since 2004, providing today a wide range of data sets, query methods and analysis tools for researchers interested in comparative plant genomics or gene discovery research. The site aims to complement other, more specialized plant genome sites by providing comprehensive plant sequence data as well as a suite of tools and genome browsers that emphasize spliced alignment of cognate and non-cognate transcripts and similar protein sequences. PlantGDB also addresses the need for timely access to, and processing of, high-volume informatics data through use of automated daily data pipelines (e.g. maize BAC pipeline) and online workflow tools (e.g. BioExtract Server and Tracembler). With the yrGATE community annotation tool, PlantGDB facilitates the sharing of user-generated gene annotation information across the entire plant research community.
The authors would like to thank Dr. Q. Dong, Dr S.D. Schlueter and Dr B.-B.Wang for their many contributions to PlantGDB over the years, M. Brekke for system support, and the many PlantGDB users who have helped us to improve this resource by providing feedback and suggestions. This work is supported in part by a grant from the National Science Foundation Plant Genome Research Program to V.B., C.J.L, and C.L. (DBI-0606909). Funding to pay the Open Access publication charges for this article was provided by the cited NSF grant.
Conflict of interest statement . None declared.