Bread wheat (Triticum aestivum) is one of the most important crop plants, globally providing staple food for a large proportion of the human population. However, improvement of this crop has been limited due to its large and complex genome. Advances in genomics are supporting wheat crop improvement. We provide a variety of web-based systems hosting wheat genome and genomic data to support wheat research and crop improvement. WheatGenome.info is an integrated database resource which includes multiple web-based applications. These include a GBrowse2-based wheat genome viewer with BLAST search portal, TAGdb for searching wheat second-generation genome sequence data, wheat autoSNPdb, links to wheat genetic maps using CMap and CMap3D, and a wheat genome Wiki to allow interaction between diverse wheat genome sequencing activities. This system includes links to a variety of wheat genome resources hosted at other research organizations. This integrated database aims to accelerate wheat genome research and is freely accessible via the web interface at http://www.wheatgenome.info/.
Bread wheat (Triticum aestivum) is one of the most important crop plants worldwide, occupying 17% (one-sixth) of crop acreage of the world (Gupta et al. 2008), a staple food for 35% of the world's population, providing more calories and protein in the global diet than any other crop (www.idrc.ca/en/ev-31631-201-1-DO_TOPIC.html). Annual global wheat consumption has exceeded 600 Mt (http://www.fao.org), equivalent to about 100 kg per capita (Safar et al. 2010), and the demand for wheat production is predicted to grow by >40% by 2020 (Bhalla 2006).
While wheat is of great social and economic importance, it possesses a large and complex genome due to hexaploidy and a high proportion of repetitive DNA (Chantret et al. 2005, Paux et al. 2008, Wanjugi et al. 2009), making genomic analysis a significant challenge. The wheat genome sequence is currently unknown, limiting genomic-based crop improvement; however, efforts are underway throughout the world to sequence this genome. The International Wheat Genome Sequencing Consortium (http://www.wheatgenome.org/), which was established following a wheat genome sequencing workshop in November 2003 (Gill et al. 2004), is taking a BAC (bacterial artificial chromosome) by BAC approach which aims to deliver a complete high quality genome sequence by 2015. Hexaploid bread wheat was selected rather than the individual, ancestral diploid genomes because it is the species grown in 95% of the wheat-growing areas, and the ABD genomes of bread wheat do not correspond physically and functionally to the sum of the ancestral A (Triticum urartu), B (unknown species that are likely to be related to Aegilops speltoides) and D (Aegilops tauschii) genomes (Feuillet et al. 2011). The consortium has adopted a chromosome-based strategy to construct physical BAC clone maps and subsequently to sequence each of the 21 individual chromosomes (Dolezel et al. 2007). The first physical map of the largest wheat chromosome 3B was produced (Paux et al. 2008) and its sequencing is ongoing (http://urgi.versailles.inra.fr/index.php/urgi/Projects/3BSeq).
As an alternative to the BAC by BAC approach, other groups are applying second-generation sequencing technologies to gain insights into this complex genome. A consortium from the UK produced 5× coverage of the bread wheat genome using Roche 454 technology (http://www.cerealsdb.uk.net/). While this is insufficient to produce a finished genome assembly, the data are a valuable resource for gene discovery and genetic variation analysis (Imelfort and Edwards 2009). A draft wheat genome assembly has also been produced from the donor species of the wheat D genome, A. tauschii (http://www.cshl.edu/genome/wheat) and for individual flow-sorted bread wheat chromosome arms (Berkman et al. 2011, Wicker et al. 2011). With the increasing volume of wheat genome data becoming available through such efforts, it is essential to provide resources that can integrate wheat-specific sequence information in a manner accessible to crop improvement researchers (Edwards and Batley 2010).
In recent years, the growth in genome information has led to a challenge for bioinformatics researchers to transform the vast quantities of data being produced into collective knowledge. As sequence availability has increased, data access, representation, analysis and visualization present significant challenges (Batley and Edwards 2009, Ning and Montgomery 2010). In this context, online databases for genome and genomic data are very much in demand.
This paper describes an integrated wheat genome data resource, WheatGenome.info, which provides a variety of web-based systems for access to wheat genome and genomic data to support applied crop research and crop improvement. Moreover, this interface includes links to wheat-related web-based data hosted at other research organizations. WheatGenome.info is available at http://www.wheatgenome.info/.
The wheatGenome.info database integrates several main web-based systems. These include an annotated wheat genome viewer based on GBrowse2, searchable using keywords, genome location or by sequence similarity using the BLAST portal; a CMap genetic and physical mapping database; TAGdb for searching wheat short read sequences; an annotated wheat expressed sequence tag (EST)-single nucleotide polymorphism (SNP) database; and a wheat genome Wiki.
A wheat genome viewer for annotated chromosome arm assemblies
The application of second-generation sequencing technology and advanced bioinformatics tools has enabled the assembly and annotation of the genes and low copy regions of isolated wheat chromosome arms, producing syntenic builds containing the majority of wheat genes (Berkman et al. 2011). Assemblies and syntenic builds for each of the group 7 chromosome arms have been produced (Table 1) and are hosted in a GBrowse2 database at wheatgenome.info for public access prior to publication. GBrowse2 is a user-friendly generic genome browser for genome sequence data and annotation (Donlin 2007, Arnaoudova et al. 2009). Each wheat chromosome arm has been annotated with predicted genes, Uniref90 gene similarities as well as homoeologous SNPs.
|Chromosome arm||Volume of data on TAGdb (Gbp)||Coverage||Syntenic build version||Syntenic build size (Mbp)|
|Chromosome arm||Volume of data on TAGdb (Gbp)||Coverage||Syntenic build version||Syntenic build size (Mbp)|
As well as annotation keyword searches, a BLAST portal enables sequence similarity searches of assembled wheat chromosome arm data, with results displayed in the GBrowse2 viewer. DNA or protein query sequence can be uploaded or pasted in the web-based form in FASTA format. The results are displayed in three sliding windows: the Overview window, Region window and Details window. The reference view can be dragged and zoomed. Several tracks of annotation are available, including Uniref90, Genes, Contigs, SNPs and Exons. All of these features can be expanded by clicking the associated plus button, and each feature provides a link to show the feature details (Fig. 1).
This GBrowse database enables the rapid dissemination of wheat chromosome arm sequence information prior to publication. In the absence of a finished wheat reference genome upon which to base crop improvement efforts, this tool represents the first opportunity for wheat researchers to interact with chromosome-scale gene-based sequence scaffolds in an intuitive and user-friendly manner. It allows for a more rigorous interrogation of genes surrounding a locus of interest than was previously possible in wheat, to assist the identification of the genomic basis of important traits. With the expansion of wheat genome sequencing activities by several groups internationally, this resource will increasingly provide access to wheat genome information for crop improvement research.
Wheat arms sequence data on TAGdb
TAGdb is an online database system designed to identify and visualize next-generation paired sequence tags that share identity with a submitted query sequence (Marshall et al. 2010). The TAGdb interface requests a FASTA format query sequence of up to 5,000 bp as well as a contact E-mail address, so users can retrieve previous query searches. Users can then select a variety of wheat short read data libraries. After starting the process, TAGdb sends an E-mail to the user stating that the job has started successfully and provides a link to the results web page. Once the search is complete, TAGdb sends a second E-mail to confirm completion, together with a link to the results. Two windows display an overview and zoomed region of the read alignments (Fig. 2); paired reads are connected by a line, with a blue rectangle confirming that the result conforms to the expected orientation and paired read distance. Matching reads, together with their matching or non-matching read pairs, are viewed as a table or can be downloaded as a multi-FASTA format file for further analysis.
The key value of this tool is that it provides researchers with rapid yet simple access to the wheat genome sequence data being produced by new sequencing technologies. The identification of a large number of matching reads may enable the local assembly of the wheat genomic region. Where few reads are identified, read pairs may be used to PCR amplify and sequence the gene as well as genomic sequence flanking the matching query. Wheat TAGdb currently hosts whole-genome paired read libraries of wheat cultivar Chinese Spring, including specific data for the long and short arms of isolated chromosomes. Read lengths vary between 35 and 100 bp, with a range of insert sizes from 300 to 3,700 bp. Additional wheat short read data for different wheat varieties will be hosted on TAGdb as they becomes publicly available in the near future.
Comparative wheat genome and genetic maps on CMap and CMap3D
CMap is a generic, extensible web-based comparative map viewer for displaying and comparing genetic and physical maps from any species (Youens-Clark et al. 2009). There are two main CMap databases of interest to wheat researchers. The most comprehensive is hosted within GrainGenes (Matthews et al. 2003, Carollo et al. 2005) and is linked from the wheatgenome.info front page. The wheatgenome.info installation of the CMap system aims to link specifically the assembled wheat chromosome arm information with the sequenced genomes of Brachypodium distachyon and rice, as well as a genetic map of the D genome donor of hexaploid wheat, A. tauschii. Bread wheat genome data include syntenic builds for chromosome arms 7DS and 7BS, with other chromosomes being added as an ongoing process.
A CMap summary interface provides links to CMap viewer, administration, tutorial document, map search and feature search functionalities. When a main reference sequence is selected, users can add a physical sequence map or a genetic map as a second map. As genetic and physical maps become more abundant, their effective visualization becomes a challenge. CMap3D is a tool developed based on CMap for the visualization and comparison of multiple genome or genetic maps. This software is a stand-alone client and available for Windows, OSX and Linux (Duran et al. 2010a). The comparative maps present each corresponding marker and the links between maps as a three-dimensional view (Fig. 3). CMap3D overcomes the limitation of comparing multiple adjacent aligned maps and provides a more user-friendly comparison of multiple genomes or genetic maps in three-dimensional space.
Annotated wheat EST single nucleotide polymorphisms within autoSNPdb
Advances in second-generation sequencing technologies have greatly increased the scale and scope to interrogate genomes and uncover genetic variation. However, differentiating between sequence errors and real SNPs remains a challenge, particularly for large and complex genomes such as wheat (Duran et al. 2009c, Imelfort et al. 2009). An approach to improve polymorphism prediction accuracy includes deep sequencing and multiple measures of prediction confidence.
AutoSNPdb (Duran et al. 2009a, Duran et al. 2009b) is the latest version of SNP discovery software which started with autoSNP (Barker et al. 2003, Batley et al. 2003) and includes SNPServer (Savage et al. 2005). It provides an extensible and user-friendly graphical interface facilitating a variety of queries to identify SNPs related to specific genes or traits. This application processes multiple consensus sequences from multiple EST reads and identifies candidate SNPs using a series of Perl scripts.
The current autoSNPdb application hosts data for important crops including rice, barley, Brassica and wheat (Duran et al. 2010b). Within wheat autoSNPdb, the accuracy of polymorphism detection has been improved by adopting the strategy of deep coverage sequencing of specific wheat cultivars. Wheat ESTs generated by Roche 454 second-generation sequencing have been assembled using MIRA, with the resulting assembly processed using autoSNPdb Perl scripts to identify SNPs. Wheat autoSNPdb provides a valuable resource of annotated genetic markers of wheat, which can be used for genetic diversity analysis, cultivar identification and high-resolution genetic map construction.
Wheat autoSNPdb can be searched using keywords, similarity to a query sequence, or by selecting SNPs which differentiate between varieties. A list of consensus contigs is displayed which includes the consensus sequence with aligned reads and highlighted SNPs (Fig. 4). Full annotation of potential gene function is also displayed, and SNPs can also be searched based on homologous locations in the rice genome. AutoSNPdb is recommended to be viewed by using Mozilla Firefox as Internet Explorer may not provide full functionality.
Wheat genome Wiki
The Wiki hosted at wheatgenome.info aims to assist communication between international groups undertaking diverse wheat sequencing activities. The Wiki is based on the popular free web-based Wiki software application from MediaWiki (http://www.mediawiki.org) which is also used by Wikipedia. This Wiki can provide an economic and efficient way to communicate and collaborate, and any research group which is undertaking wheat genome sequencing is welcome to describe their activities on the Wiki, with secure access provided on request.
Conclusions and future direction
The wheatgenome.info system hosts a range of wheat genome information with unrestricted public access. Wheat genome sequencing is still in its infancy, and a complete high quality genome sequence is not expected until 2015 at the earliest. Meanwhile, the number and quality of draft genome assemblies are likely to increase, together with an increasing amount of genome information relating to different wheat cultivars and wild relatives. The wheatgenome.info resource provides researchers with early access to these genetic and genomic data allowing them to compare query sequences with genomic data, identify genes at loci of interest, extract new genetic marker information, distinguish between homoeologous and varietal SNP markers, and access a hub for discussion on wheat genome sequencing activities beyond the current scope of the international consortium. The collation of this information within one place, together with links to external wheat genome resources, greatly facilitates researchers who wish to use this information to improve this valuable crop.
This work was supported by the Australian Research Council [Projects LP0882095, LP0883462 and DP0985953].
Support from the Australian Genome Research Facility (AGRF), the Queensland Cyber Infrastructure Foundation (QCIF), the Australian Partnership for Advanced Computing (APAC) and Queensland Facility for Advanced Bioinformatics (QFAB) is gratefully acknowledged.
automatic annotated single nucleotide polymorphism database
bacterial artificial chromosome
Basic Local Alignment Search Tool
expressed sequence tag
Generic Genome Browser version 2
single nucleotide polymorphism