BLAST (Basic Local Alignment Search Tool) searches against DNA and protein sequence databases have become an indispensable tool for biomedical research. The proliferation of the genome sequencing projects is steadily increasing the fraction of genome-derived sequences in the public databases and their importance as a public resource. We report here the availability of Genomic BLAST, a novel graphical tool for simplifying BLAST searches against complete and unfinished genome sequences. This tool allows the user to compare the query sequence against a virtual database of DNA and/or protein sequences from a selected group of organisms with finished or unfinished genomes. The organisms for such a database can be selected using either a graphic taxonomy-based tree or an alphabetical list of organism-specific sequences. The first option is designed to help explore the evolutionary relationships among organisms within a certain taxonomy group when performing BLAST searches. The use of an alphabetical list allows the user to perform a more elaborate set of selections, assembling any given number of organism-specific databases from unfinished or complete genomes. This tool, available at the NCBI web site http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/genom_table_cgi, currently provides access to over 170 bacterial and archaeal genomes and over 40 eukaryotic genomes.
The large number of genome sequencing programs worldwide has created a formidable wealth of genomic information, which continues to grow at an extraordinary rate, encompassing organisms of the three superkingdoms (Archaea, Eubacteria and Eukaryota). The high number of complete genomes and the high complexity of the genomes being sequenced impose increasing demands in the development of bioinformatics tools to facilitate and improve the use of genomic information in general and functional annotation of genomes in particular. This latter process includes gene identification and assignment of their functions, based in part on finding significant sequence similarity matches to previously characterized genes or proteins by using BLAST and similar tools [1,2]. With the increasing number of genomes being sequenced, the output of a high-throughput BLAST search can be very complex and time-consuming to interpret, with many redundant results. We have developed a graphic tool that allows the user to customize BLAST (Basic Local Alignment Search Tool) searches by creating a virtual database of target organisms. Ideally, this database would include the largest number possible of organisms, with complete and unfinished genomes. The general features of this tool and its applications are described here.
There are two independent methods to create a virtual database for BLAST searches: the first is based on a graphic tree of taxonomic groups and the second is based on a text tree of Linnaean names. The organism-specific databases represent the collection of nucleotide and protein sequences corresponding to the three superkingdoms: Archaea, Eubacteria and Eukaryota. The ability to select a phylum-, order-, genus-, or species-specific database for BLAST searches relies on a retrieval system based upon a unique taxonomic identifier.
BLAST using the taxonomy-based tree
The taxonomy tree-based method has the advantage of displaying the phylogenetic relationship of the organism-specific databases to be selected. This is useful in evolutionary comparisons between genes or gene families of closely or distantly related groups. Taxonomy-based trees have been produced for the superkingdoms Eubacteria, Archaea (Archaebacteria) and Eukaryota and include those taxonomic groups for which representatives have been sequenced and data are available. Fig. 1 shows the taxonomy tree for Eubacteria. The link in phylum/order brings a pop-up box that contains all organisms within this taxonomy group. If selected, the option ‘BLAST all’ will subsequently compare the desired query sequence against the nucleotide or protein sequences that are available for the selected group. The link ‘About the Databases’ (http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/prokdata.html) provides an alphabetical list of all organisms for which nucleotide and/or protein sequences are available for that particular superkingdom, their genome sizes, contributors of genomic information and the cumulative size of all DNA sequences in base pairs (which reflects the fraction of the genome that has been sequenced to that date). The box above the taxonomy tree allows the user to copy and paste the query sequence (or use the accession number or a gi number). Other boxes allow the user to set different BLAST parameters. From left to right, the following selections are available: (1) a nucleotide or protein query can be selected in the first box; (2) inclusion of unfinished or complete genomes (or both; all genomes are color-coded: the complete genomes are represented in yellow and the unfinished in pale green); (3) selection of the BLAST search type and parameters. Detailed information about BLAST parameters is available under the link ‘Help’ (http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/prokblasthelp.html) and answers to commonly asked questions are available elsewhere on the NCBI web page (http://www.ncbi.nlm.nih.gov/blast/blast_FAQs.html).
BLAST using an alphabetical list of organisms
Alternatively, a virtual database can be created by selecting multiple organism-specific databases from an alphabetical list of all organisms (available under the link ‘text table’). The options available are more advanced and they allow the user to perform a complex set of include/exclude operations. For example, Fig. 2 shows how to combine and exclude a particular group or multiple groups of organisms, for example, selecting all the Archaea (by clicking over the ‘+’ sign) but excluding Halobacteriales, Archaeoglobales and Methanopyrales (by clicking over the ‘−’ sign next to the lineage). The number of selected genomes is automatically displayed on top of the page (in this case, 14 genomes). The button ‘show’ brings up an alternative alphabetical menu, which displays all organisms within this superkingdom, with the selected organism databases in red font. Additional genomes can be excluded or included in this window. The button ‘hide’ returns to the alphabetical list display. The user can select on the upper left to use only complete genomes in the virtual database (which automatically unselects all previously chosen organisms whose genomes are unfinished). We performed a BLAST search using as query the polypeptide sequence of a Methanopyrus kandleri tRNA/rRNA cytosine-C5-methylase (gi20095066) to illustrate the use of the Genomic BLAST tool. When the closely related lineages Halobacteriales, Archaeoglobales and Methanopyrales are excluded from the search (by deselecting the boxes next to their names), the nine paralogs in M. kandleri are not shown; also not shown are the ortholog from Halobacterium sp., and three homologs in Archaeoglobus fulgidus. This shows that the exclusion feature allows simpler identification of conserved proteins in more distantly related groups. Fig. 3 shows an example of the BLAST result page (panel A). The top of the page displays a list of complete genomes to which significant matches are found. For example, the link ‘Completed Pyrococcus furiosus DSM 3638’ opens a new page, which shows the circular chromosome and the ordered best hits outside the circle (panel B). Further information is accessible through other links from this page, such as ‘score’ or ‘protein GI’. The former link points to a page where a detailed graphic and text pair-wise alignment of the query/match is available (panel C) and the latter points to the GenBank protein record (not shown).
Genomic BLAST for lower eukaryotes
Besides microbial and archaeal virtual databases, users can also define eukaryote-specific virtual databases at the Genomic BLAST page. A taxonomy-based tree and alphabetical list interfaces have been created for 42 eukaryotic genomes (five of them complete). DNA and/or protein sequences are obtained from the sequencing centers as individual reads or DNA contigs of varying length. The link ‘About the databases’ opens a page that provides updated information about the list of organisms, sequencing centers, genome size, and size of nucleotide and protein files that are available (http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/eukdata.html). From this page, the organism name is linked to the project page. This page provides detailed information about the genome size and organization, relevant biological features, sequencing centers, sequencing strategy, etc. A list of all genomic data submitted to GenBank for each organism is available under ‘GenBank sequences’. Data usually include genomic DNA, protein sequences, ESTs, STSs, GSSs and HTGs.
Genomic information, data release policies and submission to GenBank
With the increasing number of genomes being sequenced, many projects will exist in different stages of completion, with varying genome coverage until gap closure is complete and the DNA annotation is finalized. Due to the fact that gap closure is often troublesome and DNA annotation in eukaryotes is not fully automated, many projects with complete shotgun sequencing will remain unfinished for a long period of time. Examples can be found among the eukaryotes Plasmodium falciparum, Trypanosoma cruzi and Leishmania major sequencing programs (see, for example, the web sites of the Sanger Institute http://www.sanger.ac.uk/Projects/Protozoa and the Institute for Genomic Research (http://www.tigr.org/tdb/edb2/pfa1/htmls). Furthermore, many microbial genomes will remain unfinished as a result of the sequencing strategy. This is illustrated by the genomic survey sequencing used in many of the DOE-Joint Genome Institute programs (http://www.jgi.doe.gov). The sequencing centers typically submit nucleotide and protein sequences to GenBank and partner databases EMBL and DDBJ only after all gaps are closed and the DNA annotation is complete. While complete genome sequences are important to the progress of biomedical research, it is crucial to provide investigators with ready access to unfinished data. To achieve this goal, the National Center for Biotechnology Information (NCBI) encourages the sequencing centers to submit their data as soon as they becomes available. DNA contigs (with or without preliminary annotation) can be submitted as whole genome shotgun sequences (WGS) and assigned accession numbers. These sequences are consequently made publicly available. In order to comply with more restrictive data release policies, investigators can also submit their genomic data through a private ftp account. Using this route, DNA sequences become available only for BLAST searches (and not downloadable). Approximately 1-kb flanking sequence is provided per BLAST hit. Further details regarding genome submissions are available from the Genomic BLAST main web page.
BLAST specialized databases
Other organism-specific databases for large-scale genome sequencing projects have been created for BLAST searches and include the human, mouse, rat, zebrafish, fugu, Anopheles gambiae, Arabidopsis thaliana, and the Oryza sativa genomes (http://www.ncbi.nlm.nih.gov/BLAST). The scope of the data depends on the status and nature of the project and varies from thousands of DNA contigs or assembled scaffolds (for example, available for the O. sativa and A. gambiae genomes) to datasets that include completely annotated chromosomes (i.e., fully annotated assemblies, cDNA and protein sequences) as available for the A. thaliana genome. This latter also includes high-throughput sequences (HTGs), raw sequence reads (traces), ESTs, and BAC ends. The human, mouse, rat and zebrafish genomes have a more complete dataset and a visualization tool called MapViewer has been implemented for these genomes and for the complete vertebrate genomes. This tool allows the user to localize the best BLAST hits in the context of the genome and to navigate genomic regions in great detail (for example, zooming in and out of regions, retrieving information about all associated features such as genes, cDNAs, markers, etc).
One of the goals of comparative genomics is the identification of nucleotide and protein evolutionary patterns and the elucidation of unique features of the biology of a given organism. This type of analysis requires the use of genomic information from closely and distantly related species and therefore the knowledge of phylogenetic relationships between organisms can greatly facilitate the selection of target databases to be used in BLAST searches. The ability to construct databases containing nucleotides and/or proteins from a particular set of organisms can be useful in many instances: for example, the identification of unusually high similarity among genes found in otherwise unrelated organisms, an observation that suggests lateral gene transfer [3,4]; the characterization of genus-specific sets of conserved genes in hyperthermophilic Archaea ; and the characterization of a minimal gene set by performing genome comparisons between the smallest known genome of Mycobacterium genitalium and of Mycobacterium pneumoniae. Limiting the number of relevant species to perform sequence similarity searches can be helpful by simplifying the analysis of the sometimes very complex BLAST outputs (with the ever-expanding number of organisms being sequenced), and requires less computational capacity. The Genomic BLAST graphical tool makes possible the automated preparation of sequence evaluations based on BLAST runs. In the near future, we expect to see a number of applications that use the network BLAST interface improve the ability to perform searches against databases that are increasing not just in size but also in biological complexity.
The construction of the unfinished genome blastable databases has been possible by the generous submission to the public databases of preliminary data from many research centers. We are grateful for the technical assistance provided by Andrei Kochergin, Pavel Bolotov, and Chris Musial.