The Gene Context Tool (GeConT) allows users to visualize the genomic context of a gene or a group of genes and their orthologous relationships within fully sequenced bacterial genomes. The new version of the server incorporates information from the COG, Pfam and KEGG databases, allowing users to have an integrated graphical representation of the function of genes at multiple levels, their phylogenetic distribution and their genomic context. The sequence of any of the genes can be easily retrieved, as well as the 5′ or 3′ regulatory regions, greatly facilitating further types of analysis. GeConT 2 is available at: http://bioinfo.ibt.unam.mx/gecont .
With more than 660 prokaryotic genomes in the current RefSeq database ( 1 ), the need for tools that allow biologists to visualize the genomic context of genes of their interest becomes crucial. Despite the tendency of prokaryotes towards very little overall conservation of gene order ( 2 , 3 ), groups of genes participating in particular functions tend to remain close together across different lineages, either as part of operons ( 4 , 5 ) or as functional neighbourhoods comprising several transcription units ( 6 , 7 ). Genomic context is not limited to genes close to each other. Overall, inference of functional associations from genomic context can be derived from the following kinds of evidence: (i) Gene fusions ( 8 , 9 ), whereby separated genes would be assumed to work together if they are found as a single, fused, gene in another organism; (ii) Conservation of gene order ( 3 , 10 ), where conservation across evolutionarily distant organisms is taken as evidence of functional association and (iii) similarity of phylogenetic profiles ( 9 , 11 , 12 ), whereby two genes are assumed to work together if their orthologs tend to co-occur, appear and disappear in concert, across different organisms, with the idea that genes working together would both either be present or absent because the presence of a single one of them would be useless without the other. A fourth evidence of functional interactions would be provided by the study of the rearrangement of operons across lineages ( 13–15 ). The idea here is that the rearrangements or reorganization of transcription units across genomes might be conservative in the sense that newly formed operons will put genes with related functions together, thus revealing a functional association that would not be apparent in a single organism.
Biologists interested in particular groups of genes or functional modules would be able to find other features by visually inspecting the genomic context or neighbourhood of the genes of their interest. Such experts might be able to interpret these neighbourhoods and find examples of non-orthologous gene displacement ( 2 ), or horizontal gene transfers, that might have an effect on the functional module in particular organisms. Further tests of the validity of their findings would be greatly facilitated if the tool used to visualize the gene modules across several genomes would also allow for downloading of meaningful sequences, such as the protein sequences, or the DNA coding for the gene, or the DNA sequences occurring downstream or upstream the gene. There are excellent tools that allow for the retrieval of functional predictions based on genomic context, such as STRING ( 16 ) and PROLINKS ( 17 ), but such tools restrict the visualization of genomic context to the particular predictions associated to a gene or genes of interest, rather than to any physical neighbours. Also, while protein sequences of predicted interactors can be retrieved from these servers, nucleotide sequences of genes or intergenic regions are not available. In other instances, such as GECO ( 18 ), the navigation interface for retrieving this type of information is not simple and the orthology definition is different from any of the most commonly accepted standards. The SEED ( 19 ) is a fully automated web resource that analyses the genome context of bacterial and archaeal organisms, however its main purpose is oriented to genome annotation rather than the genome context exploration by a particular user who wants to examine the neighbourhoods of his/her genes of interest. Another useful web server is the comprehensive microbial resource ( http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi ). Although this web server offers a wide variety of tools and resources to highlight differences and similarities between prokaryotic genomes, its comprehensiveness hinders straightforward navigation, further justifying the development of more simplified gene context analysis tools. Here we present Gene Context Tool (GeConT) 2, the second version of GeConT ( 20 ), a web-based tool that allows users to visualize their genes of interest, and their genomic context, across all available fully sequenced bacterial genomes. Orthologous domains are highlighted using shared colours. This makes it easy to navigate across the functional neighbourhood of any particular gene and its orthologs.
GeConT 2 extends over the previous version in many ways. We have increased the query options to allow one or more of the following: (i) gene ids which can be given as common names, GI numbers as defined in GenBank ( 21 ) or SwissProt identifiers ( 22 ); (ii) orthologous groups as defined in the COG database ( 23 ); (iii) metabolic pathways as described in the KEGG database ( 24 ); (iv) protein domains taken from the Pfam database ( 25 ); (v) a protein or DNA sequence, from which similarities will be identified using the integrated BLAST ( 26 ) search; (vi) complex phrases, using Boolean operators, to allow flexible searches against all the descriptions of the included databases. Since many queries will likely result in hundreds of matches (many of which are likely to be redundant), we implemented a filter that can reduce the display to a user-specified number of non-redundant genomes. This option uses distances calculated from 16S rRNA alignments to select a set of representative genomes specific for each query. Additionally, the user can also restrict the search to particular phylogenetic groups of interest. The genome context can be displayed considering a user-defined number of flanking genes or in accordance to their predicted operon structure ( 27–29 ).
In agreement with the increased input flexibility, the output allows visualizing the genes colour-coded according to their COG, Pfam or KEGG assignations. Also new to this version, multiple domains can be visualized as distinct coloured regions within a gene. Domains, genes and intergenic regions are drawn to scale; overlapping genes and non-coding RNA genes are now included. The user can click on any cistron to display relevant information, including descriptions from COG, KEGG and Pfam, as well as the amino acid and nucleotide sequence, and upstream and downstream intergenic flanking regions. GeConT 2 also allows the user to retrieve all the sequence data of the set of genes that have been matched by the input query, facilitating further analysis.
WEB SERVER DESCRIPTION
EXAMPLES AND DISCUSSION
With GeConT 2 users will be able to perform fast, integrated and intuitive analyses in fully sequenced genomes. In this section we discuss several examples that help illustrate the functionality of the webserver.
Identifying conserved elements involved in regulating a given pathway
An important feature of GeConT 2 is its potential to do comparative genome analysis of related genes to look for potential conserved regulatory motifs. The gene relationship can be established based on their orthology or biochemical pathway associations as defined in the COG, KEGG and Pfam databases. Since regulatory elements are commonly more conserved in closely related organisms, users can restrict their searches to a particular phylogenetic group. For example, in order to identify likely regulatory elements in methionine metabolism represented by the KEEG pathway 00271 in Firmicutes and Proteobacteria, the user can perform the corresponding searches in these groups by using the ‘Specific taxonomy’ option. The output of two representative organisms of these groups is shown in Figure 1 of the Supplementary Material . Using the operon clustering option, and colouring by COG attributes, there are 17 different operons with enzymes related to methionine metabolism in the Firmicute Bacillus halodurans , while there are 11 operons in the Proteobacteria Caulobacter crescentus . The user can take advantage of the sequence retrieval options in GeConT 2 and get all the 5′ upstream regions for these operons. Using these sequences as the input of motif discovery programs such as MEME ( 32 ), the user can verify that the operons involved in this pathway are regulated by the SAM-I and S(MK) riboswitches in Firmicutes and by SAM-II in alpha-Proteobacteria ( 33–35 ). It is important to note that redundant information coming from different strains of the same organism might generate over-representation of particular sequences in the data set. To overcome this problem, the user can reduce the number of organisms returned by the ‘maximum genomes to display’ option. Previously we have shown the power of this kind of approach for identifying riboswitches starting from the regulatory regions of genes belonging to a same COG ( 36 ). It is now possible, using only web-based tools such as GeConT 2, to perform similar searches using any group of genes or pathways that a user might be interested in.
Functional insights for genes of unknown function
Most genes have little or no functional annotation. Even for the most studied bacteria, Escherichia coli , the fraction of genes for which detailed knowledge is available is still low [54% in the latest survey ( 37 )]. Genomic context can give valuable insights into the functional relationships between neighbouring genes, for reasons discussed in the introduction. There are many cases of conserved proteins for which no functional assignation is available in the public databases. Homology searches are of no use, since all the hits also lack function. Context analysis can help solve some of these cases. One such example is annotated in Pfam as Duf150 (Domain of Unknown Function 150) and in the COG database as COG0779 (‘Uncharacterized protein conserved in bacteria’). Figure 1 shows a section of the results when searching for Duf150 in GeConT 2. It is easy to see that the context of this protein is well conserved, and the mouse-over function allows a quick view of the functional assignations of the neighbouring genes, all of which seem to be involved in transcriptional elongation or translation initiation. It is thus quite likely that Duf150/COG0779 members are functionally related to these processes, and this can be considered as a first general function prediction for these previously uncharacterized proteins.
Using context to discover the correct function for paralogs
When multiple copies of a gene arise by duplication (paralogs), it can become particularly difficult to assign the correct function, at least by sequence alone. For example, the enzymes TrpE (Anthranilate synthases component I) and PabB (para-aminobenzoate synthases component I) have great sequence similarity and perform very similar reactions. These enzymes use pyruvate as a common substrate, although they participate in different pathways involved in tryptophan and in folate biosynthesis, respectively. Based on genome context, the trpE and pabB genes can easily be distinguished even in un-annotated genomes. With GeConT 2 we can analyse the neighbourhood searching for the other genes of the corresponding metabolic pathways ( Figure 2 ). Another good example of paralogous domains is Palp (Pyridoxal-phosphate-dependant enzyme). Enzymes with this Pfam domain are highly versatile, participating in the biosynthesis of different amino acids such as tryptophan, cysteine, serine and threonine. Again, the context as well as the COG annotations allow us to easily distinguish between the different pathways and correctly identify the specific function of each gene ( Figure 3 ).
Supplementary Data are available at NAR Online.
We wish to thank Shirley Ainsworth for bibliographical assistance and Abel Linares, Arturo Ocadiz, Juan Manuel Hurtado, Alma Martinez and Nancy Mena for computer support. G.M-H. acknowledges computer facilities of the Shared Hierarchical Academic Research Computing Network (SHARCNET). Funding was provided by Natural Sciences and Engineering Research Council of Canada (NSERC) to G.M-H. Sanger Institute Postdoctoral Fellowship to C.A-G. Consejo Nacional de Ciencia y Tecnología (CONACyT) [60127-Q] and PAPIIT IN212708 grants to E.M. Macroproyecto de Tecnologías de la información y la computación-UNAM to E.M. Funding to pay the Open Access publication charges for this article was provided by CONACYT [60127-Q].
Conflict of interest statement . None declared.