We present YOGY a web-based resource for orthologous proteins from nine eukaryotic organisms: Homo sapiens , Mus musculus , Rattus norvegicus , Arabidopsis thaliana , Drosophila melanogaster , Caenorhabditis elegans , Plasmodium falciparum , Schizosaccharomyces pombe and Saccharomyces cerevisiae . Using a gene name from any of these organisms as a query, this database provides comprehensive, combined information on orthologs in other species using data from five independent resources: KOGs, Inparanoid, HomoloGene, OrthoMCL and a table of curated fission and budding yeast orthologs. Associated Gene Ontology (GO) terms of orthologs can also be retrieved for functional inference. Integrating these different and complementary datasets provides a straightforward tool to identify known and predicted orthologs of proteins from a variety of species. This resource should be useful for bench scientists looking for functional clues for their genes of interest as well as for curators looking for information that can be transferred based on orthology and for rapidly identifying the relevant GO terms as an aid to literature curation. YOGY is accessible online at http://www.sanger.ac.uk/PostGenomics/S_pombe/YOGY/ .
It is common practice to obtain useful clues about the function and evolution of a protein of interest by identifying homologous proteins in other organisms ( 1 – 3 ). There are three types of homology with biological relevance ( 4 ). Orthology is most useful for insight into related gene functions as it arises from a common protein in an ancestral organism rather than from gene duplication (paralogy) or horizontal transfer of genes (xenology).
Several methods are available to identify orthologous proteins in different organisms. KOGs [euKaryotic Orthologous Groups; Ref. ( 5 )] is a homology database derived from seven eukaryotic genomes, which uses the principle of BLAST best hits between three proteins from different organisms ( 6 , 7 ); since many eukaryotic proteins contain multiple domains, some common modules are masked ( 5 ). Inparanoid contains 26 datasets from 23 eukaryotic organisms; it can distinguish true homologs (orthologs and in-paralogs) from out-paralogs that arose from gene duplications prior to the divergence of two species ( 8 – 10 ). HomoloGene is a system for automated detection of homologs among the annotated proteins of 18 eukaryotic genomes; it is integrated with other databases at the NCBI including PubMed, Entrez and GEO ( 11 ). A recent addition to orthology resources is OrthoMCL, which can group orthologs from multiple genomes into a single cluster [currently 55 organisms; Refs ( 12 , 13 )]. Finally, a curated list of orthologs between Schizosaccharomyces pombe (fission yeast) and Saccharomyces cerevisiae (budding yeast) is also available. This dataset has been compiled by inspecting multiple alignments and clusters of protein families on a protein-by-protein basis, taking into account experimental evidence, domain organization, protein length and species distribution ( 14 ).
These various homology resources have different advantages and complement each other. For example, no method is optimal for both specificity and coverage; assessing the results from multiple resources can thus increase confidence in orthology calls. Ortholog identification and subsequent extraction of relevant functional data on a gene by gene basis can be time consuming and confusing, owing to a lack of integration of the various resources. We have designed a web server called YOGY (eukarYotic OrtholoGY) that integrates results from the homology databases described above. Information from all these data sources is stored in a combined database to ease the search for and interpretation of orthologs. Gene Ontology [GO; Ref. ( 15 )] annotations supported by manual evidence codes are included to provide functional insight into uncharacterized proteins. All of this information can be searched with a web interface in a single step.
YOGY is implemented in a MySQL relational database running on a UNIX server. Data for the external resources have been downloaded from the associated FTP and websites for import into our database (Supplementary Data). The data model has been validated to identify and remove potential problems such as many-to-many relationships. It uses Perl scripts together with the Perl DBI module for file import. For queries, we have designed a web interface using the CGI module of Perl, hosted on an Apache server. The Perl GD graphics module is used for bar charts.
Genes and proteins from the following nine organisms can be searched using gene names or systematic identifiers from the corresponding Model Organism Database (MOD), Ensembl ( 16 ), NCBI ( 11 ) or UniProt ( 17 ): Homo sapiens, Mus musculus ( 18 ), Rattus norvegicus ( 19 ), Arabidopsis thaliana ( 20 ), Drosophila melanogaster ( 21 ), Caenorhabditis elegans ( 22 ), Plasmodium falciparum ( 23 ), S.pombe ( 23 ), and S.cerevisiae ( 24 ). Where possible, identifiers from the appropriate MOD are shown throughout the output so that proteins from the five data sources can be evaluated for consistency; this is useful as the different homology resources use identifiers from a variety of databases. Because of the ambiguity of many identifiers, legacy naming systems and revisions to gene structures and gene complements, it is not always possible to be certain whether some apparent differences in orthology calls are, in fact, equivalent proteins. Whilst we have made every effort to map these identifiers automatically using resources from the MOD, the International Protein Index ( 25 ), UniProt ( 17 ) and the NCBI Entrez Gene database ( 11 ), any discrepancies should be checked manually by the user. It is possible to use incomplete names with a wild-card option, providing a list of genes and one-line descriptions for further search.
GO terms annotated to the identified orthologs can also be retrieved. Only associations using experimental and curator validated evidence codes are included. The option to show GO terms is switched off by default due to the increased time required to download GO data. Options are provided to display GO terms in separate tables at the end of each resource, or in a single table at the end of the output.
The output is provided in a tabulated HTML format ( Figure 1 ). The first table contains general information for the protein of interest including description and links to the corresponding MOD and the UniProt database, if this accession number is available. For S.pombe , links to gene expression profiles during the cell cycle [C; Ref. ( 26 )], eiotic differentiation [M; Ref. ( 27 )] and stress conditions [S; Ref. ( 28 )] are also provided. The data sources which provide positive orthology results for the gene of interest are then specified with links to the corresponding outputs.
The orthology results are presented in a standard output format for each dataset. At the top is information about the query protein cluster(s), followed by a list of available orthologs ordered by organism together with links to the ortholog resource ( Figure 1B–E ). Links to UniProt are also provided if the accession number is available. Below, each data source is mentioned in the order given in the output page.
For KOGs, the summary table starts with the unique KOG name together with a link to the website. The next column displays a bar chart of the ortholog numbers for each organism, revealing the phylogenetic pattern for the KOGs ( Figure 1B ). This chart also provides a link to a list of other KOGs that share the same phylogenetic pattern, which provides insight into gene preservation and loss in different lineages. The summary table also indicates the functional classification, with a link to other KOGs in this classification, and a one-line description for the KOG. The orthologs are displayed in a list below the summary table, together with links for each protein or domain to the corresponding KOG cluster alignments and to the relevant protein page at NCBI ( Figure 1B ).
For Inparanoid, we have excluded orthologs from largely unannotated organisms, which are not in the other homology resources; this reduces the output page to 18 organisms (20 databases, as both mouse and rat include two datasets). The bar chart on top shows the phylogenetic pattern for the orthologs ( Figure 1C ). The list underneath shows the orthologs for the query protein, links to the Inparanoid protein clusters for each organism, the Inparanoid score and a link to the protein page in the corresponding MOD ( Figure 1C ). Inparanoid uses a sophisticated methodology to distinguish between in- and out-paralogs (8); we have downloaded the tables from the Inparanoid website and present these pre-calculated datasets on the YOGY website.
For HomoloGene, the summary at the top provides a link to the query protein cluster at NCBI and a phylogenetic bar chart. Each ortholog is then presented by organism with links to the relevant NCBI pages.
For OrthoMCL, we have again excluded orthologs from largely unannotated organisms and prokaryotes (except Escherichia coli , which is also included in Inparanoid) reducing the output to 24 organisms. The summary table includes a link to the OrthoMCL cluster and a phylogenetic bar chart ( Figure 1D ). This table is followed by a list of orthologs in the cluster with a link to the original protein sequence used for clustering and a link to the relevant MOD ( Figure 1D ). For some of the less well-characterized yeasts, which have no MOD, a link is provided to either the ‘Yeast Gene Order Browser’ or Génolevures that both provide graphical representations of conserved genome location ( 29 , 30 ).
For the curated yeast ortholog dataset, only fission and budding yeast proteins are included. The output provides the lists of orthologs together with links to the S.pombe GeneDB ( 23 )and SGD ( 24 ) databases ( Figure 1E ).
If selected, either multiple tables or one table at the end provide a summary for all GO terms found for the query protein and its orthologs. This includes the term name, the aspect (P: Biological Process; C: Cellular Component and F: Molecular Function), the evidence codes, and the corresponding organisms together with the accession numbers of orthologs containing the GO term for each organism ( Figure 1F ). GO terms with the evidence code ‘Inferred from Electronic Annotation’(IEA) are not included as these have not been assessed by an annotator, and tend to be to higher level terms.
In the future, we plan to make further changes to improve the display and integration of the different orthology resources. In the longer term, GO annotations will be represented on the GO tree structure, which will allow for the rapid identification of redundant and non-overlapping annotations from the various model organisms.
The described integrated database together with the accompanying search site provides a straightforward resource to identify orthologs from all specialized databases that are currently most useful; these ortholog databases have been built using different methods that complement each other, and the integrated results give a rich picture of orthology based on combined evidence from the independent resources. The GO annotations of orthologs can provide additional evidence on orthology and help to infer functional information for genes with limited annotation. This resource will be regularly updated to include the latest information from the independent data sources.
Supplementary Data are available at NAR Online.
The authors thank Matloob Qureshi and members of the Bähler laboratory and the Pathogen Sequencing Unit for discussions and help with programming. The work in the group is funded by a Cancer Research UK [CUK] Grant No. C9546/A6517 and by DIAMONDS, an EC FP6 Lifescihealth STREP (LSHB-CT-2004-512143). Funding to pay the Open Access publication charges for this article was provided by Cancer Research UK.
Conflict of interest statement . None declared.