Abstract

Motivation: The presence or absence of metabolic pathways and structures provide a context that makes protein annotation far more reliable. Compiling such information across microbial genomes improves the functional classification of proteins and provides a valuable resource for comparative genomics.

Results: We have created a Genome Properties system to present key aspects of prokaryotic biology using standardized computational methods and controlled vocabularies. Properties reflect gene content, phenotype, phylogeny and computational analyses. The results of searches using hidden Markov models allow many properties to be deduced automatically, especially for families of proteins (equivalogs) conserved in function since their last common ancestor. Additional properties are derived from curation, published reports and other forms of evidence. Genome Properties system was applied to 156 complete prokaryotic genomes, and is easily mined to find differences between species, correlations between metabolic features and families of uncharacterized proteins, or relationships among properties.

Availability: Genome Properties can be found at http://www.tigr.org/Genome_Properties

Contact:selengut@tigr.org

Supplementary information:http://www.tigr.org/tigr-scripts/CMR2/genome_properties_references.spl

INTRODUCTION

Assigning names to all predicted proteins in a complete genome, even when carried out with exquisite accuracy, provides only an initial layer of understanding of the activities in a cell. For example competence genes (comA-G) participate in cellular competency, where competency refers to a cell’s ability to take up extracellular DNAs and incorporate those DNAs into the bacterium’s own chromosome. However, many com genes are also found in organisms such as Escherichia coli that do not exhibit natural competency. It has been suggested that the com operon is part of a mechanism for using extracellular DNA as a nutrient source which is evident in E.coli and in many other species (Finkel and Kolter, 2001). The point being, it is not possible to determine this cellular role of com genes in isolation; genome annotation is made more complete when individual genes are placed in context of metabolic pathways, coordinated cellular activities or cellular structures. This secondary layer of biological description provides a more complete and contextually rich picture of biological processes.

We call objects in this secondary layer as ‘genome properties’. A genome property is a single assertion (a numerical value, a truth state such as ‘Yes’ or ‘No’, or a controlled vocabulary term such as ‘facultative anaerobic’) for some attribute, applied to a single completely sequenced genome. The attribute may be a metabolic capability such as tryptophan biosynthesis from chorismate, a physical feature such as outer membrane, a taxonomic classification or a calculated value such as GC content (Table 1).

In this paper, we describe a computational analysis system for the investigation of genome properties for completely sequenced prokaryotes. The goals of the project are 5-fold. First, to create a repository of property assertions for each species, whether taken from the scientific literature or produced during genome annotation. Second, to increase the accuracy and richness of genome annotation. Third, to provide concise species summaries with controlled vocabularies suitable for comparative analyses. Fourth, to create a research tool for hypothesis generation by any of several techniques, including phylogenetic profiling (Pellegrini et al., 1999). Fifth, to create a compendium of knowledge concerning microbial genome properties by linking the underlying data with brief scholarly descriptions, primary and secondary literature references and relevant websites.

Assertions in Genome Properties can be made either manually, through the intervention of a human curator, or automatically, using the results of precomputed analyses, such as hidden Markov models (HMMs) from the Pfam (Bateman et al., 2004) and TIGRFAMs (Haft et al., 2003) databases. Many TIGRFAM HMMs are built to segregate larger protein families into smaller subfamilies, termed ‘equivalogs’ (Haft et al., 2001) where all members share the same specific function. Combinations of HMMs can be used as an evidence to show that a genome contains the complete set of enzymes from a pathway or subunits from a protein complex. Utilizing HMM evidence and other sequence analyses allow many properties to be set automatically by rules encoded within the Genome Properties system; these rules may be applied even for unannotated genomes, as soon as HMM search results are available.

Although it is not restricted to the generation of metabolic pathways, Genome Properties joins a number of other projects that combine metabolic reconstruction and comparative genomics. Other examples of metabolic reconstruction methods include EcoCyc, which provides a metabolic map of E.coli based on exhaustive literature searching and expert curation (Karp et al., 2002a). EcoCyc’s pathway definitions can be used to find corresponding pathways, based on matching protein annotations, in other species (Karp et al., 2002c). Its companion database, MetaCyc (Karp et al., 2002b), confines itself to experimentally verified instances of pathways in other species but expands the set of model pathways considerably. Given sufficient quality annotation, a fairly robust metabolic reconstruction can be performed. The KEGG database (Kanehisa et al., 2004) presents detailed representations of sets of interconnected pathways, containing multiple species with different subsets of each pathway. The KEGG system uses bi-directional best hits across genomes to identify probable orthologs and functionally equivalent proteins, with manual curation, which has led to fairly extensive prediction in the presence or absence of pathways and annotation of the indicated enzyme components. WIT (Overbeek et al., 2000), which is no longer publicly available, used a stringent protein clustering scheme related to the bi-directional best hit heuristic in COGs (Tatusov et al., 2003), combined with human curation, to detect the presence of pathways and their variants across a wide collection of genomes. ERGO (Overbeek et al., 2003) is a commercial successor to WIT. All these efforts improve on the picture of metabolism that might be understood from the annotated gene list alone. Other websites provide analyses such as TransportDB, which contains data on transporters (Ren et al., 2004); SACSO, which contains ‘comparative analysis of completely sequenced organisms including base composition, amino acid composition, ancestral duplication and ancestral conservation’ (Tekaia et al., 2002); and Comparative Genometrics, which contains nucleotide composition data (Roten et al., 2002).

Genome Properties currently contains >17 000 property assertions (Table 2). Our interface that has been created for Genome Properties allows facile searching of these data and affords the ability to compare multiple properties over multiple genomes. The inclusion of taxonomic properties allows searches and comparisons to be easily restricted to certain phylogenetic clades. Genes mapped to specific properties are linked directly to the Comprehensive Microbial Resource (CMR) (Peterson et al., 2001) which provides an additional set of analyses of whole genomes and their encoded proteins for comparison, summarization and investigation.

The computational system for Genome Properties will be released in the near future as open source.

SYSTEMS AND METHODS

Genomic source data

The input to the Genome Properties system is data from the CMR (Peterson et al., 2001). The CMR uses a relational database called the Omniome to store all data associated with the completed prokaryotic genomes. The Omniome provides primary sequence data and annotations that mirror those contained in GenBank. It also provides a second set of protein predictions carried out using GLIMMER (Delcher et al., 1999) with automated annotation. For genomes sequenced at The Institute for Genomic Research (TIGR), only the primary annotation is used. This set of protein predictions is made initially using GLIMMER but is then curated extensively by human annotators who adjust start sites, remove spurious gene calls and add missing genes using homology evidence. A gene is considered present in a genome if it appears in either the primary or the secondary gene lists. The CMR is updated continuously with new genomes as they become available.

Data model

Storage for the Genome Properties system is implemented using a relational database. Entities represented in separate tables in the database include definitions of the properties, relationships between properties, links to external sources of information related to the properties and the assertions made for those properties. For assertions that can be made by rules, Genome Properties tables represent the components (metabolic steps, structural subunits or other hallmarks) that signify the property and the evidence for identifying those components. Forms of evidence used in the current release include HMM scores, manual assignments of proteins to HMM families or to specific enzymatic functions, tRNA predictions by tRNA-scanSE (Lowe and Eddy, 1997), DNA features and coding region attributes such as selenocysteine codons (Stadtman, 1987) and programmed frameshifts (Farabaugh, 2000).

Property types

Properties are divided conceptually into a number of types.

Taxonomic properties reflect seven hierarchical levels of phylogenetic classification (superkingdom, phylum, class, order, family, genus and species) derived from the NCBI Taxonomy database (Wheeler et al., 2004). Also included in this type is the property TaxID, which stores the unique id that the NCBI has associated with each sequenced strain and allows us to maintain consistency between Genome Properties and NCBI.

Phenotypic properties reflect directly observable data (e.g. ‘oxygen requirement’, ‘human pathogen’ or ‘optimal growth temperature’), and are not necessarily derivable in any simple way from the content of the genome. Phenotypic assertions are set manually. The concepts associated with phenotypic assignments vary widely and are not easily subjected to a unified classification scheme; we have used a controlled vocabulary for phenotypes whenever possible. Unlike properties whose states or values are set by automatic processes, not every genome may have a curated state for every phenotypic property.

Calculated properties are produced by computational analysis on the DNA sequence or the set of predicted proteins, and may include numerical or string values depending on the nature of the metric (e.g. ‘GC content: 50%’).

Pathway and system properties represent groups of genes that work together in some way. A pathway is composed of proteins that are able to perform a set of consecutive enzymatic steps to generate one or more products from some set of starting materials (e.g. ‘glutathione biosynthesis’). A system is analogous but broader, describing elements that work together but are not necessarily in a metabolic pathway (e.g. ‘PTS transport system’ and ‘nucleotide excision repair’). In some cases these assertions are made manually, in others they are made as the result of rules (see below). Rules are used to automatically detect the components based on specified evidence stored in database tables.

Category properties organize other genome properties in a hierarchy (e.g. ‘Sulfur metabolism’ and ‘Biological niche’).

Summary properties serve an organizational role in the same manner as category properties, but also hold a value for each genome that represents a summary, consensus or average of those properties that are below it in the hierarchy. These values are assigned either by the Genome Properties rules interpreter (see below) or by the separate algorithms written for each summary property (e.g. ‘IPP biosynthesis’ summarizes ‘IPP biosynthesis via mevalonate’ and ‘IPP biosynthesis via deoxyxylulose’).

The current set of Genome Properties as evaluated for Haemophilus influenzae KW20 Rd (Fleischmann et al., 2002) is shown in Table 1, which contains examples of the property types described above.

Evidence—the input data to Genome Properties

More than 30 types of sequence analysis data that have been generated from a computational pipeline are stored in the Omniome, including results of topological and structural prediction programs and various homology-based search results. Evidence for non-protein features such as tRNA molecules, programmed (i.e. genuine) frameshifts or repeat regions serve as an input to the Genome Properties system. The primary source of information to Genome Properties is the search results of TIGRFAM (Haft et al., 2003) and Pfam (Bateman et al., 2004) HMMs. The utility of the TIGRFAMs database for identifying protein functions has been described previously (Haft et al., 2003). Each of its models is labeled according to the functional diversity of the family of proteins it describes. Out of over 2000 models in TIGRFAMs, more than half are of type ‘equivalog’, meaning that all proteins found by the model are presumed to share the same primary function with each other and with their last common ancestor. The definition of equivalog differs from the term ortholog (Fitch, 1970) in two ways. First, it shows conserved function explicitly; functionally distinct offshoots of the protein family are excluded. Second, it allows the inclusion of laterally transferred genes, in contrast to the formal requirement of orthology that sequences are derived only by speciation. We have found that some Pfam models describe families all of whose members are functionally identical. These Pfam models may be used in the same way as TIGRFAMs equivalog models in the building of rules. Any score above the trusted cutoff of an HMM assigns the target sequence to that protein family and shows the family to be represented in the corresponding genome. Other protein classification systems such as PROSITE (Hulo et al., 2004) could also serve as sources of evidence but have not been used to date.

Definition and implementation of rules for system and pathway properties

We derive our system and pathway descriptions from compilations of biochemical pathways (Michal, 1999; Kanehisa et al., 2004), descriptions in entries from protein annotation databases (Boeckmann et al., 2003; McGarvey et al., 2000) and directly from the scientific literature. System and pathway properties are represented in our database as a list of components (enzymes, RNA or DNA elements, etc.) and, for each of these components, a list of the types of evidence (HMMs, EC numbers, etc.) that may be used to identify it.

Although we strive to adhere to definitions that are consistent with those norms accepted by the scientific community, our primary goal is to identify properties that result in unambiguous assertions for the greatest number of genomes possible. Subsequently, subsets of canonical pathways/systems may be defined as separate Genome Properties when a significant number of species contain those subsets in isolation.

For some cases, pathways and systems are described in our data model as alternative subsets of components with logical OR relationships. This is because, for example, certain organisms utilize acetylated intermediates while others utilize succinylated intermediates, and still others act without esterification. Different enzymes (and even different numbers of enzymes) catalyze these different steps. We also encounter situations where a component of a system or pathway may not be detected in a genome of an organism in which that property is present. In some of these cases, a component may play a non-essential role in the process and occur in a subset of lineages. In other cases, a particular step may be known to be essential but the identity of the component that fulfills that role may not yet have been discovered in all or a subset of species. In such cases, our representation allows for components of a Genome Property to be specified as optional.

Together, the lists of components, evidence types, required/optional flags, Boolean relationships and threshold values represent rules, which are used to evaluate individual properties and make assertions about their presence or absence in a particular genome. A stand-alone program evaluates these rules in the following five steps: (1) the evidence describing the property is read, (2) the genome is scanned for features that contain these types of evidence, (3) the features are recorded in the database, (4) the list of identified features and the required components are evaluated in accordance with the logical structure of the rule and (5) the proper assertion is written to the Genome Properties table.

System and pathway properties assertions are stored as a controlled vocabulary in the Genome Property table as follows. ‘Yes’ indicates that all components have been identified, or the existence of the property has been experimentally determined or asserted by expert curation. ‘some evidence’ indicates that a number of components have been identified, but one or more are not in evidence so that the presence of the property is possible. ‘not supported’ indicates that some components are found but they are insufficient in number to argue that the property is present. The boundary between ‘some evidence’ and ‘not supported’ is determined separately for each property by assigning a threshold value. ‘none found’ indicates that no components were identified, and ‘No’ indicates that either a curator has evaluated a ‘none found’ or a ‘not supported’ assertion and concurred, or that the absence of a functional version of the system or pathway was verified experimentally.

Genome Properties curation

For many genomes, assertions are assigned unambiguously reflecting that all components or no components are found. For other genomes, the distribution of evidence is examined in a manual process. Typically, the curation of the two assertions ‘not supported’ and ‘some evidence’ are adjusted. Promotions of the states ‘none found’ and ‘not supported’ to ‘No’ are performed only when some other line of reasoning exists such as experimental evidence, disrupted genes or alternative pathways. Genes that are expected to appear in otherwise complete pathway reconstructions or systems may be undetectable due to highly divergent sequences, missed annotation of the correct open reading frame, truncated, interrupted or inactivated genes. In such cases we manually evaluate homology criteria, molecular phylogeny, metabolic context and gene clustering. Often, this will result in the identification of the missing gene. We may also alter system and pathway rules in response to manual review, typically when spurious hits occur for HMMs that perform imperfectly, in which case the existing HMMs are improved. It may be that a protein family for some component appears only in certain lineages, and new models are created to provide alternative evidence for a particular component across some species range.

Genome Properties interface

Our interface consists of a series of web pages within the CMR. These include a home page (http://www.tigr.org/Genome_Properties), a search page, property definition pages and data display pages for comparison among genomes and properties as well as detailed displays for individual properties in single genomes. The search page allows a user to extract slices of the Genome Properties data based on one or more selected genomes, properties and/or property states including the states of taxonomic properties (Fig. 1A). Property definition pages include summaries, lists of components, the methods used to identify them, links to relevant literature through PubMed and external databases such as EcoCyc or KEGG (Fig. 1B). Comparison display pages show property assertions displayed as a table across the genomes selected by the user. Data displays for individual properties as evaluated for particular genomes include lists of all the genes mapped to each of the components of the property, the evidence by which each gene was identified as a component and the regional genomic context of each gene (Fig. 1C).

RESULTS AND DISCUSSION

TIGRFAM accuracy and overall genome coverage

Much of our metabolic reconstruction system is based on searches of TIGRFAM HMMs against prokaryotic proteins. We consider searches using TIGRFAMs to be highly accurate because our experience is that when annotators annotating bacterial genomes manually evaluate TIGRFAM searches, it is rare that false positives are detected. In a controlled experiment to evaluate TIGRFAM accuracy, 50 equivalog HMMs were tested against the current release of Swiss-Prot (Boeckmann et al., 2003). The Swiss-Prot database has been subjected to years of manual curation by the trained experts and is used by the academic institutions throughout the world. For the purpose of this comparison, it was treated as a dataset containing an independently derived standard of truth. TIGRFAM HMM scores above the curated trusted cutoff were collected. A total of 1016 proteins were identified and the assertion of function by the TIGRFAM versus the Swiss-Prot annotation were compared by manual evaluation. Only two assignments of function conflicted, a difference traceable to a TIGRFAM annotation that used an outdated publication.

The overall coverage of our dataset was estimated as follows: the results of searching TIGRFAM HMMs against the genes in 144 completed bacterial genomes indicate that ∼13–28% of all genes in a typical bacterial genome receive an automatic functional assignment from a TIGRFAM equivalog HMM. This corresponds to roughly half of all the genes in a typical genome that can receive a specific functional assignment. These data are displated for a representative selection of genomes in Figure 2.

Genome Property assertions

We have defined 172 properties (Table 1) resulting in over 17 500 property assertions after their application to 145 completed prokaryotic genomes (Table 1). Nearly 4000 property assertions result from direct computation such as DNA GC content and count of predicted proteins. Over 2000 property assertions have been performed manually using information derived from external sources for properties such as optimal pH, chemotaxis and phylum. We encourage the submission of further literature-supported assertions of phenotypic data through our website where such data are currently absent.

The remaining assertions are for pathway properties such as tryptophan biosynthesis from chorismate, and system properties such as Tat (Sec-independent) protein export, which utilize autonomous ‘rules’ that evaluate HMM search results and other stored evidence (see Systems and methods section). Properties like these may weigh evidence from >30 HMMs, but most rules built so far weigh evidence for between three and eight components. Rules may assign the state ‘Yes’ when all required components are identified, ‘none found’ in the absence of any components, ‘some evidence’ when no more than a specified number of components is missing and ‘not supported’ when less than this number is identified.

Properties assigned to the extreme states ‘Yes’, ‘not supported’, ‘none found’ and ‘No’ outnumber those in the intermediate state ‘some evidence’ by greater than 10:1. ‘Yes’ assignments for properties map genes to specific biological processes in a way that context-independent functional identification of the protein alone may not (see below). Currently, over 40 000 entries from the CMR are linked as evidence for entries in Genome Properties. Over 34 000 of these contribute to ‘Yes’ states for their respective properties.

Table 3 shows the components of the rule for the selenocysteine incorporation property and the application of that rule to the genome of H. influenzae KW20 Rd (Fleischmann et al., 1995). This example illustrates the diversity of both protein function and genomic evidence that may contribute to a property. Three types of genomic features are required: genes encoding two enzymes and a translation factor, the selenocysteine tRNA and an example of a protein that incorporates selenocysteine at a UGA codon. Evidence is provided by HMM search results, tRNA detection (Lowe and Eddy, 1997) and annotation of a selenoprotein translation exception. A single protein (HI0200) fills two requirements of the rule: a selenoprotein example and the enzyme selenophosphate synthase.

Validation of evidence-based Genome Properties assertions

Potentially, the accuracy of the Genome Properties system may be very high in that it uses HMM-based rules that were developed by expert curation. We evaluated the accuracy of this system by several approaches. First, we applied the system to the bacterium Corynebacterium glutamicum ATCC 13032. This organism was chosen because it was sequenced recently (Kalinowski et al., 2003) and therefore does not appear in many of the seed alignments of TIGRFAM HMMs. The bacterium is an industrially important source of lysine and glutamate and has been characterized extensively in the experimental literature. Application of Genome Properties to the C.glutamicum genome sequence resulted in 36 Yes assertions, 29 of which were supported in published reports on this organism’s metabolism (see Supplementary Table S1). No literature reference could be identified that contradicted one of our assertions. Literature sources were also used to identify organisms like Corynebacterium that grow in the absence of supplemented amino acids, proteins or peptides. These organisms are presumed to have functional pathways for the biosynthesis of all amino acids.

Similarly, organisms that lack amino acid biosynthesis pathways should be restricted to environments rich in amino acids and peptides (e.g. obligate intracellular pathogens such as Chlamydia trachomatis). Literature-based support of amino acid biosynthesis Genome Properties for 77 genera are summarized in Supplementary Table S2. Of those pathways and organisms listed in Table S2, 615 positive assertions are expected. Genome Properties asserts ‘Yes’ in 583 of these cases and ‘some evidence’ in an additional 26 cases (99% overall success). The remaining 6 cases involve proline biosynthesis, mainly in Archaea, where an alternative pathway has been proposed but not yet characterized (Graupner and White, 2001).

Occasionally, literature scans will identify cases where Genome Properties has asserted the presence of a property, but yet specific tests in laboratories have failed to observe the associated phenotype. For instance, Lactococcus lactis appears to have complete pathways for the biosynthesis of several amino acids that are nonetheless required to be present in the media for cell growth. In this case, the sequenced L.lactis is an industrial strain used for the production of cheese and may have recently (over the course of laboratory isolation) developed the ability to not express these enzymes (Bolotin et al., 2001). This type of ‘false positive’ assertion, when identified, is flagged by changing the state of the property from ‘Yes’ to ‘Cryptic’. Cryptic states alert the users to conflicts between genomic content (about which Genome Properties makes assertions) and expressed phenotypes, which may depend on the nature of the experimental system and factors outside the scope of the Genome Property.

In certain cases, an essential metabolic function can be fulfilled by two or more independent pathways or systems. Genome Properties can be self-validated in these cases by observing that at least one of these properties should be present (complementarity). For example, proton-gradient energized ATPases are essential for cellular life and come in two types, the F1-F0 ATPase and the V-type ATPase. Every genome should have at least one of these systems, and in fact, Genome Properties finds this to be true in all cases, with only five examples of genomes containing both systems. These and similar results for IPP and lysine biosynthesis are presented in Table 4.

Genome Properties was benchmarked against KEGG, which employs an independent methodology for identifying the components of pathways and systems. KEGG does not explicitly assert the presence or absence of a complete pathways or system. However, where all the steps in a pathway were present in the KEGG database for an organism, we inferred that KEGG was asserting the presence of that pathway. Table 5 indicates that four pathways shared by KEGG and Genome Properties were in agreement for 95% of the organisms tested. The only differences occurred when KEGG did not identify a component found by Genome Properties. In each of these 17 cases, the assignment made by Genome Properties was supported by multiple lines of evidence such as HMM scores, multiple sequence alignments and co-localization with functionally related genes. In a number of cases, it appeared that KEGG’s system did not identify a complete pathway because of a missed gene call in the original annotation. Such genes are often identified during Genome Properties curation by searches of genes against the genomic DNA rather than relying on the primary set of predicted proteins. In no instance did KEGG assert the presence of a component that was not also identified by Genome Properties.

Comparative Genometrics with Genome Properties

An example of Genome Properties used as a comparative tool for chorismate-associated biosynthetic pathways across many species is shown in Table 6. These pathways tend to be conserved for members of any given genus but exceptions to such phylogenetic patterns often prove interesting. Staphylococcus aureus has the Tat (Sec-independent) protein export system while Staphylococcus epidermidis lacks it. Examination showed that the Tat translocases in S.aureus are encoded adjacent to their lone target, suggesting lateral gene transfer of a cassette composed of a Tat translocase together with its target, a gene containing an N-terminal Tat signal sequence.

‘Missing’ components of genome properties

The property ‘histidine biosynthesis from PRPP’ consists of 10 enzymatic steps. Currently, all 10 enzymes have been identified in 43 published genomes. In an additional 56 genomes only the ninth step, histidinol-phosphate phosphatase (HisB), is not found. In the Genome Property rule for this pathway, the ninth step is treated as an optional element (although the activity is surely required for the pathway) due to our current inability to detect it in many species. The lack of universal detection of this step does not change the overall quality of the assertion that the pathway is complete. Most probably, this step is carried out by enzymes from a number of non-orthologous gene families (Koonin et al., 1996), only two of which have been characterized and modeled by HMMs. The list of organisms carrying out histidine biosynthesis but lacking an identified hisB gene may be a useful starting point for investigations aimed at identifying novel hisB gene families.

One method of identifying such non-orthologous families (Osterman and Overbeek, 2003) involves looking for candidate genes that are nearby along the chromosome (Overbeek et al., 1999). In the case of the histidine biosynthesis property, a gene annotated as ‘Inositol monophosphatase-like protein’ (due to its membership in a Pfam family—PF00459) is adjacent to the gene encoding the identified Step 8 of the pathway, histidinol phosphate aminotransferase (hisC) in Synechocystis species PCC6803 (loci NTL01SS01282 and NTL01SS01283, respectively). The Genome Properties interface allows one to view such information easily. The branch of the PF00459 family containing this gene includes genes from 18 other published bacterial genomes (including Actinobacteria, Alphaproteobacteria, Pirellula sp. strain 1 and Pseudomonas putida), all of which contain every step of the histidine biosynthesis pathway except HisB. Although no other published genome shows gene clustering of this phosphatase with histidine biosynthesis genes, it is observed in two unpublished genomes being finished at TIGR, Myxococcus xanthus DK 1622 and Fibrobacter succinogenes S85 (data not shown). It seems that this family is a strong candidate for the HisB enzyme in these genomes and warrants experimental characterization. This family of putative HisB enzymes has been modeled by a TIGRFAMs HMM (TIGR02067).

Mapping of process information onto protein annotations

The Gene Ontology (GO) (Harris et al., 2004) database has proven to be a versatile system for categorizing information pertaining to the functions, physical localizations and processes of genes. In certain cases TIGR annotators have recently begun adding GO terms to protein annotations where possible. From this experience, we have learned that the assignment of GO functional terms is analogous to the process of gene name annotation and relatively straightforward. However, the association of GO process terms is more labor intensive during gene-by-gene annotation. This is because a protein may serve multiple processes in different species. To address this issue, Genome Properties maps GO terms to each of the components of rules-based properties. When property states are set to ‘Yes’, those genes corresponding to components of the system are assigned GO process terms automatically. In the case of metabolic pathways that terminate in the production of branch-point metabolites, the presence of downstream pathways with ‘Yes’ states assigned results in the transitive application of GO process references to the components of the upstream pathways. For instance most of the genomes listed in Table 1 contain all the components of the chorismate biosynthesis pathway, but chorismate is further utilized to a variety of different purposes depending on the genomic context. Genes of the chorismate pathway will receive GO-IDs corresponding to those processes active in that particular organism.

Phylogenetic profiling with genome properties

Genome Properties assertions can be converted into phylogenetic profiles (Pellegrini et al., 1999) simply by encoding each ‘Yes’ assertion as 1 and ‘No’, ‘none found’, and ‘not supported’ assertions as 0. The ambiguous assertion ‘some evidence’ may be treated as missing data and ignored, or may be lumped with ‘Yes’ under the hypothesis that missing components are likely present but not recognized. The resulting pattern of 1’s and 0’s for the presence or absence of a protein (as originally formulated) or of a genome property across many genomes can carry a significant amount of information, enough to suggest functional relationships between pairs of proteins or between proteins and properties.

In principle, analysis in terms of genome properties should make phylogenetic profiling more robust because the signal represented by a property is the aggregate of all of its components and therefore provides a less noisy profile. The phylogenetic profile of an individual component, say an enzyme carrying out a particular step in a pathway, may be noisy due to several issues. For instance, the component may be involved in more than one process, each of which has a separate and distinct phylogenetic profile. A component may exist as two or more functionally equivalent but non-orthologous families (Koonin et al., 1996), and the profiles of these families individually will represent only a subset of the whole phylogenetic range of the underlying biological process.

Relationships between genome properties

Many pairs of genome properties are strongly correlated. For example, both histidine and tryptophan must be synthesized if they cannot be imported, and environments typically allow import of both or neither. Among genomes analyzed to date, only six genera (of 73 total) contain species that break this rule. Plant and animal pathogens typically exploit rich environments that make de novo biosynthesis of both histidine and tryptophan unnecessary. Another example of correlation is evident in larger genomes where it is generally expected that these organisms will have more genes in most functional categories (van Nimwegen, 2003), scaling differently with genome size according to the category. Secondary and redundant capabilities including biosynthetic, catabolic and transport systems would be expected to be present more often in larger genomes, and in general this correlation is observed in the Genome Properties dataset.

We find a strong positive correlation of many genome properties with DNA GC content. Some differences in GC content follow major phylogenetic divisions, such as between the Actinobacteria (high-GC) and the Firmicutes (low-GC). However, large differences in GC content also occur within the various lineages, such as within the Gammaproteobacteria, the Actinobacteria, the Euryarchaeota or the Spirochaetes. In several lineages, and for all prokaryotic genomes taken together, the smallest genomes tend to be AT-rich and the largest genomes GC-rich. Figure 3 shows the relationship among DNA size (megabases), DNA GC content and histidine biosynthesis from PRPP. None of the genera that are above the median for both size and GC content lacks the ability to make histidine while a majority of species that are below both these levels are unable to do so. Both species that do and do not synthesize histidine are phylogenetically diverse. These trends seem consistent with the relationships of GC-to-AT transition bias for point mutations, low-GC content, gene loss and small genome size (Andersson and Andersson, 1999) found in a study of the Rickettsia. Biosynthetic pathways for various essential amino acids (including histidine) and enzyme cofactors are found in three Buchnera aphidicola genomes, despite extremes of low-GC and genome size. In this case, it appears that these aphid endosymbionts retain these abilities to benefit their insect hosts (Clark et al., 1998) while having shed much of the rest of the genetic capacity of their free-living bacterial ancestors.

CONCLUSION

The Genome Properties system contains a rich and varied collection of biological characterizations for completely sequenced prokaryotic genomes. We present a paradigm in which standard methods of sequence analysis, including but not limited to TIGRFAMs and Pfam HMM scoring, produce evidence that is stored in relational database tables. Rules weigh the evidence automatically and detect pathways and other features accordingly. Curation includes manual finishing of property assignments where rules cannot capture all the particulars for a species. The curation process generates feedback that leads to the improvement of the protein identification models on which the rules are based, as well as improvements in annotation accuracy, completeness and information content. The inclusion of metrics such as GC content, phylogenetic and other non-metabolic properties expands the value of the data for biological studies of individual prokaryotes as well as for comparative genomics.

We encourage members of the scientific community to contact us through our website (http://www.tigr.org/Genome_Properties) to suggest new Genome Properties that may be of particular interest to their research or to add manually curated data to existing properties.

Table 1

A subset of the Genome properties evaluated for the genome of H.influenzae KW20 Rd

Property Typea Assertion 
Biological niche  
    Oxygen requirement Facultative anaerobic 
    Optimal growth temperature (°C) 37 
    Temperature environment Mesophilic 
Cell surface component  
    Capsule No 
    Outer membrane Yes 
    Peptidoglycan (murein) Yes 
Metabolism  
    Biosynthesis  
        Amino acid biosynthesis  
            Arginine via ornithine Yes 
            Glycine cleavage system none found 
            Histidine from ribose-5-phosphate Yes 
            Isoleucine from threonine and pyruvate Yes 
            Leucine from pyruvate and acetyl-CoA Yes 
            Lysine Yes 
                Lysine via alpha-aminoadipate (AAA pathway) none found 
                Lysine from aspartate semialdehyde, acylated branch Yes 
            Phenylalanine from chorismate Yes 
            Proline from glutamate Yes 
            Threonine from homoserine Yes 
            Tryptophan from chorismate and PRPP Yes 
            Tyrosine from chorismate Yes 
            Valine from pyruvate Yes 
        Chorismate via shikimate Yes 
        Cofactor biosynthesis  
            Biotin Yes 
            Coenzyme A from pantothenate Yes 
            Coenzyme PQQ none found 
            Glutathione none found 
            Menaquinone Yes 
            NAD(P) from L-aspartate and DHAP none found 
            Pantothenate from aspartate and 2-oxoisovalerate No 
            Tetrahydrofolate from GTP and PABA Yes 
            Ubiquinone from chorismate, aerobic none found 
        Glycine betaine from choline none found 
        Isopentenyl pyrophosphate (IPP) Yes 
            IPP via mevalonate No 
            IPP via deoxyxylulose Yes 
        Nucleotide biosynthesis  
            Purine (inosine-5′-P from ribose-5-phosphate) Yes 
            Pyrimidine (uridine-5′-P de novo) not supported 
        Protein biosynthesis  
            Glu-tRNA(Gln) amidation none found 
            Ribosome, bacterial, large subunit Yes 
            Ribosome, bacterial, small subunit Yes 
            Selenocysteine Yes 
        Energy metabolism  
            Glyoxalate shunt No 
            Electron transport  
                NADH dehydrogenase I No 
                Nickel-dependent hydrogenase none found 
                Rnf-type electron transport complex Yes 
            F1/F0 ATPase Yes 
            Methanogenesis No 
            Pentose phosphate cycle Yes 
            Photosynthesis No 
TCA cycle Partialb 
Quantitative content  
    Amino acid abundance (order)c LAIVGEKSTDNQRFPYMHWC 
    Percent A (alanine)d 8.21 
    Count of DNA molecules 
    Count of predicted proteins 1739 
    Count of tRNAs 58 
    DNA GC content (%) 38.15 
    DNA size (megabases) 1.83 
    Protein average length 300 
Transport  
    Protein transport  
        Type I secretion none found 
        Type II secretion none found 
        Type III secretion No 
        Type IV secretion nd 
        Tat (Sec-independent) protein export Yes 
    Small molecule transport  
         K+-transporting ATPase KdpFABC none found 
         Na+-translocating NADH-quinone reductase Yes 
        Phosphate ABC transporter (pstSCAB-phoU) Yes 
        PTS transport system Yes 
        Sulfate/thiosulfate ABC transporter none found 
        TRAP-T (tripartite ATP-independent periplasmic transporters) Yes 
            count of TRAP transporter clusters 
Virulence  
    Animal pathogen nd 
    Human pathogen attenuatede 
    Plant pathogen No 
    Spore formation No 
Property Typea Assertion 
Biological niche  
    Oxygen requirement Facultative anaerobic 
    Optimal growth temperature (°C) 37 
    Temperature environment Mesophilic 
Cell surface component  
    Capsule No 
    Outer membrane Yes 
    Peptidoglycan (murein) Yes 
Metabolism  
    Biosynthesis  
        Amino acid biosynthesis  
            Arginine via ornithine Yes 
            Glycine cleavage system none found 
            Histidine from ribose-5-phosphate Yes 
            Isoleucine from threonine and pyruvate Yes 
            Leucine from pyruvate and acetyl-CoA Yes 
            Lysine Yes 
                Lysine via alpha-aminoadipate (AAA pathway) none found 
                Lysine from aspartate semialdehyde, acylated branch Yes 
            Phenylalanine from chorismate Yes 
            Proline from glutamate Yes 
            Threonine from homoserine Yes 
            Tryptophan from chorismate and PRPP Yes 
            Tyrosine from chorismate Yes 
            Valine from pyruvate Yes 
        Chorismate via shikimate Yes 
        Cofactor biosynthesis  
            Biotin Yes 
            Coenzyme A from pantothenate Yes 
            Coenzyme PQQ none found 
            Glutathione none found 
            Menaquinone Yes 
            NAD(P) from L-aspartate and DHAP none found 
            Pantothenate from aspartate and 2-oxoisovalerate No 
            Tetrahydrofolate from GTP and PABA Yes 
            Ubiquinone from chorismate, aerobic none found 
        Glycine betaine from choline none found 
        Isopentenyl pyrophosphate (IPP) Yes 
            IPP via mevalonate No 
            IPP via deoxyxylulose Yes 
        Nucleotide biosynthesis  
            Purine (inosine-5′-P from ribose-5-phosphate) Yes 
            Pyrimidine (uridine-5′-P de novo) not supported 
        Protein biosynthesis  
            Glu-tRNA(Gln) amidation none found 
            Ribosome, bacterial, large subunit Yes 
            Ribosome, bacterial, small subunit Yes 
            Selenocysteine Yes 
        Energy metabolism  
            Glyoxalate shunt No 
            Electron transport  
                NADH dehydrogenase I No 
                Nickel-dependent hydrogenase none found 
                Rnf-type electron transport complex Yes 
            F1/F0 ATPase Yes 
            Methanogenesis No 
            Pentose phosphate cycle Yes 
            Photosynthesis No 
TCA cycle Partialb 
Quantitative content  
    Amino acid abundance (order)c LAIVGEKSTDNQRFPYMHWC 
    Percent A (alanine)d 8.21 
    Count of DNA molecules 
    Count of predicted proteins 1739 
    Count of tRNAs 58 
    DNA GC content (%) 38.15 
    DNA size (megabases) 1.83 
    Protein average length 300 
Transport  
    Protein transport  
        Type I secretion none found 
        Type II secretion none found 
        Type III secretion No 
        Type IV secretion nd 
        Tat (Sec-independent) protein export Yes 
    Small molecule transport  
         K+-transporting ATPase KdpFABC none found 
         Na+-translocating NADH-quinone reductase Yes 
        Phosphate ABC transporter (pstSCAB-phoU) Yes 
        PTS transport system Yes 
        Sulfate/thiosulfate ABC transporter none found 
        TRAP-T (tripartite ATP-independent periplasmic transporters) Yes 
            count of TRAP transporter clusters 
Virulence  
    Animal pathogen nd 
    Human pathogen attenuatede 
    Plant pathogen No 
    Spore formation No 

Concise summaries of the properties of prokaryotes such as these allow Genome Properties to be utilized as a kind of prokaryotic encyclopedia.

nd, not determined.

aProperty types: C, category; P, phenotypic; and R, rules-based.

bThe state ‘Partial’ for the TCA cycle indicates the presence of a subset of the entire cycle consisting of a single linear pathway.

cThe property amino acid abundance (order) presents a 20-character string representing the 20 amino acids in descending order of abundance.

dGenome Properties contains properties for all 20 amino acids, only one is included here for clarity.

eH.influenzae KW20 Rd is derived from a virulent strain that has lost certain virulence factors. Phenotypic Genome Properties refer narrowly to the sequenced strain and not necessarily reflect general properties of a species.

Table 2

Genome Properties content data

Genomes (distinct genera) 145 (79) 
Properties 172 
Properties assigned by rules 77 
Property assertions 17 534 
Rules-based assertions and calculated values 15 305 
Distinct HMMs used as evidence for rules 656 
‘Yes’ assertions 6 306 (42%) 
‘No’, ‘none found’ or ‘not supported’ assertions 7 168 (48%) 
‘some evidence’ assertions 1 532 (10%) 
ORFs and other genomic features linked to Genome Properties 40 206 
Average number of features linked per genome analyzed 277 
Genomes (distinct genera) 145 (79) 
Properties 172 
Properties assigned by rules 77 
Property assertions 17 534 
Rules-based assertions and calculated values 15 305 
Distinct HMMs used as evidence for rules 656 
‘Yes’ assertions 6 306 (42%) 
‘No’, ‘none found’ or ‘not supported’ assertions 7 168 (48%) 
‘some evidence’ assertions 1 532 (10%) 
ORFs and other genomic features linked to Genome Properties 40 206 
Average number of features linked per genome analyzed 277 

Fig. 1

Examples of Genome Properties interface views. (A) The Genome Properties query page. (B) A section of the definition display page for the ‘sulfate reduction to sulfide, assimilatory’ property. Hyperlinks are provided to PubMed, EcoCyc, KEGG, GO and other Genome Properties definition display pages. Not shown are sections detailing the components of the property and the evidence used to detect them. (C) A section of a detailed results page for the ‘chorismate biosynthesis via shikimate’ property evaluated for a strain of E.coli. Hyperlinks are provided to the CMR genome page, Genome Properties definition display page, CMR gene pages and CMR HMM profile pages. Not shown are sections displaying the genomic regions of each gene or cluster of genes.

Fig. 1

Examples of Genome Properties interface views. (A) The Genome Properties query page. (B) A section of the definition display page for the ‘sulfate reduction to sulfide, assimilatory’ property. Hyperlinks are provided to PubMed, EcoCyc, KEGG, GO and other Genome Properties definition display pages. Not shown are sections detailing the components of the property and the evidence used to detect them. (C) A section of a detailed results page for the ‘chorismate biosynthesis via shikimate’ property evaluated for a strain of E.coli. Hyperlinks are provided to the CMR genome page, Genome Properties definition display page, CMR gene pages and CMR HMM profile pages. Not shown are sections displaying the genomic regions of each gene or cluster of genes.

Fig. 2

HMM protein identification statistics. This figure shows TIGRFAMs protein identification statistics for complete bacterial genomes that have recently been annotated and published at TIGR. The first (cross-hatched) bar of the histogram reflects the percentage of genes for each genome that receive a specific function by manual curation. The second (gray) bar reflects the percentage of ORFs with equivalog HMM scores greater than the trusted cutoff for that HMM and represent a correct functional assignment.

Fig. 2

HMM protein identification statistics. This figure shows TIGRFAMs protein identification statistics for complete bacterial genomes that have recently been annotated and published at TIGR. The first (cross-hatched) bar of the histogram reflects the percentage of genes for each genome that receive a specific function by manual curation. The second (gray) bar reflects the percentage of ORFs with equivalog HMM scores greater than the trusted cutoff for that HMM and represent a correct functional assignment.

Table 3

Selenocysteine property evaluated for the H.influenzae genome (state = Yes)

Component Evidence type Evidence Locus Annotation 
Seryl-tRNA(sec) selenium transferase HMM TIGR00474 (equivalog) HI0708 l-seryl-tRNA selenium transferase 
Selenocysteine-specific elongation factor HMM TIGR00475 (equivalog) HI0709 Selenocysteine-specific elongation factor 
Selenocysteine-specific tRNA (tRNA-SeC) tRNAscan-SE Anticodon for UGA tRNA-SeC(p) tRNA-SeC 
Selenium donor protein HMM TIGR00476 (equivalog) HI0200 Selenophosphate synthetase 
SeCys-containing example Gene attribute Manual assignment of SeCys (UGA) codon HI0006 Formate dehydrogenase, alpha subunit 
   H10200 Selenophosphate synthetase 
Component Evidence type Evidence Locus Annotation 
Seryl-tRNA(sec) selenium transferase HMM TIGR00474 (equivalog) HI0708 l-seryl-tRNA selenium transferase 
Selenocysteine-specific elongation factor HMM TIGR00475 (equivalog) HI0709 Selenocysteine-specific elongation factor 
Selenocysteine-specific tRNA (tRNA-SeC) tRNAscan-SE Anticodon for UGA tRNA-SeC(p) tRNA-SeC 
Selenium donor protein HMM TIGR00476 (equivalog) HI0200 Selenophosphate synthetase 
SeCys-containing example Gene attribute Manual assignment of SeCys (UGA) codon HI0006 Formate dehydrogenase, alpha subunit 
   H10200 Selenophosphate synthetase 

Table 4

Validation of Genome Properties from complementary pathways

Properties Count of distinct genuses 
 Complementary Both Neither 
ATPase (F0/F1 versus V-type) 74 
IPP (deoxyxylulose versus mevalonate pathways) 74 5a 
Lysine (DAP versus AAA pathways) 71 7b 
Properties Count of distinct genuses 
 Complementary Both Neither 
ATPase (F0/F1 versus V-type) 74 
IPP (deoxyxylulose versus mevalonate pathways) 74 5a 
Lysine (DAP versus AAA pathways) 71 7b 

aBlochmannia and Buchnera (insect symbiont), Bifidobacterium (gastrointestinal tract commensal), Mycoplasma and Rickettsia (obligate intracellular).

bBorrelia, Mycoplasma, Treponema and Tropheryma (obligate intracellular), Streptococcus agalactiae, mitis, pyogenes, Fusobacterium (oral commensal), Halobacterium (see Supplementary Table S2).

Table 5

Comparison of Genome Properties with KEGG

Property Genome Properties and KEGG would assert the same statea (of 74 genera) Genes apparently missingb 
  Both KEGG only Genome Properties only 
Purine biosynthesis (IMP from ribose-5-phosphate) 66 (89%) 10 
Histidine biosynthesis 71 (96%) 
Selenocysteine incorporation systemc 72 (97%) 
F1/F0 ATPase 73 (99%) 
Property Genome Properties and KEGG would assert the same statea (of 74 genera) Genes apparently missingb 
  Both KEGG only Genome Properties only 
Purine biosynthesis (IMP from ribose-5-phosphate) 66 (89%) 10 
Histidine biosynthesis 71 (96%) 
Selenocysteine incorporation systemc 72 (97%) 
F1/F0 ATPase 73 (99%) 

aStates are not asserted by KEGG. Here, we apply the same method as used by Genome Properties to the set of genes found by KEGG. Components flagged as ‘Not required’ by Genome Properties (i.e. genes whose detection by both KEGG and Genome Properties is significantly impaired) were not included in these comparisons. In every case of discrepancy listed here Genome Properties would assert ‘Yes’ and KEGG, missing the identification of one or more components, would assert ‘some evidence’.

bThese are unidentified components in which the majority of the components of the property appear to be present in at least one of the two systems.

cKEGG does not include SelB, SelC or an example of a selenocysteine-containing protein in its description of this system, so only the common components, SelA and SelD were compared.

Table 6

A comparison of chorismate-associated properties across genomes

 Species and taxa showing indicated pattern of property states 
Properties Most γ-proteo bacteria Bacilli, Actino bacteria, Bacteroides, Chlorobium, Haemophilus α-, β-Proteobacteria Pseudomonas Bifido bacterium, Clostridium, Streptococcus Archaea, α-, ε-proteobacteria, Deinococcus, B.halodurans, Thermo anaerobacter, Aquifex, Pirellula, Streptomyces, Thermotoga, Xanthomonadales Aeropyrum pernix, Pyrococcus abyssi Chlamydia Spirochaetes, Mycoplasma, Rickettsia, Wigglesworthia, Pyrococcus horikoshii 
Chorismate biosynthesis Yes Yes Yes Yes Yes Yes Yes noa 
Tryptophan biosynthesis Yes Yes Yes Yes Yes Yes no no 
Phenylalanine biosynthesis Yes Yes Yes Yes Yes no no no 
Tyrosine biosynthesis Yes Yes Yes Yes Yes no no no 
Menaquinone biosynthesis Yes Yes no no no no no no 
Ubiquinone biosynthesisb Yes no Yes no some evidence some evidence some evidence no 
 Species and taxa showing indicated pattern of property states 
Properties Most γ-proteo bacteria Bacilli, Actino bacteria, Bacteroides, Chlorobium, Haemophilus α-, β-Proteobacteria Pseudomonas Bifido bacterium, Clostridium, Streptococcus Archaea, α-, ε-proteobacteria, Deinococcus, B.halodurans, Thermo anaerobacter, Aquifex, Pirellula, Streptomyces, Thermotoga, Xanthomonadales Aeropyrum pernix, Pyrococcus abyssi Chlamydia Spirochaetes, Mycoplasma, Rickettsia, Wigglesworthia, Pyrococcus horikoshii 
Chorismate biosynthesis Yes Yes Yes Yes Yes Yes Yes noa 
Tryptophan biosynthesis Yes Yes Yes Yes Yes Yes no no 
Phenylalanine biosynthesis Yes Yes Yes Yes Yes no no no 
Tyrosine biosynthesis Yes Yes Yes Yes Yes no no no 
Menaquinone biosynthesis Yes Yes no no no no no no 
Ubiquinone biosynthesisb Yes no Yes no some evidence some evidence some evidence no 

a‘no’ indicates any of the ‘No’, ‘none found’ or ‘not supported’ states.

b‘some evidence’ states are prominent for this property because the pathway has not been well-characterized outside of the proteobacteria.

Fig. 3

A plot of GC content versus genome size showing the distribution of species possessing the ability to make histidine de novo. Each point represents average values for a single genus except where members of the genus vary with respect to the histidine biosynthesis property. A single exception is the two species of Treponema (pallidum and denticola) that have extremely different values for both GC content and genome size. Vertical and horizontal dashed lines represent median values for all the points, and the percentage of histidine biosynthesizers is recorded for the four quadrants thus produced.

Fig. 3

A plot of GC content versus genome size showing the distribution of species possessing the ability to make histidine de novo. Each point represents average values for a single genus except where members of the genus vary with respect to the histidine biosynthesis property. A single exception is the two species of Treponema (pallidum and denticola) that have extremely different values for both GC content and genome size. Vertical and horizontal dashed lines represent median values for all the points, and the percentage of histidine biosynthesizers is recorded for the four quadrants thus produced.

The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.

We would like to thank Tanja Davidsen for her support in integrating Genome Properties into the Comprehensive Microbial Resource. This work was supported in part by NSF grant DBI-0110270 and DOE grant DE-FG02-01ER63203.

REFERENCES

Andersson, J.O. and Andersson, S.G.
1999
Genome degradation is an ongoing process in Rickettsia.
Mol. Biol. Evol.
 
16
1178
–1191
Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L.
2004
The Pfam protein families database.
Nucleic Acids Res.
 
32
D138
–D141
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O'Donovan, C., Phan, I., Pilbout, S., Schneider, M.
2003
The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.
Nucleic Acids Res.
 
31
365
–370
Bolotin, A., Wincker, P., Mauger, S., Jaillon, O., Malarme, K., Weissenbach, J., Ehrlich, S.D., Sorokin, A.
2001
The complete genome sequence of the lactic acid bacterium Lactococcus lactis ssp. lactis IL1403.
Genome Res.
 
11
731
–753
Clark, M.A., Baumann, L., Baumann, P.
1998
Buchnera aphidicola (Aphid endosymbiont) contains genes encoding enzymes of histidine biosynthesis.
Curr. Microbiol.
 
37
356
–358
Delcher, A.L., Harmon, D., Kasif, S., White, O., Salzberg, S.L.
1999
Improved microbial gene identification with GLIMMER.
Nucleic Acids Res.
 
27
4636
–4641
Farabaugh, P.J.
2000
Translational frameshifting: implications for the mechanism of translational frame maintenance.
Prog. Nucleic Acid Res. Mol. Biol.
 
64
131
–170
Finkel, S.E. and Kolter, R.
2001
DNA as a nutrient: novel role for bacterial competence gene homologs.
J. Bacteriol.
 
183
6288
–6293
Fitch, W.M.
1970
Distinguishing homologous from analogous proteins.
Syst. Zool.
 
19
99
–113
Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.F., Dougherty, B.A., Merrick, J.M., et al.
1995
Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.
Science
 
269
496
–512
Fleischmann, R.D., Alland, D., Eisen, J.A., Carpenter, L., White, O., Peterson, J., DeBoy, R., Dodson, R., Gwinn, M., Haft, D., et al.
2002
Whole-genome comparison of Mycobacterium tuberculosis clinical and laboratory strains.
J. Bacteriol.
 
184
5479
–5490
Graupner, M. and White, R.H.
2001
Methanococcus jannaschii generates l-proline by cyclization of l-ornithine.
J. Bacteriol.
 
183
5203
–5205
Haft, D.H., Loftus, B.J., Richardson, D.L., Yang, F., Eisen, J.A., Paulsen, I.T., White, O.
2001
TIGRFAMs: a protein family resource for the functional identification of proteins.
Nucleic Acids Res.
 
29
41
–43
Haft, D.H., Selengut, J.D., White, O.
2003
The TIGRFAMs database of protein families.
Nucleic Acids Res.
 
31
371
–373
Harris, M.A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., Eilbeck, K., Lewis, S., Marshall, B., Mungall, C.
2004
The Gene Ontology (GO) database and informatics resource.
Nucleic Acids Res.
 
32
D258
–D261
Hulo, N., Sigrist, C.J., Le Saux, V., Langendijk-Genevaux, P.S., Bordoli, L., Gattiker, A., De Castro, E., Bucher, P., Bairoch, A.
2004
Recent improvements to the PROSITE database.
Nucleic Acids Res.
 
32
D134
–D137
Kalinowski, J., Bathe, B., Bartels, D., Bischoff, N., Bott, M., Burkovski, A., Dusch, N., Eggeling, L., Eikmanns, B.J., Gaigalat, L.
2003
The complete Corynebacterium glutamicum ATCC 13032 genome sequence and its impact on the production of l-aspartate-derived amino acids and vitamins.
J. Biotechnol.
 
104
5
–25
Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., Hattori, M.
2004
The KEGG resource for deciphering the genome.
Nucleic Acids Res.
 
32
D277
–D280
Karp, P.D., Paley, S., Romero, P.
2002
The Pathway Tools software.
Bioinformatics
 
18
(Suppl. 1),
S225
–S232
Karp, P.D., Riley, M., Paley, S.M., Pellegrini-Toole, A.
2002
The MetaCyc Database.
Nucleic Acids Res.
 
30
59
–61
Karp, P.D., Riley, M., Saier, M., Paulsen, I.T., Collado-Vides, J., Paley, S.M., Pellegrini-Toole, A., Bonavides, C., Gama-Castro, S.
2002
The EcoCyc Database.
Nucleic Acids Res.
 
30
56
–58
Koonin, E.V., Mushegian, A.R., Bork, P.
1996
Non-orthologous gene displacement.
Trends Genet.
 
12
334
–336
Lowe, T.M. and Eddy, S.R.
1997
tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.
Nucleic Acids Res.
 
25
955
–964
(Ed.).
Biochemical Pathways: An Atlas of Biochemistry and Molecular Biology
 
1999
, NY John Wiley and Sons
McGarvey, P.B., Huang, H., Barker, W.C., Orcutt, B.C., Garavelli, J.S., Srinivasarao, G.Y., Yeh, L.S., Xiao, C., Wu, C.H.
2000
PIR: a new resource for bioinformatics.
Bioinformatics
 
16
, pp.
290
–291
Osterman, A. and Overbeek, R.
2003
Missing genes in metabolic pathways: a comparative genomics approach.
Curr. Opin. Chem. Biol.
 
7
238
–251
Overbeek, R., Fonstein, M., D'Souza, M., Pusch, G.D., Maltsev, N.
1999
The use of gene clusters to infer functional coupling.
Proc. Natl. Acad. Sci. USA
 
96
2896
–2901
Overbeek, R., Larsen, N., Pusch, G.D., D'Souza, M., Selkov, E., Jr, Kyrpides, N., Fonstein, M., Maltsev, N., Selkov, E.
2000
WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction.
Nucleic Acids Res.
 
28
123
–125
Overbeek, R., Larsen, N., Walunas, T., D'Souza, M., Pusch, G., Selkov, E., Jr, Liolios, K., Joukov, V., Kaznadzey, D., Anderson, I.
2003
The ERGO genome analysis and discovery system.
Nucleic Acids Res.
 
31
164
–171
Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D., Yeates, T.O.
1999
Assigning protein functions by comparative genome analysis: protein phylogenetic profiles.
Proc. Natl. Acad. Sci., USA
 
96
4285
–4288
Peterson, J.D., Umayam, L.A., Dickinson, T., Hickey, E.K., White, O.
2001
The Comprehensive Microbial Resource.
Nucleic Acids Res.
 
29
123
–125
Ren, Q., Kang, K.H., Paulsen, I.T.
2004
TransportDB: a relational database of cellular membrane transport systems.
Nucleic Acids Res.
 
32
D284
–D288
Roten, C.A., Gamba, P., Barblan, J.L., Karamata, D.
2002
Comparative Genometrics (CG): a database dedicated to biometric comparisons of whole genomes.
Nucleic Acids Res.
 
30
142
–144
Stadtman, T.C.
1987
Specific occurrence of selenium in enzymes and amino acid tRNAs.
FASEB J.
 
1
375
–379
Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Kiryutin, B., Koonin, E.V., Krylov, D.M., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N.
2003
The COG database: an updated version includes eukaryotes.
BMC Bioinformatics
 
4
41
Tekaia, F., Yeramian, E., Dujon, B.
2002
Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis.
Gene
 
297
51
–60
van Nimwegen, E.
2003
Scaling laws in the functional content of genomes.
Trends Genet.
 
19
479
–484
Wheeler, D.L., Church, D.M., Edgar, R., Federhen, S., Helmberg, W., Madden, T.L., Pontius, J.U., Schuler, G.D., Schriml, L.M., Sequeira, E.
2004
Database resources of the National Center for Biotechnology Information: update.
Nucleic Acids Res.
 
32
D35
–D40

Comments

0 Comments