FunTree is a new resource that brings together sequence, structure, phylogenetic, chemical and mechanistic information for structurally defined enzyme superfamilies. Gathering together this range of data into a single resource allows the investigation of how novel enzyme functions have evolved within a structurally defined superfamily as well as providing a means to analyse trends across many superfamilies. This is done not only within the context of an enzyme's sequence and structure but also the relationships of their reactions. Developed in tandem with the CATH database, it currently comprises 276 superfamilies covering ∼1800 (70%) of sequence assigned enzyme reactions. Central to the resource are phylogenetic trees generated from structurally informed multiple sequence alignments using both domain structural alignments supplemented with domain sequences and whole sequence alignments based on commonality of multi-domain architectures. These trees are decorated with functional annotations such as metabolite similarity as well as annotations from manually curated resources such the catalytic site atlas and MACiE for enzyme mechanisms. The resource is freely available through a web interface: www.ebi.ac.uk/thorton-srv/databases/FunTree.
The majority of chemical reactions known to occur in biology appear to have been created by the modulation of an existing reaction through the evolution of the enzyme responsible. To begin to understand in detail how enzymes have evolved new functions requires the combination of protein 3D structure, sequence, phylogenetic, chemical and mechanistic data. This combination of information is crucial given the continual flood of data from structural genomic projects, since insights into the evolution of enzyme function provide one of the best routes for predicting functions of uncharacterized enzymes (1). Current resources either provide details on just a subsection of this combination of data or advance extensive detailed analysis on a relatively small number of enzyme superfamilies (2–5).
In order to address this challenge, we have developed a resource that brings together manually curated data from the CATH (6) classification of domains from protein structures, sequences from UniProtKB (7) and CATH-Gene3D (8), as well as functional and chemical information from a variety of sources including the manually curated MACiE (9) and Catalytic Site Atlas (CSA) (10) databases. The data are presented through phylogenetic analysis and is combined with the examination of relationships between metabolites obtained by exploiting tools for comparing small molecules.
THE FUNTREE PIPELINE
Protein domains, structurally defined by CATH, that are identified as having an enzyme function are selected using the MACiE database. This identifies, through careful manual annotation, the location of the residues involved in the enzyme mechanism. FunTree processes the superfamilies of domains that have the active site residues located within the domain. The workflow by which data are collected, processed and presented is shown in Figure 1. Recent studies have highlighted the problems of relying on functional annotations, especially those generated by automated methods (11,12), thus FunTree only uses sequences with functional annotations from the reviewed section of UniProtKB or where a functional annotation is made on deposition of structural data.
Changes to an enzyme's function can arise from modifications of a single domain or from a change to the combination of domains making up in the complete protein sequence. To capture both factors, we generate two types of cluster based on either the superfamily domain or the complete protein sequence:
Structurally similar groups
Protein domain superfamilies can show considerable sequence and structural diversity outside of the common structural core. This makes it difficult to effectively superimpose all domains within some superfamilies. Thus, we grouped non-redundant domains with <35% sequence identity to all other members of the cluster, whose structures could be aligned by CORA (13) and superimposed using the McLachlan algorithm (14) as implemented in the program Profit (Martin,A.C.R. and Porter,C.T.; http://www.bioinf.org.uk/software/profit/) with a root mean squared deviation of <9 Å, to generate multiple structure alignments. These clusters are described as structurally similar groups (SSG), and are then subsequently populated with sequence relatives. These are collected from CATH-Gene3D (a resource which contains sequences for all known and predicted domains in 1867 genomes) and are assigned to one of the SSGs using BLASTp (15) to scan against the sequences of known structural domains. BLASTp is used because it is very fast and can scan through vast numbers of sequences in CATH-Gene3D. Once assigned the sequence is aligned to the profile of the structurally informed sequence alignment using FUGUALI [part of the FUGUE (16) software]. The resulting robust structurally informed sequence alignments are used to undertake the phylogenetic analysis.
A protein can be made up of one or more domains that may be contributing to overall function (17). For each sequencein FunTree, the multi-domain architecture (MDA) is assigned by considering the order of known or predicted structural domains mapped to the sequence. Domain structure assignments are taken from CATH-Gene3D by initially scanning the sequence against Markov models built from CATH domains. Regions of sequences that are unassigned and are large enough to be considered as a domain are checked against the PFam database (18) and if a non-overlapping PFam domain is found, it is included in the MDA, Subsequently, we group together proteins within a superfamily that share the same domains in the same order along the sequence: i.e. the same MDA. Grouping of MDAs is carried out by ArchSchema (19), which also visualizes the relationships between MDAs as a directed graph. For each superfamily, entire protein sequences which share the same MDA are aligned using MAFFT (20). Alignments generated are used to perform the phylogenetic analysis of the MDA clusters.
We perform phylogenetic analysis on both the SSG and the MDA alignments. However, some enzyme superfamilies can be very large with thousands to tens of thousands of sequences, which makes both aligning all the sequences and conducting the phylogenetic analysis difficult. In order to overcome this, sequences are first filtered by taxonomic lineage and uniqueness of function. This removes sequences sharing the same genus level and having the same function (i.e. E.C. number) and, for simplicity taking the first occurrence as the single representative. If there are still many thousands of sequences left, a stricter filter is applied at the kingdom level. In both cases, however, if a sequence has a function annotated that has not been previously seen for that taxonomic rank, then the sequence is included.
For both the SSG and MDA alignments, phylogenetic trees are generated using the TreeBest software [as described in the methods for compiling the TreeFam database (21)]. The method uses species relationships to guide the tree building, thus a taxonomic tree for those sequences in the alignment is generated using the species relationships as defined by the NCBI taxonomic database (22).
By systematically traversing the tree, it can be simplified by collapsing nodes whose branches have a commonality in their annotation. For the purposes of this study, we define commonality at the subsubclass (third level) of the four-level E.C. classification, which broadly can act as a proxy for a change in general chemistry. Thus, nodes in the pruned tree correspond to different reaction chemistries. This collapsed version of the tree is also generated and presented.
Functional data, in the form of E.C. classifications (23), are collected from either annotations from the reviewed section of UniProtKB or if present from the annotations made in the deposition of the protein structure. These identifiers are then used, via KEGG (24), to collect the reaction performed and the small molecules used by the enzyme. All the small molecules within the superfamily and within each SSG/MDA groups are compared with each other using the Small Molecule Subgraph Detector (SMSD) toolkit (25) to generate an all-by-all comparison matrix for each case. The metabolites are clustered using PVCLUST (26), implemented in the R statistical package and the results are rendered as a similarity tree using software developed in-house.
Data presentation and navigation
FunTree data are presented through a publicly accessible website—http://www.ebi.ac.uk/thornton-srv/databases/FunTree, for the superfamily, structurally similar groups (SSGs) and MDA groups. The website can be searched by superfamily or small molecule names and synonyms as well as specific superfamily, sequence, structure, E.C. or small molecule identifiers. In addition, the data can be browsed by superfamily, E.C. code, structure or metabolites. SSG and MDA groups provide different views of the superfamily data—i.e. in terms of structural similarity and similarity of domain composition, respectively. An SSG may contain domain relatives in different MDAs, and conversely a given MDA may be present in one or more SSGs. To navigate between the various groups and show how they relate, FunTree displays a simple bifurcated graph with the two branches representing the division between SSGs and MDAs. Clicking on a particular SSG branch highlights the MDA branches that contain members of the SSG; conversely, clicking on an MDA branch highlights the SSG branches that have members belonging to this MDA (Figure 2).
On the top level page describing the superfamily, the following data are presented: a summary of statistics such as the sequence diversity as measured by ScoreCons (27) and the average SSAP (28) scores of the domain structures; a similarity tree of the small molecules; an ArchSchema graph of the MDA and a representation of the E.C. hierarchy (an ‘E.C. wheel’) showing which E.C. numbers are present in the superfamily. At the SSG or MDA level, the page shows the ‘grouping-specific’ general statistics, a similarity tree of the small molecules, an E.C wheel, a phylogenetic tree, the annotated alignment used to build the phylogenetic tree and a collapsed version the phylogenetic tree. The collapsed tree shows nodes with changes at the third level of the E.C. classification, which, as mentioned above, broadly act as a proxy for a change in general chemistry. Each leaf of the tree also lists the full E.C. numbers the collapsed node represents.
As the phylogenetic trees can be large and contain many annotations, the tree is rendered as a series of images at various zoom levels that can be navigated using the GoogleMaps API. The GMMap Image Cutter (R. Milton 2008, http://www.casa.ucl.ac.uk/software/googlemapimagecutter.asp) is used to generate the image tiles that are used by the GoogleMaps API to display the tree. Embedded in a web page, the tree can be navigated using the tools familiar to anyone who has used Google Maps navigation tools. Thus, the tree can be scrutinized within a web page by being dragged, panned and zoomed using the navigation tools or click-and-drag mouse motions, as well as allowing for overlays to show hyperlinks and additional notes when a mouse hovers over a specific part of the map. Also provided is an in-page thumbnail overview, which tracks the movements in the main image. This aids navigation when the image is zoomed in.
Each leaf of the tree is annotated with links to sequence, structure and mechanism data if known. In addition, the E.C. numbers are annotated and coloured according to their similarity at the third level of the E.C. hierarchy. The small molecules involved in each E.C. reaction are represented by coloured boxes, where the colour shows the similarity relationship based on the SMSD scores. The more similar the molecules, the closer their colours are according to the colours of the rainbow. The complete reaction is also annotated as an image, appearing when the mouse is hovered over the annotation. Finally, the domain architecture of the complete sequence is depicted as a series of coloured bars, with each unique domain in the MDA given a unique colour. At the nodes in the tree the bootstrap values are displayed and a link to a JMol (29) view of the superimposition of any structures present in the clade rooted at the node. The structures are shown as protein cartoons, coloured based on the colours assigned to the E.C. code in the tree and the active site residues are highlighted as space filled atoms coloured red. The active site information is derived from the CSA (Figure 3).
In the collapsed phylogenetic tree, the third level E.C. code representing the branch is highlighted and all the full E.C. numbers are listed. In addition, the sequence alignment the phylogenetic tree is based on is shown in Jalview (30). If any sequence has a known structure, the secondary structure assignments provided by PDBsum (31) are annotated along with catalytic site residues as defined by the CSA. Distinction is made between the catalytic residues identified in the CSA by curation from the literature, and those inferred on the basis of sequence comparison.
All types of tree images are processed and rendered, along with data collection, processing and integration, using software developed in-house for FunTree. All software is written in Python making particular use of the BioPython (32), Pycluster and PIL libraries. Associated data relating to the trees and superfamilies are stored in a MySQL database.
Overview of 276 superfamilies
The FunTree pipeline has been applied to 276 CATH superfamilies. These superfamilies represent over 2 million sequences from UniProtKB and nearly 3 million domain sequences (32% of CATH-Gene3D sequences) as defined by CATH-Gene3D. All four CATH classes and 60% of all CATH architectures are present. Though these 276 superfamilies represent only 11% of CATH homologous superfamilies they include some of the largest superfamilies, so that 48% of structurally characterized domains classified by CATH are present. In total, FunTree captures 2167 E.C. numbers (71% of E.C. numbers assigned to sequences) of which 1817 are fully classified and 1360 represent chemically balanced reactions with 1589 unique metabolites.
The largest number of SSGs and MDAs are found in the P-loop containing nucleotide triphosphate hydrolases with 27 SSGs and 687 unique MDAs. The top 10% of superfamilies in FunTree ordered by either the number of SSGs or MDAs account for ∼50% of sequences of all sequences represented by the 276 superfamilies. This top 10% have with a mean number SSGs of 5 and MDAs of 113. The rest of the superfamilies only have an average of one structurally similar group and only 7 different domain architectures (Figure 4). This accords with previous observations (6).
The purpose of this resource is to explore the evolution of functional catalytic diversity. The distribution of the number of associated functions for each superfamily, as defined by the E.C., shows that some exceptional superfamilies have many different enzyme functions [the NAD(P)-binding Rossmann-like domain has the most with 223 unique E.C. numbers], while 49 others have only one. The top 10% of superfamilies by number of sequences in FunTree account for 849 unique functions as defined by E.C. number, with an average of 35 E.C. numbers per superfamily. The rest have on average only 6 E.C. numbers per superfamily.
Of the 276 superfamilies, about two-thirds (177) show some or all of their functional diversity at the fourth serial number level of the E.C. classification, which indicate changes in substrate specificity. The promiscuity of a superfamily can be gauged by analysing the diversity shown by multiple differences in the serial number (note that in some reactions, as defined by the E.C. number, the substrates include an ‘R’ group which indicates a variable moiety and provides another level of substrate diversity). Of these 177 superfamilies, 150 have more than 50% of their E.C. diversity coming from changes in this level, the rest coming from changes at the higher levels. However, nearly an equal number of superfamilies (176 superfamilies) include at least one member where the diversity is occurring at the third level E.C. or above, which can act as proxy for a change in chemistry. Thirty-nine of these general chemistry diverse superfamilies show that all the diversity is occurring through changes at the third level or above and none are occurring at the serial number (fourth level) of the E.C. In our data set, there are 67 superfamilies (∼25%), where the single domain carries out 80% or more of the enzyme functions found in the superfamily.
It is difficult to combine structural, sequence, phylogenetic, functional and chemical data together effectively for a large number of superfamilies, thus we had to develop a complex pipeline. Bringing together this range of data into a single resource allows the investigation of the evolution of novel enzyme functions within structurally defined superfamilies. It has permitted not only the exploration of specific enzyme superfamilies but provides a means to analyse trends across many superfamilies. Exploring individual families has reinforced the observation that enzyme evolution is incredibly complex, with many different routes being taken to obtain different reactions, mechanisms and specificities within a superfamily.
In practice, the FunTree resource allows a number of questions to be addressed. For example, for a given superfamily the catalytic diversity (by E.C. number) can be gauged as well as the range and diversity of known substrates and products. Furthermore, FunTree can provide the evolutionary progression in terms of function of the superfamily. In the future, we envisage that it would be possible to place new sequences into FunTree, allowing a user to see how it positions in ‘functional’ space. FunTree also allows other more general questions to be addressed for all superfamilies, such as which E.C. numbers are ‘related’ in terms of evolution and what are the common structural paradigms of enzyme evolution that underlie functional evolution.
We will continue to update the resource in parallel with the CATH/CATH-Gene3D update process to include new sequences, structures and functions as they become available. As new tools become available to analyse similarities between enzyme reactions based on metabolite substructure similarity and bond order changes, these will be introduced as another similarity measure appended to the branch annotation. Using such tools should remove some of the problems in comparing E.C. codes.
By beginning to gather, catalogue and classify the emergence of catalytic reactions, users can analyse shifts in functionality across and within enzyme superfamilies and may help in designing new enzymes as well as aid in function prediction.
Wellcome Trust (Grant No. 081989/Z/07/A to N.F. and I.S.); Biotechnology and Biological Sciences Research Council to ALC; European Molecular Biology Laboratory (to G.L.H. and S.A.R.); and in part by US Department of Energy Contract (DE-AC02-06CH11357 to R.A.L.) as part of the Midwest Center for Structural Genomics. Funding for open access charge: EMBL and Wellcome Trust Grant.
Conflict of interest statement. None declared.