Abstract

The latest version of CATH (class, architecture, topology, homology) (version 3.2), released in July 2008 (http://www.cathdb.info), contains 1 14 215 domains, 2178 Homologous superfamilies and 1110 fold groups. We have assigned 20 330 new domains, 87 new homologous superfamilies and 26 new folds since CATH release version 3.1. A total of 28 064 new domains have been assigned since our NAR 2007 database publication (CATH version 3.0). The CATH website has been completely redesigned and includes more comprehensive documentation. We have revisited the CATH architecture level as part of the development of a ‘Protein Chart’ and present information on the population of each architecture. The CATHEDRAL structure comparison algorithm has been improved and used to characterize structural diversity in CATH superfamilies and structural overlaps between superfamilies. Although the majority of superfamilies in CATH are not structurally diverse and do not overlap significantly with other superfamilies, ∼4% of superfamilies are very diverse and these are the superfamilies that are most highly populated in both the PDB and in the genomes. Information on the degree of structural diversity in each superfamily and structural overlaps between superfamilies can now be downloaded from the CATH website.

CURRENT POPULATION OF THE CATH HIERARCHY

CATH (class, architecture, topology, homology) is a hierarchical protein domain classification (1) where domains are classified manually by curators, guided by prediction algorithms (such as structure comparison). Each protein structure is decomposed into one or more chains which in turn are split into one or more domains before being classified into homologous superfamilies according to both structure and function. At the Class, or C-level, the domains are classified simply on the basis of their secondary structure content [whether they are mostly α-helical (Class 1) or β-sheet (Class 2), contain a significant percentage of both secondary structure elements (Class 3) or contain very little secondary structure (Class 4)]. The domains within each class are then sorted according to their architecture—that is similarities in the arrangements of secondary structures in 3D space. Each architecture (A-level) is further broken down into one or more topology, or fold, groups (T-level), where the connectivity between these secondary structures are taken into account. The domains are then classified into their respective homologous superfamilies (H-level) according to similarities in sequence, structure and/or function. Clustering performed at the H-level (>35% sequence identity and above) then produces one or more sequence families for each of the homologous superfamilies (S-level). Table 1 below shows the current population of different levels in the CATH hierarchy.

Table 1.

Release statistics for CATH version 3.2

Class Architecture Topology Homologous superfamily S35 family 
310 682 2078 
20 196 438 2062 
14 512 956 4558 
92 102 173 
Total 40 1110 2178 8871 
Class Architecture Topology Homologous superfamily S35 family 
310 682 2078 
20 196 438 2062 
14 512 956 4558 
92 102 173 
Total 40 1110 2178 8871 

A PERIODIC TABLE OF CATH ARCHITECTURES

A visual snapshot of the domain architectures in the CATH database is now captured in a new ‘Protein Chart’ (2). This chart, inspired by Taylor's ‘Periodic Table’ of protein structures devised in 2002 (3), shows fold representatives of all the most regular domain architectures currently classified in the CATH database. It is organized so that the smallest representative for any given architecture is at the top of the chart and the largest at the bottom, giving a guide to the variation in size and structure that can occur. Functional information and population statistics for each architecture are provided in a table accessible from the CATH web site (http://www.cathdb.info/download#version_v3.1).

Using the chart, we have identified nine new architectures classified since CATH architectures were first presented in 1997 (1) (Figure 1). These new architectures are not highly populated accounting for only ∼4% of predicted CATH domain sequences in the genomes.

Figure 1.

Some of the architectures new to CATH since 1997.

Figure 1.

Some of the architectures new to CATH since 1997.

In the mainly-α class, the α-solenoid architecture (1.40) contains only one superfamily. Domains provide an α-helical scaffold for a central hydrophobic cavity, which contains light harvesting molecules (4). The αα-barrel (1.50) contains 2 α-helical layers, with long loops that create a tunnel. They are typically glycosyl hydrolases (5). The α-horseshoe (1.25) is a super helical structure made up of a number of 3 α-helical orthogonal bundle repeats.

In the mainly-β class, we identify a new β-propellor—the 5-bladed propeller (2.115). A new sandwich architecture, the 3-layer βββ-sandwich is made up of three anti-parallel forumla layered into three adjacent stacks with an immunoglobulin-like sub-domain. Most are rieske iron–sulphur proteins (7)

Four new architectures are classified in the α-β class. Super-rolls are made from twisted anti-parallel β-strands capped by 2 α-helices. All classified domains bind to and neutralize lipopolysaccharides in the outer-membrane of gram-negative bacteria (8). The 3-layer (βαβ) sandwich architecture contains 10 domains in three different folds. The most highly populated fold is largely comprised of bacterial heat shock proteins. The αβ-prism is made up from a repeating folding unit composed of two parallel α-helices and a β-sheet. Domains with this fold are commonly found in 5-enolpyruvylshikimate-3-phosphate synthase and UDP-N-acetylglucosamine enolpyruvyl transferase (9). 5-Stranded αβ-propellers are composed of ββαβ repeats arranged in a circular fashion surrounding a channel in the centre of the structure (10).

INCREASING THE PROPORTION OF NOVEL STRUCTURES CLASSIFIED IN CATH

Recent analyses of CATH domain annotations in Gene3D (11) showed that between 80–90% of domain sequences in completely sequenced genomes can be assigned to a structural family in CATH. This suggests that the CATH database now provides a reasonably comprehensive structural view of the protein universe. Those protein families that have yet to be represented structurally are likely to be transmembrane or disordered proteins.

The fact that most major folds are represented in CATH is reflected in the continual decrease in the proportion of non-redundant structures found to adopt novel folds. Table 2 gives the number of new folds identified over the last 10 years and the percentage of non-redundant structures deposited adopting a novel fold.

Table 2.

Numbers of structures classified in CATH and the proportion of novel folds per year

Year of PDB release Number PDB structures classified in CATH Number novel folds Novel folds (%) 
1997 1584 92 5.81 
1998 1876 87 4.64 
1999 2226 104 4.67 
2000 2549 90 3.53 
2001 2766 91 3.29 
2002 2821 73 2.59 
2003 3668 34 0.93 
2004 3711 61 1.64 
2005 3198 0.19 
2006 3163 18 0.57 
2007 2802 11 0.39 
Year of PDB release Number PDB structures classified in CATH Number novel folds Novel folds (%) 
1997 1584 92 5.81 
1998 1876 87 4.64 
1999 2226 104 4.67 
2000 2549 90 3.53 
2001 2766 91 3.29 
2002 2821 73 2.59 
2003 3668 34 0.93 
2004 3711 61 1.64 
2005 3198 0.19 
2006 3163 18 0.57 
2007 2802 11 0.39 

IMPROVEMENTS TO THE STRUCTURAL COMPARISON METHOD CATHEDRAL

We have further improved our domain boundary prediction and fold assignment algorithm, CATHEDRAL (12), which is used to guide curators in the manual classification process. Whole PDB structures are scanned against a library of representatives from the CATH database to recognize constituent domains. CATHEDRAL initially performs rapid secondary structure comparison against the library to identify putative fold matches, which are then more accurately aligned at the residue level using dynamic programming. A support vector machine (SVM) is used to combine different measures of structural similarity and rank hits to the query structure. All domains predicted to be genuine hits by the SVM are assigned in an iterative fashion to identify constituent folds and domain boundaries from the residue-based structural alignments. Hits are allowed to overlap by up to 30 residues and conflicts are resolved by a new algorithm that moves along the overlapping region and assigns each residue to the closest domain.

STRUCTURAL DIVERSITY AND THE VALIDITY OF THE CATH HIERARCHY

There has been much debate on the existence of a protein fold continuum and the validity of a hierarchical protein classification system (13–16). Greene et al. (17) previously explored the concept of ‘lateral links’ across the CATH hierarchy as a way of capturing structural relationships between superfamilies. More recent in-house analyses have shown that, within some of the most highly populated superfamilies, significant structural changes have occurred. Typically, the domains within a given superfamily possess a ‘common structural core’ comprising 40–50% of the residues in the structure, but there can be considerable structural embellishments to this core and some domains can be up to three times larger than the typical representative of the family (18). In some cases, the embellishments are so considerable that the domain in question can be considered to exhibit a different fold to the other domains in the family.

Due to the improvements made to CATHEDRAL (12), we have been able to perform a database-wide analysis of the similarities between all protein structures in the CATH database. This has been used to examine the extent to which superfamilies diverge structurally and determine which superfamilies overlap with one another. Domains in each superfamily were first assigned to ‘structurally similar groups’ (SSGs), whereby a domain is assigned to a particular SSG if they exhibit significant structural similarity with other domains in that group (Cuff,A.L. et al., submitted for publication). That is, if they share a normalized RMSD (SiMAX) structure comparison score of <5 Å (Cuff,A.L. et al., submitted for publication). Superfamilies with five or more SSGs were deemed to be structurally diverse.

The majority of homologous superfamilies (∼96%) in the database are structurally conserved and structurally coherent, that is, they contain less than five SSGs and do not overlap with any other superfamily. However, the ∼4% of CATH superfamilies that do show considerable structural diversity, are those which are the most highly populated in CATH, accounting for 40% of domain sequences in the genomes (Figure 2) (Cuff,A.L. et al., submitted for publication).

Figure 2.

Relationship between the degree of structural diversity (measured by the number of SSGs) and population of the superfamilies in the genomes (number of sequences).

Figure 2.

Relationship between the degree of structural diversity (measured by the number of SSGs) and population of the superfamilies in the genomes (number of sequences).

If we consider the different SSGs to represent distinct ‘folds’ within these superfamilies, then instead of the 1110 ‘fold groups’ (defined by the Topology level in CATH version 3.2) there would be 3118 ‘fold groups’ and some superfamilies would have multiple ‘folds’. However, although examples of dramatic fold changes are known (19), they are rare and the majority of gross structural changes that occur within a superfamily result from extensive structural embellishments to the common core rather than a dramatic change within the core. Therefore, the CATH hierarchical classification is not challenged if we consider a more appropriate definition of the T-level or topology level in CATH to be a grouping of structures sharing a common fold in the core of the domain. A file containing the number of SSGs contained in each superfamily in CATH can be downloaded from (http://www.cathdb.info/download#version_v3.1)

We also investigated whether structures in different superfamilies were structurally similar (i.e. SiMAX <5Å). We observed relatively little overlap between different superfamilies and fold groups for a SiMAX threshold of <5Å. As the threshold is increased, however, more overlaps do occur between some architectures, such as the α up-down bundle, α-orthogonal bundle, β-sandwiches and αβ-sandwiches (Figure 3). This is largely due to the presence of small common super-secondary motifs, such as the α-hairpin, β-hairpin and αβ-motif. Superfamilies that exhibit no structural overlaps at all tend to have very distinctive folds, such as the β-trefoil fold, with unusual motifs or unusual combinations of common motifs.

Figure 3.

Plot showing the percentage of superfamilies that overlap (red) and show structural diversity, or drift (>5 SSG's) (blue) for different SiMAX cuff-offs.

Figure 3.

Plot showing the percentage of superfamilies that overlap (red) and show structural diversity, or drift (>5 SSG's) (blue) for different SiMAX cuff-offs.

A structural overlap matrix of SiMAX scores created via the all-against-all CATHEDRAL analysis is downloadable from the CATH website (see http://www.cathdb.info/download#version_v3.1) so that users can perform their own analyses on a CATH-based protein structure universe.

REDESIGNED WEBSITE

The CATH database can be accessed at http://www.cathdb.info. The web interface has been completely redesigned since version 3.1. Documentation, such as an FAQ, tutorials, a glossary, downloadable data files and staff webpages have also been created and are being maintained through a open source wiki software package. This will be frequently updated.

SUMMARY

In the light of our analyses on structural diversity in CATH, it is clear that the T-level provides a clustering of domain structures having similar folds in their domain cores. For each superfamily, information on the variety of different decorations to this common structural core is provided as distinct SSGs within the superfamily. Multiple structural alignments will shortly be provided for each SSG in order to highlight common secondary structures in the domain core and embellishments to this core.

FUNDING

Funding for open access charge: BBSRC.

Conflict of interest statement. None declared.

REFERENCES

1
Orengo
CA
Michie
AD
Jones
DT
Swindells
MB
Thornton
JM
CATH: A Hierarchic Classification of Protein Domain Structures
Structure
 , 
1997
, vol. 
5
 (pg. 
1093
-
1108
)
2
Garratt
RC
Orengo
CA
The Protein Chart
 , 
2007
Germany
Wiley-VCH
 
ISBN: 978-3-527-31963-3
3
Taylor
WR
A ‘periodic table’ for protein structures
Nature
 , 
2002
, vol. 
416
 (pg. 
657
-
660
)
4
Hofmann
E
Wrench
PM
Sharples
FP
Hiller
RG
Welte
W
Diederichs
K
Structural basis of light harvesting by carotenoids: peridinin-chlorophyll-protein from Amphidinium carterae
Science
 , 
1996
, vol. 
272
 (pg. 
1788
-
1791
)
5
Rojas
AL
Nagem
RA
Neustroev
KN
Arand
M
Adamska
M
Eneyskaya
EV
Kulminskaya
AA
Garratt
RC
Golubev
AM
Polikarpov
I
Crystal structures of beta-galactosidase from Penicillium sp. and its complex with galactose
J. Mol. Biol.
 , 
2004
, vol. 
343
 (pg. 
1281
-
1292
)
6
Fülöp
V
Jones
DT
Beta propellers: structural rigidity and functional diversity
Curr. Opin. Struct. Biol.
 , 
1999
, vol. 
9
 (pg. 
715
-
721
)
7
Kolling
DJ
Brunzelle
JS
Lhee
S
Crofts
AR
Nair
SK
Atomic resolution structures of rieske iron-sulfur protein: role of hydrogen bonds in tuning the redox potential of iron-sulfur clusters
Structure
 , 
2007
, vol. 
15
 (pg. 
29
-
38
)
8
Beamer
LJ
Carroll
SF
Eisenberg
D
Crystal structure of human BPI and two bound phospholipids at 2.4 angstrom resolution
Science
 , 
1997
, vol. 
276
 (pg. 
1861
-
1864
)
9
Palm
GJ
Billy
E
Filipowicz
W
Wlodawer
A
Crystal structure of RNA 3′-terminal phosphate cyclase, a ubiquitous enzyme with unusual topology
Structure
 , 
2000
, vol. 
8
 (pg. 
13
-
23
)
10
Humm
A
Fritsche
E
Steinbacher
S
Huber
R
Crystal structure and machanism of human L-arginine: glycine amidinotransferase: a mitochrondrial enzyme involved in creatine biosynthesis
EMBO J.
 , 
1997
, vol. 
16
 (pg. 
3373
-
3385
)
11
Yeats
C
Lees
J
Reid
A
Kellam
P
Martin
N
Liu
X
Orengo
C
Gene3D: comprehensive structural and functional annotation of genomes
Nucleic Acids Res.
 , 
2008
, vol. 
36
 (pg. 
D414
-
D418
)
12
Redfern
OC
Harrison
A
Dallman
T
Pearl
FM
Orengo
CA
CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures
PLoS Comput. Biol.
 , 
2007
, vol. 
3
 pg. 
e232
 
13
Grishin
NV
Fold change in evolution of protein structures
J. Struct. Biol.
 , 
2001
, vol. 
134
 (pg. 
167
-
185
)
14
Krishna
SS
Grishin
NV
Structural drift: a possible path to protein fold change
Bioinformatics
 , 
2005
, vol. 
21
 (pg. 
1308
-
1310
)
15
Kolodny
R
Petrey
D
Honig
B
Protein structure comparison: implications for the nature of ‘fold space’, and structure and function prediction
Curr. Opin. Struct. Biol.
 , 
2006
, vol. 
16
 (pg. 
393
-
398
)
16
Sippl
MJ
Suhrer
SJ
Gruber
M
Wiederstein
M
A discrete view on fold space
Bioinformatics
 , 
2008
, vol. 
24
 (pg. 
870
-
871
)
17
Greene
LH
Lewis
TE
Addou
S
Cuff
A
Dallman
T
Dibley
M
Redfern
O
Pearl
F
Nambudiry
R
Reid
A
, et al.  . 
The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution
Nucleic Acids Res.
 , 
2007
, vol. 
35
 (pg. 
D291
-
D297
)
18
Reeves
G
Dallman
T
Redfern
O
Akpor
A
Orengo
CA
Structural diversity of domain superfamilies in the CATH database
J. Mol. Biol.
 , 
2006
, vol. 
360
 (pg. 
725
-
741
)
19
Davidson
AR
A folding space odyssey
PNAS
 , 
2008
, vol. 
105
 (pg. 
2759
-
2760
)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Comments

0 Comments