Proteins can be formed by single or multiple domains. The process of recombination at the molecular level has generated a wide variety of multi‐domain proteins with specific domain organization to cater to the functional requirements of an organism. The functional and structural costs of inserting a domain into another means that multi‐domain proteins are usually formed by covalently linking the N‐terminus of one domain to the C‐terminus of the preceding domain. While this is true in a large proportion of multi‐domain proteins, we find a significant fraction of proteins that are the result of domain insertion. The inserted domain breaks the sequence contiguity of the domain into which it is inserted leading to a novel domain organization. This web resource aims to document domain insertions in known protein structures that are classified in the SCOP database. The web server can be accessed from http://stash.mrc‐lmb.cam. ac.uk/DomIns/ .
Received August 4, 2003; Revised and Accepted September 22, 2003
Domains constitute the basic structural, functional and evolutionary units of proteins ( 1 – 3 ). Proteins can be built from a single domain or from a combination of domains using a limited repertoire of domain families to form multi‐domain proteins with widely diversified domain architecture and functions ( 4 ). Although most multi‐domain proteins have a contiguous arrangement of domains, like beads on a string with one domain following the next in a sequential order, there are exceptions to this common pattern ( 5 ). Insertions are one form of non‐contiguous domain organization, where one domain (insert) is inserted into another (parent) thus breaking the sequence contiguity of the parent domain. Although a few examples of domain insertions have been observed previously ( 6 ), the availability of an accurate and well curated domain classification in SCOP ( 7 ) and the exponential increase in the number of deposited structures in the PDB ( 8 ) now provides a platform to look into this intriguing structural organization of multi‐domain proteins. We provide for the first time, a definite set of domain insertions in known protein structures through this resource.
In order to identify insertions, we used the SCOP ( 1 ) protein domain definitions and the SCOP parseable file, dir.cla. scop.txt_1.61, available here: http://scop.mrc‐lmb.cam.ac.uk/scop/parse/index.html . Although there are several schemes for protein structure classification, we chose SCOP as it is a manually curated classification of proteins of known structures based on their structural and evolutionary relatedness. In SCOP, a protein domain is a unit of evolution if it occurs independently or in combination with other domains based on evidence from proteins of known structure. SCOP has a hierarchical classification scheme with the principal levels being family, superfamily, fold and class. Domains clustered together into families are clearly evolutionarily related, usually detectable at the sequence level. Families brought together into superfamilies may have low sequence identity, but their structural and functional features suggest that they have a common ancestry. Superfamilies with similar topology, but without evidence for evolutionary relatedness are grouped under a fold. Folds are classified into classes based on the secondary structure elements present. We only considered the major five classes (all‐α, all‐β, α/β, α+β and Small proteins), at the fold and superfamily levels of SCOP hierarchy in determining insertions. Considering only multi‐domain proteins, we define a case of domain insertion if a domain occurs within a different domain of the same chain (Fig. 1 ). When more than one domain is inserted in a parent, we categorize them as multiple insertions. The domains involved in insertions can come from the same or different SCOP superfamilies.
The server can be accessed from the URL: http://stash. mrc‐lmb.cam.ac.uk/DomIns/ . There are various ways of retrieving information from the server.
(i) Search by PDB or SCOP identifier: a simple search identifies insertions given a PDB code with or without chain information or a valid SCOP domain identifier. No result for a given query may be due to the following reasons: (a) no known insertion, (b) there is no SCOP domain definition available for the PDB code or (c) the structure does not belong to any of the major five SCOP classes considered for identifying insertions.
(ii) Keyword search: this option allows the user to specify keywords (for example, d ‐amino acid oxidase) and retrieve a list of PDB entries with insertions that match the keyword(s).
(iii) Browsing all insertions: users can browse all known insertions one by one or choose to browse insertions from a non‐redundant list of PDB chains. In order to have a representative sample of structures from PDB, we used a pre‐computed list of non‐redundant chains provided by PDB_Select (Apr 2002 release available from ftp://ftp. embl‐heidelberg.de/pub/databases/protein_extras/pdb_select), with a sequence identity threshold of 90%. The procedure to extract such representative chains is explained in ( 9 ).
(iv) Search by insertion type: we categorized known insertions as single or multiple depending on the number of insert domains identified in a given chain. In single insertions, a domain belonging to a particular superfamily is inserted into another domain of the same or a different superfamily. In multiple insertions, more than one insert of the same or different superfamily is inserted into a parent domain. This search feature permits the user to display entries belonging to either or both of these categories.
(v) Search by SCOP class combination: we provide a search facility to retrieve entries with insertions based on the combination of SCOP classes. This facility will also retrieve all insertions for a given parent or insert SCOP class. Figure 2 provides a screen shot of results for the PDB code 1mla ( 10 ).
We used MySQL and Java Server Pages (JSP) to create this resource.
SUMMARY OF KNOWN INSERTIONS
As of SCOP Release 1.61, there are 1332 PDB chains that have at least a single insertion, out of which 1143 chains have just a single insertion and 189 chains have more than one insertion. However, in the non‐redundant list of chains, there is a total of 149 insertions, with 131 single insertions and 18 multiple insertions.
IMPLICATIONS AND FUTURE WORK
While artificial bifunctional and multifunctional proteins have been created by engineering end‐to‐end gene fusions, there are only a handful of examples where it has been possible to create multifunctional proteins by inserting whole domains into pre‐existing ones ( 16 ). We believe that the list of naturally occurring domain insertions through this resource provides a valuable tool that can be used to undertake studies on the effect of domain insertions on protein folding and to expand the repertoire of multifunctional hybrid proteins.
In addition to providing regular updates in conjunction with SCOP releases, we intend to annotate known insertions at different levels. We are currently working on providing a graphical representation of the organization of the domains in a given chain and a wire plot of the PDB chain with information on secondary structure with unique colour coding for individual domains in order to provide a more detailed view of the structural features of insertions.
We thank Emma Hill and Madan Babu Mohan for several useful suggestions. R.A.S. and R.S. acknowledge financial support from Cambridge Commonwealth Trust and the Medical Research Council, UK.