The K homology (KH) module is a widespread RNA-binding motif that has been detected by sequence similarity searches in such proteins as heterogeneous nuclear ribonucleoprotein K (hnRNP K) and ribosomal protein S3. Analysis of spatial structures of KH domains in hnRNP K and S3 reveals that they are topologically dissimilar and thus belong to different protein folds. Thus KH motif proteins provide a rare example of protein domains that share significant sequence similarity in the motif regions but possess globally distinct structures. The two distinct topologies might have arisen from an ancestral KH motif protein by N- and C-terminal extensions, or one of the existing topologies may have evolved from the other by extension, displacement and deletion. C-terminal extension (deletion) requires β-sheet rearrangement through the insertion (removal) of a β-strand in a manner similar to that observed in serine protease inhibitors serpins. Current analysis offers a new look on how proteins can change fold in the course of evolution.
Received October 16, 2000; Revised and Accepted December 1, 2000.
Since the emergence of the first three-dimensional protein structures, it has been widely accepted that spatial structure is more conserved than protein sequence (1–6). Many examples of very close structural resemblance in the absence of detectable sequence similarity have been catalogued (7–12). The opposite situation remains obscure. We know very few proteins with statistically supported sequence similarity that fold into radically different structures (13–16). These rare cases are of exceptional interest since they have a profound impact on our understanding of the protein world. Practically, their existence indicates difficulties for homology modeling techniques that rely heavily on the assumption ‘similar sequences, similar structures’ and brings inconsistencies between sequence- and structure-based protein classification schemes. The most fundamental questions, however, concern evolution of protein structure, its relation to evolution of sequence and function, and mechanisms by which protein folds can change. These mechanisms remain largely unexplored both experimentally and theoretically.
A unique example of proteins with clear sequence similarity while having considerably different folds is presented here and it appears to be by far the most striking case of this kind. Sequence similarity between the two proteins described below has been detected and was widely known before the structures were solved, but the protein folds turned out to be topologically different.
K homology (KH) motif was first biochemically characterized in the major pre-mRNA-binding protein K (heterogeneous nuclear ribonucleoprotein K, hnRNP K) and described as a 45-amino acid repeat detected by sequence similarity in a number of RNA-binding proteins (17). Siomi et al. (17) note that similarity was particularly strong with ribosomal protein S3. The first KH domain of human hnRNP K and the KH domain of Halobacterium halobium S3 display 36% identity (54% similarity, z-score of 12.5, calculation through the entire alignment length of 39 residues), which is larger than that between the first and the second KH domains of hnRNP K (31% identity) (17). KH motifs can occur in multiple copies (15 in chicken vigilin) (18). The most conserved sequence with the consensus VIGXXGXXI maps to the middle of the motif (17,19,20). A single amino acid substitution (I304 to N) in this consensus sequence of FMR1 protein (21) affects its RNA-binding properties (22) and causes fragile X mental-retardation syndrome (23). There has been no question that significant sequence similarity in the KH motif reflects descent from a common ancestor (17,19,20). The conservation of KH motif in diverse organisms such as Bacteria, Archaea and Eukaryotes suggests that KH arose early in evolution.
MATERIALS AND METHODS
Sequence similarity searches against the non-redundant protein database (nr) maintained at the National Center for Biotechnology Information (NCBI; Bethesda, MD) were performed using the PSI-BLAST program (24,25). The BLOSUM62 matrix (26) was used for scoring, and 0.01 or 0.001 were used as E-value thresholds for inclusion in the profile calculation. Sequence analysis protocols were carried out using SEALS (27). Structure similarity searches against the protein data bank (PDB) (28,29) maintained at the Research Collaboratory for Structural Bioinformatics (RCSB) were performed using DALI (30–32), VAST (33,34) and CE (35) programs with default parameters. The Structural Classification of Proteins (SCOP) database (release 1.53, 11 410 PDB entries, July 1, 2000) (11,12) was used as a source of protein classification. Protein structures were visualized and superimposed using InsightII package (MSI) and the multiple structure-based alignment was built on the basis of the superpositions made in InsightII. Structure diagrams were rendered using Bobscript (36), a modified version of Molscript (37).
RESULTS AND DISCUSSION
Sequence similarity between KH domains of hnRNP K and ribosomal protein S3 described back in 1993 (17) can be detected by PSI-BLAST. Even the gapped BLAST program finds S3 when hnRNP K is used as a query and vice versa. For example, gapped BLAST aligns the first KH domain of human hnRNP K [NCBI database gene identification number (GI) 585911, residues 35–95] taken as a query with the ribosomal protein S3 from Deinococcus radiodurans (GI:7473848) detected in the nr protein sequence database (November 2000, 582 290 sequences; 183 345 511 letters). The alignment spans through 64 residues, which constitutes virtually the entire KH motif and displays 32% identity (47% similarity, score 31.4 bits, E-value 1.3). BLAST alignment of hnRNP K (GI:585911 residues 35–95) and H.halobium S3 (GI:133930) spans through 36 residues giving 38% identity (63% similarity, no gaps, score 31 bits, E-value 1.6). In this alignment, a nine-residue segment in the KH signature region is invariant between the two sequences: VIGKGGKNI (GI:585911 residues 57–65, GI:133930 residues 54–62). Conversely, when D.radiodurans S3 (GI:7473848 residues 63–126) is taken as a gapped BLAST query, the first KH domain of human hnRNP K (GI:585911) is found with a score of 34 bits (E-value 0.19). Additionally, the KH domain of GTPase ERA from Mycobacterium leprae is found with a score of 34.3 bits (E-value 0.16), 31% identity (50% similarity) in a 47-residue alignment.
When the KH motif was first described (17), no spatial structure for KH-containing proteins was determined. By now, we have several KH domain structures in hand (18,38–44), including those detected by sequence similarity in the original paper that identified the motif (17): hnRNP K and ribosomal protein S3. The structure of the C-terminal KH domain of human hnRNP K has been determined by NMR spectroscopy (Fig. 1B) (40) and the coordinates for S3 became available recently following the solution of the X-ray structure of the entire 30S ribosome subunit from Thermus thermophilus (Fig. 1E) (44). Was the prediction of structural similarity between hnRNP K and S3 based on sequence similarity in the KH motif region fulfilled? Yes and no. The conformations of residues in and around the KH consensus VIGXXGXXI are indeed very similar between the two structures (Figs 1B and E and 2A). Near the consensus, the protein chain is folded as two α-helices, A and B (Fig. 1), arranged at an angle of 100–120° to each other. A two-residue protruding turn connects the α-helices A and B (Figs 1B and E and 2A). The two largely invariant glycines separated by two variable residues in the turn (GXXG) serve as C- and N-caps of the two α-helices A and B. The side chains of residues around the consensus are conformationally similar (Fig. 1B and E) and are likely to bear the same functional role. The KH consensus sequence has been implied in direct contact with nucleic acids (17,19,21,22,45) and the recent crystal structure of nova-2 KH domain bound to a 20mer RNA hairpin (43) confirmed this hypothesis (Fig. 2B). The α-helix A, the following turn and the β-strand b (Fig. 1) are involved in extensive contacts with RNA.
Thus the local motif identified by the statistically supported sequence similarity is folded the same way in hnRNP K and S3 structures, and is likely to bind nucleic acids by the same mechanism. But are the global folds of the two proteins similar? The first spatial structure of a KH motif protein, the sixth KH domain of vigilin (Fig. 1A), revealed the presence of a compact domain. In addition to the motif sequence covering the βααβ unit (Fig. 1, a, A, B and b), the KH domain included a βα unit at the C-terminus that is inherently important for its structural integrity (18). Indeed, the β-strand c is the central element of the three-stranded anti-parallel β-sheet (Fig. 1A and B). The α-helix C (Fig. 1A and B) completes the hydrophobic core of the protein and the KH domain is unable to fold when this α-helix is deleted (18). The vigilin KH domain can be described as an α+β two-layer sandwich with α-β plate topology (9,10). This topology is also known as the ‘ferredoxin-like’ protein fold (11,12) (the last strand of the ferredoxin common fold is missing in the KH domain). An example of a protein with α-β plate topology that does not share sequence similarity with the KH domain, namely the C‐terminal domain of the Escherichia coli arginine repressor (46), is illustrated in Figure 1C.
The structure of the vigilin domain leads to re-definition of the KH motif boundaries to cover the helix C (18), making the domain length equal to approximately 70 residues. However, several KH sequences lack the C helix. These include ribosomal protein S3, amongst others. The shorter KH sequences that match the original definition of the KH motif (17) were termed ‘mini-KH’, in contrast to typical ‘vigilin-like’ ‘maxi-KH’ domains (18). Surprisingly, the structure of the ribosomal protein S3 N-terminal domain (44) revealed that the β-sheet topology of the mini-KH domain is drastically different from the one established for maxi-KH (Fig. 1E). Indeed, not only the α-helix C, but also the central β-strand c, which seemed to be crucially important for the fold, is lacking in S3 structure (Fig. 1E). Alternatively, another β-strand (a′) and α-helix (A′) donated by the N-terminal part of the domain complete the hydrophobic core of the mini-KH. Such an arrangement results in architectural similarity between maxi- and mini-KH: both domains are built from a three-stranded β-sheet with three α-helices packed on one side of it (Figs 1 and 2A). The difference is topological: while in maxi-KH the β-sheet is anti-parallel, in mini-KH it is mixed. Parallel β-strands a and b that were included in the original definition of KH motif (17,19) form hydrogen bonds with each other in the S3 structure (Fig. 1E), but are separated by the β-strand c in maxi-KH (Fig. 1B). Another structure of a mini-KH domain-containing protein, GTPase ERA (41), displays significant topological similarity to S3 (Fig. 1D) and thus confirms that the structure of S3 is not an exception, but a template for mini-KH domains. The structures topologically similar to mini-KH domain are known among proteins that do not contain KH motif. For example, the C-terminal domain of E.coli GMP synthetase (47) is shown in Figure 1F.
Global structure similarity search programs such as DALI (30–32), VAST (33,34) and CE (35) find similarity significant within mini- and maxi-KH classes, but concur on the global structural differences between the two classes. For example, DALI finds the structures of two mini-KH domains similar: the KH domains of S3 (PDB entry 1FJF chain C) and GTPase ERA (1EGA chain A, C-terminal domain) are aligned with z‐score of 4.1, root mean-squared deviation (RMSD) of 4.3 Å and 7% sequence identity in the alignment of 89 residues. DALI does not report similarity of these proteins to any of the maxi-KH domains implying that corresponding z-scores are <2.0.
The analysis presented forces us to return to the original definition of the KH motif boundaries that include only the βααβ unit shared between maxi- and mini-KH domains (Fig. 1G). In addition to this shared KH motif element, maxi- and mini-KH domains contain C- and N-terminal extensions, respectively. Therefore in terms of the overall domain size, the mini-KH domain is not smaller than the maxi-KH domain: both comprise approximately 70 residues. The mini-/maxi-KH terminology was originally meaningful. The mini-KH domain does not contain the C-terminal β-strand and α-helix (Fig. 1A and B, c and C) of maxi-KH that were included in the modified KH domain definition (18). Prior to mini-KH structure determination it was not known that sequence segments upstream of the N-terminal boundary set by maxi-KH would be part of the hydrophobic core of the mini-KH domain, thus the mini-KH domain appeared to be shorter than the maxi-KH domain. However, due to the lack of chain length differences between the two domains, as revealed by their crystal structures, mini/maxi terminology loses its meaning. We suggest naming the two topologically different KH domains KH type I for the KH domain with the C-terminal βα extension [maxi-KH, its structure was determined first (18)], and KH type II for the KH domain with N-terminal αβ extension (mini-KH).
It is clear that the type I and II KH domains belong to different protein folds (Fig. 1A, B, D and E). It is also clear that they share the same KH motif (Fig. 1G). What is the evolutionary connection between the two different KH domains with the same KH motif? The simplest, and well-documented, mechanism of topological changes in protein evolution (48–50), circular permutation, is not possible in this case since the order of secondary structural elements differs: a βα unit is present at the C-terminus of the type I KH, but an αβ unit starts type II KH. It is therefore likely that type I and II KH domains are not homologous throughout their entire length. Theoretically, four evolutionary scenarios are possible. First, local sequence, structural and functional similarities in the KH motif region were acquired independently by type I and II KH domains and thus are convergent. Second, the element of the local sequence similarity (minimally, sequence segment around the turn between the α-helices A and B; Fig. 1) was inserted in two different structural templates: type I and II KH domains. Third, the homology region covers the entire βααβ unit, which represents a ‘primordial’ KH domain. This domain was expanded by the C-terminal extension to form a type I KH domain fold or by the N-terminal extension to form a type II KH domain fold. Fourth, one of the two types represents the ancestral form and the other type evolved through N- or C-terminal extension, and displacement and deletion at the other end.
It appears that the third and fourth scenarios offer the simplest explanation to the available data. Indeed, insertions, deletions and terminal extensions are very common events in protein evolution (51,52). Also, it was argued, and largely accepted, that statistically significant similarity detected from the sequence alone (without consideration of spatial structure) reflects descent from the common ancestor, i.e. homology (16,53,54). Programs that are routinely used for sequence similarity searches, such as PSI-BLAST (24,25), are based on amino acid similarity matrices which are derived under evolutionary models (55) or computed from the aligned homologous sequences (26) and thus are intended to find homologs. Therefore, convergent origin of KH domains appears unlikely due to their highly significant sequence similarity (17–19). At present, it is hard to discriminate between the third and fourth scenarios. The third scenario might seem unrealistic, since it assumes the existence of a putative primordial βααβ domain, which might not be stable in the absence of the N- or C-terminal α-helix to pack against the β-sheet. However, it is likely that primordial proteins existed in tight contacts with RNA and might not be foldable in the absence of RNA molecules. It is also reasonable to assume that primordial proteins were significantly shorter than average present-day domains. The fourth scenario offers a physically realistic model that might pass through an intermediate protein containing both N- and C-terminal extensions before one of the extensions was eliminated. There is a chance that such a ‘hybrid’ protein still exists in nature. Thus to discover the KH motif-containing protein with topology αββααββα (a combination of both type I and II domains, four-stranded β-sheet with four α-helices on one side; Fig. 1A, B, D and E), would be an argument favoring the fourth scenario.
Interestingly, the C-terminal extension cC (Fig. 1A and B) in the type I KH domains required rearrangement of the β-sheet: hydrogen-bonding between β-strands a and b of the putative ‘primordial’ KH domain should have been broken to accommodate the central β-strand c. Typically, terminal extensions do not disrupt the β-sheet topology, but add up to the existing structural core, like the N-terminal extension a′A′ (Fig. 1D and E) in the type II KH domain. However, the KH domain is not the first example for which the rearrangement of β-sheet topology has been suggested. Serine protease inhibitors, serpins, are known to undergo the conformational change during which one of the β-strands is inserted between the two hydrogen-bonded parallel β-strands (56). P-loop ATPases that display statistically significant sequence similarity in Walker A and B motifs (57) are known to possess several distinct topologies that can be transformed to each other through the β-sheet rearrangement (58,59). β-Sheet rearrangement was postulated for the triabin that shares sequence similarity with lipocalins but possesses distinct topology (14).
In summary, analysis of available spatial structures revealed that there are two different KH domains that belong to different protein folds, but share a single KH motif. The KH motif is folded into a βααβ unit. In addition to the motif core, type II KH domains (e.g. ribosomal protein S3) include N-terminal extension αβ and type I KH domains (e.g. hnRNP K) contain C-terminal extension βα. A β-strand of this extension in type I KH is inserted into the β-sheet formed by the KH motif βααβ unit offering a clear example of a rare structural rearrangement. KH domains demonstrate how proteins can change fold in the course of evolution.
The author is grateful to Hong Zhang and Sara Cheek for critical reading of the manuscript and the two anonymous reviewers for constructive suggestions.
Tel: +1 214 648 3386; Fax: +1 214 648 9099; Email: email@example.com