Motivation: Existing comparisons of protein structures are not able to describe structural divergence and flexibility in the structures being compared because they focus on identifying a common invariant core and ignore parts of the structures outside this core. Understanding the structural divergence and flexibility is critical for studying the evolution of functions and specificities of proteins.
Results: A new method of multiple protein structure alignment, POSA (Partial Order Structure Alignment), was developed using a partial order graph representation of multiple alignments. POSA has two unique features: (1) identifies and classifies regions that are conserved only in a subset of input structures and (2) allows internal rearrangements in protein structures. POSA outperforms other programs in the cases where structural flexibilities exist and provides new insights by visualizing the mosaic nature of multiple structural alignments. POSA is an ideal tool for studying the variation of protein structures within diverse structural families.
Availability: POSA is freely available for academic users on a Web server at http://fatcat.burnham.org/POSA
Protein structure comparison has a long history in computational biology and it is almost as old as the better-known sequence alignment problem (Philips, 1970). These two problems share many similarities and the former can be viewed as a variant of the latter using a specific scoring function. Comparison of structures is more difficult than sequence comparison because of the non-local similarity score (Kolodny and Linial, 2004); it is not even clear if the optimal solution of structure comparison exists (Godzik, 1996). Both structure and sequence comparison can be pairwise (with two input proteins) or multiple (with more than two input proteins). The multiple comparison problem is more difficult than pairwise comparison even for sequences and it requires using heuristics to find an approximate solution in polynomial time (Gotoh, 1999). Compounded difficulties at both levels (i.e. structure versus sequence and multiple versus pairwise) resulted in only a handful of multiple protein structure alignment algorithms being developed so far. Existing methods differ in both the heuristics used to find the approximate solution and the type of scoring function used, including STAMP (uses a preliminary multiple sequence alignment followed by the optimization of alignment using structure information) (Russell and Barton, 1992), MALECON (searches a library of pairwise alignments for all three-protein alignments, which are progressively expanded to include additional proteins and more spatially equivalent residues) (Ochagavia and Wodak, 2004), CE–MC (builds a progressive alignment by sequentially aligning structures followed by refinement using the Monte Carlo algorithm) (Guda et al., 2001), MUSTA (Leibowitz et al., 2001b), MASS (Dror et al., 2003) and MultiProt (Shatsky et al., 2004) (the last three methods use the geometric hashing technique to detect small structurally similar motifs for subsets of input structures).
In contrast to the great variety of methods of building the multiple alignments (both sequence and structure), their presentation is almost always the same: the tabular row–column. This format has many limitations; most importantly it provides very limited information about similarities present only in a subset of proteins being aligned. This limitation is especially severe in structural alignment and it is the main reason why in most multiple structure comparisons the end result is a very small conserved protein core (Matsuo and Bryant, 1999; Leibowitz et al., 2001a). On the other hand, to understand the entire picture of protein structure evolution, we may want to analyze not only the core regions but also partially conserved and divergent regions. For this purpose, we need to adopt a representation that can describe nonlinear structure of the multiple alignment by capturing information on partially conserved and unique regions.
Directed graphs (Minieka, 1978), including acyclic and cyclic graphs, have been adopted to describe nonlinear multiple sequence alignments (Lee et al., 2002; Raphael et al., 2004) and to describe the nonlinear relationship of structure segments in a pairwise sequence alignment method using predicted local structure information (Ye et al., 2003). Lee et al. (2002) applied directed acyclic graphs (DAGs), known also as partial order graphs (POGs), to develop a sequence multiple alignment method called partial order alignment (POA). Raphael et al. (2004) took Lee's approach a step further and developed A-Bruijn alignment to align sequences with repeated and/or shuffled domains using directed graph possibly containing cycles. In this paper we propose to adopt a directed acyclic graph representation of multiple alignment for structural alignments. We formulate multiple protein structure alignment as a process of iterative pairwise alignment of two multiple structure alignments, each represented as a directed acyclic graph (Fig. 1). Constraints of consecutiveness must be obeyed in aligning two POGs, i.e. two nodes (residues) that have no order relationship or have wrong order can never be found in an alignment, as shown in Figure 2b. In this sense, the directed acyclic graph is partially ordered. To keep with the naming convention introduced by Lee et al., we call our new method partial order structure alignment (POSA).
Protein structure alignment has another complication. Most existing approaches (we call them rigid-body alignments) ask a question: ‘What is the largest similar part of two proteins?’ Recently, several protein structure alignments reformulated the central question of protein structure comparison to: ‘How can one of the structures be rearranged to make it more similar to the other one?’ To answer this question, such programs incorporated conformational flexibility in structure comparison (Boutonnet et al., 1995; Wriggers and Schulten, 1997; Shatsky et al., 2002; Ye and Godzik, 2003). All existing multiple structure comparison programs are based on the standard, rigid-body alignment. In contrast, the method we introduce here (POSA) combines the POA approach and the flexible structure alignment FATCAT (Ye and Godzik, 2003), thus becoming the first algorithm to perform and visualize multiple alignments of protein structures, while accounting for their conformational flexibility. Aided by the visualization tools that we have developed for displaying POA of structures, POSA is able to provide a global picture of the similarity of multiple structure alignment, including the partially conserved regions.
2.1 Overview of the POSA method
POSA adopts a progressive strategy to build a multiple structure alignment given a set of input structures in the order provided by a guide tree (Grasso and Lee, 2004) (an arbitrary order can be treated as a special tree). A multiple structure alignment is represented as a POG and a single structure can be treated as a special POG containing only one structure (which in this case is completely ordered). POSA iteratively performs pairwise alignment of POGs following the guide tree until all the input structures are aligned together, as shown in Figure 1.
2.2 Partial order representation of structure alignment
We keep information about similarities and differences of the aligned structures by using POG representation of the alignment. A POG (or DAG) is a directed graph that contains no cycles (Minieka, 1978; Lee et al., 2002). A multiple alignment can be represented by a POG, in which nodes are the residues (aligned residues are merged) and the sequential order of the residues defines the edges between nodes. In the partial order format, we keep the information of the nonaligned residues (Fig. 1), which is essentially lost in the standard tabular row–column multiple alignment format. For clarity, Figure 2a demonstrates a simple case with only two structures aligned. Aligned positions are merged while positions that are not aligned form the branches in the graph (Fig. 2b). Figure 4 illustrates an actual example of POG generated by POSA.
2.3 Flexible structure comparison
We divide the multiple flexible structure alignment into iterative pairwise alignments of POGs or single structures (which can be treated as a special type of a POG). In POSA we use the same formalism of the flexible structure alignment as in FATCAT, a pairwise flexible structure comparison that we developed recently (Ye and Godzik, 2003). The only, but fundamental, difference is that in FATCAT we align two protein structures, whereas in POSA we also align two structure alignments.
2.3.1 Pairwise flexible structure alignment
The pairwise flexible structure alignment program FATCAT starts by identifying a list of aligned fragment pairs (AFPs) (Shindyalov and Bourne, 1998), superpositions of two continuous fragments, in the two proteins to be compared (Fig. 3a). Two fragments of length 8 form an AFP if the root mean square deviation (RMSD) of their Cα atoms is less than a certain threshold (3.0 Å in the version available on the Web site) (Ye and Godzik, 2003). The FATCAT structure alignment is formulated as an AFP chaining process (Gusfield, 1999) with connections provided by extensions, gaps or twists (Fig. 3b). In contrast to a standard structural alignment, a rotation/translation (twist) can be introduced between two consecutive AFPs if it results in a substantially better superposition of the structures. FATCAT integrates simple extensions, gaps and twists into a unified scoring function and performs the alignment and hinge detection simultaneously by using dynamic programming. If we denote S(k) as the best score ending at AFP k, it can be calculated from the best ending at previous AFPs that can be connected with AFP k subject to the consecutive constraints (Fig. 3b),
2.3.2 Multiple pairwise flexible structure alignment
POSA is based on iterative pairwise alignment of POGs created at the previous stages. We use the same formalism in aligning two POGs representing alignments of structures as in chaining AFPs in FATCAT. As shown in Figure 3d, in POSA, AFPs and their connections must obey sequential orders of the residues, i.e. two residues that have no sequential order (Fig 2b) can never be found in an AFP or in the chain of AFPs on the higher level. Owing to the presence of branches, the alignment graph of POGs is no longer planar; instead, it is high-dimensional, as shown in Figure 3d. In the example shown in Figure 3d, the searching space of POA of protein C and POG of the alignment of proteins A and B is three-dimensional. Still the best path of AFPs is the optimal alignment of two POGs subject to the constraints discussed above, and finding the optimal alignment can be solved by using dynamic programming (Equation 1). The computation complexity of POSA is O(L2N3), where L is the average length of the input structures and N is the number of input structures.
2.3.3 Build a POG of multiple structure alignments considering flexibility
As mentioned above we can generate POGs of structure alignments by merging equivalent positions of POGs from an earlier generation, which can represent single structures or previously aligned groups of structures. In the case of structure alignments without twists, all the equivalent positions will be merged; otherwise, we merge only the equivalent positions in the largest block (alignments are divided into n + 1 blocks by n twists) and record the information of equivalent positions from the remaining blocks. The reason we do not incorporate the flexibility in constructing POA is that we want to keep the difference of structures as long as possible during the comparison process (one of the major reasons to adopt POG representation of multiple alignments (Lee et al., 2002)). However, we have this information on record, and it is used at the end of a POSA process to recover the final alignment with flexibility.
2.4 POSA outputs
POSA reports a POG of the final multiple alignment with all input structures and the superposition of all the structures according to the alignment. In the case when twists are detected while comparing structures, POSA can also report a superposition of all the input structures after changing their coordinates according to the flexible alignment. To compare the performance of POSA with the existing programs (which always focus on the common core), we also report the length of common core and average RMSD (here, defined as the average root mean square of all the pairs of each position for all positions in the common core).
2.5 Guide tree
A small yet important issue that we have not addressed so far is the guide tree that we use to guide the process of multiple structure alignment. The problem of building the phylogenetic tree of proteins based purely on structural information is still open, mostly because despite several efforts (Hall and Barlow, 2003; O'Donoghue and Luthey-Schulten, 2003) no model has yet been proposed for the evolution of protein structures. In this paper, we do not aim to attack this problem; instead, we only want to provide guide trees for practical reasons. Guide trees can be generated in various ways using sequence and/or structure similarity. For the proteins that are very similar to each other in terms of both structure and sequence, trees based on sequence identity are preferable because we have a better understanding of sequence evolution. For distantly homologous structures or nonhomologous but similar structures, building trees of structures based on their structural similarity becomes the only choice because no sequence similarity can be used. Since we are performing multiple structure alignment for proteins with structural similarity, we decided to adopt the latter approach. We first calculate a distance matrix, using the P-value of the similarity of two structures from FATCAT as a measure of distance (Ye and Godzik, 2004), of all pairs for a group of structures to be aligned and then perform a single linkage clustering (Lance and Williams, 1967) on the distance matrix to generate a guide tree (program FATCAT2Tree was implemented for this purpose). The results reported in the Results section all used guide trees built using this approach.
We implemented the POSA algorithm in C++ on a Linux platform with 1.8 GHz CPU. The running time of a POSA process varies from seconds to hours, depending on the number of input structures and their sizes. For example, the running time for comparing three calmodulin structures (details in Results section) is 13 s. We also implemented a public Web server for POSA at http://fatcat.burnham.org/POSA. The POG generated by POSA can be visualized using the dot function from the graphviz package (http://www.graphviz.org) and the superimposed structures are visualized online using the Chime plug-in. The graphs of protein structure superposition in this work were prepared using the molecular graphics software Pymol (http://pymol.sourceforge.net/).
We applied POSA to several textbook examples of multiple structural alignments and several structural families that have conformational flexibilities to show the advantages of considering flexibility in multiple structure alignment. Performance of POSA was compared with several alignment programs including MultiProt, CE–MC and MALCON using results from the original papers if available or using the corresponding Websites to calculate the alignments in other cases. The results, described in detail below, show that POSA achieved comparable performances in the examples in which no flexibilities are observed (the first example below compares globin structures). More importantly, POSA achieved better performance in the cases with proteins having conformational flexibility. In all cases, the POSA visualization approach provides a better view of the relationship of structures with high structural divergence. In the following subsections we discuss the details of four examples—comparison of globins, calmodulin-like proteins, tRNA-synthetases and Rossmann fold structures—focusing on the demonstration of the importance of using flexibilities in the structure comparison and the advantages of using POG representation for revealing the mosaic nature of protein structures. Finally, we ran POSA on 399 structure families and compared the results with HOMSTRAD alignments. The comparison shows that POSA-generated alignments are compatible with the HOMSTRAD alignments for the majority of the families, and more importantly POSA-detected flexibility in 27 families in which the flexibility introduced into the alignment is important for improving the alignment quality and for better understanding of the structures.
3.1 Globin family
The globin family has been extensively studied in the literature and is considered to be a relatively easy case for multiple structure comparison. We ran POSA on the example of 15 globin structures as described in Ochagavia and Wodak (2004) and compared the POSA result with the alignments produced by several other methods. POSA detected a common core of 71 positions with an average RMSD of 2.29 Å without introducing any structural flexibility into the alignment. For comparison, MALECON's alignment has 59 residues with an RMSD of 1.73 Å, MALECON+'s alignment has 55 residues with an RMSD of 1.30 Å, the largest alignment reported by MultiProt that includes all 15 input structures has 50 aligned positions with an RMSD of 1.55 Å (note that MultiProt reports all local similarities of different lengths and different input structures), STAMP's alignment has 55 residues with an RMSD of 1.30 Å and CE–MC's alignment has 95 aligned positions with an RMSD of 2.37 Å. In short, as expected, different programs generated different alignments with various lengths and RMSDs, and the results could be summarized by the general rule of ‘the longer alignment, the higher the RMSD’. Nevertheless, superpositions of the globin structures according to the alignments from different programs are very similar, showing that the performances of these programs are comparable for this case.
3.2 Calmodulin-like proteins
The advantages of POSA are more visible on the example of comparing calmodulin-like proteins (Crivici and Ikura, 1995). Calmodulin is a Ca2+-binding protein that is a key component of the Ca2+ second-messenger system and is involved in the regulation of many biochemical and physiological processes. It has two domains with each domain having a pair of EF-hands (helix–loop–helix Ca2+-binding motif). Each of the two domains may exist in a calcium-free or calcium-bound state, and in addition, domain movements result in an open or closed conformation of the whole protein. Some proteins from this family adopt permanently closed or open conformations. The calmodulin family exemplifies the function of domain movement and the flexibility of structure in the regulation of cell processes.
At first we consider only three calmodulin-like proteins with different conformational states for comparison. These three proteins are troponin C from chicken [an open, classical ‘dumbbell’ conformation of two pairs of EF-hand (PDB code 1ncx, Fig. 4a)], sarcoplasmic calcium-binding protein from Branchiostoma lanceolatum (PDB code 2sas, Fig. 4b; closed conformation in which both domains are pressed to each other and form a compact, globular structure) and EHCABP from Entamoeba histolytica (PDB code 1jfj, chain A, Fig. 4c; intermediate, partly open structure). When these three structures are compared by the standard, rigid-body multiple structure alignment programs, only one domain of the structures can be aligned. For instance, the largest alignment detected by MultiProt has 46 positions with an RMSD of 1.70 Å; the core detected by CE–MC has 62 positions with an RMSD of 6.79 Å. In contrast, POSA can get a much longer alignment of these three structures spanning their whole length (alignment length of 132 positions with an RMSD of 2.80 Å) when the flexibilities detected by POSA are incorporated into the superposition (Fig. 4e and h).
Another interesting observation from this POSA comparison is that the sarcoplasmic calcium-binding protein is significantly longer than the other two proteins (Fig. 4h) and the difference is mostly located in one of the EF-hand half domains (highlighted in Fig. 4e), which is 15 residues longer (indicated by an arrow in Fig. 4h). No data on the analysis of this long helix–loop–helix have ever been reported, but this difference is so interesting that it may deserve further exploration.
We further extended the list of input calmodulin-like proteins to include all the domains found in the calmodulin-like fold in the SCOP classification (1.65 release) (Andreeva et al., 2004), except for proteins with only one pair of EF-hands. Including one representative from each family resulted in 16 proteins. The interesting result from POSA is that some of these proteins are more similar in the N-terminal domain while others are more similar in the C-terminal domain such that at the end no common core in the sense of rigid-body comparison can be detected. However, after incorporating the flexibilities detected during the POSA comparison into the POG construction and superposition of the structures accordingly, we can recover a common core of 16 input calmodulin-like proteins that includes 70 positions with the aligned residues spanning both N-terminal and C-terminal domains with an average RMSD of 3.27 Å.
3.3 tRNA synthetases
The aminoacyl-tRNA synthetases play a crucial role in protein synthesis by covalently linking an amino acid to the tRNA that bears the corresponding triplet anticodon. These synthetases use induced fit mechanism to achieve high specificity required for accurate protein synthesis; at the same time, the binding of the enzyme with their cognate amino is not very tight for efficient release of product (Torres-Larios et al., 2003). Four tRNA-synthetases were studied here including histidyl-tRNA synthetase (PDB code 1adj), prolyl-tRNA synthetase (PDB code 1hc7), threonyl-tRNA synthetase (PDB code 1qf6) and glycyl-tRNA synthetase (PDB code 1ati). All are multiple domain proteins and share two common domains, a catalytic domain and an anticodon-binding domain (Fig. 5a). This case was studied by MultiProt to show that it can detect the two common domains in the four input structures (Shatsky et al., 2004). MultiProt reports two superpositions of these four structures, focusing on the catalytic domain and the anticodon-binding domain, respectively. POSA also detected the similarities in both domains. The important difference, however, is that POSA aligns both domains in the same picture, allowing us to clearly see the relationship of these two domains and simultaneously see how different the histidyl-tRNA synthetase (1adj) structure is in comparison with the other three. The simultaneous alignment of all four proteins is possible only because of the detection of the conformational change in 1adj at a hinge, and as a result we obtain an alignment spanning the two common domains with a common core of 278 aligned positions and an average RMSD of 2.92 Å (Fig. 5c).
3.4 NAD(P)-binding Rossmann fold
The typical feature of the Rossmann fold is a three-layer sandwich structure with a β-sheet between two layers of α-helix (Rossmann et al., 1975) and sheet order 321456. According to SCOP (1.65 release), this fold consists of one superfamily, which in turn has 10 families. We selected 10 representative structures by using one structure from each family to run POSA analysis. Only a small common core of 48 residues with an average RMSD of 2.92 Å could be detected. The small size of the common core in this fold is also confirmed by MultiProt (Shatsky et al., 2004), which reported an even smaller common core of 32 residues but with a smaller RMSD of 1.30 Å. The POSA result shows that only three β-strands and two α-helices from the three layers are relatively conserved among all the Rossmann fold proteins (Fig. 6c), which is a small part of their overall structures (Fig. 6a and b; in each protein, the regions corresponding to the common core are shown in red while the remaining regions are shown in green). However, when we visualize the superposition of the input structures showing not only the common core (shared by all the input structures), but also partly conserved regions, we can recover the three-layer overall structure much better, as shown in Figure 6d. Comparing this picture with the overall structures of the proteins in this fold (Fig. 6e and two individual structures shown in Fig. 6a and b), the mosaic nature of the proteins in this family is quite clear, suggesting that structural divergence has evolved in this family with the inclusion of additional structural elements to the common core.
3.5 POSA performance on 399 structure families
We collected 399 homologous structure families, each with more than two structures based on HOMSTRAD (Dec 15, 2004, version, including 1032 families), a database of protein structure alignments for homologous families (Mizuguchi et al., 1998). HOMSTRAD alignments were prepared using the structure alignment programs MNYFIT, STAMP and COMPARER followed by manual examination of individual cases. We ran POSA on this large dataset to test its general performance. The details of POSA results with side-by-side comparison with HOMSTRAD alignments are available at http://fatcat.burnham.org/POSA/POSAvsHOM.html. Overall, POSA-generated alignments are slightly shorter than the HOMSTRAD alignments but result in slightly better superimposition of input structures with smaller RMSDs. The (COREPOSA −COREHOM)/COREHOM and (RMSDPOSA − RMSDHOM) averaged over 399 families are −4% and −0.797 Å, respectively, where COREPOSA and RMSDPOSA (COREHOM and RMSDHOM) are the size of the common core and the corresponding RMSD of the POSA (HOMSTRAD) alignments. These two values are −3.8% and −0.528 Å, respectively, when excluding 27 structure families with flexibilities detected by POSA. We can conclude that even though POSA was developed aiming to address the flexibility and divergence of protein structures neglected in all previous multiple structure alignment programs, it generally performed well in comparison with HOMSTRAD alignments.
POSA revealed conformational flexibility, an important piece of information for understanding structures, in 27 structure families. Such information is not at all present in HOMSTRAD. In many of these families, POSA generated much-improved superimposition of structures with smaller RMSD by transforming some structures according to the flexibility it detected. For instance, POSA generated an alignment for family ‘rhv’ of virus coat proteins (including Theiler virus coat protein 1tme, Mengo virus coat protein 2mev, rhinovirus serotype 1 coat protein 1r1a, rhinovirus 14 coat protein 4rhv, poliovirus coat protein 2plv and foot-and-mouth disease virus coat protein 1bbt) with a common core of 586 positions and an RMSD of 2.12 Å when incorporating flexibility; the size of the common core dropped to 372 (RMSD 2.45 Å) without incorporating flexibility. In contrast, HOMSTRAD alignment has a common core of 597 but with an extremely large RMSD (21.93 Å). We can expect that real advantages of POSA would be seen in aligning structures not present even in HOMSTRAD or similar databases, because of their inability to detect even a common core for the reason of extreme divergence of structures.
POSA was designed to describe flexibilities of protein structures and to consider both common and divergent regions in multiple structure comparison. By providing a visualization and classification of differences among structures, POSA provides an ideal tool for studying the structural divergence of protein structure families. As we have seen from the examples presented above, it is important to study not only the similarities but also the differences of structures for a better understanding of protein structures in terms of structure–function relationship and the evolution of structures. There is no doubt that new tools, such as POSA, are increasingly important in the age of a rapid growth of the protein structure databases, with most of the growth coming in from solving new structures in large protein families. With the anticipated goal of identifying most new folds by the structural genomics initiative, the questions of analysis of structural divergence in large protein families will surely come to the forefront of structural biology.
This work was supported by NIH grant GM63208 and NSF grant DBI-0349600.