Abstract

Accurate sequence alignments are crucial for modelling and to provide an evolutionary picture of related proteins. It is well-known that alignments are hard to obtain during distant relationships. Three thousand and fifty-two alignments of 218 pairs of protein domain structural entries, with <40% sequence identity, belonging to different structural classes, of diverse domain sizes and length-rigid/variable domains were performed using 12 programs. Structural parameters such as root mean square deviation, secondary-structural content and equivalences were considered for critical assessment. Methods that compare fragments and permit twists and translations align well during distant relationships and length variations.

INTRODUCTION

The interpretation of three-dimensional (3D) structural similarity presents a difficult scientific challenge, for it is difficult to distinguish between similarities that reveal entirely the physical constraints on protein folding [1] and similarities that arise from evolutionary relationships. Due to evolutionary distances, sequences progressively diverge, even within a protein family. These alterations are not constantly dispersed throughout the structure. The common folding pattern is conserved and usually, a large central hydrophobic core of the structure remains similar [2, 3] and other parts of the structure, like loop regions, change conformation more radically.

Nonetheless, careful comparisons of 3D structures of proteins have often yielded the realization of distant evolutionary relationships [4]. This is promising as the evolution of structures of proteins occurs less rapidly since their folds are highly conserved, even though the sequences that encode them may not be recognizably similar in the superfamily level. Structure comparison/alignment methods are usually applied in order to establish structural, evolutionary and functional relationships between proteins [5]. The structure alignment of functionally related proteins provides insight into the functional mechanisms, and has been effectively applied in the functional annotation of proteins whose structures have been determined [6].

Detailed structural comparison of proteins are often used to reveal (i) intricate differences between two independent structure determinations of the same protein with/without ligands, (ii) comparison between NMR and crystal structure determinations or sets of closely related proteins or (iii) structural similarities in whole genes or domains, in an unexpected manner, suggesting remotely similar biological functions and therefore superfamily-level relationships. Particularly, in the era of structural genomics initative, the third application of structure comparison to detect superfamily relationships has attractive applications since these may suggest important clues about function of whole gene or particular domains.

Currently, there are a considerable number of structural comparison tools available to the structural biologist [5, 7]. In general, these methods are based on different algorithms and have been designed for various applications. In particular, SSAP [8] and LSQMAN [9] mainly focus on the matches of secondary-structure elements. Methods like MUSTANG [10], DaliLite [11], MINRMS [12] and MATT [13] search for compatible pairs of fragments with similar intermolecular Cα distances and the fragments are combined into a final alignment using different strategies. MATT [13] permits structural allowances such as twists, where local flexibility is introduced between fragments in intermediate steps. Small translations and rotations are temporarily allowed to bring sets of aligned fragments closer. Methods like FATCAT [14] are able to align subdomains in different relative orientations, resulting from protein flexibility or from evolutionary divergence based on the String model [15]. Another strategy is to consider not only the backbone geometry, but also the physicochemical environment of each residue in order to align the two structures. Some tools match secondary-structure elements to obtain in an efficient way a first alignment that is later refined. MATRAS [16], in particular matches secondary-structure elements in the first stage of alignment. Environmental properties and Cα distances are then applied to obtain the final alignment. MATRAS [16] applies a Markov transition model of evolution to derive different types of scoring functions. Whereas, some other structure comparison methods, has been employed by COMPARER [17], which often recruit structural features such as secondary structure, solvent burial and hydrogen bonding patterns to recognize the structural core and variable regions to guide the presence of gaps and to obtain reliable alignments.

Several methods are perhaps relatively good at detecting structural similarity, yet comparatively poor in terms of the accuracy of the structural alignment they generate [11]. Some methods work well for certain SCOP [18] structural classes (α-rich, β-rich, α/β folds and α + β folds), but not so for others [11]. There is also a need to assess the accuracy of these structure-based alignments regarding the correct identification of equivalent residues in terms of structure, evolution or function. Until now, structure comparison methods have been mostly evaluated in terms of their capability to identify proteins with similar folds or to recognize homologous proteins [7, 19, 20, 21]. They have also been calculated relative to the level of structural similarity, where better performance corresponds to longer alignments and better rigid body superposition, or a better score according to other geometric measures [20, 22].

Structure-based sequence alignments created by different programs can be dissimilar even when the structures are related [23, 24]. Finally, a few tools perform very fast for comparisons between a given query protein and a structural database, and provide structural similarity scores for each comparison but no alignment [25–27]. Structural alignment of distantly related proteins in a superfamily, which has <40% identity with more length variation, still remains a challenging task.

We analysed and compared structural and sequence features following structure-based alignment for 218 pairs, produced by 12 methods based on different algorithms. DALI [11], SSAP [8], LSQMAN [9], MINRMS [12], MATRAS [16], MATT [13], MUSTANG [10], FATCAT [14] (for structure-based alignment) and MALIGN [28], CLUSTALW [29], T-COFFEE [30] followed by COMPARER [17] programme (for sequence followed by structure-based alignment), were applied to a representative set of distantly related proteins from the ASTRAL database [31] (see ‘Methods’ section for choice of data set). The methods were compared in terms of the extent of structural similarity detected according to the resulting alignments and in terms of alignment consistency without a gold-standard alignment. Functional relevance of such alignments was exemplified using some examples. Finally, illustration of the different types of structure comparison challenges, faced to align distantly related proteins, was described and the results of pairs were analysed by considering the three parameters, normalized root mean square deviation (RMSD) (RMSDn), normalized matches (Mnl), normalized SST (SSTn) (for more detail, see ‘Methods’ section).

In general, these methods provide a measure of structural similarity between proteins, which is used to identify similar folds and evolutionary related proteins. Since protein structure is more conserved in evolution than sequence, structure alignments and sequence followed by structure-based alignments of remote homologous proteins are considered more reliable than sequence-based alignments to identify the secondary-structural equivalent residues. Sequence-based programs were purposefully chosen to note the extent of improvement of the alignment when it is aligned by structural data.

METHODS

Data set

Protein domain pairs were selected that belong to a superfamily, where no two protein domains have >40% sequence identity. Structural coordinates were downloaded from ASTRAL compendium and processed further. The superfamilies were sorted out into all five major structural classes, as defined by SCOP [18] (α, β, α+β, α/β and small domains) and bin size ranging from 50 to >400 residues into five bins (1–50, 51–100, 101–200, 201–400 and >400 residues) as in flowchart (Figure 1).

Figure 1:

Flowchart for comparison of structure alignment programs of distantly related proteins. Flowchart describes data selection part for length-rigid and length-varying set, programs for comparison and, assessment of alignments.

Figure 1:

Flowchart for comparison of structure alignment programs of distantly related proteins. Flowchart describes data selection part for length-rigid and length-varying set, programs for comparison and, assessment of alignments.

Data preparation based on length variation

The length difference between a pair has been found using  

formula
where Ld is the percentage of length difference between domains; Ds1 and Ds2 are the domain sizes and Lmax is the length of the largest domain:  
formula
The Ld value for each pair has been calculated in all-against-all basis in a superfamily to group them into length-rigid domain pair (where Ld ≤ 25%) and variable length domain pair (where Ld > 25%) and named as length-rigid and length-varying data set. Where possible, six pairs have been selected from each of the five structural classes and each of the five bins of different domain size based on the Ld value (as length-rigid and length-varying data set separately). We obtained 129 and 89 pairs for length-rigid and varying set, respectively, as opposed to the expected number of 150 in each set, due to lack of enough number of sets based on Ld value.

Comparison of alignments

Sequence followed by structure-based alignment

We employed the standalone tools of MALIGN [28], T-COFFEE [30] and CLUSTALW [29], which are purely sequence-based methods to align the protein pairs on both the similar and the variable length set. These were then re-aligned using COMPARER [17], a structure-based sequence alignment program, to obtain the final structural alignment.

Structure-based alignment

The structure-based alignment softwares, like DALI [11], SSAP [8], LSQMAN [9], MINRMS [12], MUSTANG [10], FATCAT [14], MATT [13] and MATRAS [16], were implemented to align the protein pairs which were of length-rigid domain pairs and high length variation. SSAP [8] was obtained by accessing the corresponding online server. Results of programs, like FATCAT [14] and MATRAS [16], were obtained by direct correspondence with and by the kind help of the authors. As far as possible, all the programs were automated by using perl and shell scripts to obtain the alignments except SSAP [8] (manually done in online server).

Annotation and validation of the alignments using Nett normalization score

The JOY program [32] was used to annotate the alignments from the programs. Three parameters were selected and considered from the alignment output files to normalize and validate the alignment: RMSD, number of fitted points and secondary-structure equivalences. A secondary structure is viewed equivalent or conserved in an alignment position if the same secondary structure is adopted by 75% or more members of a superfamily. One of the JOY output file (.tem) has been used to calculate the secondary-structure equivalences.  

formula
Wnl is the nett normalized value, whereas

  • RMSDnl = 1–(RMSD/10) (normalized RMSD)

  • Mnl = (No. of matches/average length of respective domain pair)

  • SSTnl = (conserved secondary−structural positions/average domain length)

  • RMSDnl, Mnl, SSTnl can range from 0 to 1 and should be close to 1 for a significant alignment.

  • Results obtained from the nett normalized values could be divided into four broad ranges:

  • Best = the nett normalized value (Wnl) which is ≥0.75 and <1.0

  • Good = the nett normalized value (Wnl) which is ≥0.5 and < 0.75

  • Average = the nett normalized value (Wnl) which is ≥0.25 and <0.5

  • Poor = the nett normalized value (Wnl) which is ≥0.0 and <0.25.

3D-plots and histograms were used to perform detailed analysis on the basis of the distribution of points for the length-rigid data set and length-varying data set.

Cumulative nett normalized scores

Further, the average performance of each of the programs was cumulatively normalized over all domain sizes (bins) and all structural classes and expressed in the same range from 0 to 1.

Box-and-whiskers analysis

Box-and-whisker plot [33] is a convenient way of graphically representing a group of numerical data through their five-number summaries: the smallest observation (sample minimum), lower quartile (Q1), median (Q2), upper quartile (Q3) and largest observation (sample maximum). These plots may also indicate the presence of outliers, if any. This representation was employed to project and analyse cumulative nett normalized scores.

RESULTS AND DISCUSSION

There are ∼70 structure comparison programs and 30 sequence comparison programs available as server and standalone version. Whereas these numbers keep growing, a variety of programs had been selected to perform this analysis (see Table 1 for a list of programs used). This structure comparison analysis is unique, since the length of the domain in different ranges, distributed in the main five structural classes, has been considered in length-rigid and length-varying distantly related domain pairs. Two hundred and eighteen domain pairs, as mentioned in Tables 2 and 3, have been selected based on Ld value as described earlier and detailed in ‘Methods’ section. Three thousand and fifty-two alignment pairs were analysed based on the nett normalized values derived from RMSD, number of fitted points and secondary structure (SST) equivalences.

Table 1:

Programs used for structure comparison work

NAME Description Class Type Availability Reference 
Sequence-based method 
    CLUSTALW Sequence-based alignment Progressive alignment Multi Download (29) 
    MALIGN Multiple Sequence Alignment Programme Progressive alignment Multi Download (28) 
    T-COFFEE Sequence-based alignment Progressive alignment Multi Download (30) 
Pairwise structure-based sequence alignment 
    SSAP Sequential Structure Alignment Programme SSE Pair Server (8) 
    DaliLite Distance Matrix Alignment C-Map Pair Standalone (11) 
    LSQMAN Superposition by Least-mean square deviation SSE Pair Standalone (9
    MATRAS MArkovian TRAnsition of protein Structure Cα and SSE Pair Author request (16
    Minrms Determining Protein Similarity by minimal root-mean-squared-distance Cα Pair Standalone (12
    TOPS++FATCAT Flexible Structure AlignmenT by Chaining Aligned Fragment Pairs Allowing Twists derived from TOPS+ String Model Cα Pair Server/ author request (15) 
Multiple structure-based sequence alignment 
    COMPARER Comparison and alignment based on structural features SST, H-bonds, gaps Multi Author request (17) 
    MUSTANG Multiple Structural AligNment AlGorithm Cα and C-Map Multi Standalone (10
    Matt Multiple Alignment with Translations and Twists Cα Multi Download (13) 
NAME Description Class Type Availability Reference 
Sequence-based method 
    CLUSTALW Sequence-based alignment Progressive alignment Multi Download (29) 
    MALIGN Multiple Sequence Alignment Programme Progressive alignment Multi Download (28) 
    T-COFFEE Sequence-based alignment Progressive alignment Multi Download (30) 
Pairwise structure-based sequence alignment 
    SSAP Sequential Structure Alignment Programme SSE Pair Server (8) 
    DaliLite Distance Matrix Alignment C-Map Pair Standalone (11) 
    LSQMAN Superposition by Least-mean square deviation SSE Pair Standalone (9
    MATRAS MArkovian TRAnsition of protein Structure Cα and SSE Pair Author request (16
    Minrms Determining Protein Similarity by minimal root-mean-squared-distance Cα Pair Standalone (12
    TOPS++FATCAT Flexible Structure AlignmenT by Chaining Aligned Fragment Pairs Allowing Twists derived from TOPS+ String Model Cα Pair Server/ author request (15) 
Multiple structure-based sequence alignment 
    COMPARER Comparison and alignment based on structural features SST, H-bonds, gaps Multi Author request (17) 
    MUSTANG Multiple Structural AligNment AlGorithm Cα and C-Map Multi Standalone (10
    Matt Multiple Alignment with Translations and Twists Cα Multi Download (13) 
Table 2:

Statistic of pairs: number of superfamily worked for similar length domain (<25%)

 <50 50–100 100–200 200–400 >400 
α 
β 
α/β 
α + β 
Small domain 
 <50 50–100 100–200 200–400 >400 
α 
β 
α/β 
α + β 
Small domain 

Total = 129 pairs.

Table 3:

Statistic of pairs: number of superfamilies worked for varying domain length (>25%)

 <50 50–100 100–200 200–400 >400 
α 
β 
α/β 
α + β 
Small domain 
 <50 50–100 100–200 200–400 >400 
α 
β 
α/β 
α + β 
Small domain 

Total pairs = 89.

3D-plot analysis

The 3D-plots have been generated considering RMSDnl in x-axis, normalized number of fitted points in y-axis and normalized secondary-structure equivalences in z-axis. A program is considered to perform significantly if the points occupy the left-top corner in 3D-plot. Likewise, a program with cumulative normalized nett score close to 1.0 is considered to provide good alignment. RMSD is a simple measure of ‘goodness’ of structural alignment, which has been shown for different programs in Figure 2 for one pair of domain. RMSD values, SST equivalence for both sequence-based method and sequence followed by structure-based alignment method have been represented in Figure 3. The performance of the programs were analysed for different length-rigid and length-varying superfamilies arranged according to domain size (bins) and structural classes (classes) and presented as 3D-graphs for convenient analysis.

Figure 2:

An example for the influence of RMSD, which is a simple measure of ‘goodness’ of structural alignment has been shown. This is from Rudiment single hybrid motif of SCOP superfamily—51246 shown with matches, RMSD and percentage of average secondary structural equivalences obtained by various programs. ‘clustalw_Bc’ is clustalw before subjecting a structure-based program like COMPARER; ‘clustal_Ac’ is clustalw after COMPARER and so on.

Figure 2:

An example for the influence of RMSD, which is a simple measure of ‘goodness’ of structural alignment has been shown. This is from Rudiment single hybrid motif of SCOP superfamily—51246 shown with matches, RMSD and percentage of average secondary structural equivalences obtained by various programs. ‘clustalw_Bc’ is clustalw before subjecting a structure-based program like COMPARER; ‘clustal_Ac’ is clustalw after COMPARER and so on.

Figure 3:

Example for good or bad alignment adaptation, the improvement of alignment from sequence-based method and sequence followed by structure-based alignment has been represented. A pair from Chaperone J-domain superfamily (SCOP code: 46565 superfamily) has been aligned and the secondary structure (α) has been annotated by Joy program [32] and the relevant assessment value has been given.

Figure 3:

Example for good or bad alignment adaptation, the improvement of alignment from sequence-based method and sequence followed by structure-based alignment has been represented. A pair from Chaperone J-domain superfamily (SCOP code: 46565 superfamily) has been aligned and the secondary structure (α) has been annotated by Joy program [32] and the relevant assessment value has been given.

To begin with, let us examine superfamilies in α-class (domains rich in helices) of different bin size from length-rigid data set and length-varying data set (Figure 4). The length-rigid pairs were relatively easy to align using any of the programs when compared to length-varying set. 3D-plot provides us an accurate picture of the differences in the performance of the programs. Most of the programs perform well for the alignment of length-rigid small domains from different structural classes (Supplementary Data; Figure 4; Supplementary Figures S1–S4). The secondary-structural equivalence, however, tends to be low while aligning large domains. In the length-varying domains pairs of different structural classes, none of the programs fare with high scores. Algorithms, like MATT, that permits twists and turns in protein domains perform better. The nett score was better (Figure 4) in FATCAT (nett score 0.549) [14], MINRMS (nett score 0.553) [12], SSAP (nett score 0.572) [8], MATT (nett score 0.531) [13], whereas the nett score went low, either because of low SST value or RMSD value, in case of MATRAS (nett score 0.501) [16], MUSTANG (nett score 0.418) [10] and LSQMAN (nett score 0.456) [9]. The same trend was observed in the large-length domains also. These 3D-plots depicting the nett scores of individual programs for different size bins and structural classes, both for length-rigid and length-varying domain pairs, can be obtained from the URL: http://caps.ncbs.res.in/download/strc_compn.

Figure 4:

The 3D-plot analysis on alpha class with different bin size from length-rigid data set (first row 3D plots) and length-varying data set (second row 3D plots). There was no pair obtained which came under the criteria of all alpha class in length-varying data set, which are <50 residues. Symbols denote the nett scores corresponding to the alignments of the following programs: black filled circle: CLUSTALW; open circle: CLUSTALW + COMPARER; inverted red filled triangle: MALIGN; inverted green filled triangle: MALIGN + COMPARER; yellow filled square: T-COFFEE; black filled square: T-COFFEE + COMPARER; red filled rhombus: DALI; blue filled rhombus: LSQMAN; open triangle: SSAP; pink filled triangle: MINRMS; green filled circle: MUSTANG; yellow filled circle: MATRAS; blue filled circle: FATCAT and light pink filled circle: MATT.

Figure 4:

The 3D-plot analysis on alpha class with different bin size from length-rigid data set (first row 3D plots) and length-varying data set (second row 3D plots). There was no pair obtained which came under the criteria of all alpha class in length-varying data set, which are <50 residues. Symbols denote the nett scores corresponding to the alignments of the following programs: black filled circle: CLUSTALW; open circle: CLUSTALW + COMPARER; inverted red filled triangle: MALIGN; inverted green filled triangle: MALIGN + COMPARER; yellow filled square: T-COFFEE; black filled square: T-COFFEE + COMPARER; red filled rhombus: DALI; blue filled rhombus: LSQMAN; open triangle: SSAP; pink filled triangle: MINRMS; green filled circle: MUSTANG; yellow filled circle: MATRAS; blue filled circle: FATCAT and light pink filled circle: MATT.

In general, pure sequence-based alignment methods are difficult to fare with high nett-scores at distant relationships; as in CLUSTALW [29], MALIGN [28] and T-COFFEE [30] programs. However, sequence followed by structure-based alignment has improved well and nearly competent with purely structure-based alignment methods. Where initial equivalences need to be seeded into structure-based sequence alignment methods like COMPARER, sequence followed by structural alignment will be convenient. A set of length bin-wise and structural class-wise 3D plot analysis has been provided in Supplementary Data.

Box-and-whisker plot analysis

The cumulative value of three normalized values have been averaged and ranked for both class-wise and bin-wise data set. In most of the cases, FATCAT [14], MINRMS [12], MATT [13], DALI [11] and MUSTANG [10] fell into the first five ranks. A box-and-whisker plot of the distribution for each of 14 methods has been plotted. The analysis plot for different bin sizes, for both length-rigid and length-varying superfamilies, is provided in Figure 5 and Supplementary Figure S5.

Figure 5:

(A) Box-and-whiskers plot on length-rigid domain including all classes. Upper extreme is marked just above the upper quartile, median is marked in the junction lower quartile lower down. (B) same as (A), but for length-varying domain including all classes.

Figure 5:

(A) Box-and-whiskers plot on length-rigid domain including all classes. Upper extreme is marked just above the upper quartile, median is marked in the junction lower quartile lower down. (B) same as (A), but for length-varying domain including all classes.

Our analysis reveals that maximum programs attain better score when the domain size is small, both in length-rigid and length-varying data set, which clearly shows the placement of gaps is still a bottleneck for large-length alignments.

Cumulative scores for alignments of different domain size (bin-wise) across all structural classes

Length-varying proteins

None of the programs scored ‘best’ (with scores >0.75) when the bins were examined individually (Figure 5A; Supplementary Figure S5). Very few programs scored ‘good’ (with scores between 0.5 and 0.75) in the larger bin (domain size >400). More programs attained ‘good’ scores when domains pairs of smaller bin size were aligned. DALI, LSQMAN, MINRMS and MATT consistently obtained ‘good’ scores (values between 0.5 and 0.75). MATRAS would score between ‘average’ and ‘good’ scoring regions, when aligning length-varying domain pairs of different sizes. All the sequence-based programs (CLUSTALW, MALIGN and T-COFFEE) were always found in the ‘average’ scoring region in all the bins of different domain sizes, except in the first bin (<50 residues). The positive and negative skewness were found, but we did not find any pattern, since they were varying between program to program and bin to bin.

Length-rigid protein

Few programs such as DALI and MINRMS fared with ‘best’ scores, when aligning domain pairs of bin size 201–400 and they have also been found in ‘near-best’ scoring region of the longest bin (>400). Along with DALI and MINRMS, few more programs such as LSQMAN, MINRMS and FATCAT were found in the best scoring region of bin size 101–200. However, none of the programs were found in the best scoring region in the last two bins, where the domain sizes are between 51 and 100 and <50 residues suggesting that, where the presence of indels were not playing decisive roles, it is challenging to obtain high-quality alignments for domain pairs of small size. None of the programs obtained ‘average’ or ‘poor’ scores.

Cumulative scores for alignments of different structural classes (class-wise) across all domain bin sizes

Length-varying proteins

None of the programs scored ‘best’ when examined for individual structural classes (Figure 5B; Supplementary Figure S5). Except MUSTANG, all other structure-based programs like DALI, SSAP, LSQMAN, MINRMS, MUSTANG, FATCAT, MATT and sequence followed by structure-based programs (like CLUSTALW, MALIGN, T-COFFEE along with COMPARER) scored ‘good’ in all alpha class. All the structure-based programs (DALI, SSAP, LSQMAN, MINRMS, MUSTANG, FATCAT and MATT) and sequence followed by structure-based programs (CLUSTALW, MALIGN, T-COFFEE along with COMPARER) obtained ‘good’ scores in almost all the classes albeit with negative skewness. The performance of pure sequence-based comparison programs were poor and retained ‘average’ scores for all the domain pairs of all structural classes.

Length-rigid protein

Programs such as DALI, LSQMAN, MINRMS, MUSTANG, FATCAT and MATT acquired ‘best’ scores (cumulative score of 0.75 or above) and other programs considered in our analysis acquired ‘good’ scores for length-rigid domain pairs of α-class. All the programs scored ‘good’ in case of domain pairs of β-class and α + β class. Except SSAP and MATRAS, which acquires 0.55 and 0.56 cumulative scores, all the other programs scored ‘good’ in case of (α/β) class of length-rigid domain pairs considered in our analysis.

In general, algorithms that work on the basis of determining protein similarity by minimal RMSD and spatial proximity (LSQMAN and MINRMS) and methods that employ aligned fragment pairs allowing for translations and twists (MATT and FATCAT) work better in all classes and both in length-rigid and length-variable categories even in small and large bin-sizes. On the whole, differences in the performance of the programs are significant when we plot them bin-wise and class-wise that provide some insight (please see Supplementary Data for individual plots and for the calculation of Z-scores).

Alignment of functionally important residues

Case 1: serine protease

The serine proteases are β-rich folds consisting of two similar β-barrel domains each of six anti-parallel strands. A number of mammalian and microbial structures have been determined by X-ray crystallography and it has been observed that they adopt similar 3D structures although there is <21% sequence identity between the mammalian and microbial serine proteases [34, 35]. The functionally important residues would be the most interesting part, where the sequence similarity between the domains from same superfamily is very low. We evaluated the alignments further by locating the functionally important residues and further to examine SST and the catalytic domain conservation in the alignment.

The trypsin-like serine proteases from eukaryotic proteases matriptase, MTSP1 in human (d1eaxa-) and trypsin-like serine proteases from eukaryotic proteases complement C1S protease catalytic domain in human (d1elva1) from serine protease superfamily (SCOP code: 50494), was aligned using various programs. The alignment of catalytic triad residues His 57, Ser 195, Asp 102 [36] by various programs has been examined (shown in Figure 6 for one method and for two others in Supplementary Figure S6). The structure-based alignment programs like MATT [13], MINRMS [12], DALI [11] and FATCAT [14], which report high nett normalized scores, have successfully aligned the conserved residues and the catalytic triad residues.

Figure 6:

A domain pair from serine protease superfamily (SCOP code: 50494) aligned well using MINRMS [12] programme. Goodness of alignments can be further analysed for the equivalence of functionally important residues. The catalytic triad His 57, Ser 195, Asp 102 are marked both in alignment and superposition. The number of matches, RMSD and percentage secondary-structure equivalence have been tabulated.

Figure 6:

A domain pair from serine protease superfamily (SCOP code: 50494) aligned well using MINRMS [12] programme. Goodness of alignments can be further analysed for the equivalence of functionally important residues. The catalytic triad His 57, Ser 195, Asp 102 are marked both in alignment and superposition. The number of matches, RMSD and percentage secondary-structure equivalence have been tabulated.

Case 2: metalloprotease

Metalloproteases (or metalloproteases) constitute a family of enzymes from the group of proteases, classified by the nature of the most prominent functional group in their active site. These are proteolytic enzymes whose catalytic mechanism involves a metal. Most metalloproteases are zinc-dependent and some use cobalt. The zinc cofactor containing domains from metalloprotease superfamily (SCOP code: 55486), d1e1h-1(botulinum neurotoxin type a light chain) and d1hs6a3 (leukotriene a-4 hydrolase) were aligned using various programs. The catalytically important Zn2+ ion, which are bound by three histidine residues, are in the conserved sequence HexxHxxGxxH [37]. FATCAT (nett score 0.537 [14] and MINRMS (nett score 0.502) [12] have performed equally well where the catalytic triad is aligned (as shown in Figure 7), whereas the alignment from MATRAS (nett score 0.380) [16] has the secondary structures as equivalent, but not the catalytic triad (Supplementary Figure S7) and this superfamily gained 0.447 as cumulative value from the performance of all other programs used. It is gratifying to note that programs, which acquire high nett normalized scores by our analysis, also preserve equivalence of functionally important residues.

Figure 7:

A pair from metalloprotease superfamily (SCOP code: 55486), aligned using well using FATCAT [14] method, has been examined for the equivalence of functionally important residues. The conserved residues HEXXHXXH has been marked with ellipse and highlighted for the two domains under superposition. The number of matches, RMSD and percentage secondary structure equivalences have been tabulated.

Figure 7:

A pair from metalloprotease superfamily (SCOP code: 55486), aligned using well using FATCAT [14] method, has been examined for the equivalence of functionally important residues. The conserved residues HEXXHXXH has been marked with ellipse and highlighted for the two domains under superposition. The number of matches, RMSD and percentage secondary structure equivalences have been tabulated.

CONCLUSION

A critical analysis of different sequence alignment programs (both structure-based sequence alignments and sequence followed by structure-based alignments) to align domain pairs which have <40% identity has been performed. Data sets of pairwise alignments were grouped according to structural class (class) and domain sizes (bin) separately for domains of similar length (length-rigid) and highly variable lengths (length-varying) giving rise to over 3000 alignments. The main intention of this work was to select and work on an appropriate software for the type of data which the users need to compare or align and to understand the strengths and limitations of popular programs when applied for distantly related domains irrespective of structural class and bin sizes.

Our choice of domain pairs were made to retain poor sequence similarity (40% identity) and grouped into different categories such as domain size and structural class. However, we did not bias our choice of domain pairs in any manner like enzymes or particular biological function. Further, the numbers of domain pairs chosen are hopefully statistically significant for analysis and general derivations. Sequence-based methods driven by structure-based alignment has improved the alignment quality but they do not reach the top scoring level. There are some changes observed in the performance, by various programs but changes are not significant since the cumulative nett scores reach the same level over the whole data set, when examined on the basis of class and bin-types. However, FATCAT [14], MATT [13], DALI [11], MINRMS [12] and LSQMAN [9] programs perform equally in many cases, in comparison to other programs on the basis of the scoring scheme. LSQMAN [9] fares with high values in most classes and bins (Figure 4).

Even though LSQMAN [9] fairs well in aligning distantly related proteins, it produces fewer fitted points albeit with low RMSD. Moreover, in many cases, it fails to improve the percentage of secondary-structure equivalences. LSQMAN [9] reported RMSD values are good when compared to other programs. As a result, they always seek to obtain a better nett and cumulative score. Some examples, where LSQMAN [9] fails to align the pairs, from the analysed data set have been shown in Figure 8 and in Supplementary Data using two examples from our data set, and two other examples from a random data set of Astral <40% identity data set corresponding to SCOP 1.73 [18] (provided in Figure 8; Supplementary Figures S8–S11).

Figure 8:

Domain pairs, from within SCOP superfamily-64288, aligned using LSQMAN [9]. This example illustrates that the alignment and superposition can be significantly hard, even for well-performing methods like LSQMAN [9]. Alignments obtained by two other programs have also been displayed for the sake of comparisons. The light yellow ellipse region on LSQMAN [9] superposition shows regions of poor superimposition, like the central beta-sheet. Alignment is annotated for the positions of secondary structures, using different grey-shaded boxes.

Figure 8:

Domain pairs, from within SCOP superfamily-64288, aligned using LSQMAN [9]. This example illustrates that the alignment and superposition can be significantly hard, even for well-performing methods like LSQMAN [9]. Alignments obtained by two other programs have also been displayed for the sake of comparisons. The light yellow ellipse region on LSQMAN [9] superposition shows regions of poor superimposition, like the central beta-sheet. Alignment is annotated for the positions of secondary structures, using different grey-shaded boxes.

During assessment of alignment methods, the goodness of an alignment is measured most often by the extent of matches of aligned positions, as compared to a previously existing and accepted reliable alignment (‘gold-standard’, see for examples, Figures 2 and 3). The benchmarking in such instances becomes easier to achieve, but the bottleneck still remains as to which is the accepted reliable alignment program. In this article, we have performed a critical assessment of various structure-based sequence alignment programs that do not rely on a ‘gold-standard’ alignment. Further, we have chosen more than 200 domain pairs that are poor in sequence identity, some contain length-variations and represent various structural classes. Some have reviewed the performance of large number of distantly related domain pairs [11] and some other groups have compared the performance of different alignment programs that depend on the availability of a gold-standard. Balibase [38] has examined length-varying domain pairs but they are not necessarily distantly related. Here, for the first time, we have combined two difficult parameters (distant relationships and length-variation) to assess state-of-the-art structure comparison programs. We report that most methods need to still improve in achieving optimal performance for length-varying domains. This analysis should also provide a clear glimpse about which method could work best for different structural classes and different bin sizes irrespective of length-rigid and length-varying domain pairs.

The understanding and establishment of protein structural similarities can be an essential step in relating protein families due to evolution. Such similarities are harder to establish owing to high evolutionary divergence, reflected as poor sequence identity or unequal lengths of protein domains or suboptimal structural similarities. Alignment of proteins related at the superfamily level is especially challenging, even with the availability of structural information. However, the alignment of protein domain superfamilies using multiple methods has provided important insights about the advantages and limitations of several protein structure comparison methods.

SUPPLEMENTARY DATA

Supplementary data are available online at http://bib.oxfordjournals.org/.

Key Points

  • There are several structure comparison programs and most of them work well in detecting overall structural similarity between two proteins, but their alignments are seldom subject to critical assessment. We selected 12 popular structure comparison methods were subject to critical analysis.

  • We chose protein domain pairs of low sequence identity (≤40%), of different domain sizes, of unequal lengths and representing diverse structural classes. In the process, more than 200 domain pairs have been considered giving rise to statistically significant number of comparisons.

  • A novel assessment scoring technique, built around parameters that does not rely on the availability of gold-standard alignment, is proposed.

  • Structure comparison methods which include twists are relatively better in aligning protein domains of low sequence identities.

  • At distant similarities, proteins of similar lengths are easier to align by most structure comparison methods suggesting that placement of insertions and deletions using appropriate gap penalties continues to be a difficult problem.

FUNDING

S.K. was supported by Career Development Award rendered to R.S. by Department of Biotechnology, India. We thank NCBS (TIFR) for financial support.

Acknowledgements

We thank Mr Nagarajan for analysing MINRMS results scripts and also for discussions. K.K. thanks University Grants Commission for his studentship. The authors thank NCBS for infrastructural and financial support.

References

1
Gibrat
JF
Madej
T
Bryant
SH
Surprising similarities in structure comparison
Curr Opin Struct Biol
 , 
1996
, vol. 
6
 (pg. 
377
-
85
)
2
Grishin
N
Fold change in evolution of proteins structures
J Struct Biol
 , 
2001
, vol. 
134
 (pg. 
167
-
85
)
3
Murzin
A
How far divergent evolution goes in proteins
Curr Opin Struct Biol
 , 
1998
, vol. 
8
 (pg. 
380
-
7
)
4
Mark
S
Johnson comparison of protein three-dimensional structures
Curr Opin Struct Biol
 , 
1991
, vol. 
1
 (pg. 
334
-
44
)
5
Sierk
ML
Kleywegt
GJ
Deja vu all over again: finding and analyzing protein structure similarities
Structure
 , 
2004
, vol. 
12
 (pg. 
2103
-
11
)
6
Yakunin
AF
Yee
AA
Savchenko
A
, et al.  . 
CH: structural proteomics: a tool for genome annotation
Curr Opin Chem Biol
 , 
2004
, vol. 
8
 (pg. 
42
-
8
)
7
Novotny
M
Madsen
D
Kleywegt
GJ
Evaluation of protein fold comparison servers
Proteins
 , 
2004
, vol. 
54
 (pg. 
260
-
70
)
8
Taylor
W
Orengo
C
Protein structure alignment
J Mol Biol
 , 
1989
, vol. 
208
 (pg. 
1
-
22
)
9
Kleywegt
G
Use of non-crystallographic symmetry in protein structure refinement
Acta Crystallogr D Biol Crystallogr
 , 
1996
, vol. 
52
 (pg. 
842
-
57
)
10
Konagurthu
AS
Whisstock
JC
Stuckey
PJ
, et al.  . 
MUSTANG: a multiple structural alignment algorithm
Proteins
 , 
2006
, vol. 
64
 (pg. 
559
-
74
)
11
Holm
L
Sander
C
Protein structure comparison by alignment of distance matrices
J Mol Biol
 , 
1993
, vol. 
233
 (pg. 
123
-
38
)
12
Jewett
AI
Huang
CC
Ferrin
TE
MINRMS: an efficient algorithm for determining protein structure similarity using root-mean-squared-distance
Bioinformatics
 , 
2003
, vol. 
19
 (pg. 
625
-
34
)
13
Menke
M
Berger
B
Cowen
L
Matt: local flexibility aids protein multiple structure alignment
PLoS Comput Biol
 , 
2008
, vol. 
4
 pg. 
e10
 
14
Ye
Y
Godzik
A
Flexible structure alignment by chaining aligned fragment pairs allowing twists
Bioinformatics
 , 
2003
, vol. 
19
 
Suppl. 2
(pg. 
ii246
-
55
)
15
Veeramalai
M
Ye
Y
Godzik
A
“TOPS++FATCAT: fast flexible structural alignment using constraints derived from TOPS+ Strings Model”
BMC Bioinformatics
 , 
2008
, vol. 
9
 pg. 
358
 
16
Kawabata
T
MATRAS: a program for protein 3D structure comparison
Nucleic Acids Res
 , 
2003
, vol. 
31
 (pg. 
3367
-
9
)
17
Sali
A
Blundell
TL
Definition of general topological equivalence in protein structures. A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming
J Mol Biol
 , 
1990
, vol. 
212
 pg. 
403
 
18
Andreeva
A
Howorth
D
Chandonia
JM
, et al.  . 
Data growth and its impact on the SCOP database: new developments
Nucleic Acids Res
 , 
2008
, vol. 
36
 (pg. 
D419
-
25
)
19
Kim
C
Lee
B
Accuracy of structure-based sequence alignment of automatic methods
BMC Bioinformatics
 , 
2007
, vol. 
8
 pg. 
355
 
20
Sierk
ML
Pearson
WR
Sensitivity and selectivity in protein structure comparison
Protein Sci
 , 
2004
, vol. 
13
 (pg. 
773
-
85
)
21
Kim
C
Tai
C-H
Lee
B
Iterative refinement of structure-based sequence alignments by seed extension
BMC Struct Biol
 , 
2009
, vol. 
10
 pg. 
210
 
22
Kolodny
R
Koehl
P
Levitt
M
Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures
J Mol Biol
 , 
2005
, vol. 
346
 (pg. 
1173
-
88
)
23
Standley
DM
Toh
H
Nakamura
H
Detecting local structural similarity in proteins by maximizing number of equivalent residues
Proteins
 , 
2004
, vol. 
57
 (pg. 
381
-
91
)
24
Godzik
A
The structural alignment between two proteins: is there a unique answer?
Protein Sci
 , 
1996
, vol. 
5
 (pg. 
1325
-
38
)
25
Mayr
G
Domingues
FS
Lackner
P
Comparative analysis of protein structure alignments
BMC Struct Biol
 , 
2007
, vol. 
7
 pg. 
50
 
26
Carugo
O
Pongor
S
Protein fold similarity estimated by a probabilistic approach based on Cα-Cα distance comparison
J Mol Biol
 , 
2002
, vol. 
315
 (pg. 
887
-
98
)
27
Rogen
P
Fain
B
Automatic classification of protein structure by using Gauss integrals
Proc Natl Acad Sci USA
 , 
2003
, vol. 
100
 (pg. 
119
-
24
)
28
Wheeler
WC
MALIGN: a multiple sequence alignment program
J Heredity
 , 
1994
, vol. 
85
 (pg. 
419
-
20
)
29
Pearson
WR
Lipman
DJ
Improved tools for biological sequence comparison
Proc Natl Acad Sci USA
 , 
1988
, vol. 
85
 (pg. 
2444
-
8
)
30
Poirot
O
O’Toole
E
Notredame
C
Tcoffee@igs: a web server for computing, evaluating and combining multiple sequence alignments
Nucleic Acids Res
 , 
2003
, vol. 
31
 (pg. 
3503
-
6
)
31
Chandonia
JM
Hon
G
Walker
NS
, et al.  . 
The ASTRAL compendium in 2004
Nucleic Acids Res
 , 
2004
, vol. 
32
 (pg. 
D189
-
92
)
32
Mizuguchi
K
Deane
CM
Blundell
TL
, et al.  . 
JOY: protein sequence-structure representation and analysis
Bioinformatics
 , 
1998
, vol. 
14
 (pg. 
617
-
23
)
33
Tukey
JW
Box-and-Whisker Plots
 
§2C in Exploratory Data Analysis. Reading, MA: Addison-Wesley, 1977,39–43
34
James
MNG
Delbaere
LTJ
Brayer
GD
Amino acid sequence alignment of bacterial and mammalian pancreatic serine proteases based on topological equivalences
Can J Biochem
 , 
1978
, vol. 
56
 (pg. 
396
-
402
)
35
Read
RJ
Brayer
GD
Jurasek
L
, et al.  . 
Critical evaluation of comparative model building of streptomyces griseus trypsin
Biochemistry
 , 
1984
, vol. 
23
 (pg. 
6570
-
5
)
36
Hedstrom
L
Serine protease mechanism and specificity
Chem Rev
 , 
2002
, vol. 
102
 (pg. 
4501
-
24
)
37
Trexler
M
Briknarová
K
Gehrmann
M
, et al.  . 
Peptide ligands for the fibronectin type II modules of matrix metalloproteinase 2 (MMP-2)
J Biol Chem
 , 
2003
, vol. 
278
 (pg. 
12241
-
6
)
38
Thompson
JD
Plewniak
F
Poch
O
BaliBASE: a benchmark alignment database for the evaluation of multiple sequence alignment programs
Bioinformatics
 , 
1999
, vol. 
1
 (pg. 
87
-
8
)

Supplementary data