- Split View
-
Views
-
Cite
Cite
Mathilde Carpentier, Jacques Chomilier, Protein multiple alignments: sequence-based versus structure-based programs, Bioinformatics, Volume 35, Issue 20, October 2019, Pages 3970–3980, https://doi.org/10.1093/bioinformatics/btz236
- Share Icon Share
Abstract
Multiple sequence alignment programs have proved to be very useful and have already been evaluated in the literature yet not alignment programs based on structure or both sequence and structure. In the present article we wish to evaluate the added value provided through considering structures.
We compared the multiple alignments resulting from 25 programs either based on sequence, structure or both, to reference alignments deposited in five databases (BALIBASE 2 and 3, HOMSTRAD, OXBENCH and SISYPHUS). On the whole, the structure-based methods compute more reliable alignments than the sequence-based ones, and even than the sequence+structure-based programs whatever the databases. Two programs lead, MAMMOTH and MATRAS, nevertheless the performances of MUSTANG, MATT, 3DCOMB, TCOFFEE+TM_ALIGN and TCOFFEE+SAP are better for some alignments. The advantage of structure-based methods increases at low levels of sequence identity, or for residues in regular secondary structures or buried ones. Concerning gap management, sequence-based programs set less gaps than structure-based programs. Concerning the databases, the alignments of the manually built databases are more challenging for the programs.
All data and results presented in this study are available at: http://wwwabi.snv.jussieu.fr/people/mathilde/download/AliMulComp/.
Supplementary data are available at Bioinformatics online.
1 Introduction
Multiple alignments of protein sequences are an essential tool for exploring the evolution, diversity, conservation and function of proteins (Feng and Doolittle, 1987; Lecompte et al., 2001; Levasseur et al., 2008; Wong et al., 2008). Despite the impressive and increasing number of available structures, most of these alignments are still computed by softwares that rely only on sequence information. Protein structures are mostly used as a second step in order to manually refine the alignment (Lemey et al., 2009) or to guide a particularly difficult alignment of very divergent proteins (Jean et al., 1997). Since it is usually admitted that structures are more conserved than sequences (Illergård et al., 2009) it is somehow surprising that multiple protein structure alignment methods, or methods combining sequence and structure, are not more widespread.
The goal of protein sequence alignments is to align homologous amino acids that derive from an ancestral sequence by substitutions. In structural alignments, the aligned positions are similar from the point of view of local and/or global conformations, and this structural similarity does not always imply homology (Godzik, 1996). Indeed, similar sub-domain fragments can be found in many different folds, with unrelated functions or various origins (Alva et al., 2015; Lamarine et al., 2001; Nepomnyachiy et al., 2017). The conceptual model behind sequence alignment explicitly considers three events for evolution: insertion, deletion and mutation. The model behind structure alignment is not so clear, partly because the impact of those three events on the folding step of protein structures is not well understood. The design of such a model is one of the greatest challenges of our decade for structural biology (Liberles et al., 2012).
Homology is difficult to assess, especially when the proteins show a low level of similarity or if the homology of the whole genes is questionable. Due to all the considerations above, it is difficult to claim that structure alignments provide a golden standard for evaluating the quality of sequence alignment. However, as structures are better conserved, alignments should be more reliable when information from sequences and structures are combined. We compared in this article the alignments computed from structure or both structure and sequence with those from sequence only.
Multiple sequence alignment methods have been compared in many articles and with several types of benchmarks reviewed in Iantorno et al. (2014). The most widely used benchmarks are composed of a collection of reference alignments considered as the gold standard. The reference alignments are constructed mainly by using the sequence and structural information, but also according to other information as the function (Thompson et al., 2011). Other types of benchmarks rely on simulated sequences (Nuin et al., 2006), on direct comparison of all computed alignments, without any reference alignment (Landan and Graur, 2007; Lassmann and Sonnhammer, 2005) or on the validity of phylogenetic trees computed from the alignments (Dessimoz and Gil, 2010).
For structure-based alignment methods, less comparative studies have been conducted and most of them compare pairwise structural alignment programs (Feng and Sippl, 1996; Gerstein and Levitt, 1998; Godzik, 1996; Kim and Lee, 2007; Mayr et al., 2007; Sauder et al., 2000; Slater et al., 2013). Multiple structural alignment programs are compared in the study of Berbalk et al. (2009). The authors noticed that structure-based alignment programs were generally very difficult to use and that there is room for improvements concerning use and applicability. They concluded that combining different alignment approaches into a single program relying on an automated scoring could improve the alignment quality but that until such a method is implemented, it seems important for a user to apply different tools and to manually compare their results.
We have conducted here a thorough comparative study of the performances of sequence-based and structure-based programs in order to address the following questions: are structure-based methods really superior in order to retrieve homologous residues? Or is it the sequence and structure ones? In which cases should we use structure-based methods, sequence+structure-based methods or sequence-based methods?
2 Materials and methods
2.1 Databases
In this study, we used reference multiple alignments built from sequences, structures and function information, and considered them as the gold standard. We did not use the three other types of benchmarks mentioned above because: (i) the use of simulated sequences is not possible in our case because there is no associated structure; (ii) it is possible to compare all alignments without a reference but, as programs may be consistent with each other but all wrong, we decided to avoid this approach in this article; (iii) the phylogeny-based approach would be very interesting but it requires a database of validated trees which is beyond the scope of the article.
We have selected 847 alignments, containing at least three protein chains or domains, from five reference multiple alignment databases: BALIBASE 2 (Thompson et al., 1999b), BALIBASE 3 (Thompson et al., 2005), HOMSTRAD (Mizuguchi et al., 1998a), OXBENCH (Raghava et al., 2003) and SISYPHUS (Andreeva et al., 2007). We restricted the databases to proteins present in the protein data bank that represent only the structured domains of protein sequences, thus discarding intrinsically disordered proteins. This restriction is necessary when using structure alignment methods. Some regions may be disordered in resolved protein structure but their proportion is low (1% of the residues in human protein-coding genes), whereas the proportion of these regions predicted in proteins of unknown structure is 20% (van der Lee et al., 2014). Some other alignments have been discarded: those with two or more proteins with identical amino acid sequence, those with missing residues in structures or with various inconsistencies. We did not consider the alignments of other well-known databases listed in Blackshields et al. (2006) for various reasons: PREFAB (Edgar, 2004) because it is composed of pairwise alignments; IRMbase (Subramanian et al., 2005) because there is no structure associated to the simulated fragments and SABMARK (Van Walle et al., 2005) because of some inconsistencies in the multiple alignments which are built from pairwise structural alignments, pointed by the author and in Edgar (2010). We also had difficulties accessing PALI (Balaji et al., 2001) and could not download the database. For all the databases, we only consider the core of the alignments but its definition depends on the database.
We have selected 29 families from BALIBASE 2 (BB2) and 38 from BALIBASE 3 (BB3), manually curated by checking the alignments of functional and other conserved residues. In each family, all proteins share the same structural fold, so the core can be reliably defined, excluding ambiguous or non-superimposable regions, unrelated secondary structure borders or some loop regions. BB2 and BB3 were kept even if they are from the same source because the protein families are different between BB2 and BB3. HOMSTRAD, from which we selected 357 families, is exclusively based on proteins with known structures, and each family is aligned with the programs MNYFIT (Sutcliffe et al., 1987), STAMP (Russell and Barton, 1992) and COMPARER (Sali and Blundell, 1990). These structure-based alignments are annotated with JOY (Mizuguchi et al., 1998b) and individually examined and modified if necessary. JOY produces core blocks annotations defined as the regular secondary structure elements (SSEs). We retrieved from OXBENCH 330 alignments with three or more proteins in each alignment (subset ‘multi’), not split in domains (full-length sequences). These multiple alignments are computed by STAMP (Russell and Barton, 1992). All the aligned positions were taken as the core blocks. The last database, SISYPHUS, is based on the families of domains from the structural classification SCOP (Murzin et al., 1995) with non-trivial structural relationships. Multiple alignments are manually constructed for structural regions that range from oligomeric biological units, or individual domains to fragments of different sizes and are manually curated. SISYPHUS annotates the structurally equivalent residues in the alignments and we consider them as the core blocks.
Many structure-based programs do not output all the residues of input protein structures (some residues are removed or ignored) or change the name of the sequences. We have developed two programs for solving this issue: the first matches the protein names in the reference alignments and the protein names in the program-calculated alignments and the second makes each sequence of a program-calculated alignment identical to the sequence in the reference alignment. The residues removed by some structure-based programs are inserted in the alignment and the rest of the column is filled with gaps.
2.2 Alignment quality evaluation
The alignments produced by each program are evaluated by comparison with the reference alignments through two scores, following Thomson et al. (1999a): (i) the fraction of pairs of residues in the reference alignment correctly identified by a given method, known as the sum-of-pairs (SP) score; (ii) the column score (CS) that describes the fraction of reference columns identified. As usually done in alignment method comparisons (Do et al., 2005; Golubchik et al., 2007; Thompson et al., 1999a), Friedman tests (Friedman, 1937) were performed. This test is more conservative than the Wilcoxon test that assumes a symmetrical difference, and this is not always the case. All tests, plots and heatmaps have been done with R (R Core Team, 2017). The average multiple root mean square (RMS) have been computed with THESEUS (Theobald and Wuttke, 2006) that has been applied to all alignments, reference ones or computed by the tested programs. We have counted the number of gaps in all columns between the first and last core elements. We present in the article only the proportion of columns containing one or more gap opening. Accessible surface area (ASA) is calculated with NACCESS for all the proteins, in order to split the amino acids in two classes: either buried (relative ASA <25%) or exposed (Petersen et al., 2009). Secondary structure assignments have been performed with STRIDE (Frishman and Argos, 1995). The six classes given in the output of STRIDE are back coded in three classes: helices, strands and coils. All analyses have been led according to the following characteristics: the residues have been assigned either as buried or accessible, and either in helix, strand or other (loop).
2.3 Programs
We have three categories of multiple alignment programs: sequence-based, sequence+structure-based and structure-based. To be included in this study a program must: (i) be available for download, (ii) output a file containing the sequence alignment, (iii) run without error. Each multiple alignment had to be computed in <2 h, otherwise the job was canceled. The execution time has been measured for the alignments of the SISYPHUS database on a standard desk computer with an i7 processor (Table 2). Some programs failed to produce enough alignments to allow a significant analysis of their performance and they were excluded if they produced an alignment for <70% of the dataset. As we mainly aim at addressing the performance of structure-based or sequence+structure-based alignment methods, we tried to be as exhaustive as possible for them. We searched or tested more than 40 programs but many were unavailable or did not conform to our criteria. We were also surprised by the low number of sequence+structure-based alignment methods. We did not include methods improving alignments afterwards, like STACCATO (Shatsky et al., 2005). There is a great number of sequence-based programs and we only tested the most popular ones according to the last studies (Le et al., 2017; Thompson et al., 2011). All the programs included in our study are listed with a short description in Table 1. We have selected 9 sequence-based programs, 5 sequence+structure-based programs, (TCOFFEE/3DCOFFEE is either run with SAP or TM-ALIGN) and 11 structure-based programs.
Type . | Name . | Description . | Rigid super- imposition . | Version . | References . | Year . |
---|---|---|---|---|---|---|
SEQ | CLUSTALO | Seeded guide trees and HMM profile–profile | NA | 1.2.0 | (Goujon et al., 2010; Sievers et al., 2011) | 2010 |
SEQ | CLUSTALW | Classical progressive aligner | NA | 2.1 | (Larkin et al., 2007; Thompson et al., 1994) | 1994 |
SEQ | DIALIGN | Greedy and progressive approaches for segment-based multiple alignment | NA | TX, 1.0.2 | (Al Ait et al., 2013; Morgenstern, 1999; Morgenstern et al., 1998) | 1998 |
SEQ | KALIGN2 | Wu–Manber string-matching algorithm, improving both accuracy and speed | NA | 2.04 | (Lassmann et al., 2009; Lassmann and Sonnhammer, 2005) | 2005 |
SEQ | MAFFT_linsi | Fast progressive aligner with iteration and refinement using consistency score | NA | 7.215 | (Katoh et al., 2002; Katoh and Standley, 2013) | 2002 |
SEQ | MUSCLE | Fast progressive aligner with iteration and refinement | NA | 3.8.31 | (Edgar, 2004, 2004) | 2004 |
SEQ | PRANK | Phylogeny-aware progressive aligner; correcting treatment of insertions | NA | v.100701 | (Löytynoja and Goldman, 2005) | 2005 |
SEQ | PROBCONS | Probabilistic variant of the consistency algorithm | NA | 1.12 | (Do et al., 2005) | 2005 |
SEQ | TCOFFEE_SEQ | Consistency-based progressive aligner | NA | 11.00.8cbe486 | (Notredame et al., 2000) | 2000 |
SEQ/STRUCT | PROMALS3D | Derives constraints through structure-based alignments; combines them with sequence constraints when constructing consistency-based multiple sequence alignments | No | NA | (Pei et al., 2008; Pei and Grishin, 2007) | 2008 |
SEQ/STRUCT | TCOFFEE_SAP | TCOFFEE + pairwise structure alignments by SAP | Yes | 11.00.8cbe486 | (O’Sullivan et al., 2004; Orengo and Taylor, 1996) | 2004 |
SEQ/STRUCT | TCOFFEE_TM | TCOFFEE + pairwise structure alignments by TM-ALIGN | Yes | 11.00.8cbe486 | (O’Sullivan et al., 2004; Zhang and Skolnick, 2005) | 2004 |
SEQ/STRUCT | SALIGN | DP with a score that is a sum of an affine gap penalty and of terms depending on various sequence and structure features | Yes | Modeler version: 9.18 | (Madhusudhan et al., 2009) | 2007 |
SEQ/STRUCT | FORMATT | MATT with sequence information | No | 1.02 | (Daniels et al., 2012) | 2005 |
STRUCT | 3DCOMB | Identifies structurally similar pairwise fragments and assemblies according to pivot structures | Yes | 1.06 | (Wang et al., 2011) | 2011 |
Score: TM-score (Zhang and Skolnick, 2004) | ||||||
STRUCT | GESAMT | Clustering of small structurally similar pairwise fragments | Yes | 7.0 | (Krissinel, 2012; Winn et al., 2011) | 2012 |
Score: Q-score (Krissinel and Henrick, 2004) | ||||||
STRUCT | KPAX | DP + alignment optimization | Yes | 5.0.5 | (Ritchie et al., 2012) | 2005 |
Score: Gaussian structural similarity score | ||||||
STRUCT | MAMMOTH | AFPs alignment by DP. Progressive multiple alignment with a guide tree | No | NA | (Lupyan et al., 2005) | 2005 |
Score: probability of residue random match of two different folds (Ortiz et al., 2002) | ||||||
STRUCT | MATRAS | Progressive multiple alignment (guide tree) by DP | No | 1.2 | (Kawabata, 2003; Kawabata and Nishikawa, 2000) | 2000 |
Score: PAM like matrices computed on SSE conservation or Cα internal distances | ||||||
STRUCT | MATT | AFPs chaining by DP | Yes | 1.0 | (Menke et al., 2008) | 2008 |
Score: based on RMS for AFP and on a geometrical transformations to allowing flexibility for chaining | ||||||
STRUCT | MISTRAL | Superposition by minimizing interaction energy and residue one-to-one correspondence afterwards | Yes | 3.6 | (Micheletti and Orland, 2009) | 2009 |
Score: interaction energy and RMS | ||||||
STRUCT | MTMALIGN | Progressive multiple alignment (guide tree) by DP | Yes | 20171124 | (Dong et al., 2018) | 2017 |
Score: TM-score | ||||||
STRUCT | MULTIPROT | With each structure as a pivot, detection of all AFPs, assembling to build the longest consistent alignment | Yes | 1.93 | (Shatsky et al., 2004) | 2004 |
Score: alignment length, consistency and RMS | ||||||
STRUCT | MUSTANG | AFP and progressive multiple alignment with a tree. | No | 3.2.3 | (Konagurthu et al., 2006) | 2005 |
Score: Cα internal distance [DALI like, (Holm and Sander, 1993)] | ||||||
STRUCT | STAMP | Iterative superposition and alignment of Cα by DP with a guide tree | Yes | 4.4 | (Russell and Barton, 1992) | 1992 |
Score: Cα distances and conformational similarity |
Type . | Name . | Description . | Rigid super- imposition . | Version . | References . | Year . |
---|---|---|---|---|---|---|
SEQ | CLUSTALO | Seeded guide trees and HMM profile–profile | NA | 1.2.0 | (Goujon et al., 2010; Sievers et al., 2011) | 2010 |
SEQ | CLUSTALW | Classical progressive aligner | NA | 2.1 | (Larkin et al., 2007; Thompson et al., 1994) | 1994 |
SEQ | DIALIGN | Greedy and progressive approaches for segment-based multiple alignment | NA | TX, 1.0.2 | (Al Ait et al., 2013; Morgenstern, 1999; Morgenstern et al., 1998) | 1998 |
SEQ | KALIGN2 | Wu–Manber string-matching algorithm, improving both accuracy and speed | NA | 2.04 | (Lassmann et al., 2009; Lassmann and Sonnhammer, 2005) | 2005 |
SEQ | MAFFT_linsi | Fast progressive aligner with iteration and refinement using consistency score | NA | 7.215 | (Katoh et al., 2002; Katoh and Standley, 2013) | 2002 |
SEQ | MUSCLE | Fast progressive aligner with iteration and refinement | NA | 3.8.31 | (Edgar, 2004, 2004) | 2004 |
SEQ | PRANK | Phylogeny-aware progressive aligner; correcting treatment of insertions | NA | v.100701 | (Löytynoja and Goldman, 2005) | 2005 |
SEQ | PROBCONS | Probabilistic variant of the consistency algorithm | NA | 1.12 | (Do et al., 2005) | 2005 |
SEQ | TCOFFEE_SEQ | Consistency-based progressive aligner | NA | 11.00.8cbe486 | (Notredame et al., 2000) | 2000 |
SEQ/STRUCT | PROMALS3D | Derives constraints through structure-based alignments; combines them with sequence constraints when constructing consistency-based multiple sequence alignments | No | NA | (Pei et al., 2008; Pei and Grishin, 2007) | 2008 |
SEQ/STRUCT | TCOFFEE_SAP | TCOFFEE + pairwise structure alignments by SAP | Yes | 11.00.8cbe486 | (O’Sullivan et al., 2004; Orengo and Taylor, 1996) | 2004 |
SEQ/STRUCT | TCOFFEE_TM | TCOFFEE + pairwise structure alignments by TM-ALIGN | Yes | 11.00.8cbe486 | (O’Sullivan et al., 2004; Zhang and Skolnick, 2005) | 2004 |
SEQ/STRUCT | SALIGN | DP with a score that is a sum of an affine gap penalty and of terms depending on various sequence and structure features | Yes | Modeler version: 9.18 | (Madhusudhan et al., 2009) | 2007 |
SEQ/STRUCT | FORMATT | MATT with sequence information | No | 1.02 | (Daniels et al., 2012) | 2005 |
STRUCT | 3DCOMB | Identifies structurally similar pairwise fragments and assemblies according to pivot structures | Yes | 1.06 | (Wang et al., 2011) | 2011 |
Score: TM-score (Zhang and Skolnick, 2004) | ||||||
STRUCT | GESAMT | Clustering of small structurally similar pairwise fragments | Yes | 7.0 | (Krissinel, 2012; Winn et al., 2011) | 2012 |
Score: Q-score (Krissinel and Henrick, 2004) | ||||||
STRUCT | KPAX | DP + alignment optimization | Yes | 5.0.5 | (Ritchie et al., 2012) | 2005 |
Score: Gaussian structural similarity score | ||||||
STRUCT | MAMMOTH | AFPs alignment by DP. Progressive multiple alignment with a guide tree | No | NA | (Lupyan et al., 2005) | 2005 |
Score: probability of residue random match of two different folds (Ortiz et al., 2002) | ||||||
STRUCT | MATRAS | Progressive multiple alignment (guide tree) by DP | No | 1.2 | (Kawabata, 2003; Kawabata and Nishikawa, 2000) | 2000 |
Score: PAM like matrices computed on SSE conservation or Cα internal distances | ||||||
STRUCT | MATT | AFPs chaining by DP | Yes | 1.0 | (Menke et al., 2008) | 2008 |
Score: based on RMS for AFP and on a geometrical transformations to allowing flexibility for chaining | ||||||
STRUCT | MISTRAL | Superposition by minimizing interaction energy and residue one-to-one correspondence afterwards | Yes | 3.6 | (Micheletti and Orland, 2009) | 2009 |
Score: interaction energy and RMS | ||||||
STRUCT | MTMALIGN | Progressive multiple alignment (guide tree) by DP | Yes | 20171124 | (Dong et al., 2018) | 2017 |
Score: TM-score | ||||||
STRUCT | MULTIPROT | With each structure as a pivot, detection of all AFPs, assembling to build the longest consistent alignment | Yes | 1.93 | (Shatsky et al., 2004) | 2004 |
Score: alignment length, consistency and RMS | ||||||
STRUCT | MUSTANG | AFP and progressive multiple alignment with a tree. | No | 3.2.3 | (Konagurthu et al., 2006) | 2005 |
Score: Cα internal distance [DALI like, (Holm and Sander, 1993)] | ||||||
STRUCT | STAMP | Iterative superposition and alignment of Cα by DP with a guide tree | Yes | 4.4 | (Russell and Barton, 1992) | 1992 |
Score: Cα distances and conformational similarity |
Note: Categories of programs: SEQ is a sequence-based alignment method; STRUCT is a structure-based alignment method; SEQ/STRUCT is a sequence+structure-based program. DP, dynamic programming; AFP, aligned fragment pairs.
Type . | Name . | Description . | Rigid super- imposition . | Version . | References . | Year . |
---|---|---|---|---|---|---|
SEQ | CLUSTALO | Seeded guide trees and HMM profile–profile | NA | 1.2.0 | (Goujon et al., 2010; Sievers et al., 2011) | 2010 |
SEQ | CLUSTALW | Classical progressive aligner | NA | 2.1 | (Larkin et al., 2007; Thompson et al., 1994) | 1994 |
SEQ | DIALIGN | Greedy and progressive approaches for segment-based multiple alignment | NA | TX, 1.0.2 | (Al Ait et al., 2013; Morgenstern, 1999; Morgenstern et al., 1998) | 1998 |
SEQ | KALIGN2 | Wu–Manber string-matching algorithm, improving both accuracy and speed | NA | 2.04 | (Lassmann et al., 2009; Lassmann and Sonnhammer, 2005) | 2005 |
SEQ | MAFFT_linsi | Fast progressive aligner with iteration and refinement using consistency score | NA | 7.215 | (Katoh et al., 2002; Katoh and Standley, 2013) | 2002 |
SEQ | MUSCLE | Fast progressive aligner with iteration and refinement | NA | 3.8.31 | (Edgar, 2004, 2004) | 2004 |
SEQ | PRANK | Phylogeny-aware progressive aligner; correcting treatment of insertions | NA | v.100701 | (Löytynoja and Goldman, 2005) | 2005 |
SEQ | PROBCONS | Probabilistic variant of the consistency algorithm | NA | 1.12 | (Do et al., 2005) | 2005 |
SEQ | TCOFFEE_SEQ | Consistency-based progressive aligner | NA | 11.00.8cbe486 | (Notredame et al., 2000) | 2000 |
SEQ/STRUCT | PROMALS3D | Derives constraints through structure-based alignments; combines them with sequence constraints when constructing consistency-based multiple sequence alignments | No | NA | (Pei et al., 2008; Pei and Grishin, 2007) | 2008 |
SEQ/STRUCT | TCOFFEE_SAP | TCOFFEE + pairwise structure alignments by SAP | Yes | 11.00.8cbe486 | (O’Sullivan et al., 2004; Orengo and Taylor, 1996) | 2004 |
SEQ/STRUCT | TCOFFEE_TM | TCOFFEE + pairwise structure alignments by TM-ALIGN | Yes | 11.00.8cbe486 | (O’Sullivan et al., 2004; Zhang and Skolnick, 2005) | 2004 |
SEQ/STRUCT | SALIGN | DP with a score that is a sum of an affine gap penalty and of terms depending on various sequence and structure features | Yes | Modeler version: 9.18 | (Madhusudhan et al., 2009) | 2007 |
SEQ/STRUCT | FORMATT | MATT with sequence information | No | 1.02 | (Daniels et al., 2012) | 2005 |
STRUCT | 3DCOMB | Identifies structurally similar pairwise fragments and assemblies according to pivot structures | Yes | 1.06 | (Wang et al., 2011) | 2011 |
Score: TM-score (Zhang and Skolnick, 2004) | ||||||
STRUCT | GESAMT | Clustering of small structurally similar pairwise fragments | Yes | 7.0 | (Krissinel, 2012; Winn et al., 2011) | 2012 |
Score: Q-score (Krissinel and Henrick, 2004) | ||||||
STRUCT | KPAX | DP + alignment optimization | Yes | 5.0.5 | (Ritchie et al., 2012) | 2005 |
Score: Gaussian structural similarity score | ||||||
STRUCT | MAMMOTH | AFPs alignment by DP. Progressive multiple alignment with a guide tree | No | NA | (Lupyan et al., 2005) | 2005 |
Score: probability of residue random match of two different folds (Ortiz et al., 2002) | ||||||
STRUCT | MATRAS | Progressive multiple alignment (guide tree) by DP | No | 1.2 | (Kawabata, 2003; Kawabata and Nishikawa, 2000) | 2000 |
Score: PAM like matrices computed on SSE conservation or Cα internal distances | ||||||
STRUCT | MATT | AFPs chaining by DP | Yes | 1.0 | (Menke et al., 2008) | 2008 |
Score: based on RMS for AFP and on a geometrical transformations to allowing flexibility for chaining | ||||||
STRUCT | MISTRAL | Superposition by minimizing interaction energy and residue one-to-one correspondence afterwards | Yes | 3.6 | (Micheletti and Orland, 2009) | 2009 |
Score: interaction energy and RMS | ||||||
STRUCT | MTMALIGN | Progressive multiple alignment (guide tree) by DP | Yes | 20171124 | (Dong et al., 2018) | 2017 |
Score: TM-score | ||||||
STRUCT | MULTIPROT | With each structure as a pivot, detection of all AFPs, assembling to build the longest consistent alignment | Yes | 1.93 | (Shatsky et al., 2004) | 2004 |
Score: alignment length, consistency and RMS | ||||||
STRUCT | MUSTANG | AFP and progressive multiple alignment with a tree. | No | 3.2.3 | (Konagurthu et al., 2006) | 2005 |
Score: Cα internal distance [DALI like, (Holm and Sander, 1993)] | ||||||
STRUCT | STAMP | Iterative superposition and alignment of Cα by DP with a guide tree | Yes | 4.4 | (Russell and Barton, 1992) | 1992 |
Score: Cα distances and conformational similarity |
Type . | Name . | Description . | Rigid super- imposition . | Version . | References . | Year . |
---|---|---|---|---|---|---|
SEQ | CLUSTALO | Seeded guide trees and HMM profile–profile | NA | 1.2.0 | (Goujon et al., 2010; Sievers et al., 2011) | 2010 |
SEQ | CLUSTALW | Classical progressive aligner | NA | 2.1 | (Larkin et al., 2007; Thompson et al., 1994) | 1994 |
SEQ | DIALIGN | Greedy and progressive approaches for segment-based multiple alignment | NA | TX, 1.0.2 | (Al Ait et al., 2013; Morgenstern, 1999; Morgenstern et al., 1998) | 1998 |
SEQ | KALIGN2 | Wu–Manber string-matching algorithm, improving both accuracy and speed | NA | 2.04 | (Lassmann et al., 2009; Lassmann and Sonnhammer, 2005) | 2005 |
SEQ | MAFFT_linsi | Fast progressive aligner with iteration and refinement using consistency score | NA | 7.215 | (Katoh et al., 2002; Katoh and Standley, 2013) | 2002 |
SEQ | MUSCLE | Fast progressive aligner with iteration and refinement | NA | 3.8.31 | (Edgar, 2004, 2004) | 2004 |
SEQ | PRANK | Phylogeny-aware progressive aligner; correcting treatment of insertions | NA | v.100701 | (Löytynoja and Goldman, 2005) | 2005 |
SEQ | PROBCONS | Probabilistic variant of the consistency algorithm | NA | 1.12 | (Do et al., 2005) | 2005 |
SEQ | TCOFFEE_SEQ | Consistency-based progressive aligner | NA | 11.00.8cbe486 | (Notredame et al., 2000) | 2000 |
SEQ/STRUCT | PROMALS3D | Derives constraints through structure-based alignments; combines them with sequence constraints when constructing consistency-based multiple sequence alignments | No | NA | (Pei et al., 2008; Pei and Grishin, 2007) | 2008 |
SEQ/STRUCT | TCOFFEE_SAP | TCOFFEE + pairwise structure alignments by SAP | Yes | 11.00.8cbe486 | (O’Sullivan et al., 2004; Orengo and Taylor, 1996) | 2004 |
SEQ/STRUCT | TCOFFEE_TM | TCOFFEE + pairwise structure alignments by TM-ALIGN | Yes | 11.00.8cbe486 | (O’Sullivan et al., 2004; Zhang and Skolnick, 2005) | 2004 |
SEQ/STRUCT | SALIGN | DP with a score that is a sum of an affine gap penalty and of terms depending on various sequence and structure features | Yes | Modeler version: 9.18 | (Madhusudhan et al., 2009) | 2007 |
SEQ/STRUCT | FORMATT | MATT with sequence information | No | 1.02 | (Daniels et al., 2012) | 2005 |
STRUCT | 3DCOMB | Identifies structurally similar pairwise fragments and assemblies according to pivot structures | Yes | 1.06 | (Wang et al., 2011) | 2011 |
Score: TM-score (Zhang and Skolnick, 2004) | ||||||
STRUCT | GESAMT | Clustering of small structurally similar pairwise fragments | Yes | 7.0 | (Krissinel, 2012; Winn et al., 2011) | 2012 |
Score: Q-score (Krissinel and Henrick, 2004) | ||||||
STRUCT | KPAX | DP + alignment optimization | Yes | 5.0.5 | (Ritchie et al., 2012) | 2005 |
Score: Gaussian structural similarity score | ||||||
STRUCT | MAMMOTH | AFPs alignment by DP. Progressive multiple alignment with a guide tree | No | NA | (Lupyan et al., 2005) | 2005 |
Score: probability of residue random match of two different folds (Ortiz et al., 2002) | ||||||
STRUCT | MATRAS | Progressive multiple alignment (guide tree) by DP | No | 1.2 | (Kawabata, 2003; Kawabata and Nishikawa, 2000) | 2000 |
Score: PAM like matrices computed on SSE conservation or Cα internal distances | ||||||
STRUCT | MATT | AFPs chaining by DP | Yes | 1.0 | (Menke et al., 2008) | 2008 |
Score: based on RMS for AFP and on a geometrical transformations to allowing flexibility for chaining | ||||||
STRUCT | MISTRAL | Superposition by minimizing interaction energy and residue one-to-one correspondence afterwards | Yes | 3.6 | (Micheletti and Orland, 2009) | 2009 |
Score: interaction energy and RMS | ||||||
STRUCT | MTMALIGN | Progressive multiple alignment (guide tree) by DP | Yes | 20171124 | (Dong et al., 2018) | 2017 |
Score: TM-score | ||||||
STRUCT | MULTIPROT | With each structure as a pivot, detection of all AFPs, assembling to build the longest consistent alignment | Yes | 1.93 | (Shatsky et al., 2004) | 2004 |
Score: alignment length, consistency and RMS | ||||||
STRUCT | MUSTANG | AFP and progressive multiple alignment with a tree. | No | 3.2.3 | (Konagurthu et al., 2006) | 2005 |
Score: Cα internal distance [DALI like, (Holm and Sander, 1993)] | ||||||
STRUCT | STAMP | Iterative superposition and alignment of Cα by DP with a guide tree | Yes | 4.4 | (Russell and Barton, 1992) | 1992 |
Score: Cα distances and conformational similarity |
Note: Categories of programs: SEQ is a sequence-based alignment method; STRUCT is a structure-based alignment method; SEQ/STRUCT is a sequence+structure-based program. DP, dynamic programming; AFP, aligned fragment pairs.
. | Alignments . | Average time . | |
---|---|---|---|
MATRAS | 847 | 100.0% | <10 s |
TCOFFEE_TM | 847 | 100.0% | <1 min |
KPAX | 846 | 99.9% | <10 s |
PROMALS3D | 846 | 99.9% | <10 min |
TCOFFEE_SAP | 845 | 99.8% | <10 s |
MTMALIGN | 845 | 99.8% | <10 s |
FORMATT | 844 | 99.6% | <1 min |
GESAMT | 841 | 99.3% | <1 s |
MUSTANG | 840 | 99.2% | <10 min |
MISTRAL | 828 | 97.8% | <10 min |
STAMP | 826 | 97.5% | <1 s |
MATT | 824 | 97.3% | <10 min |
3DCOMB | 822 | 97.0% | <10 s |
SALIGN | 796 | 94.0% | <10 min |
MULTIPROT | 766 | 90.4% | <1 min |
MAMMOTH | 622 | 73.4% | <10 s |
#Alignments | 847 |
. | Alignments . | Average time . | |
---|---|---|---|
MATRAS | 847 | 100.0% | <10 s |
TCOFFEE_TM | 847 | 100.0% | <1 min |
KPAX | 846 | 99.9% | <10 s |
PROMALS3D | 846 | 99.9% | <10 min |
TCOFFEE_SAP | 845 | 99.8% | <10 s |
MTMALIGN | 845 | 99.8% | <10 s |
FORMATT | 844 | 99.6% | <1 min |
GESAMT | 841 | 99.3% | <1 s |
MUSTANG | 840 | 99.2% | <10 min |
MISTRAL | 828 | 97.8% | <10 min |
STAMP | 826 | 97.5% | <1 s |
MATT | 824 | 97.3% | <10 min |
3DCOMB | 822 | 97.0% | <10 s |
SALIGN | 796 | 94.0% | <10 min |
MULTIPROT | 766 | 90.4% | <1 min |
MAMMOTH | 622 | 73.4% | <10 s |
#Alignments | 847 |
Note: The average computation time has been measured for the 42 SISYPHUS families that all programs successfully aligned. All sequence-based methods compute the alignments in less than a second on average except PRANK (time < 1 min). KALIGN2, CLUSTALO, CLUSTALW and MUSCLE are the fastest (<0.1 s).
. | Alignments . | Average time . | |
---|---|---|---|
MATRAS | 847 | 100.0% | <10 s |
TCOFFEE_TM | 847 | 100.0% | <1 min |
KPAX | 846 | 99.9% | <10 s |
PROMALS3D | 846 | 99.9% | <10 min |
TCOFFEE_SAP | 845 | 99.8% | <10 s |
MTMALIGN | 845 | 99.8% | <10 s |
FORMATT | 844 | 99.6% | <1 min |
GESAMT | 841 | 99.3% | <1 s |
MUSTANG | 840 | 99.2% | <10 min |
MISTRAL | 828 | 97.8% | <10 min |
STAMP | 826 | 97.5% | <1 s |
MATT | 824 | 97.3% | <10 min |
3DCOMB | 822 | 97.0% | <10 s |
SALIGN | 796 | 94.0% | <10 min |
MULTIPROT | 766 | 90.4% | <1 min |
MAMMOTH | 622 | 73.4% | <10 s |
#Alignments | 847 |
. | Alignments . | Average time . | |
---|---|---|---|
MATRAS | 847 | 100.0% | <10 s |
TCOFFEE_TM | 847 | 100.0% | <1 min |
KPAX | 846 | 99.9% | <10 s |
PROMALS3D | 846 | 99.9% | <10 min |
TCOFFEE_SAP | 845 | 99.8% | <10 s |
MTMALIGN | 845 | 99.8% | <10 s |
FORMATT | 844 | 99.6% | <1 min |
GESAMT | 841 | 99.3% | <1 s |
MUSTANG | 840 | 99.2% | <10 min |
MISTRAL | 828 | 97.8% | <10 min |
STAMP | 826 | 97.5% | <1 s |
MATT | 824 | 97.3% | <10 min |
3DCOMB | 822 | 97.0% | <10 s |
SALIGN | 796 | 94.0% | <10 min |
MULTIPROT | 766 | 90.4% | <1 min |
MAMMOTH | 622 | 73.4% | <10 s |
#Alignments | 847 |
Note: The average computation time has been measured for the 42 SISYPHUS families that all programs successfully aligned. All sequence-based methods compute the alignments in less than a second on average except PRANK (time < 1 min). KALIGN2, CLUSTALO, CLUSTALW and MUSCLE are the fastest (<0.1 s).
3 Results
3.1 Number of computed alignments
All programs have been run on the 847 alignments. All sequence-based programs calculated all the 847 alignments but some programs of the two other categories failed for some alignments (Table 2). Sequence-based programs, MATRAS and TCOFFEE_TM successfully computed all alignments but not the other programs. Sometimes failures were due to the time limit, but most of the time they were due to errors returned by the programs. MAMMOTH encountered the most failures; it has obviously a limit of 25 proteins per alignment. In order to improve the robustness of our analysis, we decided to restrict our analysis to the alignments computed by all programs, resulting in 535 alignments: 24 from BB2, 24 from BB3, 287 from HOMSTRAD, 158 from OXBENCH and 42 from SISYPHUS. These 535 alignments involve more than 2000 different protein chains.
3.2 Databases
The distribution of mean pairwise sequence identity among the 535 core alignments is given in Figure 1. BB2, BB3 and SISYPHUS databases are more focused on low identity, while HOMSTRAD and OXBENCH present alignments of high level of identity. The proportion of amino acids included in regular secondary structures in the complete dataset is 60%; restricted to the core alignments, it increases to 79%. We checked the redundancy of the databases. The number and proportion of chains included in two databases are listed in Supplementary Table S1. There is some overlap between BB2 and BB3: 48 chains are present in both BB2 and BB3. However, the protein families are all different between BB2 and BB3 so we decided to keep them all. The overlaps are very weak for the other databases.
3.3 Global analysis of alignment scores
The boxplot distribution of SP and CS scores of each program run on the 535 alignments are presented in Figure 2. The exact median values are reported in Supplementary Table S2. Globally, the results are impressively good: the SP score medians range from 0.86 to 0.97, meaning that in half of the alignments, more than 86% of the residue pairs are correctly aligned by any method. Similarly, in half of the alignments, more than 81% of the alignment columns are correct. Scores vary with the programs and structure-based programs give better results on the whole, except for MULTIPROT and MISTRAL. The sorting is the same for SP and CS scores except for FORMATT, MULTIPROT and MISTRAL that have a better CS score, and STAMP and KPAX that swap ranks. STAMP shows the greatest variability in its results, and it is not the best despite the fact that it has been used for building the alignments of two databases (HOMSTRAD and OXBENCH). FORMATT, a modified version of MATT that includes sequence information, is worse than MATT. It highlights the difficulty of combining sequence and structure information, which is nevertheless possible: TCOFFEE_TM is the best sequence+structure-based program, and achieves better than TCOFFEE_SEQ. However, sequence+structure-based methods do not perform better than structure only methods, despite the use of both sequence and structure information.
For each pair of programs, the significance of their differences has been evaluated by a Friedman rank test on their scores calculated for all 535 alignments (Section 2). In Figure 3, the programs are ranked according to their median CS score, and six groups of programs without significant differences within a group appear (black squares). The differences are significant between the programs outside the groups in most cases. The first group contains MAMMOTH and MATRAS that are the two best performing programs according to our study. The second group gathers MATT, TCOFFEE_TM, 3DCOMB, MUSTANG and TCOFFEE_SAP and their results are close to the two first programs. MUSTANG and TCOFFEE_TM are not significantly different from MAMMOTH and MATRAS despite their lower ranking. The three last groups contain all sequence-based programs and also FORMATT, MULTIPROT and MISTRAL. TCOFFEE_SEQ and PROBCONS are the two best sequence-based programs. STAMP, FORMATT, MULTIPROT and MISTRAL performances are not significantly different from the performances of programs with a lower ranking.
We also performed a hierarchical clustering on the basis of the scores of the various programs and the various alignments. A heatmap of this clustering is presented in Figure 4 for CS scores and in Supplementary Figure S1 for SP scores. The results are extremely similar regardless of the score (CS or TC). Considering program clustering (left tree Fig. 4), all sequence+structure-based programs and structure-based programs except STAMP are in the same sub-tree. All sequence-based ones are also pooled together. We have three groups of programs in the upper sub-tree (see the pink dashed line). TCOFFEE_SAP is alone on its branch; its score profile is different from the others: it sometimes fails when others succeed (see the red scores at the right extremity of its profile). The second group is composed of MUSTANG, MAMMOTH, MAT, FORMATT and MATRAS, whose performances are almost undistinguishable according to the Friedman tests. In the third group, 3DCOMB, MTMALIGN, KPAX and GESAMT have very similar profiles; they are also pooled with MULTIPPROT and MISTRAL that are more designed to find conserved structural blocks than to align whole proteins. This splitting of the structure-based programs into two clusters is consistent with the performance of rigid superimposition at some step in the methods, except for MATT but it explicitly compensates for the rigidity by introducing flexibly in its score. The profiles of sequence-based programs are very similar to each other.
We analyzed the most difficult alignments. Nine alignments show CS scores below 0.5 for all the programs (Supplementary Table S3 and Figure S7). The sequence identities of the core reference alignments are low (21% on average). The structural challenges of these alignments are: large insertions or deletions for some proteins of the families (five alignments), structural repetitions (one alignment) or large alignments with strong structural variations (three alignments) examined the difficult alignments for structure-based programs. There are 78 alignments where structure-based programs do not have the highest CS score. For 66 of them, the difference between the maximum CS score of all programs and structure-based programs is <0.1. The remaining 12 alignments are listed in Supplementary Table S4 and Figure S8. The sequence identity is globally higher (31% on average) and the RMS calculated from the reference alignment is high (>4Å) for all but two families that include structural repetition (Leucine rich repeats and many beta strands).
3.4 The effect of sequence identity
We have investigated the effect of sequence conservation on the quality of the alignments computed by the different programs. The results are presented in Figure 5 for CS scores and in Supplementary Figure S2 for SP scores. As expected, the differences between structure-based and sequence-based methods are stronger for alignments of very divergent proteins. For alignments above 50% of sequence identity, sequence-based programs have similar or even better performances than structure-based programs. We also checked the effect of the number of proteins to align. The effect is very weak in the case of SP scores for all programs except MULTIPROT (Supplementary Figure S3) but it is noticeable on the CS scores (Supplementary Figure S4).
3.5 The effect of structural variations
We have measured the structural divergence by computing the RMS from a superimposition built according to the reference alignments. The performances of the programs as a function of these RMS are presented in Figure 6 for the CS scores and in Supplementary Figure S5 for SP scores. We have split our dataset in alignments below 30% of sequence identity (left, 199 alignments) and above (right, 335 alignments). There is no alignment below 30% with an average RMS below 1 Å. For the alignments below 30% of sequence identity, the scores of all programs globally decrease while RMS increases. This is understandable for structure and sequence+structure-based programs, but it is less obvious for sequence-based programs. The average sequence identity of these alignments is almost constant whatever the RMS (between 21 and 25%). The decrease of CS scores for sequence-based programs may be associated with the increase of the number of gaps—eight indels on average for alignments below 1 Å of RMS to 22 indels for all alignments above 3 Å—and to the increase of the number of proteins to align—from 3.5 proteins on average to 5.7. For the alignments above 30% of sequence identity, the CS scores decrease for alignments below 3 Å; this decrease is associated with a decrease of sequence identity (68% of sequence identity for alignments in the interval [0 Å, 1 Å], 52% for ]1 Å, 2 Å] and 40% for ]2 Å, 3 Å]). The variations are non-significant afterward (43% for ]4 Å, 5 Å] and 41% above 5 Å) which is coherent with the stability of the sequence-based CS scores. When the RMS is high (>6Å) and the sequence identity not too low (>30% sequence identity), several sequence-based programs perform better than the structure-based programs, and the best program is a sequence+structure-based program. The structural variability may be due to unstructured regions of the proteins which may be seen in some structures of the difficult cases presented in Supplementary Figures S7 and S8.
We have also computed the RMS on the basis of the alignments resulting from the programs. The results are presented in Supplementary Figure S6. The multiple RMS among proteins of the families are smaller for structure-based methods than for sequence-based methods as expected because structure-based methods align proteins while optimizing the structural similarities. The RMS computed according to the reference alignments (black line) are in between the two.
3.6 SSE and burying effect
We also investigated whether structure-based methods are strongly dependent on secondary structures and solvent exposure. We computed the SP and CS scores independently for core residues respectively in helices, strands or loops; the same procedure was applied for exposed or buried residues. The results are presented in Figure 7 (CS scores only). CS scores fall sharply for loops and a little for helices for structure-based and structure+sequence-based methods compared with sequence-based methods. The structure-based methods are sensitive to regular secondary structures. The scores decrease for exposed residues for all type of methods. The structural variability of exposed regions explains the difficulties of structure-based and sequence+structure-based programs. For the sequence-based programs, the decrease is probably due to the decrease of sequence identity (46% versus 33%).
3.7 Database effect
We wondered whether the success rate of the programs was dependent on the database. The composition of the various databases is different in terms of sequence identity and core definition. We tried to eliminate these biases by selecting alignments between 10 and 40% of sequence identity since all databases are present in this range. Besides only core positions in conserved regular secondary structures were selected. In Figure 8, it is clear that the CS scores fluctuate depending on the reference alignment origin. The median scores are globally higher and less variable for HOMSTRAD and OXBENCH that contain more alignments and whose generation procedure is automatic. For BB2, BB3 and SISYPHUS, the discrepancy of the scores is larger. One may consider that a bias in favor of the structure-based methods is present in HOMSTRAD and OXBENCH. Yet, the ranking of the programs is similar: the same structure-based or structure+sequence-based programs are the best, even though their order varies slightly. The most affected program is STAMP, whose performances are poorer with the last three databases. It is used in the building procedure of HOMSTRAD and OXBENCH nevertheless its performances for those two databases are not the best. The best program in this 10–40% sequence identity subset is MATT, followed by MATRAS and FORMATT. Therefore, these programs have good results with divergent proteins.
3.8 Gaps
The proportion of gap opening is clearly different in sequence-based and structure-based programs (Fig. 9). The structure-based programs except MAMMOTH tend to over-estimate the number of indels and the sequence-based ones tend to under-estimate the number of gaps. MAMMOTH has a linear penalty gap function that seems to be quite efficient. PROMALSD3D has also a linear gap penalty function and tends to put fewer gaps than in the reference alignments. PRANK which has been designed for placing correctly indels is the closest to the reference. As most of the structure-based methods work with small structural blocks, they do not have a gap penalty function, which may explain this possible over-estimation of gaps. We believe that some improvement in the gap treatment for structure-based and sequence+structure-based methods should improve their performance.
4 Discussion
In this article, we have compared the ability of sequence-based, structure-based and sequence+structure-based alignment programs to retrieve supposed homologous positions defined in reference alignments from five well-known databases. The structure-based programs have globally better performances than the sequence-based ones, but also better than most of the structure+sequence-based programs. A first group of two structure-based programs—MAMMOTH and MATRAS—scores significantly better than the others. A second group is close: MATT, MUSTANG and 3DCOMB (structure-based), TCOFFEE_TM and TCOFFEE_SAP (sequence+structure). All these seven programs build the alignments from pairwise aligned fragments of few residues. The program performances are different according to the hierarchical clustering of their results: they do not all cluster together, meaning that their success or failure varies with the alignments. A consensus method may achieve better results if it can identify the cases where each method succeeds, as it has been also suggested in the Berbalk et al. (2009). In TCOFFEE_TM and TCOFFEE_SAP, adding structure information clearly improves the alignment achieved by TCOFFEE_SEQ, but it is not the case for MATT and FORMATT. The consistency-based programs (TCOFFEE, PROBCONS) are the best ranked as far as the sequence-based programs are concerned in Pais et al. (2014; Thompson et al., 2011) except for MAFFT. The performance differences between sequence and structure-based programs are stronger for low identity alignments as it has been highlighted by Kim and Lee (2007). The sequence-based program performances fall sharply for low sequence identity alignments but their performances are similar to structure-based programs above 50% of sequence identity. When the structural variations are large, structure-based program results may be worse than sequence-based programs. Other difficult cases for structure-based programs are the loops, the proteins with structural repetitions as Leucine rich repeats and proteins with large insertions or deletions. We can conclude that while aligning proteins for the identification of homologous positions and if all its structures are known, it is better to align the proteins only according to their structures. Today we can consider that the proportion of protein families with available structure amounts to 8700 over a total of 17 425 families or domains in PFAM database (El-Gebali et al., 2019). In the cases where not all structures are known, it should be better to use a sequence+structure-based method such as TCOFFEE_TM, but this particular case has not been addressed in this study.
As the five reference alignment databases are built from protein structural information we wondered whether it would advantage structure-based methods. For the two automatically generated databases, the alignments are computed by structure-based programs and it should favor homologous or even non-homologous but structurally similar positions that are more easily retrieved with structure than with sequence only. The case is different with manually curated alignments because no structure-based method has been used to build them and all kinds of information have also been used: sequence, function and structure. The scores of all programs and their dispersions are similar if the two automatic databases, HOMSTRAD and OXBENCH are used, and they are globally lower and more variable using the three other databases. Moreover, whatever the database used, the first ranked program is always a structure-based program. Although, structure-based and sequence+structure-based programs have better scores than sequence-based programs. It would be interesting to compare program alignments altogether without a reference in order to check their consistency, or to compute phylogenetic trees from the program alignments and derive a score from the accuracy of the trees. Finally, some improvements concerning usability and applicability of structure-based programs would be worthwhile and structure-based programs could improve the gap placement in the alignments.
5 Conclusions
For identifying homology in proteins, we can conclude that it is better to use structure information than sequence information only, yet the difficulty of combining sequence and structure information is obvious: the sequence+structure-based methods are not better than the structure-based methods. Several programs are globally equivalent in performance but their behavior varies for each alignment, and a consensus method might achieve better results. A real model of sequence and structure protein evolution would greatly improve the methods but such a model is quite difficult to design mainly because of the folding process that may drastically change the structure even if the sequence difference is not that strong. There is also still room for improvement in term of software ergonomics and gap treatments. This study showed that, if several structures of a family are known, the most reliable alignment is the structural one. However, usually far more sequences than structures of a family are available so the use of sequence+structure-based methods with all sequences and known structures would gather all available information and may produce the best alignment. Computing several kinds of alignments using tools like STRAP (Gille and Frömmel, 2001) that allow combining alignments would be the most advisable approach.
Acknowledgements
Our thanks to the authors of the programs used in this study for the informal discussion, Joël Pothier for discussion and critical reading of the manuscript and to Therese Pothier for the English proof reading. We also thank the referees for their pertinent remarks that permitted us to make noticeable improvements in the manuscript.
Funding
This study has been supported from regular supplies provided both involved laboratories.
Conflict of Interest: none declared.
References
R Core Team (