## Abstract

**Motivation:** The steady growth of the number of available protein structures has constantly motivated the development of new algorithms for detecting structural correspondences in proteins. Detecting structural equivalences in two or more proteins is computationally demanding as it typically entails the exploration of the combinatorial space of all possible amino acid pairings in the parent proteins. The search is often aided by the introduction of various constraints such as considering protein fragments, rather than single amino acids, and/or seeking only sequential correspondences in the given proteins. An additional challenge is represented by the difficulty of associating to a given alignment, a reliable a priori measure of its statistical significance.

**Results:** Here, we present and discuss MISTRAL (Multiple STRuctural ALignment), a novel strategy for multiple protein alignment based on the minimization of an energy function over the low-dimensional space of the relative rotations and translations of the molecules. The energy minimization avoids combinatorial searches and returns pairwise alignment scores for which a reliable a priori statistical significance can be given.

**Availability:** MISTRAL is freely available for academic users as a standalone program and as a web service at http://ipht.cea.fr/protein.php.

**Contact:**michelet@sissa.it

**Supplementary information:**Supplementary data are available at *Bioinformatics* online.

## 1 INTRODUCTION

One of the ubiquitous bioinformatics tasks is the alignment of protein sequences and structures. The development of these computational tools has a long tradition and was especially motivated by the interest in tracing the evolutionary relationships of various protein families (Needleman and Wunsch, 1970; Phillips, 1970). Sequence alignment methods are particularly useful for this purpose when the sequence identity of the involved proteins is larger than ∼30%. Below this threshold there is no stringent relationship between sequence and structural similarity (Chothia, 1984; Chothia and Lesk, 1986; Gan *et al.*, 2002; Lesk, 2004; Wood and Pearson, 1999) and the detection of evolutionary and/or functional relatedness is appropriately complemented by structural alignment techniques (Koehl, 2001; Lichtarge and Sowa, 2002). Structural alignment, which is the focus of the present study, is applied in several other contexts including the grouping and classification of known protein folds, the identification of conserved structural fragments for molecular replacement in crystallography and homology modelling (Holm *et al.*, 1992; Orengo *et al.*, 1997; Schwarzenbacher *et al.*, 2008).

The result of a pairwise alignment procedure consists of a one-to-one correspondence list between amino acids of the two proteins. In rigid structural alignments, the two sets of corresponding *C*_{α} (or mainchain) atoms can be put in good spatial proximity after an optimal structural superposition which minimizes the root mean square distance (RMSD). Indeed, within these approaches the best alignment is obtained by minimizing a suitable measure of geometric compatibility, such as the similarity of distance matrices in DALI, over the possible amino acid pairings (Holm and Sander, 1993) or weighted sums of distances of equivalent *C*α's in CE (Shindyalov and Bourne, 1998) or in MAMMOTH (Ortiz *et al.*, 2002). Amino acid correspondences can also be established without resorting to criteria of rigid spatial superposability, as in ‘flexible’ alignment methods such as RAPIDO, FlexProt, FATCAT and POSA (Mosca and Schneider, 2008; Shatsky *et al.*, 2004b; Ye and Godzik, 2003, 2005).

As the combinatorial space of possible amino acid correspondences is enormous, the search space for the optimal alignment is often limited by the introduction of various constraints. For instance, pairings can be sought not between single amino acids but among fragments or secondary structure elements of suitable minimal length (Camproux *et al.*, 2004; Kolodny *et al.*, 2002; Micheletti *et al.*, 2000) such as in DALI, CE, MAMMOTH, GANGSTA (Kolbeck *et al.*, 2006) or CLEMAPS (Liu *et al.*, 2008).

A second commonly employed constraint is the requirement of sequentiality, that is subsequent amino acids in one protein must correspond to subsequent amino acids in the partner protein. This restriction, while computationally convenient, can impair the capability to detect evolutionary relationships (Xie and Bourne, 2008) as new protein structures can arise from the combination and permutation of subdomains (Bashton and Chothia, 2007; Fong *et al.*, 2007). The number of structural alignment methods that are non-sequential is still limited and include CA, MULTIPROT, MASS, TOPOFIT, SCALI and GANGSTA+(Bachar *et al.*, 1993; Dror *et al.*, 2003; Guerler and Knapp, 2008; Ilyin *et al.*, 2004; Shatsky *et al.*, 2004a; Yuan and Bystroff, 2005).

Besides the list of equivalent amino acids in two or more proteins, a highly desirable output of an alignment procedure is the indication of the statistical significance of the returned structural correspondence.

In a comprehensive study, Sierk and Pearson (2004) have applied several tests, based on the capability to correctly identify known homologous pairs in a reference set of protein domains, to validate the statistical significance returned by several structural alignment methods. Their analysis pointed out that the provided nominal statistical significance of alignment scores largely exceeded the true one, and hence posed the necessity to develop a better control of the scoring statistics. This task appears particularly challenging for alignment methods based on the combination of several criteria rather than on the optimization of a single scoring function.

Here, we discuss a novel scheme named MISTRAL (after MultIple STRuctural ALignment), which establishes structural correpondences between two or more proteins without exploring the combinatorial space of amino acid pairings and allows for a good control of the statistical significance of the returned alignments.

The optimal superposition of a given set of proteins is obtained by minimizing a suitable protein–protein interaction energy. The minimization of the energy function is carried out in the low-dimensional space corresponding to the orientations (translations and rotations) of the molecules in a given Cartesian reference frame. Once the molecules are optimally oriented, the residue–residue correspondences are established by using simple criteria based on spatial proximity. Such correspondences are therefore not sought by assuming any a priori sequence directionality among or within blocks of aligned residues in the two proteins which, in fact, can even be multimeric. The method is here applied and tested in several cases that aptly illustrate its non-sequential character and applicability in multiple alignment contexts. Finally, a detailed discussion of the returned nominal statistical significance, based on the database of Sierk and Pearson (2004) is presented.

MISTRAL is freely available for academic users as a web server at http://ipht.cea.fr/protein.php where multiple alignments of up to 20 proteins (of chain length up to 500 amino acids) can be submitted. To align more or longer proteins users can obtain the MISTRAL program from the authors.

## 2 METHODS

The optimal structural superposition of two proteins A and B, is identified by minimizing the following interaction energy term over their relative translations and orientations:

where the indexes*i*and

*j*run over all amino acids in proteins A and B, respectively, and

*f*is a weight function rewarding short spatial separations of amino acid pairs. Specifically, we considered the following piecewise-linear sigmoidal weight, where

*d*

_{ij}is the separation of the

*C*

_{α}atoms of amino acids

*i*and

*j*, and Δ is the interaction range which controls the spatial tolerance of the resulting superposition. By default, Δ is set to 8 Å. The product of the

*f*'s in Equation (1) embodies a multi-body interaction between pairs of amino acids that are nearby in sequence and consequently rewards the superposition of small protein fragments.

The parameters over which the minimization is done are the six degrees of freedom necessary to specify the displacement and orientation of protein B relative to A, which is held fixed in space. The extensive search over the relative orientations was employed before in alignment contexts; in STRUCTAL (Gerstein and Levitt, 1998) the spatial positioning of a protein pair is iteratively guided by finding equivalent pairs of amino acids. Here, the exploration of the relative orientations is, instead, guided by the minimization of the energy function in Equation (1). The optimization is carried out within a simulated annealing scheme starting from a tentative orientation of the proteins obtained by optimally superimposing fragments of 10–20 amino acids (using longer fragments typically adds additional computational overhead, without altering the result). Notice that, unlike the case of effective pairwise amino acid interactions used, for example, in models for protein docking or protein folding, the potential energy *f*, does not introduce any excluded-volume penalty at small separations. This allows protein chains to ‘pass-through’ each other as the minimum of Equation (2) is sought.

We point out that the method is seamlessly applicable if one or both proteins to be aligned are multimeric (though it should be noted that ‘flexible’ alignments scheme are better suited to deal with arbitrary displacements of separate domains). In this case, those amino acids *i* and *j* for which a chain interruption occurs in the intervals [*i*, *i*+2] or [*j*, *j*+2] are excluded from the sum in Equation (1). The same summation restriction is applied when dealing with proteins with missing structural informations about *C*_{α} atoms.

The energy-based approach is readily generalizable to perform multiple alignments where all proteins are treated in a symmetric, equal way. For example, in the case of alignment of three proteins A, B and C, the interaction energy to minimize is the sum of all pairwise interaction energies (possibly length-normalized):

The dimensionality of the parameter space entailed by the symmetric multiple alignment of*n*proteins is 6(

*n*−1) as one protein, the largest for computational convenience, is held fixed in space while all possible rotations and translations are considered for the others.

However, for the simplicity of the formulation, we shall not resort to the minimization of Equation (3), but will consider separately the possible *n*(*n*−1)/2 alignments and combine them *a posteriori* to identify the alignment core, as explained in a following section.

Finally, we observe that unless small values of Δ are used, the ‘cooperative’ terms in Equation (1), can assign a favorable energy not only to superposable segments having the same sequence directionality in the two proteins but also (albeit to a lesser extent) to segments with opposite directionality. This property, which is illustrated in Section 3, further highlights that the energy-based alignment can be employed without any a priori assumptions on the sequentiality and sequence directionality of the blocks to be aligned. The energy function in Equation (1) is, however, generalizable to strongly promote the matches of segments with opposite directionality by substituting (or complementing) the term in Equation (1) with

### 2.1 Statistical significance of pairwise alignments

The optimized alignment energies of 10^{4} pairs of protein chains were analyzed to characterize their statistical distribution. These reference pairwise alignments were produced by randomly pairing the members of a comprehensive set of about 2000 protein chains comprising between 19 and 500 amino acids (the average length being 230) with limited mutual sequence identity. The average pairwise sequence identity (the number of identically aligned amino acids divided by the length of the shortest protein (May, 2004)) returned by Clustalw (Chenna *et al.*, 2003) over the 10^{4} pairs was 25%. No more than 80 pairs out of 10^{4} overcome the 40% threshold. In view of the limited sequence relatedness of the set, the 10^{4} pairings are expected to involve mostly structurally unrelated proteins chains (Chothia, 1984; Chothia and Lesk, 1986; Gan *et al.*, 2002; Levitt and Gerstein, 1998; Wood and Pearson, 1999) and are thus taken as a random background reference to calibrate the statistical significance of structural alignments.

As a first step, the 10^{4} minimized pairwise energies were divided by the length of the longest of the two proteins, *l*_{max}. This energy rescaling favors the detection of matches that involve a substantial fraction of the longest protein. This is desirable as short proteins can be fitted against much longer ones which may have a different overall topological organization. The rescaled energies were found to have a residual dependence on *l*_{max}. In particular, the energies were shifted toward smaller absolute values for increasing *l*_{max}, as visible in the histograms of Figure 1a and b. This is consistent with the intuitive expectation that it is easier to find random alignments involving, e.g. 30% of a 100-residue protein, than 30% of a 300-residue one.

The length dependence of the rescaled alignment energies was characterized by fitting separately the energy distributions involving various intervals of *l*_{max}. Among the commonly employed statistical distributions it was found that the Gumbel extreme-value statistics, which often applies to alignment scores (Karlin and Altschul, 1990), provided a χ^{2} of order unity for the considered intervals of *l*_{max}. The quality of the Gumbel fits is visible in Figure 1, where it can be compared against the Gaussian distribution which yields χ^{2} values larger than a factor of about 4. The Gumbel distribution:

*E*is the energy of the alignment and

*a*

_{1}and

*a*

_{2}are the so-called location and scale parameters, was therefore taken as a viable statistical model for modeling the statistics of rescaled alignment energies. The analysis of the Gumbel fits for various length ranges (see e.g. Fig. 1a and b) indicated that the parametric dependence of the

*a*

_{1}and

*a*

_{2}parameters on

*l*

_{max}can be approximated as Based on the above considerations, the statistical significance of the alignment of two proteins is equivalently conveyed by the Gumbel

*z*-score: where γ≈0.5772, or the associated

*P*-value: where

*a*

_{1}and

*a*

_{2}have the aforementioned dependence on the length of the longest of the two proteins. The

*z*-score measures the separation (in terms of SDs) of the observed alignment energy from the random average. The

*P*-value, instead, provides the expected probability to have a random alignment with interaction energy equal or better (i.e. more negative) than the observed one.

The viability and consistency of the above analysis to model the statistics of random alignment energies is verified *a posteriori* over the reference dataset by comparing the expected and observed *P*-values, as a function of the *z*-score threshold. The observed *P*-value is straightforwardly calculated as the fraction of alignments whose energy has a *z*-score larger than the assigned threshold; the 80 pairs (out of 10^{4}) having sequence identity >40% were excluded from the analysis. The result is shown in Figure 1c and indicates that the expected and observed *P*-values are in very close agreement up to *z*-scores of about 3. Beyond this range, the differences are nevertheless limited, as the observed and expected number of alignments are still of the same order of magnitude even for *z*-score of 4.5 (the relative difference being a factor of 3). The fact that, at these values of *z*-score, the small number of observed alignments overcomes the expected one is also consistent with the fact that, despite the limited mutual sequence identity of members of the set, a limited number of genuine pairwise structural correspondences is still expected to occur in the dataset.

The above considerations indicate that the parameterized Gumbel statistics provides a consistent and viable model for estimating the significance of the structural alignments. An independent assessment of the viability of the statistical significance is presented in Section 3.

As the characterization of the statistics is particularly time consuming (both in terms of computation and data analysis), it was carried out only for pairwise alignments obtained with the default interaction range Δ =8 Å. Consequently, while further extensions of the statistical analysis may be undertaken in the future, presently the *z*-scores and *P*-values are provided by MISTRAL only for pairwise alignments carried out with the default interaction range.

### 2.2 One-to-one amino acid correspondences

The direct output of the minimization of Equation (1) or Equation (3) is the best spatial positioning of two or more proteins. The output is further processed to return one-to-one correspondences among amino acids in different proteins. The post-processing consists of an iterative scheme which attempts to match segments of length *l* in the two proteins. For two such segments, (*i*, *i*^{′}) and (*j*, *j*^{′}), with |*i*−*i*^{′}|=|*j*−*j*^{′}|=*l*, amino acids are tentatively matched in the scheme *i*:*j*; *i*+1:*j*+1;… or *i*:*j*^{′}; *i*+1:*j*^{′}−1;…. The latter type of matching can be disallowed in MISTRAL if reversal of sequence directionality in matching segments is not considered appropriate. The segments represent a putative match if they contain at least two corresponding amino acids at distance below the ‘seed’ threshold of 3.5 Å, and no pair at distance beyond the ‘alignment tolerance’ threshold of 4.0 Å. Among all viable segment pairs of length *l*, (*i*, *i*+*l*) (*j*, *j*±*l*), the one with the lowest RMSD is picked and assigned. Once this assignment is made, another search is performed for other viable segments of the same length. If no other viable segment is found, *l* is decreased by one unit and the search is repeated. The search is run starting from the maximum possible value of *l*=min(*l*_{A}, *l*_{B}) and stopping at *l*=4 which is the minimum considered length for aligned segments. The relative position and orientation of the two proteins obtained by the minimization of the interaction energy is finally adjusted by optimally superimposing the corresponding amino acids.

In the present online and distributed version of MISTRAL, multiple alignments are constructed by computing separately all one-to-one correspondences in the possible pairwise alignments of the set. Each protein in the set is next considered in turn as a tentative pivot for the multiple alignment and the size of the alignment core is computed. The latter is given by the set of amino acids that have a matching amino acid in *all* pairwise alignments with the other proteins in the set. The protein with the largest core is picked as pivot for the multiple alignment and relative position of the proteins is finally adjusted by optimally superimposing the core-corresponding region of each non-pivotal protein over the core of the pivot one. Besides the size of the core, the quality of the resulting alignment is conveyed by calculating the RMSD of all pairs of corresponding amino acids; e.g. if *n* proteins share a core of *N* residues, the number of spatial distances considered for the average is *Nn*(*n*−1)/2.

## 3 RESULTS

The MISTRAL alignment strategy complements traditional approaches for structural alignments in two main respects.

On the one hand structural alignment between two or more proteins is identified by minimizing an energy function in a very low-dimensional parameter space (e.g. six parameters per pairwise alignment) rather than exploring the highly dimensional combinatorial space of possible amino acid correspondences. This aspect entails not only a simpler formulation and implementation of the alignment algorithm, but also makes the algorithm free of constraints on the sequentiality of the matched blocks, and also their sequence directionality, that are employed in several alignment methods.

A further point is the fact that the distribution of MISTRAL alignment scores (energies) of random pairs of proteins chain is well modeled by the extreme value (Gumbel) statistics, implying a good control of the statistical significance of the found alignments. This property represents an important aspect to the method as appreciable discrepancies in the expected and observed significance of various alignment schemes were reported.

In the following, we shall illustrate and discuss the generality and performance of the MISTRAL algorithm by considering various cases for pairwise and multiple structural alignments.

### 3.1 Validation against curated pairwise alignments

We first discuss the performance of MISTRAL in a context of curated pairwise alignments taken from the SISYPHUS database (Andreeva *et al.*, 2007).

Specifically, the reference set consists of 69 protein pairs recently used by Mayr *et al.* (2007) in a comparative evaluation of six alignment methods: CE, DALI, FATCAT, CA, MATRAS (Kawabata, 2003) and SHEBA (Jung and Lee, 2000). The accuracy of each alignment is measured by computing the fraction of reference amino acid pairings that are correctly aligned by MISTRAL. The overall performance of the method is conveyed by the average and median accuracy calculated over the 69 queries. The average accuracy was found to be 66% and the median one was 82%. These values were obtained for the default parameters (seed tolerance and alignment tolerance equal to 3.5 Å and 4.0 Å, respectively). A detailed profiling of the performance in a wide range of parameters are provided in the Supplementary Material. The accuracy of MISTRAL compares well against the performance of the six alignments methods considered by (Mayr *et al.*, 2007) for which the typical average accuracy is in the range of 51–76% and the median one is in the range of ∼ 72–91%. In particular, for both the average and median accuracy, MISTRAL is behind only DALI (average: 76%, median: 91%) and MATRAS (average: 67%, median: 88%).

An important additional element to take into account in the above evaluation is the average length of the alignments, as the latter impacts on the method sensitivity. In this respect, it is noted that the average length of MISTRAL alignment is 100, which is appreciably closer to the average reference length (77) than, for example, DALI which has an average alignment length larger than 115.

Finally, we applied the method of (Kim and Lee, 2007) and computed how the fraction of correctly aligned residues, *F*_{car} changes if a maximum shift error (mismatch along the sequence), Δ, is allowed. The average *F*_{car} was computed for the range 0≤Δ≤10, see Supplementary Material. The performance improvement of MISTRAL for increasing Δ is in line with that of most alignment methods considered by (Kim and Lee, 2007). In particular, increasing Δ from 0 (no mismatch allowed) to 4 takes the average *F*_{car} from 66% to 75%, while the median *F*_{car} increases from 82% to 90%.

### 3.2 Non-sequentiality of MISTRAL alignments

We now discuss the capability of MISTRAL to deal with non-sequential alignments. The method can, in fact, detect correspondences between segments with arbitrary order in the matching proteins and is tolerant to their sequence directionality. In addition, it is seamlessly applicable to multimeric proteins.

These features are illustrated and discussed in the context of pairwise alignments between two representatives of the OB-fold family, which gathers proteins that bind nucleic acids. Canonical OB-fold members can exhibit considerable variability for both structure and nucleic acids binding modes (Theobald *et al.*, 2003), yet they share a common spatial organization of several β-strands, which is highlighted in Figure 2a for protein RPA70. By means of NMR techniques, it has recently been possible to resolve the structure of nucleic acids binding proteins that adopt a non-canonical OB-fold (de Chiara *et al.*, 2005). An example is provided by ATX:HBP1 (PDB: 1v06) which is represented in Figure 2b. From Figure 2a, it can be perceived that the sequence order of the β-strands and also their directionality is different with respect to the canonical OB-fold. The visual inspection of Figure 2a and b, however, suggest that a general alignment algorithm, insensitive to the succession and directionality of the matching regions ought to detect the superposability of the two structures (Zen *et al.*, 2009).

Set | PDB IDs | Avg. length | Core size | RMSD |
---|---|---|---|---|

1 | 1hhoA, 1hhoB, 1mbd | 145 | 136 | 1.4 Å |

2dhbA, 2dhbB | ||||

2 | 1dlw, 1dly, 1eco | 138 | 72 | 2.1 Å |

1hhoA, 1hhoB, 1idrA | ||||

1mbd, 2dhbA, 2dhbB | ||||

2lh7, 4vhbA | ||||

3 | 1a75A, 1omd, 1pal, 1pvaA | 108 | 100 | 0.7 Å |

1rtp1, 5cpv, 5pal | ||||

4 | 1bmv1, 1cwpA, 1smvA | 200 | 54 | 2.0 |

2bukA, 2tbvA, 4sbvA |

Set | PDB IDs | Avg. length | Core size | RMSD |
---|---|---|---|---|

1 | 1hhoA, 1hhoB, 1mbd | 145 | 136 | 1.4 Å |

2dhbA, 2dhbB | ||||

2 | 1dlw, 1dly, 1eco | 138 | 72 | 2.1 Å |

1hhoA, 1hhoB, 1idrA | ||||

1mbd, 2dhbA, 2dhbB | ||||

2lh7, 4vhbA | ||||

3 | 1a75A, 1omd, 1pal, 1pvaA | 108 | 100 | 0.7 Å |

1rtp1, 5cpv, 5pal | ||||

4 | 1bmv1, 1cwpA, 1smvA | 200 | 54 | 2.0 |

2bukA, 2tbvA, 4sbvA |

The length and RMSD of aligned core are provided.

This expectation is indeed confirmed by their MISTRAL alignment which yielded a set of 59 corresponding amino acids (with an RMSD of 2.0 Å) covering intervals with different sequence order and directionality in the two proteins (Fig. 2c and d).

### 3.3 Multiple alignments

As anticipated in Section 2, the algorithm can be used for multiple alignments either by combining information from all pairwise alignments among the provided proteins or through the minimization of the energy function in Equation 3 that treats simultaneously and on equal footing, all possible protein pairings. In both the cases, it is necessary to evaluate *n*(*n*−1)/2 energy terms, *n* being the number of proteins to align. The difference of the two approaches is that, in the first case, the minimization is carried over a 6D space for each of the *n*(*n*−1)/2 alignments, while in the second case, the minimization is over a 6(*n*−1)-dimensional parameter space. By comparison against available multiple structural alignment algorithms, we found that the first strategy, based on the prior computation of pairwise alignments, was adequate to detect known common alignment cores, and therefore we adopted it for its simplicity.

We considered four groups of proteins previously used as test cases for multiple alignments. In particular, we considered two sets of globins, which had been originally the object of manually curated structural superpositions (Lesk and Chothia, 1980), and two sets of the Homstrad database (Mizuguchi *et al.*, 1998).

The list of proteins defining the four groups are provided in Table 1, along with quantitative indicators of their multiple alignment. The resulting multiple structures superpositions are shown Figure 2e–h.

We first discuss the two sets of globins, previously considered by Konagurthu *et al.* (2006) when evaluating the performance of the Mustang method. The group of proteins in Set 1 are highly homologous and hence admit an almost complete structural superposition. Their alignment core (Fig. 2e), consists of 136 amino acids, with an average RMSD of 1.4 Å. The second set of globins, which comprises those in Set 1, is more heterogeneous, a fact reflected in a smaller core. The latter in fact, consists of 72 amino acids with an average RMSD of 2.1 Å (Fig. 2f). The running time over the two sets on a present-day personal computer was 16 s for set 1 and 80 s for set 2. Over the same sets Mustang, POSA and CE-MC yielded a core of 138–139 amino acids at an average RMSD of 1.4–1.5 Å for Set1 and a core of 90–92 amino acids at an average RMSD of 2.2–2.5 Å for Set 2. As already remarked in the context of the MISTRAL selectivity about SISYPHUS reference alignments, the method returns somewhat smaller cores than other alignment methods, but the associated RMSD is also smaller. Indeed, over the sets in Table 1, the values of RMSD_{100} [which provides a length-rescaled RMSD (Carugo and Pongor, 2001)] of MISTRAL are smaller than those of the mentioned alignment methods.

The results refer to the default parameters for interaction range, Δ = 8 Å, seed tolerance of 3.5 Å and alignment tolerance of 4.0 Å, which can be adjusted for specific needs (see Supplementary Material for an illustration of the dependence of alignment length on the parameters). For example, setting the tolerance to 5 Å returns a core of 138 residues with 1.7 Å RMSD for Set 1 and a core of 85 residues with 2.5 Å RMSD for Set 2.

Finally, we consider the multiple alignment of two groups of proteins from the Homstrad database. The first set (Set 3 in Table 1) comprises seven proteins, while the second, Set 4, gathers six proteins. Their multiple alignment returns a core of 100 residues with an average RMSD of 0.7 Å for the first set (Fig. 2g), and 54 residues with average RMSD of 2.0 Å for the second one (Fig. 2h). In both cases, the results compare well with the results of MULTIPROT which yield cores of 101–107 residues for Set 3 and of about 33–76 residues for Set 4 upon varying the algorithm alignment tolerance threshold from 3 Å to 4 Å.

### 3.4 Selectivity and sensitivity of the method

As discussed in Section 2, the distribution of alignment energy of random pairs of protein chains is well described by the extreme value (Gumbel) statistics. This fact indicates that the alignment energy of MISTRAL can adequately capture the statistical significance of structural alignments. The need to develop reliable statistical scores for structural alignments emerged some years ago from the work of Sierk and Pearson (2004) who showed that structural alignment methods were largely overconfident in the expected statistical significance of the alignments (which was shown to exceed even by orders of magnitude the observed one). As a further test of the viability of the statistical significance provided by MISTRAL for pairwise alignments we have carried out statistical tests on the same dataset considered by Sierk and Pearson (2004). The use of the same dataset, allows a straightforward comparison of the performance of MISTRAL relative to other methods and of its statistical significance.

The dataset of Sierk and Pearson (2004), consists of 86 protein domains used as queries and of a database of 2771 domains (including the queries themselves). Each query was run against all members of the database, thus returning 86×2771 pairwise alignments which include 1111 pairings of domains sharing the same CATH code [following Sierk and Pearson (2004) these pairings are referred to as homologs]. The CATH subdivision in groups of homologs is taken as the ‘gold standard’ for defining the ‘true’ structural correspondences present in the set. While this is a natural criterion, it should be borne in mind that it is not free of limitations (Kolodny *et al.*, 2005; Sippl and Wiederstein, 2008).

We first used the dataset to analyze the overall sensitivity and selectivity of MISTRAL. For a given level of statistical significance, a more sensitive method will identify a larger number of correct correspondences (homologs) among a given set of pairwise alignments. A more selective method will instead yield a smaller number of false positives (non-homologs). As customary, both features are assessed by plotting the errors per query (EPQ) versus coverage (Brenner *et al.*, 1998). Specifically, the set of alignments was ordered by decreasing *z*-score and the number of correct homolog pairs *n*_{c} among the top *n* alignments was computed. For a given level of significance (*z*-score), the EPQ are given by (*n*−*nc*)/86. The coverage, which is the detected fraction of correct alignments, is instead given by *nc*/1111.

The plot is shown in Figure 3. The resulting performance is consistent with the average one exhibited by the seven structural alignment methods considered by Sierk and Pearson (which included methods that were sequential and not applicable to multimeric proteins). In particular, the coverage for the threshold values of 0.1, 1 and 10 EPQ correspond to a coverage of 21%, 32% and 51%, respectively. The average coverage ranges found among the seven methods analyzed in Sierk and Pearson (2004) for the same EPQ thresholds are approximately: 15%, 30% and 50%, respectively.

As for several of the methods considered in the mentioned study, most of the top false positives of MISTRAL correspond to topologs, that is pairs of domains that share the same CATH topology. In fact, among the top 100 non-homolog pairs, there are 71 topologs. As a complement of the sensitivity/selectivity analysis, we examined the reliability of the statistical significance of the alignments. This represents an important complementary aspect of the above analysis as it aims at clarifying if the incidence of false positives can be kept under a priori control. In this respect, all structural alignment methods considered in Sierk and Pearson (2004) proved to be ‘overconfident’, in that the number of encountered top false positives was found to exceed by orders of magnitude the one expected on the basis of the score statistics. We have undertaken this analysis by following the method of Brenner *et al.* (1998) which, at variance with the approach of Sierk and Pearson (2004), is not limited to the first false positive of each query. The method we used consists of comparing the number of observed and expected EPQ (non-homologs) as a function of the *z*-score threshold. The number of expected EPQ is obtained by multiplying the *P*-value by the size of the database. The resulting curve is shown in Figure 3. It is seen that the number of expected false positives (or equivalently of EPQ) follows the observed one within about a factor of 3. In agreement with what we exposed in Section 2, the results of Figure 3 provide a clear indication that also on this comprehensive set of 86×2771 alignments (which is unrelated to the one used for modeling and calibrating the alignment score), the returned *z*-score and *P*-values provide an adequate estimate of the statistical significance of the alignments.

## 4 CONCLUSIONS

We presented and discussed MISTRAL, a novel multiple structural alignment method based on the minimization of a scoring function representing a suitable pairwise interaction energy between the given proteins.

The parameters over which the energy function is minimized are the relative rotations and translations of the molecules. The dimensionality of the phase space to be explored during the minimization is consequently drastically limited (e.g. six parameters per pairwise alignment) compared with the highly dimensional combinatorial space of possible amino acid correspondences.

After the suitable normalization depending on the proteins' length, the optimized energies returned by MISTRAL for alignment of unrelated proteins are found to follow closely the Gumbel extremal statistics distribution. This property reflects in a good a priori control of the statistical significance of the method, as confirmed by running MISTRAL over a standard reference database of protein domains (Sierk and Pearson, 2004).

## ACKNOWLEDGEMENTS

We thank A. Capdepon for setting up the MISTRAL web server and R. Potestio and A. Zen for useful discussions.

*Funding*: The Italian Ministry of Education, grant PRIN.

*Conflict of Interest*: none declared.

## Comments