Abstract
Motivation: Searching for structural similarity is a key issue of protein functional annotation. The maximum contact map overlap (CMO) is one of the possible measures of protein structure similarity. Exact and approximate methods known to optimize the CMO are computationally expensive and this hampers their applicability to largescale comparison of protein structures.
Results: In this article, we describe a heuristic algorithm (AlEigen) for finding a solution to the CMO problem. Our approach relies on the approximation of contact maps by eigendecomposition. We obtain good overlaps of two contact maps by computing the optimal global alignment of few principal eigenvectors. Our algorithm is simple, fast and its running time is independent of the amount of contacts in the map. Experimental testing indicates that the algorithm is comparable to exact CMO methods in terms of the overlap quality, to structural alignment methods in terms of structure similarity detection and it is fast enough to be suited for largescale comparison of protein structures. Furthermore, our preliminary tests indicates that it is quite robust to noise, which makes it suitable for structural similarity detection also for noisy and incomplete contact maps.
Availability: Available at http://bioinformatics.cs.unibo.it/AlEigen
Contact:dilena@cs.unibo.it
Supplementary information:Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
Measuring the similarity of protein structures is a necessary step in several problems of bioinformatics, including the determination of structure conservation through evolution across species and/or the relation among structure and function. Several similarity measures of protein structures are available albeit no general agreement on the best one has been achieved (Godzik, 1996; Oakley et al., 2008; Sadreyev et al., 2009). Routinely, each similarity measure relies on the choice of a scoring function and on the assumption that its optimum corresponds to the best possible match between two protein structures.
The most widely adopted scoring measures are based on the root mean square deviation (RMSD; Kabasch, 1976), distance map similarity (Holm and Sander, 1993), contact map overlap (CMO; Godzik et al., 1992) and, recently, universal similarity metric (UMS; Krasnogor and Pelta, 2004; Rahmati and Glasgow, 2009). The maximum CMO measure quantifies the level of similarity between two protein structures by measuring the maximum overlap of their contact maps and it is obtained by computing the sequence alignment that maximizes the number of corresponding contacts between pairs of aligned residues. The maximum CMO is one of the few measures for which exact algorithms are known. On the other hand, the maximum CMO problem is known to be NP hard (Goldman et al., 1999).
Although not explicitly stated in previous papers dealing with CMO algorithms, the goal of the these tools is to address the problem of protein structure prediction. In practice, given a set of predicted contacts the basic idea is to look for the best possible physical contact map among the available ones by means of a CMO program: methods for contact prediction are presently not accurate enough (Ezkurdia et al., 2009) to allow protein structure reconstruction (Vassura et al., 2008). The main contribution of CMO algorithms is therefore to provide tools for template recognition when the native protein structure is unknown and not to substitute structural alignment methods (i.e. methods based on RMSD and distance map similarity measures), which perform quite well.
The first exact algorithm for the maximum CMO problem, based on integer programming (IP), was developed in Lancia et al. (2001) and improved in Caprara and Lancia (2002). Later, several other methods based on the same approach were proposed (Andonov et al., 2008; Strickland et al., 2005; Xie and Sahinidis, 2007). The IP approach consists in formulating the CMO as the maximization of some integer linear function and solving it with Lagrangian Relaxation (LR) and/or Branch and Bound reduction techniques. The disadvantage of IPbased methods is that, due to the intractability of the problem, they are exponential in the worst case. For a practical usage, the running time of these algorithms is bounded and the best solution within the time limit is returned. The counter part is that the IPbased methods provide upper and lower bounds to the optimal solution and this makes it possible to evaluate the quality of the partial solution from the distance between the upper and lower bound. The best possible overlap is found when the upper and lower bounds coincide. Recently, two polynomialtime approximation schemes for the protein structure alignment problem (in particular, contact map alignment) have been developed (Agarwal et al., 2007; Xu et al., 2007). The approximation algorithm described in Xu et al. (2007) is polynomial in the protein length but it is exponential with respect to some constant parameters and its running time increases with decrease of an approximation factor. The method developed in Agarwal et al. (2007) is based on a decomposition procedure on the input graphs. It is a sixapproximation algorithm, i.e. it returns a solution that is at least 1/6 distant from the optimal one, and it has a polynomial running time.
Despite the strength of the underlying formalization, the CMObased algorithms are scarcely used to compare protein structures. Indeed, the algorithmic implementations of the exact and approximation methods are on the average too slow to be used for widescale comparison. Furthermore and most importantly, there is no agreement on the most suitable value of contact threshold to represent a protein structure (Caprara et al., 2004; Duarte et al., 2010; Vassura et al., 2008). Highthreshold contact maps (≥10 Å) are more informative; however, their comparison requires an even higher computational time and this makes the adoption of exact/approximation methods not feasible.
To our knowledge, only three heuristic methods, SADP (Jain and Lappe, 2007), MSVNS (Pelta et al., 2008) and BIMAL (Jain and Obermayer, 2009), have been proposed for the CMO problem. Even if not optimal, SADP, MSVNS and BIMAL can produce acceptable solutions in a reasonable time compared with exact methods. The main limitation of available heuristic CMO methods is that their running time depends on the number of contacts making them extremely slow when the contact maps contain a huge number of contacts.
In this article, we describe a new heuristic algorithm for the CMO problem. Our approach is based on the property that a contact map can be wellapproximated by few of its eigenvectors. Thus, an acceptable overlapping of two contact maps can be heuristically obtained by performing a global alignment of few eigenvectors. The exploitation of the eigenvalues and eigenvector properties of the graph adjacency matrices have been extensively investigated in the more general context of inexact graph matching. In the pioneering work by Umeyama (1988), lately improved in Zhao et al. (2007), the eigendecomposition is introduced to detect a nearoptimal permutation matrix that maximizes the similarity between two labeled graphs of the same size. The limitation on the graph dimensions is overcome in more recent approaches, such as Luo and Hancock (2001) and Singh et al. (2007), where the eigendecomposition is used to detect local similarities between the neighborhood topologies of the nodes in the two graphs. A different approach that exploits eigendecomposition is based on the conversion of a graph into a string by using the ordering of the nodes as defined by the principal eigenvector of its adjacency matrix (RoblesKelly and Hancock, 2002). By this, two graphs can be matched by computing an alignment that minimizes the edit distance between the corresponding strings. However, these approaches, differently from the CMO methods, do not require to preserve the ordering of the nodes in the adjacency matrices. Differently, in our method the eigendecomposition is used to transform the twodimensional alignment problem between contact maps into a onedimensional alignment problem between eigenvectors. Therefore, in our case, the ordering between the nodes in the maps is naturally imposed by the onedimensional alignment between their eigenvectors.
Our algorithm is easily implementable and fast. Noticeably, by design, its running time does not depend on the number of contacts contained in the map and thus on the contact threshold. Experimental results show that it can compute good overlaps compared with exact CMO methods and that its performances in terms of protein structure recognition/classification are comparable with those of structural alignment methods. The running time of our implementation is comparable with the fastest structural alignment algorithms and heuristic CMO methods. Our tests confirm that contact maps computed at thresholds values ≥10 Å are more informative of the protein structures than those at lower thresholds and this makes our algorithm more suitable than other available CMO methods for protein structure comparison on large scale. Furthermore, in order to evaluate the robustness our algorithm, we test the effect of random noise on the accuracy of structural similarity recognition performance. To the best of our knowledge, this is the first time that a CMO algorithm is evaluated in this way. The experimental tests indicate that our method is quite robust and can tolerate high amount of noise without dramatically affecting its recognition capabilities.
2 BACKGROUND
2.1 Contact maps
A protein contact map is a twodimensional approximation of the protein threedimensional structure. For a given protein P, its contact map of threshold τ is a square binary symmetric matrix defined by
There are several definitions of distance between residues in literature. The particular choice of a distance is not critical since the CMO problem is independent of the distance used to represent contacts between residues.Following the CMO literature, we consider here the C_{α}distance, which defines the distance between residues i,j as the Euclidean distance between the coordinates of their respective C_{α} atoms. Typical threshold values for C_{α} contact maps vary from 6 Å to 16 Å. For this range of thresholds, consecutive residues are always in contact (consecutive residues share a peptide bond and the distance between their respective C_{α} atoms is about 3.7 Å). For lowthreshold values, typically 6–9 Å, the number of contacts (i.e. 1s) observed in the map is sparse compared with the number of noncontacts (i.e. 0s). Moreover, these threshold values are the ones which minimize the distance between C_{α} contact maps and physical contact maps (Bartoli et al., 2007). On the contrary, highthreshold contact maps (10–16 Å) have a higher number of contacts and are more informative about the protein structure. At lowthreshold values, several different threedimensional structures can be consistent with the same contact map: the ambiguity can be minimized by increasing the threshold of the contact map (Duarte et al., 2010; Vassura et al., 2008). The threshold problem was also noticed in Caprara et al. (2004), the authors report that a threshold value smaller than 7 Å is not suitable to represent the protein structures in their benchmark set.
2.2 The maximum CMO problem
Given two proteins P_{1}, P_{2}, whose (ordered) sets of residues are denoted, respectively, by R_{1} = {1,…, n} and R_{2} = {1,…, m}, an alignment between P_{1} and P_{2} is a mapping f : R_{1} → R_{2} that respects the following two conditions:
f is an injective partial function; and
for each pair of residues i,j ∈ R_{1} in the domain of f (i.e. f(i) ≠ ∅ ≠ f(j)) we have that
Condition 1 imposes that a residue in the first/second protein can be aligned at most with one (possibly none) residue in the second/first protein. The nonaligned residues are assumed to be matched with gaps. Biologically, the introduction of a gap reflects an insertion/deletion event during the evolution of protein sequences. The number of gaps in an alignment f is defined by
Condition 2 imposes the ordering of the residues to be preserved in the alignment.The maximum CMO for proteins P_{1}, P_{2}, is an alignment that maximizes the overlap between their respective contact maps M^{P1}, M^{P2}. More formally, the maximum CMO problem is defined as the problem of computing the alignment f that maximizes the quantity
Noticeably, since contacts between consecutive amino acids are always present, they are not counted in (1). Moreover, a match between a contact and a noncontact is not penalized in (1).The CMO can be used as a measure of the similarity between two proteins structures: the higher the overlap between two contact maps, the higher the probability that the two related protein structures are similar. The CMO measure is quite robust to perturbations and does not greatly penalize the insertion of gaps and deletions. The CMO as similarity measure was introduced in Godzik et al. (1992). The problem of computing the maximum CMO was proven to be NP hard in Goldman et al. (1999). To quantify the level of similarity of two overlapped contact maps, we use the most widely adopted scoring function, originally proposed in Xie and Sahinidis (2007):
where C(M) = ∑_{j>i+1}M_{ij} denotes the number of contacts in the contact map M.3 MATERIALS AND METHODS
3.1 Datasets
In order to compare our results with those published before, we use the protein sets previously described. In practice, we use five different datasets (see Table 1 and the Supplementary Material) for three different tests: performance comparison with exact and heuristic CMO methods (Section 4.1), recognition and classification performance comparison with structural alignment methods (Section 4.2) and error tolerance test (Section 4.3).
Dataset  Proteins/Domains  Residues  

Min  Avg ± SD  Max  
Sokol  9  12  39±14  51 
Lancia  269  44  57±4  68 
Skolnick  40  97  160±61  256 
Fischer  68  62  183±105  534 
Proteus300  300  64  193±97  465 
Dataset  Proteins/Domains  Residues  

Min  Avg ± SD  Max  
Sokol  9  12  39±14  51 
Lancia  269  44  57±4  68 
Skolnick  40  97  160±61  256 
Fischer  68  62  183±105  534 
Proteus300  300  64  193±97  465 
For the comparison with exact CMO methods, we refer to the results published in Andonov et al. (2008) on the Skolnick dataset. The contact maps used in Andonov et al. (2008) have been computed at 7.5 Å threshold and are available as Supplementary Material.
For the comparison with heuristic CMO methods, we refer to the results published in Jain and Obermayer (2009) on the Sokol, Lancia, Skolnick and Fischer datasets. These datasets are usually adopted as standard benchmark to evaluate overlap quality of CMO algorithms. The maps for the Skolnick, Lancia and Sokol datasets used in Jain and Obermayer (2009) are available as Supplementary Material of Xie and Sahinidis (2007). The contact threshold used to compute these maps is not reported but it seems to be lower than 7.5 Å. The maps for the Fischer dataset are not publicly available; thus, we computed our maps at threshold 7.5 Å for this test.
For the comparison with structural alignment methods and for testing the error tolerance of our method, we use the Fischer and Proteus300 datasets, which contain medium–large protein domains and are more varying with respect to the SCOP (Andreeva et al., 2008) classification than the other benchmarks considered. The 68 domains in the Fischer dataset are distributed in 56 distinct SCOP families, 44 distinct super families and 40 distinct folds. Among them, 22 domains belong to a nonunique family (i.e. for each one of them there is in the dataset at least one representative in the same family class), five have one representative in the same fold class but not in the same family/s.family classes and 22 domains belong to a unique fold class. The 300 domains in the Proteus300 dataset are distributed in 30 distinct SCOP families (10 domains per family), 27 distinct super families and 24 folds. For the recognition/classification experiments, we test several different threshold values. For all the other experiments, we use contact maps at threshold 7.5 Å (or lower).
3.2 Aleigen implementation
The maximum CMO problem involves the alignment of twodimensional objects. While there are optimal polynomialtime algorithms for the alignment in onedimensional space, the problem is not tractable in two dimensions. Here, we describe a heuristic method to obtain a twodimensional alignment of contact maps by means of a onedimensional alignment of their eigenvectors. Our approach uses standard techniques such as the canonical eigendecomposition of symmetric matrices (Strang, 2003) and the Needleman–Wunsch (NW) alignment algorithm (Needleman and Wunsch, 1970).
The spectral theory provides conditions under which a matrix can be decomposed into a canonical form in terms of eigenvalues and eigenvectors. This canonical decomposition is usually called eigendecomposition or spectral decomposition. By the spectral theorem, every real n × n symmetric matrix M can be eigendecomposed as
where λ_{i} represents the ith eigenvalue, v_{i} the corresponding eigenvector and ⊗ denotes the outer product between vectors. The ordering of the eigenvalues is not important provided that the eigenvectors are permuted accordingly; thus, we can always assume that the eigenvalues are sorted in decreasing order, i.e. λ_{1} ≥ λ_{2}≥ ··· ≥ λ_{n}. Equation (3) defines matrix M as the sum of n × n matrices v_{i} ⊗ v_{i}, weighted by the corresponding eigenvalues λ_{i}.The eigendecomposition allows the detection of the most relevant information contained in a contact map. For example, in Porto et al. (2004), the authors show that a contact map can be perfectly reconstructed starting from the only knowledge of its principal eigenvector. In practice, a contact map can be approximated by considering only few of its eigenvectors/eigenvalues. For instance, for 1 ≤ t ≤ n, the approximation of order t of M can be defined as
This way of approximating a contact map is effective because the smaller is eigenvalue λ_{i} the smaller is the contribution of matrix v_{i} ⊗ v_{i} in Equation (3). This is actually one of the approaches used for image data compression (Andrews and Patterson, 1976).Consider now two proteins P_{1}, P_{2} with contact maps M^{P1} ∈ {0, 1}^{n×n}, M^{P2} ∈ {0, 1}^{m×m}, respectively. For some given 1≤t≤min{n, m}, we can heuristically compute an overlap between M^{P1} and M^{P2} by computing an alignment that maximizes an opportune scoring function defined on their respective t eigenvectors u_{1},…, u_{t} and v_{1},…, v_{t}. Since the scoring function (1) does not penalize the eventual match of a contact with a noncontact, a global alignment of eigenvectors is preferred to a local alignment.
The NW algorithm computes in polynomialtime the optimal global alignment of two sequences with respect to some scoring matrix S ∈ ℝ^{n×m} and (constant) gap penalty G ∈ ℝ. The entry S_{ij} of the scoring matrix denotes the level of similarity between the ith residue of P_{1} and the jth residue of P_{2}. The constant value G defines the cost for the introduction of a gap in the alignment. The NW algorithm can be easily modified in order to encode nonconstant gap penalties. In this work, we consider only constant gap penalties. Formally, the NW algorithm computes the alignment f : R_{1} → R_{2} that maximizes the objective function
In the following, we describe the scoring function (a) and the constant gap penalty (b) of our NW implementation.
By Equation (3), for residues i,j of protein P_{1} the quantity
will tend to 1 with increasing of t if i,j are in contact and to 0 otherwise. In the quantity (5), the contribution of each product (v_{k})_{i}(v_{k})_{j} is weighted by the corresponding eigenvalue λ_{k}. Moreover, when λ_{k} is positive, such a contribute is positive if and only if (v_{k})_{i} and (v_{k})_{j} agree in sign. We describe the ith residue of P_{1} by the ith entries of eigenvectors v_{1},…, v_{t} weighted by the square root of the corresponding eigenvalue λ_{k}: According to this representation, the ith residue of P_{1}, described by [(u′_{1})_{i},…, (u′_{t})_{i}], should be matched with the jth residue of P_{2}, described by [(v′_{1})_{i},…, (v′_{t})_{i}], when the pairwise entries of the these vectors highly agree both in sign and relative magnitude. The vectors do not need to be equal to obtain a high score; for this reason, a scoring scheme based on the Euclidean distance is not appropriate. Experimentally, we found that the scoring function that provides the best performances is The scoring function (7) assigns high scores when the corresponding entries of the vectors describing residues i and j have the same sign. Moreover, in S_{ij} the contribution of each product (u′_{k})_{i}(v′_{k})_{j} is weighted by the square root of the corresponding eigenvalues. 
The gap penalty was evaluated experimentally. We found that in nearly all cases a good choice is
Note that the sign of the eigenvectors has no influence in Equation (3), i.e. v_{i} ⊗ v_{i} = −v_{i} ⊗ −v_{i}. In fact, there is no way to standardize the sign of the eigenvectors in the eigendecomposition. This implies that, when aligning two sets of t eigenvectors, we are forced to try all possible combinations of their signs. By definition of (7), to consider all possible combinations, it is sufficient to try all possible sign combinations for just one set of t eigenvectors, thus producing 2^{t} different alignments. Moreover, in some cases, we found that, with increasing number of eigenvectors, the quality of the alignment (in terms of the number of overlapping contacts recovered) slightly decreases. Thus, when aligning t eigenvectors, our algorithm proceeds as follows. First, it computes all alignments with one eigenvector for each pair, then with two and so on up to t. This procedure evaluates a total of ∑^{t}_{k=1} 2^{k} = 2^{t+1} − 2 alignments. The best alignment in terms of overlapping contacts is chosen. One execution of the NW algorithm costs Θ(nm), so our algorithm costs Θ(2^{t+1}·nm), i.e. it is exponential in the order of approximation t. Anyway, the running time is not dependent on the contact map threshold and for values of t up to 7 the running time is small enough to assure a fast computation, as shown below.
3.3 Evaluation of the running time
The experiments were run on an Intel Pentium machine with a 2.80 GHz CPU and with 1 GB RAM. In our case, the time necessary to perform the eigendecomposition is not included for the comparison with other methods, since the eigenvectors can be computed (and stored) in a preprocessing phase. In general, the eigendecomposition time depends on the map sizes but the it is very fast compared with the time needed to perform the alignments. Nonetheless, for sake of clarity, we include here the time necessary to perform the eigendecomposition of the maps in our datasets described in Section 3.1. The eigendecomposition of the Sokol dataset requires <1 ms in total. The Lancia, Skolnick, Fischer and Proteus300 datasets require approximatively 3, 4, 14 and 55 s, respectively. The eigendecomposition of the most large map in our datasets (534 residues) requires 2.3 s.
4 RESULTS AND DISCUSSION
4.1 Comparison with exact/heuristic CMO methods
In the first test, the AlEigen algorithm is compared on the Skolnick set with two exact IPbased methods, LAGR (Caprara and Lancia, 2002) and A_Purva (Andonov et al., 2008), and with the heuristic MSVNS (Pelta et al., 2008). In Table 2, the total number of recovered overlapping contacts (Total overlap) and the required computational time (Time) are listed. For the sake of comparison, we use MSVNS version 3 (as suggested by the authors) with number of restarts equal to 5, 10, 30, 50, 70. All the runs are on the same contact maps with threshold value set to 7.5 Å.
Method^{a}  Total overlap^{b}  % wrt A_purva^{c}  Time^{d} 

A_purva  216 372  100%  7 day 
LAGR  210 395  97.2%  13 day 
MSVNS v3 r70  199 270  92.1%  15 h 
MSVNS v3 r50  197 777  91.4%  10 h 
MSVNS v3 r30  195 007  90.1%  6 h 
MSVNS v3 r10  186 776  86.3%  2 h 
MSVNS v3 r5  178 757  82.6%  1 h 
AlEigen_{14}  198 124  91.6%  18 h 
AlEigen_{13}  197 386  91.2%  9 h 
AlEigen_{12}  196 512  90.8%  4 h 30 min 
AlEigen_{11}  195 640  90.4%  2 h 
AlEigen_{10}  194 654  90.0%  1 h 
AlEigen_{9}  193 571  89.5%  25 min 
AlEigen_{8}  192 177  88.8%  12 min 
AlEigen_{7}  190 923  88.2%  6 min 
Method^{a}  Total overlap^{b}  % wrt A_purva^{c}  Time^{d} 

A_purva  216 372  100%  7 day 
LAGR  210 395  97.2%  13 day 
MSVNS v3 r70  199 270  92.1%  15 h 
MSVNS v3 r50  197 777  91.4%  10 h 
MSVNS v3 r30  195 007  90.1%  6 h 
MSVNS v3 r10  186 776  86.3%  2 h 
MSVNS v3 r5  178 757  82.6%  1 h 
AlEigen_{14}  198 124  91.6%  18 h 
AlEigen_{13}  197 386  91.2%  9 h 
AlEigen_{12}  196 512  90.8%  4 h 30 min 
AlEigen_{11}  195 640  90.4%  2 h 
AlEigen_{10}  194 654  90.0%  1 h 
AlEigen_{9}  193 571  89.5%  25 min 
AlEigen_{8}  192 177  88.8%  12 min 
AlEigen_{7}  190 923  88.2%  6 min 
^{b}Total number of recovered overlapping contacts.
^{c}Percentage of total overlapping contacts recovered with respect to the best performing method (A_purva).
^{d}Running time needed to compute the entire set of 780 pairwise alignments in the Skolnick set.
The implementations of LAGR and A_purva are not publicly available and we use the data previously published as Supplementary Material (Andonov et al., 2008). Another exact method, CMOS (Xie and Sahinidis, 2007), available online as a web server, has limitations on the size of the submitted problems and it was not included in our benchmark.
As previously described (Andonov et al., 2008), the computational time of the exact algorithms LAGR and A_purva has been limited to a maximum of 30 min per pair of contact maps. Even under this constraint, the two exact methods require a large computational time (from 7 to 13 days) to compute all the 780 alignments of the Skolnick set. It has been reported (Andonov et al., 2008) that A_purva in this test returns an upper bound to the total overlap equal to 218 316 contacts and a total lower bound of 216 372 contacts (listed in Table 2). This indicates that the quality of the overlaps returned by A_purva within the time limitation is very good, since it is quite close to the optimum solution (the distance between the total upper bound and the total lower bound of A_purva is very short). With A_purva, the solution is in fact <1% distant from the optimal one. The other exact algorithm LAGR similarly returns a solution <4% distant from the optimal one. The total overlap returned by A_purva can, therefore, be adopted as the best approximation of the optimal solution for the Skolnick set. Therefore, all the other solutions listed in Table 1 are normalized to that of A_purva (%wrt A_purva).
In Table 3, the performance of AlEigen is compared with the two heuristic methods MSVNS and BIMAL Jain and Obermayer (2009), which is not publicly available, with respect to the results published in Jain and Obermayer (2009). The maps for Lancia and Sokol datasets include also some contacts between contiguous residues which are not taken into account by AlEigen (thus, its performance is slightly lowered). It was not possible to compare the performance of AlEigen with BIMAL on the Fischer dataset, since the original maps used in Jain and Obermayer (2009) are not available. In Table 4, the performance on the Fischer dataset (7.5 Å contact maps) is compared with MSVNS only. The running times of BIMAL and MSVNS in Table 3 are reduced with respect to that published in Jain and Obermayer (2009) in order to take into account the different clock frequencies: 1.75 GHz in Jain and Obermayer (2009) and 2.8 GHz for our machine, respectively. In detail, the running times of BIMAL and MSVNS have been scaled down by a factor equal to 0.8, which is the minimum ratio we obtained by comparing the running time of MSVN on our machine (on the same maps used for the tests in Table 3) and the running time reported in Jain and Obermayer (2009).
Method^{a}  Skolnick^{b}  Time^{c}  Lancia^{b}  Time^{c}  Sokol^{b}  Time^{c} 

MSVNS v1  155 308  40 min  945 813  2 h 30 min  833  5 s 
MSVNS v2  179 382  50 min  1 010 559  3 h  885  4 s 
MSVNS v3  185 753  1 h  1 038 186  3 h 20 min  862  5 s 
BIMAL 1  187 049  1 min  818 357  4 min  740  0.1 s 
BIMAL 2  189 498  2 min  867 691  8 min  793  0.1 s 
BIMAL r1  192 961  20 min  936 996  1 h  814  1 s 
BIMAL r2  194 499  45 min  984 481  2 h  828  3 s 
AlEigen_{7}  187 003  6 min  731 865  40 min  679  2 s 
AlEigen_{8}  188 246  12 min  737 718  1 h  694  5 s 
AlEigen_{9}  189 308  25 min  744 095  2 h  702  7 s 
AlEigen_{10}  190 324  1 h  750 888  4 h  706  13 s 
Method^{a}  Skolnick^{b}  Time^{c}  Lancia^{b}  Time^{c}  Sokol^{b}  Time^{c} 

MSVNS v1  155 308  40 min  945 813  2 h 30 min  833  5 s 
MSVNS v2  179 382  50 min  1 010 559  3 h  885  4 s 
MSVNS v3  185 753  1 h  1 038 186  3 h 20 min  862  5 s 
BIMAL 1  187 049  1 min  818 357  4 min  740  0.1 s 
BIMAL 2  189 498  2 min  867 691  8 min  793  0.1 s 
BIMAL r1  192 961  20 min  936 996  1 h  814  1 s 
BIMAL r2  194 499  45 min  984 481  2 h  828  3 s 
AlEigen_{7}  187 003  6 min  731 865  40 min  679  2 s 
AlEigen_{8}  188 246  12 min  737 718  1 h  694  5 s 
AlEigen_{9}  189 308  25 min  744 095  2 h  702  7 s 
AlEigen_{10}  190 324  1 h  750 888  4 h  706  13 s 
^{a}The results for MSVNS and BIMAL have been taken from Jain and Obermayer (2009). MSVN has been run with 10 restarts.
^{b}Total number of recovered overlapping contacts. The total overlap includes also the overlap of a map with itself.
^{c}Running time needed to compute the entire set of pairwise alignments.
Method  Fischer^{a}  Time^{b} 

MSVNS v1 r10  445 887  4 h 
MSVNS v2 r10  465 892  4 h 
MSVNS v3 r10  476 469  6 h 
AlEigen_{7}  476 249  20 min 
AlEigen_{8}  479 745  40 min 
AlEigen_{9}  482 898  1 h 20 min 
AlEigen_{10}  485 388  3 h 
Method  Fischer^{a}  Time^{b} 

MSVNS v1 r10  445 887  4 h 
MSVNS v2 r10  465 892  4 h 
MSVNS v3 r10  476 469  6 h 
AlEigen_{7}  476 249  20 min 
AlEigen_{8}  479 745  40 min 
AlEigen_{9}  482 898  1 h 20 min 
AlEigen_{10}  485 388  3 h 
^{a}Total number of recovered overlapping contacts on the Fischer dataset. The total overlap includes also the overlap of a map with itself.
^{b}Running time needed to compute the entire set of 2346 pairwise alignments.
Summing up, heuristic algorithms, when compared with exact algorithms, can provide good solutions in a much smaller computational time. Interestingly enough, when the elapsed time is 6 min, AlEigen_{7} returns 88.2% of the best overlap (Table 2). The comparison in Tables 3 and 4 shows that AlEigen has worst performance than other heuristic methods, MSVNS and BIMAL, on maps containing a small number of contacts. On the other hand, on maps containing a higher number of contacts the performance of AlEigen improves both in terms of quality of the overlap (compare the results on the Lancia/Sokol datasets with those on the Skolnick dataset) and computational time (compare the computational times of MSVNS and AlEigen on the Skolinck dataset in Tables 2 and 3).
4.2 Comparison with structural alignment methods
In this section, we evaluate the accuracy of our algorithm as a classifier (the ability to recognize protein structural similarities at the SCOP family/s.family/fold level) on the Proteus300 and Fischer dataset. These two datasets are more computationally demanding than the Skolnick set (they require 44 850 and 2278 alignments, respectively) and contain nontrivial superfamilies and folds.
We compare the effectiveness of contact map alignment methods, including MSVNS, for the structural classification task with three structural alignment methods such as CE (Shindyalov and Bourne, 1998), TMalign (Zhang and Skolnick, 2005) and DaliLite (Holm and Park, 2000). On the Proteus300 dataset, we consider also the performance of A_purva+sse, a variant of A_purva that encodes secondary structure constraints (two residues can be aligned only if they belong to the same secondary structure class). The results shown for A_purva+sse have been taken from the Supplementary Material of Andonov et al. (2008). The scoring function for the CMO methods is provided by Equation (2). CE, DaliLite and TMalign have their own scoring functions: Zscore (for CE and DaliLite) and TMscore (for TMalign). The only pure CMO methods considered for these tests are MSVNS and our algorithm. We use different contact map thresholds to test the performance of our method. Due to the demanding computational time required by MSVNS on highthreshold contact maps, it is not possible to test it for thresholds >7.5 Å. In order to highlight the level of difficulty of our benchmarks, we also include the performance of BLAST (Altschul et al., 1997), so that it is easier to detect the amount of information contained in the dataset at the protein sequence level. The performance of BLAST is evaluated with respect to its bitscore. In detail, we perform the following two tests.
4.2.1 Structural class recognition
We check the ability to detect the correct protein family/super family/fold with a leaveoneout test (Tables 5 and 6). For every queryprotein, we select the modelprotein in the set that obtains the best similarity score and we measure the fraction of queryproteins for which the chosen modelprotein belongs to the same family/s.family/fold. The results on the Proteus300 and Fischer datasets are shown in Tables 5 and 6, respectively. On the Fischer dataset we test the performance only at the fold level since almost 1/3 of the proteins in the dataset (22) belong to a unique fold. In particular, the fold recognition test is performed only on the 46 proteins that belong to a nonunique fold in the dataset.
Method^{a}  Contact th.^{b}  Fam. rec.^{c}  S.Fam. rec.^{c}  Fold rec.^{c}  AUC Fam.^{d}  AUC S.Fam.^{d}  AUC Fold^{d}  Time^{e} 

BLAST  N/A  274/300  274/300  275/300  0.82  0.78  0.71  2 min 
CE  N/A  297/300  299/300  300/300  0.99  0.98  0.97  40 h 
DaliLite  N/A  299/300  300/300  300/300  0.99  0.97  0.97  9 h 30 min 
TMalign  N/A  300/300  300/300  300/300  0.99  0.98  0.98  5 h 
A_purva+sse  7.5 Å  300/300  300/300  300/300  0.99  0.97  0.97  23 h 
MSVNS.v3 r10  7.5 Å  297/300  298/300  299/300  0.97  0.93  0.93  94 h 
AlEigen_{7}  7.5 Å  294/300  294/300  296/300  0.98  0.94  0.94  6 h 
AlEigen_{7}  8 Å  297/300  297/300  298/300  0.98  0.94  0.94  6 h 
AlEigen_{7}  9 Å  299/300  299/300  300/300  0.98  0.94  0.94  6 h 
AlEigen_{7}  10 Å  300/300  300/300  300/300  0.99  0.94  0.95  6 h 
AlEigen_{7}  11 Å  300/300  300/300  300/300  0.99  0.95  0.96  6 h 
AlEigen_{7}  12 Å  300/300  300/300  300/300  0.99  0.95  0.96  6 h 
AlEigen_{7}  13 Å  300/300  300/300  300/300  0.99  0.95  0.96  6 h 
Method^{a}  Contact th.^{b}  Fam. rec.^{c}  S.Fam. rec.^{c}  Fold rec.^{c}  AUC Fam.^{d}  AUC S.Fam.^{d}  AUC Fold^{d}  Time^{e} 

BLAST  N/A  274/300  274/300  275/300  0.82  0.78  0.71  2 min 
CE  N/A  297/300  299/300  300/300  0.99  0.98  0.97  40 h 
DaliLite  N/A  299/300  300/300  300/300  0.99  0.97  0.97  9 h 30 min 
TMalign  N/A  300/300  300/300  300/300  0.99  0.98  0.98  5 h 
A_purva+sse  7.5 Å  300/300  300/300  300/300  0.99  0.97  0.97  23 h 
MSVNS.v3 r10  7.5 Å  297/300  298/300  299/300  0.97  0.93  0.93  94 h 
AlEigen_{7}  7.5 Å  294/300  294/300  296/300  0.98  0.94  0.94  6 h 
AlEigen_{7}  8 Å  297/300  297/300  298/300  0.98  0.94  0.94  6 h 
AlEigen_{7}  9 Å  299/300  299/300  300/300  0.98  0.94  0.94  6 h 
AlEigen_{7}  10 Å  300/300  300/300  300/300  0.99  0.94  0.95  6 h 
AlEigen_{7}  11 Å  300/300  300/300  300/300  0.99  0.95  0.96  6 h 
AlEigen_{7}  12 Å  300/300  300/300  300/300  0.99  0.95  0.96  6 h 
AlEigen_{7}  13 Å  300/300  300/300  300/300  0.99  0.95  0.96  6 h 
^{a}BLAST (Altschul et al., 1997) is a local sequence alignment tool. CE (Shindyalov and Bourne, 1998), DaliLite (Holm and Park, 2000) and TMalign (Zhang and Skolnick, 2005) are structural alignment methods. A_purva+sse (Andonov et al., 2008) is an exact CMO method that uses secondary structure information. MSVNS (Pelta et al., 2008) and AlEigen_{7} (our method) are the only pure CMO methods.
^{b}Contact map threshold used for CMO methods. N/A: not applicable.
^{c}Family/S.Family/Fold recognition results on the 300 proteins in the Proteus300 dataset. When CE returns more than one alignment for a pair, the one with the best Zscore is chosen. The TMscores are normalized with respect to the shortest sequence. The best results are highlighted with bold fonts.
^{d}AUC values of the ROC curves at the SCOP Family/S.Family/Fold level.
^{e}Running time needed to compute the entire set of 44 850 pairwise alignments in the Proteus300 dataset.
Method^{a}  Contact th^{b}  Fold rec^{c}  AUC Fold^{d}  Time^{e} 

BLAST  N/A  19/46  0.67  6 s 
CE  N/A  40/46  0.98  1 h 20 min 
DaliLite  N/A  45/46  0.99  30 min 
TMalign  N/A  39/46  0.99  10 min 
MSVNS v3 r10  7.5 Å  23/46  0.90  6 h 
AlEigen_{7}  7.5 Å  24/46  0.88  20 min 
AlEigen_{7}  8 Å  23/46  0.90  20 min 
AlEigen_{7}  9 Å  22/46  0.89  20 min 
AlEigen_{7}  10 Å  29/46  0.91  20 min 
AlEigen_{7}  11 Å  30/46  0.93  20 min 
AlEigen_{7}  12 Å  33/46  0.94  20 min 
AlEigen_{7}  13 Å  32/46  0.93  20 min 
Method^{a}  Contact th^{b}  Fold rec^{c}  AUC Fold^{d}  Time^{e} 

BLAST  N/A  19/46  0.67  6 s 
CE  N/A  40/46  0.98  1 h 20 min 
DaliLite  N/A  45/46  0.99  30 min 
TMalign  N/A  39/46  0.99  10 min 
MSVNS v3 r10  7.5 Å  23/46  0.90  6 h 
AlEigen_{7}  7.5 Å  24/46  0.88  20 min 
AlEigen_{7}  8 Å  23/46  0.90  20 min 
AlEigen_{7}  9 Å  22/46  0.89  20 min 
AlEigen_{7}  10 Å  29/46  0.91  20 min 
AlEigen_{7}  11 Å  30/46  0.93  20 min 
AlEigen_{7}  12 Å  33/46  0.94  20 min 
AlEigen_{7}  13 Å  32/46  0.93  20 min 
^{b}Contact map threshold used for CMO methods. N/A: not applicable
^{c}Fold recognition results on Fischer dataset. Among the 68 proteins in the dataset, 22 belong to a unique fold. Thus, the fold recognition performance are evaluated only on the remaining 46 proteins. When CE returns more than one alignment for a pair, the one with the best Zscore is chosen. The TMscores are normalized with respect to the shortest sequence. The best results are highlighted with bold fonts.
^{d}AUC values of the ROC curves at the SCOP Fold level.
^{e}Running time needed to compute the entire set of 2278 pairwise alignments in the Fischer dataset.
On the Proteus300 dataset, all the methods have similar performance on the task of recognition at the family/s.family/fold level. Notably, most of the errors reported in Table 5 are related to queryproteins that belong to nontrivial folds or nontrivial super families (see the Supplementary Material for detailed results). For example, as can we notice in Table 5, CE and DaliLite for 3 and 1 queries, respectively, can detect the correct superfamily but not the correct family. MSVNS and AlEigen (on 7.5 Å and 8 Å maps) in few cases fail to detect also the correct fold. It is worth noticing how the family recognition performance of our method varies at increasing values of the contact map threshold: all queries are correctly recognized only at thresholds ≥ 10 Å. The good performance of the CMO method A_purva+sse on 7.5 Å contact maps can be related to the use of secondary structure constraints, which can be of help in discriminating structure similarity, at least at the SCOP class level.
As shown in Table 6, on the Fischer dataset, structural alignment methods have better performance than CMO methods. In this case none of the methods is able to obtain full accuracy. The differences with respect to the results obtained on the Proteus300 dataset are justified by the low level of similarity between the targets in the Fischer dataset, as highlighted also by the performance of BLAST. Most of the errors reported in Table 6 are related to queries that have a representative only in the same superfamily but not in the same family (Supplementary Material). In particular, the only query misclassified by DaliLite has only one representative in the same fold. Except for this case, the other four queries with representatives only in the same fold class are classified correctly by all the other methods (except BLAST). It is worth noticing that, also in this case, the performance in fold recognition of our method increases with increasing values of the contact threshold.
In terms of computational time, in both tests, TMalign and AlEigen are the fastest.
4.2.2 Binary classification
We measure the area under curve (AUC) values of the receiver operating characteristic (ROC) curves, when the alignment scores are used as a binary classifier system at the SCOP family/s.family/fold level. As in the previous test, the performance on the Fischer dataset are evaluated only at the fold level (Tables 5 and 6).
The comparison based on AUC values shows that all the methods perform quite well as binary classifiers on both Proteus300 and Fischer datasets. In particular, the performance gap between structural alignment and CMO methods on the Fischer dataset is much smaller in terms of AUC values. This result indicates that, even if almost half of the queries are misclassified by CMO comparison, the most similar structures in the dataset are topranked. This test also confirms that highthreshold contact maps are more suitable for structure similarity detection.
4.3 Error tolerance test
In this section, we evaluate the error tolerance of AlEigen and MSVNS on the Proteus300 and Fischer datasets.
In this preliminary test, we adopt two different random error models. In particular, we apply some random perturbation to the original contact maps and evaluate the performance in terms of structural class recognition accuracy and AUC values, as performed in Section 4.2. In this case, we compare a perturbed map with all the other nonperturbed maps in the dataset, except itself. For the sake of comparison, we use only 7.5 Å contact maps. This threshold is also very close to the standard distance (8 Å) adopted for contact prediction (Ezkurdia et al., 2009). In detail, we consider the following two models for random error. Error model 1 preserves the total number of contacts in the map. On the contrary, model 2 can eventually introduce more false positive contacts than the original number of contacts in the map. We test error model 1 with percentage of errors equal to 70, 80 and 90% and error model 2 with 10, 20 and 30%. In the latter case, the perturbed contact maps contain on the average 4, 8 and 11 times more contacts than the original ones, respectively. The results obtained on the Proteus 300 and Fischer datasets are shown in Tables 7 and 8, respectively.

Model 1: an amount of %x errors means that exactly n random contacts are set to 0 and exactly n random noncontacts are set to 1, where n=round, and C is the number of contacts in the map. Only contacts between nonconsecutive residues are considered.

Model 2: an amount of %x errors means that exactly n random entries of the map are flipped, where n=round and L is the length of the protein. Also in this case only the entries of the map related to nonconsecutive residues are considered.
Method  Error model^{a}  %Err^{b}  Fam. rec.^{c}  S.Fam. rec^{c}  Fold rec.^{c}  AUC Fam.^{d}  AUC S.Fam.^{d}  AUC Fold^{d} 

MSVNS.v3 r10  1  70  114/300  119/300  138/300  0.73  0.71  0.73 
AlEigen_{7}  283/300  285/300  289/300  0.91  0.86  0.86  
MSVNS.v3 r10  80  22/300  27/300  38/300  0.66  0.65  0.68  
AlEigen_{7}  253/300  256/300  268/300  0.85  0.81  0.82  
MSVNS.v3 r10  90  8/300  9/300  22/300  0.60  0.60  0.64  
AlEigen_{7}  53/300  54/300  64/300  0.71  0.69  0.72  
MSVNS.v3 r10  2  10  229/300  231/300  240/300  0.68  0.67  0.70 
AlEigen_{7}  291/300  292/300  293/300  0.80  0.77  0.79  
MSVNS.v3 r10  20  181/300  185/300  192/300  0.63  0.62  0.66  
AlEigen_{7}  273/300  277/300  277/300  0.73  0.70  0.73  
MSVNS.v3 r10  30  86/300  92/300  107/300  0.57  0.57  0.61  
AlEigen_{7}  241/300  243/300  245/300  0.67  0.65  0.69 
Method  Error model^{a}  %Err^{b}  Fam. rec.^{c}  S.Fam. rec^{c}  Fold rec.^{c}  AUC Fam.^{d}  AUC S.Fam.^{d}  AUC Fold^{d} 

MSVNS.v3 r10  1  70  114/300  119/300  138/300  0.73  0.71  0.73 
AlEigen_{7}  283/300  285/300  289/300  0.91  0.86  0.86  
MSVNS.v3 r10  80  22/300  27/300  38/300  0.66  0.65  0.68  
AlEigen_{7}  253/300  256/300  268/300  0.85  0.81  0.82  
MSVNS.v3 r10  90  8/300  9/300  22/300  0.60  0.60  0.64  
AlEigen_{7}  53/300  54/300  64/300  0.71  0.69  0.72  
MSVNS.v3 r10  2  10  229/300  231/300  240/300  0.68  0.67  0.70 
AlEigen_{7}  291/300  292/300  293/300  0.80  0.77  0.79  
MSVNS.v3 r10  20  181/300  185/300  192/300  0.63  0.62  0.66  
AlEigen_{7}  273/300  277/300  277/300  0.73  0.70  0.73  
MSVNS.v3 r10  30  86/300  92/300  107/300  0.57  0.57  0.61  
AlEigen_{7}  241/300  243/300  245/300  0.67  0.65  0.69 
^{a}Error models described in Section 4.3.
^{b}Percentage of random error introduced in the maps.
^{c} Family/S.Family/Fold recognition results on the 300 proteins in the Proteus300 dataset. Every perturbed map is compared against every nonperturbed map in the dataset except itself.
^{d}AUC values of the ROC curves at the SCOP Family/S.Family/Fold level.
Method  Error model^{a}  %Err^{b}  Fold rec.^{c}  AUC Fold^{d} 

MSVNS.v3 r10  1  70  6/46  0.74 
AlEigen_{7}  16/46  0.81  
MSVNS.v3 r10  80  3/46  0.74  
AlEigen_{7}  11/46  0.78  
MSVNS.v3 r10  90  2/46  0.71  
AlEigen_{7}  6/46  0.73  
MSVNS.v3 r10  2  10  16/46  0.78 
AlEigen_{7}  19/46  0.81  
MSVNS.v3 r10  20  15/46  0.72  
AlEigen_{7}  19/46  0.76  
MSVNS.v3 r10  30  7/46  0.67  
AlEigen_{7}  12/46  0.73 
Method  Error model^{a}  %Err^{b}  Fold rec.^{c}  AUC Fold^{d} 

MSVNS.v3 r10  1  70  6/46  0.74 
AlEigen_{7}  16/46  0.81  
MSVNS.v3 r10  80  3/46  0.74  
AlEigen_{7}  11/46  0.78  
MSVNS.v3 r10  90  2/46  0.71  
AlEigen_{7}  6/46  0.73  
MSVNS.v3 r10  2  10  16/46  0.78 
AlEigen_{7}  19/46  0.81  
MSVNS.v3 r10  20  15/46  0.72  
AlEigen_{7}  19/46  0.76  
MSVNS.v3 r10  30  7/46  0.67  
AlEigen_{7}  12/46  0.73 
^{a}Error models described in Section 4.3.
^{b}Percentage of random error introduced in the maps.
^{c} Fold recognition results on Fischer dataset. Among the 68 proteins in the dataset, 22 belong to a unique fold. Thus, the fold recognition performance are evaluated only on the remaining 46 proteins. Every perturbed map is compared against every nonperturbed map in the dataset except itself.
^{d}AUC values of the ROC curves at the SCOP Fold level.
The AUC values in Tables 7 and 8 show that the CMO comparison approach has some discriminative power in distinguishing similar structures also in presence of highly noisy contact maps. In particular, the approach seems to be more robust when the total number of contacts in noisy maps does not differ too much from the native amount of contacts. In terms of recognition accuracy, by comparing the results obtained on the two different datasets it is evident that, in presence of noise (not surprisingly) the approach is much robust in detecting similarities at the family level more than at the fold level. In particular, on the Proteus300 dataset, the performance of AlEigen are almost unaffected also for 70% errors of type 1 (compare Tables 5 and 7).
The better performance of AlEigen with respect to those of MSVNS can be related to a higher tolerance of noise due to the eigendecomposition. In particular, we tested (data not shown) how the two algorithms align a perturbed map with its native version. In almost all cases the alignment computed by AlEigen is the identity, i.e. the ith residue in the perturbed map is aligned with the i residue in the original map. In detail, the alignments of AlEigen, on the average, have > 94% of identity (the identity abruptly drops from 94% at 90% error to 3% at 100% error, with respect to model 1). On the contrary, the same test on MSVN reports an average identity <10%.
5 CONCLUSIONS
In this article, we described a heuristic algorithm suited to address the CMO problem. Our algorithm computes an overlap of two contact maps by performing a global alignment based on few eigenvectors. The approach is effective since contact maps can be well approximated by just a fraction of their eigenvectors. Our algorithm is reasonably simple and, by design, its computing time does not depend on the number of contacts in the map and then on the contact threshold.
There are two main differences between our approach and the other methods developed so far for the CMO problem: (i) our algorithm is completely heuristic and it has no way to detect if the overlap found in some point of the computation is the best possible. In contrast, exact IPbased methods can stop the computation when the lower and upper bounds to the optimal solution coincide. (ii) The computing time of our algorithm depends uniquely on the protein length and on the number of eigenvectors considered and it is not affected by the threshold value of the contact map. Differently, for all the other methods developed so far (exact, approximate and heuristic), the computing time increases with increasing number of contacts in the map (namely, with increasing threshold value of the contact map): more contacts result into more constraints to be taken into account and in a longer running time.
We experimentally validated the performance of our algorithm by comparison with exact CMO methods (LAGR and A_purva), with heuristic CMO approaches (MSVNS and BIMAL) and with three structural alignment methods (CE, TMalign and DaliLite). Our algorithm is fast, it has good performance in terms of quality of the overlap when compared with exact/heuristic CMO methods and it is also competitive with widely used structural alignment methods in the task of protein structure comparison. In our tests, we found that contact maps computed at thresholds values ≥10 Å are more informative of the protein structures than those at lower thresholds. The computational time of our method is independent of the threshold value of the contact map and this makes it more suitable than other CMO methods for largescale detection of protein structure similarities.
Finally, in this article for the first time, to the best of our knowledge, we tested the effect of random noise on CMO alignments. According to these tests our algorithm is quite robust to noise and performs well in the task of structural similarity detection also with highly perturbed contact maps, at least at the SCOP family level.
Conflict of Interest: none declared.
Comments