Abstract

Motivation

Protein structure alignment is one of the fundamental problems in computational structure biology. A variety of algorithms have been developed to address this important issue in the past decade. However, due to their heuristic nature, current structure alignment methods may suffer from suboptimal alignment and/or over-fragmentation and thus lead to a biologically wrong alignment in some cases. To overcome these limitations, we have developed an accurate topology-independent and global structure alignment method through an FFT-based exhaustive search algorithm, which is referred to as FTAlign.

Results

Our FTAlign algorithm was extensively tested on six commonly used datasets and compared with seven state-of-the-art structure alignment approaches, TMalign, DeepAlign, Kpax, 3DCOMB, MICAN, SPalignNS and CLICK. It was shown that FTAlign outperformed the other methods in reproducing manually curated alignments and obtained a high success rate of 96.7 and 90.0% on two gold-standard benchmarks, MALIDUP and MALISAM, respectively. Moreover, FTAlign also achieved the overall best performance in terms of biologically meaningful structure overlap (SO) and TMscore on both the sequential alignment test sets including MALIDUP, MALISAM and 64 difficult cases from HOMSTRAD, and the non-sequential sets including MALIDUP-NS, MALISAM-NS, 199 topology-different cases, where FTAlign especially showed more advantage for non-sequential alignment. Despite its global search feature, FTAlign is also computationally efficient and can normally complete a pairwise alignment within one second.

Availability and implementation

http://huanglab.phys.hust.edu.cn/ftalign/.

1 Introduction

Protein structure comparison is one of the fundamental bioinformatics tasks in computational structure biology (Hasegawa and Holm, 2009; Ma and Wang, 2014). As similarity of structures often implies similarity of their functions, protein structure alignment has been a valuable computational method for inferring evolutionary relationships of various protein structures (Koehl, 2001; Lichtarge and Sowa, 2002), protein classification (Koehl, 2006; Murzin et al., 1995; Orengo et al., 1997; Yan and Huang, 2019a, b), protein function prediction (Brylinski and Skolnick, 2008; Roy et al., 2012; Zhang et al., 2017), protein structure prediction (Mirabello and Wallner, 2018; Scheeff and Bourne, 2006) and drug discovery (Huang, 2014; Hwang et al., 2017; Litfin et al., 2017; Wu et al., 2018; Yang et al., 2013), when the sequence identity of proteins is below a cutoff of 20–30% (Chothia and Lesk, 1986; Gan et al., 2002; Wood and Pearson, 1999). One primary task of structure alignment algorithms is to identify the residue equivalences between two proteins to be compared by superimposing them together. For years, a variety of protein structure alignment methods with different computational efficiencies have been developed by using different protein representations, structure comparison strategies and scoring schemes (Hasegawa and Holm, 2009). Nevertheless, there is still room in the improvement of protein structure alignment, especially when compared with human-curated results (Ma and Wang, 2014; Mayr et al., 2007).

Protein structure alignment is to find a biologically meaningful superimposition and/or those geometrically important matches between two protein structures, though these two criteria may be consistent with each other in many cases (Hasegawa and Holm, 2009). Mathematically, structure alignment is a combinatorial optimization problem (Ma and Wang, 2014). Therefore, assuming the scoring scheme for residue equivalences is ideal, one should explore all possible matches between two protein structures in order to obtain the most reasonable alignment of two proteins. As such, the computational cost would be the order of O(NM) for aligning two proteins of N and M residues in real space. This would be computationally too expensive for large-volume alignments with current computing power. Fortunately, protein structure alignment is not a purely mathematical issue, but also a biologically relevant problem. Therefore, biological constraints or strategies can be applied to reduce the search space in structure alignment and thus increase the computational speed (Ma and Wang, 2014).

One commonly used constraint is the requirement to follow a simple sequential rule, referred to as sequential alignment, in which the sequence order of the amino acids from two proteins should be preserved in the alignment (Holm and Sander, 1993; Madhusudhan et al., 2009; Minami et al., 2018; Orengo and Taylor, 1996; Pandit and Skolnick, 2008; Ritchie et al., 2012; Shindyalov and Bourne, 1998; Ye and Godzik, 2004; Zhang and Skolnick, 2005; Zhu and Weng, 2004; Yang et al., 2012). Although such sequential constraint is very successful in most of cases (Ma and Wang, 2014), it might miss the detection of some biologically evolutionary relationships (Micheletti and Orland, 2009; Xie and Bourne, 2008), as proteins can fold into topologically different structures while maintaining evolutionary relationship through the combination and permutation of subdomains (Bashton and Chothia, 2007; Lindqvist and Schneider, 1997). Accordingly, many algorithms have been developed for non-sequential structure alignment, where the ordering constraint is released (Alexandrov, 1996; Bachar et al., 1993; Brown et al., 2016; Dong et al., 2018; Dror et al., 2003; Holm and Sander, 1993; Kolbeck et al., 2006; Konagurthu et al., 2006; Minami et al., 2013; Nguyen and Madhusudhan, 2011; Salem et al., 2009, 2010). However, because current non-sequential algorithms often try to maximize the number of matched Cα atoms/local structure units and minimize their root mean square deviation (RMSD), they may experience the problems of over-fragmentation or noisy alignments and thus lead to a biologically trivial alignment (Hasegawa and Holm, 2009).

Another common constraint is to adopt a heuristic technique during alignment search (Ma and Wang, 2014), in which the protein is broken into local structural units like fragments, high-order structural alphabets, or secondary structure elements of suitable length (Camproux et al., 2004; Holm and Sander, 1993; Kolbeck et al., 2006; Kolodny et al., 2002; Lupyan et al., 2005; Micheletti et al., 2000; Minami et al., 2013; Shindyalov and Bourne, 1998; Tyagi et al., 2008; Wang and Zheng, 2008). Thus, the structure alignment can be performed by following a two-step strategy. Namely, local similarity scores are first obtained by comparing all possible pairs of local structural units from two proteins. Then, those local structure pairs with high similarity can be subsequently re-assembled (Budowski-Tal et al., 2010; Pandit and Skolnick, 2008), or regarded as seed fragments to obtain an initial superimposition which can be then optimized using a dynamics programming (DP) method (Jung and Lee, 2000; Lackner et al., 2000; Zhang and Skolnick, 2005). Although such heuristic strategy can dramatically boost the computational speed and has achieved great successes in many cases (Ma and Wang, 2014), the final overall alignment of two structures may not be a globally optimal or biologically meaningful one because of its reduced search space (Hasegawa and Holm, 2009; Ma and Wang, 2014).

To overcome these limitations in current structure alignment approaches, we have developed a truly topology-independent and global structure alignment approach based on a fast Fourier transform (FFT)-based search algorithm, which is referred to as FTAlign, in which a protein is represented by its Cα atoms. Due to its exhaustive search in the complete six-dimensional (three translational + three rotational) space, FTAlign can always sample the globally optimal alignment between two protein structures no matter whether the alignment is sequential or non-sequential. When compared with seven state-of-the-art structure alignment methods, FTAlign achieved a significant improvement in both manually curated benchmarks and reference-free test sets.

2 Materials and methods

2.1 Global structure superimposition

FTAlign exhaustively explores all six degrees of freedom of one protein structure relative to the other, so as to find the globally optimal superimposition/match of the two structures. During the search, proteins are represented by their Cα atoms, each of which has one of three secondary structure attributes: coil, α-helix and β-strand (Kabsch and Sander, 1983). The global search process is accelerated through an FFT-based algorithm (Katchalski-Katzir et al., 1992) that has been successfully used in global protein–protein docking (Chen and Weng, 2003; Huang, 2014; Yan et al., 2017, 2018; Yan and Huang, 2019a, b).

2.1.1 FFT-based search in 3D translational space

To perform an FFT-based search, both the first and second protein structures are first mapped onto a three-dimensional (3D) grid of N×N×N grid points (Katchalski-Katzir et al., 1992), where the grid spacing is empirically set to 2.0 Å (Chen and Weng, 2003; Yan et al., 2017). Then, each grid point within 1.8 Å of Cα atoms for the first (say A) and second (say B) protein structures is assigned a secondary structure-dependent value as:
A(l,m,n)={1.0forcoil1.5forhelix2.0forstrand0.0otherwise
(1)
and
B(l,m,n)={1.0forcoil1.5forhelix2.0forstrand0.0otherwise
(2)
where l, m and n are the indices of the grid point, and 1.8 Å is approximately the van der Waals (VDW) radius of Cα atoms.
With the above protein mapping on grids, the grid-based match score C for a superimposition between two structures can be generally expressed by the following formula (Katchalski-Katzir et al., 1992)
C(o,p,q)=l=1Nm=1Nn=1NA(l,m,n)×B(l+o,m+p,n+q)
(3)
where o, p and q are the numbers of grid points by which the first protein (A) is shifted with respect to the second protein (B) in three translational dimensions, respectively. Here, a more negative correlation score means a better superimposition between two proteins for a relative translation of (o, p, q). All the N3 translations, i.e. {o[1,N],p[1,N],q[1,N]}, in 3D translational space for Eq. (3) can be completed in one round of computation through an FFT-based computation (Katchalski-Katzir et al., 1992).

2.1.2 Exhaustive search in 3D rotational space

In addition to the global search in three-dimensional translational space, one also needs to search the whole rotational space so as to achieve the globally optimal superimposition between two structures in six degrees of freedom. This process can be conducted by exploring all the angle sets of (θ,ϕ,ψ) that are evenly distributed in 3D Euler space. In this study, an angle interval of 18∘ was used to evenly discretize the Euler space, resulting in a total of 2540 evenly distributed rotations in the rotational space (Yan and Huang, 2018). Thus, for each rotation of the second protein in the Euler space, an exhaustive search in translational space is conducted through an FFT-based algorithm. Repeating this process for all the rotations results in a truly global search of superimposition in six degrees of freedom.

2.1.3 Final superimpositions

Similar to that adopted in protein–protein docking (Chen and Weng, 2003; Yan and Huang, 2018), only one match with the best score, which corresponds to the best structural superimposition in an FFT-based translational search, was retained for each rotation of the secondary structure, yielding a total of 2540 structure superimpositions for a global alignment in six-dimensional space.

2.2 Atom-based refinement of superimpositions

To further improve the matching accuracy between two proteins at atomic level, we have performed an optimization for each of the 2540 grid-based superimpositions by using a SIMPLEX method, where a Gaussian-like pairwise potential is used to evaluate the matching quality between two Cα atoms ij as
Eij=wij·exp[(rijr0)2]
(4)
where wij is a secondary structure-dependent weighting coefficient (Table 1) and r0 is set to 3.0 Å so as to ensure that the potentials will not overlap significantly for two adjacent Cα atoms. Finally, the top 20 superimpositions were retained for sequent residue equivalence search according to the Cα-based scoring function of Eq. (4).
Table 1.

The weighting coefficient wij of the energy score between Cα atoms of secondary-structure elements (SSE) for match optimization in Eq. (4)

SSECoilα-helixβ-strand
Coil–1.0–0.5–1.0
α-Helix–0.5–2.01.0
β-Strand–1.01.0–3.0
SSECoilα-helixβ-strand
Coil–1.0–0.5–1.0
α-Helix–0.5–2.01.0
β-Strand–1.01.0–3.0
Table 1.

The weighting coefficient wij of the energy score between Cα atoms of secondary-structure elements (SSE) for match optimization in Eq. (4)

SSECoilα-helixβ-strand
Coil–1.0–0.5–1.0
α-Helix–0.5–2.01.0
β-Strand–1.01.0–3.0
SSECoilα-helixβ-strand
Coil–1.0–0.5–1.0
α-Helix–0.5–2.01.0
β-Strand–1.01.0–3.0

2.3 One-to-one residue alignment

For each of obtained superimpositions, we obtained its one-to-one residue correspondence of two proteins by identifying those continuous segment pairs with high alignment scores through a stepwise way (Minami et al., 2013). The scoring matrix for residue alignment is defined as follows
M(dij)={δij1+(dijd0)2ifdijdcut0otherwise
(5)
where dij is the distance between two aligned Cα atoms of ij, δij is a secondary structure-dependent coefficient (Table 2), d0 is a normalized distance and dcut is a distance cutoff. Here, d0 is set to be 1.24L1531.8, where L is the residue number of the smaller protein of the pair, such that the similarity score can be comparable for protein structures of different sizes (Xu and Zhang, 2010).
Table 2.

The weighting coefficients δij for the one-to-one alignment score between two Cα atoms of secondary-structure elements (SSE) in Eq. (5)

SSECoilα-Helixβ-Strand
Coil1.00.00.5
α-Helix0.02.0–0.5
β-Strand0.5–0.52.0
SSECoilα-Helixβ-Strand
Coil1.00.00.5
α-Helix0.02.0–0.5
β-Strand0.5–0.52.0
Table 2.

The weighting coefficients δij for the one-to-one alignment score between two Cα atoms of secondary-structure elements (SSE) in Eq. (5)

SSECoilα-Helixβ-Strand
Coil1.00.00.5
α-Helix0.02.0–0.5
β-Strand0.5–0.52.0
SSECoilα-Helixβ-Strand
Coil1.00.00.5
α-Helix0.02.0–0.5
β-Strand0.5–0.52.0

During the search of residue alignment, the distance cutoff dcut was stepwisely increased from 3.2 Å, 4.8 Å to 8.0 Å. Specifically, dcut was first set to 3.2 Å. Then, we identified all those continuous segment matches from [α,β] to [α+(l1),β+(l1)], where α and β are the starting residue positions in two proteins, and l is the length of an aligned segment. To avoid spurious matches, the length of continuous segments must be at least two residues (Minami et al., 2013). Among all the segments, the one with the highest match score will be recorded. Then, a similar search is conducted within the rest residues after excluding the previously identified segment(s). The first stage of search for residue alignment continues until no continuous segment has a minimum length of two. After that, we entered the next stage of search by increasing the distance cutoff to 4.8 Å, and then to 8.0 Å. The search procedure for continuous segments is similar at each stage except that the identified residue alignment from a previous stage should be retained at current stage. Finally, we get a set of one-to-one residue correspondence and the total match score by combining all those continuous segments for each superimposition.

The 20 sets of residue alignments for the top 20 superimpositions are ranked according to their total alignment scores. The top one with the highest alignment score is selected as the predicted structure alignment of two proteins by default, though our FTAlign can generate up to ten top alignments for users because the alternative alignments are useful for investigating the conformational changes of multi-domain proteins (Nguyen and Madhusudhan, 2011).

2.4 Test datasets

2.4.1 Two manually curated benchmarks

Manually curated benchmarks are the most important datasets for the performance evaluation of a structure alignment method, as the structure pairs in those benchmarks have been carefully aligned by considering not only geometric similarity but also evolutionary and functional relationship. Here, we selected two such benchmarks, MALIDUP (Cheng et al., 2007a) and MALISAM (Cheng et al., 2007b), which have been widely used in the community of structure alignment. MALIDUP contains 241 pairwise structure alignments for remotely related homologous domains originated from internal duplication within a protein chain. MALISAM consists of 130 cases, in which the two proteins in a pair are structural analogs with different SCOP folds (Murzin et al., 1995). Therefore, MALISAM is expected to be more challenging than MALIDUP, as the protein pairs in MALISAM are not homologs (Cheng et al., 2008).

2.4.2 Four reference-free datasets

In addition to two gold-standard benchmarks, we have also tested our FTAlign approach on four other reference-free test sets that are not manually curated, which have been widely used to evaluate the performance of many existing structure alignment methods. These datasets include MALIDUP-NS (Minami et al., 2013), MALISAM-NS (Minami et al., 2013), 64 difficult pairwise alignments from the HOMSTRAD dataset with low structure similarity (Stebbings and Mizuguchi, 2004), and 199 pairwise alignments that have similar structures but different topology (Nguyen and Madhusudhan, 2011). MALIDUP-NS and MALISAM-NS are two artificial non-sequential test sets that have been constructed based on the sets of MALIDUP and MALISAM by a multiple segment permutation technique (Minami et al., 2013). The 64 difficult protein pairs have a SO between 30 and 70% SO and a root mean square deviation (RMSD) above 2.5 Å. The 199 topology-different pairwise alignments are formed by 91 protein structures, containing 5 cases with circular permutation, 60 cases with non-topological similarity, 24 cases with swapped domains and 110 alignments from different protein families (Nguyen and Madhusudhan, 2011).

2.4.3 Alignment sequentiality of test sets

In terms of alignment order, the six test sets may also be divided into two groups: sequential test sets that include MALIDUP, MALISAM and 64 difficult cases, and non-sequential test sets that contain MALIDUP-NS, MALISAM-NS and 199 topology-different cases.

2.5 Programs to be compared

To validate our FTAlign, we have compared our method with seven state-of-the-art structure alignment algorithms, TMalign (Zhang and Skolnick, 2005), DeepAlign (Wang et al., 2013), Kpax (Ritchie et al., 2012; Ritchie, 2016), 3DCOMB (Wang et al., 2011), CLICK (Nguyen and Madhusudhan, 2011), SPalignNS (Brown et al., 2016) and MICAN (Minami et al., 2013), in which TMalign, DeepAlign, Kpax and 3DCOMB are developed for sequential alignment and the other three approaches are designed for non-sequential alignment. For these seven programs, we all downloaded their latest versions and ran them locally with their default parameters.

2.6 Evaluation criteria

According to whether the test sets are manually curated or not, we have used different criteria to evaluate the performance of a structure alignment approach. For a fair comparison, only the top alignment was used for all the approaches. To remove the effect of noisy or spurious alignments, only those aligned segments with at least two residues were used in the evaluation (Hasegawa and Holm, 2009; Minami et al., 2013).

For two manually curated benchmarks, MALIDUP and MALISAM, we have used the success rate in reproducing manually curated alignments to measure the performance of a structure alignment method. Specifically, for each case in the benchmark, we first superimposed the predicted alignment onto the manually curated alignment according to the first protein by using TMalign (Zhang and Skolnick, 2005). The superimposition can also be done based on the RMSD of the first protein. The two ways will give the same results because the first proteins from predicted and manually curated alignments are the same and thus can perfectly overlap during the superimposition. Then, we calculated the RMSD between the second proteins of the manual and predicted alignments based on their Cα atoms. If the RMSD is less than 5.0 Å, the predicted alignment is thought to be close enough to the manual one (Janin et al., 2003) and thus defined as a successful alignment. The success rate is defined as the percentage of the cases with successfully predicted alignments compared to the total number of pairwise alignments in the benchmark. It should be noted that here we used the RMSD of the second protein instead of the number of correctly aligned residues to measure the quality of a structure alignment because different similarity scoring schemes may lead to different sets of optimal residue equivalences and thus different number of aligned residues for the same alignment (Hasegawa and Holm, 2009). However, the RMSD will be less affected by similarity scoring schemes and therefore is a more objective criterion for the quality of a superimposition, compared with the number of correctly aligned residues. Compared to the residue equivalence, the RMSD from the reference alignment is also more robust for some tasks of structure alignment like functional or binding site detection (Estrin and Wolfson, 2017; Huang and Zou, 2006; Roy et al., 2012; Yang et al., 2013; Zhou et al., 2018).

In addition to the success criterion for manually curated benchmarks, we have also adopted those reference-free criteria to evaluate the performance of our FTAlign on all the six test sets, which include four parameters: number of aligned residues (Nali), RMSD of aligned residues (RMSDali), TMscore of aligned residues and structure overlap (SO). Here, SO is defined as the percentage of those aligned residues within 3.5 Å of each other compared to the number of residues in the smaller protein. As Nali is counter-correlated with RMSDali, it is difficult to optimize these two parameters simultaneously (Zemla, 2003). Therefore, the SO is normally used to rank different structure alignment methods during the performance comparison (Brown et al., 2016).

3 Results and discussion

3.1 Performance on manually curated benchmarks

We first tested our FTAlign on the two manually curated benchmarks, MALIDUP and MALISAM. It was found that FTAlign performed better than the other seven structure alignment methods in both success rate and robustness in reproducing manually curated alignments.

Figure 1 shows the success rates of our structure alignment method FTAlign in reproducing manually curated alignments on the two gold-standard benchmarks of MALIDUP and MALISAM. For comparison, the figure also lists the corresponding results of the other seven approaches including TMAlign, DeepAlign, Kpax, 3DCOMB, MICAN, SPalignNS and CLICK. It can be seen from Figure 1 that FTAlign performed the best among the eight structure alignment methods on the benchmark of MALIDUP, and obtained a high success rate of 96.7%, which is slightly better than 96.3% for 3DCOMB, 95.4% for DeepAlign, 94.2% for TMAlign, 94.2% for Kpax and 93.4% for MICAN and significantly higher than 84.7% for SPalignNS and 83.4% for CLICK. Similar trends can also be observed on the benchmark of MALISAM. That is, FTAlign achieved the best performance on the benchmark of MALISAM and gave a success rate of 90.0%, compared to 87.7% for 3DCOMB, 81.5% for TMAlign, 80.0% for DeepAlign, 76.9% for MICAN, 75.4% for Kpax, 50.8% for SPalignNS and 50% for CLICK.

Fig. 1.

Success rates of FTAlign and seven other state-of-the-art structure alignment methods in reproducing manually curated structure alignments on two gold-standard benchmarks, MALIDUP and MALISAM

Comparing the results of MALIDUP and MALISAM also reveals that the success rates on MALIDUP are always higher than those on MALISAM for all the tested methods (Fig. 1). This is consistent with the experimental findings that MALISAM is more challenging than MALIDUP because MALIDUP consists of distant homologs while MALISAM is made up of structural analogs (Cheng et al., 2008). Another notable feature in Figure 1 is that FTAlign only lost 7.0% in success rate (i.e. 89.2% versus 96.3%) for MALISAM versus MALISDUP, compared to 8.6% for 3DCOMB (87.7% versus 96.3%), 12.7% for TMAlign (81.5% versus 94.2%), 15.4% for DeepAlign (80.0% versus 95.4%), 16.4% for MICAN (76.9% versus 93.4%), 18.8% for Kpax (75.4% versus 94.2%), 33.9% for SPalignNS (50.8% versus 84.6%) and 33.4% for CLICK (50.0% versus 83.4%). These results suggest the robustness of FTAlign in structure alignment for both distant homologs and structure analogs.

3.2 Performance on general datasets

We further evaluated the performance of FTAlign on the two manually curated benchmarks, MALIDUP and MALISAM, and four other reference-free datasets, MALIDUP-NS, MALISAM-NS, 64 difficult cases and 199 topology-different alignments by using the reference-independent criteria, Nali, RMSDali, TMscore and SO. The results with these criteria again confirmed the superior performance of FTAlign to the other seven structure alignment algorithms, which are detailed as follows.

3.2.1 Sequential datasets

Table 3a–c lists the average number of aligned residues (Nali), RMSD of aligned residues (RMSDali), TMscore of aligned residues and structure overlap (SO) for FTAlign on three sequential test sets, MALIDUP, MALISAM and 64 difficult cases, of which the SO values are also shown in Figure 2. For comparison, the table and figure also gives the corresponding results of the other seven structure alignment methods, TMalign, DeepAlign, Kpax, 3DCOMB, MICAN, SPalignNS and CLICK. It can be seen from Table 3 and Figure 2 that FTAlign achieved the highest SO on all three sequential test sets, followed by TMalign, 3DCOMB, Kpax and DeepAlign, while MICAN, SPalignNS and CLICK yielded a relatively worse performance. In addition, FTAlign obtained a higher TMscore than five of the other seven methods except TMalign and 3DCOMB. This can be understood because the two approaches TMalign and 3DCOMB have been designed to maximize the geometric similarity score (i.e. TMscore) between two protein structures during the alignment (Wang et al., 2011; Zhang and Skolnick, 2005). Given the lower success rates of TMalign and 3DCOMB than FTAlign in reproducing the manually curated alignments of MALIDDUP and MALISAM (Fig. 1), TMalign and 3DCOMB may align many more evolutionarily unrelated but geometrically similar residues than FTAlign, as also pointed out in the DeepAlign study (Wang et al., 2013). From Table 3, one can also observe that FTAlign, TMalign, 3DCOMB, Kpax, DeepAlign and MICAN obtained a significantly higher number of aligned residues than SPalignNS and CLICK, though they gave a worse average RMSDali. This can be understood because the RMSDali is anti-correlated with the number of aligned residues. The overall worse performances of three non-sequential alignment approaches, MICAN, SPalignNS and CLICK, than the other methods on these three sequential datasets suggest that it is challenging to develop a general approach for both sequential and non-sequential alignment, as it is unknown whether a structure alignment is sequential or not before it is manually examined. Therefore, it is encouraging that our topology-independent approach FTAlign performed the best on all the three sequential datasets.

Fig. 2.

Structure overlap (SO) of FTAlign and seven other structure alignment methods on six commonly used datasets of pairwise alignments, where MALIDUP, MALISAM and 64 difficult cases are sequential sets, and MALIDUP-NS, MALISAM-NS and 199 topology-different cases are non-sequential sets

Table 3.

Performance comparison between FTAlign and seven other state-of-the-art methods on six test sets of pairwise structure alignments, in which MALIDUP, MALISAM and 64 difficult cases are sequential sets, and MALIDUP-NS, MALISAM-NS and 199 topology-different cases are non-sequential sets

(a) MALIDUP

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign87.12.680.61073.0
Kpax80.12.170.60970.6
TMalign86.72.690.63170.4
3DCOMB84.92.600.62369.8
DeepAlign83.82.680.61167.8
MICAN84.92.580.59367.2
SPalignNS62.71.610.52062.3
CLICK61.01.810.49659.1

(b) MALISAM

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign63.63.120.51167.0
TMalign62.53.150.52263.0
3DCOMB60.82.970.52062.3
Kpax55.92.570.49361.8
DeepAlign58.93.040.50059.1
MICAN61.82.990.41450.5
SPalignNS35.11.820.35646.4
CLICK34.81.950.34644.0

(c) 64 difficult cases

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign82.62.980.53169.1
TMalign82.32.970.55866.5
Kpax76.12.430.52665.8
3DCOMB80.92.900.54064.9
DeepAlign80.83.030.53362.7
MICAN79.92.900.49259.8
SPalignNS57.01.780.43756.9
CLICK51.01.940.38048.5

(d) MALIDUP-NS

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign86.02.730.59771.4
MICAN83.72.550.58766.6
CLICK60.41.810.49358.6
SPalignNS57.81.690.47757.7
3DCOMB58.52.740.43047.8
TMalign61.53.010.43347.5
Kpax52.32.130.40747.0
DeepAlign54.12.570.40945.3

(e) MALISAM-NS

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign61.83.130.49865.0
MICAN61.12.960.41250.3
TMalign48.33.440.39345.7
3DCOMB46.13.270.38445.1
Kpax40.42.590.36244.1
CLICK33.81.940.33742.9
DeepAlign40.72.930.36141.7
SPalignNS31.41.820.31440.9
(f) 199 topology-different cases

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign80.43.400.51859.7
TMalign69.43.440.46650.2
3DCOMB67.33.290.46250.1
DeepAlign61.63.030.43347.0
Kpax57.12.620.42347.0
MICAN76.83.200.41844.5
CLICK46.61.990.36743.7
SPalignNS42.82.000.33741.1
(a) MALIDUP

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign87.12.680.61073.0
Kpax80.12.170.60970.6
TMalign86.72.690.63170.4
3DCOMB84.92.600.62369.8
DeepAlign83.82.680.61167.8
MICAN84.92.580.59367.2
SPalignNS62.71.610.52062.3
CLICK61.01.810.49659.1

(b) MALISAM

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign63.63.120.51167.0
TMalign62.53.150.52263.0
3DCOMB60.82.970.52062.3
Kpax55.92.570.49361.8
DeepAlign58.93.040.50059.1
MICAN61.82.990.41450.5
SPalignNS35.11.820.35646.4
CLICK34.81.950.34644.0

(c) 64 difficult cases

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign82.62.980.53169.1
TMalign82.32.970.55866.5
Kpax76.12.430.52665.8
3DCOMB80.92.900.54064.9
DeepAlign80.83.030.53362.7
MICAN79.92.900.49259.8
SPalignNS57.01.780.43756.9
CLICK51.01.940.38048.5

(d) MALIDUP-NS

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign86.02.730.59771.4
MICAN83.72.550.58766.6
CLICK60.41.810.49358.6
SPalignNS57.81.690.47757.7
3DCOMB58.52.740.43047.8
TMalign61.53.010.43347.5
Kpax52.32.130.40747.0
DeepAlign54.12.570.40945.3

(e) MALISAM-NS

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign61.83.130.49865.0
MICAN61.12.960.41250.3
TMalign48.33.440.39345.7
3DCOMB46.13.270.38445.1
Kpax40.42.590.36244.1
CLICK33.81.940.33742.9
DeepAlign40.72.930.36141.7
SPalignNS31.41.820.31440.9
(f) 199 topology-different cases

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign80.43.400.51859.7
TMalign69.43.440.46650.2
3DCOMB67.33.290.46250.1
DeepAlign61.63.030.43347.0
Kpax57.12.620.42347.0
MICAN76.83.200.41844.5
CLICK46.61.990.36743.7
SPalignNS42.82.000.33741.1

Note: The methods are ranked according to their SO.

Table 3.

Performance comparison between FTAlign and seven other state-of-the-art methods on six test sets of pairwise structure alignments, in which MALIDUP, MALISAM and 64 difficult cases are sequential sets, and MALIDUP-NS, MALISAM-NS and 199 topology-different cases are non-sequential sets

(a) MALIDUP

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign87.12.680.61073.0
Kpax80.12.170.60970.6
TMalign86.72.690.63170.4
3DCOMB84.92.600.62369.8
DeepAlign83.82.680.61167.8
MICAN84.92.580.59367.2
SPalignNS62.71.610.52062.3
CLICK61.01.810.49659.1

(b) MALISAM

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign63.63.120.51167.0
TMalign62.53.150.52263.0
3DCOMB60.82.970.52062.3
Kpax55.92.570.49361.8
DeepAlign58.93.040.50059.1
MICAN61.82.990.41450.5
SPalignNS35.11.820.35646.4
CLICK34.81.950.34644.0

(c) 64 difficult cases

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign82.62.980.53169.1
TMalign82.32.970.55866.5
Kpax76.12.430.52665.8
3DCOMB80.92.900.54064.9
DeepAlign80.83.030.53362.7
MICAN79.92.900.49259.8
SPalignNS57.01.780.43756.9
CLICK51.01.940.38048.5

(d) MALIDUP-NS

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign86.02.730.59771.4
MICAN83.72.550.58766.6
CLICK60.41.810.49358.6
SPalignNS57.81.690.47757.7
3DCOMB58.52.740.43047.8
TMalign61.53.010.43347.5
Kpax52.32.130.40747.0
DeepAlign54.12.570.40945.3

(e) MALISAM-NS

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign61.83.130.49865.0
MICAN61.12.960.41250.3
TMalign48.33.440.39345.7
3DCOMB46.13.270.38445.1
Kpax40.42.590.36244.1
CLICK33.81.940.33742.9
DeepAlign40.72.930.36141.7
SPalignNS31.41.820.31440.9
(f) 199 topology-different cases

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign80.43.400.51859.7
TMalign69.43.440.46650.2
3DCOMB67.33.290.46250.1
DeepAlign61.63.030.43347.0
Kpax57.12.620.42347.0
MICAN76.83.200.41844.5
CLICK46.61.990.36743.7
SPalignNS42.82.000.33741.1
(a) MALIDUP

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign87.12.680.61073.0
Kpax80.12.170.60970.6
TMalign86.72.690.63170.4
3DCOMB84.92.600.62369.8
DeepAlign83.82.680.61167.8
MICAN84.92.580.59367.2
SPalignNS62.71.610.52062.3
CLICK61.01.810.49659.1

(b) MALISAM

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign63.63.120.51167.0
TMalign62.53.150.52263.0
3DCOMB60.82.970.52062.3
Kpax55.92.570.49361.8
DeepAlign58.93.040.50059.1
MICAN61.82.990.41450.5
SPalignNS35.11.820.35646.4
CLICK34.81.950.34644.0

(c) 64 difficult cases

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign82.62.980.53169.1
TMalign82.32.970.55866.5
Kpax76.12.430.52665.8
3DCOMB80.92.900.54064.9
DeepAlign80.83.030.53362.7
MICAN79.92.900.49259.8
SPalignNS57.01.780.43756.9
CLICK51.01.940.38048.5

(d) MALIDUP-NS

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign86.02.730.59771.4
MICAN83.72.550.58766.6
CLICK60.41.810.49358.6
SPalignNS57.81.690.47757.7
3DCOMB58.52.740.43047.8
TMalign61.53.010.43347.5
Kpax52.32.130.40747.0
DeepAlign54.12.570.40945.3

(e) MALISAM-NS

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign61.83.130.49865.0
MICAN61.12.960.41250.3
TMalign48.33.440.39345.7
3DCOMB46.13.270.38445.1
Kpax40.42.590.36244.1
CLICK33.81.940.33742.9
DeepAlign40.72.930.36141.7
SPalignNS31.41.820.31440.9
(f) 199 topology-different cases

MethodNaliRMSDali (Å)TMscoreSO (%)

FTAlign80.43.400.51859.7
TMalign69.43.440.46650.2
3DCOMB67.33.290.46250.1
DeepAlign61.63.030.43347.0
Kpax57.12.620.42347.0
MICAN76.83.200.41844.5
CLICK46.61.990.36743.7
SPalignNS42.82.000.33741.1

Note: The methods are ranked according to their SO.

3.2.2 Non-sequential datasets

Table 3d–f lists the average number of aligned residues (Nali), RMSD of aligned residues (RMSDali), TMscore of aligned residues and structure overlap (SO) for FTAlign on three non-sequential test sets, MALIDUP-NS, MALISAM-NS and 199 topology-different cases, of which the SO valuses are also shown in Figure 2. As a reference, the table and figure also gives the corresponding results of the other seven structure alignment methods, TMalign, 3DCOMB, Kpax, DeepAlign, MICAN, SPalignNS and CLICK. It can be seen from Table 3 and Figure 2 that FTAlign again obtained a better performance than the other seven structure alignment methods in terms of both SO and TMscore on the three non-sequential test sets. Several other features can also be found by comparing the results of all six test sets. First, TMalign, 3DCOMB, Kpax and DeepAlign have a significantly lower SO on the non-sequential sets than on the sequential sets. This can be understood because these four methods are designed for sequential structure alignment (Wang et al., 2013; Zhang and Skolnick, 2005). Second, MICAN maintained a comparable performance, as its algorithm is able to handle both sequential and non-sequential alignments. The SO values of SPalignNS and CLICK are relatively low on both sequential and non-sequential test sets, which may be due to their point representation of protein structures, which would lead to over-fragmentation under the pressure of minimizing the RMSDali and/or maximizing the SO. On the contrary, through a global superimposition of two protein structures, FTAlign would circumvent the above limitation and thus achieved a consistently better performance than the other methods on both sequential and non-sequential test sets.

3.3 Case study

To further illustrate the advantage of a global search in structure alignment, we have investigated an example pairwise alignment, d2bs2_C1 and d2bs2_C2 from MALIDUP, on which only three methods, FTAlign, 3DCOMB and Kpax, successfully reproduced the manually curated structure alignment and gave an RMSD of <5.0 Å, while the other five methods failed to predict the manually curated alignment and led to a large RMSD of more than 20 Å (Fig. 3).

Fig. 3.

Structural comparison between the manually curated alignment (red and blue) and computationally predicted alignment (green and blue) for FTAlign (A) and seven other state-of-the-art methods (BH) on the case of d2bs2_C1/d2bs2_C2 from MALIDUP, where the manual and predicted alignments are superimposed based on the first protein (blue) of the pair. (Color version of this figure is available at Bioinformatics online.)

The two proteins of this case are two domains of a fumarate reductase from wolinella succinogenes (PDB ID: 2BS2) (Burley et al., 2019). Further examination of all the alignment plots revealed that the failure of the other five approaches may be due to the heuristic nature of these structure alignment algorithms (Fig. 4). That is, heuristic methods always try to identify the current continuous segment(s) with the highest match score through a local greedy-like search (Ma and Wang, 2014). Therefore, these methods may get trapped in a local minimum during the search of continuous alignments. Although such a heuristic strategy can dramatically reduce the search space and speed up the alignment process, the final structure alignment will be impacted by the first identified segments. Therefore, in such situation, if the first continuous segment is a biologically wrong alignment, the final alignment will also be a biologically wrong one. This is just the case on the alignments between d2bs2_C1 and d2bs2_C2 for the five failed approaches. As shown in Figure 4, the manually curated alignment consists of two continuous segments with a medium match score, which are reproduced by FTAlign, 3DCOMB and Kpax (Fig. 4A–C). On the contrary, the five failed approaches except MICAN all contain one locally optimal continuous segment that has a higher match score than either of the two continuous segments in the manually curated alignment (Fig. 4D–H). This kind of locally optimal alignments can also be observed in the structural comparison between the manually curated and computationally predicted alignments for the failed methods (Fig. 3). As for MICAN, its failure is due to its non-sequential alignment for this sequential case, although it does not show the overfitting problem of local segments between two structures (Fig. 3F). This example suggests the necessity of developing a global structure alignment algorithm like FTAlign.

Fig. 4.

Comparison of the residue alignments between the computationally predicted (red) and manually curated (black) alignments for FTAlign (A) and seven other state-of-the-art methods (BH) on the case of d2bs2_C1/d2bs2_C2, where the sizes of symbols are linearly scaled according to the distance between two aligned Cα atoms. Namely, a smaller symbol means a geometrically better match between two residues. (Color version of this figure is available at Bioinformatics online.)

3.4 Computational efficiency

Despite the global search process, our structure alignment method can be greatly accelerated through an FFT-based algorithm. As such, FTAlign is computationally efficient and can normally finish a pairwise alignment within seconds. Table 4 gives the total running times and average time per case of FTAlign on a single Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60 GHz core over the six test sets. For comparison, the table also lists the corresponding times of the other seven methods. It can be seen from the table that 3DCOMB is the fastest method and only consumes an average of 0.009 s for aligning a pair of structures, followed by 0.034 s for TMAlign and 0.065 s for MICAN. DeepAlign, Kpax and SPalignNS are moderately fast and can normally finish a structure alignment within a half second. FTAlign and CLICK are relatively slow and have an average running time of 2.895 and 3.906 s per case, respectively. The much higher (∼100 times) running time of FTAlign than other sequential alignment methods like TMAlign and 3DCOMB can be understood because FTAlign explores dramatically more orientational space [O(N6)] due to its exhaustive global search for an accurate topology-independent alignment. Therefore, taking its huge search space into account, FTAlign is relatively fast in some sense, though its running time is much longer than those of other sequential alignment approaches. Moreover, the FFT-based algorithm can be further speeded up through a graphic process unit (GPU). With the GPU version, a pairwise alignment can be normally completed within one second, which is fast enough for high-throughput structure alignment.

Table 4.

Total running times and average time per case of FTAlign and seven other methods on a single Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60 GHz core over the six test sets of pairwise structure alignments

MethodRunning time (s)
CLICKFTAlignSPalignNSKpaxDeepAlignMICANTMalign3DCOMB
MALIDUP1177.27694.58102.3971.8052.8217.618.782.20
MALISAM207.08340.1619.2435.2819.155.833.451.15
64 cases265.67176.8228.6518.7116.734.562.710.56
MALIDUP-NS1140.49693.4695.9872.3249.8317.218.362.20
MALISAM-NS200.09353.7519.1236.2218.645.753.261.16
199 cases942.87647.8579.8160.2650.2514.218.161.82
Average/case3.9062.8950.3440.2970.2080.0650.0340.009
MethodRunning time (s)
CLICKFTAlignSPalignNSKpaxDeepAlignMICANTMalign3DCOMB
MALIDUP1177.27694.58102.3971.8052.8217.618.782.20
MALISAM207.08340.1619.2435.2819.155.833.451.15
64 cases265.67176.8228.6518.7116.734.562.710.56
MALIDUP-NS1140.49693.4695.9872.3249.8317.218.362.20
MALISAM-NS200.09353.7519.1236.2218.645.753.261.16
199 cases942.87647.8579.8160.2650.2514.218.161.82
Average/case3.9062.8950.3440.2970.2080.0650.0340.009
Table 4.

Total running times and average time per case of FTAlign and seven other methods on a single Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60 GHz core over the six test sets of pairwise structure alignments

MethodRunning time (s)
CLICKFTAlignSPalignNSKpaxDeepAlignMICANTMalign3DCOMB
MALIDUP1177.27694.58102.3971.8052.8217.618.782.20
MALISAM207.08340.1619.2435.2819.155.833.451.15
64 cases265.67176.8228.6518.7116.734.562.710.56
MALIDUP-NS1140.49693.4695.9872.3249.8317.218.362.20
MALISAM-NS200.09353.7519.1236.2218.645.753.261.16
199 cases942.87647.8579.8160.2650.2514.218.161.82
Average/case3.9062.8950.3440.2970.2080.0650.0340.009
MethodRunning time (s)
CLICKFTAlignSPalignNSKpaxDeepAlignMICANTMalign3DCOMB
MALIDUP1177.27694.58102.3971.8052.8217.618.782.20
MALISAM207.08340.1619.2435.2819.155.833.451.15
64 cases265.67176.8228.6518.7116.734.562.710.56
MALIDUP-NS1140.49693.4695.9872.3249.8317.218.362.20
MALISAM-NS200.09353.7519.1236.2218.645.753.261.16
199 cases942.87647.8579.8160.2650.2514.218.161.82
Average/case3.9062.8950.3440.2970.2080.0650.0340.009

To choose an appropriate method for real applications, both computational efficiency and alignment accuracy need to be considered. For sequential alignment, programs like TMAlign and 3DCOMB would be the ideal choice, owing to their high speed. However, for topology-independent alignment, FTAlign would be an optimal one because of its significant accuracy advantage and also acceptable running time.

3.5 Impact of alignment parameters

During our FFT-based global search for optimal alignment between two protein structures, there are two basic parameters, grid spacing for 3D translational search and angle step for 3D rotational search. To investigate the impact of these two parameters on our alignment, we have conducted an extensive evaluation of FTAlign on six test sets under the combination of four grid spacings (1.0, 1.2, 1.5 and 2.0 Å) and four angle steps (10∘, 15∘, 18∘ and 20∘). Table 5 lists the SO of FTAlign with different sets of parameters on the six test sets. It can be seen from the table that the changes of SO are very small and within 1.0% for different sets of grid spacing and angle step, suggesting the robustness of FTAlign in FFT-based global search for geometric match. This may be understood because proteins are represented by their Cα atoms in FTAlign. As the average distance between two Cα atoms is around 3.8 Å, FTAlign is expected to perform well as long as the grid spacing is significantly smaller than 3.8 Å, although we have taken the grid spacing of 2.0 Å and angle step of 18∘ as the default parameters of FTAlign for the sake of both accuracy and speed in the present study.

Table 5.

Structure overlaps (SO) of FTAlign with different grid spacings and angle steps on six test sets

Grid spacingAngle stepStructure overlap (%)
MALIDUPMALISAM64 casesMALIDUP-NSMALISAM-NS199 cases
1.0 Å10∘73.167.368.971.465.159.7
15∘72.967.468.971.464.759.4
18∘72.867.268.771.564.759.8
20∘73.067.068.771.664.459.9
1.2 Å10∘73.267.369.171.464.759.6
15∘73.067.069.071.264.859.7
18∘72.966.968.971.464.559.5
20∘72.867.168.471.464.859.5
1.5 Å10∘72.967.268.971.265.159.6
15∘72.867.369.371.464.959.5
18∘73.067.168.871.364.759.8
20∘72.966.868.971.664.759.6
2.0 Å10∘73.167.269.071.364.859.5
15∘73.166.968.971.564.759.8
18∘73.067.069.171.465.059.7
20∘72.866.868.771.364.659.8
Grid spacingAngle stepStructure overlap (%)
MALIDUPMALISAM64 casesMALIDUP-NSMALISAM-NS199 cases
1.0 Å10∘73.167.368.971.465.159.7
15∘72.967.468.971.464.759.4
18∘72.867.268.771.564.759.8
20∘73.067.068.771.664.459.9
1.2 Å10∘73.267.369.171.464.759.6
15∘73.067.069.071.264.859.7
18∘72.966.968.971.464.559.5
20∘72.867.168.471.464.859.5
1.5 Å10∘72.967.268.971.265.159.6
15∘72.867.369.371.464.959.5
18∘73.067.168.871.364.759.8
20∘72.966.868.971.664.759.6
2.0 Å10∘73.167.269.071.364.859.5
15∘73.166.968.971.564.759.8
18∘73.067.069.171.465.059.7
20∘72.866.868.771.364.659.8
Table 5.

Structure overlaps (SO) of FTAlign with different grid spacings and angle steps on six test sets

Grid spacingAngle stepStructure overlap (%)
MALIDUPMALISAM64 casesMALIDUP-NSMALISAM-NS199 cases
1.0 Å10∘73.167.368.971.465.159.7
15∘72.967.468.971.464.759.4
18∘72.867.268.771.564.759.8
20∘73.067.068.771.664.459.9
1.2 Å10∘73.267.369.171.464.759.6
15∘73.067.069.071.264.859.7
18∘72.966.968.971.464.559.5
20∘72.867.168.471.464.859.5
1.5 Å10∘72.967.268.971.265.159.6
15∘72.867.369.371.464.959.5
18∘73.067.168.871.364.759.8
20∘72.966.868.971.664.759.6
2.0 Å10∘73.167.269.071.364.859.5
15∘73.166.968.971.564.759.8
18∘73.067.069.171.465.059.7
20∘72.866.868.771.364.659.8
Grid spacingAngle stepStructure overlap (%)
MALIDUPMALISAM64 casesMALIDUP-NSMALISAM-NS199 cases
1.0 Å10∘73.167.368.971.465.159.7
15∘72.967.468.971.464.759.4
18∘72.867.268.771.564.759.8
20∘73.067.068.771.664.459.9
1.2 Å10∘73.267.369.171.464.759.6
15∘73.067.069.071.264.859.7
18∘72.966.968.971.464.559.5
20∘72.867.168.471.464.859.5
1.5 Å10∘72.967.268.971.265.159.6
15∘72.867.369.371.464.959.5
18∘73.067.168.871.364.759.8
20∘72.966.868.971.664.759.6
2.0 Å10∘73.167.269.071.364.859.5
15∘73.166.968.971.564.759.8
18∘73.067.069.171.465.059.7
20∘72.866.868.771.364.659.8

4 Conclusions

We have developed an accurate topology-independent and global structure alignment method based on an FFT-based search algorithm, which is referred to as FTAlign. FTAlign was extensively evaluated on six commonly used test sets including two manually curated gold-standard benchmarks, MALIDUP and MALISAM and four reference-free test sets, MALIDUP-NS, MALISAM-NS, 64 difficult cases from HOMSTRAD and 199 topology-different pairwise alignments, in which MALIDUP, MALISAM and 64 difficult cases are sequential sets and MALIDUP-NS, MALISAM-NS and 199 topology-different cases are non-sequential sets. It was shown that FTAlign not only obtained a better success rate in reproducing manually curated structure alignments on MALIDUP and MALISAM, but also achieved a higher biologically meaningful structure overlap (SO) and an overall higher TMscore on the six test sets than seven other state-of-the-art structure alignment methods, TMAlign, 3DCOMB, Kpax, DeepAlign, MICAN, SPalignNS and CLICK. A case study further confirmed the advantage of a global search like FTAlign in structure alignment. Despite its global search feature, FTAlign is also computational efficient and its GPU version can normally finish a structure alignment within one second. FTAlign provides a general method for both sequential and non-sequential alignments between two protein structures.

Funding

This work was supported by the National Natural Science Foundation of China (grant No. 31670724), the National Key R&D Program of China (grant Nos. 2016YFC1305800 and 2016YFC1305805), the National 1000 Young Thousand Talents of China and the startup grant of Huazhong University of Science and Technology.

Conflict of Interest: none declared.

References

Alexandrov
 
N.N.
(
1996
)
SARFing the PDB
.
Protein Eng
.,
9
,
727
732
.

Bachar
 
O.
 et al.  (
1993
)
A computer vision based technique for 3-D sequence-independent structural comparison of proteins
.
Protein Eng
.,
6
,
279
288
.

Bashton
 
M.
,
Chothia
C.
(
2007
)
The generation of new protein functions by the combination of domains
.
Structure
,
15
,
85
99
.

Brown
 
P.
 et al.  (
2016
)
Fast and accurate non-sequential protein structure alignment using a new asymmetric linear sum assignment heuristic
.
Bioinformatics
,
32
,
370
377
.

Brylinski
 
M.
,
Skolnick
J.
(
2008
)
A threading-based method (FINDSITE) for ligand binding site prediction and functional annotation
.
Proc. Natl. Acad. Sci. USA
,
105
,
129
134
.

Budowski-Tal
 
I.
 et al.  (
2010
)
FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately
.
Proc. Natl. Acad. Sci. USA
,
107
,
3481
3486
.

Burley
 
S.K.
 et al.  (
2019
)
RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy
.
Nucleic Acids Res
.,
47
,
D464
D474
.

Camproux
 
A.C.
 et al.  (
2004
)
A hidden Markov model derived structural alphabet for proteins
.
J. Mol. Biol
.,
339
,
591
605
.

Chen
 
R.
,
Weng
Z.
(
2003
)
A novel shape complementarity scoring function for protein–protein docking
.
Proteins
,
51
,
397
408
.

Cheng
 
H.
 et al.  (
2007a
)
MALIDUP: a database of manually constructed structure alignments for duplicated domain pairs
.
Proteins
,
70
,
1162
1166
.

Cheng
 
H.
 et al.  (
2007b
)
MALISAM: a database of structurally analogous motifs in proteins
.
Nucleic Acids Res
.,
36
,
D211
D217
.

Cheng
 
H.
 et al.  (
2008
)
Discrimination between distant homologs and structural analogs: lessons from manually constructed, reliable data sets
.
J. Mol. Biol
.,
377
,
1265
1278
.

Chothia
 
C.
,
Lesk
A.M.
(
1986
)
The relation between the divergence of sequence and structure in proteins
.
EMBO J
.,
5
,
823
826
.

Dong
 
R.
 et al.  (
2018
)
mTM-align: an algorithm for fast and accurate multiple protein structure alignment
.
Bioinformatics
,
34
,
1719
1725
.

Dror
 
O.
 et al.  (
2003
)
MASS: multiple structural alignment by secondary structures
.
Bioinformatics
,
19
,
i95
i104
.

Estrin
 
M.
,
Wolfson
H.J.
(
2017
)
SnapDock-template-based docking by Geometric Hashing
.
Bioinformatics
,
33
,
i30
i36
.

Gan
 
H.H.
 et al.  (
2002
)
Analysis of protein sequence/structure similarity relationships
.
Biophys. J
.,
83
,
2781
2791
.

Hasegawa
 
H.
,
Holm
L.
(
2009
)
Advances and pitfalls of protein structural alignment
.
Curr. Opin. Struct. Biol
.,
19
,
341
348
.

Huang
 
S.Y.
(
2014
)
Search strategies and evaluation in protein–protein docking: principles, advances and challenges
.
Drug Discov. Today
,
19
,
1081
1096
.

Huang
 
S.-Y.
,
Zou
X.
(
2006
)
Ensemble docking of multiple protein structures: considering protein structural variations in molecular docking
.
Proteins
,
66
,
399
421
.

Holm
 
L.
,
Sander
C.
(
1993
)
Protein structure comparison by alignment of distance matrices
.
J. Mol. Biol
.,
233
,
123
138
.

Hwang
 
H.
 et al.  (
2017
)
Structure-based prediction of ligand-protein interactions on a genome-wide scale
.
Proc. Natl. Acad. Sci. USA
,
114
,
13685
13690
.

Janin
 
J.
 et al.  (
2003
)
Critical Assessment of PRedicted Interactions. CAPRI: a Critical Assessment of PRedicted Interactions
.
Proteins
,
52
,
2
9
.

Jung
 
J.
,
Lee
B.
(
2000
)
Protein structure alignment using environmental profiles
.
Protein Eng
.,
13
,
535
543
.

Kabsch
 
W.
,
Sander
C.
(
1983
)
Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features
.
Biopolymers
,
22
,
2577
2637
.

Katchalski-Katzir
 
E.
 et al.  (
1992
)
Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques
.
Proc. Natl. Acad. Sci. USA
,
89
,
2195
2199
.

Koehl
 
P.
(
2001
)
Protein structure similarities
.
Curr. Opin. Struct. Biol
.,
11
,
348
353
.

Koehl
 
P.
(
2006
)
Protein structure classification
.
Rev. Comput. Chem
.,
22
,
1
.

Kolbeck
 
B.
 et al.  (
2006
)
Connectivity independent protein-structure alignment: a hierarchical approach
.
BMC Bioinformatics
,
7
,
510.

Kolodny
 
R.
 et al.  (
2002
)
Small libraries of protein fragments model native protein structures accurately
.
J. Mol. Biol
.,
323
,
297
307
.

Konagurthu
 
A.S.
 et al.  (
2006
)
MUSTANG: a multiple structural alignment algorithm
.
Proteins
,
64
,
559
574
.

Lackner
 
P.
 et al.  (
2000
)
ProSup: a refined tool for protein structure alignment
.
Protein Eng
.,
13
,
745
752
.

Lichtarge
 
O.
,
Sowa
M.E.
(
2002
)
Evolutionary predictions of binding surfaces and interactions
.
Curr. Opin. Struct. Biol
.,
12
,
21
27
.

Lindqvist
 
Y.
,
Schneider
G.
(
1997
)
Circular permutations of natural protein sequences: structural evidence
.
Curr. Opin. Struct. Biol
.,
7
,
422
427
.

Litfin
 
T.
 et al.  (
2017
)
SPOT-ligand 2: improving structure-based virtual screening by binding-homology search on an expanded structural template library
.
Bioinformatics
,
33
,
1238
1240
.

Lupyan
 
D.
 et al.  (
2005
)
A new progressive-iterative algorithm for multiple structure alignment
.
Bioinformatics
,
21
,
3255
3263
.

Ma
 
J.
,
Wang
S.
(
2014
)
Algorithms, applications, and challenges of protein structure alignment
.
Adv. Prot. Chem. Struct. Biol
.,
94
,
121
175
.

Madhusudhan
 
M.S.
 et al.  (
2009
)
Alignment of multiple protein structures based on sequence and structure features
.
Protein Eng. Des. Sel
.,
22
,
569
574
.

Mayr
 
G.
 et al.  (
2007
)
Comparative analysis of protein structure alignments
.
BMC Struct. Biol
.,
7
,
50.

Micheletti
 
C.
 et al.  (
2000
)
Recurrent oligomers in proteins: an optimal scheme reconciling accurate and concise backbone representations in automated folding and design studies
.
Proteins
,
40
,
662
674
.

Micheletti
 
C.
,
Orland
H.
(
2009
)
MISTRAL: a tool for energy-based multiple structural alignment of proteins
.
Bioinformatics
,
25
,
2663
2669
.

Minami
 
S.
 et al.  (
2013
)
MICAN: a protein structure alignment algorithm that can handle multiple-chains, Inverse alignments, Cα only models, alternative alignments, and non-sequential alignments
.
BMC Bioinformatics
,
14
,
24.

Minami
 
S.
 et al.  (
2018
)
MICAN-SQ: a sequential protein structure alignment program that is applicable to monomers and all types of oligomers
.
Bioinformatics
,
34
,
3324
3331
.

Mirabello
 
C.
,
Wallner
B.
(
2018
)
Topology independent structural matching discovers novel templates for protein interfaces
.
Bioinformatics
,
34
,
i787
i794
.

Murzin
 
A.G.
 et al.  (
1995
)
SCOP: a structural classification of proteins database for the investigation of sequences and structures
.
J. Mol. Biol
.,
247
,
536
540
.

Nguyen
 
M.N.
,
Madhusudhan
M.S.
(
2011
)
Biological insights from topology independent comparison of protein 3D structures
.
Nucleic Acids Res
.,
39
,
e94.

Orengo
 
C.A.
 et al.  (
1997
)
CATH–a hierarchic classification of protein domain structures
.
Structure
,
5
,
1093
1108
.

Orengo
 
C.A.
,
Taylor
W.R.
(
1996
)
SSAP: sequential structure alignment program for protein structure comparison
.
Methods Enzymol
.,
266
,
617
635
.

Pandit
 
S.B.
,
Skolnick
J.
(
2008
)
Fr-TM-align: a new protein structural alignment method based on fragment alignments and the TM-score
.
BMC Bioinformatics
,
9
,
531.

Ritchie
 
D.W.
 et al.  (
2012
)
Fast protein structure alignment using Gaussian overlap scoring of backbone peptide fragment similarity
.
Bioinformatics
,
28
,
3274
3281
.

Ritchie
 
D.W.
(
2016
)
Calculating and scoring high quality multiple flexible protein structure alignments
.
Bioinformatics
,
32
,
2650
2658
.

Roy
 
A.
 et al.  (
2012
)
COFACTOR: an accurate comparative algorithm for structure-based protein function annotation
.
Nucleic Acids Res
.,
40
,
W471
W477
.

Salem
 
S.
 et al.  (
2009
)
Iterative non-sequential protein structural alignment
.
J. Bioinform. Comput. Biol
.,
07
,
571
596
.

Salem
 
S.
 et al.  (
2010
)
FlexSnap: flexible non-sequential protein structure alignment
.
Algorithms Mol. Biol
.,
5
,
12
.

Scheeff
 
E.D.
,
Bourne
P.E.
(
2006
)
Application of protein structure alignments to iterated hidden Markov model protocols for structure prediction
.
BMC Bioinformatics
,
7
,
410.

Shindyalov
 
I.N.
,
Bourne
P.E.
(
1998
)
Protein structure alignment by incremental combinatorial extension (CE) of the optimal path
.
Protein Eng
.,
11
,
739
747
.

Stebbings
 
L.A.
,
Mizuguchi
K.
(
2004
)
HOMSTRAD: recent developments of the Homologous Protein Structure Alignment Database
.
Nucleic Acids Res
.,
32
,
D203
D207
.

Tyagi
 
M.
 et al.  (
2008
)
Protein structure mining using a structural alphabet
.
Proteins
,
71
,
920
937
.

Wang
 
S.
 et al.  (
2011
)
Alignment of distantly related protein structures: algorithm, bound and implications to homology modeling
.
Bioinformatics
,
27
,
2537
2545
.

Wang
 
S.
 et al.  (
2013
)
Protein structure alignment beyond spatial proximity
.
Sci. Rep
.,
3
,
1448
.

Wang
 
S.
,
Zheng
W.M.
(
2008
)
CLePAPS: fast pair alignment of protein structures based on conformational letters
.
J. Bioinform. Comput. Biol
.,
06
,
347
366
.

Wood
 
T.C.
,
Pearson
W.R.
(
1999
)
Evolution of protein sequences and structures
.
J. Mol. Biol
.,
291
,
977
995
.

Wu
 
Q.
 et al.  (
2018
)
COACH-D: improved protein-ligand binding sites prediction with refined ligand-binding poses through molecular docking
.
Nucleic Acids Res
.,
46
,
W438
W442
.

Xie
 
L.
,
Bourne
P.E.
(
2008
)
Detecting evolutionary relationships across existing fold space
.
Proc. Natl. Acad. Sci. USA
,
105
,
5441
5446
.

Xu
 
J.
,
Zhang
Y.
(
2010
)
How significant is a protein structure similarity with TM-score=0.5?
.
Bioinformatics
,
26
,
889
895
.

Yan
 
Y.
 et al.  (
2017
)
HDOCK: a web server for protein–protein and protein–DNA/RNA docking based on a hybrid strategy
.
Nucleic Acids Res
.,
45
,
W365
W373
.

Yan
 
Y.
,
Huang
S.-Y.
(
2018
)
Protein–protein docking with improved shape complementarity
.
Lect. Notes Comput. Sci
.,
10954
,
600
605
.

Yan
 
Y.
 et al.  (
2018
)
HSYMDOCK: a docking web server for predicting the structure of protein homo-oligomers with Cn or Dn symmetry
.
Nucleic Acids Res
.,
46
,
W423
W431
.

Yan
 
Y.
,
Huang
S.-Y.
(
2019a
)
CHDOCK: a hierarchical docking approach for modeling Cn symmetric homo-oligomeric complexes
.
Biophys. Rep
.,
5
,
65
72
.

Yan
 
Y.
,
Huang
S.-Y.
(
2019b
)
A non-redundant benchmark for symmetric protein docking
.
Big Data Min. Anal
.,
2
,
92
99
.

Yang
 
Y.
 et al.  (
2012
)
A new size-independent score for pairwise protein structure alignment and its application to structure classification and nucleic-acid binding prediction
.
Proteins
,
80
,
2080
2088
.

Yang
 
J.
 et al.  (
2013
)
Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment
.
Bioinformatics
,
29
,
2588
2595
.

Ye
 
Y.
,
Godzik
A.
(
2004
)
FATCAT: a web server for flexible structure comparison and structure similarity searching
.
Nucleic Acids Res
.,
32
,
W582
W585
.

Zemla
 
A.
(
2003
)
LGA: a method for finding 3D similarities in protein structures
.
Nucleic Acids Res
.,
31
,
3370
3374
.

Zhang
 
C.
 et al.  (
2017
)
COFACTOR: improved protein function prediction by combining structure, sequence and protein–protein interaction information
.
Nucleic Acids Res
.,
45
,
W291
W299
.

Zhang
 
Y.
,
Skolnick
J.
(
2005
)
TM-align: a protein structure alignment algorithm based on the TM-score
.
Nucleic Acids Res
.,
33
,
2302
2309
.

Zhu
 
J.
,
Weng
Z.
(
2004
)
FAST: a novel protein structure alignment algorithm
.
Proteins
,
58
,
618
627
.

Zhou
 
P.
 et al.  (
2018
)
HPEPDOCK: a web server for blind peptide-protein docking based on a hierarchical algorithm
.
Nucleic Acids Res
.,
46
,
W443
W450
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Yann Ponty
Yann Ponty
Associate Editor
Search for other works by this author on: