Abstract

Motivation

Protein structure comparison plays a fundamental role in understanding the evolutionary relationships between proteins. Here, we release a new version of the DaliLite standalone software. The novelties are hierarchical search of the structure database organized into sequence based clusters, and remote access to our knowledge base of structural neighbors. The detection of fold, superfamily and family level similarities by DaliLite and state-of-the-art competitors was benchmarked against a manually curated structural classification.

Results

Database search strategies were evaluated using Fmax with query-specific thresholds. DaliLite and DeepAlign outperformed TM-score based methods at all levels of the benchmark, and DaliLite outperformed DeepAlign at fold level. Hierarchical and knowledge-based searches got close to the performance of systematic pairwise comparison. The knowledge-based search was four times as efficient as the hierarchical search. The knowledge-based search dynamically adjusts the depth of the search, enabling a trade-off between speed and recall.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Protein folds display a natural clustering due to physical convergence and evolutionary descent from common ancestors. Newly solved protein structures are routinely searched against the database of known structures in the hope of discovering hidden evolutionary relationships (e.g. Valleau et al., 2018) or defining a new fold. DaliLite (Holm and Park, 2000; Holm et al., 2008) is a standalone implementation of the Dali methodology (Holm and Sander, 1993). Here, we explore the trade-off between speed and classification performance using different database search strategies.

The hierarchical strategy systematically tests the query structure against a representative subset of the Protein Data Bank (PDB), which is followed by expansion to sequence neighbors of the highest scoring representatives. The knowledge-based strategy first has to connect the query structure to a sparse graph of pre-computed structural similarities between PDB structures. The initial step uses fast approximate comparison methods with moderate recall. If no structural similarity is found, the strategy falls back to a systematic comparison against a representative subset of the PDB. Once the query has been connected to the graph, the depth of the search is controlled by a priority queue and dynamic adjustment of the structural similarity threshold for following transitive links.

The performance of the different database search strategies is evaluated on a benchmark that uses a manual structural classification, SCOP (Fox et al., 2014), as ground truth. For a given query domain, the evaluated structure comparison programs return a result list of PDB structures (‘hits’) ordered by decreasing structural alignment score. A hit is incorrect, if the structure does not have any domain from the same SCOP fold class as the query. All hits which share a domain from the same SCOP fold class are counted as correct. Correct hits are stratified by difficulty to fold, superfamily and family level targets (Supplementary Fig. S1).

Classification performance is evaluated using Fmax, which balances recall and precision. Calculating Fmax involves scanning an ordered list of hits for the optimal threshold that maximizes the harmonic mean of precision and recall. We compute four variants of Fmax values: using the full PDB or a non-redundant subset (PDB70), and allowing query-specific thresholds or enforcing uniform thresholds in the calculations. Query-specific thresholds suit Dali, because it quantifies structural similarity using a sum-of-pairs score, which has a strong dependence on the size of the query structure. We compute Fmax separately for each query and report the average query-wise Fmax over the benchmark as the main metric. Sometimes the classification task is formulated so as to require a uniform threshold applied to all queries, such as a probability of same-fold membership. We call this version the pooled Fmax. DaliLite v.5 is compared with state-of-the-art competitors (Dong et al., 2018; Wang et al., 2013; Zhang and Skolnick, 2005). The results (Supplementary Fig. S3) show large differences between evaluation metrics and between different measures of structural similarity.

2 Materials and methods

The DaliLite v.5 standalone software supports data import, pairwise alignment, hierarchical clustering (structural dendrograms) and database searches by hierarchical and knowledge-based strategies. The software runs on Linux and requires gfortran, optionally Open MPI, and Perl. Database searches require BLAST (Camacho et al., 2009). Users must download their own copy of the PDB. The knowledge base contains links between similar structures in the PDB. It is maintained by our group and is accessed remotely. Benchmark results (Table 1) used default parameters of DaliLite described in Supplementary Methods.

Table 1.

Benchmark results of average query-wise Fmax evaluation

ProgramStrategyFoldSuperfamilyFamily
DaliLiteaSystematic0.38 (0.36)0.83 (0.81)0.96 (0.97)
DaliLiteaHierarchical0.36 (0.31)0.82 (0.77)0.94 (0.92)
DaliLiteaKnowledge-based0.36 (0.28)0.79 (0.73)0.96 (0.96)
DeepAlignbSystematic0.28 (0.25)0.78 (0.76)0.97 (0.97)
mTMaligncKnowledge-based0.13 (0.13)0.55 (0.55)0.91 (0.93)
TMaligndSystematic0.12 (0.14)0.39 (0.42)0.85 (0.87)
ProgramStrategyFoldSuperfamilyFamily
DaliLiteaSystematic0.38 (0.36)0.83 (0.81)0.96 (0.97)
DaliLiteaHierarchical0.36 (0.31)0.82 (0.77)0.94 (0.92)
DaliLiteaKnowledge-based0.36 (0.28)0.79 (0.73)0.96 (0.96)
DeepAlignbSystematic0.28 (0.25)0.78 (0.76)0.97 (0.97)
mTMaligncKnowledge-based0.13 (0.13)0.55 (0.55)0.91 (0.93)
TMaligndSystematic0.12 (0.14)0.39 (0.42)0.85 (0.87)

Note: Fmax values for PDB70 (PDB).

a

This work.

Table 1.

Benchmark results of average query-wise Fmax evaluation

ProgramStrategyFoldSuperfamilyFamily
DaliLiteaSystematic0.38 (0.36)0.83 (0.81)0.96 (0.97)
DaliLiteaHierarchical0.36 (0.31)0.82 (0.77)0.94 (0.92)
DaliLiteaKnowledge-based0.36 (0.28)0.79 (0.73)0.96 (0.96)
DeepAlignbSystematic0.28 (0.25)0.78 (0.76)0.97 (0.97)
mTMaligncKnowledge-based0.13 (0.13)0.55 (0.55)0.91 (0.93)
TMaligndSystematic0.12 (0.14)0.39 (0.42)0.85 (0.87)
ProgramStrategyFoldSuperfamilyFamily
DaliLiteaSystematic0.38 (0.36)0.83 (0.81)0.96 (0.97)
DaliLiteaHierarchical0.36 (0.31)0.82 (0.77)0.94 (0.92)
DaliLiteaKnowledge-based0.36 (0.28)0.79 (0.73)0.96 (0.96)
DeepAlignbSystematic0.28 (0.25)0.78 (0.76)0.97 (0.97)
mTMaligncKnowledge-based0.13 (0.13)0.55 (0.55)0.91 (0.93)
TMaligndSystematic0.12 (0.14)0.39 (0.42)0.85 (0.87)

Note: Fmax values for PDB70 (PDB).

a

This work.

The benchmark consists of 140 query domains representing SCOP’s major structural classes (Supplementary Table S5). Evaluation considered only those structures that are both present in PDB (frozen in May 2018) and classified in SCOPe 2.07 (Supplementary Table S4). The complete benchmark data sets are available in the scope-140 benchmark distribution package http://ekhidna2.biocenter.helsinki.fi/dali/README.benchmark. The evaluation metric was
(1)
where n is the rank of a (query, hit) pair in the ordered list of results, p(n)=TP(n)/n is precision, r(n)=TP(n)/T is recall, TP(n) is the number of true positives (correct pairs) among the first n pairs in the ordered list of results and T is the number of structures in the fold class.

3 Results

The hierarchical and knowledge-based database searches are much faster than systematic pairwise comparison with small losses in classification performance (Table 1). The performance of hierarchical search and knowledge-based search drops by only 0.01–0.08 at all levels relative to the baseline given by systematic pairwise comparison. The hierarchical and knowledge-based strategies searched the PDB in 2–3 min per query domain (using 40 parallel processes), yielding an estimated throughput of 500 query domains per day. In terms of CPU time, the knowledge-based search was four times as efficient as the hierarchical search (Supplementary Table S3). The hierarchical search always scales linearly with database size. On the other hand, the knowledge-based walk continues until transitive closure under a Z-score threshold, and it can be made arbitrarily fast by setting a more stringent threshold.

The reference classification, SCOP, is based on visual inspection and manual curation. Thus, the benchmark evaluates to what extent computational methods are able to capture ‘biologically significant’ similarities. DaliLite and DeepAlign outperform TM-score based methods in query-wise Fmax evaluation (Table 1) at all levels of the benchmark. mTMalign has excellent precision at the cost of lower recall (Supplementary Table S2). DaliLite performs better than the other methods at fold level. Pooled Fmax evaluation (Supplementary Table S1) shows lower performance overall with smaller differences between methods (Supplementary Fig. S3). In contrast to TM-score based methods, DaliLite and DeepAlign showed a large difference between query-wise and pooled Fmax evaluation (Supplementary Fig. S3). This means that they recognize structural similarities in agreement with SCOP, but class boundaries occur at different Z-scores (DaliLite) or bitscores (DeepAlign) for different queries. The margin was largest at the superfamily level, the level where the discovery of remote homology can lead to biological insights.

Funding

This work was supported by HiLife, University of Helsinki.

Conflict of Interest: none declared.

References

Camacho
 
C.
 et al. (
2009
)
BLAST+: architecture and applications
.
BMC Bioinformatics
,
10
,
421.

Dong
 
R.
 et al. (
2018
)
mTM-align: a server for fast protein structure database search and multiple protein structure alignment
.
Nucleic Acids Res
.,
46
,
W380
W386
.

Fox
 
N.K.
 et al. (
2014
)
SCOPe: structural classification of proteins—extended, integrating SCOP and ASTRAL data and classification of new structures
.
Nucleic Acids Res
.,
42
,
D304
D309
.

Holm
 
L.
 et al. (
2008
)
Searching protein structure databases with DaliLite v.3
.
Bioinformatics
,
24
,
2780
2781
.

Holm
 
L.
,
Park
J.
(
2000
)
DaliLite workbench for protein structure comparison
.
Bioinformatics
,
16
,
566
567
.

Holm
 
L.
,
Sander
C.
(
1993
)
Protein structure comparison by alignment of distance matrices
.
J. Mol. Biol
.,
233
,
123
138
.

Valleau
 
D.
 et al. (
2018
)
Discovery of ubiquitin Deamidases in the pathogenic arsenal of Legionella pneumophila
.
Cell Rep
.,
23
,
568
583
.

Wang
 
S.
 et al. (
2013
)
Protein structure alignment beyond spatial proximity
.
Sci. Rep
.,
3
,
1448.

Zhang
 
Y.
,
Skolnick
J.
(
2005
)
TM-align: a protein structure alignment algorithm based on TM-score
.
Nucleic Acids Res
.,
33
,
2302
2309
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Arne Elofsson
Arne Elofsson
Associate Editor
Search for other works by this author on:

Supplementary data