Abstract

We present a fragment-search based method for predicting loop conformations in protein models. A hierarchical and multidimensional database has been set up that currently classifies 105 950 loop fragments and loop flanking secondary structures. Besides the length of the loops and types of bracing secondary structures the database is organized along four internal coordinates, a distance and three types of angles characterizing the geometry of stem regions. Candidate fragments are selected from this library by matching the length, the types of bracing secondary structures of the query and satisfying the geometrical restraints of the stems and subsequently inserted in the query protein framework where their fit is assessed by the root mean square deviation (r.m.s.d.) of stem regions and by the number of rigid body clashes with the environment. In the final step remaining candidate loops are ranked by a Z -score that combines information on sequence similarity and fit of predicted and observed ϕ/ψ main chain dihedral angle propensities. Confidence Z -score cut-offs were determined for each loop length that identify those predicted fragments that outperform a competitive ab initio method. A web server implements the method, regularly updates the fragment library and performs prediction. Predicted segments are returned, or optionally, these can be completed with side chain reconstruction and subsequently annealed in the environment of the query protein by conjugate gradient minimization. The prediction method was tested on artificially prepared search datasets where all trivial sequence similarities on the SCOP superfamily level were removed. Under these conditions it is possible to predict loops of length 4, 8 and 12 with coverage of 98, 78 and 28% with at least of 0.22, 1.38 and 2.47 Å of r.m.s.d. accuracy, respectively. In a head-to-head comparison on loops extracted from freshly deposited new protein folds the current method outperformed in a ∼5:1 ratio an earlier developed database search method.

INTRODUCTION

Computational analysis of protein sequences, such as the identification of conserved motifs, is often informative to learn about the possible function of a protein ( 1 , 2 ). However, a detailed functional characterization frequently requires the study of 3D structures and complexes of proteins ( 3 , 4 ). Despite recent improvements in techniques of structure determination by X-ray crystallography or NMR spectroscopy, a quick inspection of biological databases reveals a two order of magnitude difference between the number of known protein sequences [∼3 millions; UniProt database release 5.2 ( 5 )] and that of protein structures [∼35 000; Protein Data Bank (PDB) database ( 6 )]. In the absence of an experimentally described structure, computational methods, such as comparative modeling [e.g. Sali et al . ( 7 )], threading [e.g. Domingues et al . ( 8 )] or ab initio methods [e.g. Simons et al . ( 9 )] can be used to provide a useful 3D model and fill the gap between the number of sequences and structures [reviewed in ( 1013 )].

Comparative modeling is currently the most accurate computational approach to protein structure prediction but it is applicable only if a suitable template is found with a detectable sequence similarity over the entire length of the target protein ( 10 , 14 ). The applicability of comparative modeling is steadily increasing because of the worldwide efforts in Structural Genomics that aims at experimentally solving ∼5000–10 000 representative protein structures within the next few years ( 15 ). In the context of comparative modeling, the most difficult problems are the calculations of an accurate alignment between the target sequence and template structure and the prediction of insertions i.e. loop structures ( 10 , 14 ). Even above 40% sequence identity, where the core of the fold is well preserved and can be aligned accurately, the surface exposed variable loops can vary substantially among the homologs ( 14 ). Recent improvements that were observed in the performance of fold prediction and homology modeling methods throughout successive CASP experiments ( 16 ) did not extend to the performance of loop modeling techniques.

Loops often represent an important part of the protein structure. Functional differences between the members of the same protein family are usually a consequence of structural differences on the protein surface, which frequently correspond to exposed loop regions ( 17 ). Loops often determine the functional specificity of a given protein framework, contributing to active and binding sites, such as antibody complementary determining regions ( 18 ), ligand binding sites [ATP ( 19 ), calcium binding sites ( 20 ), NAD(P) ( 21 )], DNA binding ( 22 ) or enzyme active sites [e.g. Ser-Thr kinases ( 23 ) or serine proteases ( 24 )]. Therefore the accuracy of loop conformations often determines the usefulness of computational or experimental models.

Loop prediction can be seen as a mini protein-folding problem. The correct conformation of a given segment of a polypeptide chain has to be calculated from the sequence of the segment influenced by flanking regions that span the loop and by the structure of the rest of the protein that cradles the loop. Many loop-modeling procedures have been described in recent years. Similarly to the prediction of protein structures there are ab initio (conformational search) methods ( 2527 ), and database search (or knowledge-based) methods ( 2830 ). There are also procedures that combine the two ( 31 , 32 ). An extensive overview of published methods in loop prediction until year 2000 can be found in Fiser et al . ( 33 ).

In ab initio prediction a conformational search or enumeration of conformations is conducted in a given environment, guided by a scoring or energy function ( 26 , 27 ). There are many such methods, exploiting different protein representations, sampling methods, energy function terms and optimization or enumeration algorithm. Recent works include ModLoop, a method that combines a pseudo-energy scoring function with molecular dynamics and simulated annealing ( 33 ); a new energy function, ‘colony energy’ ( 34 ) that combines a force-field energy and a root mean square deviation (r.m.s.d.)-dependent term to improve ranking of loop conformations; a divide-and-conquer approach to recursively decompose a target loop until the conformation of resulting conformations can be compiled analytically ( 35 ); a method that combines a fine-grained sampling of ϕ/ψ states and AMBER/GBSA force field for ranking ( 36 ); a low-barrier molecular dynamics simulation to improve conformational sampling and a ‘soft-core’ potential energy function to allow extensive rearrangement of loop conformations ( 37 ); a hierarchical approach, where first large number of conformations are generated that is followed by iterative cycles of clustering, side-chain optimization and energy minimization of selected conformation using all-atoms empirical potentials ( 38 ); DFIRE ( 39 ) and ROSETTA ( 40 ) are among other methods that were used to calculate loop conformations recently.

Candidate loop structures (up to 12 residues) whose conformations are similar to the native can be found if the number of loops generated is large enough ( 41 ). However, scoring functions are often not accurate enough to score the native conformation of a loop with the lowest energy ( 42 , 43 ). Therefore, there are two bottlenecks in conformational search approaches: (i) sampling a near native loop conformation; and (ii) constructing a scoring function that properly ranks a set of near native conformations.

Knowledge-based methods ( 44 ), also known as database search approaches, work by finding a segment that fits two stem regions of the target loop. The stems are defined as the main chain atoms that precede and follow the loop, but are not part of it. The search is performed through a database of many known protein structures, not only homologs of the modeled protein. Usually, many different alternative segments that fit the stem residues are obtained, and possibly sorted according to geometric criteria or sequence similarity between the template and target loop sequences. The selected segments are then superposed and annealed on the stem regions. Lessel and Schomburg pointed out the importance of the correct positioning of stem regions for knowledge-based loop prediction methods ( 45 ).

It has been shown by various groups that loops follow certain conformational patterns and are not random structures ( 4649 ). Knowledge based prediction of loop structures benefit from the classification of loop conformations ( 32 , 46 , 50 ). A recent work ( 51 ) described the advantage of using HMM sequence profiles in classifying and predicting loops that are derived from ArchDB database ( 49 ). The good performance of database search methods is well established for cases when canonical loop conformations exist, as in the case of CDR loop predictions ( 29 , 52 , 53 ) but their performance is limited by the exponential increase in the number of possible conformations as a function of loop length. Although in the mid and late 90s it was argued that only segments of <7 or even only 4 residues long had most of their conceivable conformations present in structure databases ( 30 , 54 ), a recent update suggested sufficient coverage to model even a novel fold using fragments from the PDB, as the current database of known structures has increased enormously in the last few years ( 55 ). Subsequently a recent work that used a compilation of fragments extracted from PDB reported good results in prediction of long loops ( 56 ). Our most current survey indicates that loops up to length 8 are essentially fully covered by known conformational segments and the structure database is rapidly saturating for longer segments as well (N. Fernandez-Fuentes and A. Fiser, manuscript submitted).

Combined methods use both database search and ab initio methods. The underlying idea is the use of database search methods to find candidate loops for a given target loop and subsequently evaluate and re-optimize it in the target protein. An example of a combined algorithm is that of Martin et al . ( 57 ), in which antibody hypervariable loops were predicted using a database search followed by ab initio reconstruction of sections of the predicted loops and side chains using the CONGEN conformational search algorithm ( 27 ). This idea has been generalized: loops were selected from a fragment databank, optimized and ranked using the CHARMM energy function ( 58 ). Deane and Blundell presented CODA ( 32 ), a combination of two algorithms: FREAD, a knowledge-based method, and PETRA, an ab initio method ( 59 ).

Here we present the construction of a loop database (Search Space) that is an exhaustive compilation of all possible loop conformations braced in between two regular secondary structures (α-helices or β-strands) in all protein structures and a novel database search algorithm to identify loop conformations for a given sequence segment. The prediction algorithm selects a set of candidate loops from the Search Space, then subsequently filters and ranks them by various criteria. First, the Search Space is queried by the length of the loop, the type of secondary structures that span the query loop and by the geometry of the stem using various descriptions: such as a distance and various angles of the stems ( 48 ). Second, loops are filtered and discarded if the r.m.s.d. of their stem residues and the interactions between the fragment and the rest of the protein environment are unfavorable (steric clashes). Third, in the ranking step, the remaining candidate loops are sorted by a composite Z -score. The Z -score combines a sequence score, as obtained from a conformational similarity weight matrix (K3) ( 60 ), and a ϕ/ψ main chain dihedral angle propensities score ( 61 ).

MATERIALS AND METHODS

Construction and organization of Search Space—an exhaustive database of loop fragments

A representative set of 6578 protein structures were selected from the February 2004 release of PDB ( 6 ). The selected proteins share <95% sequence identity and were determined by X-ray crystallography at a resolution of 2.5 Å or better. The DSSP program ( 62 ) located loop segments defined as fragments that connect two regular secondary structures. The initial dataset of loops was further filtered by various quality rules to obtain a high-quality loop library by discarding incomplete or poorly defined segments: (i) loops with missing residues and/or main chain atoms (including C β , except for Gly), and (ii) loops with high crystallographic B -factors were discarded. For this latter a B -factor Z -score was calculated from atomic B -factors for each residue by averaging B -factors for all atoms in the residue and comparing it with the mean and SD of B -factors of all residues in the protein. Loops containing >50% of their residues with B -factor Z -score higher than 1.0 were discarded. The final set contains 105 950 protein loops altogether with their flanking secondary structures. These loops were organized into a hierarchical and multidimensional database that we refer to as Search Space.

The Search Space of loops is representing all possible loop conformations and is organized by the definition of bracing secondary structures, loop lengths and loop geometries. Search Space is organized in a three level hierarchy: (i) at the top of the classification, loops are identified according to the type of the bracing secondary structures: αα, αβ, βα and ββ loops; (ii) at the second level, loops are grouped according to their length as defined by DSSP program ( 62 ) and (iii) at the third level, loops are grouped according to geometry of the bracing secondary structures as defined by a distance, D , between the anchor points and three angles: a hoist (δ), a packing (θ) and a meridian (ρ) ( 48 ) (Figure 2).

For distance, the interval considered for classifying all possible loops spans between 0 and 40 Å partitioned by intervals of 2 Å; for hoist and packing angles span from 0 to 180 degrees and is partitioned into 30° intervals; meridian angle spans from 0 to 360° and is partitioned into 45° intervals. This partition classifies each loop in a 4D geometrical space. The partitioning of the Search Space is optimized: very narrow, fine grain partitioning would result in numerous empty or poorly populated cells in the multidimensional Search Space, whereas wide bins could join highly dissimilar geometries (see below, Calibration Test-Sets and Supplementary Data).

Selecting candidate loops

For a given query segment, candidate loop conformations are selected from the Search Space by matching bracing secondary structures, length ±1 residue and geometrical criteria. A tolerance in loop length of ±1 residue is permitted to compensate for possible uncertainties in assigning end points to secondary structures ( 63 , 64 ). We refer to these selected loops as ‘candidate loops’ ( Figure 1 ). Two loops, loop A with geometry GA = ( DAAAA ) and loop B with geometry GB = ( DBBBB ) share the same geometry if GA-B = [(∣D A − D B ∣),(∣δ A − δ B ∣),(∣θ A − θ B ∣),(∣ρ A − ρ B ∣)] belongs to the 4D semi-open interval I = [(0,0,0,0), (2,30,30,45)], i.e. the distance difference between the anchor points should be <2 Å, the differences between the three dihedral angles, δ, θ and ρ should be <30, 30 and 45, respectively.

The use of geometry as a descriptor for loop selection implies that the flanking secondary structures are well described (at least five residues for α-helices and two residues for β-strands). However, the current method is prepared to handle cases where secondary structures are not known and/or not well defined. In this case, Search Space is queried using only the distance of end points. The approach predicts loops between two defined regions, therefore it is not suitable for prediction of terminal fragments.

Filtering candidate loops

The filtering step in the algorithm discards clearly unfavorable candidates based on structural superposition of stem residues and steric violations after fitting the loop in the protein framework ( Figure 1 ). All candidate loops are superimposed on their stem positions using the main chain atoms of two stem residues at each flanking secondary structure. Candidate segments with r.m.s.d. of stems higher than 1.0, 1.5 and 1.75 Å for loops with 4–7, 8–12, 13 and more residues, respectively, are discarded. This dynamic range of cut-off values was determined via an iterative optimization (see below, Calibration Test-Sets). The rest of the candidate loops are further filtered by exploring their conformational fit in the new protein environment in terms of number of steric violations or clashes. The conformation fit in the new environment is assessed in terms of steric clashes among main chain atoms (N, C, Cα and O). Two atoms are in steric clash if their distance is smaller than the 70% of sum of the respective van der Waals radii. Van der Waals radii were taken from Tsai et al . ( 65 ).

Ranking of candidate loops

The final set of candidate loops are ranked by two measures: (i) A sequence similarity score between the query and candidate loops; and (ii) ϕ/ψ main chain dihedral angle propensities. The sequence similarity score for a loop sequence Ssequence is defined as the following equation:  

\[{S}_{\hbox{ sequence }}={\displaystyle \sum _{i=1}^{L}}\left({C}_{i}\to {Q}_{i}\right),\]
where L is the length of aligned positions between a candidate loop and the query loop; C i →Q i is the value of substitution of amino acid C (candidate) by the amino acid Q (query) in position i . A number of substitution tables were tested (Calibration Test-Sets) and the conformation similarity weight matrix (K3 matrix) of Kolaskar and Kulkarni-Kale ( 60 ) was found as the best performing one.

The dihedral angle propensity score measures the compatibility of observed and expected dihedral angles of each residue of the candidate loop in the corresponding position of the query. Main chain conformation definitions and propensities are defined according to the p15 propensities table of Shortle's work ( 61 ). Similarly to the sequence score, the propensity score, Spropensity , of the query loop is obtained as it is threaded in the main chain conformation of the candidate loop ( 2 ):  

\[{S}_{\hbox{ propensity }}={\displaystyle \sum _{i=1}^{L}}log\left({C}_{i}\to {Q}_{i}\right),\]
where L is the length of aligned positions and C i →Q i is the propensity of the residue C when it adopts the main chain conformation of the residue Q in the position i . The two components of the scoring scheme, sequence and propensity, are combined into a composite score. First, the individual scores ( S ) were transformed into a Z -score using the mean (µ) and SD (σ) of scores from random predictions ( 3 ):  
\[Z\hbox{ -score }=\frac{S-\mu }{\sigma }.\]
The randomized dataset for Z -score calculation for sequence score and for propensity score was generated in different manner, thus each of these two has their specific µ and σ values. For sequence scores, 5000 random sequences were generated taking into account the natural occurrence of amino-acid types in known proteins. For propensity score, 5000 strings of ϕ/ψ regions [as defined in the p15 table ( 61 )] were generated randomly.

The composite Z -score is a sum of Z -score sequence and Z -score propensity given that they are both larger than zero.

Benchmarking the quality of prediction

Twelve test sets, each of which had 50 randomly selected loops from the Search Space, between lengths 4 and 14 were used to test the performance of the prediction method. In order to remove biases because of loop homologues in the Search Space, a specific Search Space was built for each prediction by removing proteins in each round that share the same SCOP superfamily as the structure of the protein containing the query loop. The accuracy of loop prediction is evaluated by comparing the selected/predicted and the experimental conformation. Two types of r.m.s.d. values were calculated: (i) the global r.m.s.d. (r.m.s.d. global ), which is the r.m.s.d. of the loop main chain atoms (N, Cα, C and O) after superposition of the main chain atoms of the stem residues on each flanking secondary structures (two residues on each side); and (ii) the local r.m.s.d. (r.m.s.d. local ), which is calculated for the main chain atoms after the superposition of the main chain loop atoms.

Calibration test-sets

The prediction algorithm includes a number of steps where parameters have to be optimized, such as the cutoff value for r.m.s.d. of the stems, choice of sequence substitution matrices and bin-size of Search Space. All the calibrations were carried out on three sets (different from the test sets above) of lengths 4, 8 and 12 residue long loops (to cover short, medium and long loops) each containing 100 randomly selected fragments. The approach during the calibrations was an iterative optimization.

To identify the optimal binning of the sequence space we explored the conformational variations of structures in terms of r.m.s.d. local . If the binning is too wide dissimilar conformation will merge, hence high r.m.s.d. local with smaller grid result in a poor coverage of predicted loops. To identify an optimal r.m.s.d. stem threshold, the correlation between r.m.s.d. of stems versus r.m.s.d. local of loops was studied altogether with the coverage of prediction at different r.m.s.d. stem cutoffs. For the sequence similarity scores, different types of residue replacement scoring schemes were explored: Luthy et al . ( 66 ), BLOSUM62 ( 67 ), Topham et al . ( 68 ), Azarya-Sprinzak et al . ( 69 ), H3P2 ( 70 ), FUGUE ( 71 ), Blake and Cohen ( 72 ) plus two type of ‘home-made’ log-odd matrices resulting from pair-wise comparison of loop structures. All data can be consulted in the Supplementary Data.

RESULTS AND DISCUSSION

We present a novel fragment-search based loop conformation prediction method. The approach has two parts, (i) the classification of loop fragments into an extensive library (‘Search Space’) and (ii) a three step search algorithm to Select, Filter and Rank candidate loops for a given query sequence. Five different measures are used during the prediction process. Three of the measures: motif geometry, r.m.s.d. of stems and steric clashes are used as qualitative descriptors only, to accept or to reject candidate loops through the Selection and Filtering steps. Sequence similarity and amino acid ϕ/ψ dihedral angle propensities were used for quantitatively rank the final set of candidate loops ( Figure 1 ).

Search Space

The Search Space currently classifies 105 950 high quality loop structures, and it is regularly updated. Search Space is organized in a three level hierarchy: loops are identified and grouped according to (i) the type of bracing secondary structures; (ii) their length and (iii) four internal coordinates of the bracing secondary structures as defined by a distance vector between the anchor points and three angles: hoist (δ), packing (θ) and meridian (ρ) ( Figure 2 ) ( 48 ). The third level of hierarchy is the geometrical binning of loops. It bins loops into 20 × 6 × 6 × 8 = 5760 possible cells or geometrical combinations that is obtained from the number of possible bins for the distance vector, and (δ), (θ) and (ρ) angles, respectively. Not all cells are equally populated, short loops cannot have large values of vector distance or β−β hairpin loops have a restricted geometry in terms of possible angles combinations due to strict hydrogen bond requirements ( 73 ). For instance, in case of loops of length 4 the number of sampled cells is 614, where 225 cells have <5 loops and the most populated cell contains 681 loops. For loops of length 8 and 12 the number of sampled cells are 669 and 861, where 304 and 416 of these have <5 loops and the two most populated cells contains 110 and 93 loops, respectively. Even at longer loop lengths there are preferred geometries, in agreement with earlier observations ( 74 ).

Selection of candidate loops from the Search Space

Prediction of loops requires an efficient (fast and scalable) and accurate algorithm. We group our algorithm into three steps: selecting, filtering and ranking of the suitable segments from the Search Space. During selection, loops in the Search Space are queried in a stepwise manner. First, loops with similar bracing secondary structures are identified, and those having a similar length (+/−1 residue) to the query loop are selected. The last selection step in the lookup process involves comparing one distance and three angle values, which serve as internal coordinates to describe the geometry of the stems.

Selecting loops by geometry is a quick but coarse filtering step. It is more powerful than selecting fragments from loop databases based only on end point distances ( 56 , 75 ) because not only a distance is considered but also the orientation of the stems as well. On the other hand it is faster than selecting fragments through superimposition and r.m.s.d. calculation of stem residues. The r.m.s.d. calculation is computationally more demanding than a simple string comparison. The initial selection of candidate loops by simple geometrical requirements quickly narrows the space to be explored by subsequent, more elaborate structural comparison. For instance, for loops of lengths 4, 8 and 12, the average number of selected loops by stem residue distances comparison on 50 randomly chosen examples (with a tolerance of 1 Å) is 1534, 683 and 430; while the selected number of loops after geometrical comparison is only 181, 85 and 25, respectively. This strict filtering step does not mean that good candidate loops are rejected. Comparing the average r.m.s.d. local of the best fragment between loops that are selected by end point distances and loops selected by geometry, the differences are <0.05, 0.09 and 0.11 Å for the calibration test sets (Materials and Methods) of 4, 8 and 12 residue long loops, respectively. This suggests that the comparison of stem geometries is a robust measure for loop selection.

Filtering and ranking candidate loops

Two qualitative descriptors are used for filtering: the fit of stem residues by superposition of main chain atoms and r.m.s.d. calculation and the evaluation of steric clashes between the loop and the rest of the protein environment. r.m.s.d. cutoffs for superposed stem residues have been applied before in loop structure prediction method either for ranking ( 56 ) or filtering ( 75 ). The r.m.s.d. fit of stem residues correlate strongly with the accuracy of prediction of short loops, but this correlation becomes less pronounced for longer loops (Supplementary Data; correlation between r.m.s.d. stem versus r.m.s.d. local of loops). The reason is that conformations that a fragment can adopt are less restricted by the stem residues in case of medium and long loops (8–14 residues) than for short loops (1–7 residues). Therefore we applied a range of r.m.s.d. stem cutoff values as a function of loop length. After this filtering step the average number of candidate loops for a given random query dropped from 181, 85 and 25 to 96, 36 and 11 for loops of length 4, 8 and 12, respectively.

The second qualitative descriptor to filtering of loops explores the conformational fit of candidate loops in the new protein environment. Each candidate loop is plugged in the protein environment of the query and checked for steric clashes between the loop and its surroundings and the ones with steric clashes are removed from the candidate's list. After these steps the average number of loop candidates decreased to 81, 35 and 5 for loops of length 4, 8 and 12, respectively.

Ranking candidate loops by sequence and main chain dihedral angle propensity comparisons

Remaining candidate loops are ranked according to sequence similarity and amino acid ϕ/ψ dihedral angle propensities. Sequence and propensity scores have their own range and correlation with prediction accuracy, therefore these scores were converted into Z -scores in order to unify both scores with a comparable and dimensionless criteria.

Sequence Z -score gauges the similarity between the sequence of the query and candidate loops and compares it to a reference distribution of randomly selected pairs of loops with similar lengths. A number of different substitution matrices were tested to score sequence similarity and the K3 weight matrix proved to be the most efficient ( 60 ) as it was derived from comparisons of Ramachandran maps and was developed to select protein fragments with similar conformations.

The second quantitative measure to rank the set of candidate loops is the propensity of amino-acids to adopt a specific ϕ/ψ main chain dihedral angle conformation. Propensity is defined as the likelihood that an amino acid residue is found in a specific environment. The environment is defined by the backbone dihedral angles ϕ and ψ. The expected propensity values were obtained from a table that divides the Ramachandran plot into 15 different regions (‘p15 propensity’ table) ( 61 ). The logarithm of the propensity approximates the free energy of a specific residue conformation. The free energy for each position is assumed to be additive, so the score for a sequence fragment is the sum of the log of the propensities at each position ( 61 ). The composite Z -score is defined as the sum of the two types of Z -scores.

There is a (negative) correlation between the composite Z -score and the r.m.s.d. local for all the three calibration test sets ( Figure 3A–C ). The distribution of sequence Z -score versus propensity Z -score ( Figure 3D–F ) for all candidate loops in the calibration test shows that in most of the cases the sign and magnitude of Z -score is related. For instance, if a candidate loop has a high sequence Z -score, the propensity Z -score is also high and vice versa. Also, candidates with good r.m.s.d. local have both positive and large Z -score ( Figure 3D–F ).

Performance of loop prediction

Benchmarking loop prediction approaches using database methods is not straightforward. Some sort of artificially filtered input fragment dataset needs to be prepared to avoid trivial hits and consequently the overestimation of performance. However if one overly ambitious in getting rid of all segments in a database that show any level and type of similarity to a query may end up with seriously underestimated method performance. We compare our results with (i) using variety of pre-filtered Search Spaces (ii) the performance of a competitive and freely available ab initio prediction method, ModLoop ( 76 ); (iii) the theoretical minimum r.m.s.d. which depends on the database applied (Search Space) thus informing on the practical limits of the method; (iv) with the expected r.m.s.d. of a prediction made by random selection of loops segments ( Figure 4 ); and (v) by directly comparing with an earlier developed, publicly available fragment search based method ( 32 ).

The minimum value of r.m.s.d. local that can be obtained with loops available in the Search Space (i.e. the loop with the smallest r.m.s.d.) are on average 0.25, 0.5 and 1 Å more accurate (for 4, 8 and 12 residues long loops, respectively) than the best results obtained by ModLoop ( Figure 4 ). This indicates that there are candidate loops at all loop length that outperform the accuracy of the ab initio approach. This supports the conclusion of Du et al . ( 55 ) who found that even for long loops (up to 15 residues) there is a 90% probability that a non-homologous structure within 2 Å r.m.s.d. exists. Therefore the bottleneck in fragment search based loop modeling does not appear to be the sampling (completeness of database segments), but the search algorithm and scoring function to locate these segments.

ModLoop on average outperforms the current prediction method at all loop lengths if we force the current search algorithm to locate a segment for all possible queries even if these are not very good candidates ( Figure 4 ). However, averages of both methods fall within the boundaries of 1 Å SD. The accuracy obtained with the current method is clearly much higher than the accuracy obtained with a random prediction ( Figure 4 ).

Differences between the current method and ModLoop are smaller in case the comparison is based on r.m.s.d. global measures ( Figure 4 ). Global r.m.s.d. measures the accuracy of the orientation of the loop altogether with its local conformation. Better global r.m.s.d. values as compared with local r.m.s.d. imply that candidate loops are selected with proper orientation. This is probably due to the filter that is applied on steric clashes.

Coverage versus accuracy: identifying confidence Z -score thresholds

We explored the performance of the method as a function of Z -score cut-offs. The r.m.s.d. values decrease as the composite Z -scores and the accuracy of predictions increase. Meantime the corresponding coverage of the prediction decreases ( Figure 5 ).

It is important to assign confidence values to a prediction. Table 1 lists r.m.s.d. local and coverage results versus Z -score cutoffs. Z -score cutoffs were defined in such a way that fragments selected with more significant Z -scores will have equal or better accuracy than the average accuracy of fragments obtained by ModLoop. For instance, for loops between lengths of 4–7 residues a Z -score of 1.0 gives an equal or better performance than ModLoop with a corresponding coverage of 90% ( Figure 5 ). For loops between 8 and 11 residues a Z -score larger than 2–3 is required and the average coverage is around 50–60%. For longer loops, beyond 12 residues long the coverage rapidly drops ( Table 1 ).

Completeness of fragment database

Knowledge based approaches are limited by the completeness of the database they are based on. In our benchmarking process we artificially impoverished our Search Space by discarding loops that belong to the same SCOP superfamily as the query. This simulates a situation where no similar structures to our query protein are available on the SCOP superfamily level when attempting to predict its loop conformations. We also explored the performance of the prediction algorithm using a more dynamic range of pre-filtering of the Search Space. We have studied three additional scenarios by removing loop fragments that shared >75, 50 or 25% of sequence identity with the query and re-run the prediction ( Figure 6 ). As the sequence identity filter is less restrictive better candidates can be selected, and better accuracy is achieved. While a sequence filtering at 25% resembles the performance of our default filtering approach on the SCOP superfamily level, at 50% the performance of the current approach becomes competitive with ModLoop method ( 33 ), while at 75% it exceeds its performance. The coverage of loop fragments in PDB has been analyzed in a separate work (N. Fernandez-Fuentes and A. Fiser, manuscript submitted) and it has been found that current PDB supplies us with loop fragments up to 14 residues long that are on average 40–60% identical to any observed fragment in the sequence databases. This suggests that the 50% filtering of Search Space might actually be the one that resembles true application scenarios. The performance of all benchmarks discussed so far were in the context of sequence signal only. Using the full power of the current approach the accuracy of prediction can be significantly increased at each loop length ( Figure 7 ). These improvements become more significant as we apply the prediction for more filtered Search Spaces, where sequence signal has less influence.

Comparing performance to FREAD database search method

We performed a head-to-head comparison of performances between the current ArchPRED and the FREAD methods ( 32 ).To avoid a trivial exercise we used only new structural releases from PDB ( 6 ), which could not yet enter the classification schemes of either methods and we tracked these new PDB structures for two weeks. Among the new structures we identified new folds by removing all proteins with sequence (>40% sequence identity) and structural similarity [DALI ( 77 ) Z -score >3] to any known PDB structures. From the remaining 6 novel fold structures we located 35 loop regions and submitted the sequences of these fragments to our method and to the FREAD server. The predicted loops were superposed with the experimental solution and r.m.s.d. values obtained ( Table 2 ). The current method, ArchPRED not only provides a higher coverage (it predicted all segments, while FREAD did not return answer for four cases) but also on average it returned more accurate predictions in 23 out of 28 cases, while in three cases they returned identical solutions ( Table 2 ).

Examples of predicted loops

We present three examples as illustrations, to predict a short, a medium and a long loop. For a short loop we predicted a loop with a length of four residues (extracted from structure 1g29 chain 1, between residues 37 and 40). The loop spans two β-strands forming a β−β hairpin motif. The top three fragment candidates and the experimental solution structure are shown in Figure 8A . All of the candidates fit with a similar r.m.s.d. of stems and without clashes in the new protein framework. If ranking was based only on sequence signals, the candidate loop in green color would be the top choice. However, red candidate has the highest composite Z -score ( Z = 2.85 versus 1.45 and 1.01) and is the most accurate fragment (r.m.s.d. local = 0.2 Å versus 0.4 and 0.6 Å).

Figure 8B shows an example of predicting a medium-size loop of 8 residues, between positions 107 and 114 in the 1srp structure. This example illustrates the usefulness of filtering by steric clashes. All three candidate loops, shown in red, green and blue, have approximately the same r.m.s.d. for stems residues, around 1.1 Å. The loop in green has the highest Z -score for sequence signal (3.2, 2.8 and 1.9 for green, red and blue loops, respectively). However, the green bumps against a neighboring β-strand in the new protein environment, therefore it is removed from the list of putative candidates by the prediction method. Remaining candidates, red and blue, both fit without steric clashes, but the composite Z -score for red loop is higher than the blue loop (4.7 versus 3.6) in good agreement with a superior r.m.s.d. local (1.3 Å versus 1.9 Å).

The third example is a prediction of a loop with 12 residues extracted from structure 1j85 chain A, between residues 121 and 132 ( Figure 8C ). Out of the three different candidate loops the one in blue is discarded because of steric conflicts with the protein framework. Both loops in red and green have a comparable sequence Z -score, (2.3 and 2.5) but the main chain dihedral angle propensity Z -score is more favorable for the red (2.2 versus 1.4), resulting in an overall higher composite Z -score, in agreement with the overall accuracy, or r.m.s.d. local (1.6 versus 2.7 Å).

Loop prediction web server

The Search Space and the prediction method described here are implemented in a web server. The user provides the query structure in PDB format that contains the missing loop(s) and defines its sequential location. The interface of the web server provides all the controls of the method: searching loops by end-point distance only, or by geometry; if by geometry than by two types of bracing regular secondary structural elements; if these elements are beta strands than further distinguished by hairpin or link types. Once the prediction is completed results are sent by email in form of a link pointing to temporary web pages. Optionally the best loop fragment located is built in the query structure and a conjugate gradient minimization is applied to smoothly anneal the stems in the protein framework. The server is accessible at http://www.fiserlab.org/servers/archpred .

CONCLUSION

The number of experimentally solved structures has grown dramatically in the last few years. More importantly, due to the ongoing Structural Genomics efforts an increasing number of new folds or remotely related proteins are being solved, amalgamating the library of conformational segments ( 15 ). Studies in 1994 and 1997 concluded that database search methods were limited to predict loops up to four residues long (seven considering three stem residues) ( 30 , 54 ). However, our analysis (N. Fernandez-Fuentes and A. Fiser, submitted) in agreement with other recent reports ( 55 ) suggests that there is a sufficient sampling of short segments in the PDB to efficiently use database search methods to predict loops currently up to 9–12 residues. If a good fragment is found in the database it could be used straightforward or as a starting conformation for subsequent optimization. In both approaches, the presence of a suitable segment permits one to avoid computationally more demanding and riskier ab initio approaches. To assess the usefulness of a given predicted segment it is necessary to define confidence values for fragment based approaches. We tackled the problem of defining confidence values in the current method by calibrating Z -score cutoff values that ensure a superior solution to a competitive ab initio approach.

The accuracy and coverage of the current method implies that database search approaches rapidly gain importance in loop prediction and the bottleneck in these approaches does not appears to be the sampling (database completeness of segments), but the search algorithm and scoring function to locate these segments. With the advance of structural genomics efforts ( 15 ) we expect that this trend will be further accentuated in the coming years.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

Figure 1

Flowchart of the methodology for building the Search Space and performing the prediction.

Figure 1

Flowchart of the methodology for building the Search Space and performing the prediction.

Figure 2

Definition of loop geometry. The axis for an α or β secondary structure is defined as the shortest of the principal moments of inertia of that structure. M1 and M2 are the axis vectors of the secondary structure. The geometry of each motif is defined by four internal co-ordinates: (i) D , a distance between ending points, (ii) Hoist angle, δ, the angle between axis M1 and vector of D; (iii) Packing angle, θ, the angle between M1 and M2; and (iv) Meridian angle, ρ, the angle between M2 and the plane that contains the vector M1.

Figure 2

Definition of loop geometry. The axis for an α or β secondary structure is defined as the shortest of the principal moments of inertia of that structure. M1 and M2 are the axis vectors of the secondary structure. The geometry of each motif is defined by four internal co-ordinates: (i) D , a distance between ending points, (ii) Hoist angle, δ, the angle between axis M1 and vector of D; (iii) Packing angle, θ, the angle between M1 and M2; and (iv) Meridian angle, ρ, the angle between M2 and the plane that contains the vector M1.

Figure 3

( A ) Correlation a of composite Z -score and accuracy of prediction (average r.m.s.d. local ) for calibration test-sets of ( A ) 4, ( B ) 8 and ( C ) 12 residues long loops, respectively. SDs are shown around the averages. ( B ) Distribution of Z -scores (propensity Z -score versus sequence Z -score) for all candidates ( x ) and for the candidates with the best r.m.s.d. local among top 10 hits (closed circle) for ( D ) 4, ( E ) 8 and ( F ) for 12 residues long loops, respectively.

Figure 3

( A ) Correlation a of composite Z -score and accuracy of prediction (average r.m.s.d. local ) for calibration test-sets of ( A ) 4, ( B ) 8 and ( C ) 12 residues long loops, respectively. SDs are shown around the averages. ( B ) Distribution of Z -scores (propensity Z -score versus sequence Z -score) for all candidates ( x ) and for the candidates with the best r.m.s.d. local among top 10 hits (closed circle) for ( D ) 4, ( E ) 8 and ( F ) for 12 residues long loops, respectively.

Figure 4

Average ( A ) r.m.s.d. local and ( B ) r.m.s.d. global as a function of loop length. (closed diamond) indicates the practical limit of the prediction; (closed circle) shows the average r.m.s.d. of ModLoop predictions; (closed triangle) shows the average r.m.s.d. of candidates with the highest composite Z -score; and (closed square) shows the average r.m.s.d. for random prediction.

Figure 4

Average ( A ) r.m.s.d. local and ( B ) r.m.s.d. global as a function of loop length. (closed diamond) indicates the practical limit of the prediction; (closed circle) shows the average r.m.s.d. of ModLoop predictions; (closed triangle) shows the average r.m.s.d. of candidates with the highest composite Z -score; and (closed square) shows the average r.m.s.d. for random prediction.

Figure 5

Average r.m.s.d. local ( A ) and coverage ( B ) as a function of Z -score threshold for all loop lengths.

Figure 5

Average r.m.s.d. local ( A ) and coverage ( B ) as a function of Z -score threshold for all loop lengths.

Table 1

Accuracy and coverage of prediction for different loop lengths and Z -score thresholds

Loop length  Confidence Z -score a  Average r.m.s.d. local b (Å)   Coverage c (%)  
≥1 0.22 98 
≥1 0.15 96 
≥1 0.34 98 
≥1 0.93 94 
≥2 1.38 78 
≥3 1.93 60 
10 ≥3 2.11 46 
11 ≥3 2.30 44 
12 ≥4 2.47 28 
13 ≥4 2.85 
14 ≥4 2.88 
Loop length  Confidence Z -score a  Average r.m.s.d. local b (Å)   Coverage c (%)  
≥1 0.22 98 
≥1 0.15 96 
≥1 0.34 98 
≥1 0.93 94 
≥2 1.38 78 
≥3 1.93 60 
10 ≥3 2.11 46 
11 ≥3 2.30 44 
12 ≥4 2.47 28 
13 ≥4 2.85 
14 ≥4 2.88 

aZ -score thresholds were defined to guarantee that the selected segments are at least as accurate on average as the corresponding prediction of ModLoop.

b Average local r.m.s.d. for a given Z -score threshold.

c Average local r.m.s.d. for a given coverage.

Figure 6

Average r.m.s.d. local as a function of loop lengths under different condition of pre-filtering Search Space: selecting from candidates that belong to a different SCOP superfamily than the query loop (closed triangle); using ModLoop (closed circle); selecting among candidates that have <75% (closed diamond), 50% (open square) and 25% (closed square) sequence identity with query loop.

Figure 6

Average r.m.s.d. local as a function of loop lengths under different condition of pre-filtering Search Space: selecting from candidates that belong to a different SCOP superfamily than the query loop (closed triangle); using ModLoop (closed circle); selecting among candidates that have <75% (closed diamond), 50% (open square) and 25% (closed square) sequence identity with query loop.

Figure 7

Average r.m.s.d. local versus loop length under different condition of selection using information only sequence identity or the full algorithm for prediction.

Figure 7

Average r.m.s.d. local versus loop length under different condition of selection using information only sequence identity or the full algorithm for prediction.

Table 2

Comparing prediction performances of ArchPRED and FREAD programs

PDB Chain Start Sequence  ArchPRED r.m.s.d. local (A)   FREAD r.m.s.d. local (A)  
2fdo 17 GHLEDDVVVVVSSD 4.06 3.27 
2fdo 54 AFPADIFDAD 2.23 2.80 
2fef 94 GLA 0.15 0.22 
2fef 22 LARTDRAPRRNID 2.30 3.29 
2fef 172 VLQFDTD 1.62 1.79 
2fef 268 GWDD 0.40 0.91 
2fef 60 QPFAAQ 1.41 1.24 
2fef 235 QLS 0.06 No hit 
2fef 205 GQL 0.11 0.11 
2fef 276 ASPHYL 0.27 2.02 
2d13 211 DGLS 0.22 0.26 
2d13 194 MPFFK 0.36 1.32 
2d13 11 YSGG 0.10 1.15 
2d13 96 AGALAS 0.93 1.79 
2d13 121 TPAWEKD 0.82 No hit 
2d13 138 LGF 0.12 0.10 
2d13 27 SGL 0.12 0.12 
2d13 88 GLKVD 0.21 0.86 
2d13 63 GIP 0.16 0.16 
2d13 177 GIHIAGEGGE 2.69 3.84 
2fgg 44 RLDPADEPVA 2.30 2.96 
2fgg 20 HDER 0.94 1.00 
2fi9 48 DMTGPVPTQEDI 3.32 No hit 
2fi9 76 TGVELLRLP 1.79 3.23 
2fi9 65 ESDQIE 0.30 1.92 
2fi9 94 KRI 0.10 0.07 
2fiy 41 EGHPM 0.35 0.55 
2fiy 97 GAW 0.18 0.60 
2fiy 109 GYPAPAN 1.33 2.25 
2fiy 142 SGQFDLLPAAL 1.22 No hit 
2fiy 83 HGMPPLA 0.52 1.88 
2fiy 212 SLCAC 0.20 0.15 
2fiy 257 PSCQ 0.07 0.15 
2fiy 268 LEFDRHAD 1.98 2.48 
2fiy 293 DGY 0.08 0.17 
PDB Chain Start Sequence  ArchPRED r.m.s.d. local (A)   FREAD r.m.s.d. local (A)  
2fdo 17 GHLEDDVVVVVSSD 4.06 3.27 
2fdo 54 AFPADIFDAD 2.23 2.80 
2fef 94 GLA 0.15 0.22 
2fef 22 LARTDRAPRRNID 2.30 3.29 
2fef 172 VLQFDTD 1.62 1.79 
2fef 268 GWDD 0.40 0.91 
2fef 60 QPFAAQ 1.41 1.24 
2fef 235 QLS 0.06 No hit 
2fef 205 GQL 0.11 0.11 
2fef 276 ASPHYL 0.27 2.02 
2d13 211 DGLS 0.22 0.26 
2d13 194 MPFFK 0.36 1.32 
2d13 11 YSGG 0.10 1.15 
2d13 96 AGALAS 0.93 1.79 
2d13 121 TPAWEKD 0.82 No hit 
2d13 138 LGF 0.12 0.10 
2d13 27 SGL 0.12 0.12 
2d13 88 GLKVD 0.21 0.86 
2d13 63 GIP 0.16 0.16 
2d13 177 GIHIAGEGGE 2.69 3.84 
2fgg 44 RLDPADEPVA 2.30 2.96 
2fgg 20 HDER 0.94 1.00 
2fi9 48 DMTGPVPTQEDI 3.32 No hit 
2fi9 76 TGVELLRLP 1.79 3.23 
2fi9 65 ESDQIE 0.30 1.92 
2fi9 94 KRI 0.10 0.07 
2fiy 41 EGHPM 0.35 0.55 
2fiy 97 GAW 0.18 0.60 
2fiy 109 GYPAPAN 1.33 2.25 
2fiy 142 SGQFDLLPAAL 1.22 No hit 
2fiy 83 HGMPPLA 0.52 1.88 
2fiy 212 SLCAC 0.20 0.15 
2fiy 257 PSCQ 0.07 0.15 
2fiy 268 LEFDRHAD 1.98 2.48 
2fiy 293 DGY 0.08 0.17 

Thirty-five loops were collected from six recently deposited novel fold structures and used as test cases. The table shows the PDB code, chain identifier, loop starting position, loop sequence, backbone r.m.s.d. local value calculated after the superposition of predicted and experimental solution structure.

Figure 8

Cartoon representations for three examples of predicted loops. N- and C-termini of loops are marked. The experimental structure is in gray while in red, blue and yellow are the candidate loops. Red colored candidate loops are with the highest composite Z -score. Gray shaded spheres indicate regions with repulsive contacts (clashes) between the candidate loop and the protein framework. ( A ) Prediction of a loop of length 4, PDB 1g29, chain 1, residues 37–40. ( B ) Prediction of a loop of length 8, PDB 1srp, residues 107–114. ( C ) Prediction of a loop of length 12, PDB 1j85, chain A residues 121–132. All figures were generated using Pymol ( http:pymol.sourceforge.net ).

Figure 8

Cartoon representations for three examples of predicted loops. N- and C-termini of loops are marked. The experimental structure is in gray while in red, blue and yellow are the candidate loops. Red colored candidate loops are with the highest composite Z -score. Gray shaded spheres indicate regions with repulsive contacts (clashes) between the candidate loop and the protein framework. ( A ) Prediction of a loop of length 4, PDB 1g29, chain 1, residues 37–40. ( B ) Prediction of a loop of length 8, PDB 1srp, residues 107–114. ( C ) Prediction of a loop of length 12, PDB 1j85, chain A residues 121–132. All figures were generated using Pymol ( http:pymol.sourceforge.net ).

The authors acknowledge all Fiser lab members for their insightful comments on the work, especially Dr D. Rykunov. N.F.F. was partially supported by a Boehringer fellowship. Financial support provided by NIH GM62519-04 and the Seaver Foundation. Funding to pay the Open Access publication charges for this article was provided by NIH GM62519-04 and MEC BI02005-00533.

Conflict of interest statement . None declared.

REFERENCES

1
Hulo, N., Sigrist, C.J., Le Saux, V., Langendijk-Genevaux, P.S., Bordoli, L., Gattiker, A., De Castro, E., Bucher, P., Bairoch, A.
2004
Recent improvements to the PROSITE database
Nucleic Acids Res
  .
32
D134
–D137
2
Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S.R., Griffiths-Jones, S., Howe, K.L., Marshall, M., Sonnhammer, E.L.
2002
The Pfam protein families database
Nucleic Acids Res
  .
30
276
–280
3
Rost, B.
2002
Enzyme function less conserved than anticipated
J. Mol. Biol
  .
318
595
4
Todd, A.E., Orengo, C.A., Thornton, J.M.
2001
Evolution of function in protein superfamilies, from a structural perspective
J. Mol. Biol
  .
307
1113
5
Bairoch, A., Apweiler, R., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., et al.
2005
The Universal Protein Resource (UniProt)
Nucleic Acids Res
  .
33
D154
–D159
6
Berman, H.M., Battistuz, T., Bhat, T.N., Bluhm, W.F., Bourne, P.E., Burkhardt, K., Feng, Z., Gilliland, G.L., Iype, L., Jain, S., et al.
2002
The Protein Data Bank
Acta Crystallogr. D Biol. Crystallogr
  .
58
899
–907
7
Sali, A. and Blundell, T.L.
1993
Comparative protein modeling by satisfaction of spatial restraints
J. Mol. Biol
  .
234
779
–815
8
Domingues, F.S., Lackner, P., Andreeva, A., Sippl, M.J.
2000
Structure-based evaluation of sequence comparison and fold recognition alignment accuracy
J. Mol. Biol
  .
297
1003
9
Simons, K.T., Kooperberg, C., Huang, E., Baker, D.
1997
Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions
J. Mol. Biol
  .
268
209
–225
10
Baker, D. and Sali, A.
2001
Protein structure prediction and structural genomics
Science
 
294
93
–96
11
Schonbrun, J., Wedemeyer, W.J., Baker, D.
2002
Protein structure prediction in 2002
Curr. Opin. Struct. Biol
  .
12
348
–354
12
Kretsinger, R.H., Ison, R.E., Hovmoller, S.
2004
Prediction of protein structure
Methods Enzymol
  .
383
1
–27
13
Fiser, A.
2004
Protein structure modeling in the proteomics era
Expert Rev Proteomics
  .
1
97
–110
14
Fiser, A., Feig, M., Brooks, C.L., III, , Sali, A.
2002
Evolution and physics in comparative protein structure modeling
Acc. Chem. Res
  .
35
413
–421
15
Chance, M.R., Fiser, A., Sali, A., Pieper, U., Eswar, N., Xu, G., Fajardo, J.E., Radhakannan, T., Marinkovic, N.
2004
High-throughput computational and experimental techniques in structural genomics
Genome Res
  .
14
2145
16
Venclovas, C., Zemla, A., Fidelis, K., Moult, J.
2003
Assessment of progress over the CASP experiments
Proteins
 
53
585
17
Blouin, C., Butt, D., Roger, A.J.
2004
Rapid evolution in conformational space: a study of loop regions in a ubiquitous GTP binding domain
Protein Sci
  .
13
608
–616
18
Kim, S.T., Shirai, H., Nakajima, N., Higo, J., Nakamura, H.
1999
Enhanced conformational diversity search of CDR-H3 in antibodies: role of the first CDR-H3 residue
Proteins
 
37
683
–696
19
Saraste, M., Sibbald, P.R., Wittinghofer, A.
1990
The P-loop—a common motif in ATP- and GTP-binding proteins
Trends Biochem. Sci
  .
15
430
–434
20
Kawasaki, H. and Kretsinger, R.H.
1995
Calcium-binding proteins 1: EF-hands
Protein Profile
 
2
297
–490
21
Wierenga, R.K., Terpstra, P., Hol, W.G.
1986
Prediction of the occurrence of the ADP-binding beta alpha beta-fold in proteins, using an amino acid sequence fingerprint
J. Mol. Biol
  .
187
101
–107
22
Tainer, J.A., Thayer, M.M., Cunningham, R.P.
1995
DNA repair proteins
Curr. Opin. Struct. Biol
  .
5
20
–26
23
Johnson, L.N., Lowe, E.D., Noble, M.E., Owen, D.J.
1998
The Eleventh Datta Lecture. The structural basis for substrate recognition and control by protein kinases
FEBS Lett
  .
430
1
–11
24
Wlodawer, A., Miller, M., Jaskolski, M., Sathyanarayana, B.K., Baldwin, E., Weber, I.T., Selk, L.M., Clawson, L., Schneider, J., Kent, S.B., et al.
1989
Conserved folding in retroviral proteases: crystal structure of a synthetic HIV-1 protease
Science
 
245
616
–621
25
Fine, R.M., Wang, H., Shenkin, P.S., Yarmush, D.L., Levinthal, C.
1986
Predicting antibody hypervariable loop conformations. II: minimization and molecular dynamics studies of MCPC603 from many randomly generated loop conformations
Proteins
 
1
342
26
Moult, J. and James, M.N.
1986
An algorithm for determining the conformation of polypeptide segments in proteins by systematic search
Proteins
 
1
146
27
Bruccoleri, R.E. and Karplus, M.
1987
Prediction of the folding of short polypeptide segments by uniform conformational sampling
Biopolymers
  .
26
137
28
Jones, T.A. and Thirup, S.
1986
Using known substructures in protein model building and crystallography
EMBO J
  .
5
819
29
Chothia, C. and Lesk, A.M.
1987
Canonical structures for the hypervariable regions of immunoglobulins
J. Mol. Biol
  .
196
901
30
Fidelis, K., Stern, P.S., Bacon, D., Moult, J.
1994
Comparison of systematic search and database methods for constructing segments of protein structure
Protein Eng
  .
7
953
31
Wintjens, R.T., Rooman, M.J., Wodak, S.J.
1996
Automatic classification and analysis of alpha alpha-turn motifs in proteins
J. Mol. Biol
  .
255
235
–253
32
Deane, C.M. and Blundell, T.L.
2001
CODA: a combined algorithm for predicting the structurally variable regions of protein models
Protein Sci
  .
10
599
33
Fiser, A., Do, R.K., Sali, A.
2000
Modeling of loops in protein structures
Protein Sci
  .
9
1753
34
Xiang, Z., Soto, C.S., Honig, B.
2002
Evaluating conformational free energies: The colony energy and its application to the problem of loop prediction
Proc. Natl Acad. Sci. USA
 
99
7432
–7437
35
Tosatto, S.C., Bindewald, E., Hesser, J., Manner, R.
2002
A divide and conquer approach to fast loop modeling
Protein Eng
  .
15
279
–286
36
de Bakker, P.I., DePristo, M.A., Burke, D.F., Blundell, T.L.
2003
Ab initio construction of polypeptide fragments: accuracy of loop decoy discrimination by an all-atom statistical potential and the AMBER force field with the Generalized Born solvation model
Proteins
 
51
21
37
Hornak, V. and Simmerling, C.
2003
Generation of accurate protein loop conformations through low-barrier molecular dynamics
Proteins
 
51
577
–590
38
Jacobson, M.P., Pincus, D.L., Rapp, C.S., Day, T.J., Honig, B., Shaw, D.E., Friesner, R.A.
2004
A hierarchical approach to all-atom protein loop prediction
Proteins
 
55
351
39
Zhang, C., Liu, S., Zhou, Y.
2004
Accurate and efficient loop selections by the DFIRE-based all-atom statistical potential
Protein Sci
  .
13
391
–399
40
Rohl, C.A., Strauss, C.E., Chivian, D., Baker, D.
2004
Modeling structurally variable regions in homologous proteins with rosetta
Proteins
 
55
656
–677
41
Rapp, C.S. and Friesner, R.A.
1999
Prediction of loop geometries using a generalized born model of solvation effects
Proteins
 
35
173
42
Smith, K.C. and Honig, B.
1994
Evaluation of the conformational free energies of loops in proteins
Proteins
 
18
119
43
Pellequer, J.L. and Chen, S.W.
1997
Does conformational free energy distinguish loop conformations in proteins?
Biophys. J
  .
73
2359
–2375
44
Greer, J.
1981
Comparative model-building of the mammalian serine proteases
J. Mol. Biol
  .
153
1027
45
Lessel, U. and Schomburg, D.
1999
Importance of anchor group positioning in protein loop prediction
Proteins
 
37
56
46
Donate, L.E., Rufino, S.D., Canard, L.H., Blundell, T.L.
1996
Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction
Protein Sci
  .
5
2600
47
Rufino, S.D., Donate, L.E., Canard, L., Blundell, T.L.
1996
Analysis, clustering and prediction of the conformation of short and medium size loops connecting regular secondary structures
Pac. Symp. Biocomput
  .
570
–589
48
Oliva, B., Bates, P.A., Querol, E., Aviles, F.X., Sternberg, M.J.
1997
An automated classification of the structure of protein loops
J. Mol. Biol
  .
266
814
49
Espadaler, J., Fernandez-Fuentes, N., Hermoso, A., Querol, E., Aviles, F.X., Sternberg, M.J., Oliva, B.
2004
ArchDB: automated protein loop classification as a tool for structural genomics
Nucleic Acids Res
  .
32
D185
50
Burke, D.F. and Deane, C.M.
2001
Improved protein loop prediction from sequence alone
Protein Eng
  .
14
473
–478
51
Fernandez-Fuentes, N., Querol, E., Aviles, F.X., Sternberg, M.J., Oliva, B.
2005
Prediction of the conformation and geometry of loops in globular proteins: testing ArchDB, a structural classification of loops
Proteins
 
60
746
–757
52
Martin, A.C. and Thornton, J.M.
1996
Structural families in loops of homologous proteins: automatic classification, modelling and application to antibodies
J. Mol. Biol
  .
263
800
53
Oliva, B., Bates, P.A., Querol, E., Aviles, F.X., Sternberg, M.J.
1998
Automated classification of antibody complementarity determining region 3 of the heavy chain (H3) loops into canonical forms and its application to protein structure prediction
J. Mol. Biol
  .
279
1193
54
Lessel, U. and Schomburg, D.
1997
Creation and characterization of a new, non-redundant fragment data bank
Protein Eng
  .
10
659
55
Du, P., Andrec, M., Levy, R.M.
2003
Have we seen all structures corresponding to short protein fragments in the Protein Data Bank? An update.
Protein Eng
  .
16
407
56
Michalsky, E., Goede, A., Preissner, R.
2003
Loops in proteins (LIP)—a comprehensive loop database for homology modelling
Protein Eng
  .
16
979
57
Martin, A.C., Cheetham, J.C., Rees, A.L.
1989
Modeling antibody hypervariable loops: a combined algorithm
PNAS
  .
86
9268
–9272
58
Brooks, C.L., III, Bruccoleri, R.E., Olafson, B.D., States, D.J., Swaminathan, S., Karplus, M.
1983
CHARMM: a program for macromolecular energy minimization and dynamics calculations
J. Comput. Chem
  .
4
187
59
Deane, C.M. and Blundell, T.L.
2000
A novel exhaustive search algorithm for predicting the conformation of polypeptide segments in proteins
Proteins
 
40
135
–144
60
Kolaskar, A.S. and Kulkarni-Kale, U.
1992
Sequence alignment approach to pick up conformationally similar protein fragments
J. Mol. Biol
  .
223
1053
–1061
61
Shortle, D.
2002
Composites of local structure propensities: evidence for local encoding of long-range structure
Protein Sci
  .
11
18
–26
62
Kabsch, W. and Sander, C.
1983
Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features
Biopolymers
 
22
2577
–2637
63
Colloc'h, N., Etchebest, C., Thoreau, E., Henrissat, B., Mornon, J.P.
1993
Comparison of three algorithms for the assignment of secondary structure in proteins: the advantages of a consensus assignment
Protein Eng
  .
6
377
–382
64
Carter, P., Andersen, C.A., Rost, B.
2003
DSSPcont: continuous secondary structure assignments for proteins
Nucleic Acids Res
  .
31
3293
–3295
65
Tsai, J., Taylor, R., Chothia, C., Gerstein, M.
1999
The packing density in proteins: standard radii and volumes
J. Mol. Biol
  .
290
253
66
Luthy, R., McLachlan, A.D., Eisenberg, D.
1991
Secondary structure-based profiles: use of structure-conserving scoring tables in searching protein sequence databases for structural similarities
Proteins
 
10
229
–239
67
Henikoff, S. and Henikoff, J.G.
1992
Amino acid substitution matrices from protein blocks
Proc. Natl Acad. Sci. USA
  .
89
10915
–10919
68
Topham, C.M., McLeod, A., Eisenmenger, F., Overington, J.P., Johnson, M.S., Blundell, T.L.
1993
Fragment ranking in modelling of protein structure. Conformationally constrained environmental amino acid substitution tables
J. Mol. Biol
  .
229
194
69
Azarya-Sprinzak, E., Naor, D., Wolfson, H.J., Nussinov, R.
1997
Interchanges of spatially neighbouring residues in structurally conserved environments
Protein Eng
  .
10
1109
–1122
70
Rice, D.W. and Eisenberg, D.
1997
A 3D–1D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence
J. Mol. Biol
  .
267
1026
71
Shi, J., Blundell, T.L., Mizuguchi, K.
2001
FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties
J. Mol. Biol
  .
310
243
72
Blake, J.D. and Cohen, F.E.
2001
Pairwise sequence alignment below the twilight zone
J. Mol. Biol
  .
307
721
73
Gunasekaran, K., Ramakrishnan, C., Balaram, P.
1997
Beta-hairpins in proteins revisited: lessons for de novo design
Protein Eng
  .
10
1131
–1141
74
Marti-Renom, M.A., Mas, J.M., Aloy, P., Querol, E., Aviles, F.X., Oliva, B.
1998
Statistical analysis of the loop-geometry on a non-redundant database of protein
J. Mol. Model
  .
4
347
–354
75
Heuser, P., Wohlfahrt, G., Schomburg, D.
2004
Efficient methods for filtering and ranking fragments for the prediction of structurally variable regions in proteins
Proteins
 
54
583
–595
76
Fiser, A. and Sali, A.
2003
ModLoop: automated modeling of loops in protein structures
Bioinformatics
 
19
2500
77
Holm, L. and Sander, C.
1993
Protein structure comparison by alignment of distance matrices
J. Mol. Biol
  .
233
123

Comments

0 Comments