ClusPro PeptiDock: efficient global docking of peptide recognition motifs using FFT

Abstract Summary We present an approach for the efficient docking of peptide motifs to their free receptor structures. Using a motif based search, we can retrieve structural fragments from the Protein Data Bank (PDB) that are very similar to the peptide’s final, bound conformation. We use a Fast Fourier Transform (FFT) based docking method to quickly perform global rigid body docking of these fragments to the receptor. According to CAPRI peptide docking criteria, an acceptable conformation can often be found among the top-ranking predictions. Availability and Implementation The method is available as part of the protein-protein docking server ClusPro at https://peptidock.cluspro.org/nousername.php. Supplementary information Supplementary data are available at Bioinformatics online.

. Modeled protein-peptide complexes. Blue is the crystal, native pose, yellow is the acceptable accuracy model. Pink shows the closest non-acceptable accuracy model produced by the approach. Green depicts the acceptable accuracy model for a case (PDB ID 2VJ0) in which only the electrostatics coefficient set gave a strong result. Figure S3. Crystall symmetry interface of the alpha-adaptin WVTFE interaction. Two crystal symmetry mates of the bound protein peptide complex (PDB 2VJ0) are shown in blue and green, and the peptide chain is colored in yellow. The view is focused on tight packing of Val 6 of the peptide with methionine 914 of the crystal symmetry partner, which further suggests that the bound peptide pose is most probably strongly affected by the crystal conditions, and therefore not expected to be reproduced by predictions that do not take crystal symmetry into account.

SUPPLEMENTARY METHODS
Supplementary Overview of the algorithm. Figure S1 demonstrates the steps of our protocol on an example application, the interaction between cyclin (structure of free cyclin, PDB ID 1H1R (Davies, et al., 2002)) and a peptide derived from CDC6 (HTLKGRRLVFDN) (Cheng, et al., 2006) (structure of the complex, PDB ID 2CCH (Cheng, et al., 2006)). The RxL peptide motif (Arginine, followed by any amino acid, and Leucine) was defined based on a literature search (Cheng, et al., 2006). We use a set of rules (defined below in the 'Buildup of motif' section) to extend and refine this initial motif, until a search in the PDB results in a comprehensive set of fragment hits (note that for protocol validation homologs of complex structures are, of course, excluded). In this example (see Figure S1A), the peptide sequence covering the initial motif (i.e. RRL) was first extended in the N-terminal direction towards a polar residue, to yield a pentamer sequence KGRRL (applying rule Sp). Since this motif is found only nine times in the PDB, we made it more general by introducing a wild-card at the smallest residue, G, to obtain KXRRL (rule E). This motif is found frequently enough to proceed to the next step (in 472 PDB structures, homologs of the solved CDC6-cyclin structure (PDB ID 2CCH) excluded). Once the docking motif is defined, we extract the matching fragments from the PDB (1051 fragments) and cluster them with a 0.5Å RMSD cluster threshold (resulting in 40 clusters for this example, see Figure S1B and Table S6A). Representatives from the top 25 largest clusters are then each docked to the receptor structure ( Figure S1C). All solutions are pooled and clustered with a 3.5Å RMSD threshold, and representatives of each cluster are further minimized to produce the final 100 models ( Figure S1D and Table S6B). In the cyclin-CDC6 peptide example, the third ranked model lies within 1.9Å RMSD ( Figure 1E). This can be seen more clearly in Figure S1.

Supplementary details on the components of the protocol.
Buildup of motif: Successful definition of a good motif for peptide fragment selection is the critical step of our protocol: A general, non-biased protocol should define a motif that is both loose enough to provide good coverage, and informative enough to enrich for relevant conformations. We start from the peptide sequence of interest and a known motif, and apply the following rules ( Figure S1A; see Table  S5 for application of these rules in the predictions reported here): (1) Start (S): Start with a peptide sequence of minimal length of 5 residues (to allow for a motif of 4 and more residues and one or more wild cards if necessary). This sequence should cover the initial motif, and if needed be extended by including additional positions in the peptide sequence. The preferential direction of extension is defined based on the type of residues, according to the following priority: (Sp) Polar, (Sa) Aromatic, and (So) other residues. Small amino acids (GSTA) are not considered for extension, except as a bridge to the next extended residue (e.g. extension of PXQ motif to PQQATD for the peptide PQQATDD, leading to a 6 residue long starting motif), or if this is the only possible option to extend the motif to the minimal length. This initial sequence will usually result in very few fragment hits in the PDB, and we therefore expand the motif in the following step(s). (2) Expand (E): Insert wildcards back (X, or redundant positions of the motif), starting with the smallest residues. Refrain from introducing adjacent wild cards if possible, and do not introduce X at the termini of the peptide. (3) Large (L): If more than 1000 hits to PDB structures are found, introduce specific residues back into the motif, starting with the largest residues. If this does not help, try to extend, if possible. (4) Stop when there are between 100 and 1000 hits to PDB structures (or more if further extension of motif is not possible). (5) Complement F/Y (F): F & Y show very similar conformational preferences in the backbone dependent rotamer libraries (Ting, et al., 2010). Once the motif has been designed and the set of fragments has been extracted, the amino acid sequence is changed back to the actual peptide sequence (using a backbone-dependent rotamer library (Dunbrack and Karplus, 1993)).

Docking: Global sampling by fast Fourier transform (FFT) correlation
Docking was performed as detailed before (Kozakov, et al., 2013). In short, to fully explore the conformational space of rigid body orientations between a given peptide conformer and the receptor, we perform exhaustive evaluation of an energy function in the discretized space of mutual orientations of the protein and peptide using fast Fourier transform (FFT) correlation approach (Kozakov, et al., 2006). The center of mass of the receptor protein is fixed at the origin of the coordinate system, whereas the peptide conformer, defined as the ligand, is rotated and translated. The translational space is represented as a grid of 1.0 Å displacements of the ligand center of mass, and the rotational space is sampled using 70,000 rotations based on a deterministic layered Sukharev grid sequence, which quasi-uniformly covers the space. The energy expression used for the FFT based sampling includes a simplified van der Waals energy E vdw with attractive (E attr ) and repulsive (E rep ) contributions, the electrostatic interaction energy E elec , and a statistical pairwise potential E pair , representing other solvation effects (Chuang, et al., 2008): where r is the distance between atoms i and j, D is an atom-type independent approximation of the generalized Born radius, and ε ij is a pairwise interaction potential between atoms i and j. Two options of weights sets are used: The original set (w 1 =1.3, w 2 =160 and w 3 =2.6) and a set of weights that was recently shown to improve performance for polar-dominated interactions (w 1 =4, w 2 =600, w 3 =0: the pairwise potential is omitted, and consequently the relative electrostatic contribution is increased (Kozakov, et al., 2013)). For consistency, we use here the original set of weights, and report on improvement for polar interactions in the text and Table S3. All energy expressions are defined on the grid. In order to evaluate the energy function E by FFT, it must be written as a sum of correlation functions. The first two terms, E vdw and E elec , satisfy this condition, whereas E pair is written as a sum of a few correlation functions, using an eigenvalue-eigenvector decomposition (Kozakov, et al., 2006). For each rotation, this expression can be efficiently calculated using P forward and one inverse Fast Fourier transforms. The calculations are performed for each of the 70,000 rotations, and one lowest-energy translation for each rotation is retained.

Clustering algorithm
For k structures, we calculate the k × k matrix of pairwise backbone RMSDs. We count the number of neighbors each structure has within a defined cluster radius. The members of the largest cluster are removed from the pool of structures, and the procedure is repeated until no structures remain, resulting in clusters ranked according to their size (Kozakov, et al., 2005). The cluster radius cutoff is set to 0.5Å for the clustering of backbone fragments (a 2.0Å cutoff was found to result in a too small number of clusters), and to 3.5Å for the clustering of docking solutions. The latter represents the assumed radius of attraction for peptide-protein docking.

Scoring according to cluster size
In biophysical terms, clusters represent isolated, highly populated low energy basins of the energy landscape, and the large clusters are thus more likely to include native structures 8 . The globally sampled conformational space can be considered as a canonical ensemble with the partition function Z = Σ j exp(-E j /RT), where we sum the associated energy E j over all poses j. For the k th cluster, the partition function is given by Z k = Σ j exp(E j /RT), where the sum is restricted to poses within the cluster. Based on these values, the probability of the k th cluster is given by P k = Z k /Z. Since the low energy structures are selected from a relatively narrow energy range, and the energy values are calculated with considerable error, it is reasonable to assume that these energies do not differ, i.e., E j =E for all j in the low energy clusters. This simplification implies that P k =exp(-E/RT)×N k /Z, and thus the probability P k is proportional to N k , where N k is the number of structures in the k th cluster.

Minimization of final structures
For minimization we use the polar hydrogen PARAM19 like forcefield with CHARMM (Brooks, et al., 2009). The protocol consists of 500 steps of unconstrained Adapted Basic Newton-Raphson (ABNR) minimization, where both protein and peptide are free to move, followed by the restoration of crystal protein coordinates, and 1000 steps of ABNR minimization of the peptide with the fixed protein.  a Predictions within 4.0Å RMSD are highlighted in bold. Note that the same input was provided, and that we use the unbound receptor structure (even though the bound pdb id is listed here; see Table S1)  Table S4. Set of peptide-protein complexes used in this study. We model a diverse set of 16 domain-motif interactions from the PeptiDB v2 set. The docking protocol was validated on a set of 5 motif-domain complexes recently published in the PDB. For each complex, a bound and free receptor structure is available in the PDB, and an interaction motif has been reported.  (Poy, et al., 1999) lsb3 sla1 (SH3) 1SSHA 1OOTA GPPPAMPARPT PXXPX[R/K] (Hou, et al., 2012) erbB2 (PDZ) 1MFGA 2H3LA EYLGLDVPV VXV' (Jaulin-Bastard, et al., 2001) wdr5 (WD40) 2H9MA 2H14A ARTKQT gδRg (Schuetz, et al., 2006) usp7 (MATH) 2FOJA 2F1WA GARAHSS [PA]XXS (Sheng, et al., 2006) p97 N-glycanase (PUB) 2HPLA 2HPJA DDLYG ϕYX' (Smith, et al., 2007) traf2 (TRAF) 1CZYA 1CA4A ace-PQQATDD PxQ (Devergne, et al., 1996) i-ap1 (BIR) 1JD5A 1JD4B AIAYFIPD