Motivation: Structural knowledge, extracted from the Protein Data Bank (PDB), underlies numerous potential functions and prediction methods. The PDB, however, is highly biased: many proteins have more than one entry, while entire protein families are represented by a single structure, or even not at all. The standard solution to this problem is to limit the studies to non-redundant subsets of the PDB. While alleviating biases, this solution hides the many-to-many relations between sequences and structures. That is, non-redundant datasets conceal the diversity of sequences that share the same fold and the existence of multiple conformations for the same protein. A particularly disturbing aspect of non-redundant subsets is that they hardly benefit from the rapid pace of protein structure determination, as most newly solved structures fall within existing families.
Results: In this study we explore the concept of redundancy-weighted datasets, originally suggested by Miyazawa and Jernigan. Redundancy-weighted datasets include all available structures and associate them (or features thereof) with weights that are inversely proportional to the number of their homologs. Here, we provide the first systematic comparison of redundancy-weighted datasets with non-redundant ones. We test three weighting schemes and show that the distributions of structural features that they produce are smoother (having higher entropy) compared with the distributions inferred from non-redundant datasets. We further show that these smoothed distributions are both more robust and more correct than their non-redundant counterparts.
We suggest that the better distributions, inferred using redundancy-weighting, may improve the accuracy of knowledge-based potentials and increase the power of protein structure prediction methods. Consequently, they may enhance model-driven molecular biology.
Mining the riches of experimentally determined data in the Protein Data Bank (PDB) (Berman et al., 2013a, b; Bernstein et al., 1977) has been a major source of structural knowledge over the past four decades. In a reverse engineering-like fashion it allows the derivation of rules and methods for the prediction of secondary structure (Chou and Fasman, 1974; Garnier et al., 1978; McGuffin et al., 2000; Rost, 1996), solvent accessibility (Karplus, 2009; Rost, 1996) and trans-membrane regions (McGuffin et al., 2000; Rost, 1996), as well as knowledge-based potentials (Eisenberg et al., 1997; Lüthy et al., 1992; Samudrala and Moult, 1998; Sippl, 1993; Summa and Levitt, 2007; Tanaka and Scheraga, 1976). These methods and potentials have had a considerable impact on our understanding of the protein universe and accelerated progress in biology, chemistry and medicine. This study aims to enhance these data mining efforts by attacking their major obstacle, data redundancy.
The underlying premise behind data mining of protein structures is that recurring patterns may result from physical aspects of protein folding and stability (e.g. the hydrophobic effect) (Miyazawa and Jernigan, 1985). However, the data, which are available for mining, are not uniform samples of sequence and structure spaces. Certain folds (e.g. TIM barrels and G-protein-coupled receptors) are far more abundant than others in genomes [see Goldstein (2008) for some suggestions of why it is so]. The PDB content is further skewed by research interests of the contributing experimentalists and by methodological constraints. This bias, often referred to as the ‘PDB redundancy’, may amplify or diminish the signal of recurring patterns. For example, the stabilizing effect of the beta-alpha-beta supersecondary structure may be overestimated because of the abundance of folds that include it (e.g. TIM barrels).
The common solution to data redundancy is to use a non-redundant subset of the PDB, composed of family representatives. Kabsch and Sander (1983) pioneered this solution using a 62-membered subset of the 75 high-quality entries of the PDB, leaving out homologs with sequence identity of 50% or higher (a rather promiscuous threshold by current standards). This approach has been adopted by numerous studies, and became the field’s norm, with publicly-available and standardized tools for data culling (Hobohm and Sander, 1994; Wang and Dunbrack, 2003, 2005). Yet, notwithstanding the evident utility of non-redundant PDB subsets, they have inherent limitations in scalability and descriptive power. First and foremost, they do not benefit from the rapid growth of the PDB because almost all new entries are mapped to already known folds [Fig. 1; and see also Levitt (2009)]. More importantly, non-redundant datasets conceal much of the complexity of protein universe. Specifically, they hide the compatibility of diverse sequences with the same fold and the flexibility of protein structures (Kosloff and Kolodny, 2008). Our working hypothesis is that this oversimplified perspective manifests itself in artificially ‘bumpy’ distributions of the measurable features.
An important alternative was presented more than a decade ago by Miyazawa and Jernigan (1996, 1999), who weighted structures in their knowledge-based potentials. In those studies, they considered all the PDB structures (of sufficient quality and length), yet assigned non-uniform weights to protein chains, so that the weights of chains with many homologs are scaled down. Thus, all the data are considered, yet biases are alleviated. Somewhat surprisingly, we are unaware of any recent studies that use this approach. Further, to the best of our knowledge, no study has yet compared its performance with the standard, representative subset, approach.
Interestingly, the redundancy, which is removed from structural datasets, is valuable when investigating families of homologous sequences. There, evolutionary conserved patterns characterize a family and deviations from these patterns shed light on the uniqueness of specific members. The distribution of sequence similarities across a family is typically uneven; that is, there may be subsets of sequences that are more closely related. This might bias the analysis toward patterns that appear in such highly similar subsets. In this context, however, a non-redundant subset (i.e. one without recognizable similarities) would consist of a single sequence, and be devoid of any information. Instead, scholars have successfully used various weighting schemes that downscale contributions from large subsets of sequences [see Altschul et al. (1997), and references therein], most notably for multiple sequence alignment (Thompson et al., 1994) and sequence search (Altschul et al., 1997). Structural data mining is a different computational task: most importantly, it does not focus on a single protein family but rather must deal with multiple families and account for their varying sizes. Nonetheless, the success of including more, and even all, available proteins in the context of search and alignment tasks, suggests that similar approaches may be valuable in structural data mining as well.
Our study revisits the redundancy-weighting approach and provides the first systematic assessment of its correctness and robustness. To this end, we compare interatomic distance distributions, a central component in knowledge-based potentials, sampled from either non-redundant or redundancy-weighted datasets; for the sake of completeness, we also consider distributions that were sampled from an unweighted, redundant dataset. We estimate the complexity of these distributions by their entropy and observe that the redundancy-weighted datasets have higher entropy than their non-redundant counterparts. We further demonstrate that the higher entropy distributions are more correct and robust. Our observations suggest that structure prediction methods could benefit considerably from training on redundancy-weighted datasets rather than on non-redundant ones. This in turn, can improve our understanding of the forces that shape the protein structure universe.
2.1 Structure dataset
Our dataset includes 7307 structures of polypeptide chains (in 6956 PDB files), at least 40 residues long, solved at 1.5 Å resolution or better. To allow a straightforward definition of local structural features, we broke each chain into (non-overlapping) segments of peptide-bonded amino acids, where two consecutive residues are considered ‘bonded’ if the distance between their Cα atoms is within 10% of the canonical peptide bond distance (2.95 or 3.8 Å for cis and trans peptide bond, respectively) and none is listed as ‘missing’ (in PDB Remark 465, missing residues, or Remark 470, missing atoms). The resulting set consists of 8876 segments longer than 20 residues.
These data are split into training and test subsets. The training subset consists of 6299 structures (7635 segments) released by the end of 2011, whereas the remaining 1008 chain structures (1241 segments), released from January 2012 to April 2013, constitute the test set. We compare the feature distributions obtained from the larger training set with those inferred using the smaller test set.
This study focuses on Cα distance distributions of the 441 possible type pairs (all combinations over 20 amino acids + the wild-card amino acid X) of residues separated by predefined distances along the sequence (e.g. 20 residues apart). We compare distributions that were inferred using various weighting schemes and over different sets of protein structures. In general, the reliability of distributions improves as the number of instances from which they were derived, grows. Figure 4D depicts the number of residue pairs (20 residues apart) in the training set. Note the two orders of magnitude difference between the rarest pair (two Tryptophan residues, 304 instances) and the most common pair (two Leucine residues, 10 163).
2.2 Weighting schemes
For each chain, we maintain a list of homologs and their aligned regions, as found by the FASTA (Pearson and Lipman, 1988) sequence alignment tool (with e-value ≤ 10−4). The sets of chains and homology assignments constitute the vertices and edges, respectively, of a neighbors graph (Fig. 3). We consider connected components of this graph as homology subsets; the size distributions of these subsets in the training and test sets are shown in Figure 2. We follow a common practice [see, for example, Bull et al. (2013); Li and Godzik (2006)] and generate a non-redundant dataset by picking a random representative from each homology subset. In contrast, redundant and redundancy-weighting datasets use all available structures.
In this study, we consider five schemes (Fig. 3): non-redundant, redundant and three ways for redundancy-weighting. The standard, non-redundant scheme effectively assigns a weight of 1 to each feature instance (here, a distance between two Cα atoms separated by a certain number of residues) from a representative subset of structures. The redundant scheme, too, effectively associates a weight of 1 to each feature, but includes all structures in the set.
The redundancy-weighting schemes use all structures available in the dataset, while balancing the contributions of protein families with many PDB entries and ones that have fewer. To this end, these schemes define the weight of a feature instance to be inversely proportional to the number of its homologs.
We implement three redundancy-weighting schemes. The first, per-sequence weighting scheme (RWs) associates each structure s with a set of close neighbors, ones with sequence similarity ≥40% and coverage ≥50%. The structure’s weight, which is assigned to all its features, is 1/(N(s) + 1), where N(s) is the number of these neighbors. The second, per-feature weighting scheme (RWf) assigns the weights individually to each feature instance. The weight of a structural feature, in this scheme, is inversely proportional to the number of homologous chains that align with all of its positions, without gaps. The last scheme, MJ, follows Miyazawa and Jernigan (1996, 1999) sample weighting. Briefly, sequence identity between each pair of structures is computed as the fraction of identical residues in their alignment (Needleman and Wunsch, 1970); eigenvalue decomposition of the sequence identity matrix defines per-sequence weights, which are approximately equal to the inverse of the number of similar sequences. This weight is assigned to all the features of the structure.
2.3 Performance assessment
Similarly, the correctness of a weighting scheme, for a given feature distribution, is the complement of the divergence between the scheme’s test distribution and the standard, non-redundant feature distribution, inferred over the training set:
Distributions of interatomic distances lie at the heart of knowledge-based potentials and as such are among the most studied features in PDB data mining. Here, we compare data-mining of such distances from three types of datasets: redundant, non-redundant and redundancy-weighted. The redundant training and test datasets include all proteins that were solved before and after the end of 2011, respectively, at 1.5 Å resolution or better. The training and test non-redundant datasets are subsets of the redundant ones, containing a single representative from each set of homologous structures. Finally, the three pairs (training and test) of redundancy-weighted datasets include the same structures as their redundant counterparts but the contributions of these structures, or features thereof, are weighted. The three redundancy-weighting schemes (see Section 2 and Fig. 3) are: (i) a scheme that assigns a single weight to each polypeptide chain—and all the feature instances computed from its structure—based on the number of its homologs (RWs); (ii) a scheme that computes a per-feature weight based only on homologs that cover all the positions in a feature (RWf); and (iii) the algebraic sample weighting scheme of Miyazawa and Jernigan (1996; 1999).
We focus on distributions of Cα–Cα distances, as a proof of concept. Figure 4, upper panel, shows training and test distributions of Cα distances between three types of amino acid pairs located 20 positions apart along the sequence: Tryptophan (W) and Histidine (H, left), Tryptophan and Tyrosine (Y, center) and Tryptophan and Leucine (L, right). We compute the distributions using the standard, non-redundant scheme (NR) or a redundancy-weighting scheme, RWf. The insets in Figure 4A–C depict the similarity of these distributions. We assess the performance of a weighting scheme by two quantities: correctness, which measures how similar is ’s test distribution to the non-redundant distribution over the training set (inset’s top row); and robustness, which measures how similar are ’s training and test distributions (inset’s diagonal; see Section 2 for formal definitions).
Figures 4E and F shows the correctness of Cα–Cα distance distributions over all pairs of amino acids when using a non-redundant dataset and a redundancy-weighted one. Predictably, correctness for both schemes is lower for amino acid pairs with a relatively small number (i.e. hundreds) of samples (top left corner of both heat-maps) compared with pairs with many more (i.e. tens and hundreds of thousands) samples (see Fig. 4D for sample counts of each amino acid pair); the Spearman’s rank correlation between the number of samples and the correctness of NR and RWf is 0.96, P < 10−100. Evidently, the distance distributions derived using the redundancy-weighted scheme are, for the most part, more correct than those inferred using the non-redundant scheme (Fig. 4G); Of 441 distributions, one for each amino acid pair, 422 (95.7%) redundancy-weighted distributions are more correct. Moreover, the improvement in correctness, or the correctness gain, is greater for amino acid pairs with lower number of samples (Spearman’s rank correlation = −0.7, P < 10−65) for which, as we indicated above, the correctness is poorer.
A more complete picture is presented in Figure 5, which compares the standard non-redundant scheme with the four non-standard ones, over three performance criteria: correctness, entropy and robustness. As expected, the non-redundant Cαi –Cαi+20 distributions are significantly more correct (one-tailed, paired-sample Wilcoxon signed rank test P < 10−6) and associated with greater entropy (P < 10−16) than the corresponding redundant distributions. This agrees with the common practice of preferring non-redundant datasets to redundant ones. Yet, importantly, all three redundancy-weighting schemes perform significantly better than the non-redundant scheme in terms of correctness and robustness, and obtain higher entropy scores; for example: the median improvement in correctness using RWs compared with NR is 0.89%, P < 10−68.
Finally, Figure 6 compares the performance of all weighting schemes over Cα distance distributions of residues 5, 10, 20, 50 and 100 amino acids apart. In all these cases, and consistent with the results for the Cαi –Cαi+20 distributions (Fig. 5), all three redundancy-weighting schemes perform better in terms of their correctness, entropy and robustness, than the standard non-redundant and redundant alternatives. Among the redundancy-weighting schemes, RWf is the most correct in 4 out of 5 amino acid distances and MJ significantly more correct than RWs for Cα atoms 50 and 100 amino acids apart; RWf’s entropy values are the greatest for the longer amino acid ranges and MJ’s for the shorter ones; and, consistently, MJ is the most robust, followed by RWf. Comparing the non-redundant and redundant distributions reveals that the latter is always more robust, whereas the former is more correct in all but the longest range (where redundant is more correct, P = 0.0002); and obtains greater entropy values in all but the shortest range. These results show that the redundancy-weighting performs better than the widely-used non-redundant datasets, in all these measures.
The dominant approach for data mining of the PDB is to extract the knowledge from a relatively small subset of non-homologous structures and consider their homologs (all other structures) as redundant and, thus, non-informative. Although this approach circumvents much of the inherent biases in the PDB, it also artificially reduces the variability of the structural landscape. An alternative approach, which we term here redundancy-weighting, considers all available structures but assigns them (or features thereof) lower weights proportionally to the number of homologs they have in the dataset.
Redundancy-weighting, originally proposed by Miyazawa and Jernigan almost two decades ago (Miyazawa and Jernigan, 1996), has been rarely used since and, to the best of our knowledge, never been systematically benchmarked against the standard approach. Here, we compare the correctness, complexity and robustness of feature distributions, which were inferred using non-redundant and redundant datasets, and three redundancy-weighting schemes: the algebraic sample weighting of Miyazawa and Jernigan (MJ) (Miyazawa and Jernigan, 1996), as well as our novel RWs and RWf weightings. To this end, we quantify and compare the correctness, complexity and robustness of distance distributions, which were derived from training and test datasets according to these schemes.
Our most significant contribution is demonstrating that the three redundancy-weighting schemes outperform the ‘standard’ non-redundant approach in all tested metrics. The MJ scheme, which is the most robust, is probably not scalable enough for practical use. This scheme requires an eigenvalue decomposition of an all-against-all similarity matrix. As the PDB is quickly approaching the 100 000 structures milestone, such decomposition becomes challenging both in terms of the computational requirements (space and time) and the numerical stability (Heath, 2002). RWf, which is the most correct scheme, has a milder limitation: it requires recomputing of weights for each structural feature. The RWs scheme is less accurate and robust than the other two, but does not suffer from these limitations. Thus, it may serve as a simple, off-the-shelf, general weighting scheme, which performs better than non-redundant datasets.
Here, we focused on a relatively small set (only a few thousands) of the PDB’s most accurately solved structures (with resolution better than 1.5 Å). Focusing on this set had several advantages: most importantly, the computational cost of exploring alternative weighting schemes is tractable and, in particular, we could easily calculate the eigenvalue decomposition needed for the MJ weighting scheme. Notice that homology is not too common within this set, which includes >50% singletons (see Fig. 2). Thus, one could expect that refining the weighting scheme will not have any impact on the accuracy and robustness when inferring structural features from the set. However, it does. Thus, we believe that such weighting schemes can similarly benefit others, even more so when considering larger subsets of the PDB, culled using more lax experimental quality thresholds.
While most residue–residue contact potentials [e.g. Miyazawa and Jernigan (1996)] consider ‘un-directional’ residue pairs (thus combining pairs of two same residues with different orders to a single unique pair), we chose to focus on directional pairs to increase the dynamic range of the explored features (Fig. 4D). Notably, the trends shown in Figure 6 are preserved when un-directional pairs are considered (data not shown).
Importantly, our study shows that the gain in accuracy and robustness is proportional to the rarity of the studied feature (Figs. 4 and 5). Current applications of PDB data mining (e.g. derivation of pairwise potentials and secondary structure prediction) are dominated by highly prevalent features (e.g. contacts between specific residues and the three major secondary structure elements). However, in more subtle applications [e.g. multi-body potentials (Gniewek et al., 2011) and fragment prediction (Gront et al., 2011; Kalev and Habeck, 2011; Shen et al., 2013)] the number of relevant features grows while their prevalence in the non-redundant dataset drops. We speculate that shifting to the redundancy-weighting paradigm may be essential for the advancement of computational structural biology beyond pairwise potentials and prediction of simple features.
Although two of the redundancy-weighting schemes presented here are already useful, this study is mostly a proof-of-concept. A promising route for improvement is to take advantage of rapid structural search methods (Budowski-Tal et al., 2010) and replace the currently used sequence alignments by more reliable and sensitive structural alignments. Another possible direction is to focus on domains rather than peptide chains, which often harbor several, repeating domains (e.g. Zn fingers) and are, thus redundant by nature. Finally, the weighting scheme may apply some evolutionary model to gain better estimate of the interdependencies of homologous proteins (Thompson et al., 1994).
In a wider context, redundant datasets are abundant in other domains of knowledge, e.g. genomics, natural language processing and computer vision. While the current work focuses on protein structure datasets, we hope insights obtained in this domain can have impact on data mining in other, unrelated ones.
C.Y. and C.K. are grateful to Eitan Bachmat for generous support. R.K. and C.Y. would like to thank Yaron Rothman and Udi Feldman for introducing them to one another.
Funding: C.K. and M.L. are supported by BSF (grant number 2009432).
Conflicts of Interest: none declared.