Motivation: Structural knowledge, extracted from the Protein Data Bank (PDB), underlies numerous potential functions and prediction methods. The PDB, however, is highly biased: many proteins have more than one entry, while entire protein families are represented by a single structure, or even not at all. The standard solution to this problem is to limit the studies to non-redundant subsets of the PDB. While alleviating biases, this solution hides the many-to-many relations between sequences and structures. That is, non-redundant datasets conceal the diversity of sequences that share the same fold and the existence of multiple conformations for the same protein. A particularly disturbing aspect of non-redundant subsets is that they hardly benefit from the rapid pace of protein structure determination, as most newly solved structures fall within existing families.

Results: In this study we explore the concept of redundancy-weighted datasets, originally suggested by Miyazawa and Jernigan. Redundancy-weighted datasets include all available structures and associate them (or features thereof) with weights that are inversely proportional to the number of their homologs. Here, we provide the first systematic comparison of redundancy-weighted datasets with non-redundant ones. We test three weighting schemes and show that the distributions of structural features that they produce are smoother (having higher entropy) compared with the distributions inferred from non-redundant datasets. We further show that these smoothed distributions are both more robust and more correct than their non-redundant counterparts.

We suggest that the better distributions, inferred using redundancy-weighting, may improve the accuracy of knowledge-based potentials and increase the power of protein structure prediction methods. Consequently, they may enhance model-driven molecular biology.

Contact:cheny@il.ibm.com or chen.keasar@gmail.com

## 1 INTRODUCTION

Mining the riches of experimentally determined data in the Protein Data Bank (PDB) (Berman et al., 2013a, b; Bernstein et al., 1977) has been a major source of structural knowledge over the past four decades. In a reverse engineering-like fashion it allows the derivation of rules and methods for the prediction of secondary structure (Chou and Fasman, 1974; Garnier et al., 1978; McGuffin et al., 2000; Rost, 1996), solvent accessibility (Karplus, 2009; Rost, 1996) and trans-membrane regions (McGuffin et al., 2000; Rost, 1996), as well as knowledge-based potentials (Eisenberg et al., 1997; Lüthy et al., 1992; Samudrala and Moult, 1998; Sippl, 1993; Summa and Levitt, 2007; Tanaka and Scheraga, 1976). These methods and potentials have had a considerable impact on our understanding of the protein universe and accelerated progress in biology, chemistry and medicine. This study aims to enhance these data mining efforts by attacking their major obstacle, data redundancy.

The underlying premise behind data mining of protein structures is that recurring patterns may result from physical aspects of protein folding and stability (e.g. the hydrophobic effect) (Miyazawa and Jernigan, 1985). However, the data, which are available for mining, are not uniform samples of sequence and structure spaces. Certain folds (e.g. TIM barrels and G-protein-coupled receptors) are far more abundant than others in genomes [see Goldstein (2008) for some suggestions of why it is so]. The PDB content is further skewed by research interests of the contributing experimentalists and by methodological constraints. This bias, often referred to as the ‘PDB redundancy’, may amplify or diminish the signal of recurring patterns. For example, the stabilizing effect of the beta-alpha-beta supersecondary structure may be overestimated because of the abundance of folds that include it (e.g. TIM barrels).

The common solution to data redundancy is to use a non-redundant subset of the PDB, composed of family representatives. Kabsch and Sander (1983) pioneered this solution using a 62-membered subset of the 75 high-quality entries of the PDB, leaving out homologs with sequence identity of 50% or higher (a rather promiscuous threshold by current standards). This approach has been adopted by numerous studies, and became the field’s norm, with publicly-available and standardized tools for data culling (Hobohm and Sander, 1994; Wang and Dunbrack, 2003, 2005). Yet, notwithstanding the evident utility of non-redundant PDB subsets, they have inherent limitations in scalability and descriptive power. First and foremost, they do not benefit from the rapid growth of the PDB because almost all new entries are mapped to already known folds [Fig. 1; and see also Levitt (2009)]. More importantly, non-redundant datasets conceal much of the complexity of protein universe. Specifically, they hide the compatibility of diverse sequences with the same fold and the flexibility of protein structures (Kosloff and Kolodny, 2008). Our working hypothesis is that this oversimplified perspective manifests itself in artificially ‘bumpy’ distributions of the measurable features.

Fig. 1.

The rapid pace of protein structure determination hardly affects the size of non-redundant databases. The PDB growth rate (gray) has accelerated in recent years but the rate of new Structural Classification of Proteins (SCOP) superfamily reports (black) dwindled in the last three releases of the database (versions 1.71—1.75)

Fig. 1.

The rapid pace of protein structure determination hardly affects the size of non-redundant databases. The PDB growth rate (gray) has accelerated in recent years but the rate of new Structural Classification of Proteins (SCOP) superfamily reports (black) dwindled in the last three releases of the database (versions 1.71—1.75)

An important alternative was presented more than a decade ago by Miyazawa and Jernigan (1996, 1999), who weighted structures in their knowledge-based potentials. In those studies, they considered all the PDB structures (of sufficient quality and length), yet assigned non-uniform weights to protein chains, so that the weights of chains with many homologs are scaled down. Thus, all the data are considered, yet biases are alleviated. Somewhat surprisingly, we are unaware of any recent studies that use this approach. Further, to the best of our knowledge, no study has yet compared its performance with the standard, representative subset, approach.

Interestingly, the redundancy, which is removed from structural datasets, is valuable when investigating families of homologous sequences. There, evolutionary conserved patterns characterize a family and deviations from these patterns shed light on the uniqueness of specific members. The distribution of sequence similarities across a family is typically uneven; that is, there may be subsets of sequences that are more closely related. This might bias the analysis toward patterns that appear in such highly similar subsets. In this context, however, a non-redundant subset (i.e. one without recognizable similarities) would consist of a single sequence, and be devoid of any information. Instead, scholars have successfully used various weighting schemes that downscale contributions from large subsets of sequences [see Altschul et al. (1997), and references therein], most notably for multiple sequence alignment (Thompson et al., 1994) and sequence search (Altschul et al., 1997). Structural data mining is a different computational task: most importantly, it does not focus on a single protein family but rather must deal with multiple families and account for their varying sizes. Nonetheless, the success of including more, and even all, available proteins in the context of search and alignment tasks, suggests that similar approaches may be valuable in structural data mining as well.

Our study revisits the redundancy-weighting approach and provides the first systematic assessment of its correctness and robustness. To this end, we compare interatomic distance distributions, a central component in knowledge-based potentials, sampled from either non-redundant or redundancy-weighted datasets; for the sake of completeness, we also consider distributions that were sampled from an unweighted, redundant dataset. We estimate the complexity of these distributions by their entropy and observe that the redundancy-weighted datasets have higher entropy than their non-redundant counterparts. We further demonstrate that the higher entropy distributions are more correct and robust. Our observations suggest that structure prediction methods could benefit considerably from training on redundancy-weighted datasets rather than on non-redundant ones. This in turn, can improve our understanding of the forces that shape the protein structure universe.

## 2 METHODS

### 2.1 Structure dataset

Our dataset includes 7307 structures of polypeptide chains (in 6956 PDB files), at least 40 residues long, solved at 1.5 Å resolution or better. To allow a straightforward definition of local structural features, we broke each chain into (non-overlapping) segments of peptide-bonded amino acids, where two consecutive residues are considered ‘bonded’ if the distance between their Cα atoms is within 10% of the canonical peptide bond distance (2.95 or 3.8 Å for cis and trans peptide bond, respectively) and none is listed as ‘missing’ (in PDB Remark 465, missing residues, or Remark 470, missing atoms). The resulting set consists of 8876 segments longer than 20 residues.

These data are split into training and test subsets. The training subset consists of 6299 structures (7635 segments) released by the end of 2011, whereas the remaining 1008 chain structures (1241 segments), released from January 2012 to April 2013, constitute the test set. We compare the feature distributions obtained from the larger training set with those inferred using the smaller test set.

This study focuses on Cα distance distributions of the 441 possible type pairs (all combinations over 20 amino acids + the wild-card amino acid X) of residues separated by predefined distances along the sequence (e.g. 20 residues apart). We compare distributions that were inferred using various weighting schemes and over different sets of protein structures. In general, the reliability of distributions improves as the number of instances from which they were derived, grows. Figure 4D depicts the number of residue pairs (20 residues apart) in the training set. Note the two orders of magnitude difference between the rarest pair (two Tryptophan residues, 304 instances) and the most common pair (two Leucine residues, 10 163).

### 2.2 Weighting schemes

For each chain, we maintain a list of homologs and their aligned regions, as found by the FASTA (Pearson and Lipman, 1988) sequence alignment tool (with e-value ≤ 104). The sets of chains and homology assignments constitute the vertices and edges, respectively, of a neighbors graph (Fig. 3). We consider connected components of this graph as homology subsets; the size distributions of these subsets in the training and test sets are shown in Figure 2. We follow a common practice [see, for example, Bull et al. (2013); Li and Godzik (2006)] and generate a non-redundant dataset by picking a random representative from each homology subset. In contrast, redundant and redundancy-weighting datasets use all available structures.

Fig. 2.

Homology subset sizes for the polypeptide structures in the training (black) and test (white) sets

Fig. 2.

Homology subset sizes for the polypeptide structures in the training (black) and test (white) sets

Fig. 3.

Weighting schemes. A homology graph represents the similarity relations (edges) between chain sequences (nodes); each homology set, i.e. a connected component in this graph, is depicted in a different shade. The non-redundant and redundant schemes use uniform weights (shown inside the circles), assigned to a set of representatives and to all available structures, respectively. The two redundancy-weighting schemes, RWs and RWf, associate each feature (e.g. a specific pair of Cα atoms, 20 residues apart), with a weight, which is inversely proportional to its redundancy. RWs considers redundancy at the chain level, that is, all the features of a given chain have the same weight, which is inversely proportional to the number of chains with which it has considerable sequence similarity and coverage (schematically illustrated as thicker lines). In contrast, RWf considers redundancy at the feature level. Two features are considered redundant if they are part of an alignment of high identity level and are both complete and continuous. For example, the query feature instance shown in the illustration is given a weight of 0.5 as its only counterpart is Hit 1

Fig. 3.

Weighting schemes. A homology graph represents the similarity relations (edges) between chain sequences (nodes); each homology set, i.e. a connected component in this graph, is depicted in a different shade. The non-redundant and redundant schemes use uniform weights (shown inside the circles), assigned to a set of representatives and to all available structures, respectively. The two redundancy-weighting schemes, RWs and RWf, associate each feature (e.g. a specific pair of Cα atoms, 20 residues apart), with a weight, which is inversely proportional to its redundancy. RWs considers redundancy at the chain level, that is, all the features of a given chain have the same weight, which is inversely proportional to the number of chains with which it has considerable sequence similarity and coverage (schematically illustrated as thicker lines). In contrast, RWf considers redundancy at the feature level. Two features are considered redundant if they are part of an alignment of high identity level and are both complete and continuous. For example, the query feature instance shown in the illustration is given a weight of 0.5 as its only counterpart is Hit 1

In this study, we consider five schemes (Fig. 3): non-redundant, redundant and three ways for redundancy-weighting. The standard, non-redundant scheme effectively assigns a weight of 1 to each feature instance (here, a distance between two Cα atoms separated by a certain number of residues) from a representative subset of structures. The redundant scheme, too, effectively associates a weight of 1 to each feature, but includes all structures in the set.

The redundancy-weighting schemes use all structures available in the dataset, while balancing the contributions of protein families with many PDB entries and ones that have fewer. To this end, these schemes define the weight of a feature instance to be inversely proportional to the number of its homologs.

We implement three redundancy-weighting schemes. The first, per-sequence weighting scheme (RWs) associates each structure s with a set of close neighbors, ones with sequence similarity ≥40% and coverage ≥50%. The structure’s weight, which is assigned to all its features, is 1/(N(s) + 1), where N(s) is the number of these neighbors. The second, per-feature weighting scheme (RWf) assigns the weights individually to each feature instance. The weight of a structural feature, in this scheme, is inversely proportional to the number of homologous chains that align with all of its positions, without gaps. The last scheme, MJ, follows Miyazawa and Jernigan (1996, 1999) sample weighting. Briefly, sequence identity between each pair of structures is computed as the fraction of identical residues in their alignment (Needleman and Wunsch, 1970); eigenvalue decomposition of the sequence identity matrix defines per-sequence weights, which are approximately equal to the inverse of the number of similar sequences. This weight is assigned to all the features of the structure.

### 2.3 Performance assessment

We quantify the similarity between two distance distributions, P and Q, by their Jensen–Shannon divergence (JSD) defined as:

(1)
$JSD(P;Q)=12(dKL(P,M)+dKL(Q,M))$
where $M=12(P+Q)$ and dKL is the Kullback–Leibler divergence (Lin, 1991, Equation (4.1) and setting P and Q’s weighs to 0.5). For each feature distribution (e.g. Euclidean distances between Tryptophan Cα at position i and Histidine Cα at position i + 20; see Fig. 4A–C), we define the robustness of a weighting scheme, $W$, as the complement of the JSD between $W$-induced feature distributions in the training ($PW(Train)$) and test ($PW(Test)$) sets:
(2)
$Robustness(W)=1−JSD(PW(Train);PW(Test))$

Similarly, the correctness of a weighting scheme, for a given feature distribution, is the complement of the divergence between the scheme’s test distribution and the standard, non-redundant feature distribution, inferred over the training set:

(3)
$Correctness(W)=1−JSD(PNR(Train);PW(Test))$
We also compute the Shannon entropy of $W$-induced feature distribution, $PW$:
(4)
$Entropy(PW)=−∑jpj·log2(pj)$
where j ranges over all distribution bins.

Finally, we define the gain in either correctness, robustness or entropy, as the increase in that measure obtained using weighting scheme $W1$ compared with $W2$; for example:

(5)
$Correctness gain(W1,W2)=Correctness(W1)−Correctness(W2)$

## 3 RESULTS

Distributions of interatomic distances lie at the heart of knowledge-based potentials and as such are among the most studied features in PDB data mining. Here, we compare data-mining of such distances from three types of datasets: redundant, non-redundant and redundancy-weighted. The redundant training and test datasets include all proteins that were solved before and after the end of 2011, respectively, at 1.5 Å resolution or better. The training and test non-redundant datasets are subsets of the redundant ones, containing a single representative from each set of homologous structures. Finally, the three pairs (training and test) of redundancy-weighted datasets include the same structures as their redundant counterparts but the contributions of these structures, or features thereof, are weighted. The three redundancy-weighting schemes (see Section 2 and Fig. 3) are: (i) a scheme that assigns a single weight to each polypeptide chain—and all the feature instances computed from its structure—based on the number of its homologs (RWs); (ii) a scheme that computes a per-feature weight based only on homologs that cover all the positions in a feature (RWf); and (iii) the algebraic sample weighting scheme of Miyazawa and Jernigan (1996; 1999).

We focus on distributions of Cα–Cα distances, as a proof of concept. Figure 4, upper panel, shows training and test distributions of Cα distances between three types of amino acid pairs located 20 positions apart along the sequence: Tryptophan (W) and Histidine (H, left), Tryptophan and Tyrosine (Y, center) and Tryptophan and Leucine (L, right). We compute the distributions using the standard, non-redundant scheme (NR) or a redundancy-weighting scheme, RWf. The insets in Figure 4A–C depict the similarity of these distributions. We assess the performance of a weighting scheme $W$ by two quantities: correctness, which measures how similar is $W$’s test distribution to the non-redundant distribution over the training set (inset’s top row); and robustness, which measures how similar are $W$’s training and test distributions (inset’s diagonal; see Section 2 for formal definitions).

Figures 4E and F shows the correctness of Cα–Cα distance distributions over all pairs of amino acids when using a non-redundant dataset and a redundancy-weighted one. Predictably, correctness for both schemes is lower for amino acid pairs with a relatively small number (i.e. hundreds) of samples (top left corner of both heat-maps) compared with pairs with many more (i.e. tens and hundreds of thousands) samples (see Fig. 4D for sample counts of each amino acid pair); the Spearman’s rank correlation between the number of samples and the correctness of NR and RWf is 0.96, P < 10100. Evidently, the distance distributions derived using the redundancy-weighted scheme are, for the most part, more correct than those inferred using the non-redundant scheme (Fig. 4G); Of 441 distributions, one for each amino acid pair, 422 (95.7%) redundancy-weighted distributions are more correct. Moreover, the improvement in correctness, or the correctness gain, is greater for amino acid pairs with lower number of samples (Spearman’s rank correlation = −0.7, P < 1065) for which, as we indicated above, the correctness is poorer.

Fig. 4.

Upper panel: Comparing feature distributions. Cα–Cα distance distributions between Tryptophan $(WiCα)$ and Histidine $(Hi+20Cα$, A), Tryptophan and Tyrosine $(Yi+20Cα$, B) or Tryptophan and Leucine $(Li+20Cα$, C) 20 positions apart calculated using non-redundant (red) and redundancy-weighted (RWf, blue) schemes over the training (solid) and test (dashed) sets; amino acid pairs are ordered, from left-to-right, by the number of their instances in these sets (as indicated above each plot). The insets show the similarity of the specified distributions, where hotter colors indicate greater divergences. Weighting scheme performance is described in terms of correctness and robustness. Correctness here is the similarity of a weighting scheme-induced test distributions (dashed) to the one inferred using the non-redundant training dataset (solid, red); these values are shown in the top row of each inset. Robustness indicates the similarity of weighting scheme-induced distributions over the training and test sets (plots with the same color); these values are shown along the inset diagonals. Note the improvement (higher similarity) in both measures as the number of instances increases. Lower panel: (D) The number of Cα pairs with 20 residues separation in the training set (note that the color scheme is exponential). The amino acid types are ordered by prevalence in the Swiss-Prot data bank (from left to right and top to bottom); X denotes all residue types. (E–G) Correctness of NR (E) and RWf (F) weighting schemes for all pairs of amino acids is shown as a heat-map; entries corresponding to distributions shown in the upper panel are highlighted (black boxes). Correctness gain, the difference between the correctness of RWf and NR, is positive (G, blue shades) for the vast majority of amino acid pairs, indicating an improvement obtained using the former scheme; the one-tailed, paired-sample Wilcoxon signed rank test P-value is shown on the top

Fig. 4.

Upper panel: Comparing feature distributions. Cα–Cα distance distributions between Tryptophan $(WiCα)$ and Histidine $(Hi+20Cα$, A), Tryptophan and Tyrosine $(Yi+20Cα$, B) or Tryptophan and Leucine $(Li+20Cα$, C) 20 positions apart calculated using non-redundant (red) and redundancy-weighted (RWf, blue) schemes over the training (solid) and test (dashed) sets; amino acid pairs are ordered, from left-to-right, by the number of their instances in these sets (as indicated above each plot). The insets show the similarity of the specified distributions, where hotter colors indicate greater divergences. Weighting scheme performance is described in terms of correctness and robustness. Correctness here is the similarity of a weighting scheme-induced test distributions (dashed) to the one inferred using the non-redundant training dataset (solid, red); these values are shown in the top row of each inset. Robustness indicates the similarity of weighting scheme-induced distributions over the training and test sets (plots with the same color); these values are shown along the inset diagonals. Note the improvement (higher similarity) in both measures as the number of instances increases. Lower panel: (D) The number of Cα pairs with 20 residues separation in the training set (note that the color scheme is exponential). The amino acid types are ordered by prevalence in the Swiss-Prot data bank (from left to right and top to bottom); X denotes all residue types. (E–G) Correctness of NR (E) and RWf (F) weighting schemes for all pairs of amino acids is shown as a heat-map; entries corresponding to distributions shown in the upper panel are highlighted (black boxes). Correctness gain, the difference between the correctness of RWf and NR, is positive (G, blue shades) for the vast majority of amino acid pairs, indicating an improvement obtained using the former scheme; the one-tailed, paired-sample Wilcoxon signed rank test P-value is shown on the top

A more complete picture is presented in Figure 5, which compares the standard non-redundant scheme with the four non-standard ones, over three performance criteria: correctness, entropy and robustness. As expected, the non-redundant Cαi –Cαi+20 distributions are significantly more correct (one-tailed, paired-sample Wilcoxon signed rank test P < 106) and associated with greater entropy (P < 1016) than the corresponding redundant distributions. This agrees with the common practice of preferring non-redundant datasets to redundant ones. Yet, importantly, all three redundancy-weighting schemes perform significantly better than the non-redundant scheme in terms of correctness and robustness, and obtain higher entropy scores; for example: the median improvement in correctness using RWs compared with NR is 0.89%, P < 1068.

Fig. 5.

Weighting scheme performance. Correctness (top row), entropy (middle) and robustness (bottom) gain of various weighting schemes compared with the standard, non-redundant scheme. One-tailed, paired-sample Wilcoxon signed rank test P-values are shown above each heat-map; see Figure 2 for more details

Fig. 5.

Weighting scheme performance. Correctness (top row), entropy (middle) and robustness (bottom) gain of various weighting schemes compared with the standard, non-redundant scheme. One-tailed, paired-sample Wilcoxon signed rank test P-values are shown above each heat-map; see Figure 2 for more details

Finally, Figure 6 compares the performance of all weighting schemes over Cα distance distributions of residues 5, 10, 20, 50 and 100 amino acids apart. In all these cases, and consistent with the results for the Cαi –Cαi+20 distributions (Fig. 5), all three redundancy-weighting schemes perform better in terms of their correctness, entropy and robustness, than the standard non-redundant and redundant alternatives. Among the redundancy-weighting schemes, RWf is the most correct in 4 out of 5 amino acid distances and MJ significantly more correct than RWs for Cα atoms 50 and 100 amino acids apart; RWf’s entropy values are the greatest for the longer amino acid ranges and MJ’s for the shorter ones; and, consistently, MJ is the most robust, followed by RWf. Comparing the non-redundant and redundant distributions reveals that the latter is always more robust, whereas the former is more correct in all but the longest range (where redundant is more correct, P = 0.0002); and obtains greater entropy values in all but the shortest range. These results show that the redundancy-weighting performs better than the widely-used non-redundant datasets, in all these measures.

Fig. 6.

Performance summary. Significance of correctness (top), entropy (middle) and robustness (bottom) gain of various weighting schemes, all-against-all (x-axis scheme over y-axis one), for distance distributions of Cα atoms 5 (left column) to 100 (right column) amino acids apart. The base-10 logarithm of one-tailed, paired-sample Wilcoxon signed rank test P-value is shown, with stronger colors indicating more significance gains. NR: non-redundant; R: redundant; MJ: Miyazawa and Jernigan (1996, 1999) sample weighting; RWs: per-sequence redundancy-weighting; RWf: per-feature redundancy-weighting

Fig. 6.

Performance summary. Significance of correctness (top), entropy (middle) and robustness (bottom) gain of various weighting schemes, all-against-all (x-axis scheme over y-axis one), for distance distributions of Cα atoms 5 (left column) to 100 (right column) amino acids apart. The base-10 logarithm of one-tailed, paired-sample Wilcoxon signed rank test P-value is shown, with stronger colors indicating more significance gains. NR: non-redundant; R: redundant; MJ: Miyazawa and Jernigan (1996, 1999) sample weighting; RWs: per-sequence redundancy-weighting; RWf: per-feature redundancy-weighting

## 4 DISCUSSION

The dominant approach for data mining of the PDB is to extract the knowledge from a relatively small subset of non-homologous structures and consider their homologs (all other structures) as redundant and, thus, non-informative. Although this approach circumvents much of the inherent biases in the PDB, it also artificially reduces the variability of the structural landscape. An alternative approach, which we term here redundancy-weighting, considers all available structures but assigns them (or features thereof) lower weights proportionally to the number of homologs they have in the dataset.

Redundancy-weighting, originally proposed by Miyazawa and Jernigan almost two decades ago (Miyazawa and Jernigan, 1996), has been rarely used since and, to the best of our knowledge, never been systematically benchmarked against the standard approach. Here, we compare the correctness, complexity and robustness of feature distributions, which were inferred using non-redundant and redundant datasets, and three redundancy-weighting schemes: the algebraic sample weighting of Miyazawa and Jernigan (MJ) (Miyazawa and Jernigan, 1996), as well as our novel RWs and RWf weightings. To this end, we quantify and compare the correctness, complexity and robustness of distance distributions, which were derived from training and test datasets according to these schemes.

Our most significant contribution is demonstrating that the three redundancy-weighting schemes outperform the ‘standard’ non-redundant approach in all tested metrics. The MJ scheme, which is the most robust, is probably not scalable enough for practical use. This scheme requires an eigenvalue decomposition of an all-against-all similarity matrix. As the PDB is quickly approaching the 100 000 structures milestone, such decomposition becomes challenging both in terms of the computational requirements (space and time) and the numerical stability (Heath, 2002). RWf, which is the most correct scheme, has a milder limitation: it requires recomputing of weights for each structural feature. The RWs scheme is less accurate and robust than the other two, but does not suffer from these limitations. Thus, it may serve as a simple, off-the-shelf, general weighting scheme, which performs better than non-redundant datasets.

Here, we focused on a relatively small set (only a few thousands) of the PDB’s most accurately solved structures (with resolution better than 1.5 Å). Focusing on this set had several advantages: most importantly, the computational cost of exploring alternative weighting schemes is tractable and, in particular, we could easily calculate the eigenvalue decomposition needed for the MJ weighting scheme. Notice that homology is not too common within this set, which includes >50% singletons (see Fig. 2). Thus, one could expect that refining the weighting scheme will not have any impact on the accuracy and robustness when inferring structural features from the set. However, it does. Thus, we believe that such weighting schemes can similarly benefit others, even more so when considering larger subsets of the PDB, culled using more lax experimental quality thresholds.

While most residue–residue contact potentials [e.g. Miyazawa and Jernigan (1996)] consider ‘un-directional’ residue pairs (thus combining pairs of two same residues with different orders to a single unique pair), we chose to focus on directional pairs to increase the dynamic range of the explored features (Fig. 4D). Notably, the trends shown in Figure 6 are preserved when un-directional pairs are considered (data not shown).

Importantly, our study shows that the gain in accuracy and robustness is proportional to the rarity of the studied feature (Figs. 4 and 5). Current applications of PDB data mining (e.g. derivation of pairwise potentials and secondary structure prediction) are dominated by highly prevalent features (e.g. contacts between specific residues and the three major secondary structure elements). However, in more subtle applications [e.g. multi-body potentials (Gniewek et al., 2011) and fragment prediction (Gront et al., 2011; Kalev and Habeck, 2011; Shen et al., 2013)] the number of relevant features grows while their prevalence in the non-redundant dataset drops. We speculate that shifting to the redundancy-weighting paradigm may be essential for the advancement of computational structural biology beyond pairwise potentials and prediction of simple features.

Although two of the redundancy-weighting schemes presented here are already useful, this study is mostly a proof-of-concept. A promising route for improvement is to take advantage of rapid structural search methods (Budowski-Tal et al., 2010) and replace the currently used sequence alignments by more reliable and sensitive structural alignments. Another possible direction is to focus on domains rather than peptide chains, which often harbor several, repeating domains (e.g. Zn fingers) and are, thus redundant by nature. Finally, the weighting scheme may apply some evolutionary model to gain better estimate of the interdependencies of homologous proteins (Thompson et al., 1994).

In a wider context, redundant datasets are abundant in other domains of knowledge, e.g. genomics, natural language processing and computer vision. While the current work focuses on protein structure datasets, we hope insights obtained in this domain can have impact on data mining in other, unrelated ones.

## ACKNOWLEDGEMENT

C.Y. and C.K. are grateful to Eitan Bachmat for generous support. R.K. and C.Y. would like to thank Yaron Rothman and Udi Feldman for introducing them to one another.

Funding: C.K. and M.L. are supported by BSF (grant number 2009432).

Conflicts of Interest: none declared.

## REFERENCES

Altschul
SF
et al.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Res.

1997
25
3389
3402
Berman
HM
et al.
The future of the protein data bank
Biopolymers

2013a
99
218
222
Berman
HM
et al.
Trendspotting in the protein data bank
FEBS Lett.

2013b
587
1036
1045
Bernstein
FC
et al.
The protein data bank: a computer-based archival file for macromolecular structures
J. Mol. Biol.

1977
112
535
542
Budowski-Tal
I
et al.
FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately

2010
107
3481
3486
Bull
SC
et al.
Maximising the size of non-redundant protein datasets using graph theory
PLoS One

2013
8
e55484
Chou
PY
Fasman
GD
Prediction of protein conformation
Biochemistry

1974
13
222
245
Eisenberg
D
et al.
Carter
CW
Jr
Sweet
RM
VERIFY3D: assessment of protein models with three-dimensional profiles
1997
Macromolecular Crystallography Part B, Volume 277 of Methods in Enzymology. Academic Press, pp 396–404
Garnier
J
et al.
Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins
J. Mol. Biol.

1978
120
97
120
Gniewek
P
et al.
Multibody coarse-grained potentials for native structure recognition and quality assessment of protein models
Proteins

2011
79
1923
1929
Goldstein
RA
The structure of protein evolution and the evolution of protein structure
Curr. Opin. Struct. Biol.

2008
18
170
177
Gront
D
et al.
Generalized fragment picking in Rosetta: design, protocols and applications
PLoS One

2011
6
e23294
Heath
MT
Scientific Computing: An Introductory Survey

2002
2nd edn
New York, NY
The McGraw-Hill Companies Inc
Hobohm
U
Sander
C
Enlarged representative set of protein structures
Protein Sci.

1994
3
522
524
Kabsch
W
Sander
C
Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features
Biopolymers

1983
22
2577
2637
Kalev
I
Habeck
M
HHfrag: HMM-based fragment detection using HHpred
Bioinformatics

2011
27
3110
3116
Karplus
K
SAM-T08, HMM-based protein structure prediction
Nucleic Acids Res.

2009
37
Suppl. 2
W492
W497
Kosloff
M
Kolodny
R
Sequence-similar, structure-dissimilar protein pairs in the PDB
Proteins

2008
71
891
902
Levitt
M
Nature of the protein universe

2009
106
11079
11084
Li
W
Godzik
A
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
Bioinformatics

2006
22
1658
1659
Lin
J
Divergence measures based on the Shannon entropy
IEEE Trans. Inf. Theory

1991
37
145
151
Lüthy
R
et al.
Assessment of protein models with three-dimensional profiles
Nature

1992
356
83
85
McGuffin
LJ
et al.
The PSIPRED protein structure prediction server
Bioinformatics

2000
16
404
405
Miyazawa
S
Jernigan
RL
Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation
Macromolecules

1985
18
534
552
Miyazawa
S
Jernigan
RL
Residue–residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading
J. Mol. Biol.

1996
256
623
644
Miyazawa
S
Jernigan
RL
Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues
Proteins

1999
34
49
68
Needleman
SB
Wunsch
CD
A general method applicable to the search for similarities in the amino acid sequence of two proteins
J. Mol. Biol.

1970
48
443
453
Pearson
WR
Lipman
DJ
Improved tools for biological sequence comparison

1988
85
2444
2448
Rost
B
Doolittle
RF
PHD: predicting one-dimensional protein structure by profile-based neural networks
Computer Methods for Macromolecular Sequence Analysis

1996
Volume 266 of Methods in Enzymology. Academic Press, pp. 525–539
Samudrala
R
Moult
J
An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction
J. Mol. Biol.

1998
275
895
916
Shen
Y
et al.
Detecting protein candidate fragments using a structural alphabet profile comparison approach
PLoS One

2013
8
e80493
Sippl
MJ
Recognition of errors in three-dimensional structures of proteins
Proteins

1993
17
355
362
Summa
CM
Levitt
M
Near-native structure refinement using in vacuo energy minimization

2007
104
3177
3182
Tanaka
S
Scheraga
HA
Medium- and long-range interaction parameters between amino acids for predicting three-dimensional structures of proteins
Macromolecules

1976
9
945
950
Thompson
JD
et al.
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
Nucleic Acids Res.

1994
22
4673
4680
Wang
G
Dunbrack
RL
PISCES: a protein sequence culling server
Bioinformatics

2003
19
1589
1591
Wang
G
Dunbrack
RL
PISCES: recent improvements to a PDB sequence culling server
Nucleic Acids Res.

2005
33
W94
W98

## Author notes

Associate Editor: Anna Tramontano