BCov: a method for predicting b -sheet topology using sparse inverse covariance estimation and integer programming

Motivation: Prediction of protein residue contacts, even at the coarse-grain level, can help in finding solutions to the protein structure prediction problem. Unlike a -helices that are locally stabilized, b -sheets result from pairwise hydrogen bonding of two or more disjoint regions of the protein backbone. The problem of predicting contacts among b -strands in proteins has been addressed by several supervised com-putational approaches. Recently, prediction of residue contacts based on correlated mutations has been greatly improved and finally allows the prediction of 3D structures of the proteins. Results: In this article, we describe BCov, which is the first unsupervised method to predict the b -sheet topology starting from the protein sequence and its secondary structure. BCov takes advantage of the sparse inverse covariance estimation to define b -strand partner scores. Then an optimization based on integer programming is carried out to predict the b -sheet connectivity. When tested on the prediction of b -strand pairing, BCov scores with average values of Matthews Correlation Coefficient (MCC) and F1 equal to 0.56 and 0.61, respectively, on a non-redundant dataset of 916 protein chains known with atomic resolution. Our approach well compares with the state-of-the-art methods trained so far for this specific task. The new dataset BetaSheet1452 can be downloaded Supplementary information: Supplementary data are available at Bioinformatics online.


INTRODUCTION
b-Sheets are widespread motifs of local structure found in over 80% of the protein structures presently available in the Protein Data Bank (http://www.rcsb.org/pdb/home/home.do). b-Sheets are generated by the pairing of two or more b-strands held together by characteristic patterns of hydrogen bonds running in a parallel or antiparallel fashion (Zhang and Kim, 2000). Prediction of the b-sheet topology from the covalent structure of the protein is useful for predicting its tertiary structure, for designing new proteins and for understanding folding pathways. The first method to predict b-strand pairing was based on a statistical potential approach (Hubbard, 1994). Prediction of b-residue contacts was addressed by Baldi et al. (2000) using an elaborate method based on neural networks. Steward and Thornton (2002) adopted an information theoretic approach to predict b-residue pairwise interaction. Cheng and Baldi (2005) pioneered the idea of predicting b-sheet topologies when the protein secondary structure is known and set the standard for this type of task. Their method BetaPro is based on a 2D-recursive neural network (Baldi and Pollastri, 2003) trained to predict pairing probabilities of interstrand b-residue pairs. Then an algorithm finds alignments between all pairs of b-strands, and a weighted-graph matching algorithm predicts the b-sheet topologies. Lippi and Frasconi (2009) introduced an alternative approach based on Markov logic networks (MLNs). This approach exploits b-sheet structural constraints defined as logical formulas whose weights can be learned from examples (Lippi and Frasconi, 2009). The prediction of b-residue contacts was also applied to the problem of predicting the 3D structure of proteins using integer linear optimization (Rajgaria et al., 2010). Aydin et al. (2011) combined BetaPro outputs with a Bayesian approach and tested it on subsets of the Cheng and Baldi benchmark set (Aydin et al., 2011). Recently, a new method to predict protein b-sheet contacts using a maximum entropy-based correlated mutation measure (CMM) has been introduced and compared with the state-of-the-art methods (Burkoff et al., 2013). CMM achieves performances similar to BetaPro and MLNs (Burkoff et al., 2013).
In recent years, the main breakthroughs in residue contact prediction have concerned improvements in the exploitation of information from multiple sequence alignments (MSA). The degree of coevolution of pair of sites in the MSA can be used to infer the closeness of the corresponding residues in the 3D structure. Different approaches have been described to elucidate the coevolution of columns in an MSA, including standard correlated mutations analysis (Olmea and Valencia, 1997) and information theoretic measures (Dunn et al., 2008). Recently some powerful methods based on the extraction of direct coupling information from MSAs have been introduced to predict protein contacts both in globular (Cocco et al, 2013;Ekeberg et al. 2013;Jones et al., 2012;Marks et al., 2011;Morcos et al., 2001;Weigt et al. 2009) and membrane proteins (Hopf et al., 2012;Nugent and Jones 2012). These new contact prediction methods are not only improved with respect the previous approaches but also finally allow to predict 3D structures of the proteins (for review see de Juan et al., 2013;Marks et al., 2012;Taylor et al., 2013). *To whom correspondence should be addressed.
In this article, we describe BCov, a new approach for b-sheet topology prediction based on sparse inverse covariance estimation and integer programming. BCov is the first unsupervised method that addresses this task. We use the residue-contact propensities provided by the Protein Sparse Inverse COVariance (PSICOV) method  to compute the scores of the b-strand pairings. We selected PSICOV because it is freely available and does not require proprietary software to run. Then we apply an integer programming optimization to assign the bsheet topology. BCov well compares with the performances of the state-of-the-art algorithms (BetaPro, CMMs and MLNs) that integrate supervised techniques to address the same task.

The BetaSheet916 dataset
The BetaSheet916 dataset was first introduced by Cheng and Baldi (2005) to evaluate their BetaPro method for b-sheet prediction and then adopted by other authors (Burkoff et al., 2013;Lippi and Frasconi, 2009). The set is routinely adopted as benchmark set for b-sheet prediction methods and for sake of comparison we also use it in this article. The atomic coordinates of 916 protein chains (with X-ray diffraction at a resolution 2.5 Å ) were extracted from the Protein Data Bank (PDB) as to May, 2004 (Cheng andBaldi, 2005). Secondary structure assignments were computed by means of the Define Secondary Structure of Proteins (DSSP) program (Kabsch and Sander, 1983). Both extended b-strands (marked by E in the DSSP output) and isolated b-bridges (B in the DSSP output) were considered b-residues. The b-contact maps were defined using information about b-partnership available from the DSSP output. Statistics of the dataset are shown in Table 1.

The BetaSheet1452 dataset
To complement the BetaSheet916 dataset, we constructed a new dataset of more recently deposited high-resolution proteins. We extracted from the PDB a set of protein chains whose structures were obtained by X-ray diffraction with a resolution 2.5 Å . We restricted our search to PDB entries deposited after May, 2004 to exclude chains already present in the BetaSheet916 dataset. We filtered out protein sequences at 20% of sequence identity level to obtain a non-redundant dataset. More importantly, we also removed sequences with identity 420% with any of the proteins contained in the BetaSheet916 set. We used DSSP to assign secondary structures, and we filtered out sequences having 53 distinct extended b-strands to exclude trivial cases. We also discharged protein chains shorter than 50 residues or having backbone interruptions or nonstandard amino acids. The final dataset of proteins, referred to as BetaSheet1452, contained 1452 protein chains. Statistics are shown in Table 1.

CASP 2010 dataset
For sake of comparison, we also considered protein chains from the Critical Assessment of protein Structure Prediction (CASP) 2010 experiment. The original set comprised 116 targets. We used the same procedure described in Burkoff et al. (2013) to filter out sequences with a number of b-residues 10. The final set consisted of 92 protein chains. Secondary structures have been assigned using DSSP.

MSA construction
For each sequence in the datasets described earlier in the text, we obtained an MSA using the jackhmmer program that is part of HMMER 3.0 package (http://hmmer.org). Given a target protein sequence, homologous sequences were found by running three iterations of jackhmmer against the UNIREF90 database (Magrane and the Uniprot Consortium, 2011) setting the E-value threshold to 1e-3. The corresponding MSA has been obtained from jackhmmer output.

BCov general description
BCov consists of three main steps: (i) compute the residue contact propensity with PSICOV; (ii) compute the score of each possible b-strand pairing; (iii) compute the b-sheet topology using an integer programming optimization to find the best solution according to the b-pairing scores and constraints. Below we report the details of these three major steps.

Computing the residue contact propensity with PSICOV
For sake of clarity, in this section we provide a brief description of the PSICOV method. We refer to Jones et al. (2012) for further details about the method. Starting from an MSA with m columns, PSICOV first computes a sample 21 m-by-21 m (also gaps are considered) covariance matrix C using observed single and pair amino acid occurrence frequencies: where f i,j (a,b), f i (a) and f j (b) are the sample relative frequency of amino acid pair ab at sites ij, the relative frequency of amino acid a at column i and the frequency of amino acid b at column j, respectively.
The inverse of the sample covariance matrix is computed with the graphical lasso method (Banerjee et al., 2008;Friedman et al., 2008). This algorithm allows estimating a sparse inverse covariance matrix Â by minimizing the following objective function: where C is a d-by-d covariance matrix, Â is the inverse covariance matrix and the last term is a regularization term (the '1-norm of the inverse matrix) that governs the sparsity of the solutions. is the sparsity hyper-parameter: the greater is the sparser is the solution.
The sparse inverse covariance matrix Â is used, in turn, to derive a contact score between residues at positions i and j by computing the '1-norm of the 20-by-20 sub-matrix of Â corresponding to all possible pairs of amino acid at position i and j: The raw contact score computed as in Equation 3 is finally corrected as follows: where " B i, À is the mean raw contact score between the i-th position and all other positions (analogously " B À, j for the j-th position) and " B is the overall mean contact score. This correction, referred to as average product correction, allows reducing both entropic and phylogenetic biases that are major source of noise in MSAs (Dunn et al., 2008).
For each possible pairs of residues belonging to two different b-sheets, we then adopt the score computed in Equation 4 as a measure of the b-residue contact strength.

b-contact and b-sheet topology prediction
BCov consists of an integrated approach to predict both b-residue contacts and b-sheet topology i.e. the specification of the b-strand pairings, including the directions of interactions (parallel, antiparallel or isolated b-bridge). The method takes advantage of b-contact scores as computed by PSICOV and integer programming for b-sheet topology prediction.
Consider a protein sequence with n distinct b-strands and m b-residues. As mentioned earlier in the text, PSICOV is used to compute contact scores between all residue pairs in the protein sequence (even for non-b residues). Then we extract the submatrix obtained by considering only b-residue pairs. After this step, we end up with an m-by-m symmetric matrix B whose B i,j components can be interpreted as a propensity value for b-residues i and j to form a b-contact.
Our algorithm proceeds by computing an n-by-n matrix S whose entries represent interaction scores between pairs of strands. The matrix S is defined as follows: where, for strands s i and s j with i 5 j, S ij contains the best alignment score between the two strands in the parallel direction, whereas S ji stores the best alignment score in the antiparallel direction. Here, without loss of generality, we use the upper diagonal part of S for parallel scores and the lower diagonal part for antiparallel scores. To compute alignments, we use b-contact propensities B i,j obtained from PSICOV as local residue match scores. We restrict the search space for optimal alignments by considering only solutions that can be obtained by sliding one strand along the other without gaps and with at most one unmatched residue at the shortest segment ends. With these constraints, for two strands of length r and t, respectively, assuming r4t, there are 2 Â (r À t þ 3) possible alignments (see Fig. 1 for an example). Strand interaction scores stored in the matrix S are then used to assign the b-sheet topology. Because a naı¨f approach that enumerates all possible b -sheet pairings is infeasible because of the combinatorial nature of the problem (Zhang and Kim, 2000), several alternative solutions were adopted such as graph matching or maximum spanning-tree algorithms (Cheng and Baldi, 2005).
In this article, we tackle this problem using integer programming. In particular, the optimal b-strand pairing pattern can be obtained by solving the following integer program: The solution X is an n-by-n integer matrix that maximizes the overall b-sheet sum of scores. The matrix X is binary (c1 constraints) and its non-zero entries identify interacting b-strands. c2 constraints ensure the consistency of the solution by enforcing the assignment of either parallel or antiparallel pairing direction for each strand pair (i, j). Furthermore, a given b-strand can pair with at most two distinct b-strands (c3 and c4 constraints). Figure 2 shows an example of the overall procedure for a protein with five b-strands. We also evaluate a version of BCov that takes into consideration the geometric constraint of generating parallel b-sheets at short sequence separation. In protein structures, the formation of parallel b-sheets with few interleaving residues between two strands is difficult and it requires a strong unfavorable free energy contribution. For instance, in the Cheng and Baldi (2005) dataset almost all pairing  b-strands separated by 56 residues are antiparallel. For this reason, we introduce a modified version of our algorithm, referred to as BCov6, which always assigns antiparallel directions to pairs of b-strands whose sequence separation is 56 residues.

Method implementation
BCov has been developed in the C programming language (available at http://biocomp.unibo.it/savojard/bcov/bcov-1.0.tar.gz). The source code is released under the terms of the GNU General Public License (GPL) version 3. BCov is linked with the PSICOV source code (available at http://bioinfadmin.cs.ucl.ac.uk/downloads/PSICOV/ and also released under GPL terms) and the graphical lasso FORTRAN package for sparse inverse covariance estimation (Friedman et al., 2008). For integer linear optimization program, we used the C application program interface of the GNU Linear Programming Kit (http://www.gnu.org/software/ glpk/).

Measures of performance
We evaluated the performance of the method at both b-residue contact level (i.e. contact maps) and b-strand pairing level (coarse contact maps).
In either case, we computed the canonical scoring measures, which include: Precision: Recall: F1-score: Matthews Correlation Coefficient: where TP, TN, FP and FN are, respectively, true positives, true negatives, false positives and false negatives. We also evaluate the methods using qualitative measures introduced by Lippi end Frasconi (2009

RESULTS AND DISCUSSION
3.1 PSICOV performance on b-residue contacts PSICOV has been described as one of the most promising methods to address the prediction of residue contact in proteins . The accuracy of PSICOV depends on both the number and the quality of the aligned sequences . In the case of BetaSheet916 dataset, the number of sequences in each MSA ranged between 2 and 190576 (see Fig. 3).
Here, we address a slightly different goal, namely we want to evaluate the predictions on the subsets of the contact map identified by the b-strand regions. In Table 2, we report the PSICOV accuracy on the whole BetaSheet916 dataset and also on two disjoint subsets: a subset of proteins whose number of aligned sequences are !1000 and the subset of proteins whose number of aligned sequences are 51000. Results listed in Table 2 indicate that a low number of aligned sequences can affect the accuracy. The performances of PSICOV on this task are lower than those obtained by the state-of-the-art methods (see Table 3, rows 1-4). However, it must be noticed that PSICOV is an unsupervised method (differently from the others) and that here we evaluate it only on the portion of b-residue contacts.

BCov performance on b-residue contacts
As described in Section 2, BCov optimizes the PSICOV outputs using integer programming approaches. BCov assigns the b-sheet topology and also all segment pairings including b-sheet directions. For each pair of b-strand partners the best score obtained by the segment pairing (see Section 2.4) can be used to assign b-residue contacts. With this procedure, the performances are significantly higher than those obtained using PSICOV alone (compare Table 2 with Table 3). When the predicted b-sheet contacts are used to reconstruct the corresponding 3D structures with FT-COMAR (Vassura et al., 2008), the improvement of BCov over PSICOV leads to a better reconstruction of the protein fold (see Supplementary Fig. S1 of the Supplementary Materials).
In Table 3, we report the performances of BCov and BCov6 with respect to the state-of-the-art methods on the BetaSheet916 dataset. It is evident that when we do not allow parallel pairing of b-strand segments whose sequence separation is 56 residues

BCov performance at strand level
The evaluation of the methods at strand level (detecting the correct b-sheet topology and b-strand pairings) is the most relevant benchmark, as this is the main goal of all the approaches developed so far (BetaPro, NLN, CMM). In Table 4, we report the comparison of BCov with the state-of-the-art methods on the BetaSheet916 dataset. It is worth noticing that in this task BCov (which is not trained on the dataset) compares well with the other methods achieving a correlation coefficient value of 0.56 (MCC). This indicates that the information extracted by PSICOV is relevant also for the problem at hand, and if coupled with an optimization procedure the overall accuracy can improve significantly. Furthermore, it is evident that when the number of aligned sequences is sufficiently high (Bcov ! 1000) the performances increases by 5 percentage points and with an MCC of 7 percentage points higher than the best state-of-the-art method. As a consequence, we may expect that BCov performance will improve as new sequences will be made available by mass sequencing efforts.
In Table 5, we show scoring of the different methods obtained with the qualitative measures introduced by Lippi and Frasconi (2009) (see Section 2.6). It is interesting to see that BCov outperforms all other methods with respect to all the reported indices with the exception of the prediction of the correct directions of the correctly predicted b-sheets. This indicates that with respect to the other methods, BCov is better at guessing the segment pairings and b-sheet topology but less effective at detecting the contacts at residue level.

Performance on the CASP 2010 dataset
To evaluate the BCov performances at the strand level on another benchmark, we used the CASP 2010 dataset. With this choice we can also compare the BCov performance at the strand level with other two state-of-the-art predictors (BetaPro and CMM). We downloaded the protein structures provided during the CASP 2010 experiment, and we followed the same procedure described in Burkoff et al. (2013) to filter out sequences with a number of b-residues 10. The final set consisted of 92 protein chains. The results reported in Table 6 are really encouraging, as it appears that BCov6 (an unsupervised method) compares well with the state-of-the-art methods BetaPro and CMM [as reported from the data taken from the supplementary material of Burkoff et al. (2013)]. It is also worth noticing that BCov6 MCC and F1 performances are almost unaffected by the dataset change (compare Table 6 with indexes reported in Tables 4 and 5).

Performance on the new BetaSheet1452 dataset
To evaluate the BCov performance on a larger and newer dataset, in Table 7 we report the performances both at residue and at strand level. It is encouraging that the performances are consistent and comparable with those obtained using the classical BetaSheet916 dataset (compare with Tables 3-5) and also the CASP 2010 dataset (compare with Table 6). Because BetaSheet1452 is larger than the previous dataset (far larger than any CASP datasets, at least so far), the fact that the  The score is evaluated only on the subsets of the proteins whose number of aligned sequences (NAS) is !1000.
c The score is evaluated only on the subsets of the proteins whose NAS is 51000. Note: P(L*x) ¼ precision evaluated selecting the L*x highest scoring b-residue pairs. performances do not degrade indicates that BCov is a robust predictor of b-sheet topologies. In Table 7 we also report the performance of the most recently introduced CMM method ( Burkoff et al., 2013) that was trained on the BetaSheet916 dataset. It is worth noticing that BCov6 performs similarly at residue level, but generally better at strand level indicating that BCov more efficient than CMM at locating the b-sheet segment pairing (see MCC and F1 measure in Table 7).

CONCLUSIONS
In this article, we presented BCov, a new unsupervised method which integrates correlated mutation analysis and integer programming for b-sheet topology prediction. BCov well compares with the performances of the state-of-the-art methods, such as BetaPro, CMMs and MLNs. The main advantage of BCov is in the identification of the b-strand pairing and the prediction of the b-sheet topology. This is also confirmed when BCov is compared with the most recently introduced method to address this task (MCC by Burkoff et al., 2013) on the new BetaSheet1452 dataset (Table 7). It is also worth noticing that BCov might be coupled or incorporated into more complex machine-learning frameworks such as recurrent neural networks or MLNs, as all the information used by BCov (or its components) is obtained at the sequence level without training. Finally, all the experiments carried out in this article showed once more the strength of the correlated mutation analysis for residue contact prediction, especially when this is performed using a sparse inverse covariance approach as the one implemented by PSICOV  or related approaches (Cocco et al., 2013;Ekeberg et al. 2013;Marks et al., 2011Marks et al., , 2012Morcos et al., 2001;Weigt et al. 2009). Note: For legend see Table 3.  Note: Results for CMM and BetaPro taken from Burkoff et al. (2013). For the indices see Section 3.2.