Motivation: Protein fold recognition is an important approach to structure discovery without relying on sequence similarity. We study this approach with new multi-class classification methods and examined many issues important for a practical recognition system.

Results: Most current discriminative methods for protein fold prediction use the one-against-others method, which has the well-known ‘False Positives’ problem. We investigated two new methods: the unique one-against-others and the all-against-all methods. Both improve prediction accuracy by 14–110% on a dataset containing 27 SCOP folds. We used the Support Vector Machine (SVM) and the Neural Network (NN) learning methods as base classifiers. SVMs converges fast and leads to high accuracy. When scores of multiple parameter datasets are combined, majority voting reduces noise and increases recognition accuracy. We examined many issues involved with large number of classes, including dependencies of prediction accuracy on the number of folds and on the number of representatives in a fold. Overall, recognition systems achieve 56% fold prediction accuracy on a protein test dataset, where most of the proteins have below 25% sequence identity with the proteins used in training.

Supplementary information: The protein parameter datasets used in this paper are available online (http://www.nersc.gov/~cding/protein).

Contact: chqding@lbl.gov; ildubchak@lbl.gov


To whom correspondence should be addressed.