-
PDF
- Split View
-
Views
-
Cite
Cite
Matthew N. Davies, Andrew Secker, Alex A. Freitas, Miguel Mendao, Jon Timmis, Darren R. Flower, On the hierarchical classification of G protein-coupled receptors, Bioinformatics, Volume 23, Issue 23, December 2007, Pages 3113–3118, https://doi.org/10.1093/bioinformatics/btm506
- Share Icon Share
Abstract
Motivation: G protein-coupled receptors (GPCRs) play an important role in many physiological systems by transducing an extracellular signal into an intracellular response. Over 50% of all marketed drugs are targeted towards a GPCR. There is considerable interest in developing an algorithm that could effectively predict the function of a GPCR from its primary sequence. Such an algorithm is useful not only in identifying novel GPCR sequences but in characterizing the interrelationships between known GPCRs.
Results: An alignment-free approach to GPCR classification has been developed using techniques drawn from data mining and proteochemometrics. A dataset of over 8000 sequences was constructed to train the algorithm. This represents one of the largest GPCR datasets currently available. A predictive algorithm was developed based upon the simplest reasonable numerical representation of the protein's physicochemical properties. A selective top-down approach was developed, which used a hierarchical classifier to assign sequences to subdivisions within the GPCR hierarchy. The predictive performance of the algorithm was assessed against several standard data mining classifiers and further validated against Support Vector Machine-based GPCR prediction servers. The selective top-down approach achieves significantly higher accuracy than standard data mining methods in almost all cases.
Contact: [email protected]
1 INTRODUCTION
1.1 The G protein-coupled receptors
The G protein-coupled receptors (GPCR) are composed of a diverse range of integral membrane proteins that regulate many important physiological functions (Bissantz et al., 2003; Christopoulos and Kenakin, 2002; Gether et al., 2002). GPCRs control and/or affect processes as diverse as neurotransmission, cellular metabolism, secretion, cellular differentiation and inflammatory responses (Hebert and Bouvier, 1998). The binding of a ligand on the cell surface causes the GPCR to become active, and subsequently bind and activate ubiquitous guanine nucleotide-binding regulatory (G) proteins within the cytosol. An extremely heterogeneous set of molecules including ions, hormones, neurotransmitters, peptides and proteins can act as ligands to a GPCR. The GPCRs are a common target for therapeutic drugs and ∼50% of all marketed drugs target a GPCR (Flower, 1999; Klabunde and Hessler, 2002). In spite of their functional and sequential diversity, there are certain structural features common to all GPCRs. All GPCRs contain seven highly conserved transmembrane segments. Their sequences also contain three extracellular loops (EL1-3), three intracellular loops (IL1-3) as well as the protein N and C termini. The transmembrane segments form seven a-helices in a flattened two-layer structure known as the transmembrane bundle, a structure seen in all GPCRs (Milligan, 2006). The GPCRs show a far greater conservation with regard to the three-dimensional structure than to the primary sequence.
The diversity of the GPCRs means it is difficult to develop a comprehensive classification system for all of the GPCR subtypes (Davies et al., 2007). One of the first GPCR classification systems was introduced by Kolakowski for the now defunct GCRDb database (Kolakowski, 1994). GPCRs were divided into seven groups, designated A–F and O, derived from original standard similarity searches. This system was further developed for the GPCRDB database (Horn et al., 2003), which divides the GPCRs into six classes. These are the Class A Rhodopsin-like, which account for over 80% of all GPCRs, Class B Secretin-like, Class C Metabotropic glutamate receptors, Class D Pheromone receptors, Class E cAMP receptors and the Class F Frizzled/smoothened family. Class A is the largest of the human GPCR subtypes. There are at least 286 human non-olfactory Class A receptors, the majority of which bind peptides, biogenic amines or lipid-like substances (Fridmanis et al., 2006). Class B receptors bind large peptides such as secretin, parathyroid hormone, glucagon, calcitonin, vasoactive intestinal peptide, growth hormone releasing hormone and pituitary adenylyl cyclase activating protein (Cardoso et al., 2006). Class C Metabotropic glutamate receptors (mGluRs) are a type of glutamate receptor that are activated through an indirect metabotropic process (Das and Banker, 2006). Like all glutamate receptors, mGluRs bind to glutamate, an amino acid that functions as an excitatory neurotransmitter. There are two further GPCR families that are considerably smaller. Class D is composed of pheromone receptors, which are used by organisms for chemical communication (Nakagawa et al., 2005) while Class E, the cAMP receptors, forms part of the chemotactic signalling system of slime molds (Prabhu and Eichinger, 2006). There is also an additional minor class, the Frizzled/Smoothened receptors, which are necessary for Wnt binding and the mediation of hedgehog signalling, a key regulator of animal development (Foord et al., 2002). The six different classes can be further divided into sub-families and sub-subfamilies based upon the function of the GPCR protein and the specific ligand to which it binds. In this article, the following terminology is used to describe the classification of GPCR sequences. The six major GPCR families are termed ‘Classes’, the secondary level of classification is termed ‘Sub-families’ and the third level of classification is termed ‘Sub-subfamilies’. Not all human GPCRs can be effectively classified using this system, there are approximately 60 ‘orphan’ GPCR proteins that show the sequence properties of Class A Rhodopsin-like receptor but for which there are no defined ligands or functions (Gloriam et al., 2005). It is possible that many of these orphan receptors have ligand-independent properties, specifically the regulation of ligand-binding GPCRs on the cell surface.
1.2 GPCR prediction servers
Previous attempts at predicting the function of a GPCR from its primary sequence, and therefore its position within a given hierarchical system, have included motif-based classification tools (Attwood, 2001; Attwood et al., 2002; Flower and Attwood, 2004) and machine-learning methods such as Hidden Markov Models (Wistrand et al., 2006). These approaches have applications not only in discovering and characterizing novel protein sequences but also in better understanding relationships between known GPCRs. The majority of predictive techniques, however, have used Support Vector Machines (SVMs) (Karchin et al., 2002), machine-learning algorithms based on statistical learning theory. In two-class problems, a SVM maps the input vectors (data points representing protein descriptions) into a higher dimensional feature space and then constructs the optimal hyperplane to separate the classes, while avoiding overfitting. This is a powerful form of classification because, although it is linear in the higher dimensional feature, it is non-linear in the original attribute space of the input vectors. The optimal hyperplane is the one with a maximum distance to the closest data point from each of the two classes. The distance is called the margin, and the optimal hyperplane is called the maximal margin hyperplane. The input vectors closest to the optimal hyperplane are called the support vectors. Although SVMs are more commonly used to solve 2-class problems, this technique can be applied to the classification of GPCR data by recursively trying to classify one class against all others (Karchin et al., 2002).
Several publicly available SVM-based GPCR classifiers exist. PRED-GPCR (http://athina.biol.uoa.gr/bioinformatics/PRED-GPCR/) (Guo et al., 2005; Papasaikas et al., 2004) was developed as a fast fourier transform with SVMs on the basis of the hydrophobicity of the amino acid sequence. Quantitative descriptions of the proteins relating to hydrophobicity, bulk and electronic properties were derived from the hydrophobicity model, composition-polarity-volume (c-p-v) model and the electron–ion interaction potential (EIIP) model. Three different hydrophobicity scales—the Kyte-Doolittle Hydrophobicity (KDHΦ), Mandell Hydrophobicity (MHΦ) and Fauchére Hydrophobicity (FHΦ)—were used. The sequences are transformed, first, into numerical representations of the sequence based upon the EIIP values and, second, into the frequency domain using the discrete Fourier transform, a method by which sequences of different length can be normalized. The output of these transformations is used as the input for the SVM. In the case of an n-class classification problem where n > 2, as is the case for the GPCR families, each ith SVM, i = 1, … ,n, is trained. When using the FHΦ hydrophobicity scale, the technique achieved a reported accuracy of 93.3% and a Matthew's correlation coefficient of 0.95. However, the range of accuracies between the sub-families varied between 66.7% and 100% (Papasaikas et al., 2004).
GPCRPred is another SVM-based classifier that determines whether a sequence is or is not a GPCR; if it is a GPCR, to which class it belongs, and then, if it is a Class A protein, to which sub-family it belongs (Bhasin and Raghava, 2004). The vectors are based upon the dipeptide composition, whereby each of the 400 possible pairs of amino acids is associated with a vector component representing the percentage of the primary sequence consisting of that pair. Again, the one-versus-rest SVM is used to characterize each Class and sub-family. The program was reported as having a 99.5% predictive accuracy at the GPCR versus non-GPCR level, 97.3% accuracy at the Class level and 85% accuracy at the sub-family level. A third server, GPCRsclass (Bhasin and Raghava, 2005), concentrates on the Class A aminergic receptor sub-family. In the first round of analysis, a SVM is generated to distinguish amines from all other GPCRs. Then multiclass SVMs are set up to classify amines into the acetylcholine, adrenoreceptor, dopamine and serotonin subgroups. The SVM requires patterns of fixed length for training and testing. The sequences are transformed to fixed length format by measuring the amino acid and dipeptide compositions, giving vectors of 20 and 400 dimensions, respectively. The dipeptide composition has been proved to be far more reliable than the amino acid, scoring 99.7% accuracy at discriminating amine from non-GPCRs and 92% are discriminating between the four sub-subfamilies. A similar method involving amino acid, dipeptide and tripeptide compositions (Guo et al., 2006) claimed a 98% accuracy at the Class level. GPCRsclass gave 94% accuracy at the class level when tested with the same dataset.
Here we present a new selective top-down approach to GPCR prediction using a hierarchical classifier. The technique was validated, first, against standard data mining classifiers and, second, against several SVM-based GPCR predictive servers.
2 METHODS
2.1 GPCR dataset (GDS)
In order to develop an effective algorithm for the classification of GPCR sequences, it was necessary to build as large and comprehensive a dataset of GPCR sequences as possible with which to train and test the classifier. Protein sequences for the dataset were identified using the Entrez search and retrieval system (Wheeler et al., 2007). The system searches protein databases such as SwissProt, PIR, PRF, PDB as well as translations from annotated coding regions in DNA databases such as GenBank and RefSeq. Text-based searching was used to identify all sequences within each sub-subfamily of the hierarchy. These composite groups were then used to build each GPCR sub-family and Class level dataset. The datasets contain only human proteins sequences with the exception of Class D proteins, which are found only in fungi and Class E, which is found in Dictyostelium. All proteins shorter than 280 amino acids in length 10 were removed in order to eliminate incomplete protein sequences. All identical sequences within the dataset were removed to avoid redundancy. This left 8354 protein sequences in 5 classes at the family level (A–E), 40 classes at the sub-family level and 108 classes at the sub-subfamily level. Class F was not considered as it contains too few sequences from which to developing an accurate classification algorithm (this class has also been excluded from the PRED-GPCR and GPCRPred predictive programs). For the sake of convenience, this dataset will be referred to as the GDS (GPCR Dataset).
2.2 Sequence representation
Rather than use the primary sequence to perform the classification, the system uses an alternative form of protein data representation. Alignment-independent classification systems use the physiochemical properties of amino acids to determine differences between protein sequences. Proteochemometrics is a technique whereby 5 ‘z-values’ (z1–z5) are derived from 26 real physiochemical properties through the application of principal component analysis (Lapinsh et al., 2002; Sandberg et al., 1998). The z1 value accounts for the amino acid's lipophilicity, the z2 value accounts for steric properties such as bulk and polarisability and z3 value describes the polarity of the amino acid. The electronic effects of the amino acids are described by the z4 and z5 values. These five values are calculated for each amino acid in the sequence, generating a matrix that provides a purely numerical description of the protein's character. Several sequences in the GDS contained non-standard amino acid codes not present in the table of z-values. In such cases, the following substitutions were made. Where the sequence contained a ‘B’ (either an asparagine or aspartic acid) the residue was assigned as an asparagine ‘N’. Where the sequence contained a ‘z’ (i.e. either a glutamine or a glutamic acid), the residue was assigned as a glutamine ‘Q’. Where the sequence contained a ‘U’, indicating selenocysteine, the sequence was changed to cysteine ‘C’. All unknown residues ‘X’ were given as alanines ‘A’.


The equation above is therefore applied five times, once for each attribute, where each attribute corresponds to a z-value. In this investigation, we use an augmented version of this attribute creation method. In this case, 15 attributes are used to describe each protein. Five are created as described above but in addition to this, five more are created from the N-terminus of the protein while a further five are created using the C-terminus. The termini of a GPCR protein have the ability to be powerful predictors of function since the ends of the GPCR will be involved in either intracellular or extracellular binding. Therefore, in the case of the N-terminus, the means for each of the five z-values are computed for only the first 150 amino acids while in the case of the C-terminus, the means over the last 150 amino acids are determined. In reality, the actual lengths of the N and C-termini will vary between GPCRs; the value of 150 amino acids was 70 found, in controlled experiments, to give in the largest improvement in predictive accuracy (data not shown).
2.3 Classification algorithm
In order for the algorithm to be effective, it must be able to predict protein function based on an established classification system for the GPCRs. The GPCRDB database suggests a natural hierarchy for GPCR sequences and so it is the one used by GDB [although alternative hierarchies exist, such as the GRAFS Classification system (Schioth and Fredriksson, 2005)]. In the data mining literature, there exists a range of strategies available for predicting hierarchical classes (Freitas and de Carvalho, 2007). The simplest is to flatten the dataset to the lowest level of the hierarchy, then use one of the plethora of standard classification algorithms to predict to what class each sequence belongs. However, this strategy does not take advantage of the information implicit in the class structure. This potentially results in a loss of accuracy that can be further compounded by the huge number of classes at the tree's most specific level. An alternative is the ‘big bang’ approach, which uses a single, and typically complex, hierarchical classification algorithm. This method implicitly takes into account the class hierarchy during training. In the test phase, each example may be assigned to one class at each level of the hierarchy by one single application of the learned model. Perhaps due to its complexity, implementations of such an approach are scarce, although such a model has been used to predict gene function in Saccharomyces cerevisiae (Claire and King, 2003).
A middle ground between these two strategies is the top-down approach, where the hierarchical classification process is converted into a number of flat classification problems that may be solved independently by running a standard classifier for each (Costa et al., 2007; Freitas and Carvalho, 2007). The advantage with this strategy over the others is, as it the case with flat classification, no special classifier must be written to perform the task (other than the scaffolding required to support a classifier tree). The structure of the tree aids the classifier and reduces the number of classes that must be considered at the most specific level (see Fig. 1).

Example of a hierarchical dataset (A) and how that hierarchy may be reflected in a tree of classifiers (B) ready for a top-down approach to classification.
The standard top-down approach proceeds as follows. Given, e.g. the class tree in Figure 1A, a tree of classifiers is built to reflect the structure of the classes, as shown in Figure 1B. Thus, a tree of classifiers is generated such that the output of one classifier constitutes the input for another. The number of layers of classifiers will be equal to the number of levels represented by the class attribute. To train the classifiers in the hierarchy, all data in the training set is used to train the root classifier while only the relevant subsets of data are used to train at the levels of the sub-family and the sub-subfamily. When an unknown sequence is presented to the classifier tree, the root level classifier will assign it a class then pass it down to the appropriate classifier at the next level until the sequence is assigned to a sub-family and a sub-subfamily
A novel, modified, version of the top-down approach was developed and used as the chosen strategy for classifying the GDS. The top-down approach takes advantage of the hypothesis that some characteristics may be important to discern between two protein subsets at one classification level while being less important at another. The top-down approach exploits this, as any classifier in the tree is trained using only data instances of the classes they are required to classify between. In the standard top-down approach, the same classification algorithm is used in each node in the class tree. It is, however, possible that different classifiers may be more suited to different nodes in the class tree and that therefore the classification accuracy may be increased by using different algorithms in the classifier hierarchy. Importantly, these classifiers are selected in a data-driven manner using training data. This is referred to as the selective top-down approach.
The selective top-down approach generates a tree of classifiers, in a similar manner to the standard top-down approach but with some additional stages during training. At each node, the training data for that node is split into sub-training and validation sets, with data instances being assigned randomly. A number of different classifiers are then trained using this sub-training data and tested using the validation set. The classifier that yields the highest classification accuracy in the validation set is selected as the classifier for this node in the class tree. The sub-training and validation sets are then merged to produce the original training set (for that node), and the selected classifier is then re-trained. Eight standard classification techniques were used (Witten and Frank, 2005). These were the Naïve Bayes method, a Bayesian network, a SVM (Keerthi et al., 2001), nearest neighbour (using Euclidean distance), a decision list (Frank and Witten, 1998), J48 (a decision tree algorithm much like C4.5), a Naïve Bayes tree (a decision tree with a naïve Bayes classifier at each node), a multi-layer neural network with back propagation, AIRS2 [a classifier based on the Artificial Immune System paradigm (Watkins et al., 2004)] and a conjunctive rule learner. This list of classifiers was carefully chosen to include a wide range of paradigms. All code was written using the WEKA data mining package (Brownlee et al., 2007) and the default parameters used for each algorithm.
3 RESULTS
Two separate studies were undertaken to assess the quality of the selective top-down technique. The first was to compare the effectiveness of the approach in comparison with the standard top-down technique. The second tests the accuracy of the algorithm in comparison with three publicly available GPCR classifiers and tests it against datasets that have been used to train the three servers. Where accuracies are reported for each level, the accuracy level is computed as the percentage of correct classifications at that level.
3.1 Cross-validation experiments
In order to assess the quality of the selective top-down classifier, it was tested on the prepared GDS dataset. All experiments were carried out using a 10-fold cross-validation approach. As data instances are added randomly to each fold, test was repeated 30 times and the mean values are reported. Whilst data instances were randomly assigned to folds, care was taken to ensure that at least one instance of each class was present in each fold. For this reason, the decision was taken that any class containing fewer than 10 examples was discarded for this test. This left 87 classes at sub-subfamily level, 38 at the sub-family level and 5 classes at the family level. In total, 8222 proteins remained in the dataset. When training the selective top-down classifier, each of the 10 classifiers was trained using 80% of the training data (sub-training set) available to that node, and evaluated using the remaining 20% (validation set).
To validate the algorithm, results for a standard top-down approach are shown for a selection of different classifiers (see Table 1). Extended results for this test have been previously reported (Secker et al., 2007). A value denoting the significance of the difference between the accuracy of the selective approach and each particular algorithm was computed using the corrected resampled t-test (Witten and Frank, 2005). This test attempts to eliminate the issues encountered when a standard t-test is used over multiple runs of a cross-validation procedure. Due to space constraints the figures are not reported, instead a shaded cell indicates that the corresponding accuracy value of the selective top-down classifier is significantly greater than the shaded value. The significance threshold was set at 1% and a 2-tailed test was used. The results show that the selective top-down approach performed significantly better than the selective top-down approach in almost all cases. The 3-nearest neighbours classifier was the classifier predominantly chosen by the selective approach at the top level, and as such it is no surprise that there is no statistically significant difference between the 3-nearest neighbours classifier at this level. One disadvantage of both the selective and standard types of top-down approach is that any example misclassified at one level has no possibility of being correctly classified at a deeper level; misclassifications therefore accrue as the level depth increases.
Predictive accuracy (%) of the selective top-down technique at each level compared against several standard classifiers
Level . | Family (%) . | Sub-family (%) . | Sub-subfamily (%) . |
---|---|---|---|
Selective top-down | 95.87 | 80.77 | 69.98 |
Naïve Bayes | 77.29 | 52.60 | 36.66 |
Bayesian network | 85.54 | 64.27 | 50.69 |
SMO | 80.21 | 56.67 | 35.96 |
Nearest neighbour | 95.87 | 78.68 | 69.40 |
PART | 93.27 | 78.73 | 65.68 |
J48 | 92.93 | 77.49 | 64.30 |
Naïve Bayesian tree | 93.07 | 76.92 | 64.78 |
AIRS2 | 91.98 | 74.58 | 62.68 |
Conjunctive rules | 76.19 | 49.93 | 16.49 |
Level . | Family (%) . | Sub-family (%) . | Sub-subfamily (%) . |
---|---|---|---|
Selective top-down | 95.87 | 80.77 | 69.98 |
Naïve Bayes | 77.29 | 52.60 | 36.66 |
Bayesian network | 85.54 | 64.27 | 50.69 |
SMO | 80.21 | 56.67 | 35.96 |
Nearest neighbour | 95.87 | 78.68 | 69.40 |
PART | 93.27 | 78.73 | 65.68 |
J48 | 92.93 | 77.49 | 64.30 |
Naïve Bayesian tree | 93.07 | 76.92 | 64.78 |
AIRS2 | 91.98 | 74.58 | 62.68 |
Conjunctive rules | 76.19 | 49.93 | 16.49 |
A shaded cell indicates that the corresponding accuracy value of the selective top-down classifier is significantly greater than the shaded value.
Predictive accuracy (%) of the selective top-down technique at each level compared against several standard classifiers
Level . | Family (%) . | Sub-family (%) . | Sub-subfamily (%) . |
---|---|---|---|
Selective top-down | 95.87 | 80.77 | 69.98 |
Naïve Bayes | 77.29 | 52.60 | 36.66 |
Bayesian network | 85.54 | 64.27 | 50.69 |
SMO | 80.21 | 56.67 | 35.96 |
Nearest neighbour | 95.87 | 78.68 | 69.40 |
PART | 93.27 | 78.73 | 65.68 |
J48 | 92.93 | 77.49 | 64.30 |
Naïve Bayesian tree | 93.07 | 76.92 | 64.78 |
AIRS2 | 91.98 | 74.58 | 62.68 |
Conjunctive rules | 76.19 | 49.93 | 16.49 |
Level . | Family (%) . | Sub-family (%) . | Sub-subfamily (%) . |
---|---|---|---|
Selective top-down | 95.87 | 80.77 | 69.98 |
Naïve Bayes | 77.29 | 52.60 | 36.66 |
Bayesian network | 85.54 | 64.27 | 50.69 |
SMO | 80.21 | 56.67 | 35.96 |
Nearest neighbour | 95.87 | 78.68 | 69.40 |
PART | 93.27 | 78.73 | 65.68 |
J48 | 92.93 | 77.49 | 64.30 |
Naïve Bayesian tree | 93.07 | 76.92 | 64.78 |
AIRS2 | 91.98 | 74.58 | 62.68 |
Conjunctive rules | 76.19 | 49.93 | 16.49 |
A shaded cell indicates that the corresponding accuracy value of the selective top-down classifier is significantly greater than the shaded value.
3.2 Empirical comparison with GPCR classification servers
While there is evidence that the novel selective top-down approach provides a better classification accuracy than standard classifiers, it is important to validate the novel approach by testing with other datasets and against other classifiers specific for GPCR prediction. The PRED-GPCR, GPRCpred and GPCRsclass servers, all three of which are publicly available, were selected for this purpose. Additionally, the datasets that were used to train and test the three servers were kindly supplied by their authors. The GPCRPred dataset is composed of 1008 Class A sequences, 56 Class B, 16 Class C, 11 Class D and 3 Class E, making a total of 1096 sequences. The PRED-GPCR program was trained using 403 sequences from 17 sub-families from GPCR Classes B, C, D and F. GPCRSCLASS dataset is composed of 4 amine sub-subfamilies, 31 Acetylcholine sequences, 44 Adrenoreceptors, 38 Dopamine and 54 Serotonin, making a total of 167 sequences. For a full assessment of the technique, it was necessary to run all of the datasets against the developed algorithm and the three servers.
For this test, the selective top-down classifier was trained using the GDS dataset (8354 protein sequences) then tested using each of the GPCR server datasets as test data. This simulates the situation in which the selective top-down approach, trained with the GDS dataset could be deployed as a public server. The predictive accuracy at each level of the hierarchy is shown (see Table 2). A separate sub-table is displayed for each dataset so the quality of the classification can be directly compared between each server. In the experiments, every classification method has been tested using every dataset and the resultant classification accuracies are presented. For the sake of completeness, each sub-table includes the instances where a method has been tested using its own dataset although acknowledged that these values are of limited use as would have been trained and tested using the same data. Rows in the table where this occurs have been italicized, as the figures contained in this row will represent results heavily biased in favour of that particular classifier.
Benchmark results of the GPCR datasets comparing the GPCR servers against the selective top-down approach
GDS dataset . | Class . | Sub-family . | Sub-subfamily . |
---|---|---|---|
Server . | . | . | . |
Selective top-down | 99.6% | 91.8% | 87.0% |
PRED-GPCR | 73.2% | 72.2% | 67.6% |
GPCRpred | 64.7% | 46.1% | – |
GPCRsclass | – | – | 94.0% |
PRED-GPCR dataset | |||
Selective top-down | 96.3% | 85.7% | 76.6% |
PRED-GPCR | 95.1% | 95.1% | 94.5% |
GPCRpred | 70.1% | 55.6% | – |
GPCRsclass | – | – | 83.0% |
GPCRpred dataset | |||
Selective top-down | 92.1% | 76.2% | 57.8% |
PRED-GPCR | 80.7% | 73.8% | 59.9% |
GPCRpred | 87.2% | 67.1% | – |
GPCRsclass | – | – | 100.0% |
GPCRSCLASS dataset | |||
Selective top-down | 100.0% | 82.3% | 78.1% |
PRED-GPCR | 100.0% | 100.0% | 92.8% |
GPCRpred | 65.2% | 59.7% | – |
GPCRsclass | – | – | 82% |
GDS dataset . | Class . | Sub-family . | Sub-subfamily . |
---|---|---|---|
Server . | . | . | . |
Selective top-down | 99.6% | 91.8% | 87.0% |
PRED-GPCR | 73.2% | 72.2% | 67.6% |
GPCRpred | 64.7% | 46.1% | – |
GPCRsclass | – | – | 94.0% |
PRED-GPCR dataset | |||
Selective top-down | 96.3% | 85.7% | 76.6% |
PRED-GPCR | 95.1% | 95.1% | 94.5% |
GPCRpred | 70.1% | 55.6% | – |
GPCRsclass | – | – | 83.0% |
GPCRpred dataset | |||
Selective top-down | 92.1% | 76.2% | 57.8% |
PRED-GPCR | 80.7% | 73.8% | 59.9% |
GPCRpred | 87.2% | 67.1% | – |
GPCRsclass | – | – | 100.0% |
GPCRSCLASS dataset | |||
Selective top-down | 100.0% | 82.3% | 78.1% |
PRED-GPCR | 100.0% | 100.0% | 92.8% |
GPCRpred | 65.2% | 59.7% | – |
GPCRsclass | – | – | 82% |
Benchmark results of the GPCR datasets comparing the GPCR servers against the selective top-down approach
GDS dataset . | Class . | Sub-family . | Sub-subfamily . |
---|---|---|---|
Server . | . | . | . |
Selective top-down | 99.6% | 91.8% | 87.0% |
PRED-GPCR | 73.2% | 72.2% | 67.6% |
GPCRpred | 64.7% | 46.1% | – |
GPCRsclass | – | – | 94.0% |
PRED-GPCR dataset | |||
Selective top-down | 96.3% | 85.7% | 76.6% |
PRED-GPCR | 95.1% | 95.1% | 94.5% |
GPCRpred | 70.1% | 55.6% | – |
GPCRsclass | – | – | 83.0% |
GPCRpred dataset | |||
Selective top-down | 92.1% | 76.2% | 57.8% |
PRED-GPCR | 80.7% | 73.8% | 59.9% |
GPCRpred | 87.2% | 67.1% | – |
GPCRsclass | – | – | 100.0% |
GPCRSCLASS dataset | |||
Selective top-down | 100.0% | 82.3% | 78.1% |
PRED-GPCR | 100.0% | 100.0% | 92.8% |
GPCRpred | 65.2% | 59.7% | – |
GPCRsclass | – | – | 82% |
GDS dataset . | Class . | Sub-family . | Sub-subfamily . |
---|---|---|---|
Server . | . | . | . |
Selective top-down | 99.6% | 91.8% | 87.0% |
PRED-GPCR | 73.2% | 72.2% | 67.6% |
GPCRpred | 64.7% | 46.1% | – |
GPCRsclass | – | – | 94.0% |
PRED-GPCR dataset | |||
Selective top-down | 96.3% | 85.7% | 76.6% |
PRED-GPCR | 95.1% | 95.1% | 94.5% |
GPCRpred | 70.1% | 55.6% | – |
GPCRsclass | – | – | 83.0% |
GPCRpred dataset | |||
Selective top-down | 92.1% | 76.2% | 57.8% |
PRED-GPCR | 80.7% | 73.8% | 59.9% |
GPCRpred | 87.2% | 67.1% | – |
GPCRsclass | – | – | 100.0% |
GPCRSCLASS dataset | |||
Selective top-down | 100.0% | 82.3% | 78.1% |
PRED-GPCR | 100.0% | 100.0% | 92.8% |
GPCRpred | 65.2% | 59.7% | – |
GPCRsclass | – | – | 82% |
The accuracy of the selective top-down approach generally exceeds that of PREDGPCR and GPCRPred at the Class level and is comparable at the sub-family level. Both the selective top-down approach and PRED-GPCR are shown to be notably better than GPCRPred at all levels of the hierarchy. GPCRsclass was the most successful classifier at the most specific level but this is likely to be due to the fact that the classifier can only be applied at the sub-subfamily level and is therefore highly specialized. The other classifiers, however, have to classify at all three levels and in the case of the selective top-down classifier, accuracy at the sub-subfamily level will suffer from misclassification at the Class and sub-family stage.
4 CONCLUSION
The classification of GPCR sequences has proven resistant to conventional bioinformatics classification approaches such as sequence similarity or the identification of specific motifs. However, the structural and functional consistency of GPCR proteins suggests that there is an overall conservation of certain key properties that are necessary to maintain the transmembrane bundle that characterizes the group. The effectiveness of proteochemometrics for this type of analysis has already been demonstrated by previous research. However, this is the first time an alignment-free approach has been used on a dataset of this size. A straightforward representation was used that was a development over previously published work. While it appeared to work well in this instance, we expect that other more complex representations will be necessary as we extend this work to other problems in bioinformatics. The advantages of the selective top-down approach over standard (‘flat’ classification) data mining techniques and the current GPCR servers is clearly demonstrated by the accuracies achieved. It demonstrates that each stage of the classification problem as being dependent on unique criteria.
Any supervised learning (classification) algorithm has intrinsic limitations. For example, a classification model constructed from a training set can only have a chance of good predictive accuracy on a test set that is derived from the same (or at least similar) probability distribution as the training set. If an unusual class distribution in the training set was used to build a classification model, it is unlikely that the classification model would have a very high predictive accuracy if applied to a large set of GPCR sequences with a more usual class distribution. Both PRED-GPCR and GPCRPred struggled to accommodate the full diversity of the GDS, while the selective top-down approach proved to be adaptable to both a generalized dataset (PRED-GPCR) and a specialized one (GPCRsclass).
ACKNOWLEDGEMENTS
The authors should like to gratefully acknowledge funding under the ESPRC grant EP/D501377/1. The authors also wish to extend their thanks to G.P.S. Raghava and P.K. Papasaikas for providing their datasets for the purposes of comparison. We are also deeply indebted to Prof. Teresa K Attwood for her detailed critique of the manuscript.
Conflict of Interest: none declared.
REFERENCES
Author notes
Associate Editor: sJohn Quackenbush