Abstract

Motivation

Proteins often include multiple conserved domains. Various evolutionary events including duplication and loss of domains, domain shuffling, as well as sequence divergence contribute to generating complexities in protein structures, and consequently, in their functions. The evolutionary history of proteins is hence best modeled through networks that incorporate information both from the sequence divergence and the domain content. Here, a game-theoretic approach proposed for protein network construction is adapted into the framework of multi-objective optimization, and extended to incorporate clustering refinement procedure.

Results

The new method, MOCASSIN-prot, was applied to cluster multi-domain proteins from ten genomes. The performance of MOCASSIN-prot was compared against two protein clustering methods, Markov clustering (TRIBE-MCL) and spectral clustering (SCPS). We showed that compared to these two methods, MOCASSIN-prot, which uses both domain composition and quantitative sequence similarity information, generates fewer false positives. It achieves more functionally coherent protein clusters and better differentiates protein families.

Availability and implementation

MOCASSIN-prot, implemented in Perl and Matlab, is freely available at http://bioinfolab.unl.edu/emlab/MOCASSINprot.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Proteins often contain multiple domains. In eukaryotes, 70% or more are multi-domain proteins (Chothia and Gough, 2009; Levitt, 2009). These highly conserved protein regions are often associated with discrete functions and structures. The function of a multi-domain protein is thus the composite of the functions of its constituent domains, making domain-based evolutionary processes, such as domain shuffling and domain accretion, important mechanisms for functional innovation (Koonin et al., 2000; Graur, 2016). Protein families, therefore, can be defined based on the domain composition and sometimes the specific arrangements of domains (Chothia and Gough, 2009; Vogel et al., 2004).

Due to the importance in elucidating protein functions, diverse approaches have been proposed to classify protein sequences into families. Integration of evolutionary contexts through phylogenetic analysis can improve the accuracy of functional classification (Sjölander, 2004). Phylogenetic analysis is usually performed based on a multiple alignment of protein sequences. Several issues exist in this approach when it is applied especially at the proteome level. First, the sequences are not alignable in their entirety when the proteins have multiple domains in varied composition and arrangement, limiting the application to proteins that share at least one alignable domain. Each protein subgroup needs to be independently analyzed based on different sets of domains. Another drawback is that traditional phylogenetic analysis assumes evolution to be a bifurcating process. However, reticulate evolutionary events, such as domain shuffling, lead to evolutionary histories that are more accurately represented by networks. Furthermore, many multiple sequence alignment and phylogenetic reconstruction methods simply do not scale at the genomic level.

In the late 1990s, network-based approaches for protein clustering began to emerge (Enright et al., 1999; Enright and Ouzounis, 2000; Marcotte, 1999; Pellegrini et al., 1999; Tatusov, 1997). Most of these methods did not scale well to increasingly large datasets, which led to the development of additional clustering methodologies (Atkinson et al., 2009; Enright, 2002; Miele et al., 2012; Pipenbacher et al., 2002; Paccanaro et al., 2006; Wittkop et al., 2010). Although many of these methods, including the Markov clustering algorithm (Enright, 2002; Van Dongen, 2000) and spectral clustering (Nepusz et al., 2010; Paccanaro et al., 2006), have demonstrated the ability to cluster proteins on a large scale, they utilize similarity information from only one local region between each pair of proteins and do not consider domain architectures.

Phylogenetic profile methods (Bhardwaj et al., 2012; Chang et al., 2008; Pellegrini et al., 1999) and gene fusion methods (Enright et al., 1999; Marcotte, 1999) were some of the first attempts at incorporating domain architecture information into protein family detection. Domain co-occurrence networks (Wang et al., 2011; Wuchty and Almaas, 2005) and other related graph-theoretic approaches (Cohen-Gihon et al., 2007; Kummerfeld and Teichmann, 2009; Nacher et al., 2009; Przytycka et al., 2006; Xie et al., 2011) also incorporate domain composition (and sometimes domain order). However, these methods are employed to depict relationships among the domains, and the relationships between the proteins are usually not considered.

To understand functional construction of proteins in full, domain network relationships need to be integrated into protein network construction. Protein evolution is primarily governed by selective constraints on their sequences to maintain their functions and also by modularity of domains that allows functional innovation. With this assumption, we have previously introduced a game-theoretic method for constructing protein networks (Deng et al., 2013). In this study we adapt our approach into the framework of multi-objective optimization and extend our method by incorporating a clustering refinement procedure. MOCASSIN-prot (Multi-Objective Clustering Algorithm for Sequence Similarity Networks for proteins) not only provides protein classifications using network clustering, but also gives us a better interpretation of the relationships between proteins and domains. We illustrate the practical usage of MOCASSIN-prot, and show its better performance compared to two other methods, TRIBE-MCL and SCPS.

2 Materials and methods

A protein space is defined to be a set of proteins, each of which is in turn defined by a set of domains. Given a protein space and the corresponding domain space, we can construct a directed similarity network that gives us a set of protein clusters, where a protein cluster is defined to be a weakly connected component in the similarity network, i.e. a connected component in the graph where edge directionality is ignored (e.g. C1–C7 in Fig. 1). A directed protein similarity graph, G = (V, E), is a directed graph such that each vertex in the set V = {P1, P2,…, Pn} uniquely corresponds to one protein and all edges have nonzero weights with the incoming edges to any given vertex summing to 1.

Workflow for constructing a directed protein similarity network using MOCASSIN-prot. The domain architecture of each protein in the input sequence set can be identified by a profile hidden Markov model method such as HMMER3. For each of the proteins a similarity matrix is constructed using a log-transformed BLAST E-value as the similarity score. This matrix serves as the input to a multi-objective optimization problem. Each of these optimization problems is solved, and the solutions are used to construct a graph adjacency matrix for the protein network. This example shows the similarity matrix for protein P1. The ‘n/a’ entry in this matrix is used to show that the domain (d3) does not exist in this protein. In the final protein similarity network, protein clusters are labeled as C1-C7
Fig. 1.

Workflow for constructing a directed protein similarity network using MOCASSIN-prot. The domain architecture of each protein in the input sequence set can be identified by a profile hidden Markov model method such as HMMER3. For each of the proteins a similarity matrix is constructed using a log-transformed BLAST E-value as the similarity score. This matrix serves as the input to a multi-objective optimization problem. Each of these optimization problems is solved, and the solutions are used to construct a graph adjacency matrix for the protein network. This example shows the similarity matrix for protein P1. The ‘n/a’ entry in this matrix is used to show that the domain (d3) does not exist in this protein. In the final protein similarity network, protein clusters are labeled as C1-C7

The complete workflow of the algorithm to obtain the clusters for a protein set is shown in Figure 1. In the following sections, we briefly explain our methods using a small example, a subset of the Regulator of G-protein signaling (RGS) family proteins from the Mus musculus genome.

2.1 The MOCASSIN-prot algorithm

2.1.1 Similarity matrix construction

The network graph is constructed on a protein-by-protein basis. First, the domain architectures are identified for each of the proteins (Fig. 2). In this study, we used a profile hidden Markov model (HMM) method, the hmmscan program of HMMER3 (Eddy, 2011), to search against the Pfam protein families database (Finn et al., 2014) to identify non-overlapping domains (E-value threshold 1.0) on our proteins.

Domain architectures of the fourteen RGS proteins. The domains were identified using HMMER3 against the Pfam database. The RGS domain (PF00615.14) is shared among all RGS-family proteins. The domain structure images were generated using the domain graphic generator on the Pfam website (http://pfam.xfam.org/generate_graphic)
Fig. 2.

Domain architectures of the fourteen RGS proteins. The domains were identified using HMMER3 against the Pfam database. The RGS domain (PF00615.14) is shared among all RGS-family proteins. The domain structure images were generated using the domain graphic generator on the Pfam website (http://pfam.xfam.org/generate_graphic)

Next, we construct a set of similarity matrices using each protein as the reference. The amino acid sequences of all domains identified in the reference protein are compared to all other protein sequences using blastp with the default parameters (Camacho et al., 2009). The similarity matrix for the reference protein Pi, Ai = ai(s, j), is an m × n matrix, where ai(s, j) is the similarity score of domain s from protein Pi to protein Pj, m is the total number of domains identified on protein Pi, and n is the number of proteins in the protein space. For each entry in the similarity matrix we used a log-transform of the BLAST E-value, ai(s, j) = −log(αsj), where αsj is the E-value obtained for the query domain ds of protein Pi against the subject protein Pj. If a domain, dk, was not found on the subject protein Pj based on the given BLAST E-value threshold, then the similarity score is assigned to be an arbitrarily chosen very low value, ai(k, j) = −log(2870). It should be noted that in our current implementation, each domain is represented only once in the similarity matrix. If a protein includes repetitive domains, such information is not included.

2.1.2 Protein similarity network construction

The construction of the edge weights is based on the assumption that protein sequence evolution is primarily governed by selective constraints on their sequences to maintain their functions along with modularity of domains that allows functional innovation. This assumption leads to maximizing the sequence similarity between proteins along shared domains. The outcome is a set of protein clusters with similar domain architectures in the protein space.

For each protein Pi, we search for the other proteins in the protein space to which it is most uniquely similar (in terms of domain content). This is done using the multi-objective optimization method. It starts by defining Si to be the index of integers {1, 2,…, m} such that sSi if and only if ds is a domain of protein Pi. For ease of notation, let yj = wij for all j = 1, 2,…, n denote the weight of the edge from protein Pj to Pi. Thus we have yi = 0, jyj=1, and yj ≥ 0. The incoming edge weight vector y = (y1, y2,…, yn) to protein Pi is then chosen to simultaneously ‘maximize’ the set of mean similarity scores; i.e. y solves
(1)
In addition, we solve for the ‘maximal’ divergence distribution of protein Pi with respect to the domain space. That is, for each protein Pj its least similar (i.e. most divergent) domain to the reference protein Pi will be a domain, ds, having the smallest similarity score ai(s, j). Since this may happen on different domains for different proteins, the task is to find a domain distribution vector x to simultaneously ‘minimize’ all of the mean similarity scores; i.e. x solves
(2)
In the final step, we use the solutions of (1) and (2) for each of the proteins in the protein space to construct both a directed protein similarity network and a domain diversity profile. The edge weight wij from node Pj to Pi is assigned to be yj from the solution to (1) for node Pi, where a higher edge weight indicates that protein Pi is more similar to protein Pj than the other proteins in the protein space.

The solution vector x from (2) is referred to as the domain diversity vector for protein Pi. A high diversity weight, xs, indicates that the reference protein Pi is more dissimilar or unique to the other proteins with respect to domain ds. Arranging the x solution vectors for the proteins in the protein space into rows for their corresponding proteins yields the so-called domain diversity profile matrix.

2.1.3 The RGS protein similarity network and domain diversity profile

Figure 3 shows the similarity network for the RGS proteins. The five clusters found (C1–C5) are labeled according to their average optimal objective values [obtained from averaging the optimal objective values from (1) for the proteins in the cluster], in descending order. The optimal objective value for each protein yields important information about the intra-cluster similarities in the network, measuring how tight the connection of protein Pi is to the other proteins. The higher the value, the higher the mean similarity score fs for all sSi, which can be interpreted to mean that protein Pi is more conserved with respect to the proteins in the protein space. Moreover, for two topologically identical clusters, it is their average optimal objective values that set them apart. The cluster with a higher average optimal objective value is a ‘tighter’ or more similar subnetwork than the other. For example, C2 and C4 each include three proteins and are topologically identical. However, their average optimal objective values are 146.7074 and 119.9047, respectively, indicating that C2 is a ‘tighter’ or more conserved subnetwork than C4. Even though all fourtenn proteins in this dataset contain the RGS domain, the proteins in each of the network clusters differ from those in the other clusters with respect to their domain composition and varying levels of similarity between their domain sequences. This is clearly shown in the diversity profile across the RGS proteins exhibited in Figure 4, where the proteins are grouped according to the clusters (C1-C5).

MOCASSIN-prot network constructed from the fourteen RGS-family proteins. Five clusters were identified and labeled in descending order (from C1 to C5) according to their average optimal objective values. The nodes represent distinct proteins. The edges are directed so that the incoming edge weights of each node sum to 1. The edge color indicates the edge weight, from lighter (gray) to darker (black) color corresponding from lower to higher weights (edge weights are normalized to each reference protein). The optimal objective value for each protein is represented by the length of its incoming edges, with longer edges corresponding to smaller objective values (all incoming edges to a given node have the same edge length). The network was visualized using the prefuse force-directed layout in Cytoscape (Smoot et al., 2011). Cytoscape uses the Barnes-Hut approximation algorithm to produce the ‘best’ node layout with specified edge lengths; in some cases the incoming edge lengths are adjusted and not exactly the same
Fig. 3.

MOCASSIN-prot network constructed from the fourteen RGS-family proteins. Five clusters were identified and labeled in descending order (from C1 to C5) according to their average optimal objective values. The nodes represent distinct proteins. The edges are directed so that the incoming edge weights of each node sum to 1. The edge color indicates the edge weight, from lighter (gray) to darker (black) color corresponding from lower to higher weights (edge weights are normalized to each reference protein). The optimal objective value for each protein is represented by the length of its incoming edges, with longer edges corresponding to smaller objective values (all incoming edges to a given node have the same edge length). The network was visualized using the prefuse force-directed layout in Cytoscape (Smoot et al., 2011). Cytoscape uses the Barnes-Hut approximation algorithm to produce the ‘best’ node layout with specified edge lengths; in some cases the incoming edge lengths are adjusted and not exactly the same

Diversity profile of the RGS-family proteins generated by MOCASSIN-prot. Each row represents one of the fourteen RGS proteins, while each column represents one of the eight domains (the RGS domain is in bold). Clusters are separated by thick horizontal lines. Each cell is color-coded based on the diversity weight x: from checkered shading for xi = 0 (but domain exists) to black for xi = 1. Blank cells indicate domain absence. Proteins in each cluster (P1-P14) are arranged according to their optimal objective value, with the largest appearing as the lowest row in the cluster. The objective value for each protein is shown in parentheses next to the protein name
Fig. 4.

Diversity profile of the RGS-family proteins generated by MOCASSIN-prot. Each row represents one of the fourteen RGS proteins, while each column represents one of the eight domains (the RGS domain is in bold). Clusters are separated by thick horizontal lines. Each cell is color-coded based on the diversity weight x: from checkered shading for xi = 0 (but domain exists) to black for xi = 1. Blank cells indicate domain absence. Proteins in each cluster (P1-P14) are arranged according to their optimal objective value, with the largest appearing as the lowest row in the cluster. The objective value for each protein is shown in parentheses next to the protein name

For most of the clusters, specifically C1, C2 and C5, the proteins within the cluster all contain the same set of domains, and the weights placed on these domains are the same within each cluster. Sequence divergence for the same domain type (e.g. RGS domain for RGS 17 versus RGS 19/20 proteins) is recognized in separating C1 and C2. The clusters identified by MOCASSIN-prot for the RGS-family proteins were largely consistent with the clustering patterns found in a maximum-likelihood phylogeny (Supplementary Fig. S1), indicating the advantage of MOCASSIN-prot incorporating both sequence similarity and domain divergence information. For example, although two domains are missing from one of the beta-adrenergic receptor kinase 2 protein isoforms (P12 and P14), MOCASSIN-prot correctly clustered the two isoforms as well as another related protein (P13). Note also that proteins connected by bidirectional edges in the MOCASSIN-prot network share a same set of highly conserved domains. In fact, four of the five such pairs of RGS are two different isoforms of the same gene.

By our framework the edge weights of the protein similarity network and the corresponding domain diversity profile are the result of the search for minimally shared domains (in x) of maximal similarity (in y) for each protein in relation to all other proteins in the protein space. This is highlighted in the domain diversity profile (Fig. 4). For example, the GGL domain (PF00631.17) is present only in proteins P9, P10 and P11, and these proteins exhibited high similarity in the GGL domain. This is indicated by the high diversity weight (shown as black cells) for this domain in P9, P10 and P11 in Figure 4.

2.1.4 Network refinement

The MOCASSIN-prot method described above gives rise to clustered protein networks on minimally shared domains of maximal similarity. We can extend the method to obtain secondary clusters from a primary network cluster. This is done by removing the least conserved domain for each protein in the protein space. That is, for the domain dk having the largest diversity weight for protein Pi, we purposely remove it from the analysis, i.e. we remove xk in (2) and dk (correspondingly fk) in (1). We then find the solution to the new corresponding optimization problem. This secondary clustering structure will capture the next minimally shared region of maximal similarity for the proteins inside the primary cluster, increasing each protein’s optimal objective value and identifying further subgroupings of similar proteins within the primary cluster. This zoom-in procedure can continue to reveal the tertiary, the quaternary and so on, similarity relationships of the proteins. This iterative clustering process allows us to define a more refined network clustering structure from the original network without the use of arbitrary thresholds for pruning.

The secondary network for the fourteen RGS-family proteins is shown in Supplementary Figure S2. Proteins P1–P5 and P14 originally had only one domain, which means that for the second iteration of MOCASSIN-prot there was no optimization problem for those proteins. Hence, they have no incoming edges in the network. Clusters C3, C4 and C5 from the primary MOCASSIN-prot clustering are retained in clusters C1, C2 and C3 of the secondary network, respectively. Although P14 had only one domain, its similarity to the RGS domain of protein P12 led to it being retained in the cluster with P12 and P13.

In the secondary clustering, the edge structure of each of the clusters changed. For example, in cluster C2 of the secondary network although proteins P6 and P7 are still reciprocally most similar to each other, the edges between them are shorter than in the primary network, indicating that their optimal objective values were higher in the secondary clustering. This increase in similarity between the two proteins results from the removal of the domain that had the highest diversity weight in (2), i.e. the domain that was most dissimilar to the other proteins in the protein space. Additionally, the weak similarity between P6 and P8 in the primary clustering is no longer present after the second iteration of MOCASSIN-prot, where the uniquely shared AXIN domain (PF08833.5) was removed.

Edge structure differences between the primary and secondary networks were also observed in the cluster containing proteins P9, P10 and P11 (C4 in Fig. 3 and C3 in Supplementary Fig. S2). In the primary network, these three proteins were clustered together based on their similarity in the GGL domain (PF00631.17), which was present only in these proteins. The presence of this uniquely similar domain resulted in the three proteins being clustered together in C3 even though one of them (P11) included an extra domain. Proteins P9 and P10, the two isoforms of RGS7, were reciprocally most similar to each other, and P11 exhibited high similarity to P10 in the GGL domain. Figure 4 shows that these three proteins had not one, but two domains that were unique to the cluster, the GGL domain (PF00631.17) and the DEP domain (PF00610.16). In the secondary network, the GGL domain was removed from all three proteins. As mentioned before, the edge weights in the network are the result of the search for minimally shared domains of maximal similarity for each protein in relation to all other proteins in the protein space. Hence, the edges in the secondary network for proteins P9, P10 and P11 were based on their similarities in the DEP domain. P9 and P10 were once again identified to be reciprocally most similar, but in this iteration, without the GGL domain, P11 appears to be equally similar to P9 and P10.

2.2 Datasets used in this study

Evaluation of protein clustering algorithms requires a trusted benchmark set of known protein families. Some databases used as such benchmarks classify proteins based on the sequence as well as structural similarities [e.g. SCOP (Murzin et al., 1995), CATH (Pearl, 2003) and COG (Tatusov et al., 2003)]. However, these benchmarks are not suited for our analysis since they do not contain many multi-domain proteins.

In this study, we instead developed benchmark datasets based on the protein family classification assigned in the UniProt Knowledgebase (UniProtKB, release 2014_08) (The UniProt Consortium, 2017). Benchmark protein sets were constructed from ten genomes including seven bacteria and three eukaryotes (Supplementary Table S1). For each genome, we downloaded the entire set of proteins from the Swiss-Prot section of UniProtKB, which includes only manually annotated and reviewed protein sequences. Proteins with no identifiable domains (determined using HMMER3 search against the Pfam v27.0 database with E-value threshold 1.0) were excluded. Proteins that did not have a UniProtKB protein family assignment were further filtered out. Finally, the proteins in each benchmark set were clustered following the UniProtKB protein family assignments. The UniProt protein family information used in this study is available on the MOCASSIN-prot website.

UniProt assigns proteins to protein families based on information from a variety of sources including sequence signatures from various domain/family databases (e.g. InterPro), scientific literature, as well as sequence similarity. Because our aim is to compare the performance of the methods that classify proteins based on the information derived from the domains and sequence similarity, even though proteins in UniProt/Swiss-Prot go through integrated manual annotation, it can be argued that a domain-centric method such as MOCASSIN-prot would be favored when the protein classification benchmark is derived from the UniProt protein family information. Therefore, we also prepared datasets where benchmark classification can be done based on Gene Ontology (GO) molecular function annotations in UniProt-GOA (Huntley et al., 2015). From the ten genomes, proteins were selected if they satisfy the following three conditions: (i) annotated in UniProt/Swiss-Prot, (ii) one or more Pfam domains were found by HMMER3 search and (iii) GO annotations contained experimental evidence codes (EXP, IMP, IGI, IPI, IDA, or IEP). A sufficiently large number of proteins that sufficed these criteria were identified only from the mouse dataset (10 895 proteins). This dataset was used for the performance analysis.

2.3 Methods used for comparison

The performance of MOCASSIN-prot was compared against two other methods, TRIBE-MCL and Spectral Clustering of Protein Sequences.

2.3.1 TRIBE-MCL

TRIBE-MCL, a variant of the Markov clustering algorithm, clusters proteins by modeling a random walk on a similarity graph (Enright, 2002). It clusters proteins as follows: (i) an all-versus-all blastp search is done with default parameters, (ii) a probability matrix is generated by converting the BLAST hit table to an all-versus-all similarity matrix using a negative log-transform of the E-value using the mcxload program and (iii) the mcl program generates the protein clusters by employing the Markov clustering algorithm. The MCL package (v12-068) was downloaded from www.micans.org/mcl.

2.3.2 Spectral clustering of protein sequences (SCPS)

The SCPS algorithm clusters proteins by exploiting the eigenvalues and eigenvectors obtained from a similarity matrix to partition the proteins into disjoint clusters (Nepusz et al., 2010). The similarity file used as input to SCPS was generated using the Blast2Sim plug-in in Cytoscape 2.8.0 using an initial BLAST E-value threshold of 0.01, a coverage factor of 15 and the ‘best hit’ similarity function. The all-versus-all BLAST hit table supplied to Blast2Sim was the same one used for TRIBE-MCL. The pairwise similarity score reported by Blast2Sim is a negative log-transform of the BLAST E-value of the top hit between two proteins that also takes sequence coverage into account. We chose to use this pairwise similarity score as input to SCPS because it is a similar scoring scheme to the one that was used in MOCASSIN-prot and TRIBE-MCL. The SCPS software (v0.9.8) was downloaded from www.paccanarolab.org/scps.

2.3.3 Parameter optimization

A range of parameters were tested for each algorithm to determine the optimum parameter value. For MOCASSIN-prot, we varied the number of refinement iterations from 1 to 4. After the first clustering, subsequent iterations involve removing the least conserved domain from each protein and re-clustering the proteins to generate a more refined clustering. For TRIBE-MCL, we varied the inflation parameter, I, from 1.1 to 6.0 in increments of 0.1. For SCPS, the first eigenratio threshold, epsilon, was varied from 1.02 to 1.05 in increments of 0.01. For both I and epsilon, increasing these values has the effect of increasing the granularity of clusters. SCPS decides the number of clusters by computing the top k eigenvalues of the similarity matrix, ordering them in decreasing order, and then taking the ratio of successive eigenvalues. The number of clusters is then selected by looking for the first eigenratio that is larger than the threshold epsilon.

For each algorithm, the optimum value of the parameter was determined for each dataset based on the overall F-measure (described below). The results from all experiments (in total 580 performed) are shown in Supplementary Figures S3–S5. While increasing the parameter value generally resulted in an increase in performance for MOCASSIN-prot and TRIBE-MCL, performance stayed the same or in a few cases decreased for SCPS. The optimal parameter for each dataset and method is shown in Supplementary Table S2.

2.4 Performance measures

To compare the set of clusters generated by a method to a reference cluster set, we employed the three commonly used evaluation measures: the weighted mean precision, weighted mean recall and overall F-measure (Larsen and Aone, 1999). For the total number of proteins in the dataset n, let ni* denote the number of proteins in the ith reference cluster, n*j the number of proteins in the jth cluster generated by the clustering method, and nij the number of proteins that are in both reference cluster i and method cluster j. The precision (or positive predictive value) of method cluster j with respect to reference cluster i is the fraction of proteins in method cluster j that are also in reference cluster i: pij=nij/n*j. The weighted mean precision of a clustering generated by a method compared to the reference cluster set is then calculated as a weighted average of precisions calculated from each protein family:
(3)
The recall (or sensitivity) of method cluster j with respect to reference cluster i is the fraction of proteins in reference cluster i that are also in method cluster j: rij=nij/ni*. The weighted mean recall for a method’s clustering compared to the reference set of clusters is given by:
(4)
The F-measure (or F1 score) of method cluster j with respect to the reference cluster i is the harmonic mean of precision and recall. Thus, the overall F-measure of the clustering generated by a method compared to the reference cluster set is calculated as:
(5)
All these three scores have values ranging from 0 to 1, with 1 being the best score. The recall, R, quantifies the extent to which a method’s clustering captures the reference clusters, while the precision, P, measures the accuracy of the method’s clusters. The overall F-measure, F, weights recall and precision equally. An ideal clustering algorithm should have both high precision and high recall. Thus, in the overall F-measure moderately good performance in both precision and recall will be favored over extremely good performance in one and poor performance in the other.

Clustering performance against the GO-annotation based benchmark was done using the method described by Nepusz et al. (2010). Briefly, for each cluster with at least three proteins, hypergeometric tests (one for each GO term) were performed to assess the statistical significance of the occurrence of the GO terms in the cluster at a significance level (0.01 or 0.0001). Correction for multiple hypothesis testing was done by controlling the false discovery rate (Benjamini and Hochberg, 1995). A cluster was considered significant if it had at least one significant over-represented GO term. Clustering performance was assessed by the proportion of the number of significant clusters to the total number of clusters, as well as the proportion of the cumulative size of significant clusters to the total cumulative size of clusters.

3 Results and discussion

3.1 Clustering performance evaluation using the UniProt protein family benchmark

The performance of protein classification by each method using its optimal parameter is presented in Figure 5 (also see Supplementary Table S2). In general, MOCASSIN-prot dominates the other two methods in the overall F-measure (Fig. 5A) and weighted mean precision (Fig. 5B), indicating its lower false positive rates across a range of organisms. TRIBE-MCL had higher recall values especially for larger genomes (Fig. 5C, also see Supplementary Table S1); its lower precision (Fig. 5B), however, indicates higher false positive rates. SCPS showed much lower precision and recall, often below 50%. MOCASSIN-prot maintained higher accuracy compared to other two methods across organisms except for M.musculus where all three methods performed similarly. Interestingly TRIBE-MCL and SCPS performed poorly against especially smaller prokaryotic datasets (Supplementary Table S1). In contrast, MOCASSIN-prot was more robust and performed consistently accurately across the datasets.

Clustering performance of MOCASSIN-prot (MOC), TRIBE-MCL (MCL) and SCPS for the ten reference protein sets based on overall F-measure (A), weighted mean precision (B) and weighted mean recall (C)
Fig. 5.

Clustering performance of MOCASSIN-prot (MOC), TRIBE-MCL (MCL) and SCPS for the ten reference protein sets based on overall F-measure (A), weighted mean precision (B) and weighted mean recall (C)

The largest benefit of MOCASSIN-prot over other clustering methods was its ability to identify protein families where the proteins have divergent domain architectures. For example, the peptidase M24B family from the S.cerevisiae dataset consists of three proteins (Q07825, P40051 and P43590) with varying domain architectures (Supplementary Fig. S6). MOCASSIN-prot was able to perfectly cluster these three proteins together. TRIBE-MCL (with I = 5) placed P40051 and P43590 into singleton clusters, while Q07825 was clustered together with 368 other proteins. On the other hand, SCPS (with epsilon = 1.02) clustered P40051 and P43590 together and placed Q07825 into a singleton cluster. The first iteration of MOCASSIN-prot placed all three of these proteins into a large cluster composed of 3223 proteins. Supplementary Figure S7A shows a close-up of the direct similarity relationships for these proteins within this large cluster. Proteins P40051 and P43590 are reciprocally most similar (due to their similarity with respect to domain PF05195.11), and protein Q07825 exhibits similarity to both P40051 and P27801 (a vacuolar membrane protein) with the highest diversity weight being placed on the PF01321.13 domain. During the second iteration of MOCASSIN-prot (Supplementary Fig. S7B), where the domain with the highest diversity weight is removed for each protein, P40051 and P43590 are again found to be reciprocally most similar (this time in terms of domain PF00557.19). These two proteins being identified as highly similar is consistent with the results from SCPS. However, in contrast to SCPS, protein Q07825 is retained in the cluster due to its similarity to P43590 (with respect to domain PF00557.19). As we described before, similar observations have been made when we compared the primary and secondary MOCASSIN-prot networks constructed from a small number of RGS-family proteins. Hence, despite the variation in domain architectures among its member proteins, the incorporation of domain architecture information along with sequence similarity by MOCASSIN-prot allowed us to identify the peptidase M24B family.

3.2 Clustering performance evaluation using the GO-based benchmark

When the performance of the three methods was tested using the GO-based M.musculus benchmark, all three methods achieved proportions of significant clusters higher than 0.977 (both based on the cluster numbers and sizes) regardless of the parameters used (Supplementary Tables S3 and S4). MOCASSIN-prot exhibited the highest proportion of significant cluster numbers (0.994 with iteration 1, at the significance level 0.0001). The highest proportion for TRIBE-MCL was 0.993 (I = 1.1, at the significance level 0.0001) and SCPS had the lowest (0.991, at the significance level 0.0001). Although the difference in the proportions was small (at the significance level 0.01, all three achieved the proportion of 1.0), MOCASSIN-prot identified 270 more significant clusters compared to TRIBE-MCL (19 more than SCPS). When the cluster size was considered, MOCASSIN-prot identified 3169 more proteins in significant clusters compared to SCPS, while another 1007 proteins were identified in significant clusters by TRIBE-MCL compared to MOCASSIN-prot. Similar results were obtained when the significant level of 0.01 was used to identify significant clusters. These results support the clusters generated by MOCASSIN-prot to be more consistent with GO annotation than those generated by the other two methods.

4 Conclusion

Improving the accuracy and resolution in protein-domain classification will contribute to better understanding of evolutionary and functional relationships within the protein universe. However, classifying protein sequences that include multiple domains in varying composition, especially at the whole proteome level, is a challenging problem. Many approaches to this problem, including TRIBE-MCL and SCPS, use similarity information from only a single region shared between proteins. In this study, we presented MOCASSIN-prot, a novel approach to protein classification based on multi-objective optimization. This method utilizes quantitative sequence similarity information from all domains on the proteins and builds a network that houses clusters of proteins that have similar sequences as well as domain contents. MOCASSIN-prot is not only capable of producing clusters consistent to phylogeny-based clustering, which solely relies on sequence similarities, but it can also be applied to highly heterogeneous proteome-level classification where phylogenetic analysis is not applicable. The method is scalable to the complete proteome level, and exhibited consistently higher accuracy with lower false positives compared to TRIBE-MCL and SCPS.

We note that the performance of MOCASSIN-prot depends critically on domain identification and input similarity scores. Different domain discovery methods and scoring schemes need to be explored in the future. We also note the importance of good benchmarks to evaluate the clustering performance of various methods and needs to develop a protein evolution simulation algorithm that can accurately simulate protein families using the domain, rather the amino acid, as a unit of evolution.

Acknowledgements

The authors would like to thank Drs. Steven Dunbar and Stephen Hartke for their valuable comments, ideas and assistance in the undertaking of the research described here.

Disclaimer

Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the U.S. Department of Agriculture.

Funding

The authors were supported by the University of Nebraska-Lincoln Life Science Competitive Grants Program.

Conflict of Interest: none declared.

References

Atkinson
 
H.J.
 et al. (
2009
)
Using sequence similarity networks for visualization of relationships across diverse protein superfamilies
.
PLoS One
,
4
,
e4345
.

Benjamini
 
Y.
,
Hochberg
Y.
(
1995
)
Controlling the false discovery rate: a practical and powerful approach to multiple testing
.
J. R. Stat. Soc. Ser. B (Methodological)
,
57
,
289
300
.

Bhardwaj
 
G.
 et al. (
2012
)
PHYRN: A robust method for phylogenetic analysis of highly divergent sequences
.
PLoS One
,
7
,
e34261.

Camacho
 
C.
 et al. (
2009
)
BLAST+: architecture and applications
.
BMC Bioinformatics
,
10
,
421.

Chang
 
G.S.
 et al. (
2008
)
Phylogenetic profiles reveal evolutionary relationships within the ‘twilight zone’ of sequence similarity
.
Proc. Natl. Acad. Sci. USA
,
105
,
13474
13479
.

Chothia
 
C.
,
Gough
J.
(
2009
)
Genomic and structural aspects of protein evolution
.
Biochem. J
.,
419
,
15
28
.

Cohen-Gihon
 
I.
 et al. (
2007
)
Comprehensive analysis of co-occurring domain sets in yeast proteins
.
BMC Genomics
,
8
,
161.

Deng
 
B.
 et al. (
2013
)
Bioinformatic game theory and its application to biological affinity networks
.
Appl. Math
.,
04
,
92.

Eddy
 
S.R.
(
2011
)
Accelerated profile HMM searches
.
PLoS Comput. Biol
.,
7
,
e1002195.

Enright
 
A.J.
 et al. (
1999
)
Protein interaction maps for complete genomes based on gene fusion events
.
Nature
,
402
,
86
90
.

Enright
 
A.J.
,
Ouzounis
C.A.
(
2000
)
GeneRAGE: a robust algorithm for sequence clustering and domain detection
.
Bioinformatics
,
16
,
451
457
.

Enright
 
A.J.
(
2002
)
An efficient algorithm for large-scale detection of protein families
.
Nucleic Acids Res
.,
30
,
1575
1584
.

Finn
 
R.D.
 et al. (
2014
)
Pfam: the protein families database
.
Nucleic Acids Res
.,
42
,
D222
D230
.

Graur
 
D.
(
2016
)
Molecular and Genome Evolution
.
Sinauer Associates, Inc
.,
Sunderland, MA
.

Huntley
 
R.P.
 et al. (
2015
)
The GOA database: Gene Ontology annotation updates for 2015
.
Nucleic Acids Res
.,
43
,
D1057
D1063
.

Koonin
 
E.V.
 et al. (
2000
)
The impact of comparative genomics on our understanding of evolution
.
Cell
,
101
,
573
576
.

Kummerfeld
 
S.K.
,
Teichmann
S.A.
(
2009
)
Protein domain organisation: adding order
.
BMC Bioinformatics
,
10
,
39.

Larsen
 
B.
,
Aone
C.
(
1999
) Fast and effective text mining using linear-time document clustering. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 16–22. ACM.

Levitt
 
M.
(
2009
)
Nature of the protein universe
.
Proc. Natl. Acad. Sci. USA
,
106
,
11079
11084
.

Marcotte
 
E.M.
(
1999
)
Detecting protein function and protein-protein interactions from genome sequences
.
Science
,
285
,
751
753
.

Miele
 
V.
 et al. (
2012
)
High-quality sequence clustering guided by network topology and multiple alignment likelihood
.
Bioinformatics
,
28
,
1078
1085
.

Murzin
 
A.
 et al. (
1995
)
SCOP: a structural classification of proteins database for the investigation of sequences and structures
.
J. Mol. Biol
.,
247
,
536
540
.

Nacher
 
J.C.
 et al. (
2009
) A bipartite graph based model of protein domain networks. In:
Zhou
J.
(ed.)
Complex Sciences, Vol. 4. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
.
Springer, Berlin, Heidelberg
, pp.
525
535
.

Nepusz
 
T.
 et al. (
2010
)
SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale
.
BMC Bioinformatics
,
11
,
120
.

Paccanaro
 
A.
 et al. (
2006
)
Spectral clustering of protein sequences
.
Nucleic Acids Res
.,
34
,
1571
1580
.

Pearl
 
F.
(
2003
)
The CATH database: an extended protein family resource for structural and functional genomics
.
Nucleic Acids Res
.,
31
,
452
455
.

Pellegrini
 
M.
 et al. (
1999
)
Assigning protein functions by comparative genome analysis: protein phylogenetic profiles
.
Proc. Natl. Acad. Sci. USA
,
96
,
4285
4288
.

Pipenbacher
 
P.
 et al. (
2002
)
ProClust: Improved clustering of protein sequences with an extended graph-based approach
.
Bioinformatics
,
18
,
S182
S191
.

Przytycka
 
T.
 et al. (
2006
)
Graph theoretical insights into evolution of multidomain proteins
.
J. Comput. Biol
.,
13
,
351
363
.

Sjölander
 
K.
(
2004
)
Phylogenomic inference of protein molecular function: advances and challenges
.
Bioinformatics
,
20
,
170
179
.

Smoot
 
M.E.
 et al. (
2011
)
Cytoscape 2.8: new features for data integration and network visualization
.
Bioinformatics
,
27
,
431
432
.

Tatusov
 
R.L.
(
1997
)
A genomic perspective on protein families
.
Science
,
278
,
631
637
.

Tatusov
 
R.
 et al. (
2003
)
The COG database: an updated version includes eukaryotes
.
BMC Bioinformatics
,
4
,
41.

The UniProt Consortium
(
2017
)
UniProt: the Universal Protein knowledgebase
.
Nucleic Acids Res
.,
45
,
D158
D169
.

Van Dongen
 
S.
(
2000
) Graph clustering by flow simulation. PhD thesis. University of Utrecht, The Netherlands.

Vogel
 
C.
 et al. (
2004
)
Supra-domains: evolutionary units larger than single protein domains
.
J. Mol. Biol
.,
336
,
809
823
.

Wang
 
Z.
 et al. (
2011
)
A protein domain co-occurrence network approach for predicting protein function and inferring species phylogeny
.
PLoS One
,
6
,
e17906.

Wittkop
 
T.
 et al. (
2010
)
Partitioning biological data with transitivity clustering
.
Nat. Methods
,
7
,
419
420
.

Wuchty
 
S.
,
Almaas
E.
(
2005
)
Evolutionary cores of domain co-occurrence networks
.
BMC Evol. Biol
.,
5
,
24.

Xie
 
X.
 et al. (
2011
)
Evolutionary versatility of eukaryotic protein domains revealed by their bigram networks
.
BMC Evol. Biol
.,
11
,
242.

Author notes

The U.S. Department of Agriculture (USDA) prohibits discrimination in all its programs and activities on the basis of race, color, national origin, age, disability, and where applicable, sex, marital status, familial status, parental status, religion, sexual orientation, genetic information, political beliefs, reprisal, or because all or part of an individual’s income is derived from any public assistance program. (Not all prohibited bases apply to all programs.) Persons with disabilities who require alternative means for communication of program information (Braille, large print, audiotape, etc.) should contact USDA’s TARGET Center at (202) 720-2600 (voice and TDD). To file a complaint of discrimination, write to USDA, Director, Office of Civil Rights, 1400 Independence Avenue, S.W., Washington, D.C. 20250-9410, or call (800) 795-3272 (voice) or (202) 720-6382 (TDD). USDA is an equal opportunity provider and employer.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)
Associate Editor: John Hancock
John Hancock
Associate Editor
Search for other works by this author on:

Supplementary data