Abstract

Residue coevolution has recently emerged as an important concept, especially in the context of protein structures. While a multitude of different functions for quantifying it have been proposed, not much is known about their relative strengths and weaknesses. Also, subtle algorithmic details have discouraged implementing and comparing them. We addressed this issue by developing an integrated online system that enables comparative analyses with a comprehensive set of commonly used scoring functions, including Statistical Coupling Analysis (SCA), Explicit Likelihood of Subset Variation (ELSC), mutual information and correlation-based methods. A set of data preprocessing options are provided for improving the sensitivity and specificity of coevolution signal detection, including sequence weighting, residue grouping and the filtering of sequences, sites and site pairs. A total of more than 100 scoring variations are available. The system also provides facilities for studying the relationship between coevolution scores and inter-residue distances from a crystal structure if provided, which may help in understanding protein structures.

Availability: The system is available at http://coevolution.gersteinlab.org. The source code and JavaDoc API can also be downloaded from the web site.

Contact:mark.gerstein@yale.edu

Supplementary information: Additional materials can be found at http://coevolution.gersteinlab.org/coevolution/supp.jsp

1 INTRODUCTION

Coevolution (covariation/correlated mutation) is the change of a biological object triggered by the change of a related object. For example, the coding genes of some interacting proteins are preserved or eliminated together in new species (Pellegrini et al., 1999), or have similar phylogenetic trees (Goh et al., 2000). At the amino acid level, some residues under physical or functional constraints exhibit correlated mutations (Gloor et al., 2005; Socolich et al., 2005; Suel et al., 2003). Coevolving residues in a protein are detected in a two-step process: (1) the multiple sequence alignment (MSA) of the protein and its homologs is constructed or obtained; (2) a coevolution score is calculated for each pair of sites in the MSA. There are two main difficulties in this process. First, a large number of scoring functions have been proposed in the literature (Halperin et al., 2006). It can be difficult to choose from them, as they exhibit subtle yet significant differences, and it is likely that different applications would require different functions. Second, coevolution analyses could be confounded by uneven sequence representations, insufficient evolutionary divergence and the presence of gaps in the MSA. A successful coevolution study has to take all these details into account.

To address this need, we have developed an integrated system that provides a simple interface for preprocessing data, computing coevolution scores and analyzing the results. It offers a great variety of scoring variations (over 100) for studying different types of proteins and testing different hypotheses. The workflow of the system is shown in Figure 1. More details on the scoring functions, preprocessing options and result analysis are provided below.

Fig. 1.

The workflow of the system (a larger version can be found at the supplementary web site).

Fig. 1.

The workflow of the system (a larger version can be found at the supplementary web site).

2 SCORING FUNCTIONS

2.1 Correlation-based functions

For a pair of sites i and j in an MSA, the correlation score (Gobel et al., 1994; Halperin et al., 2006) is computed as follows:  

formula

where sikl is the score for substituting the i-th residue of sequence k by that of sequence l,

si
and σi are the mean and SD of substitution scores at site i, N is the number of sequences in the MSA and wkl is the weight for the sequence pair k, l. If the two sites are coevolving in that radical substitutions at the first site are accompanied by radical substitutions at the second site, the correlation will be high. Our system provides the classical McLachlan matrix (McLachlan, 1971) that scores substitutions based on the physiochemical properties of the residues, as well as matrices based on residue volume, pI and hydropathy index, for studying the properties individually. Two variations are provided for each of them: the ‘absolute value version’ considers only the magnitude, while the ‘raw version’ also considers the direction of change, for detecting compensatory mutations. The correlation can be computed from raw values (Pearson correlation) or from value ranks (Spearman correlation, Pazos et al., 1997). Several schemes are provided for the weights wkl, preventing false coevolution signals due to uneven sequence representation or site conservation.

2.2 Perturbation-based functions

The idea of perturbation-based functions is to perform a ‘perturbation’ at a first site, and observe its effect on a second site. The Statistical Coupling Analysis (SCA) method (Lockless and Ranganathan, 1999) defines a statistical energy term for a site, and computes the energy change at a second site when the first site is perturbed by retaining only the sequences with a certain residue.1 The Explicit Likelihood of Subset Variation (ELSC) method (Dekker et al., 2004) is based on the same idea, but has the energy computations replaced by probabilities according to hypergeometric distributions. The mutual information (MI) method (Gloor et al., 2005) can be viewed as a generalized perturbation method that considers the subsetting of all twenty kinds of residues, and combines them by a weighted average according to their frequencies. To deal with finite sample size effects and phylogenetic influence, the normalization options in Martin et al. (2005) are also provided.

2.3 Independence tests

The chi-square test (c.f. the OMES method, Larson et al., 2000) and the quartets method (Galitsky, 2003) both identify site pairs that are unlikely to be independent. The former computes the P-value under the null hypothesis of independent sites. The latter counts the number of quartets in the 2D histogram of residue frequencies that deviate considerably from the expectation.

3 PREPROCESSING OPTIONS

To improve the sensitivity and specificity of the functions, options are provided for preprocessing sequences, sites and site pairs.

3.1 Sequence filtering and weighting

Sequences that contain too many gapped positions or are too similar to others in the MSA (that might cause sites to appear coevolving) can be removed by specifying the gap and similarity thresholds respectively. A minimum number of sequences can also be specified to avoid small sample size effects.

A sequence weighting scheme based on the topology of the phylogenetic tree (Gerstein et al., 1994) and one based on Markov random walk are provided. Both schemes down-weigh sequences that are very similar to others in the MSA.

3.2 Site filtering

After sequence filtering, sites that contain too many gaps or are too conserved can be discarded. The former is likely non-informative, while the latter may artificially inflate some coevolution scores.

3.3 Site pair filtering

Sites that are close in the primary sequence may produce trivial coevolution signals that hide other more unexpected coevolution events. Such site pairs can be filtered by specifying the minimum sequence separation. It has also been observed that insertions/deletions of multiple residues may create artificial coevolution signals (Patel et al., unpublished data). An option is provided for filtering site pairs that participate in the same gaps in too many sequences.

3.4 Other options

Grouping similar residues into a smaller alphabet may increase the sensitivity (Pollock et al., 1999). Our system provides two residue groupings proposed in the literature (Elcock and McCammon, 2001; Guharoy and Chakrabarti, 2005). It has also been observed that gaps might give important coevolution signals (Patel et al., unpublished data). An option is provided for treating gaps as noise or as the 21st residue when computing coevolution scores.

4 SCORES ANALYSIS

In some proteins coevolving residues tend to be close to each other in the 3D structure (Dekker et al., 2004; Gloor et al., 2005). This suggests that the instability created by the mutation of a residue may be (partially) compensated for by a corresponding mutation of a close residue. Coevolution signals may thus convey some information about the protein structure. For instance it is interesting to study how well the coevolution scores predict the residue contact map (Halperin et al., 2006). Our system provides functions for plotting and analyzing the coevolution scores against inter-residue distances and standard machine-learning techniques (e.g. ROC curve) for evaluating the effectiveness of the various coevolution functions in predicting interacting residues. A shuffling scheme for evaluating the significance of the scores is also provided in the program package for running locally.

5 EXAMPLE

We provide a worked example of our system in operation on the web site that illustrates coevolution in the transmembrane protein bacteriorhodopsin due to physically constrained residues not adjacent in the primary sequence. The example can be easily loaded by clicking the corresponding link on the main page. Running the example will compute the coevolution scores between site pairs separated by at least 3 residues. The scatterplot for coevolution scores against inter-residue distances generated using a known PDB structure (Fig. 1) shows that residue pairs receiving high scores do tend to be closer in the crystal structure.

Due to the intensive computation involved in the score calculations, currently only one scoring function is allowed to be used each time. Anyone interested in performing large-scale comparisons can download the Java programs from the web site and run locally on most platforms (Windows, Macintosh, Linux, UNIX, etc.). Detailed installation instructions are provided on the web site.

6 DISCUSSION

Although the scatterplot in Figure 1, and other studies in the literature, have suggested some relationships between coevolution and physical constraints, to what extent could coevolution scores help understand physical structures remains unclear. We hope the current application could serve as a neutral tool for further exploration in this area.

The current system focuses on functions that do not assume any mutation models. Other functions, such as the likelihood method in Pollock et al. (1999) and the Bayesian mutational mapping method (Dimmic et al., 2005) may be added in a later version.

Coevolution signals have been used in recent studies to predict sequence regions involved in protein–protein interactions with different levels of success (Halperin et al., 2006; Pazos and Valencia, 2002). We plan on extending the system to include inter-protein residue coevolution in the next phase of development.

ACKNOWLEDGMENT

We would like to thank Rama Ranganathan and William Russ for helpful discussions and the anonymous reviewers for their valuable comments.

Conflict of Interest: none declared.

1
Our implementation provides an asymmetric SCA score matrix, as well as extra summarizing statistics. Details can be found at the supplementary web site.

REFERENCES

Dekker
JP
, et al.  . 
A perturbation-based method for calculating explicit likelihood of evolutionary co-variance in multiple sequence alignments
Bioinformatics
 , 
2004
, vol. 
20
 (pg. 
1565
-
1572
)
Dimmic
MW
, et al.  . 
Detecting coevolving amino acid sites using Bayesian mutational mapping
Bioinformatics
 , 
2005
, vol. 
21
 
Suppl. 1
(pg. 
i126
-
i135
)
Elcock
AH
McCammon
JA
Identification of protein oligomerization states by analysis of interface conservation
Proc. Natl Acad. Sci. USA
 , 
2001
, vol. 
98
 (pg. 
2990
-
2994
)
Galitsky
B
Revealing the set of mutually correlated positions for the protein families of immunoglobulin fold
In Silico Biol.
 , 
2003
, vol. 
3
 pg. 
0022
 
Gerstein
M
, et al.  . 
Volume changes in protein evolution
J. Mol. Biol.
 , 
1994
, vol. 
236
 (pg. 
1067
-
1078
)
Gloor
GB
, et al.  . 
Mutual information in protein multiple sequence alignments reveals two classes of coevolving positions
Biochemistry
 , 
2005
, vol. 
44
 (pg. 
7156
-
7165
)
Gobel
U
, et al.  . 
Correlated mutations and residue contacts in proteins
Proteins: Struct. Funct.Genet.
 , 
1994
, vol. 
18
 (pg. 
309
-
317
)
Goh
CS
, et al.  . 
Co-evolution of proteins with their interaction partners
J. Mol. Biol.
 , 
2000
, vol. 
299
 (pg. 
283
-
293
)
Guharoy
M
Chakrabarti
P
Conservation and relative importance of residues across protein-protein interfaces
Proc. Natl. Acad. Sci. USA
 , 
2005
, vol. 
102
 (pg. 
15447
-
15452
)
Halperin
I
, et al.  . 
Correlated mutations: advances and limitations. A study on fusion proteins and on the Cohesin-Dockerin families
Proteins Struct. Funct. Bioinfo.
 , 
2006
, vol. 
63
 (pg. 
832
-
845
)
Larson
SM
, et al.  . 
Analysis of covariation in an SH3 domain sequence alignment: applications in tertiary contact prediction and the design of compensating hydrophobic core substitutions
J. Mol. Biol.
 , 
2000
, vol. 
303
 (pg. 
433
-
446
)
Lockless
SW
Ranganathan
R
Evolutionarily conserved pathways of energetic connectivity in protein families
Science
 , 
1999
, vol. 
286
 (pg. 
295
-
299
)
McLachlan
AD
Tests for comparing related amino-acid sequences cytochrome c and cytochrome c551
J. Mol. Biol.
 , 
1971
, vol. 
61
 (pg. 
409
-
424
)
Martin
LC
, et al.  . 
Using information theory to search for co-evolving residues in proteins
Bioinformatics
 , 
2005
, vol. 
21
 (pg. 
4116
-
4124
)
Pazos
F
, et al.  . 
Correlated mutations contain information about protein-protein interaction
J. Mol. Biol.
 , 
1997
, vol. 
271
 (pg. 
511
-
523
)
Pazos
F
Valencia
A
In silico two-hybrid system for the selection of physically interacting protein pairs
Proteins: Struct. Funct. Genet.
 , 
2002
, vol. 
47
 (pg. 
219
-
227
)
Pellegrini
M
, et al.  . 
Assigning protein functions by comparative genome analysis: protein phylogenetic profiles
Proc. Natl Acad. Sci. USA
 , 
1999
, vol. 
96
 (pg. 
4285
-
4288
)
Pollock
DD
, et al.  . 
Coevolving protein residues: maximum likelihood identification and relationship to structure
J. Mol. Biol.
 , 
1999
, vol. 
287
 (pg. 
187
-
198
)
Socolich
M
, et al.  . 
Evolutionary information for specifying a protein fold
Nature
 , 
2005
, vol. 
437
 (pg. 
512
-
518
)
Suel
GM
, et al.  . 
Evolutionarily conserved networks of residues mediate allosteric communication in proteins
Nat. Struct. Biol.
 , 
2003
, vol. 
10
 (pg. 
59
-
69
)

Author notes

The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First authors.
Associate Editor: Olga Troyanskaya
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Comments

0 Comments