It is informative to detect highly conserved positions in proteins and nucleic acid sequence/structure since they are often indicative of structural and/or functional importance. ConSurf (http://consurf.tau.ac.il) and ConSeq (http://conseq.tau.ac.il) are two well-established web servers for calculating the evolutionary conservation of amino acid positions in proteins using an empirical Bayesian inference, starting from protein structure and sequence, respectively. Here, we present the new version of the ConSurf web server that combines the two independent servers, providing an easier and more intuitive step-by-step interface, while offering the user more flexibility during the process. In addition, the new version of ConSurf calculates the evolutionary rates for nucleic acid sequences. The new version is freely available at: http://consurf.tau.ac.il/.
The degree to which an amino (or nucleic) acid position is evolutionarily conserved is strongly dependent on its structural and functional importance. Thus, conservation analysis of positions among members from the same family can often reveal the importance of each position for the protein (or nucleic acid)’s structure or function. ConSurf (1,2) and ConSeq (3) are web servers for calculating the evolutionary rate of each position of the protein and for identifying structurally and functionally important regions within proteins. The degree of conservation of each position is the inverse of the site’s evolutionary rate; rapidly evolving positions are variable while slowly evolving positions are conserved. In ConSurf, the evolutionary rate is estimated based on the evolutionary relatedness between the protein and its homologues and considering the similarity between amino acids as reflected in the substitutions matrix (4,5). One of the advantages of ConSurf in comparison to other methods is the accurate computation of the evolutionary rate by using either an empirical Bayesian method or a maximum likelihood (ML) method (5). The differences between the two methods are explained in detail in reference (4). The strength of those methods is that they explicitly account for the stochastic process underlying the evolution of the analyzed sequences, and that they rely on the phylogeny of the sequences. Thus, they can correctly discriminate between conservation due to short evolutionary time and genuine sequence conservation. In addition, the Bayesian based method provides reliability estimates for the site-specific conservation scores.
A short description of the methodology is provided below. More detailed description is available at http://consurf.tau.ac.il/, under ‘OVERVIEW’, ‘QUICK HELP’ and ‘FAQ’.
A flowchart of the ConSurf web server is shown in Figure 1 and detailed below.
The sequence is extracted from the 3D structure (if given).
Homologous sequences are collected using a BLAST (or PSI-BLAST) (6,7) search against a selected database. The user may specify criteria for defining homologues. The user can also manually select the desired sequences from the BLAST results.
The sequences are clustered and highly similar sequences are removed using CD-HIT (8).
A multiple sequence alignment (MSA) of the homologous sequences is constructed using MAFFT, PRANK, T-COFFEE, MUSCLE or CLUSTALW.
The continuous conservation scores are divided into a discrete scale of nine grades for visualization, from the most variable positions (grade 1) colored turquoise, through intermediately conserved positions (grade 5) colored white, to the most conserved positions (grade 9) colored maroon.
The conservation scores are projected onto the protein/nucleotide sequence and on the MSA.
If a protein 3D structure is provided: For all cases, ConSurf creates the following outputs: For proteins in which the 3D structure was not provided by the user, an up-to-date version of the Protein Data Bank (13) is searched for relevant homologues. If a structure of at least one homologous protein is available, the user may map the conservation scores on the structure. This option should ease the procedure for the non-expert users, who may be unfamiliar with the 3D structure homologue. This option can also be useful for analyzing proteins that share the same sequence but differ in their 3D structure (for example, two structures solved in different conformations or with different ligands).
The nine-color conservation scores are projected onto the 3D structure of the query protein and the colored protein structure is shown by FirstGlance in Jmol (http://firstglance.jmol.org).
The sequence and MSA colored by ConSurf conservation scores.
A text file that summarizes for each position the normalized score calculated, the assigned color, the reliability estimation (for the Bayesian method) and the amino acids/nucleotides observed in the respective MSA column.
The sequences selected for the MSA and the MSA constructed (unless those files were uploaded by the user).
A file with the frequency of each amino acid/nucleotide observed in each column of the MSA.
The evolutionary tree, which was calculated by the server or uploaded by the user, is shown using an interactive Java applet written for that purpose.
As an example we provide the main output of a ConSurf run for the N-terminal region of the GAL4 transcription factor in yeast (PDB ID: 3COQ, chain A and B) in complex with its DNA recognition site (Figure 2). The analysis revealed, as expected, that the functional regions of this protein are highly conserved. For example, all the cysteines that form the Zn(2)-C6 DNA binding domain (CYS11, CYS14, CYS21, CYS28, CYS31, CYS38; 14) were assigned the highest conservation scores. Likewise, PRO26, which is known to be central for DNA binding (15) is also highly conserved according to our analysis. In addition, other amino acid residues, which are in contact with the DNA (i.e. GLN9, LYS17, LYS18, LYS20, ARG15, LYS23; 16) are relatively conserved.
ConSurf was also applied to nucleic acid sequences from yeast, which are the known binding sites of GAL4 and their adjacent neighborhood (Figure 2). As anticipated, the analysis revealed that the consensus pattern CGG-N11-CCG typical to GAL4 binding site is highly conserved. An extended full ConSurf analysis of this example is available in the ‘GALLERY’ section on the ConSurf web site.
NEW ADDITIONS AND IMPROVEMENTS IN ConSurf 2010
Analyzing nucleic acid sequences
Despite increasing interest in the non-coding fraction of transcriptomes, the number, the level of conservation, and functions, if any, of many non-protein-coding transcripts remain to be discovered. However, it has already been shown that many of the non-coding sequences are connected to regulatory processes. The new version of ConSurf offers estimations of the evolutionary rate for each position of nucleic acid sequences in the same manner used for amino acid residues. For that purpose, four evolutionary models were implemented in the Rate4Site program: (i) the Juke and Cantor 69 model (JC69), which assumes equal base frequencies and equal substitution rates (17). (ii) The Tamura 92 model that uses only one parameter, which captures variation in G-C content (18). (iii) The HKY85 model, which distinguishes between transitions and transversions and allows unequal base frequencies (19). (iv) The General Time Reversible (GTR) model, which is the most general time-reversible model. The GTR parameters consist of an equilibrium base frequency vector, giving the frequency at which each base occurs at each site, and the rate matrix (20). When enough data (i.e. sequences) are available, the GTR model is superior over the more simplified Tamura 92 model. However, the Tamura 92 model is recommended in cases in which the data are not sufficient for reliable estimation of the model parameters and thus it is the default option for analyzing nucleic acid sequences in ConSurf.
Improved substitution matrix for protein sequences
The LG substitution matrix, which incorporates variability of evolutionary rates across sites in the matrix estimation was shown to outperform other substitutions matrices for proteins (21). The LG matrix was added to Rate4Site and is offered in the new version of ConSurf in addition to the previous substitution models: JTT (22), Dayhoff (23), WAG (24), mtREV (25) and cpREV (26).
Improved selection of homologous proteins
The accuracy of conservation scores is directly influenced by the amount and quality of sequence data available in the MSA and the relatedness between the homologous sequences themselves and the sequence of interest. For example, using homologous sequences with different functions might blur the signal. One of the important changes in the new version of ConSurf is the addition of a clear and intuitive interface that helps controlling which of the sequences are included in the analysis. These improvements include:
A variety of sequence databases. The server offers the user the option to search for relevant sequences in several automatically updated sequence databases including: (i) SWISS-PROT (default) (27); (ii) A filtered version of the uniprot database (28); (iii) uniprot (29) (iv) UniRef90 in which redundant sequences were removed at level of 90% identity (30); (v) the NCBI non-redundant (nr) database.
Manual selection of sequences for the analysis. After searching for homologous sequences, the user can manually select the relevant sequences to be included in the analysis using a simple form that provides all the relevant data for the sequences found and links to external web resources.
Removing redundant sequences. The user can specify the level of redundant sequences for removal. The sequences found are clustered by their level of identity using CD-HIT (8) and the cutoff specified by the user (default level is 95% identity). Only one sequence (the longest) from each cluster is used for the analysis.
Automatic removal of remote homologues. The user can control the level of sequence identity for which a hit sequence is still considered a homologue. Filtration according to the sequence identity between the sequences found and the sequence of interest enables the user to filter out sequences that share significant alignment with the protein of interest, however, might have different function or structure. The default level is set to 35% identity, which is the upper bound of the ‘twilight zone’ for protein structures (31).
Better alignments. The user can choose to align the sequences using one of the following leading alignment algorithms: MAFFT (32), T-COFFEE (EXPRESSO mode) (33), PRANK (34) MUSCLE (35) and CLUSTALW (36). The EXPRESSO mode of T-COFFEE uses structural information (if available) and structural alignment methods to construct structure-based MSA. MAFFT and PRANK were shown to be among the leading sequence alignment algorithms (34,37). MAFFT-LINSi is much faster than PRANK and thus was chosen to be the default alignment algorithm in ConSurf.
Improved user interface
In this new version of ConSurf, we put great emphasis on the user interface. ConSurf now presents an easier and more intuitive step-by-step interface, while still offering the user great flexibility during the process as described above. Each step is accompanied by built-in detailed help.
The new version of the ConSurf web server runs on a Linux cluster of 2.6GHz AMD Opteron processors, equipped with 4 GB RAM per quad-core node. The server runs with up to date versions of the supported MSA programs, and regularly updated databases. Running time depends on the dataset size (number and length of sequences) and the server load. The ConSurf server is implemented in PHP and Perl using the support of BioPerl modules (38). Rate4Site is implemented in C++ (4). For proteins with available 3D structure the conservation scores are projected on the structure and visualized using version 1.44 of FirstGlance in Jmol.
ConSurf and ConSeq have an established reputation in the identification of functional regions in proteins using evolutionary information. In addition, these methods are a focal point that facilitates the development of more useful tools in our group and in other groups. For example, they are the basis for the development of the PatchFinder tool for the automatic detection of clusters of highly conserved amino acids (39), and the detection of DNA-binding proteins (40). Along with the massive growth of sequence and structure databases we believe that this new version of the ConSurf server will be highly useful to a growing number of molecular biology researchers and allow them to perform complex analyses using sophisticated algorithms accurately, easily and comprehensively.
BLOOMNET ERA-PG; Israeli Science Foundation (878/09 to T.P.). Funding for open access charge: BLOOMNET ERA-PG.
Conflict of interest statement. None declared.
The authors are grateful to Nimrod Rubinstein, Adi Doron-Faigenboim, Eyal Privman, Itay Mayrose, Fabian Glaser, Maya Schushan, Guy Nimrod, Ofir Goldenberg, Yana Gofman, Uri Zonens, Gilad Wainreb and Matan Kalman for technical help, useful comments and helpful discussions.