Abstract

Motivation: Annotation tools help scientists to traverse the gap between characterized and uncharacterized proteins. Tools for the prediction of protein function include those which predict the function of entire proteins or complexes, those annotating functional domains and those which predict specific residues within the domain. We have developed WSsas, a web service focused on the annotation of essential functional residues. WSsas uses similarity searches and pairwise alignments to transfer functional information about binding, catalytic and protein–protein interaction residues from solved structures to query sequences. In addition, WSsas can supply information about the relevant functional atoms. The web service definition (WSDL) file and a Perl client are freely available at http://www.ebi.ac.uk/thornton-srv/databases/WSsas/.

Contact:talavera@ebi.ac.uk

1 INTRODUCTION

During the last decade the great number of sequencing projects has produced a huge amount of biological data. However, the string of nucleic or amino acids is frequently the only knowledge scientists have about these sequences. The lack of functional information is an important constraint for the interpretation of results. Consequently, enriching the raw data with functional annotation has become an important task for the scientific community. Not only are these annotations important for the aid of experimental researchers, but also for the performance of computational biology studies. Several projects or databases use human curators to increase our knowledge in the life sciences (Holliday et al., 2007; UniProt Consortium, 2008), e.g. UniProt entries contain feature fields annotating properties such as transit signals, ion-binding regions or alternative splicing changes. However, this careful annotation process is extremely time consuming. Thus, many computational tools aiming to automatically annotate genes and proteins have been developed, being one of the most successful fields in bioinformatics and theoretical biology (Flicek et al., 2008; Mulder et al., 2007). Although automatic annotations are not exempt from errors, nor can they be as accurate as manually curated ones, their availability can be a good starting point for further experiments [for a recent review on annotation, see Reeves et al. (2009)].

There are two common ways to automatically predict function: (i) by transference of features from one homologous protein (usually found by similarity searches) to another and (ii) using ab initio predictions. The majority of existing automatic annotation tools focus on predicting the function of whole domains or elements, such as kinase domains or transmembrane segments, e.g. InterPro (Mulder et al., 2007) or Gene3D (Yeats et al., 2008), or even entire proteins, e.g. ProFunc (Laskowski et al., 2005b). However, annotation of specific functional sites is important in order to improve our understanding of the essential residues involved in function and the way they act, e.g. SAS (Milburn et al., 1998). Furthermore, the increase of available variation data among individuals highlights the importance of functional sites and their neighbourhood. There are some tools that focus on this area: ScanProsite identifies functional residues which create an identifiable pattern in the sequence (de Castro et al., 2006); MSD-motif performs motif- or ligand-oriented web searches (Golovin and Henrick, 2008); and SAS uses sequence homologues containing structural information to predict functional and structural features and provides a graphical view of the results. However, the majority of these tools do not have programmatic access to perform new predictions [ScanProsite has access via REST protocol (Fielding, 2000) and MSD data can be retrieved via API]. Aiming to fill this gap, we have developed WSsas, a new annotation web service based on similarity searches and pairwise alignments.

2 DESCRIPTION

WSsas uses the same annotation algorithm developed for the SAS tool (http://www.ebi.ac.uk/thornton-srv/databases/sas/). Briefly, it performs a FASTA search (Pearson and Lipman, 1988) of a given protein sequence against all protein sequences in the Protein Data Bank (Berman et al., 2000). Then, the residues aligned by the Smith–Waterman algorithm (Smith and Waterman, 1981) are used to transfer functional information from solved structures to the query sequence. WSsas provides annotation for several types of functional residues: ligand-, metal- and nucleic acid-binding sites; catalytic residues; and amino acids involved in protein–protein interactions. Annotations are based on data extracted from PDBsum (Laskowski et al., 2005a) and the Catalytic Site Atlas (Porter et al., 2004) (Fig. 1A). PDBsum is a pictorial database that provides an overview of the contents of each 3D structure deposited in the PDB database, including schematic diagrams from additional analyses, such as LIGPLOT (Wallace et al., 1995), NUCPLOT (Luscombe et al., 1997) and PROCHECK (Laskowski et al., 1993). PDBsum is used to extract information on binding residues and protein–protein interaction interfaces. The Catalytic Site Atlas is a database containing a set of well-curated catalytic sites and a set of homologous catalytic sites found by sequential and structural similarity. Both databases are based on three-dimensional information. This means that the functional information used in predictions is set in experimentally solved structures. It is worth mentioning that while PDBsum and CSA data are used by WSsas, these databases are not themselves annotated using WSsas as this would lead to a circular logic and would affect the reliability of both databases and the annotation tool.

Fig. 1.

(A) Design of WSsas. An input protein sequence is searched against all protein sequences in the PDB using FASTA. Residue information is retrieved from the PDBsum and CSA databases for all the FASTA hits. (B) Summary of the inputs and output of WSsas. The output is a file in XML format.

Fig. 1.

(A) Design of WSsas. An input protein sequence is searched against all protein sequences in the PDB using FASTA. Residue information is retrieved from the PDBsum and CSA databases for all the FASTA hits. (B) Summary of the inputs and output of WSsas. The output is a file in XML format.

SAS and WSsas are very similar tools; however, the way to use the annotation protocol is different. SAS is useful for researchers interested in a single protein sequence, or a small number of sequences, whereas the WSsas web service allows multiple searches or inclusion in workflows, since it is based on ‘computer-to-computer’ communication and its results are easy to retrieve via a script (Kappler, 2008; Labarga et al., 2007). At the moment, there is no restriction on the number of searches, but we may need to change this according to a fair-use policy if the tool is frequently used.

The query input for WSsas is a protein sequence. The output contains functional information (predicted functional residues and additional information about bound compounds), transference information (structures and residues used to transfer information) and alignment information [identity percentage, number of aligned residues and expectation (E) value]. Parameters that can be defined by the user are (Fig. 1B): the maximal E-value for selecting FASTA hits (0.001 by default); the minimum number of aligned residues between query and hits (80 residues by default); and the minimum identity percentage of the aligned stretches (30% by default). By changing alignment thresholds, users can perform a more or less stringent annotation transfer depending on their interests; e.g. increasing the percentage of identity or the number of aligned residues will reduce the number of annotated residues, but the reliability will be higher; conversely, decreasing significantly the thresholds will increase the number of annotated residues but could be a risky journey to the twilight zone. In addition, another way of increasing stringency of prediction exists: when the stringency flag is activated, the algorithm looks at the atomic contacts involved in the compound-binding or protein–protein interaction. Then, it discards annotations if the query residue does not contain the same functional atom type. Nevertheless, the accuracy of predictions relies absolutely on the availability of homologues with solved structure and their quality. Finally, an additional verbose flag allows listing all the relevant atomic interactions in the retrieved hits.

From the webpage http://www.ebi.ac.uk/thornton-srv/databases/WSsas/, it is possible to download an XML definition schema (XSD), a web service definition (WSDL) file and a Perl client for using the web service. The basic input for the Perl client is a file which contains a query sequence and the response received by the user is an easy to parse XML file. WSsas uses the SOAP protocol for information transfer.

3 CONCLUDING REMARKS

WSsas complements other annotation tools in our group, such as CSA, ProFunc and SAS. It predicts functional residues and their interaction partners on the basis of similarity to solved 3D structures while tolerating local variation. Further features may be added in the future.

ACKNOWLEDGEMENTS

The authors wish to thank Martin Eklund, Hamish McWilliam, Florian Reisinger, Ola Spjuth and one anonymous reviewer for helpful comments and suggestions on the web service and Daniela Wieser for a critical reading of the manuscript. This work was completed as part of the BioSapiens Network of Excellence.

Funding: European Commission within its FP6 Programme, under the thematic area ‘Life sciences, genomics and biotechnology for health’ (contract number LHSG-CT-2003-503265).

Conflict of Interest: none declared.

REFERENCES

Berman
HM
, et al.  . 
The Protein Data Bank
Nucleic Acids Res.
 , 
2000
, vol. 
28
 (pg. 
235
-
242
)
de Castro
E
, et al.  . 
ScanProsite: detection of PROSITE signature matches and ProRule-associated functional and structural residues in proteins
Nucleic Acids Res.
 , 
2006
, vol. 
34
 (pg. 
W362
-
W365
)
Fielding
RT
Architectural styles and the design of network-based software architectures
PhD Thesis.
 , 
2000
Irvine
University of California
Flicek
P
, et al.  . 
Ensembl 2008
Nucleic Acids Res.
 , 
2008
, vol. 
36
 (pg. 
D707
-
D714
)
Golovin
A
Henrick
K
MSDmotif: exploring protein sites and motifs
BMC Bioinformatics
 , 
2008
, vol. 
9
 pg. 
312
 
Holliday
GL
, et al.  . 
MACiE (Mechanism, Annotation and Classification in Enzymes): novel tools for searching catalytic mechanisms
Nucleic Acids Res.
 , 
2007
, vol. 
35
 (pg. 
D515
-
D520
)
Kappler
MA
Software for rapid prototyping in the pharmaceutical and biotechnology industries
Curr. Opin. Drug Discov. Dev.
 , 
2008
, vol. 
11
 (pg. 
389
-
392
)
Labarga
A
, et al.  . 
Web services at the European Bioinformatics Institute
Nucleic Acids Res.
 , 
2007
, vol. 
35
 (pg. 
W6
-
W11
)
Laskowski
RA
, et al.  . 
Procheck - a program to check the stereochemical quality of protein structures
J. Appl. Crystallogr.
 , 
1993
, vol. 
26
 (pg. 
283
-
291
)
Laskowski
RA
, et al.  . 
PDBsum more: new summaries and analyses of the known 3D structures of proteins and nucleic acids
Nucleic Acids Res.
 , 
2005
, vol. 
33
 (pg. 
D266
-
D268
)
Laskowski
RA
, et al.  . 
ProFunc: a server for predicting protein function from 3D structure
Nucleic Acids Res.
 , 
2005
, vol. 
33
 (pg. 
W89
-
W93
)
Luscombe
NM
, et al.  . 
NUCPLOT: a program to generate schematic diagrams of protein-nucleic acid interactions
Nucleic Acids Res.
 , 
1997
, vol. 
25
 (pg. 
4940
-
4945
)
Milburn
D
, et al.  . 
Sequences annotated by structure: a tool to facilitate the use of structural information in sequence analysis
Protein Eng.
 , 
1998
, vol. 
11
 (pg. 
855
-
859
)
Mulder
NJ
, et al.  . 
New developments in the InterPro database
Nucleic Acids Res.
 , 
2007
, vol. 
35
 (pg. 
D224
-
D228
)
Pearson
WR
Lipman
DJ
Improved tools for biological sequence comparison
Proc. Nal Acad. Sci. USA
 , 
1988
, vol. 
85
 (pg. 
2444
-
2448
)
Porter
CT
, et al.  . 
The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data
Nucleic Acids Res.
 , 
2004
, vol. 
32
 (pg. 
D129
-
D133
)
Reeves
GA
, et al.  . 
Genome and proteome annotation: organization, interpretation and integration
J. R. Soc. Interface
 , 
2009
, vol. 
6
 (pg. 
129
-
147
)
Smith
TF
Waterman
MS
Identification of common molecular subsequences
J. Mol. Biol.
 , 
1981
, vol. 
147
 (pg. 
195
-
197
)
UniProt Consortium
The universal protein resource (UniProt).
Nucleic Acids Res
 , 
2008
, vol. 
36
 (pg. 
D190
-
D195
)
Wallace
AC
, et al.  . 
LIGPLOT: a program to generate schematic diagrams of protein-ligand interactions
Protein Eng.
 , 
1995
, vol. 
8
 (pg. 
127
-
134
)
Yeats
C
, et al.  . 
Gene3D: comprehensive structural and functional annotation of genomes
Nucleic Acids Res.
 , 
2008
, vol. 
36
 (pg. 
D414
-
D418
)

Author notes

Present address: Faculty of Life Sciences, University of Manchester, Manchester M13 9PT, UK.
Associate Editor: Burkhard Rost

Comments

0 Comments