HAMAP (High-quality Automated and Manual Annotation of Proteins—available at http://hamap.expasy.org/) is a system for the automatic classification and annotation of protein sequences. HAMAP provides annotation of the same quality and detail as UniProtKB/Swiss-Prot, using manually curated profiles for protein sequence family classification and expert curated rules for functional annotation of family members. HAMAP data and tools are made available through our website and as part of the UniRule pipeline of UniProt, providing annotation for millions of unreviewed sequences of UniProtKB/TrEMBL. Here we report on the growth of HAMAP and updates to the HAMAP system since our last report in the NAR Database Issue of 2013. We continue to augment HAMAP with new family profiles and annotation rules as new protein families are characterized and annotated in UniProtKB/Swiss-Prot; the latest version of HAMAP (as of 3 September 2014) contains 1983 family classification profiles and 1998 annotation rules (up from 1780 and 1720). We demonstrate how the complex logic of HAMAP rules allows for precise annotation of individual functional variants within large homologous protein families. We also describe improvements to our web-based tool HAMAP-Scan which simplify the classification and annotation of sequences, and the incorporation of an improved sequence-profile search algorithm.
Falling costs and continuing technological advances in DNA sequencing have led to an explosion in the number of available whole genome sequences from all branches of the tree of life, opening up exciting new possibilities for research into the evolution and function of biological systems. However as the number of protein-coding gene sequences continues to grow exponentially, the tiny fraction of experimentally characterized sequences continues to shrink—this despite the best efforts of groups such as the Enzyme Function Initiative (1) and COMBREX (2) to accelerate the rate of functional characterization through combined computational and experimental approaches. This growing gap highlights a need for automated systems that can effectively leverage the available experimental information to provide precise functional annotation for the tens of millions of predicted protein sequences that will probably never be characterized (3).
One such system is HAMAP (High-quality Automated and Manual Annotation of Proteins), which provides automatic classification and functional annotation of protein sequences based on their homology to characterized templates (4). HAMAP is based on a collection of expert curated protein family profiles, which are used to determine family membership of protein sequences, and annotation rules, which specify the appropriate annotation for family members. HAMAP rules permit the annotation of protein sequences to the same level of detail and quality as manually curated UniProtKB/Swiss-Prot records, annotating protein and gene names, function, catalytic activity, cofactors, subcellular location, protein–protein interactions, as well as sequence features such as the presence of specific domains, motifs and functionally important sites (such as ion-, substrate- and cofactor-binding sites, catalytic residues and post-translational modifications). Annotations are provided in the form of the human-readable UniProtKB text format and using UniProt controlled vocabularies and terms from the Gene Ontology (GO) (5). As well as the annotations themselves, HAMAP rules also specify the conditions under which these annotations may be applied, such as a requirement for key functional residues (identified by structural or other experimental studies). Such conditions can reduce the incidence of erroneous annotation, particularly in large, functionally diverse families—errors that tend to persist in public sequence databases (6–8).
HAMAP forms one component of the UniProt UniRule system that provides annotation for the unreviewed component of the UniProt Knowledgebase UniProtKB/TrEMBL (9). HAMAP family profiles and annotation rules are created (and updated) concurrently with the curation of experimentally characterized templates into UniProtKB/Swiss-Prot, by the same expert curators. This ensures that the family profiles accurately reflect the properties of trusted protein family members, that target sequences are annotated to the quality standards of UniProtKB/Swiss-Prot, and that updates to UniProtKB/Swiss-Prot records are subsequently recorded in HAMAP rules (and propagated to homologous UniProtKB/TrEMBL records). In addition to UniProtKB, HAMAP also provides protein family annotation for Ensembl Genomes (10) as well as a number of other genome annotation pipelines (11,12).
In the remainder of this article we describe developments in HAMAP since our last report in the Database Issue of Nucleic Acids Research. We also provide examples of how the careful manual curation of HAMAP profiles and associated rules can generate precise functional annotation for individual members of large and functionally diverse protein families.
ANNOTATION AND CONTENT
Refining HAMAP family profiles for increased specificity of functional annotation
HAMAP defines family membership of protein sequences using generalized profiles derived from manually curated multiple sequence alignments (MSAs) of trusted members (4,13). Precise functional annotation requires the careful definition of isofunctional protein families and functionally important residues—excluding other functional categories and closely related families curated in UniProtKB/Swiss-Prot. During curation of the multiple sequence alignment erroneous sequences and misaligned positions are corrected where necessary (described in (4), complete workflow ftp://ftp.expasy.org/databases/hamap/SOP_HAMAP_profile_creation.pdf included as supplementary file S1). Profiles are generated using the pftools package (available at http://web.expasy.org/pftools/) as described in (14,15). The specificity of the resulting profile may be modulated through the use of different pseudocounts, which assign scores to amino acid residues that have not been observed in the sequence alignments used to construct the profile (16). The values of these scores are derived from the PAM (Point Accepted Mutation) (17) and BLOSUM (BLOcks SUbstitution Matrix) (18) amino acid scoring matrices, which cover a wide range of evolutionary distances. Matrices tailored to shorter evolutionary distances will more strongly penalize substitutions that have not been observed, producing profiles that more faithfully reflect the observed diversity in the alignment—and which may better separate closely related subfamilies. There are of course limitations to this approach, and it is not always possible to generate HAMAP profiles that discriminate between very closely related sequences—one example, concerning certain subfamilies of sirtuins, is described below. The process of HAMAP family profile generation is iterative, and curators may modify the seed alignment, the profile construction parameters, and the threshold score for trusted family members until a profile with satisfactory specificity and sensitivity is achieved—based on the annotation of the matching UniProtKB/Swiss-Prot records. The parameters used for final profile generation are stored together with the seed alignment, so that profiles can be regenerated as needed.
HAMAP is continually updated, and HAMAP profiles and families may be modified, extended, or split as results from new phylogenetic analyses and experimental characterization data become available. A case in point is provided by the sirtuin family of proteins, whose members were thought to act exclusively as protein deacetylases (19,20). Phylogenetic analyses (using methods described in 21–25) suggest five families of sirtuins—classes I, II, III, IV and U (17) (see Figure 1). Class III sirtuins, including the human SIR5 protein (UniProtKB/Swiss-Prot record Q9NXA8), were recently found to exhibit both protein demalonylase and protein desuccinylase activity (26,27). The class III sirtuin of Escherichia coli (CobB, P75960) also functions as a protein desuccinylase (28), while that of Plasmodium falciparum (Sir2A, Q8IE47) hydrolyses medium and long chain fatty acyl groups from lysine residues (29), suggesting an ancient divergence of function in evolution. Specificity for these relatively bulky substrates may be conferred by a larger hydrophobic pocket and substrate-binding residues (Tyr-102 and Arg-105 in human SIR5) common to all class III sirtuins from all kingdoms of life (20,30). As part of the normal HAMAP workflow, all characterized sirtuin protein records in UniProtKB/Swiss-Prot were first updated (31). The existing HAMAP family profile for bacterial sirtuins (profile MF_01121) was modified to specifically match only the class III sirtuins, and new family profiles were created for classes II and U (profiles MF_01967 and MF_01968 respectively). HAMAP annotation rules for class III sirtuins were created that allow specific annotation of protein function and sequence features for both prokaryotic and eukaryotic sequences (rules MF_01121 and MF_03160 respectively). Class I and IV subfamilies are not currently treated by HAMAP, as these are further divided into subclasses (Ia, Ib, Ic and IVa, IVb, respectively), where each subclass contains multiple paralogs per species. Such complex duplications may be better addressed using methods that explicitly consider evolutionary history in the form of a phylogenetic tree. Other resources such as Pfam provide broad coverage of sirtuin family proteins (with a single signature PF02146) while a more restricted PIRSF signature (PIRSF037938) currently covers only the sirtuin subclass Ib members.
HAMAP allows specific functional annotation within homologous protein families
The rule syntax used by HAMAP (described in http://hamap.expasy.org/unirule/unirule.html) allows for control statements that specify conditions–such as the occurrence of specific residues or motifs–for the application of annotation. These control statements provide a flexible means of fine-tuning the annotation of individual members of protein families, illustrated here using the 6-phosphofructokinase (PFK) family. PFK is a key regulatory enzyme of glycolysis that is present in all three domains of life. Despite this high level of conservation the enzyme has a remarkable evolutionary history, featuring a high rate of horizontal gene transfer and substitution in its active site (32). These substitutions have a profound impact on enzyme function; PFK family members with a glycine (G) at the active site catalyze the phosphorylation of D-fructose 6-phosphate to fructose 1,6-bisphosphate using adenosine triphosphate (ATP) (in the first committed step of glycolysis), while those with aspartate (D) use inorganic phosphate (PPi) as the phosphoryl donor in a reversible reaction that occurs in both glycolysis and gluconeogenesis (32–34).
HAMAP defines 8 PFK families in line with the currently accepted classification of PFKs (32,35) (Table 1). Several of the eight HAMAP families include both PPi-dependent and ATP-dependent members, suggesting that phosphoryl-donor specificity may have changed at multiple times during the evolution of the PFK superfamily. Figure 2 illustrates how this functional variation within families is treated by HAMAP using annotation rule MF_01976, which describes members of the mixed substrate PFK group III subfamily. The precise annotation that is applied to members of this family depends on the nature of the active site residue (D104 in the experimentally characterized template of Amycolatopsis methanolica—UniProtKB/Swiss-Prot record Q59126). Case statements within the rule specify the correct protein name, catalytic activity (including EC number), function, keywords, GO terms and other annotations for family members bearing either D or G at their active site. Sequences having neither of these residues are annotated as generic 6-phosphofructokinases of unknown substrate-specificity. The example of PFK illustrates how a single residue may determine substrate specificity and enzyme function, but HAMAP rule syntax also allows conditional annotation based on the combination of multiple residues or sequence motifs. The methylthioadenosine (MTA) phosphorylases are one example, where conserved amino acid substitutions in the substrate binding pocket convert the substrate specificity of this enzyme from 6-aminopurine (EC 22.214.171.124) to 6-oxopurine nucleosides (EC 126.96.36.199 and EC 188.8.131.52) (described in MF_01963).
The PFK family of proteins in HAMAP
Since our last publication in the NAR Database Issue 2013, we have added 203 new family profiles and 278 new annotation rules to HAMAP. As of 3 September 2014, HAMAP contains 1983 family classification profiles and 1998 annotation rules (a single HAMAP family profile may be associated with multiple HAMAP annotation rules, where each rule applies to a distinct taxonomic group). Through the UniRule pipeline, HAMAP provides annotations for 10,874,356 UniProtKB/TrEMBL sequence records (release 2014_08), which is around 13% of all sequence records in UniProtKB/TrEMBL, and 16% of the sequence records of each prokaryotic complete proteome. HAMAP provides 48% of all annotations and 90% of all sequence-specific feature annotations for the UniRule automatic annotation pipeline of UniProt. One of the strengths of HAMAP lies in the granularity and the comprehensiveness of its annotations, with each HAMAP rule providing over 16 annotations per UniProtKB/TrEMBL record on average.
Improvements to the web-interface for HAMAP-Scan
Protein sequences can be classified and annotated using HAMAP through our HAMAP-Scan web service (http://hamap.expasy.org/hamap_scan.html). We provide a single-page, 3-step, dynamic submission form where required fields are clearly marked, and every field is accompanied by a short explanatory text. Each user choice dynamically updates the submission form, such that only necessary fields are displayed. The form allows submission of user sequences (FASTA) and UniProt sequence record identifiers or sequence accessions; users may submit individual sequences or whole proteome sequences. All submitted sequences are returned to the user in UniProtKB format in the order of submission, while protein sequences that have a trusted match to a HAMAP family profile are also annotated by the associated HAMAP rule. All result entries (including entries that are not annotated) contain an additional section with information on matches to HAMAP family profiles, including the profile accession number and identifier, the match quality (trusted or weak), and the match score (with the score difference to the trusted cut-off score of the profile in parenthesis) (Figure 3). HAMAP profiles are also available through InterProScan (36) provided by the InterPro Consortium (37), of which HAMAP is a member.
Accelerated HAMAP-Scan with pfsearchV3
To facilitate the use of HAMAP-Scan for the classification and annotation of large datasets such as whole proteome sequences we have implemented the improved version of the PROSITE search tool pfsearchV3 (38) for HAMAP. pfsearchv3 uses modern CPU instructions to exploit the capabilities of multicore processors and a new heuristic filter to rapidly score and select possible candidate matches, achieving speeds up to two orders of magnitude faster than the previous version of this algorithm. We plan to make the heuristic score thresholds for HAMAP profiles available to our users in the near future.
HAMAP provides accurate and detailed functional annotation for the exponentially growing population of uncharacterized protein sequences in public databases such as UniProtKB/TrEMBL, as well as tools and services for external users. HAMAP profiles allow the definition of isofunctional protein families of whatever size and scope according to current knowledge. HAMAP annotation rules provide fine-grained annotations for family members, based on the presence of specific functional residues (as illustrated here for the PFK families). The creation of family profiles and annotation rules in HAMAP is a manual effort performed by expert curators. Manual curation of the experimental literature in UniProtKB/Swiss-Prot is highly accurate (6), with expert curation of HAMAP profiles and rules specifically designed to avoid over-annotation through the careful definition of isofunctional protein families and functionally important residues. HAMAP annotations can be accessed via UniProtKB, or generated by users for their own protein or proteome sequences via the HAMAP-Scan service on the HAMAP website.
Supplementary Data are available at NAR Online.
We thank Anne Morgat and Marco Pagni for insightful comments and discussions on the scope and direction of HAMAP. We also thank Brigitte Boeckmann for critical reading of the manuscript and for help with the phylogenetic analysis of the sirtuin protein family.
Swiss Federal Government through the State Secretariat for Education, Research and Innovation; National Institutes of Health [U41HG006104]; Swiss National Science Foundation [JRP09 and JRP13]. Funding for open access charge: Swiss Federal Government through the State Secretariat for Education, Research and Innovation.
Conflict of interest statement. None declared.