Profile–profile methods are well suited to detect remote evolutionary relationships between protein families. Profile Comparer (PRC) is an existing stand-alone program for scoring and aligning hidden Markov models (HMMs), which are based on multiple sequence alignments. Since PRC compares profile HMMs instead of sequences, it can be used to find distant homologues. For this purpose, PRC is used by, for example, the CATH and Pfam-domain databases. As PRC is a profile comparer, it only reports profile HMM alignments and does not produce multiple sequence alignments. We have developed webPRC server, which makes it straightforward to search for distant homologues or similar alignments in a number of domain databases. In addition, it provides the results both as multiple sequence alignments and aligned HMMs. Furthermore, the user can view the domain annotation, evaluate the PRC hits with the Jalview multiple alignment editor and generate logos from the aligned HMMs or the aligned multiple alignments. Thus, this server assists in detecting distant homologues with PRC as well as in evaluating and using the results. The webPRC interface is available at http://www.ibi.vu.nl/programs/prcwww/.
Sequence-alignment techniques are essential in providing predictions of protein function and evolution. The introduction of sequence–profile methods, such as hmmpfam, hmmsearch (1) and PSI-BLAST (2,3), increased the detection of homologous sequences considerably compared to sequence-sequence methods [e.g. (4)], such as BLAST (3). A profile numerically encodes a multiple sequence alignment and its amino acid diversity by counting the amino acids in each column. Profile hidden Markov models (HMMs), or (profile) HMMs, are statistically more advanced than numerical profiles and allow for variable gap penalties (1). Clearly, profiles, based on an alignment, contain more information than a single sequence. Indeed, including distant but true homologues in the alignment, further increases the chance of detecting of similar families (5). We here use the word ‘profiles’ to refer to both numerical profiles and profile HMMs.
The last decade the sequence–profile methods have been advanced to profile–profile methods. Profile–profile methods provide a more sensitive (6–9) way to find distant homologies between proteins. Using profiles for both query and subject (domain database), has been shown to lead to more sensitive detection of evolutionary remote relationships [e.g. (9,10)]. Different profile–profile methods have been developed, including prof_sim (9) and FFAS (11). We here focus on three widely used state-of-the-art profile–profile programs: Profile Comparer [first released in 2002 (12)], COMPASS [COmparison of Multiple Protein sequence Alignments with assessment of Statistical Significance (6,13)] and HHsearch (7,14). Profile Comparer [PRC, (12)] is a stand-alone program for scoring and aligning HMM and is routinely used by, for example, the CATH (15) and Pfam (16,17) domain databases. The CATH pipeline uses PRC to detect extremely remote homologues and group them in superfamilies [http://www.cathdb.info/wiki/doku.php?id=about:intro, (15)]. Initially, Pfam used only PRC to detect similar domains (16), but now also uses HHsearch (14) [and SCOOP (18)] to establish Pfam clans (17). In addition, internal links from one Pfam family to another are generated with PRC and SCOOP.
In contrast to HHsearch and COMPASS (7,13), PRC did not have a web interface available yet. We therefore have implemented webPRC, a server for searching several public domain databases with additional functionality, including HMM-to-alignment translation, as compared to stand-alone PRC.
Several major domain databases are provided: Pfam (17), NCBI's Conserved Domain Database (19), KOG (20), TIGRFAMs (21), CATH (22) and SUPERFAMILY (23). We briefly indicate how the profile HMMs and their seed alignments were obtained.
Pfam-A: The Pfam-A (16,17) profile HMMs have been rebuilt locally using the seed alignments downloaded from the Pfam FTP site (http://pfam.sanger.ac.uk) and the hmmbuild options provided therein. When building the HMMs the starting alignment, also for CDD/KOG and TIGRFAMs, was re-saved by hmmbuild (HMMER v2.3.2; http://hmmer.janelia.org/). This re-saved alignment includes an ‘RF’ line that indicates which alignment columns are absent from the HMM. This line is used to translate the HMM coordinates of the PRC results back to the alignment coordinates.
CDD/KOG: NCBI's Conserved Domain Database [CDD (19)] and KOG (20) HMMs have been built from the seed alignments downloaded via the CDD site (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml). As there can be multiple identical sequence identifiers in CDD alignments, the sequence identifiers in the re-saved alignments were made unique by prepending a number to the entire identifier for reoccurring identifiers only.
TIGRFAMs: The TIGRFAMs (21) HMMs have been rebuilt locally from the seed alignments and the hmmbuild options provided in the TIGRFAMs HMM files.
CATH: The CATH (15,22) HMMs have been obtained from the CATH web site (http://www.cathdb.info). These models are not based on Pfam-like seed alignments, but are produced iteratively starting from a single sequence (24). This can result in huge alignments with high gap content (up to about 80 000 sequences, >50 000 columns, or 680 Mb for a single alignment). For this reason, the CATH models are used directly. Their underlying alignments have been processed to include an ‘RF’ line and a maximum of the first 200 sequences are included in the alignment output.
The user can provide a single protein sequence or multiple sequence alignment via the paste or upload field. A variety of alignment formats is accepted (ClustalW, FASTA, GCG MSF, Stockholm and SELEX). The user may configure the following search parameters: the domain database, PSI-BLAST options, several PRC options, the number of unique hits to be visualized in the hit graphic, and the use of the hmmbuild ‘–hand’ option. This option can be used to mark regions of the alignment that should be absent from the HMM produced by hmmbuild, which is useful for searching with discontinuous domains. The ‘RF’ annotation line, required for the optional ‘–hand’ option, is supported for the SELEX (#= RF) and Stockholm (#=GC RF) formats. Finally, the user may choose to generate logos from the HMM alignments or from the aligned multiple sequence alignments [with LogoMat-P (25) and Two Sample Logo (26), respectively] to visualize the alignments. Example input and output are provided, including the possibility to regenerate the example output (‘rerun the example’).
The logos are generated with local installations of LogoMat-P (25) and Two Sample Logo (26). LogoMat-P was adapted such that the generated logos correspond exactly to the HMM alignments reported by PRC. Thus, LogoMat-P is not executing a new pair-wise PRC search to find an HMM alignment between the query and the single subject HMM, but now directly uses the alignment produced by the PRC library run against the domain databases.
RESULTS AND DISCUSSION
The webPRC server facilitates the use of PRC for finding domains related to a query alignment. Besides the possibility to run PRC against different domain databases, webPRC offers additional functionality not available with a PRC stand-alone run.
After completion of a PRC search, the raw PRC output is reformatted into a BLAST-like report, which includes a domain hit distribution graphic and a hit table (Figure 1). This makes interpreting PRC output as straightforward as reading a BLAST report. The reformatted PRC alignments now include the match, insert, and delete percentages (Figure 2). In addition, several other features aiding the evaluation of the hits are included in the report: hits in the table are linked to the source domain database and include a description from the selected domain database. The alignments section contains links to the optionally produced logos. These logos are graphical representations of the aligned HMMs or the aligned alignments and can help in the evaluation of the found domains. LogoMat-P (25) produces pair-wise HMM logos based on the reported PRC alignment. These HMM logos are related to the HMM logos (29) used to visualize the HMMs of protein families in Pfam (17). In addition, Two Sample Logos are produced. These logos are based on two multiple sequence alignments and show the positions that are significantly different between the alignments (26). Furthermore, the alignments section contains an ‘aligned alignments’ presentation. Specifically, this translation of ‘raw’ PRC results to query and hit alignments facilitates the identification of conserved residues. The combined multiple sequence alignments can be viewed in Jalview (28). The sequence labels in the Jalview applet are linked to several sequence databases, including UniProt and Entrez Protein, to facilitate the retrieval of sequence annotations. Finally, the alignments can be downloaded for additional analyses. For example, Sequence Harmony can be used to predict specificity-determining residues from these alignments (30).
The translation from HMM alignments to sequence alignments is provided for most databases. However, the sequence alignments resulting from searches against CATH generally include a large number of gaps (indicated with ‘:’ in the web output). Many alignment columns are indeed absent from their corresponding HMMs due to the high gap content of the seed alignments: for the entire CATH database only 15% of all alignment columns are represented in the HMMs as opposed to 91% for Pfam-A.
Figures 1 and 2 illustrate the webPRC output of a search with ADP-ribosylation factor-binding protein GGA1 (UniProt: GGA1_HUMAN) against Pfam and explain the aligned alignments view. A search with the single sequence indeed finds the known domains: VHS, GAT, and GAE (cf. UniProt). PSI-BLAST was run on this sequence to build an alignment (three iterations, E-value 0.0005, NCBI's NR database). Now, not only the VHS, but also the ENTH and ANTH domains are detected, while the GAE domain is not detected anymore. Indeed, the VHS, ENTH and ANTH domains are related, though in general, especially an E-value like that for the ANTH match (0.007) would require further data to state a homologous relationship. In addition to further profile–profile based searching, it is worthwhile to check the Pfam and CDD databases for information on the retrieved hits: CDD contains superfamilies and Pfam groups related families into clans and also provides ‘internal database links’. Pfam and CDD provide information on this VHS/ENTH/ANTH cluster. Hence, webPRC can be used to easily find such clusters and links for any query alignment.
E-values can be used to judge the significance of the hits returned by PRC. However, they are accurate only if the library contains more than 1000 profile HMMs (12). The author of PRC indicated that ‘for libraries of sufficient size, E < 0.003 can be taken as indicative of homology and E < 10−5 as a strong match’ (12). For profile–profile comparisons, Pfam uses an E < 0.001 as an indication of a significant match and E-values between 0.1 and 0.001 as an indication of a true relationship (16).
We here describe our PRC web interface and refrain from including another PRC validation. We would like to refer the reader to several benchmarking studies that report on the performance of PRC [(8,12,14,18), http://toolkit.tuebingen.mpg.de/hhpred/help_ov]. Reid et al. (24) benchmarked profile–profile and profile-sequence methods, including PRC, COMPASS, HHsearch, and concluded that PRC is the best method for distinguishing homologous from non-homologous domains. Depending on the specific benchmarking study, PRC performs better or worse than HHsearch, but generally better than COMPASS. We encourage prospective webPRC users to have a look at these benchmarking studies as well as the COMPASS (13) and HHsearch web servers (7).
The webPRC server provides a web-based front end to PRC, one of the state-of the-art methods for detecting remote homology, to carry out similarity searches against well-established domain databases. Since the input is a single sequence or an alignment, users need not build an HMM themselves. In addition to the domain hit distribution graphic and logo visualizations, webPRC features the translation of the PRC HMM alignments to multiple sequence alignments. This supports evaluation of a hit based on multiple sequence alignments. To this end, the Jalview applet is implemented. Furthermore, the hit, query and combined alignments can be downloaded for additional analyses.
ENFIN, a Network of Excellence funded by the European Commission within its FP6 Programme, under the thematic area ‘Life sciences, genomics and biotechnology for health’, contract number LSHG-CT-2005-518254. Funding for open access charge: ENFIN.
Conflict of interest statement. None declared.
We would like to thank Dr James Procter for extending the Jalview alignment editor with regular expression based link parsing.