Motivation: With the continuing rapid growth of protein sequence data, protein sequence comparison methods have become the most widely used tools of bioinformatics. Among these methods are those that use position-specific scoring matrices (PSSMs) to describe protein families. PSSMs can capture information about conserved patterns within families, which can be used to increase the sensitivity of searches for related sequences. Certain types of structural information, however, are not generally captured by PSSM search methods. Here we introduce a program, Structure-based ALignment TOol (SALTO), that aligns protein query sequences to PSSMs using rules for placing and scoring gaps that are consistent with the conserved regions of domain alignments from NCBI's Conserved Domain Database.
Results: In most cases, the alignment scores obtained using the local alignment version follow an extreme value distribution. SALTO's performance in finding related sequences and producing accurate alignments is similar to or better than that of IMPALA; one advantage of SALTO is that it imposes an explicit gapping model on each protein family.
Availability: A stand-alone version of the program that can generate global or local alignments is available by ftp distribution (ftp://ftp.ncbi.nih.gov/pub/SALTO/), and has been incorporated to Cn3D structure/alignment viewer.
The ongoing completion of genome projects presents scientists with the regular and increasing challenge of processing and understanding proteins using only their amino acid sequences. This has spurred the creation of protein family databases, as well as the design of programs for protein classification and annotation and the recognition of distant protein relationships. Software tools such as SSEARCH (Smith and Waterman, 1981; Pearson, 1991), FASTA (Pearson and Lipman, 1988), BLAST (Altschul et al., 1990), PSI-BLAST (Altschul et al., 1997), IMPALA (Schäffer et al., 1999), SAM (Karplus et al., 1998) and HMMER (Eddy, 1998), among others, have become popular due to their considerable speed and sensitivity in retrieving homologous sequences. Specifically, hidden Markov models (HMMs) and the related position-specific scoring matrices (PSSMs), derived from multiple alignments of related sequences, have been shown generally to increase the sensitivity of such searches (Gribskov et al., 1987; Baldi et al., 1994; Krogh et al., 1994; Altschul et al., 1997; Henikoff and Henikoff, 1997; Durbin et al., 1998; Eddy, 1998; Karplus et al., 1998; Park et al., 1998; Schäffer et al., 1999).
Explicitly, including structural information in these models presents a challenge. This can be addressed to some extent by modeling the existence and lengths of gaps which correspond to the loop regions between secondary structure elements in various positions and by requiring the alignment of certain key residues or motifs. In some sequence comparison programs, however, gap penalties are chosen ad hoc and key residues are aligned only if this will increase the alignment score. Here, we introduce a method, Structure-based ALignment TOol (SALTO), that can be used to generate global or local alignments between a query sequence and a PSSM derived from multiple alignments from the Conserved Domain Database (CDD) (Marchler-Bauer et al., 2003). SALTO forces the alignments it produces to incorporate structural features of the CDD alignments. Because the statistics of global alignment scores are not well understood, this paper will focus only on SALTO in its local alignment mode (SALTO_LOCAL).
CDD multiple alignments are curated manually to be consistent with structure–structure alignments. Each CDD alignment consists of a set of conserved core structures, or blocks. It has been shown that alignments based upon only these blocks can predict experimentally determined functional sites (Panchenko et al., 2004). SALTO constructs position-specific scoring matrices (PSSMs) from multiple alignments in a manner similar to IMPALA and its heuristic cousin RPS-BLAST, but it makes use of CDD's block structure by aligning only PSSM positions that derive from these blocks; PSSM positions outside the blocks, possibly corresponding to structural loops, are not aligned. The alignments SALTO produces are thus expected to include those residues that are key to protein structure and function. The program takes advantage of CDDs, explicit gap model by allowing gaps (at no alignment cost) only between blocks and therefore presumably within structural loops.
We have found that when SALTO is used in its local alignment mode, for most PSSMs the alignment scores it produces for sets of random sequences fit an extreme value distribution. Even though SALTO constrains its alignments to agree with structural evidence, its performance in a search for distant biological relationships is very similar to that of IMPALA. SALTO is now included in the Cn3D (Wang et al., 2000) protein structure viewer (Cn3D is available at http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml) as an option for aligning new sequences and is also available by ftp distribution (ftp://ftp.ncbi.nih.gov/pub/SALTO/).
MATERIALS AND METHODS
Database of Conserved Domains
SALTO uses the PSSMs and the definition of blocks or conserved regions from CDD (Marchler-Bauer et al., 2003). CDD contains curated domain alignments, many originally imported from Smart (Letunic et al., 2004) and Pfam (Bateman et al., 2002). CDD alignments are refined using known three-dimensional structural information and structure–structure alignments and conserved regions are identified that are likely to be present in all family members.
For this study, we began with a set of 271 alignments from CDD version 1.60, the current version of which is available at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml (Marchler-Bauer et al., 2003). We excluded from this test set 73 alignments containing repetitive regions. A PSSM was calculated for each of the remaining 198 alignments, using the procedure employed by PSI-BLAST, IMPALA and RPS_BLAST (Altschul et al., 1997). To compare alignments involving different PSSMs with one another, reasonably accurate E-values are required, so it is important to be able to approximate the random score distribution of any PSSMs included in a search database. As described below, for the parameters chosen for this study, 170 of the 198 PSSMs we constructed are well described by an extreme value distribution, hence we confine database searching using SALTO to this subset of PSSMs.
We outline the program SALTO, which uses dynamic programming to find an alignment with maximum score between a query protein sequence and a PSSM, subject to certain constraints. Controlled by a command line option, SALTO can be used in either global or local mode.
In its local mode, SALTO is similar to NCBI's IMPALA and RPS_BLAST, but uses a different approach to aligning a sequence with a PSSM and to introducing and scoring gaps. IMPALA and RPS_BLAST allow gaps anywhere in the alignments they produce and charge each gap a penalty that is an affine function of its length. In contrast, SALTO constrains the location and length of gaps using structure-based information from the CDD database, but does not penalize for their existence.
CDD partitions its multiple alignments into conserved blocks, which do not contain gaps, separated by regions of variable length within the sequences represented. SALTO considers this block structure to be inherited by the PSSMs constructed from the CDD alignments and uses it when aligning a sequence to a PSSM. SALTO aligns a new sequence only to the conserved blocks of a PSSM and no gaps are allowed within each block-alignment. An alignment must involve complete blocks and if a pair of blocks is contained in an alignment, then all intermediate blocks must be included as well. In SALTO's global mode, all blocks must be aligned, but in its local mode, only a contiguous subset (Fig. 1).
Sections of a sequence between two regions aligned to blocks are considered ‘loops’ and a two-parameter constraint is placed on the maximum allowed length of each loop. Specifically, the loop between the matching sequence pieces aligning to blocks i and i + 1 must be no larger than a constant (a) plus a chosen percentile (p) of the loop lengths between blocks i and i + 1 in the CDD multiple alignment. This captures the observed variability of loop lengths in the CDD database, but generally allows for further variation. SALTO uses dynamic programming to find an alignment with maximum score, subject to these structure-based constraints.
To evaluate the sequence similarities returned by a database search method, it is important to be able to estimate the level of similarity one may expect to arise by chance. Analytic results (Karlin and Altschul, 1990; Dembo et al., 1994) show that, using standard substitution matrices, the scores of optimal ungapped alignments of random sequence pairs approach a Gumbel or extreme value distribution (EVD) (Gumbel, 1958). Under certain conditions, the same holds true empirically when gaps are allowed (Smith et al., 1985; Pearson, 1998; Mott, 2000; Altschul et al., 2001) or when sequences are aligned with PSSMs (Altschul et al., 1997). Briefly, the probability that a random score S ≥ x is given by
Although one may expect that random scores for the type of alignment constructed by SALTO_LOCAL will also be well approximated by an extreme value distribution, random simulations are needed to determine whether this is the case and, if so, to estimate the statistical parameters K and λ for any particular PSSM.
To generate a distribution of random scores, we modeled a protein as a sequence of independently selected amino acids, chosen using standard ‘background’ frequencies (Robinson and Robinson, 1991). Each PSSM was aligned, for several gap-length parameters p and a (described in the previous section), to 6000 random sequences of length 1000, and an EVD was fitted to the resulting collection of optimal scores using the maximum-likelihood method (Lawless, 1982). In each instance, a χ2 goodness-of-fit test was performed between the EVD and the data from which it was estimated. For the great majority of PSSMs, there was a range of gap-length parameters for which the EVD could not be rejected, i.e. it had χ2P ≥0.005 (see Results section). Because reasonably accurate statistics are needed to compare alignments involving different PSSMs, these parameter ranges are to be preferred. P-values can then be reported using Equation (1) or E-values using the equation
When the χ2P-value is <0.005, such EVD P-value or E-value estimates for SALTO_LOCAL alignment scores, based upon a questionable model of the random distribution, may be inaccurate. Note that bias in the amino acid composition of protein families is another potential source of inaccuracy in estimating E-values (Schäffer et al., 1999, 2001). This can be mitigated by estimating the EVD parameters for a given PSSM using, in place of standard amino acid frequencies, those that characterize the protein family used to generate the PSSM.
A benchmark for comparing search methods
In this paper, we study the ability of a search of a PSSM database to find domains related to a query protein sequence. For this purpose, we will not record a high-scoring alignment between a sequence and a related PSSM as a true positive unless the alignment involves the correct region of the query sequence. Therefore, to evaluate the sensitivity of the search method, we require not only a test set of sequences and PSSMs and a list of ‘true positive’ relationships between them, but also a record of the sequence regions in which the relationships are valid.
We have described above the collection of 170 PSSMs used, each corresponding to a domain from CDD, with an associated multiple alignment and structures. A set of 6480 ‘non-redundant’ query sequences was employed. This set was constructed by single-linkage clustering, based on BLAST E-values of 10− 80 or less, of the entries in the MMDB structure database (Chen et al., 2003), which is available at http://www.ncbi.nlm.nih.gov/Entrez/. We recorded a sequence–PSSM pair as a true positive if it satisfied one of two criteria described below. For each criterion, we describe as well the region of the sequence in which the relationship is considered ‘valid’. The test sets, annotated list of true positives and valid sequence regions are available from the authors.
First, each CDD family contains a segment or segments from at least one sequence with known structure. We defined the region from the first sequence residue represented by the CDD family to the last as the CDD family footprint. A CDD family may contain several sequences with structures and each then has a family footprint. Using the VAST structure alignment algorithm (Gibrat et al., 1996), all structures represented by a CDD family are compared to all protein structures in db6480; complete protein chains or MMDB domains are employed for this comparison, not just those regions aligned within the CDD family. The list of structurally similar VAST neighbors is available at http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml. The region between the first and last residue aligned by VAST defines a VAST footprint in the CDD family sequences. If the overlap of the CDD family footprint and the VAST footprint comprises at least 80% of each of their lengths, then the relationship is taken to be a true one. The footprint of the VAST alignment within the db6480 sequence is recorded as the valid region. If more than one structure from a CDD family implies a true relationship to a db6480 sequence S, then the entire region between the first and last residues in S contained in some valid region is considered valid. Note that VAST has been compared to other methods for structural alignment and classification and has been shown to have comparable sensitivity (Marchler-Bauer et al., 1997; Matsuo and Bryant, 1999; Shapiro and Brutlag, 2004; Sierk and Pearson, 2004).
Second, using BLAST for each sequence segment from one of the CDD families represented and PSI-BLAST for each corresponding multiple alignment, a search of db6480 was performed. Whenever a hit with E-value 10−4 or less was returned, a true relationship between the sequence involved and the corresponding PSSM was recorded. The valid region within the db6480 sequence was determined as above, using BLAST and PSI-BLAST footprints in place of VAST footprints.
Our list of true relationships is a challenging one for database search programs. Each of the methods described above for inferring a relationship is triggered by a VAST, BLAST or PSI-BLAST alignment. A histogram of the percent identity implied by these alignments is shown in Figure 2. For PSI-BLAST alignments, percent identity is calculated using a consensus sequence for the multiple alignment used to construct the PSSM. As can be seen, most relationships are inferred from alignments with <25% sequence identity. The great majority of these alignments derive from VAST structure relationships, with only ∼10% of the relationships inferred using only the second criterion above.
There are various ways to quantify the performance of a search method on a given test set, but recently receiver operating characteristic (ROC) analysis has been gaining widespread acceptance (Bamber, 1975; Swets, 1988). Assume that the results of a search are returned in a specific order and that they may be classified as either true (related) or false (unrelated). Descending this list, one may plot the ROC curve—the aggregate numbers of true versus false positives returned. A ROCn curve is produced if one truncates the plot after the first n false results (McClish, 1989; Gribskov and Robinson, 1996). An ideal retrieval method will return all the true results before any false ones. The area under a ROC curve is normalized so that ideal retrieval receives a score of 1.0 and this is called a ROC score.
Here, we are interested not only in whether a search program identifies a related PSSM for a given query, but also in whether the returned alignment involves the correct region of the PSSM. A variation on ROC and ROCn curves appropriate to this situation is the ‘localization-response operating characteristic’ (LROC) curve (Chakraborty, 1989; Swensson, 1996; Edwards et al., 2002). Applied to our case, the number of related PSSMs returned with an alignment in the correct region is plotted against the number of unrelated PSSMs returned. Correct PSSMs returned with an alignment in the wrong position are ignored; this is equivalent to a ROC curve in which such correct PSSMs with incorrect alignments are deemed to have been returned after all other PSSMs. With certain reasonable assumptions, it can be shown that, stochastically, LROC scores are linearly related to ROC scores (Swensson, 1996).
In this paper, when using SALTO or IMPALA to compare a sequence to a CDD-derived PSSM, we record a result as a true positive only if it involves a CDD determined to be related to the query sequence in question and the alignment produced overlaps at least 1% of the ‘valid’ sequence region.
We compare the performance of SALTO_LOCAL and IMPALA on the benchmark test set described above using LROCn scores. E-values can be used to combine the results of multiple searches into a single retrieval list, and a single LROCn score may be calculated. Alternatively, LROCn scores can be calculated for individual queries and an average score over a query set may then be used to compare the performance of different search methods. LROCn errors were estimated by a bootstrap procedure (Schäffer et al., 2001).
Loop length parameters
Just as each PSSM is tailored to a particular CDD family, the loop length parameters a and p used by SALTO are similarly tailored. In general, if these parameters are too loose or too restrictive, i.e. if they allow the introduction of long or only short loops in the alignment, respectively, the resulting random score distribution will not follow an EVD and it will not be appropriate to use the PSSM in a search database.
Among the 198 PSSMs studied, we found 170 for which the range of loop length parameters tested yielded random scores well approximated by an EVD. The remaining 28 families failed the test at χ2P-value ≥0.005; we further analyzed why they failed the test. We found that from those 28 entries, 10 CDD entries that were originally hand-curated to have three conserved regions or blocks, contain one block that dominates in size. Another nine CDD entries contain only one or two long blocks. For these entries, the random alignments produced by SALTO are virtually always to the same, single block. In these cases a good fit to the EVD cannot be expected (Gumbel, 1958). This particular problem is easy to avoid; newer versions of CDD may be changed so that long blocks are split and users may also edit these family models themselves for in-house searches.
Given loop length parameters from which to choose, we preferred the ‘loosest’ ones, allowing the introduction of large gaps between blocks. To illustrate the fit of an EVD to a random distribution of SALTO_LOCAL scores, we use the glycoprotein hormone β-chain homologue CDD family (GHB). Figure 3 shows a histogram of random alignment scores when the loop length parameters p = 0.99 and a = 6 are used; the χ2 goodness-of-fit test, to an EVD with parameters λ = 0.153 and K = 0.0026, has P-value 0.88. For this CDD family, goodness-of-fit P-values >0.005 (corresponding to a good fit to an EVD) were obtained with a wide range of loop length parameters: values of p ranging between 0.20 and 0.99, each with several values of a. The parameters used for Figure 3 are the largest that yielded a good fit to the EVD and accordingly are those we associate with this CDD family in our database. Loop length parameters chosen for each of the 170 database CDD families are available by ftp (ftp://ftp.ncbi.nih.gov/pub/SALTO/).
Using query sequences from db6480 and the database of 170 CDD families described above, the relative accuracy of the IMPALA and SALTO_LOCAL search methods, as measured by pooled LROCn scores, are compared in Table 1. LROCn scores are reported for n = 10 000, 25 000 and 100 000 which correspond approximately to 1, 4 and 15 false positives per query. By these measures of retrieval accuracy, SALTO_LOCAL performs as well as IMPALA for low false positive values and outperforms IMPALA as one moves further into the ‘twilight zone’; and the number of false positives increases (Figure 4a and b).
DISCUSSION AND CONCLUSION
Traditionally, to score insertions or deletions in sequence alignments, affine gap penalties have been used despite the fact that this simple model does not adequately describe the evolution of indels (Qian and Goldstein, 2001; Wrabl and Grishin, 2004). It is useful to have a PSSM-alignment search method that can incorporate models of loop location and length. In contrast to uniform affine gap costs, the model described here accommodates large loops in specified positions without extra cost, where curated protein-family multiple alignments suggest they may occur. This can improve the recognition of members of highly divergent protein families, with long insertions (e.g. SALTO outperforms IMPALA in detecting sequences related to the AAA-ATPase families that contain big domain insertions; data not shown). To illustrate the advantage of using SALTO_LOCAL, the alignment between 1B9R_A, the representative protein sequence of the 2Fe–2S iron–sulfur cluster binding domain (fer2) CDD family and 1JRO_A (a protein query that also contains the 2Fe–2S iron–sulfur domain) is shown in Figures 5 and 6. It can be seen that while SALTO_LOCAL aligns the four cysteins that constitute the iron-binding site (depicted in Figs 5 and 6), IMPALA aligns only three of them. In this example, SALTO_LOCAL has the advantage of assessing no penalty for the inclusion of a big loop before the penultion block, which allows it to incorporate the complete iron-binding site; the alignment score produced by IMPALA decreases considerably if that insertion is included, so IMPALA does not identify the functional site as a whole.
SALTO implements one approach to capturing the structural information encoded in the pattern of ungapped blocks of CDD protein families. One problem any such specialized alignment model presents is the characterization of its statistics. Despite the requirement that specified conserved regions be aligned completely or not at all, for most protein families considered loop length parameters may be found that yield random score distributions well characterized by the EVD. Furthermore, as CDD increases in size, it is expected that models of conserved regions and the lengths of loops that separate them will become more accurate. This may improve the power of any tool, like SALTO, that employs such models in its search algorithm. Thus, in the future, it might prove possible to devise more sophisticated gap penalty functions and use them to improve SALTO's performance. Our observation here that good statistical power is obtained with structure-based alignment suggests that further investigation might be worthwhile. In the context of NCBI's CDD project, we have already found that the methods in SALTO, incorporated into Cn3D, lead to better semi-automatic curation of proteins alignments. The final alignments are easier to interpret in the context of the structure and function of the protein aligned.
In summary, our approach provides a reasonable and fast alternative method for using structural and functional knowledge from protein family models when aligning new sequences to those models. Our results show that, despite the constraints SALTO's model imposes on the alignments it generates, its retrieval accuracy, when used to search a database of protein models, is in general better than IMPALA's.
|ROC10 000||ROC25 000||ROC100 000|
|SALTO_LOCAL||0.185 ± 0.005||0.213 ± 0.005||0.300 ± 0.005|
|IMPALA||0.187 ± 0.005||0.205 ± 0.005||0.258 ± 0.005|
|ROC10 000||ROC25 000||ROC100 000|
|SALTO_LOCAL||0.185 ± 0.005||0.213 ± 0.005||0.300 ± 0.005|
|IMPALA||0.187 ± 0.005||0.205 ± 0.005||0.258 ± 0.005|
All search results of the benchmark query test sets, ranked by E-value, are pooled.
We thank Eva Czabarka and Sergey Sheetlin for calculating errors for the LROC measures, and John Spouge and Aron Marchler-Bauer for helpful discussions and comments on the manuscript. Support for this work was provided by the intramural research program of the National Institutes of Health.