Summary: DBToolkit is a user-friendly, easily extensible tool that allows the processing of protein sequence databases to peptide-centric sequence databases. This processing is primarily aimed at enhancing the useful information content of these databases for use as optimized search spaces for efficient identification of peptide fragmentation spectra obtained by mass spectrometry. In addition, DBToolkit can be used to reliably solve a range of other typical tasks in processing sequence databases.
Availability: DBToolkit is open source under the GNU GPL license. The source code, full user and developer documentation and cross-platform binaries are freely downloadable from the project website at http://genesis.UGent.be/dbtoolkit/
As the tool of choice in present-day high-throughput proteomics, mass spectrometry has evolved substantially over the last years. The classical approach of two-dimensional polyacrylamide gel electrophoresis (2D-PAGE) (O'Farrell, 1975) requires merely a mass measurement of the peptides generated from an enzymatic digest of an isolated protein (Cottrell, 1994). A refinement of this approach uses fragmentation spectra of a few peptides as additional information for the identification of the original protein. In most recent so-called gel-free techniques (e.g. as reviewed by Zhang et al., 2004 and Gevaert et al., 2005) however, mass spectrometers must be able to generate high-quality fragmentation spectra from extremely complex peptide mixtures, obtained following proteolytic digestion of an unfractionated proteome of a cell or tissue. These peptide-centric methods were primarily developed to deal with the inherent shortcomings of classical 2D-PAGE techniques and allow for a greater coverage of the proteome while simultaneously increasing the sensitivity of the analysis (Aebersold and Mann, 2003). The peptide-centric technologies have driven the exchange of the protein for the peptide as the basic unit in proteomics research (Aebersold and Mann, 2003; Kearney and Thibault, 2003).
Sequence databases like SWISS-PROT, IPI and the NCBI non-redundant database remain protein-based however. Since this discrepancy between the respective fundamental units can lead to a loss of highly interesting identifications, we developed the DBToolkit suite of software tools to allow the conversion of protein sequence databases into peptide sequence databases.
The software can recognize FASTA and EMBL formatted databases out of the box, with UniProt and IPI the most prominent examples of the latter. It is also extremely easy for developers to include automatic recognition of different database formats as detailed below.
DBToolkit can perform various types of processing on sequence databases. Of course, simple in silico enzymatic digests using a variety of predefined enzymes or user-added enzymes are possible as well as database concatenation and FASTA output of differently formatted databases. The enzymatic digest even allows for ‘dual specificity’ enzymes that generate peptides for which the aminoterminus (N-terminus) is the result of a different cleavage pattern than the carboxyterminus (C-terminus). In addition, it is also possible to filter databases (the exact filtering options depend on the database format loaded) and to limit output to sequences in a certain mass range. Additional filters by other developers are also readily included in the software (see below). The three most powerful functions of DBToolkit however, are sequence-based filtering through a simple query language, N-terminal or C-terminal ragging (optionally truncating sequences in the process) and sequence-based redundancy clearing. The ragging process creates a series of subsequences for each ‘mothersequence’ where in each n-th subsequence, the first n − 1 residues have been removed from the N-terminal or C-terminal side, respectively.
These functions are readily applied serially to achieve compound results such as a non-redundant, N-terminally ragged subset of a trypsin digest of the Homo sapiens entries in the UniProt database, all of which have a mass between 600 and 4000 Da.
Several applications for these processed databases are outlined below.
DBToolkit is completely written in the Java programming language and its only requirement is a Java runtime environment 1.3 or above. The suite consists of both an intuitive graphical user interface presenting the user with interactive controls to all processing steps, and an equivalent set of command-line tools for straightforward automation of the processing steps through simple scripting. This latter functionality has allowed us to tie different processing steps in with the automatic database updating of Mascot (http://www.matrixscience.com) for the most popular sequence databases, creating multiple derived databases overnight.
DBToolkit was designed from the start to be easily extensible. The use of robust frameworking allows the addition of novel database loaders or filters without requiring recompilation.
Full user and developer documentation for the suite is available from the project website, along with the cross-platform binaries and CVS repository coordinates.
We have applied DBToolkit in the lab for numerous purposes, most notably the generation of specialized databases for use as searchbases for protein identification in Mascot. One approach used ragged, non-redundant peptide databases to increase the number of identified spectra in an N-terminal COFRADIC experiment with ∼40% (Gevaert et al., 2003). Interestingly, most of the peptides identified only in the ragged databases corresponded to the novel N-termini of their progenitor proteins after in vivo processing (e.g. the N-termini of nuclear-encoded proteins that are imported into mitochondria and lost their transit peptide). Since these processing sites typically did not conform to standard tryptic sites, they were absent from searches solely performed in the original sequence databases. Another application has been found in picking up peptides from apoptose substrates, yielding the exact cleavage location in those proteins. For this we created non-redundant, enzymatically digested peptide databases using a bifunctional enzyme that created peptides with an N-terminus derived from caspase activity (i.e. consensus cleavage C-terminal to aspartic acid) and a C-terminus derived from trypsin activity. In this way, a large number of caspase cleavage sites have been confirmed and many tentative new sites have been found that would otherwise have eluded identification (unpublished data). A third application centers on the a priori calculation of the potential success a certain COFRADIC procedure could have by rapidly creating non-redundant, comprehensive lists of all detectable peptides containing a specified amino acid. Note that this functionality can be applied to any peptide-centric proteomics approach that can select for sequences by their aminoacid content (see Zhang et al., 2004 and Gevaert et al., 2005 for an overview of these techniques).
DBToolkit has proven to be a highly versatile yet very simple tool for routine tasks in sequence database processing. Furthermore, as the applicability and popularity of peptide-centric proteomics experiments expands further, DBToolkit can perform the essential task of complementing proven, probabilistic protein identification software like Mascot with peptide-centric search databases, optimized for the specific conditions and requirements of the research.
L.M. would like to thank An Staes, Evy Timmerman, Petra Van Damme, Grégoire Thomas and Luc Krols for their useful suggestions and comments on the DBToolkit software during its development phase. K.G. is a Postdoctoral Fellow and L.M. a Research Assistant of the Fund for Scientific Research, Flanders (Belgium) (FWO, Vlaanderen). The project was supported by research grants from the Fund for Scientific Research, Flanders (Belgium) (project number G.0008.03), the Inter University Attraction Poles (IUAP, project number P5/05), the GBOU-research initiative (project number 20204) of the Flanders Institute of Science and Technology (IWT) and the European Union Interaction Proteome (6th Framework Program).
Conflict of Interest: none declared.