Abstract

Motivation: At present, mapping of sequence identifiers across databases is a daunting, time-consuming and computationally expensive process, usually achieved by sequence similarity searches with strict threshold values.

Summary: We present a rapid and efficient method to map sequence identifiers across databases. The method uses the MD5 checksum algorithm for message integrity to generate sequence fingerprints and uses these fingerprints as hash strings to map sequences across databases. The program, called MagicMatch, is able to cross-link any of the major sequence databases within a few seconds on a modest desktop computer.

Availability: MagicMatch is available at the following URL (http://cgg.ebi.ac.uk/services/magicmatch/), including an interactive service for major databases and binary downloads for widely used platforms.

Contact:ouzounis@ebi.ac.uk

INTRODUCTION

One of the most labour-intensive yet inevitable tasks in bioinformatics is mapping sequence identifiers across various databases. Numerous databases were designed for specific user requirements and are based on diverse schemas and entry accession keys (usually in the form of identifiers). Moreover, sequence identifiers may be altered between release versions of the same database and often the previous identifiers cannot be easily traced. A few resources—either public, such as NCBI's Entrez (Wheeler et al., 2005), or commercial, such as EMBL's SRS (Etzold and Argos, 1993), were developed to facilitate linking operations across databases. All these involve heavy and sustained resource utilization, extensive computation and do not readily allow rapid mapping of user-specified databases.

At present, the linking of user-specified databases is usually achieved by performing sequence similarity searches that identify which sequence entries are identical or highly similar to each other. Each sequence in a ‘query’ database is searched against other ‘reference’ databases using sequence database search methods such as FASTA (Pearson, 1990) or BLAST (Altschul et al., 1997), which can detect proteins of identical length and primary structure. The sequence identifiers of the query and reference entries are subsequently linked. This method is particularly inefficient owing to high computational cost yet is indispensable and, therefore, utilized by many groups working in bioinformatics.

Here we report an extremely rapid mechanism to map accessions among any sequence databases, which is based on the MD5 checksum algorithm for message integrity (Rivest, 1992). The MD5 algorithm takes as input a message of arbitrary length and produces a 128-bit fingerprint of the message, usually encoded as a 32-character hexadecimal number. The basic premise is that it is virtually infeasible to produce two messages having the same fingerprint or, conversely, produce a message that can potentially generate an prespecified fingerprint (Abzug, 1992http://userpages.umbc.edu/~mabzug1/cs/md5/md5.html). MD5 has been extensively used in data integrity checking to ensure error-free transmission across information channels, and is more reliable than other similar methods (Rivest, 1992).

In the context of bioinformatics, the identity of the sequence (and thus its resulting fingerprint) can be entirely encoded by its primary structure, i.e. the sequence of amino acid residues. Two identical sequences will then result in identical MD5 checksum fingerprints. In fact, checksums (e.g. the 64-bit CRC checksum) have been used in bioinformatics to ensure that the sequence information in a database is not modified. Also, 32-bit checksums have been used to generate non-redundant databases (Holm and Sander, 1998), yet these programs are not widely available—one such example being the nrdb program (W. Gish, unpublished). Other clustering programs have been used to identify redundancies, including CD-HIT (Li et al., 2001) and BlastClust (unpublished, available at http://www.ncbi.nlm.nih.gov/BLAST/docs/blastclust.html)— available from the current BLAST distribution (Altschul et al., 1997) (see below, for a comparison of performance). These, however, have not been developed with cross-referencing in mind: they are able to identify highly similar sequences which might be either excluded from a database or included in groupings for further analysis. The advantage of checksum algorithms, based on raw sequence information, is that they are able to generate unique identifiers based on the checksum fingerprints. Further, despite the fact that the possibility of a duplicate fingerprint occurring randomly cannot be excluded, the probability of this happening is miniscule. It is not possible to estimate these probabilities readily (http://www.woodmann.com/crackz/Tutorials/Md5info.htm).

Using MD5 checksums, speed is achieved by avoiding any pairwise search procedure, which is replaced by a hashing mechanism. The hashing procedure is realized by the representation of protein sequences as a 128-bit MD5 fingerprint, obtained from the MD5 algorithm (Rivest, 1992).

IMPLEMENTATION

We have implemented the procedure of sequence fingerprinting, hashing and comparison of databases in a C++ program, named MagicMatch. MagicMatch receives as an input a FASTA-formatted sequence database and creates identifier-to-fingerprint tables, and/or uses these hash tables to cross-link sequence identifiers across databases. The execution of MagicMatch is extremely rapid, and in our tests on a Intel Xeon 2.8 GHz computer with 2 GB of memory under the Linux operating system, calculating cross-referencing the entire COGENT database (Janssen et al., 2003), containing 684 736 protein sequence entries from 178 complete genomes, to the complete Uniprot database (Bairoch et al., 2005), with 1.52 million sequences, takes 41 s, including the computation of fingerprints and their subsequent matching.

To compare performance with two other (clustering) programs, namely CD-HIT (Li et al., 2001) and BlastClust (Altschul et al., 1997), we obtained two identical sets of 1000 randomly selected sequences and identified duplicates through the three programs: BlastClust takes 68.75 s, CD-HIT takes 2.44 s, while MagicMatch takes 0.04 s (>1000-fold and 61-fold speeded-up, respectively).

LIMITATIONS

MagicMatch is not a sequence comparison procedure and is designed solely to recognize identical sequences. It links all identical sequences across databases irrespective of sequence origin or species/strain information. It will not match sequences that are at least one character different, and thus it is unsuitable for linking sequence fragments or processed and truncated proteins to their full-length precursors. However, since these are not identical sequences they should not be marked as ‘identical’ across databases. Other procedures based on exact matching fragments by overlapping sequences are available.

DISCUSSION

We encourage all public sequence databases to use MagicMatch to calculate MD5 checksums for all their entries. This practice will allow an instant and consistent linking between all publicly available resources and totally eliminate the problem of database cross-referencing based on sequence content. Also, allowing a database search with sequence fingerprints can be a faster way of accessing database entries when the identifier of a protein sequence is unknown. The method is widely applicable to other database entries used in bioinformatics research, including DNA sequences and protein structures, given the general nature of the MD5 checksum algorithm for message integrity (Rivest, 1992).

AVAILABILITY

MagicMatch binary downloads for Windows, Linux, Apple OS X and Solaris are available from http://cgg.ebi.ac.uk/services/magicmatch/

Present address: Mullard Space Science Laboratory, Holmbury St Mary, Dorking, Surrey RH5 6NT, UK
Present address: Microbial Ecology Program, DoE Joint Genome Institute, 2800 Mitchell Drive, Building 400-404, Walnut Creek, CA 94598, USA
§Present address: Wellcome Trust Sanger Institute, Cambridge CB10 1SA, UK

We thank members of the Computational Genomics Group for discussions and the anonymous reviewers for comments.

Conflict of Interest: none declared.

REFERENCES

Abzug, M.T.
1992
MD5 Homepage (unofficial)
Altschul, S.F., et al.
1997
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Res.
 
25
3389
–3402
Bairoch, A., et al.
2005
The Universal Protein Resource (UniProt).
Nucleic Acids Res.
 
33
D154
–D159
Etzold, T. and Argos, P.
1993
SRS—an indexing and retrieval tool for flat file data libraries.
Comput. Appl. Biosci.
 
9
49
–57
Holm, L. and Sander, C.
1998
Removing near-neighbour redundancy from large protein sequence collections.
Bioinformatics
 
14
423
–429
Janssen, P., et al.
2003
COmplete GENome Tracking (COGENT): a flexible data environment for computational genomics.
Bioinformatics
 
19
1451
–1452
Li, W., et al.
2001
Clustering of highly homologous sequences to reduce the size of large protein databases.
Bioinformatics
 
17
282
–283
Pearson, W.R.
1990
Rapid and sensitive sequence comparison with FASTP and FASTA.
Methods Enzymol.
 
183
63
–98
Rivest, R.
1992
RFC 1321: the MD5 message-digest algorithm. Internet Engineering Task Force
Wheeler, D.L., et al.
2005
Database resources of the National Center for Biotechnology Information.
Nucleic Acids Res.
 
33
D39
–D45

Comments

0 Comments