Summary:NCBI completed the transition of its main genome annotation database from Locuslink to Entrez Gene in Spring 2005. However, to this date few parsers exist for the Entrez Gene annotation file. Owing to the widespread use of Locuslink and the popularity of Perl programming language in bioinformatics, a publicly available high performance Entrez Gene parser in Perl is urgently needed. We present four such parsers that were developed using several parsing approaches (Parse::RecDescent, Parse::Yapp, Perl-byacc and Perl 5 regular expressions) and provide the first in-depth comparison of these sophisticated Perl tools. Our fastest parser processes the entire human Entrez Gene annotation file in under 12 min on one Intel Xeon 2.4 GHz CPU and can be of help to the bioinformatics community during and after the transition from Locuslink to Entrez Gene.
Availability:Source codes are available under the Perl and GNU public license at http://sourceforge.net/projects/egparser/
The National Center for Biotechnology Information (NCBI) completed the transition from Locuslink (Pruitt and Maglott, 2001) to Entrez Gene (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene) (Maglott et al., 2005) in Spring 2005. Thus, there is an urgent need for parsers for the ASN.1-formatted (http://www.ncbi.nlm.nih.gov/Sitemap/Summary/asn1.html) Entrez Gene annotation file. However, despite the immense popularity of Perl programming language among bioinformatics researchers, there is currently no publicly available Perl parser for either the Entrez Gene annotation file or ASN.1 text files in general. The NCBI Entrez Gene parser has a rather steep learning curve and is available only in the C/C++-based toolbox (http://ncbi.nih.gov/IEB/ToolBox/index.cgi/). The very latest gene2xml tool from NCBI provides Perl users an indirect way to process Entrez Gene data as it could convert the binary ASN-formatted Entrez Gene files to XML format. However, the Entrez Gene XML format (http://www.ncbi.nlm.nih.gov/dtd/) is rather complex and difficult to use. Its storage and processing also consume significant computational resources. In contrast, an object-oriented Perl parser of the ASN-formatted Entrez Gene files would be efficient, easy to use and could interface well with the vast number of public bioinformatics tools, e.g. Bioperl (http://bioperl.org/) (Stajich et al., 2002) and EnsEMBL (http://www.ensembl.org) (Birney et al., 2004), and any in-house tools developed in Perl. Compared with C/C++-based tools, pure Perl parsers support all the operating systems that Perl has a virtual machine for and require no effort of porting. We also placed a very strong emphasis on performance optimizations when creating our Perl parsers using four different approaches, which resulted in significantly better performance than XML parsers on XML-formatted Entrez Gene files. We describe and compare the characteristics and performance of our parsers here.
To retrieve information from Entrez Gene, we need parsers that could build an easy-to-use data structure from an Entrez Gene record, from which a calling program could retrieve specific data item(s) of interest. Tools that could parse text using a context-free grammar or simulate the process would be very appropriate for this task. We considered four Perl tools that provide powerful text processing capabilities ranging from parsing complex and arbitrary data files to performing natural language processing. We present here not only the outlines of our parsers using these tools, but also a comparison of the suitability of each of these tools for practical bioinformatics projects.
Usage and availability of our parsers
Three of the four Entrez Gene parsers we created utilize context-free grammars. We used an LL-grammar with Parse::RecDescent (http://search.cpan.org/dist/Parse-RecDescent/), and specified the same LR-grammar and very similar lexer functions for Parse::Yapp (http://search.cpan.org/~fdesar/Parse-Yapp-1.05/) and Perl-byacc (http://www.cpan.org/src/misc/). The regular expression (regex, http://www.perl.com/doc/manual/html/pod/perlre.html) based parser was implemented using recursive function calls. During parsing, the parsers will immediately abort when any offending element is encountered, effectively guaranteeing the accuracy of the results. The regex-based parser also provides validation and error reporting capabilities. All four parsers are object-oriented Perl modules with an instantiation function and a parse function with option to trim the generated data structure. Programs that use any of our parsers simply need to include the module, instantiate a parser object and pass an Entrez Gene record into the parse function of the object, which then returns the data structure generated. The details of the grammars we used, the parsers and a sample Perl program testing them are available from the sourceforge web site (http://sourceforge.net/projects/egparser/).
The speeds of the four parsers are all acceptable when parsing small- to-moderate-sized Entrez Gene records. The parsers created with Parse::Yapp, Perl-byacc and regex exhibited O(N) behavior, where N is the record size, while the Parse::RecDescent-based parser was about O(N3) in time based on curve fitting. In fact, it takes the Parse::RecDescent-based parser nearly 20 min to parse a long Entrez Gene record (Entrez Gene ID 4539, 846 KB) on one Intel Xeon 2.4 GHz CPU. In sharp contrast, the same entry takes only 0.51 s to be parsed using our regex-based parser, and about 2.8 and 5.2 s using the Parse::Yapp and Perl-byacc based parsers, respectively. In all, it took only 11.5 min for our regex-based parser to process the entire human Entrez Gene file (145 466 records) on one Intel Xeon 2.4 GHz CPU. The mouse and rat genomes took 9 and 3.5 min, respectively.
Feature comparison of the Perl tools
With the advance in modern hardware and clusters, software performance is not necessarily the primary concern for researchers. Frequently the evaluation of software tools is influenced heavily by their ease of use, flexibility and debugging capability, among other aspects. With the experience gained from using these modules, we provide a short evaluation of each tool below.
This module is the most convenient to use—no need to supply lexer function, very easy to debug and optimize—and provides superior flexibility (it allows parameter passing among rules, regex terminals, changing grammar during runtime and context-sensitive grammar, to name just a few). However, it demands more optimization on the grammar, and even so, still performs terribly at parsing large input strings.
Parse::Yapp and Perl-byacc
These two tools are based on YACC (Yet another compiler compiler, http://dinosaur.compilertools.net/yacc/). Both require users to provide a lexer function (or tokenizer) in addition to a grammar and running command-line tools to generate parser. The two tools need much improvement on documentation and are not nearly as flexible or easily debugged as Parse::RecDescent. However, the performance of the parser generated by Parse::Yapp is far better than Parse::RecDescent, and even more so in the case of Perl-byacc.
When the grammar describing data is sufficiently simple, the performance of regex parsers provides an overwhelming advantage. It is also easy to debug and add features such as validation and error reporting. However, users would essentially have to code the entire lexer and parser logic when using this approach. Therefore, the regex approach is not adept at dealing with complex grammars.
In conclusion, our regex-based parser for Entrez Gene provides excellent performance and its object-oriented interface is fairly easy to use. Compared with XML-based approach, our parser consumes less than half of memory, runs 5–100-fold faster (depending on the XML parser used) and the resulting data structure is much easier to use. In fact, while this paper was under review, a Bioperl Entrez Gene parser (Bio::SeqIO::entrezgene) was built on top of ourparser.
For bioinformatics researchers considering the tools we discussed for a text-processing project, we recommend using regex and Perl-byacc for data described by simple grammars. For data with complex grammars, Perl-byacc should be chosen. However, if strong flexibility and/or context-sensitive grammar is needed, Parse::RecDescent is the best choice in Perl. Yet if input data records are very large, one should consider using other tools such as ANTLR (http://www.antlr.org/) or SLK, Strong LL(k) Parser Generator (http://home.earthlink.net/~slkpg/) instead.
We want to thank Shahid Imran (GPC Biotech), Zhongwu Lai and Anthony Caruso (both of Altana Research Institute) for insightfuldiscussions.