Summary: Identifying mentions of named entities, such as genes or diseases, and normalizing them to database identifiers have become an important step in many text and data mining pipelines. Despite this need, very few entity normalization systems are publicly available as source code or web services for biomedical text mining. Here we present the Gnat Java library for text retrieval, named entity recognition, and normalization of gene and protein mentions in biomedical text. The library can be used as a component to be integrated with other text-mining systems, as a framework to add user-specific extensions, and as an efficient stand-alone application for the identification of gene and protein names for data analysis. On the BioCreative III test data, the current version of Gnat achieves a Tap-20 score of 0.1987.
Availability: The library and web services are implemented in Java and the sources are available from http://gnat.sourceforge.net.
The extremely rapid growth of published literature in the biological sciences necessitates the constant improvement of automated text-mining tools to extract relevant information and convert it into structured formats. Terms for the same entities used in biomedical articles can vary widely between authors and across time (Tamames and Valencia, 2006). Thus, two key tasks in biomedical text analysis are named entity recognition (NER; finding names of genes, cell lines, drugs, etc.) and entity mention normalization (EMN; mapping a recognized entity to a repository, such as Entrez Gene or PubChem). Both tasks enable indexing, retrieval and integration of literature with other resources. Gene and protein names in particular represent central entities that are discussed in biomedical texts. While a growing number of tools for gene NER are freely available e.g.(Leaman and Gonzalez, 2008), only a limited number of tools provide gene normalization capabilities that can be used off-the-shelf by end users e.g.(Huang et al., 2011).
In this article, we present a new version of the Gnat system for gene mention recognition and normalization (Hakenberg et al., 2008) and make it available as an open-source Java library and as a remote web service. Gnat now relies on a modular architecture, allowing integration of new components by implementing relatively simple HTTP interfaces and allows its components to be distributed on servers (local or remote; public or private). The framework allows end users to send PubMed or PubMed Central document identifiers as well as free text to our server, returning lists of gene mentions with Entrez Gene IDs. Text mining application developers can make use of the same service by using Gnat as a component in their own processing pipelines or by customizing Gnat toward their requirements.
Here we present the major components in Gnat, demonstrate how they interact and how they can be exchanged and extended by developers. We present an overview of the web services provided, which can be used remotely from our server or set up by users at their local sites. Finally, we discuss the performance of gene mention normalization provided by Gnat.
Gnat consists of a set of modules to handle all steps required in a text processing pipeline, from document retrieval to named entity normalization. A general Gnat processing pipeline (Fig. 1) consists of modules that perform the following steps:
Pre-process each text,
Perform named entity recognition for genes and species,
Remove likely false positive gene mentions,
Assign candidate identifiers to genes,
Validate identifiers, and
Rank candidate gene identifiers.
Steps (1) and (2) comprise essential text retrieval and pre-processing tasks. Document retrieval uses NCBI e-utils to obtain records from PubMed and PubMed Central when such identifiers are given. Pre-processing of texts includes, for instance, a name range expansion that replaces mentions such as ‘freac1-3’ with ‘freac1, freac2, and freac3’, to aid subsequent gene NER.
In step (3), Gnat recognizes names referring to both genes and species using a dictionary-based approach. A set of candidate Entrez Gene identifiers is assigned to each gene mention in this step as well, comprising all potential matches based on the gene's name alone. The NER modules available in the current default version of Gnat include the species-dependent dictionary lookups present in previous versions (Hakenberg et al., 2008) for 20 common model organisms including human, mouse, rat and Escherichia coli. In addition to the dictionary-based gene NER taggers, we now provide an interface to Banner (Leaman and Gonzalez, 2008), which uses conditional random fields to recognize candidate gene names. Users can select either of these NER modules, the joint results of both methods or implement their own NER component (3b in Fig. 1). To identify species names, we incorporated Linnaeus (Gerner et al., 2010) (3a in Fig. 1), whose output determines which dictionary-based gene taggers to run, and to narrow down identifiers for ambiguous gene names later in the pipeline [step (6)].
Steps (4) to (7) comprise the actual gene mention normalization task, for which we have implemented a range of filters to remove likely false positive gene mentions as well as candidate IDs. Removal of false positives (FPs) uses information in the gene name itself, the surrounding text, as well as entire paragraphs or full text to ensure that a found name refers to a specific gene, and not another non-gene term. Likely FPs are further removed if not also recognized by BANNER. Note that in contrast to most gene name identification tools, mentions that refer to gene families are considered FPs in the current version of Gnat, since the aim is to find gene mentions that can be mapped to a specific entry in Entrez Gene. Thus, one of the filters removes mentions such as ‘G proteins’, although this step can be tailored to specific needs.
Candidate identifiers can then be further filtered or validated, for example, by removing genes from species not mentioned in the text, or by other user-defined methods [step (6)]. In step (7), the remaining ambiguous cases (gene mentions with more than one potential Entrez Gene ID) are ranked by comparing contextual information found in the text surrounding the mention with knowledge about each gene. For example, known Gene Ontology annotations for a gene increase its rank when that GO term is found in the nearby text, and similar methods are used for chromosomal locations, associated diseases, enzymatic activity, tissue specificity, etc. More details on the individual components, especially for disambiguation and normalization, can be found in (Hakenberg et al., 2008) and (Solt et al., 2010), which discuss specific implementations for BioCreative II and III, respectively (also see Section 5).
3 USING THE GNAT JAVA LIBRARY
For each of the aforementioned steps, we provide implementations as Gnat components that can be used as they are, extended or replaced by developers within their own pipelines. Most components can run either locally within the client (for instance, during development) or as remote services (with public or restricted access). For example, users might want to implement different NER strategies or supply custom dictionaries for species currently not provided in the default version. Users might alternatively want to include non-specific gene mentions that could be mapped to structured vocabularies such as MeSH that include gene families, or to include information from DNA or protein sequences in the text to improve gene mention normalization (Haeussler et al., 2011). Likewise, the final ranking methods can be adapted, or different input/output formats could be defined.
4 USING THE GNAT WEB SERVICE
The Gnat system also implements web services using HTTP POST and GET requests that can be used by both end users and developers. To submit a text for annotation, the following three URL parameters can be used:
Users can also modify the default behavior of the web service to specify the particular tasks to perform with the parameter
Large-scale community challenges to assess the status and compare methods for gene mention normalization have been ongoing since 2003 see the overview of BioCreative I, Task 1B, in Hirschman el al., 2005). Gnat has been evaluated on three BioCreative datasets: BioCreative I is composed of abstracts from papers on mouse, fruit fly, and yeast genes, BioCreative II is composed of abstracts from papers on only human genes, and BioCreative III is composed of full-text articles with no restriction on species. For human genes only, an earlier version of Gnat was ranked first among the participants in BioCreative II (Morgan et al., 2008), achieving a precision and recall of 82.1 and 81.6%, respectively, on a test set of 262 abstracts. Subsequent modifications to Gnat improved precision to 90.1% and with recall at 81.1 (Hakenberg et al., 2008). On a dataset derived from BioCreative I+II, covering genes from 13 species in 100 abstracts (Hakenberg et al., 2008), the provided default processing pipelines achieves 79% precision at 65% recall. For BioCreative III, performance was evaluated using the TAP-k metric (Lu et al., 2010), which is based on a ranked list of predictions (Carroll et al., 2010). The 50 manually annotated full-text articles chosen for maximal variability among submissions served as the gold standard for BioCreative III, on which Gnat achieves a TAP-20 score of 0.1987, with the highest ranking method yielding only 0.3466 (Lu et al., 2010). Due to the difference in the scoring metrics, results are not easy to compare directly between BioCreative challenges; our own experiments show precision and recall values for the current system of 53.6 and 47.4%, respectively, on the manually curated training data (see Lu et al., 2010).
One drawback of the current default processing pipeline of Gnat relative with the BioCreative III test set comes from limiting our predictions to genes from 20 model organisms. The manually curated gold standard for 50 full-text documents includes an unusual composition of species compared with the training set: for example, 23% of all genes in the gold standard belong to Enterobacter sp. 638. This species and three more that contribute an additional 15% to the gold standard genes are not currently supported by the default dictionary-based NER in Gnat, but user-specific dictionaries could be added quickly when new species are anticipated, a procedure for which we provide detailed instructions in the documentation. Future extensions of the Banner library within Gnat to map any recognized gene name to species and candidate IDs should also help to make up for the low recall introduced by the current species limitation in species supported.
The current version of Gnat implements a client–server architecture, where individual modules can be set up to run within the client or as servers. Typically, a module would run as a server if it performs a memory-intensive processing step, requires a certain time for startup or is a finalized component; modules run as client are those which are suited to individual users' needs or those undergoing development. Using the default pipeline, it takes an average of 5 s to annotate a PubMed abstract; however, this number clearly depends on the underlying hardware and usage of remote services and can thus serve only as a rough estimate. Given the modular architecture, Gnat's modules can be easily tailored or replaced. For example, Gnat currently relies on Linnaeus for species NER and provides an interface to Banner for gene NER, demonstrating the ability to easily incorporate external tools, especially if they provide a Java API.
Here we presented the Gnat library for gene name recognition and normalization in biomedical text, now freely available from SourceForge at http://gnat.sourceforge.net. Gnat is written in a modular fashion to allow end users to annotate their textual data using the public web services, as well as text-mining developers to customize Gnat and host their own remote services, either public or private. Gnat provides many individual components of a typical text processing and gene name normalization pipeline, which can be extended or swapped by developers where necessary. As such, Gnat adds to the set of open-source tools now available for researchers to use for large-scale gene name normalization studies, providing a variety of access points to different users, from end users submitting text to a web service and treating Gnat's processing pipeline as a ‘black box’, to developers who use only some of Gnat's modules and replace others.
Precision and recall of Gnat can range from 54/47% on full-text documents that do not match main model organisms, to 82/82% on abstracts that reflect the species composition of the majority of PubMed. For the latter, consisting of an ensemble of ten common model organisms, we host web services that accept PubMed and PubMed Central IDs and free text as input, and return mentions and/or EntrezGene identifiers, which we hope will provide an opportunity to enhance research across many domains of bioinformatics.
Funding: Biotechnology and Biological Sciences Research Council (CASE studentship to M.G., grant BB/G000093/1 to C.M.B., G.N.); the European Commission (grant HEALTH-F4-2008-223210 to C.M.B.); German Academic Exchange Service (DAAD) to I.S.
Conflict of Interest: none declared.