Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora

BioC is a recently created XML format to share text data and annotations, and an accompanying input/output library to promote interoperability of data and tools for natural language processing of biomedical text. This article reports the use of BioC to address a common challenge in processing biomedical text information—that of frequent entity name abbreviation. We selected three different abbreviation definition identification modules, and used the publicly available BioC code to convert these independent modules into BioC-compatible components that interact seamlessly with BioC-formatted data, and other BioC-compatible modules. In addition, we consider four manually annotated corpora of abbreviations in biomedical text: the Ab3P corpus of 1250 PubMed abstracts, the BIOADI corpus of 1201 PubMed abstracts, the old MEDSTRACT corpus of 199 PubMed® citations and the Schwartz and Hearst corpus of 1000 PubMed abstracts. Annotations in these corpora have been re-evaluated by four annotators and their consistency and quality levels have been improved. We converted them to BioC-format and described the representation of the annotations. These corpora are used to measure the three abbreviation-finding algorithms and the results are given. The BioC-compatible modules, when compared with their original form, have no difference in their efficiency, running time or any other comparable aspects. They can be conveniently used as a common pre-processing step for larger multi-layered text-mining endeavors. Database URL: Code and data are available for download at the BioC site: http://bioc.sourceforge.net.


Introduction
The BioCreative (http://www.biocreative.org/) challenge evaluations-since their inception in 2003-have been a community-wide effort for evaluating text-mining information extraction systems applied to the biomedical domain. Given the emphasis on promoting scientific progress, BioCreative meetings have consistently sought to make available both suitable information extraction systems that handle life science literature and suitable 'gold standard' data for training and testing these systems (1)(2)(3)(4). The BioCreative IV Interoperability track (http://www.bio creative.org/tasks/biocreative-iv/track-1-interoperability/) follows the guidelines established in previous BioCreative meetings, and specifically addresses the goal of interoperability-a major barrier for wide-scale adoption of the developed text-mining tools. As a solution, BioC (5) is an interchange format for tools for biomedical natural language processing. BioC is a simple XML format, specified by DTD (Document Type Definition), to share text documents and annotations. The BioC annotation approach allows many different annotations to be represented, including sentences, tokens, parts of speech and named entities such as genes or diseases. The Interoperability Track at BioCreative IV led to the creation of a number of tools and corpora to encourage even broader use and reuse of BioC data and tools (6).
This article reports on the contributions of our team to the BioC repository in the form of BioC-compliant modules that address the abbreviation definition detection task in biomedical text. These modules can be seamlessly coupled with other BioC code and used with any BioCformatted corpora. We also present several BioC-formatted corpora to test the abbreviation definition detection task. These corpora can serve as gold standard data for developing better machine-learning methods for abbreviation definition recognition. They could even be used as a starting point for adding further annotations of other important biomedical entities such as genes, diseases, etc.

Abbreviation Identification in The Biomedical Domain
The past 20 years have seen a dramatic increase in the interest for automatic extraction of biological information from scientific text, and particularly from MEDLINE V R abstracts. One characteristic of these documents is the frequent use of abbreviated terms. Abbreviated terms appear not only in the scientific text, but they are also frequent in user queries requesting the retrieval of those documents (7). Two related specific issues are-(i) the high rate at which new abbreviations are introduced in biomedical texts, and (ii) the ambiguity of those abbreviations. Existing databases, ontologies and dictionaries must be continually updated with new abbreviations and their definitions. To help resolve this problem, several techniques have been introduced to automatically extract abbreviations and their definitions from MEDLINE abstracts (8)(9)(10)(11). Abbreviation identification is the task of processing text to extract explicit occurrences of abbreviation-definition pairs. For example, in the sentence 'Body mass index (BMI) of the dyslipidemic diabetic patients was significantly higher for females.' 'BMI' is the abbreviation, and 'Body mass index' is its definition. This task requires both the identification of sentences that contain <Abbreviation,Definition> candidate pairs from text, and identification of exact string boundaries. An abbreviation-a ShortForm-is a shorter term that represents a longer word or phrase, which often refers to an important biomedical entity. The definition-the LongForm-is searched for in the same sentence as the ShortForm. An important clue that is shared by abbreviation-detection methods is the presence of parenthetical text and the assumption that parenthetical text signals the presence of an abbreviation definition. Two cases are distinguished: i. LongForm (ShortForm), where the sentence mentions the long form first, following with the abbreviation in parenthesis, and ii. ShortForm (LongForm), where the sentence introduces the abbreviation first, following with the definition in parenthesis.
Of these, the first alternative is observed much more frequently in practice, and in that case, the search for the LongForm is performed on the text string between the beginning of the sentence and the open parenthesis.

Abbreviation Definition Finding Algorithms
In this study, we focus on three abbreviation definition finding algorithms and modify them to work with the BioC format. All three algorithms described here approach the abbreviation identification problem as described above, but have several differences and unique characteristics which we describe as follows:

The Schwartz and Hearst algorithm
The Schwartz and Hearst algorithm (8) identifies <ShortForm, LongForm> candidates using this rule: if the expression within parentheses contains three or more tokens then case (ii) is assumed, otherwise case (i) is assumed. The LongForm is always longer than the ShortForm. A ShortForm is verified to start with an alphanumeric character, to contain at least one letter character and to contain a reasonable number of characters. A LongForm candidate then is extracted from the string so that it contains at most min (jSFjþ5, jSFj Â 2) tokens, where jSFj is the number of characters in the ShortForm. Next, the algorithm traverses both strings right to left, matching the characters in the ShortForm to find the shortest LongForm string. With the exception of the first ShortForm character that has to match a character at the beginning of a word in the LongForm candidate, the rest of the characters can match anywhere in the LongForm string, as long as they are in sequential order. The algorithm is simple, fast, and produces good results (http://bio text.berkeley.edu/software.html).

The Ab3P algorithm
The Ab3P algorithm, developed by Sohn et al. (9) is another pattern-matching approach to abbreviation definition detection. This algorithm defines specific patternmatching rules which the authors called strategies. Depending on the matching strategy and the length of the short form, they estimate an accuracy measure called pseudo-precision. Pseudo-precision provides a computed reliability estimate for an identified <ShortForm, LongForm> pair, without any human judgment. This algorithm is also very fast and provides high-precision results.

The NatLAb algorithm
Rule-based methods, such as the Schwartz and Hearst algorithm and Ab3P, are successful at identifying abbreviation definition pairs with high precision. However, it is challenging for such approaches to identify non-typical pairs, such as <3-D, three dimensional >, or out-of-order matches, such as <T(m), melting temperature>. Machinelearning methods have the potential of recognizing such non-trivial or irregular pairs and improving the recall, if enough training data are provided. NatLAb (Natural Learning for Abbreviations), developed by Yeganova et al. (11) is an example of such an algorithm.
NatLAb is a supervised learning approach whose features, inspired by the basic rules defined in Sohn et al., describe a mapping between a character in a potential ShortForm and a character in a potential LongForm. However, in contrast to Ab3P, Yeganova et al. do not combine these features into hand-crafted strategies. They provide the learner with all these features and feature pairs and let the training process weight them. Feature weights are then used to identify abbreviation definitions.

Converting Into BioC
A BioC-compliant module is a software module that is able to process input data in BioC format, and produce results in BioC format.

BioC-compliant abbreviation identification modules
We converted the original software tools for abbreviation definition recognition into BioC-compliant tools. Each algorithm was modified to accept data input in the BioC format and to produce results in the BioC format. The original Schwartz and Hearst software is written in Java, the original Ab3P software is written in Cþþ and the original NatLAb software is written in Perl. As a result, each implementation used a different BioC library; the necessary links were established so the Schwartz and Hearst algorithm could flow seamlessly with the rest of the BioC-Java code, the Ab3P algorithm with the BioC-Cþþ code and NatLAb with a Perl-BioC implementation (12). The BioCcompliant modules differ from the originals in these main points:  Table 3 in the Results section, where we compare the results of these algorithms on four independent biomedical abbreviation corpora. Next we describe four abbreviation definition corpora and the representation of abbreviations in BioC format.

BioC-formatted corpora
BioC Representation of Abbreviations. The first step in converting a given corpus into BioC format is deciding how to represent the information present in the corpus.
This representation is what is contained in the keyfile that must accompany a BioC corpus. Figures 1 and 2 illustrate the BioC format representation for abbreviation annotations that we used in the corpora used for this study.
The original abbreviation annotations in these corpora consisted of annotations in the form of <ShortForm, LongForm> pairs. To capture this, and to make the corpora versatile for other possible biomedical information retrieval studies, we use this markup: -For annotation elements, the infon key¼type, with value ABBR, semantically identifies the annotations as abbreviations. This allows the possibility of having multiple layers of annotations on the same textual data, even including annotations on other entity types that may also overlap. All such annotations can be added without confusion. -Each part of an abbreviation is identified by an additional infon element, key¼ABBR, with value ShortForm or LongForm, thus preserving the original text's role in an abbreviation definition <ShortForm, LongForm> pair. -Finally, a relation element reflects the pairing between a ShortForm and a corresponding LongForm, as indicated by the role attribute. The same infon, key¼type, with value ABBR, is repeated for the relation to make it easier at a semantic level to distinguish what is being annotated and how they relate together.
Documents may contain multiple abbreviation definitions for the same short form, as we also discovered during our annotation effort. Moreover, an annotated mention may be composed of several non-consecutive strings. One such example is shown in Figure 2. However, these issues are easily addressed in BioC, via the location element. The location element links the annotation definition to the exact textual coordinates where it is mentioned, and also allows for defining mentions composed of multiple non-consecutive substrings. Note that, the location information was not present in the original annotations of these corpora, so this is an important enrichment over the original versions.
Abbreviation definition Corpora in BioC Format. We converted four abbreviation definition recognition corpora to BioC format. These corpora are-the Ab3P corpus (9), the BIOADI corpus (10), an earlier version of the MEDSTRACT corpus (13) and the Schwartz and Hearst corpus (8). Statistics on abbreviation occurrences in the different corpora are contained in Table 1.
The original versions of these corpora resembled each other in that they all consisted of text files where documents were separated by blank lines. For each document we were given a passage of text. In the Ab3P corpus, the text for each document is divided into PubMed title and PubMed abstract lines, and in the BIOADI corpus, the title and abstract lines are concatenated together. In the Schwartz and Hearst corpus we were additionally given author list, affiliation field as well as publication type and venue. Finally, for the MEDSTRACT corpus (The MEDSTRACT corpus used for this study was downloaded from www.medstract.org in 2008 for an earlier study on abbreviations.), we were given bibliographic information in addition to the title and abstract of the journal articles, but PubMed document IDs (PMIDs) were missing. We used the journal citation information and author list to find the corresponding PMIDs for the MEDSTRACT corpus documents.
The original versions of Ab3P, BIOADI and MEDSTRACT contained each document followed by its list of <ShortForm, LongForm> pairs of abbreviations mentioned in the title or abstract, while the Schwartz and Hearst corpus contained an XML-like format where each definition was notated with in-text tags that were connected via an identification number, for example <Long id ¼N>, <Short id¼N>. This format caused some difficulties. For example, corrections were required for: definitions with tags containing mismatched identifiers, malformed tags (for both long and short forms), repeated use of the same identifier for more than one abbreviation definition and incorrect tags.
During the BioC conversion process, we reviewed all the original annotations; and when searching for their offset locations in the text, new potential abbreviation pairs were found. All potential abbreviation cases and borderline cases were considered in detail. These were discussed by all the authors for inclusion or exclusion. The review process included searches on PubMed to find other publications that used the same abbreviation definition. This detailed analysis resulted in a handful of abbreviation pairs which were excluded from the corpora, and a few valid additional abbreviations which were added to the data.
The final numbers for each corpus are shown in Table 1. As a result of these changes, the final numbers reflect a difference in the total number of annotations per corpus, when compared to the original publications. Problems of consistency are not surprising in corpora annotated by a single annotator. Different people reading the same document raise different points and catch missed definitions, thus ensuring a better quality for the final product.
Finally, we compared the four corpora for overlapping documents. We found that there was minimal to no overlap between the four corpora, as shown in Table 2. Continuing on this line of comparison, we extracted all MeSH terms assigned to all documents in each corpus, and performed a comparison between them. Figure 2 shows  the relative importance of these terms for each corpus. The bare minimum overlap in document count, as shown in Table 2, as well as the collective MeSH terms of these corpora, as illustrated in Figure 2, demonstrates that these corpora can safely be combined into a larger abbreviation definition resource. This may help build more sophisticated abbreviation definition recognition models ( Figure 3).

Results
We tested the three abbreviation identifying modules on the Ab3P, BIOADI, MEDSTRACT and Schwartz and Hearst corpora as shown in Table 3. Results are based on the new gold standard annotations in the four abbreviation corpora. When compared to the outputs of the algorithms' original versions, the BioC-compliant modules produced the same results. The BioC versions, however, have the advantage of being easily combined with other BioCcompliant tools to produce a more complex biomedical text processing system.

Conclusions
We present three easy-to-use, BioC-compatible abbreviation definition recognizing modules for biomedical text. The original tools corresponding to Ab3P, Schwartz and Hearst and NatLAb algorithms, have only been altered to read and produce the enriched BioC format. The new BioCcompatible modules faithfully preserve their original efficiency, running-time power and other complexity-related aspects, so they can be confidently used as a common preprocessing step for larger multi-layered text-mining endeavors.
We also present four BioC-formatted abbreviation definition recognition corpora that can be used to test the above modules, as well as to study new natural language processing tools. The new versions of the modules, as well as the accompanying corpora, are freely available to the community, through the BioC website: http://bioc.sourceforge.net.