## Abstract

Objective: The aim of this study was to develop and evaluate a method of extracting noun phrases with full phrase structures from a set of clinical radiology reports using natural language processing (NLP) and to investigate the effects of using the UMLS® Specialist Lexicon to improve noun phrase identification within clinical radiology documents.

Design: The noun phrase identification (NPI) module is composed of a sentence boundary detector, a statistical natural language parser trained on a nonmedical domain, and a noun phrase (NP) tagger. The NPI module processed a set of 100 XML-represented clinical radiology reports in Health Level 7 (HL7)® Clinical Document Architecture (CDA)–compatible format. Computed output was compared with manual markups made by four physicians and one author for maximal (longest) NP and those made by one author for base (simple) NP, respectively. An extended lexicon of biomedical terms was created from the UMLS Specialist Lexicon and used to improve NPI performance.

Results: The test set was 50 randomly selected reports. The sentence boundary detector achieved 99.0% precision and 98.6% recall. The overall maximal NPI precision and recall were 78.9% and 81.5% before using the UMLS Specialist Lexicon and 82.1% and 84.6% after. The overall base NPI precision and recall were 88.2% and 86.8% before using the UMLS Specialist Lexicon and 93.1% and 92.6% after, reducing false-positives by 31.1% and false-negatives by 34.3%.

Conclusion: The sentence boundary detector performs excellently. After the adaptation using the UMLS Specialist Lexicon, the statistical parser's NPI performance on radiology reports increased to levels comparable to the parser's native performance in its newswire training domain and to that reported by other researchers in the general nonmedical domain.

The medical record in the United States is still largely paper based. However, there is increasing interest in the creation of a national model for a ubiquitous electronic health record (EHR).1 To ensure wide adoption and interoperability, the EHR must be based on standards. In particular, the use of standard terminologies for data representation will be critical. Many clinical information systems enforce standard semantics by mandating structured data entry. While this approach can be successfully applied for a subset of core clinical data elements, it has limited use when dealing with the myriad of narrative clinical documents that comprise the majority of the data making up the patient record. These documents are often dictated and then transcribed into electronic format, with little or no attempt to standardize content representation.

The Health Level 7 (HL7)® Clinical Document Architecture (CDA)2 offers a standard for the representation and communication of clinical documents but currently leaves the methodology for representing document content to the system implementer. We have been interested in this problem for some time, in part because of the importance of automatically linking imaging data to clinical imaging reports using standardized terminology in Multimedia Electronic Medical Record Systems (MEMRS).3 We are developing a system called ChartIndex that transforms electronic clinical documents into an XML-based, CDA-compliant format and then automatically identifies and represents important biomedical concepts within the transformed documents using the National Library of Medicine's (NLM) Unified Medical Language System (UMLS)®.4,5

Many researchers have worked on the problem of automated biomedical concept recognition. The SAPHIRE system designed by Hersh et al.6,7 automatically encodes UMLS concepts using lexical mapping. The lexical approach is computationally fast and useful for real-time applications. However, this approach when used alone may not provide optimal results. More recently, Zou et al.8 developed IndexFinder to add syntactic and semantic filtering to improve performance on top of lexical mapping. Other researchers use more advanced Natural Language Processing (NLP) techniques such as part of speech (POS) tagging and phrase identification together with lexical techniques to facilitate concept indexing either in clinical documents9,10,11,12,13 or biomedical literature retrieval.14,15,16,17,18,19,20 MedLEE11 is a system developed by Friedman et al. to encode free clinical text into structured format. Along with a few other systems,21,22,23 MedLEE encodes modifiers together with core concepts in noun phrases (NPs). It has been applied to a number of different types of clinical documents and achieved encouraging results.24,25,26 Cooper and Miller12 carried out an evaluation on MeSH® encoding using a lexical approach, statistical approach, and a hybrid approach. Nadkarni et al.13 conducted a study of concept matching in discharge summaries and surgical notes. Purcell and Shortliffe15 used concepts encoded in document headings to improve indexing of free text. Berrios16 exploited the above technique together with a vector space model and a statistical method to match text content to a set of query types. Berrios et al.27 also reported on building a semi-automatic system called Internet-based Semi-Automated Indexing of Documents (ISAID) to aid the manual indexing of textbook. MetaMap17,18 identifies UMLS concepts in text and returns them in a ranked list in a five-step process, identifying simple NPs, generating variants of each phrase, finding matched phrases, assigning scores to matched phrases by comparing them with the input and composing mappings. MetaMap has been used in a number of applications.28,29

In almost all of the above systems, phrase identification is an important step. Phrase identification, especially noun phrase identification (NPI), has been investigated by researchers in both general domains and the biomedical domain. In general domains, Bourigault30 reported using maximal NPs to extract terminologic NPs. However, no performance value was reported. Ramshaw and Marcus31 and Cardie and Pierce32 reported precision and recall of 91% to 93.5% on extracting base NPs from the Penn Treebank Wall Street Journal (WSJ) corpus. As mentioned by Cardie and Pierce, the above work is difficult to compare directly because each approach used a slightly different definition of NPs (even for base NPs).

In the biomedical domain, Bennett et al.33 reported the precision and recall of using several NPI systems on Medline abstracts. The performance was measured on four fields: title, author, abstract, and MeSH® terms and may not be comparable to NPI on free text alone, as only the abstract field in the above contains free text in complete sentences, whereas the text in the other three fields are most likely already in NPs. The above two systems were not augmented with the UMLS Specialist Lexicon.34 Spackman and Hersh35 reported NPI performance on discharge summaries. The recall and precision on NPs were 77% and 68.7%, respectively, using two different systems. The results were better if measured on partial NPs. Berrios et al.36 reported in their experiment that 66% of nouns and noun modifiers and 81% of NPs were correctly matched to UMLS concepts. However, the experiment focused on the evaluation of a concept matching algorithm, and no comparable performance numbers on NPI were reported.

Our prior work37,38 has been affected by poor precision. We have previously described39 a new approach, called contextual indexing, that can partially ameliorate this problem. However, this approach is dependent on implementing a successful NPI engine. In this report we describe a successful strategy, based on machine learning, statistical natural language processing, and the use of the UMLS Specialist Lexicon, which is successful at both sentence boundary detection and biomedical NPI within clinical radiology reports. Rather than develop a system that results in a single NP representation, we have instead adopted a deep parsing approach that captures the entire parse tree for each target NP within a clinical document. We believe that this approach offers maximum flexibility at the time of indexing. To evaluate this approach we performed two experiments, described in this report, which used the maximal NP and the base NP within each tree to calculate boundary precision and recall characteristics of the model.

## Methods

Our approach to encoding biomedical concepts in clinical documents makes the following assumptions: most biomedical concepts are represented as NPs; most biomedical concept NPs do not span across sentences; NPs serve as a good starting point for UMLS indexing. With the above assumptions, we set out to develop and evaluate robust automated mechanisms for delimiting sentences and identifying NPs in clinical radiology reports.

### Setting and Selection of Documents

The document collection used in this study was 100 radiology reports, randomly selected from a larger collection of 1,000 de-identified radiology reports from Stanford Hospital & Clinics. The reports covered the most common imaging modalities (25 computed tomography [CT] scans, five mammograms [MAMO], 25 magnetic resonance imaging [MR/MRI] reports, 10 radiology procedures [PROC], 30 radiographs [RAD], and five ultrasounds [US]). There were 16,298 words, 3,043 maximal NPs, 4,755 base NPs and 1,506 sentences in this document set. They were mostly well-formed sentences, with partial sentences almost all in NPs. This set of 100 documents was split into two halves serving as a training set and a test set.

### Data Collection and Processing

As reported in our prior work,39 we have developed a software module to reliably convert semistructured free-text radiology reports into segmented HL7 CDA-compatible XML documents.2 In this experiment, using these XML documents we implemented a three-step process to identify NPs: sentence boundary detection (SBD), full NLP parsing, and NP tagging.

#### Sentence Boundary Detection

While the parser we used did provide a crude sentence-breaking mechanism, its accuracy was unsatisfactorily low. Therefore, we developed a sentence boundary detector to pre-delimit sentences within documents before sending them to the parser. Among many potential machine-learning methods for SBD, such as Naïve Bayes, Decision Tree, Neutral Network, Hidden Markov Model (HMM), and Maximum Entropy Modeling (MEM), we decided to choose MEM as our main method of solving this problem. This decision was based on a number of factors: first, the MEM has a solid mathematical foundation; second, MEM offers considerable flexibility and power given that it can draw from a variety of information sources; third, Reynar and Ratnaparkhi40 have shown that even a relatively simple model in MEM can achieve very good performance in SBD. The following is a brief introduction to MEM as an approach to solving the problem of SBD. General readers can skip the technical details in this section, if desired.

Information entropy is an essential concept in the field of information theory41 and can be defined as follows for a single random variable X with probability mass function p(x)

(1)

This equation measures the average uncertainty of random variable X. We can view text as a token stream and SBD as a random process to delimit text into sentences, with the output defined as a random variable Y with value y. We can also define part of the token stream as a context stream x, which determines the value of Y. In the domain of all possible values of x, we can similarly define a random variable X. Thus, to solve the SBD problem, all we need is a conditional probability model p(y|x), predicting the sentence boundaries y given a context stream x, to simulate the random process of text. Such a model can be constructed to yield the maximum likelihood value of a training set, with a joint empirical distribution over (x, y) as the following,

(2)

However, there are many such models. Thus, we may further define Boolean features f(x, y) as shown below to place constraints on any models generated, where $p∼(x)$ is the empirical distribution of x in the training set.

(3)

A feature usually captures information helpful in solving the SBD problem. For example, we observe that if the current token in the steam is a period, and the next token is a capitalized word, then the current token is a sentence boundary with high probability. Thus, we can have the following feature:

(4)

The best model p*, among all the models derived from the above, is the one maximizing conditional entropy H(p) on p(y|x), because it corresponds to a model with more uniform probability distribution on x unseen in the training set, thus, allowing less bias for unseen contexts.

(5)

(6)

This is a typical constrained optimization problem, which can be solved using the Lagrange multiplier method.42 For details, readers may refer to a brief online tutorial by Berger.43

Other approaches to SBD, using bi-grams or tri-grams as features, require a very large training set, and the sparseness of text often causes problems. In MEM, lexical, syntactical information and bi-grams can be modeled easily as features in an integrated model. Also, features no longer have to be limited to local context. For example, the following text:

his compromise bill.” A committee staffer

is highly ambiguous. We cannot tell from the above local context, if the period or the double quote should be the sentence boundary. However, in MEM, we can easily add the parity of double quotes (i.e., if the quotes are opening or closing quotes) as a feature, which is highly effective, but cannot be derived from local context in the above example.

MEM automatically calculates weights of features in training and handles overlapping features very well. Thus, in MEM implementations, it is sometimes more advantageous to use a complex feature together with its simple component features, compared with using simple features alone. For example, if we have two simple features in a model: (A) if a period is preceded by a lowercased word, the period is probably a sentence boundary and (B) if a period is followed by an uppercased word, the period is probably a sentence boundary, the model will perform better, with the addition of a complex feature C combining A and B: (C) if a period is preceded by a lowercased word and followed by an uppercased word, the period is probably a sentence boundary.

Another potential problem when using MEM for SBD is that some useful features may be rarely seen in a particular training set and thus may not be estimated reliably. In such cases, we have tried to group similar situations into one single feature, which has more instances in the training set. For example, instead of using a separate feature to model that a period is unlikely to be a sentence boundary if the next token is “}”, we have modeled “}” together with several other punctuations as in the following feature:

if the next token is ‘.’, ‘?’, ‘,’, “,),}, then the current token is not a sentence boundary.

In summary, a set of 15 manually derived lexical features (six of them are shown in Table 1) were automatically weighted during the training on a set of 50 reports of six radiology modalities. The model was then tested on the other 50 reports.

Table 1

Sample Features Used in Maximum Entropy Modeling (MEM) Sentence Boundary Detection

 Features Used by Maximum Entropy Modeling Sentence Boundary (SB) Detection Examples: Left Candidate Right Left is a lowercased word, Candidate is a SB. craniectomy . Extracranially Right is a lowercased word, Candidate is not a SB. e . the Right is ‘.’, ‘?’, ‘,’, ”,), }, -, Candidate is not a SB. curvilinear .) Left is an honorific, Candidate is ‘.’, Candidate is not a SB. Dr . Smith Candidate is ”, an odd numbered quote, Candidate is not a SB. . ” A Left is a single uppercase letter and Candidate is ‘.’, Candidate is not a SB. S. levels
 Features Used by Maximum Entropy Modeling Sentence Boundary (SB) Detection Examples: Left Candidate Right Left is a lowercased word, Candidate is a SB. craniectomy . Extracranially Right is a lowercased word, Candidate is not a SB. e . the Right is ‘.’, ‘?’, ‘,’, ”,), }, -, Candidate is not a SB. curvilinear .) Left is an honorific, Candidate is ‘.’, Candidate is not a SB. Dr . Smith Candidate is ”, an odd numbered quote, Candidate is not a SB. . ” A Left is a single uppercase letter and Candidate is ‘.’, Candidate is not a SB. S. levels

Sequence of tokens (token1 token2 token3) is represented as (Left Candidate Right).

#### NLP Parsing

Figure 1

Sample parse tree of sentence “Left cerebellar hemisphere appears to demonstrate areas of decreased attenuation.” using Penn Treebank Part of Speech (POS) tags. S, sentence; NP, noun phrase; VP, verb phrase; JJ, adjective; NN, noun, singular; VBZ, verb, present 3SG –s form; TO, infinitive marker; VB, verb, infinitive; NNS, noun, plural; IN, preposition; VBN, verb, past/passive participle.

Figure 1

Sample parse tree of sentence “Left cerebellar hemisphere appears to demonstrate areas of decreased attenuation.” using Penn Treebank Part of Speech (POS) tags. S, sentence; NP, noun phrase; VP, verb phrase; JJ, adjective; NN, noun, singular; VBZ, verb, present 3SG –s form; TO, infinitive marker; VB, verb, infinitive; NNS, noun, plural; IN, preposition; VBN, verb, past/passive participle.

The Stanford Parser was trained on a nonmedical document collection (The Penn Treebank, WSJ section); thus, many biomedical terms found in clinical documents were rarely encountered by the parser in training. To improve the performance of the parser, we added the following preprocessing (other than sentence delimiting). First, some tokenization improvements were implemented based on sentence delimitation results. Second, the text in some sections of the radiology reports, such as the Impression section, were all in uppercase, which had a very negative impact on the parser's performance since the parser extensively used lexical features based on different letter cases. Thus, a preprocessing module converted all uppercase texts to proper-cased texts. In this process, our program also detected abbreviations that were not converted to lowercase, using a list of more than 4,000 abbreviations derived from the UMLS Specialist Lexicon.47 Third, we attempted to customize the frequencies of a few commonly used words, because the statistics learned from the parser's training set were so different from those in the clinical document set. Last, we constructed an extended lexicon of biomedical terms by mapping the POS tags in the lexical entries of the UMLS Specialist Lexicon to the relevant standard Penn Treebank POS tags. This extended lexicon was then used with the parser to improve the performance of NPI in clinical documents.

The UMLS Specialist Lexicon uses its own syntactic categories to categorize words, which can not be used by the Stanford parser directly. To convert the UMLS Specialist Lexicon entries into new entries with Penn Treebank tags, we used the mapping shown in Table 2. Some of those categories were not mapped because they are closed categories, meaning they have only a certain number of words already well represented by the parser lexicon. Of note, we limited the conversion to unambiguous entries. In other words, within the extended lexicon, we only included those words with only one allowable POS tag using the Penn Treebank tag set. There were three reasons for the above decision. First, for words with more than one allowable syntactical category (POS tag), the Stanford statistical parser requires the relative frequency of each tag, which is absent from the UMLS Specialist Lexicon. Second, some mapping conversions were inherently ambiguous.48 Third, the parser does have a robust mechanism for handling unknown words using frequency information on their lexical features, despite being trained on newswire text. Given the choice between this mechanism and a partial domain lexicon lacking frequencies, it was better to leave the parser unmodified. After conversion, there were 262,704 entries in the extended lexicon, drawn from the UMLS Specialist Lexicon (2004 version), while the original Penn Treebank lexicon contained fewer than 100,000 entries. Thus, the use of the UMLS Specialist Lexicon more than doubled the size of original lexicon.

Table 2

Syntactical Category Mappings from the UMLS Specialist Lexicon to Penn Treebank POS Tags

 Syntactical Category in Specialist Lexicon Penn Treebank POS tag All Cap starting+thr_sing NNP All Cap starting+thr_plur NNPS thr_sing NN thr_plural NNS verb + past VBD verb + pres VBP verb +past_part VBN verb + pres_part VBG verb + infinitive VB adj + positive JJ adj + comparative JJP adj + superlative JJS prep (except “to”) IN adv + positive RB adv + comparative RBR adv + superlative RBS Pron not mapped Det not mapped Conj not mapped Aux not mapped Modal not mapped Compl not mapped
 Syntactical Category in Specialist Lexicon Penn Treebank POS tag All Cap starting+thr_sing NNP All Cap starting+thr_plur NNPS thr_sing NN thr_plural NNS verb + past VBD verb + pres VBP verb +past_part VBN verb + pres_part VBG verb + infinitive VB adj + positive JJ adj + comparative JJP adj + superlative JJS prep (except “to”) IN adv + positive RB adv + comparative RBR adv + superlative RBS Pron not mapped Det not mapped Conj not mapped Aux not mapped Modal not mapped Compl not mapped

#### Noun Phrase Identification

Within a sentence, there are usually a number of ways of marking up NPs. For example, in the sentence “The left cerebellar hemisphereappears to demonstrate areas of decreased attenuation.” there are two NPs of maximum length and complexity, “the left cerebellar hemisphere” and “areas of decreased attenuation” (in bold), which we refer to as maximal NPs. These maximal NPs can be very complex in structure, with multiple prepositional phrases and relative clauses attached. On the other hand, within many sentences we can also identify smaller, less complex, NPs, such as “cerebellar hemisphere,” “hemisphere,” “areas,” “decreased attenuation,” and “attenuation.” The least complex NPs are referred to as base NPs. Base NPs have been defined as “simple, non-recursive NPs—NPs that do not contain other NP descendants.”32 We adopted the above definition of base NP and used parses of phrases in Penn Treebank style. In the above example, “the left cerebellar hemisphere,” “areas,” and “decreased attenuation” are three base NPs. However, there is no universal level of complexity at which NPs are optimal for UMLS indexing. Most of the time, medical concepts in the UMLS are expressed in simple NPs, for example, as “cerebellar hemisphere” (C0228465). At other times, the most specific UMLS concepts are expressed in longer, more complex, NPs, e.g., “insertion of graft of great vessels with cardiopulmonary bypass” (C0189681). In a full parse tree of a sentence, maximal NPs are more complex in structure and sit closer to the root of the parse tree. Simpler NPs and base NPs are usually nested in the maximal NPs and sit closer to the leaves of the parse tree.

### Outcome Measures

Precision, recall, and F1 measure49 were used in evaluating results in this study. Precision is the fraction of proposed NPs, which are present in the gold standard. Recall is the fraction of gold standard NPs that are proposed by our system. We compared performance for statistical significance by calculating 95% confidence intervals for recall and precision using the method provided by Wilson.50 F1 is a combined measure, defined to be the harmonic mean of these two quantities, which is computed as 2PR/(R+P), where P is precision and R is recall. The F1 measure is useful in giving us a single numeric measure of overall performance combining both precision and recall.

### Analysis

We set out to test three hypotheses:

1. Heuristics could be developed to reliably identify sentences in clinical radiology reports.

2. Using a statistical NLP parser we can accurately identify most NPs in clinical radiology reports.

3. Integrating the UMLS Specialist Lexicon (SL) into an NLP parser trained on nonmedical documents can lead to substantial improvements to NPI in clinical radiology reports.

#### Hypothesis 1— Sentence Boundary Detection

To generate the SBD gold standard, we first wrote a simple rule-based application to pre-markup the sentences in all 100 radiology reports. One of the authors then went through the reports and generated the gold standard of sentence markups by correcting and adding sentence markups. The MEM sentence boundary detector was trained on a training set of 50 of these reports and then tested on the other 50 reports using this gold standard.

#### Hypothesis 2—Noun Phrase Identification

Taking parse trees output by the Stanford parser trained on Penn Treebank Wall Street Journal (WSJ) newswire corpus, a NP tagger was developed to identify NPs. The authors performed experiments using both base and maximal NPs, since there is no universal level of complexity at which NPs are optimal for UMLS indexing. One experiment identified and used all maximal NPs in the documents, whereas the other identified and used all base NPs. By looking at both top level and bottom level NPs in parse trees, we hoped to derive a more reliable evaluation of NLP parsing performance, which we believe to be critical to UMLS indexing.

These two experiments were similar except for the preparation of the gold standard and small differences in programs used to tag NPs. To identify maximal NPs correctly, domain knowledge is needed to resolve prepositional phrase attachment and other structure ambiguities; thus, four physicians helped us in creating the gold standard (the process is explained below). On the other hand, the identification of the base NPs is usually much more straightforward for humans; thus, one of the authors was able to create the gold standard in marking up base NPs.

The NPI system processed the 100 reports by first delimiting and parsing all sentences, and then marking up the NPs within each sentence. In the first experiment of identifying maximal NPs, the marked-up reports generated by this process were split into four sets of 25 reports and reviewed by four physicians to identify false-positive and false-negative NP markups. One author then went through all 100 reports and decided the final markups through discussions with the physicians. Based on this expert review, corrections were then made to the markups, and the resulting 100 documents were used as the gold standard of the final NPI results.

In the second experiment of identifying simpler base NPs, one author reviewed the marked-up reports, made corrections to the markups, and the resulting 100 documents were used as the gold standard. We used the same set of 50 reports as in the SBD experiment as the test set in NPI evaluation.

#### Hypothesis 3—Integrating the UMLS Specialist Lexicon

The computed NP markups with and without the extended lexicon were then compared against the gold standard mentioned in the previous section. Precision, recall, and F1 measure were calculated for the two versions of the NPI module.

Because there were published results on base NPI in the general domain,31,32 it was possible to compare the base NPI performance of our system with systems working in the nonbiomedical domain. Thus, in addition to the comparison of NPI performance with and without the extended lexicon, we also derived a customized grammar by making a few changes to the grammar learned while training the parser on the Penn Treebank WSJ corpus. Statistical parsers usually use a lexicon with relative frequencies of different syntactical categories for each token. These frequencies are used by parsers to generate the most likely parse for each sentence. However, these frequencies are not present in the Specialist Lexicon. Some commonly used words may have very different relative frequencies in a specific domain other than the general domains. For example, “left” can be the past tense or past/passive participle of the verb “leave,” or simply an adjective that indicates laterality. In the Penn Treebank corpora, “left” is mostly used as a verb. However, in clinical documents such as radiology reports, “left” is almost always used as an adjective to indicate laterality. In base NPI, we manually added new frequencies for a few words such as “left,” “patent,” etc. in the parser grammar. The changes were derived by reviewing reports in the training set only.

## Results

### Hypothesis 1—Sentence Boundary Detection

The results of sentence boundary detection on the test set of 50 reports are shown in Table 3. The precision was generally excellent (>99%) except for radiology procedure reports (97%). The recall was also very good (>98%) except for radiographs (96%). Overall, our SBD module achieved 99.3% precision with a 95% confidence interval (CI) 98.4-99.7%, 98.3% (97.2-99.0%) in recall and 98.8% in F1 measure.

Table 3

Results of Sentence Boundary Detection

 Report Type Human Computer True-Positive False-Positive False-Negative Precision Recall F1 CT 267 267 265 2 2 0.993 0.993 0.993 MAMMO 13 13 13 0 0 1.000 1.000 1.000 MR 260 255 255 0 5 1.000 0.981 0.990 PROC 131 133 129 4 2 0.970 0.985 0.977 RAD 136 131 131 0 5 1.000 0.963 0.981 US 19 19 19 0 0 1.000 1.000 1.000 All 826 818 812 6 14 0.993 0.983 0.988
 Report Type Human Computer True-Positive False-Positive False-Negative Precision Recall F1 CT 267 267 265 2 2 0.993 0.993 0.993 MAMMO 13 13 13 0 0 1.000 1.000 1.000 MR 260 255 255 0 5 1.000 0.981 0.990 PROC 131 133 129 4 2 0.970 0.985 0.977 RAD 136 131 131 0 5 1.000 0.963 0.981 US 19 19 19 0 0 1.000 1.000 1.000 All 826 818 812 6 14 0.993 0.983 0.988

### Hypothesis 2—Noun Phrase Identification

The NP identification module used an unlexicalized version of the Stanford parser to parse reports and identified NPs from parse trees generated by the parser. The developers of the NPI module used the training set to evaluate the performance and improve the module. The NPI module was then run against the test set. The results of the identification of maximal and base NPs using the Stanford parser without any help from a biomedical lexicon are shown in Table 4.

Table 4

Results of Noun Phrase Identification Using the Stanford Parser

 Report Type Type of NP True-Positive False-Positive False-Negative Precision Recall F1 CT Maximal 347 111 90 0.758 0.794 0.775 Base 612 111 98 0.846 0.862 0.854 MAMMO Maximal 19 3 3 0.864 0.864 0.864 Base 26 7 7 0.788 0.788 0.788 MR Maximal 387 108 96 0.782 0.801 0.791 Base 745 100 105 0.882 0.876 0.879 PROC Maximal 251 43 36 0.854 0.875 0.864 Base 359 39 31 0.902 0.921 0.911 RAD Maximal 175 48 39 0.785 0.818 0.801 Base 275 47 60 0.854 0.821 0.837 US Maximal 21 8 8 0.724 0.724 0.724 Base 32 11 12 0.744 0.727 0.736 All Maximal 1,200 321 272 0.789 0.815 0.802 Base 2,049 315 313 0.867 0.867 0.867
 Report Type Type of NP True-Positive False-Positive False-Negative Precision Recall F1 CT Maximal 347 111 90 0.758 0.794 0.775 Base 612 111 98 0.846 0.862 0.854 MAMMO Maximal 19 3 3 0.864 0.864 0.864 Base 26 7 7 0.788 0.788 0.788 MR Maximal 387 108 96 0.782 0.801 0.791 Base 745 100 105 0.882 0.876 0.879 PROC Maximal 251 43 36 0.854 0.875 0.864 Base 359 39 31 0.902 0.921 0.911 RAD Maximal 175 48 39 0.785 0.818 0.801 Base 275 47 60 0.854 0.821 0.837 US Maximal 21 8 8 0.724 0.724 0.724 Base 32 11 12 0.744 0.727 0.736 All Maximal 1,200 321 272 0.789 0.815 0.802 Base 2,049 315 313 0.867 0.867 0.867

### Hypothesis 3—Integrating the UMLS Specialist Lexicon

In the third part of the experiment, we measured the performance of maximal and base NPI with the extended lexicon generated from the UMLS Specialist Lexicon. The results are shown in Tables 5–8.

Table 5

Results of Maximal Noun Phrase Identification

 Report Type NPI Version True-Positive False-Positive False-Negative Precision Recall F1 CT Baseline 347 111 90 0.758 0.794 0.775 SL 362 101 75 0.782 0.828 0.804 MAMMO Baseline 19 3 3 0.864 0.864 0.864 SL 19 3 3 0.864 0.864 0.864 MR Baseline 387 108 96 0.782 0.801 0.791 SL 402 92 81 0.814 0.832 0.823 PROC Baseline 251 43 36 0.854 0.875 0.864 SL 251 40 36 0.863 0.875 0.869 RAD Baseline 175 48 39 0.785 0.818 0.801 SL 184 35 30 0.84 0.86 0.85 US Baseline 21 8 8 0.724 0.724 0.724 SL 27 1 2 0.964 0.931 0.947 All Baseline 1,200 321 272 0.789 0.815 0.802 SL 1,245 272 227 0.821 0.846 0.833
 Report Type NPI Version True-Positive False-Positive False-Negative Precision Recall F1 CT Baseline 347 111 90 0.758 0.794 0.775 SL 362 101 75 0.782 0.828 0.804 MAMMO Baseline 19 3 3 0.864 0.864 0.864 SL 19 3 3 0.864 0.864 0.864 MR Baseline 387 108 96 0.782 0.801 0.791 SL 402 92 81 0.814 0.832 0.823 PROC Baseline 251 43 36 0.854 0.875 0.864 SL 251 40 36 0.863 0.875 0.869 RAD Baseline 175 48 39 0.785 0.818 0.801 SL 184 35 30 0.84 0.86 0.85 US Baseline 21 8 8 0.724 0.724 0.724 SL 27 1 2 0.964 0.931 0.947 All Baseline 1,200 321 272 0.789 0.815 0.802 SL 1,245 272 227 0.821 0.846 0.833
Table 6

Comparison of Results of Noun Phrase Identification between Using an Extended Lexicon Generated from the UMLS Specialist Lexicon and Not Using it

 Report Type Change in True-Positives Change in False-Positives Change in False-Negatives Change in Precision Change in Recall Change in F1 CT 4.3% −9.0% −16.7% 3.2% 4.3% 3.7% MAMMO 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% MR 3.9% −14.8% −15.6% 4.1% 3.9% 4.0% PROC 0.0% −7.0% 0.0% 1.0% 0.0% 0.5% RAD 5.1% −27.1% −23.1% 7.1% 5.1% 6.1% US 28.6% −87.5% −75.0% 33.2% 28.6% 30.8% All 3.8% −15.3% −16.5% 4.0% 3.8% 3.9%
 Report Type Change in True-Positives Change in False-Positives Change in False-Negatives Change in Precision Change in Recall Change in F1 CT 4.3% −9.0% −16.7% 3.2% 4.3% 3.7% MAMMO 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% MR 3.9% −14.8% −15.6% 4.1% 3.9% 4.0% PROC 0.0% −7.0% 0.0% 1.0% 0.0% 0.5% RAD 5.1% −27.1% −23.1% 7.1% 5.1% 6.1% US 28.6% −87.5% −75.0% 33.2% 28.6% 30.8% All 3.8% −15.3% −16.5% 4.0% 3.8% 3.9%
Table 7

Results of Base Noun Phrase Identification

 Report Type NPI Version True-Positive False-Positive False-Negative Precision Recall F1 CT Baseline 612 111 98 0.846 0.862 0.854 SL 649 78 61 0.893 0.914 0.903 SL+GR 663 49 47 0.931 0.934 0.932 MAMMO Baseline 26 7 7 0.788 0.788 0.788 SL 29 2 4 0.935 0.879 0.906 SL+GR 29 2 4 0.935 0.879 0.906 MR Baseline 745 100 105 0.882 0.876 0.879 SL 774 70 76 0.917 0.911 0.914 SL+GR 783 57 67 0.932 0.921 0.927 PROC Baseline 359 39 31 0.902 0.921 0.911 SL 368 26 22 0.934 0.944 0.939 SL+GR 370 21 20 0.946 0.949 0.948 RAD Baseline 275 47 60 0.854 0.821 0.837 SL 299 35 36 0.895 0.893 0.894 SL+GR 301 32 34 0.904 0.899 0.901 US Baseline 32 11 12 0.744 0.727 0.736 SL 38 6 6 0.864 0.864 0.864 SL+GR 42 2 2 0.955 0.955 0.955 All Baseline 2,049 315 313 0.867 0.867 0.867 SL 2,157 217 205 0.909 0.913 0.911 SL+GR 2,188 163 174 0.931 0.926 0.928
 Report Type NPI Version True-Positive False-Positive False-Negative Precision Recall F1 CT Baseline 612 111 98 0.846 0.862 0.854 SL 649 78 61 0.893 0.914 0.903 SL+GR 663 49 47 0.931 0.934 0.932 MAMMO Baseline 26 7 7 0.788 0.788 0.788 SL 29 2 4 0.935 0.879 0.906 SL+GR 29 2 4 0.935 0.879 0.906 MR Baseline 745 100 105 0.882 0.876 0.879 SL 774 70 76 0.917 0.911 0.914 SL+GR 783 57 67 0.932 0.921 0.927 PROC Baseline 359 39 31 0.902 0.921 0.911 SL 368 26 22 0.934 0.944 0.939 SL+GR 370 21 20 0.946 0.949 0.948 RAD Baseline 275 47 60 0.854 0.821 0.837 SL 299 35 36 0.895 0.893 0.894 SL+GR 301 32 34 0.904 0.899 0.901 US Baseline 32 11 12 0.744 0.727 0.736 SL 38 6 6 0.864 0.864 0.864 SL+GR 42 2 2 0.955 0.955 0.955 All Baseline 2,049 315 313 0.867 0.867 0.867 SL 2,157 217 205 0.909 0.913 0.911 SL+GR 2,188 163 174 0.931 0.926 0.928
Table 8

Performance Changes from Baseline by Using Extended Lexicon, and Using Both Extended Lexicon and Customized Grammar

 Report Type NPI Version Change in True-Positives Change in False-Positives Change in False-Negatives Change in Precision Change in Recall Change in F1 CT SL 6.0% −29.7% −37.8% 5.5% 6.0% 5.8% SL+GR 8.3% −55.9% −52.0% 10.0% 8.3% 9.2% MAMMO SL 11.5% −71.4% −42.9% 18.7% 11.5% 15.0% SL+GR 11.5% −71.4% −42.9% 18.7% 11.5% 15.0% MR SL 3.9% −30.0% −26.9% 4.0% 3.8% 3.9% SL+GR 5.1% −43.0% −35.6% 5.7% 5.0% 5.3% PROC SL 2.5% −33.3% −29.0% 3.5% 2.5% 3.0% SL+GR 3.1% −70.0% −68.3% 9.4% 8.2% 8.8% RAD SL 8.7% −25.5% −40.0% 4.8% 8.7% 6.8% SL+GR 9.5% −31.9% −43.3% 5.8% 9.5% 7.7% US SL 18.8% −45.5% −50.0% 16.1% 18.8% 17.4% SL+GR 31.3% −81.8% −83.3% 28.3% 31.3% 29.8% All SL 5.3% −31.1% −34.3% 4.8% 5.2% 5.0% SL+GR 6.8% −48.3% −44.2% 7.4% 6.7% 7.1%
 Report Type NPI Version Change in True-Positives Change in False-Positives Change in False-Negatives Change in Precision Change in Recall Change in F1 CT SL 6.0% −29.7% −37.8% 5.5% 6.0% 5.8% SL+GR 8.3% −55.9% −52.0% 10.0% 8.3% 9.2% MAMMO SL 11.5% −71.4% −42.9% 18.7% 11.5% 15.0% SL+GR 11.5% −71.4% −42.9% 18.7% 11.5% 15.0% MR SL 3.9% −30.0% −26.9% 4.0% 3.8% 3.9% SL+GR 5.1% −43.0% −35.6% 5.7% 5.0% 5.3% PROC SL 2.5% −33.3% −29.0% 3.5% 2.5% 3.0% SL+GR 3.1% −70.0% −68.3% 9.4% 8.2% 8.8% RAD SL 8.7% −25.5% −40.0% 4.8% 8.7% 6.8% SL+GR 9.5% −31.9% −43.3% 5.8% 9.5% 7.7% US SL 18.8% −45.5% −50.0% 16.1% 18.8% 17.4% SL+GR 31.3% −81.8% −83.3% 28.3% 31.3% 29.8% All SL 5.3% −31.1% −34.3% 4.8% 5.2% 5.0% SL+GR 6.8% −48.3% −44.2% 7.4% 6.7% 7.1%

The results of the identification of maximal NP are shown in Table 5. Without help from the UMLS Specialist Lexicon (baseline), the NPI module achieved 78.9% (76.8% to 80.9%) precision, 81.5% (79.5% to 83.4%) recall, and 80.2% F1 measure overall. Using the extended lexicon (SL), it achieved 82.1% (80.1% to 83.9%) precision, 84.6% (82.6% to 86.3%) recall, and 83.3% F1 measure. The contribution of including terms from the UMLS Specialist Lexicon can be seen more readily in Table 6. The improvements in precision and recall were statistically significant at the 95% confidence level, because the latter precision and recall central values were above the upper bound of 95% CI of the former precision and recall, respectively. Overall, the false-positives and false-negatives were cut by 15.3% and 16.5% by using the extended lexicon.

Similarly, the results of the identification of base NPs are shown in Table 7. Besides the comparison of NPI performance with (SL in the table) and without (baseline in the table) the extended lexicon, we tried to use a customized grammar (SL+GR) to further improve the performance of base NPI. Compared with baseline performance of 86.7% (85.4% to 88.1%) precision, 86.7% (85.4% to 88.1%) recall, and 86.7% F1, using the extended lexicon constructed from the UMLS Specialist Lexicon improved results to 90.9% (89.6% to 92.0%) precision, 91.3% (90.1% to 92.4%) recall, and 91.1% F1. The improvements were statistically significant at the 95% confidence level. The final version of the system, using a customized grammar, further improved the performance to 93.1% (92.0% to 94.0%) precision, 92.6% (91.5% to 93.6%) recall, and 92.8% F1. The improvements in precision and recall were also statistically significant.

Table 8 shows performance changes, by using an extended lexicon (SL), and using both an extended lexicon and a customized grammar (SL+GR), both compared with the baseline. Overall, using the extended lexicon improved the F1 measure by 5.0% and cut the false-positives and false-negatives by 31.1% and 34.4%, respectively. The final NPI module, with both the extended lexicon and a few changes in the grammar, improved the F1 measure by 7.1% and reduced the false-positives and false-negatives by 48.3% and 44.2%, respectively.

## Discussion

The ultimate goal of the ChartIndex project is to create a CDA-compliant model of clinical document representation at both the structural and semantic levels. The semantic model requires an indexing engine that can automatically identify and then represent important biomedical concepts in clinical documents as UMLS concept descriptors. Prior work in this area has found that achieving good indexing precision is a major challenge. Our current work uses a variety of approaches to address this issue. In this report we show that using a combination of machine learning and NLP can aid in the automated identification of sentence boundaries and NPs in clinical radiology reports.

Most existing NLP systems identify phrases using shallow parsing or text-chunking methods. This is partially because chunking systems are faster than full parsers and partially because full parsers are perceived as error prone. In the past, some researchers who attempted to use full parsing in their information systems did not see improvements in accuracy.51,52 However, with recent advances in statistical parsing methods, we believe that full parses are now better able to resolve important ambiguities within a reasonable time. It takes 1 to 2 seconds for our parser to parse a sentence of average length of 25 words on a Pentium 4 2.8 GHz computer with 1 GB SDRAM, which is sufficiently fast for our current applications. The performance of the parser was evaluated on its native training domain in a previous study,45 although the accuracy on medical texts has not been explicitly tested. A full parser also offers the ability to predict larger NPs that most text chunkers do not attempt to predict. Moreover, full parsers make more detailed assertions about relational syntactic structures, which can reasonably be expected to be useful for future indexing work.

Another concern of applying a statistical parser to a domain other than its training native domain is performance degrading.53,54 In this study, we applied the Stanford parser, trained on the Penn Treebank WSJ corpus, to clinical radiology documents in the medical domain. As Gildea54 reported, word-to-word dependencies are corpus specific. Thus, the unlexicalized Stanford parser gave us a compact and fast statistical parser that we believe is less dependent on its training corpus. Furthermore, we augmented the Stanford parser with a biomedical lexicon derived from the UMLS Specialist Lexicon. The performance of the parser on our document collection, although not evaluated on full parse trees, was evaluated at the levels of both maximal and base NPs. The finding by Hwa53 that higher-level constituents are the most informative linguistic units in grammar induction suggests that the evaluation on maximal NPI might be a better indicator than base phrases of overall parsing accuracy. As shown above, the results of maximal NPI are acceptable, and those of base NPI are comparable to published performance in the newswire, which is the parser's native training domain.

In addition, we believe that the representation of NPs within a parse tree provides considerable potential flexibility at the time of indexing. Compared with the flat output structure produced by text chunking, a parse tree captures more structural information revealing the semantics of the sentence, which may be very helpful in identifying negated concepts. This approach can support heuristics that select the optimal NP node for indexing by traversing paths between the maximal NPs and base NPs within the parse tree.

Noun phrase identification is a critical step in the ChartIndex model. Most important biomedical concepts in clinical documents are NPs, and most UMLS concept descriptors are NPs. The identification of NPs in ChartIndex relies on a high-performance statistical parser. High performance statistical parsers are usually trained on corpora in a general domain. To apply those parsers to clinical documents effectively, we need to supply the parser with biomedical terms. We have shown the UMLS Specialist Lexicon to be an effective resource for this purpose. As mentioned above, there are generally two problems in integrating SL with these parsers. The first one is the mismatch of syntactical categories, which causes ambiguities in mapping entries in SL syntactical categories to entries in Penn Treebank syntactical categories. To address this issue we chose to do the mapping conservatively, by mapping only unambiguous terms. A second problem is that statistical parsers usually use a lexicon with relative frequencies of different syntactical categories for each token. Those relative frequencies for some common words may be very different in biomedical domain from those in general domains. In the base NPI experiment, we manually changed relative frequencies for a few words as mentioned in the Methods section. Those results are marked as “SL+GR” in the data tables. Table 7 shows an improvement from 91.1% F1 to 92.8% F1 for all reports. While this tuning is extremely specific, and might raise concerns of overfitting, certain words are so extremely common and so extremely different in distribution between domains that even just a few such modifications (for five words in this case) can be widely applicable and generally useful.

The experiment on the identification of maximal NPs reported consistently lower performance numbers, with F1 of 80.2% without terms from the UMLS Specialist Lexicon and F1 of 83.3% with terms from the UMLS Specialist Lexicon. One reason is as mentioned above, to identify the maximal NP is indeed a more difficult problem because the parser needs to resolve attachment ambiguities. Another reason is related to how the errors are counted. For example, “occluded left FEM to distal bypass graft” is a maximal NP in a sentence. The parser mistakenly marked “left” as a verb, which led to three false-positives of “occluded,” “FEM,” and “distal bypass graft” and one false-negative, in the maximal NPI. On the other hand, in base NPI, the same parse error only led to two false-positives, “occluded” and “FEM,” since “distal bypass graft” is a correct base NP. Additionally, because the total number of maximal NPs is smaller than the total number of base NPs, a failure is weighted more in maximal NPI results.

From Table 5, we can see that the performance of maximal NPI has less variation across different modalities after applying the extended lexicon, except ultrasounds (US), which may reflect the small data set. From Table 6, we can see that adding terms from SL improved performance consistently, except in the smaller data set of mammogram (MAMMO), in which there were no changes in performance. Across the whole test set, there was a reduction of 15.3% in false-positives and 16.5% in false-negatives with the use of the Specialist Lexicon.

Table 7 shows the same trend in base NPI. The baseline F1 measure ranges from 73.6% in ultrasound (US) to 91.1% in radiology procedure (PROC). With SL terms, the F1 measure was improved to between 86.4% in US and 93.9% in PROC. The final version using both SL terms and a grammar with a few changes further improved F1 to between 90.1% in radiograph (RAD) and 95.5% in US. Due to the factors mentioned above, we can see the performance improvements are more substantial in base NPI by comparing the data in Table 8 with the data in Table 6.

There were some challenges presented by this document set. For example, as mentioned previously, some sentences were not well formed. Most of the ill-formed sentences were NPs. The parser could parse most of them correctly but had some problems with some long complex NPs in the Impression section. Sometimes those NPs were parsed as a full sentence either with a NP and a verb phrase or with a verb phrase only. In both cases, it was usually because some words in the text have more than one POS tag, and they were not pretagged using the extended lexicon. These words were also rare words (seen by the parser less than 20 times in the whole training process) and were tagged as a verb and the head of a verb phrase by the parser. The second type of error originated from commonly used words in capitalized form, such as “Right.” The parser currently treats “Right” as a separate entry from “right” since “Right” may serve as part of a proper noun. There were also a few cases of parsing errors involving punctuations like parenthesis and the pound sign “#”. These errors were due to the fact that these punctuations are used differently in radiology reports and indicate that there are syntactic adaptations that still remain to be done. These cases could not necessarily be addressed by simple lexicon adaptation.

Statistical learning techniques have been widely adopted in text processing and text mining applications and have been shown to produce robustness and good performance. One potential hurdle associated with this approach is the need for large-labeled training sets for use in supervised machine learning. This is especially true in the biomedical domain since there are few publicly available large-labeled corpora of clinical documents, such as the Penn Treebank corpora for general domains. We have extended such a parser using a domain-specific lexicon. There are still some remaining issues with this approach such as the ambiguous tag mappings in the conversion of lexicons using different POS tag sets. Also, the probabilities associated with POS tags for each term can be very different in clinical documents compared with the training collection in a general domain. However, our work has shown that, with the help of lexical entries contained in a domain-specific lexicon (the UMLS Specialist Lexicon) a statistical natural language parser trained on a training set in the general domain can achieve significantly improved performance on NP identification within clinical radiology reports.

There are limitations to our analysis. First, the method that we used to create a gold standard was not optimal.55 Ideally we would have asked each expert physician to review all 100 radiology reports. However, this was not possible given the time constraints of these experts and we instead asked each physician to review 25 of the 100 documents. If all 100 documents had been reviewed by each physician, we could have evaluated per-rater reliability and intra-rater variability. Second, the use of computer pre-markups may bias human experts' judgments. Third, we did not evaluate the entire parse tree for each sentence, although we did get a good estimate of the parser's overall performance through evaluation of the base NPI and maximal NPI.

## Conclusion

The performance of sentence boundary detection is excellent in this system. Extraction of NPs in clinical radiology reports, using statistical natural language processing, can achieve performance comparable to that seen in the general, nonmedical, domain. The adaptation using the UMLS Specialist Lexicon significantly improved both precision and recall in NPI on clinical radiology reports to levels comparable to the parser's native performance in its nonbiomedical training domain (newswire). Future work will include the development of a system that will take NPs in parse tree format and map them into corresponding UMLS concepts.

## References

1
United States Department of Health & Human Services. News Release
. [cited 2005 March 9]. Available at: http://www.hhs.gov/news/press/2004pres/20040721a.html.
2
Dolin
RH
Alschuler
L
Beebe
C
et al
.
The HL7 clinical document architecture
.
J Am Med Inform Assoc

2001
;
8
:
552
69
.
3
Lowe
HJ
.
Multimedia electronic medical record systems
.

1999
;
74
:
146
52
.
4
Lindberg
DA
Humphreys
BL
McCray
AT
.
The unified medical language system
.
Method Inform Med

1993
;
32
:
281
91
.
5
Humphreys
BL
Lindberg
DA
Schoolman
HM
et al
.
The Unified Medical Language System: an informatics research collaboration
.
J Am Med Inform Assoc

1998
;
5
:
1
11
.
6
Hersh
WR
Greenes
RA
.
SAPHIRE—an information retrieval system featuring concept matching, automatic indexing, probabilistic retrieval, and hierarchical relationships
.
Comput Biomed Res

1990
;
23
:
410
25
.
7
Hersh
WR
Hickam
D
.
Information retrieval in medicine: the SAPHIRE experience
.
Medinfo

1995
;
8 Pt 2
:
1433
7
.
8
Zou
Q
Chu
WW
Morioka
C
et al
.
IndexFinder: a method of extracting key concepts from clinical texts for indexing
.
Proc AMIA Symp

2003
;
763
7
.
9
Sager
N
Lyman
M
Nhan
NT
et al
.
Automatic encoding into SNOMED III: a preliminary investigation
.
Proc Annu Symp Comput Appl Med Care

1994
;
230
4
.
10
Lowe
HJ
.
Image Engine: an object-oriented multimedia database for storing, retrieving and sharing medical images and text
.
Proc Annu Symp Comput Appl Med Care

1993
;
839
43
.
11
Friedman
C
Alderson
PO
Austin
JH
et al
.
A general natural-language text processor for clinical radiology
.
J Am Med Inform Assoc

1994
;
1
:
161
74
.
12
Cooper
GF
Miller
RA
.
An experiment comparing lexical and statistical methods for extracting MeSH terms from clinical free text
.
J Am Med Inform Assoc

1998
;
5
:
62
75
.
13
P
Chen
R
Brandt
C
.
UMLS concept indexing for production databases: a feasibility study
.
J Am Med Inform Assoc

2001
;
8
:
80
91
.
14
Pietrzyk
PM
.
Free text analysis
.
Int J Biomed Comput

1995
;
39
:
139
44
.
15
Purcell
GP
Shortliffe
EH
.
Contextual models of clinical publications for enhancing retrieval from full-text databases
.
Proc Annu Symp Comput Appl Med Care

1995
;
851
7
.
16
Berrios
DC
.
Automated indexing for full text information retrieval
.
Proc AMIA Symp

2000
;
71
5
.
17
Aronson
AR
Bodenreider
O
Chang
HF
et al
.
The NLM Indexing Initiative
.
Proc AMIA Symp

2000
;
17
21
.
18
Aronson
AR
.
Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program
.
Proc AMIA Symp

2001
;
17
21
.
19
Liu
H
Friedman
C
.
Mining terminological knowledge in large biomedical corpora
.
Pac Symp Biocomput

2003
;
415
26
.
20
Baud
R
Ruch
P
.
The future of natural language processing for biomedical applications
.
Int J Med Inf

2002
;
67
:
1
5
.
21
Taira
RK
Soderland
SG
.
A statistical natural language processor for medical reports
.
Proc AMIA Symp

1999
;
970
4
.
22
Taira
RK
Soderland
SG
Jakobovits
RM
.
Automatic structuring of radiology free-text reports
.

2001
;
21
:
237
45
.
23
Christensen
L
Haug
PJ
Fiszman
M
.
MPLUS: a probabilistic medical language understanding system
.
Proc Workshop on Natural Language Processing in the Biomedical Domain
.
.
2002
;
29
36
.
24
Hripcsak
G
Kuperman
GJ
Friedman
C
.
Extracting findings from narrative reports: software transferability and sources of physician disagreement
.
Methods Inf Med

1998
;
37
:
1
7
.
25
Hripcsak
G
Austin
JH
Alderson
PO
Friedman
C
.
Use of natural language processing to translate clinical information from a database of 889,921 chest radiographic reports
.

2002
;
224
:
157
63
.
26
Friedman
C
.
A broad-coverage natural language processing system
.
Proc AMIA Symp

2000
;
270
4
.
27
Berrios
DC
Cucina
RJ
Fagan
LM
.
Methods for semi-automated indexing for high precision information retrieval
.
J Am Med Inform Assoc

2002
;
9
:
637
52
.
28
Srinivasan
S
Rindflesch
TC
Hole
WT
Aronson
AR
Mork
JG
.
Finding UMLS Metathesaurus concepts in MEDLINE
.
Proc AMIA Symp

2002
;
727
31
.
29
Brennan
PF
Aronson
AR
.
Towards linking patients and clinical information: detecting UMLS concepts in e-mail
.
J Biomed Inform

2003
;
36
:
334
41
.
30
Bourigault
D
.
Surface grammatical analysis for the extraction of terminological noun phrases
. In:
The 14th International Conference on Computational Linguistics
.
1992
;
Nantes, Frances
.
31
Ramshaw
LA
Marcus
MP
.
Text chunking using transformation-based learning. In: ACL Third Workshop on Very Large Corporation
.
Cambridge, MA
Association for Computational Linguistics
1995
.
32
Cardie
C
Pierce
D
.
Error-driven pruning of Treebank grammars for base noun phrase identification
. In:
The Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics
.
:
Association for Computational Linguistics
,
1998
.
33
Bennett
NA
He
Q
Powell
K
et al
.
Extracting noun phrases for all of MEDLINE
.
Proc AMIA Symp

1999
;
671
5
.
34
McCray
AT
Srinivasan
S
Browne
AC
.
Lexical methods for managing variation in biomedical terminologies
.
Proc Annu Symp Comput Appl Med Care

1994
;
235
9
.
35
Spackman
KA
Hersh
WR
.
Recognizing noun phrases in medical discharge summaries: an evaluation of two natural language parsers
.
Proc AMIA Annu Fall Symp

1996
;
155
8
.
36
Berrios
DC
Kehler
A
Fagan
LM
.
Knowledge requirements for automated inference of medical textbook markup
.
Proc AMIA Symp

1999
;
676
80
.
37
Lowe
HJ
Antipov
I
Hersh
W
et al
.
Automated semantic indexing of imaging reports to support retrieval of medical images in the multimedia electronic medical record
.
Methods Inf Med

1999
;
38
:
303
7
.
38
Hersh
WR
Mailhot
M
Arnott-Smith
C
et al
.
Selective automated indexing of findings and diagnoses in radiology reports
.
J Biomed Inform

2001
;
34
:
262
73
.
39
Huang
Y
Lowe
HJ
Hersh
WR
.
A pilot study of contextual UMLS indexing to improve the precision of concept-based representation in XML-structured clinical radiology reports
.
J Am Med Inform Assoc

2003
;
10
:
580
7
.
40
Reynar
JC
Ratnaparkhi
A
.
A maximum entropy approach to identifying sentence boundaries
.
Proc of the ANLP97
.
Washington, D.C
:
1997
;
16
9
.
41
Shannon
CE
.
A mathematical theory of communication
.
Bell System Technical Journal

1948
;
27
:379–423,
623
56
.
42
Bertsekas
DP
.
Constrained Optimization Lagrange Multiplier Methods
.
Burlington, MA
:
,
1982
.
43
Berger
A
.
1996
.
A Brief Maxent Tutorial
. [cited 2005 March 9]. Available at: http://www-2.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/tutorial.html.
44
Klein
D
Manning
CD
.
Fast exact inference with a factored model for natural language parsing
.
Advances in Neural Information Processing Systems 15 (NIPS 2002)

December
2002
.
45
Klein
D
Manning
CD
.
Accurate unlexicalized parsing
.
Proc of the 41st Meeting of the Association for Computational Linguistics

2003
;
423
30
.
46
Marcus
MP
Santorini
B
Marcinkiewicz
MA
.
Building a large annotated corpus of English: The Penn Treebank
.
Computational Linguistics

1993
;
19
:
313
30
.
47
McCray
AT
Aronson
AR
Browne
AC
et al
.
UMLS knowledge for biomedical language processing
.
Bull Med Libr Assoc

1993
;
81
:
184
94
.
48
Szolovits
P
.
Adding a medical lexicon to an English parser
.
Proc AMIA Symp

2003
;
639
43
.
49
Manning
C
Schütze
H
.
Foundations of Statistical Natural Language Processing
.
Cambridge, MA
:
MIT Press
,
1999
, p
269
.
50
Wilson
EB
.
Probable inference, the law of succession, and statistical inference
.
J Am Stat Assoc

1927
;
22
:
209
12
.
51
Hobbs
JR
.
SRI International's TACITUS System: MUC-3 Test Results and Analysis
.
Proc of the Third Message Understanding Conference (MUC-3)
.
San Diego, CA
,
1991
;
105
7
.
52
Grishman
R
.
The NYU System for MUC-6 or Where's the Syntax
.
Proc of the 6th Message Understanding Conference (MUC-6)
.
Columbia, MD
,
1995
;
105
7
.
53
Hwa
R
.
Supervised grammar induction using training data with limited constituent information
.
Proceedings of the 37th Annual Meeting of the ACL; 1999
.
College Park, MD
,
1999
;
73
9
.
54
Gildea
D
.
Corpus Variation Parser Performance
. In:
Conference on Empirical Methods in Natural Language Processing (EMNLP) 2001
.
Pittsburgh, PA
,
2001
:
167
202
.
55
Hripcsak
G
Wilcox
A
.
Reference standards, judges, and comparison subjects: roles for experts in evaluating system performance
.
J Am Med Inform Assoc

2002
;
9
:
1
15
.
The authors thank Albert Chan, MD, and Todd Ferris, MD, for their assistance in the evaluation component of this study. The authors also thank Dr. Robert Newcombe for providing a method to calculate confidence intervals and Haoyi Wang for helpful discussions on sentence boundary detection.