A Rule-based Approach for Identifying Obesity and Its Comorbidities in Medical Discharge Summaries

Table 5

F-measure Results for Classification of the Test Set, Ranked by Macro F-measure. CDC Entries Shown within the Context of i2b2 Best Results and Overall Average Results

System	Micro F-Measure	Macro F-Measure
i2b2 best—textual task	0.9723	0.8052
CDC—positive-weighted tie-breaker	0.9704	0.7718
CDC—negative-weighted tie-breaker	0.9685	0.7391
CDC—questionable-weighted tie-breaker	0.9685	0.7383
i2b2 average—textual task	0.91	0.56

System	Micro F-Measure	Macro F-Measure
i2b2 best—textual task	0.9723	0.8052
CDC—positive-weighted tie-breaker	0.9704	0.7718
CDC—negative-weighted tie-breaker	0.9685	0.7391
CDC—questionable-weighted tie-breaker	0.9685	0.7383
i2b2 average—textual task	0.91	0.56

CDC = Centers for Disease Control.

Detailed results for each morbidity are shown in Table 6. Out of the 16 morbidities involved in this year's challenge, our rule-based approach achieved macro F-measure scores above 0.8 for 12 of them (75%) and above 0.9 for five morbidities (31%). Our worst results occurred in trying to classify the discharge summaries for obesity, the primary morbidity of interest. Our system misclassified all the documents that should have been classified as “N” or “Q” for that morbidity (three documents for each of those judgments). Initially, we suspected this was caused by failing to include any keywords in our customized NegEx dictionary that occurred in any of the discharge summaries for these judgments. However, post-challenge analysis by a medical reviewer indicated several of the documents in the “Q” category contained multiple terms that in combination could indicate a possibility of obesity (e.g., dyslipidemia, NIDDM, hypertension, herniated disk). At least one of the documents in the “N” category included insulin-dependent diabetes as a morbidity, which could be an indicator that the patient is less likely to be obese, as opposed to a situation in which diabetes is non-insulin-dependent. For the “Q” judgments, our system was not able to assess that some terms, either individually or in combination, might indicate obesity was only possible rather than almost certain. Similarly, for the “N” judgments, our system did not include the concept that some terms could be contra-indicators for a morbidity. Our custom dictionary only consisted of clinical terms that were strong indicators of a particular morbidity. Due to macro-averaging, missing all the “N” and “Q” judgments had a substantial effect on our results for obesity. Based on the results of the challenge, poor performance on these two judgment categories for obesity appeared to be a common problem for other teams as well, with eight of the top ten entries having macro F-measures for obesity below 0.50.11

Table 6

F-measure Results for Each Morbidity, Ranked by Macro F-measure (Positive-Weighted Entry)

Morbidity	Micro F-Measure	Macro F-Measure
Obesity	0.9757	0.4917
GERD	0.9841	0.6524
OSA	0.9901	0.6564
CHF	0.9173	0.7645
Hypertension	0.9501	0.8089
Hypertriglyceridemia	0.9901	0.8308
CAD	0.9014	0.8357
Hypercholesterolemia	0.9721	0.8607
Gallstones	0.9803	0.8684
Venous insufficiency	0.9862	0.8811
Diabetes	0.9722	0.8886
PVD	0.9724	0.9380
Asthma	0.9921	0.9434
Depression	0.9763	0.9566
OA	0.9781	0.9635
Gout	0.9861	0.9678

Morbidity	Micro F-Measure	Macro F-Measure
Obesity	0.9757	0.4917
GERD	0.9841	0.6524
OSA	0.9901	0.6564
CHF	0.9173	0.7645
Hypertension	0.9501	0.8089
Hypertriglyceridemia	0.9901	0.8308
CAD	0.9014	0.8357
Hypercholesterolemia	0.9721	0.8607
Gallstones	0.9803	0.8684
Venous insufficiency	0.9862	0.8811
Diabetes	0.9722	0.8886
PVD	0.9724	0.9380
Asthma	0.9921	0.9434
Depression	0.9763	0.9566
OA	0.9781	0.9635
Gout	0.9861	0.9678

GERD = gastroesophageal reflux disease; OSA = obstructive sleep apnea; CHF = congestive heart failure; CAD = coronary artery disease; PVD = peripheral vascular disease; OA = osteo arthritis.

Table 6

F-measure Results for Each Morbidity, Ranked by Macro F-measure (Positive-Weighted Entry)

Morbidity	Micro F-Measure	Macro F-Measure
Obesity	0.9757	0.4917
GERD	0.9841	0.6524
OSA	0.9901	0.6564
CHF	0.9173	0.7645
Hypertension	0.9501	0.8089
Hypertriglyceridemia	0.9901	0.8308
CAD	0.9014	0.8357
Hypercholesterolemia	0.9721	0.8607
Gallstones	0.9803	0.8684
Venous insufficiency	0.9862	0.8811
Diabetes	0.9722	0.8886
PVD	0.9724	0.9380
Asthma	0.9921	0.9434
Depression	0.9763	0.9566
OA	0.9781	0.9635
Gout	0.9861	0.9678

Morbidity	Micro F-Measure	Macro F-Measure
Obesity	0.9757	0.4917
GERD	0.9841	0.6524
OSA	0.9901	0.6564
CHF	0.9173	0.7645
Hypertension	0.9501	0.8089
Hypertriglyceridemia	0.9901	0.8308
CAD	0.9014	0.8357
Hypercholesterolemia	0.9721	0.8607
Gallstones	0.9803	0.8684
Venous insufficiency	0.9862	0.8811
Diabetes	0.9722	0.8886
PVD	0.9724	0.9380
Asthma	0.9921	0.9434
Depression	0.9763	0.9566
OA	0.9781	0.9635
Gout	0.9861	0.9678

GERD = gastroesophageal reflux disease; OSA = obstructive sleep apnea; CHF = congestive heart failure; CAD = coronary artery disease; PVD = peripheral vascular disease; OA = osteo arthritis.

During the tuning process, the factor that increased the performance of this approach most substantially was customization of the morbidity keyword list in the NegEx dictionary, followed by customization of the lists of terms used for identifying questionable and negated assertions. More minor improvements were contributed by the text preprocessing steps.

There are several possibilities for future improvement of our rule-based approach. Based on the error analysis of the missed “N” and “Q” judgments for obesity, a prime area for improvement could be adding probabilities to the clinical terms in the dictionary to indicate if they are strong indicators or contra-indicators for a morbidity, or indicators of some questionable possibility. The post-processing scoring rules would also need to be updated to handle this new information. However, this would only apply to individual terms. Assessing multiple clinical terms in combination would require more complicated rules, possibly expert-crafted. This would require more substantial changes to our system. Rather than repurposing the conditional possibility code in NegEx to identify keywords related to other family members, an alternate approach would involve the use of the ConText12 algorithm which includes the negation-detection features of NegEx but also allows the identification of other contextual features, such as to whom a keyword applies (i.e., the patient or someone else). The preprocessing step that changes question marks to the word “questionable” should be updated to use a text pattern that excludes question marks that occur at sentence boundaries. This would minimize the chance of having two sentences concatenated together to avoid the possibility of negation and questionable assertion terms crossing sentence boundaries to affect keywords to which they should not apply. However, the documents in this particular classification task appeared to have few question marks at sentence boundaries, so this change may be unlikely to yield substantial performance improvements in this set of documents.

A practical consideration is that the customized lists of morbidity keywords and terms for identifying assertion types were manually created. This can involve a substantial amount of manual effort. Future research could investigate automated feature selection techniques to augment the creation of these keyword lists to reduce the level of human effort.

In the system we developed, representations of expert knowledge primarily exist in the lists of clinical terms and the various types of assertion modifiers (e.g., negation terms, pseudo-negation terms, etc) stored in our customized NegEx dictionary. Domain-specific knowledge does not exist as expert-crafted rules in our system. Rules in the pre-processing step are tied to domain-specific considerations at a superficial level, but they do not represent deep expert knowledge. The addition of domain-specific rules built on expert medical knowledge has the potential for enhancing the performance of this approach, particularly for situations requiring the assessment of multiple clinical terms to arrive at a judgment.

Conclusions

We applied a relatively simple rule-based approach in our entry to the 2008 i2b2 Obesity Challenge. The classification strategy involved looking for morbidity keywords and the types of assertions in which they occurred, and then classifying the documents based on scores assigned to the various morbidity/judgment combinations. As indicated by its fifth-place ranking and its performance relative to the averages for the textual classification task, this strategy performed reasonably well on the i2b2 Obesity Challenge data containing widely varying numbers of documents across morbidity/judgment categories, with few documents in some judgment categories. The overall results also indicate that this relatively simple approach held up well in comparison to more complicated strategies applied in other entries to the challenge. The approach relies substantially on the NegEx negation detection algorithm and tailoring keyword lists to this particular task. The keyword list creation and customization relies on manual inspection of training data to identify terms to be used to classify documents. These keyword lists may need to be customized or recreated to use this strategy on different classification tasks. However, this strategy also has the potential to perform reasonably well when limited example data are available for training a machine learning algorithm. The approach could be enhanced further by the addition of domain-specific rules designed to model the reasoning of a medical expert.

The authors thank i2b2 for organizing the 2008 i2b2 Obesity Challenge. The authors thank Chapman et al for making NegEx available for non-commercial use, and thank Wendy Chapman, PhD, for suggesting ConText as an enhancement to our current approach.

References

1

Wilcox

AB

Hripcsak

G

.

The role of domain knowledge in automating medical text report classification

.

J Am Med Inform Assoc

2003

;

10

:

330

–

338

.

2

Chapman

WW

Cooper

GF

Hanbury

P

et al. .

Creating a text classifier to detect radiology reports describing mediastinal findings associated with inhalational anthrax and other disorders

.

J Am Med Inform Assoc

2003

;

10

(

5

):

494

–

503

.

3

Olszewski

RT

.

Bayesian classification of triage diagnoses for the early detection of epidemics

.

Proc of the 16^th Int FLAIRS Conference

, pp

412

–

416

,

2003

.

Google Preview

4

Clark

C

Good

K

Jezierny

L

et al. .

Identifying smokers with a medical extraction system

.

J Am Med Inform Assoc

2008

;

15

:

36

–

39

.

5

Fiszman

M

Chapman

WW

Aronsky

D

Evans

RS

Haug

PJ

.

Automatic detection of acute bacterial pneumonia from chest x-ray reports

.

J Am Med Inform Assoc

2000

;

7

:

593

–

604

.

6

Chapman

WW

Christensen

LM

Wagner

MM

et al. .

Classifying free-text triage chief complaints into syndromic categories with natural language processing

.

Artif Intell Med

2005

;

33

(

1

):

31

–

40

.

7

Zeng

QT

Goryachev

S

Weiss

S

et al. .

Extracting principal diagnosis, co-morbidity and smoking status for asthma research: Evaluation of a natural language processing system

.

BMC Med Inform Decis Mak

2006

;

6

:

30

.

8

Dumais

ST

Platt

J

Heckerman

D

Sahami

M

.

Inductive learning algorithms and representations for text categorization

. In:

Proceedings of the 7th ACM International Conference on Information and Knowledge Management

.

New York

:

ACM Press

;

1998

. p.

148

–

155

.

Google Preview

9

Uzuner

O

Szolovits

P

Kohane

I

,

Second i. 2b2 shared-task and workshop [internet]. i2b2: Informatics for integrating biology and the bedside

,

2008

. Available at: https://www.i2b2.org/NLP/. Accessed: Aug 15, 2008.

10

Chapman

WW

Bridewell

W

Hanbury

P

Cooper

GF

Buchanan

BG

.

A simple algorithm for identifying negated findings and diseases in discharge summaries

.

J Biomed Inform

2001

;

34

:

301

–

310

.

11

Uzuner

O

.

Recognizing obesity and co-morbidities in sparse data

.

J Am Med Inform Assoc

2009

;

16

:

560

–

570

.

12

Chapman

WW

Chu

D

Dowling

JN

.

ConText: An algorithm for identifying contextual features from clinical text

. In:

Proceedings of the 2007 ACL Workshop on Biological, Translational, and Clinical Language Processing (BioNLP)

;

2007

Jun

29

;

Prague, Czech Republic. Madison, WI

:

Omnipress

;

2007

, pp

81

–

88

.

Google Preview