Validation of a Natural Language Processing Protocol for Detecting Heart Failure Sins in Electronic Health Record Notes BYRD

Validation of a Natural Language Processing Protocol for Detecting

Heart Failure Signs and Symptoms in Electronic Health

Record Text Notes

Roy J. Byrd2, Steven R. Steinhubl1, Jimeng Sun2, Shahram Ebadollahi2, Zahra Daar1, Walter F. Stewart1

1Geisinger Medical Center, Center for Health Research, Danville, PA

2 IBM, T.J. Watson Research Center, Hawthorne, NY

Outline• Background and objectives• Datasets• Tools & Methods• Results• Discussion

– Challenges– Opportunities

• Summary

• (Iterative annotation refinement)

Background and Objectives

• Background– Framingham criteria for HF published in 1971– Geisinger/IBM “PredMED” project on predictive modeling for

early detection of HF, using longitudinal EHRs

• Overall Project ObjectiveBetter understand the presentation of HF in the primary care

setting, in order to facilitate its more rapid identification and treatment

• Objective of this paper:Build and validate NLP extractors for Framingham criteria (signs and symptoms) from EHR clinical notes, so that they

may be suitable for downstream diagnostic applications

Framingham HF Diagnostic CriteriaMAJOR SYMPTOMS MINOR SYMPTOMS

1. Paroxysmal Nocturnal Dyspnea (PND) or Orthopnea

1. Bilateral Ankle Edema

2. Neck Vein Distension (JVD) 2. Nocturnal Cough

3. Rales 3. Dyspnea on ordinary exertion

4. Radiographic Cardiomegaly 4. Hepatomegaly

5. Acute Pulmonary Edema 5. Pleural effusion

6. S3 Gallop6. A decrease in vital capacity by 1/3

of the maximal value recorded**

7. Increased Central Venous Pressure (> 16 cm H2O at RA)

7. Tachycardia (>120 BPM)

8. Circulation Time of 25 seconds**** Not extracted, since these criteria

are not documented in routine clinical practice.

9. Hepatojugular Reflux (HJR)

10.Weight loss 4.5kg in 5 days in response to treatment

N Engl J Med. 1971;285:1441-1446.

17.2 17.9

5.2 1.4

17.7

62.3 65

7.2 5.8 1.7 0.7 1.1

28.622.9

0

10

20

30

40

50

60

PND Rales JVD PulmEdema

CMegaly AnkleEdema

DOE

Cases (N=4,644) Controls (N=45,981)

Per

cen

t w

ith

Do

cum

ente

d C

rite

ria

(Sample downstream analysis)

Reports of Framingham HF criteria in the year prior to diagnosis

Datasets• Clinical notes from longitudinal (2001-2010) EHR

encounters for– 6,355 case patients

• Meet operational criteria for HF**– 26,052 control patients

• Clinic-, gender- and age-matched to cases – The case-control distinction is exploited in downstream

applications; it’s not relevant for criteria extraction.• Development dataset

– 65 encounter notes• Selected for density of Framingham criteria• Annotated by a clinical expert

• Validation dataset– 400 encounter notes (200 cases & 200 controls)

• Randomly selected• Annotated by consensus of 4 trained coders• N = 1492 criteria

**Operational HF Criteria–HF diagnosis on

problem list,–HF diagnosis in EHR

for two outpatient encounters,

–Two or more medications with ICD-9 code for HF, or

–One HF diagnosis and one medication with ICD-9 code for HF

Tools

• LRW1 – LanguageWare Resource Workbench– Basic Text Processing– Dictionaries– Grammars

• UIMA2 - Unstructured Information Management Architecture– Execution Pipeline, including I/O management– Text Analysis Engines

• TextSTAT3 – Simple Text Analysis Tool– Concordance program, used for linguistic analysis

UIMA Collection Processing Engine

Basic Processing forparagraphs, sentences,

tokenization, etc.

EncounterDocuments

Dictionaries and Grammarsfor recognizing criteria

candidates

Text Analysis Enginesfor applying constraintsand annotating criteria

ExtractedCriteria

UIMA Collection Processing Engine

Basic Processing forparagraphs, sentences,

tokenization, etc.

EncounterDocuments

Dictionaries and Grammarsfor recognizing criteria

candidates

Text Analysis Enginesfor applying constraintsand annotating criteria

ExtractedCriteria

1http://www/alphaworks.ibm.com/tech/lrw 2http://uima.apache.org 3http://neon.niederlandistik.fu-berlin.de/en/textstat

Criteria Extraction Methods:Dictionaries

• Framingham Criteria vocabulary– Words and phrases used to mention the 15 Framingham Criteria– edema, leg edema, oedema; shortness of breath, SOB– Size: ~75 “lemma forms” (main entries) and hundreds of variant

forms• Segment Header words and phrases

– Patient History, Examination, Plan, Instruction

• Negating words– Used to deny criteria

• no, free of, ruled out• Counterfactual triggers

– The criteria may not have occurred

• if, should, as needed for• Miscellaneous Classes

– Weight loss phrases• lose weight, diurese

– Time value words• day, week, month

– Weight units• pound, kilogram

– Diuretics• Bumex, Furosimide

Criteria Extraction Methods:Grammars

• Shallow English syntax– Noun Phrases

• some moderate DOE– Compound Noun Phrases

• chest pain, DOE, or night cough– Prepositional Phrases

• No full-sentential parses– Not needed for simple HF criteria– Unreliable sentence boundaries and syntax in clinical notes

• Negated Scope– regular rate and rhythm without murmurs, clicks, gallops, or

rubs• Counterfactual Scope

– Patient should call if she experiences shortness of breath• Weight Loss

– 20 pound weight loss in a week with diuretics• Tachycardia

– tachy at 120 (to 130)– HR: 135

Criteria Extraction Methods:Text Analysis Engines (TAEs)

• Rules to filter candidate criteria created from dictionaries and grammars.

• Deny criteria mentioned in negated contexts– regular rate and rhythm without murmurs, clicks, gallops, or

rubs S3Neg• Ignore criteria in counterfactual contexts

– Patient should call if she experiences shortness of breath

• Co-occurrence constraints– exercise HR: 135 doesn’t affirm Tachycardia

• Disambiguation– edema is recognized as APEdema, if near cxr, or in a

“Radiology” note, or in a “Chest X-Ray” segment• Numeric constraints

– she lost 5 pounds over a month doesn’t affirm WeightLoss– tachy @ 115 doesn’t affirm Tachycardia

Encounter Labeling Methods

• We can label an encounter note with labels showing the criteria that the note mentions– The labels can be used by downstream analyses to gather

information such as: “This patient exhibited those symptoms on that date.”

• 2 Methods:– Machine-learning

• Using candidate criteria and scope annotations, as features, …• use a [CHAID decision tree] classifier to assign criteria as labels.

– Rule-based• Run the full extractor pipeline, then …• Assign labels consisting of all unique criteria that survive filtering.

Results

Evaluation Flow

EncounterDocuments

LexicalLook-up& Scope

Lexical& Scope

Annotations

MachineLearning

Rules

EncounterLabels

EncounterLabels

CriteriaMentions

EncounterLabel

Evaluation

CriteriaExtractionEvaluation

EncounterDocuments

LexicalLook-up& Scope

Lexical& Scope

Annotations

MachineLearning

Rules

EncounterLabels

EncounterLabels

CriteriaMentions

EncounterLabel

Evaluation

CriteriaExtractionEvaluation

Metrics: Precision (Positive Predictive Value):

#TruePositive / (#TruePositive + #FalsePositive) Recall (Sensitivity):

#TruePositive / (#TruePositive + #FalseNegative) F-Score (the harmonic mean of Precision and Recall):

(2 x Precision x Recall) / (Precision + Recall)

Encounter Labeling Performance

Machine-learning method Rule-based method

Recall Precision F-Score Recall Precision F-Score

Affirmed 0.675000 0.754190 0.712401 0.738532 0.899441 0.811083

Denied 0.945556 0.905319 0.925000 0.987599 0.931915 0.958949

Overall 0.896364 0.881144 0.888689 0.938462 0.926720 0.932554

Overall 99% Conf. Int. (0.848-0.929) (0.900-0.964)

Conclusion: Machine-learning labeling does not significantly underperform rule-based labeling.

Performance of Framingham Diagnostic Criteria Extraction

Precision Recall F-score 99% ConfidenceInterval (F-score)

Overall (exact) 0.925234 0.896864 0.910828 (0.891 - 0.929)

Overall (relaxed) 0.948239 0.919164 0.933475 (0.916 - 0.950)

Affirmed 0.747801 0.789474 0.768072 (0.711 - 0.824)

Denied 0.982857 0.928058 0.954672 (0.938 - 0.970)

Note: Performance on affirmed criteria is worse, possibly because of their greater syntactic diversity. For example, we don’t find:

PleuralEffusion: blunting of the right costrophrenic angleDOExertion: she felt like she couldn’t get enough air in

Precision and Recall for Individual Criteria

Analysis of 1492 extracted criteria: PredMED extractions vs.

Gold Standard annotations

PredMED

GoldStd

ANKED

ANKEDNeg

APED

APEDNeg

DOEDOENeg

HEPHEPNeg

HJR HJRNeg

JVD

JVDNeg

NC NCNeg

PLE PLENeg

PNDPNDNeg

RALERALE

Neg

RC RCNeg

S3G S3GNeg

TACH

TACHNeg

WTL

False

Positiv

e

ANKED 90 6 16ANKEDNeg 230 6APED 8 5 2 1 22APEDNeg 0DOE 116 17 1 3DOENeg 3 135 2 1HEP 0 1HEPNeg 125HJR 2 1HJRNeg 9JVD 7 2JVDNeg 91NC 2NCNeg 43 2PLE 8PLENeg 1PND 1 7 2PNDNeg 69RALE 11 1RALENeg 197RC 6RCNeg 1S3G 0S3GNeg 131TACH 1 2TACHNeg 0 4WTL 0False Negative 6 8 5 2 6 5 1 4 1 3 2 2 7 35 2 1 1 10

Discussion• Challenges

– Data quality: EHR text data is messy.

• >10% (i.e., 26/237) of the errors are caused by misspellings & bad sentence boundaries

– Human anatomy• We need a better solution

than word co-occurrence constraints

– Syntactic diversity of affirmed criteria

• We need deeper syntactic and semantic analysis

– Contradictions and redundancy

• An issue for downstream analysis

• Opportunities– We can apply similar

techniques to other collections of criteria.

• NY Heart Association• European Society of

Cardiology• MedicalCriteria.com

– Many specific criteria extractors can be re-used in other settings.

– For downstream applications, see posters and presentations from our project at this conference

Summary

• Extractors can identify affirmations and denials of Framingham HF criteria in EHR clinical notes with an overall F-Score of 0.91.

• Classifiers can label EHR encounters with the Framingham critera they mention with an F-Score of 0.93.

• Information about HF criteria mentioned in EHR notes appears to be useful for downstream applications that seek to achieve early detection of HF.

Backup:Iterative Annotation Refinement

Iterative Annotation Refinement

• What are the problems solved?– Annotations are required for training and evaluating

criteria extractors.– Human annotators without guidelines have high

precision but lower recall.– Domain experts’ intuitions (about the language for

expressing criteria) are initially imprecise.• What is produced?

– Annotated dataset– Annotation guidelines … that are consistent– Criteria extractors

The Development Process:Iterative Annotation Refinement

ResultsInitialization Iteration

ExpertAnnotations

AnnotationGuidelines

CriteriaExtractors

Discuss the

language of HF

criteria

Buildinitial

extractors

Writeinitial

guidelinesExpert

Linguist

Annotate textswith currentextractors

Performerror

analysis

Update theextractors

Update theannotations

and theguidelines

EncounterTexts

User interface for the annotation tool, which was used to manage annotations during refinement.

Performance improvement during development

Iterative methods for creating annotations, guidelines, and extractors

Extraction target

Result of using the method

Sources of annotations compared in each iteration

Arbiter for disagreements at each iteration

Objective (and metric) for each iteration

Iterative Annotation Refinement

Framingham HF criteria

- Annotations- Guidelines- Extractor

Expert and Extractor

Expert Improve extractor performance (F-score)

Annotation Induction(Chapman, et al. J Biom Inf 2006)

Clinical conditions

- Guidelines (in the form of an annotation schema)

Expert and Linguist

Consensus Improve inter-annotator agreement (F-score)

CDKRM(Coden, et al., J Biom Inf 2009)

Classes in the cancer disease model

- Annotations- Guidelines

2 Experts Consensus Improve inter-annotator agreement (agreement %)

TALLAL(Carrell, et al,GHRI-IT poster, 2010)

PHI (protected health information) classes

- Annotations- Extractor

Expert and Extractor

Expert Annotate full dataset (to the expert’s satisfaction)