Upload
hmo-research-network
View
452
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Clinical Informatics
Citation preview
Validation of a Natural Language Processing Protocol for Detecting
Heart Failure Signs and Symptoms in Electronic Health
Record Text Notes
Roy J. Byrd2, Steven R. Steinhubl1, Jimeng Sun2, Shahram Ebadollahi2, Zahra Daar1, Walter F. Stewart1
1Geisinger Medical Center, Center for Health Research, Danville, PA
2 IBM, T.J. Watson Research Center, Hawthorne, NY
Outline• Background and objectives• Datasets• Tools & Methods• Results• Discussion
– Challenges– Opportunities
• Summary
• (Iterative annotation refinement)
Background and Objectives
• Background– Framingham criteria for HF published in 1971– Geisinger/IBM “PredMED” project on predictive modeling for
early detection of HF, using longitudinal EHRs
• Overall Project ObjectiveBetter understand the presentation of HF in the primary care
setting, in order to facilitate its more rapid identification and treatment
• Objective of this paper:Build and validate NLP extractors for Framingham criteria (signs and symptoms) from EHR clinical notes, so that they
may be suitable for downstream diagnostic applications
Framingham HF Diagnostic CriteriaMAJOR SYMPTOMS MINOR SYMPTOMS
1. Paroxysmal Nocturnal Dyspnea (PND) or Orthopnea
1. Bilateral Ankle Edema
2. Neck Vein Distension (JVD) 2. Nocturnal Cough
3. Rales 3. Dyspnea on ordinary exertion
4. Radiographic Cardiomegaly 4. Hepatomegaly
5. Acute Pulmonary Edema 5. Pleural effusion
6. S3 Gallop6. A decrease in vital capacity by 1/3
of the maximal value recorded**
7. Increased Central Venous Pressure (> 16 cm H2O at RA)
7. Tachycardia (>120 BPM)
8. Circulation Time of 25 seconds**** Not extracted, since these criteria
are not documented in routine clinical practice.
9. Hepatojugular Reflux (HJR)
10.Weight loss 4.5kg in 5 days in response to treatment
N Engl J Med. 1971;285:1441-1446.
17.2 17.9
5.2 1.4
17.7
62.3 65
7.2 5.8 1.7 0.7 1.1
28.622.9
0
10
20
30
40
50
60
PND Rales JVD PulmEdema
CMegaly AnkleEdema
DOE
Cases (N=4,644) Controls (N=45,981)
Per
cen
t w
ith
Do
cum
ente
d C
rite
ria
(Sample downstream analysis)
Reports of Framingham HF criteria in the year prior to diagnosis
Datasets• Clinical notes from longitudinal (2001-2010) EHR
encounters for– 6,355 case patients
• Meet operational criteria for HF**– 26,052 control patients
• Clinic-, gender- and age-matched to cases – The case-control distinction is exploited in downstream
applications; it’s not relevant for criteria extraction.• Development dataset
– 65 encounter notes• Selected for density of Framingham criteria• Annotated by a clinical expert
• Validation dataset– 400 encounter notes (200 cases & 200 controls)
• Randomly selected• Annotated by consensus of 4 trained coders• N = 1492 criteria
**Operational HF Criteria–HF diagnosis on
problem list,–HF diagnosis in EHR
for two outpatient encounters,
–Two or more medications with ICD-9 code for HF, or
–One HF diagnosis and one medication with ICD-9 code for HF
Tools
• LRW1 – LanguageWare Resource Workbench– Basic Text Processing– Dictionaries– Grammars
• UIMA2 - Unstructured Information Management Architecture– Execution Pipeline, including I/O management– Text Analysis Engines
• TextSTAT3 – Simple Text Analysis Tool– Concordance program, used for linguistic analysis
UIMA Collection Processing Engine
Basic Processing forparagraphs, sentences,
tokenization, etc.
EncounterDocuments
Dictionaries and Grammarsfor recognizing criteria
candidates
Text Analysis Enginesfor applying constraintsand annotating criteria
ExtractedCriteria
UIMA Collection Processing Engine
Basic Processing forparagraphs, sentences,
tokenization, etc.
EncounterDocuments
Dictionaries and Grammarsfor recognizing criteria
candidates
Text Analysis Enginesfor applying constraintsand annotating criteria
ExtractedCriteria
1http://www/alphaworks.ibm.com/tech/lrw 2http://uima.apache.org 3http://neon.niederlandistik.fu-berlin.de/en/textstat
Criteria Extraction Methods:Dictionaries
• Framingham Criteria vocabulary– Words and phrases used to mention the 15 Framingham Criteria– edema, leg edema, oedema; shortness of breath, SOB– Size: ~75 “lemma forms” (main entries) and hundreds of variant
forms• Segment Header words and phrases
– Patient History, Examination, Plan, Instruction
• Negating words– Used to deny criteria
• no, free of, ruled out• Counterfactual triggers
– The criteria may not have occurred
• if, should, as needed for• Miscellaneous Classes
– Weight loss phrases• lose weight, diurese
– Time value words• day, week, month
– Weight units• pound, kilogram
– Diuretics• Bumex, Furosimide
Criteria Extraction Methods:Grammars
• Shallow English syntax– Noun Phrases
• some moderate DOE– Compound Noun Phrases
• chest pain, DOE, or night cough– Prepositional Phrases
• No full-sentential parses– Not needed for simple HF criteria– Unreliable sentence boundaries and syntax in clinical notes
• Negated Scope– regular rate and rhythm without murmurs, clicks, gallops, or
rubs• Counterfactual Scope
– Patient should call if she experiences shortness of breath• Weight Loss
– 20 pound weight loss in a week with diuretics• Tachycardia
– tachy at 120 (to 130)– HR: 135
Criteria Extraction Methods:Text Analysis Engines (TAEs)
• Rules to filter candidate criteria created from dictionaries and grammars.
• Deny criteria mentioned in negated contexts– regular rate and rhythm without murmurs, clicks, gallops, or
rubs S3Neg• Ignore criteria in counterfactual contexts
– Patient should call if she experiences shortness of breath
• Co-occurrence constraints– exercise HR: 135 doesn’t affirm Tachycardia
• Disambiguation– edema is recognized as APEdema, if near cxr, or in a
“Radiology” note, or in a “Chest X-Ray” segment• Numeric constraints
– she lost 5 pounds over a month doesn’t affirm WeightLoss– tachy @ 115 doesn’t affirm Tachycardia
Encounter Labeling Methods
• We can label an encounter note with labels showing the criteria that the note mentions– The labels can be used by downstream analyses to gather
information such as: “This patient exhibited those symptoms on that date.”
• 2 Methods:– Machine-learning
• Using candidate criteria and scope annotations, as features, …• use a [CHAID decision tree] classifier to assign criteria as labels.
– Rule-based• Run the full extractor pipeline, then …• Assign labels consisting of all unique criteria that survive filtering.
Results
Evaluation Flow
EncounterDocuments
LexicalLook-up& Scope
Lexical& Scope
Annotations
MachineLearning
Rules
EncounterLabels
EncounterLabels
CriteriaMentions
EncounterLabel
Evaluation
CriteriaExtractionEvaluation
EncounterDocuments
LexicalLook-up& Scope
Lexical& Scope
Annotations
MachineLearning
Rules
EncounterLabels
EncounterLabels
CriteriaMentions
EncounterLabel
Evaluation
CriteriaExtractionEvaluation
Metrics: Precision (Positive Predictive Value):
#TruePositive / (#TruePositive + #FalsePositive) Recall (Sensitivity):
#TruePositive / (#TruePositive + #FalseNegative) F-Score (the harmonic mean of Precision and Recall):
(2 x Precision x Recall) / (Precision + Recall)
Encounter Labeling Performance
Machine-learning method Rule-based method
Recall Precision F-Score Recall Precision F-Score
Affirmed 0.675000 0.754190 0.712401 0.738532 0.899441 0.811083
Denied 0.945556 0.905319 0.925000 0.987599 0.931915 0.958949
Overall 0.896364 0.881144 0.888689 0.938462 0.926720 0.932554
Overall 99% Conf. Int. (0.848-0.929) (0.900-0.964)
Conclusion: Machine-learning labeling does not significantly underperform rule-based labeling.
Performance of Framingham Diagnostic Criteria Extraction
Precision Recall F-score 99% ConfidenceInterval (F-score)
Overall (exact) 0.925234 0.896864 0.910828 (0.891 - 0.929)
Overall (relaxed) 0.948239 0.919164 0.933475 (0.916 - 0.950)
Affirmed 0.747801 0.789474 0.768072 (0.711 - 0.824)
Denied 0.982857 0.928058 0.954672 (0.938 - 0.970)
Note: Performance on affirmed criteria is worse, possibly because of their greater syntactic diversity. For example, we don’t find:
PleuralEffusion: blunting of the right costrophrenic angleDOExertion: she felt like she couldn’t get enough air in
Precision and Recall for Individual Criteria
Analysis of 1492 extracted criteria: PredMED extractions vs.
Gold Standard annotations
PredMED
GoldStd
ANKED
ANKEDNeg
APED
APEDNeg
DOEDOENeg
HEPHEPNeg
HJR HJRNeg
JVD
JVDNeg
NC NCNeg
PLE PLENeg
PNDPNDNeg
RALERALE
Neg
RC RCNeg
S3G S3GNeg
TACH
TACHNeg
WTL
False
Positiv
e
ANKED 90 6 16ANKEDNeg 230 6APED 8 5 2 1 22APEDNeg 0DOE 116 17 1 3DOENeg 3 135 2 1HEP 0 1HEPNeg 125HJR 2 1HJRNeg 9JVD 7 2JVDNeg 91NC 2NCNeg 43 2PLE 8PLENeg 1PND 1 7 2PNDNeg 69RALE 11 1RALENeg 197RC 6RCNeg 1S3G 0S3GNeg 131TACH 1 2TACHNeg 0 4WTL 0False Negative 6 8 5 2 6 5 1 4 1 3 2 2 7 35 2 1 1 10
Discussion• Challenges
– Data quality: EHR text data is messy.
• >10% (i.e., 26/237) of the errors are caused by misspellings & bad sentence boundaries
– Human anatomy• We need a better solution
than word co-occurrence constraints
– Syntactic diversity of affirmed criteria
• We need deeper syntactic and semantic analysis
– Contradictions and redundancy
• An issue for downstream analysis
• Opportunities– We can apply similar
techniques to other collections of criteria.
• NY Heart Association• European Society of
Cardiology• MedicalCriteria.com
– Many specific criteria extractors can be re-used in other settings.
– For downstream applications, see posters and presentations from our project at this conference
Summary
• Extractors can identify affirmations and denials of Framingham HF criteria in EHR clinical notes with an overall F-Score of 0.91.
• Classifiers can label EHR encounters with the Framingham critera they mention with an F-Score of 0.93.
• Information about HF criteria mentioned in EHR notes appears to be useful for downstream applications that seek to achieve early detection of HF.
Backup:Iterative Annotation Refinement
Iterative Annotation Refinement
• What are the problems solved?– Annotations are required for training and evaluating
criteria extractors.– Human annotators without guidelines have high
precision but lower recall.– Domain experts’ intuitions (about the language for
expressing criteria) are initially imprecise.• What is produced?
– Annotated dataset– Annotation guidelines … that are consistent– Criteria extractors
The Development Process:Iterative Annotation Refinement
ResultsInitialization Iteration
ExpertAnnotations
AnnotationGuidelines
CriteriaExtractors
Discuss the
language of HF
criteria
Buildinitial
extractors
Writeinitial
guidelinesExpert
Linguist
Annotate textswith currentextractors
Performerror
analysis
Update theextractors
Update theannotations
and theguidelines
EncounterTexts
User interface for the annotation tool, which was used to manage annotations during refinement.
Performance improvement during development
Iterative methods for creating annotations, guidelines, and extractors
Extraction target
Result of using the method
Sources of annotations compared in each iteration
Arbiter for disagreements at each iteration
Objective (and metric) for each iteration
Iterative Annotation Refinement
Framingham HF criteria
- Annotations- Guidelines- Extractor
Expert and Extractor
Expert Improve extractor performance (F-score)
Annotation Induction(Chapman, et al. J Biom Inf 2006)
Clinical conditions
- Guidelines (in the form of an annotation schema)
Expert and Linguist
Consensus Improve inter-annotator agreement (F-score)
CDKRM(Coden, et al., J Biom Inf 2009)
Classes in the cancer disease model
- Annotations- Guidelines
2 Experts Consensus Improve inter-annotator agreement (agreement %)
TALLAL(Carrell, et al,GHRI-IT poster, 2010)
PHI (protected health information) classes
- Annotations- Extractor
Expert and Extractor
Expert Annotate full dataset (to the expert’s satisfaction)