A Review of Natural Language Processing for Biosurveillance

A Review of Natural Language Processing

for Biosurveillance

Wendy W. Chapman, PhD

University of Pittsburgh

Dept of Biomedical Informatics

Biomedical Language Understanding

Current SurveillanceNew strain of H5N1

Avian influenza

Cough咳嗽Respiratory

Recent travel

Exposure to others

Positive CXR

0 10 20 30 40

Patient + CXR Expos Travel SeverePat 1 X X X X

Pat 2

Pat 3 X X

Pat 4 X

cough

SOB

0 5 10 15 20 25 30 35 40

Respiratory Patients

Leverage More Data for Surveillance

Biosurveillance

BioterroristThreats

Detectattacks

Natural DiseaseOutbreaks

Detectoutbreaks

DisasterManagement

Understandsituation

Much of the useful data in textual formatNeed natural language processing

Textual Data Sources

Non-clinical

Clinical

Internet Mapping of Outbreaks

• HealthMap

• Global Health Monitor

• Global Public Health Intelligence Network (GPHIN)—Public Health Agency of Canada

Textualclinicaldata

Textprocessor

Clinical Data for Biosurveillance

Pneumonia Yes

HistoryCough Yes

3 daysFever Yes

3 days

What types of data are available?

How do we transform the data?

How well can we process the data?

Clinical Data for Biosurveillance

Trade-off

What types of data are available?

Chief Complaints

Content• Patient's reason for seeking care• 1-2 symptoms

“Cough/headache” “n/v/d” “Motor vehicle accident”

Registration Physician Exam Discharge HomeX

Timeliness

Useful for early detection of larger outbreaks

Ambulatory & Inpatient NotesContent

• Risk factors• Travel history• Homelessness• Duration of illness• Exposure to contacts


• Symptoms• Findings• Medications• Allergies• Diagnoses• Chronic conditions

Clinical Epidemiological

Timeliness

Useful for targeted case detection, disease surveillance, and situational awareness

Discharge ReportsContent

• Cause• Time of death


• Reason for hospitalization

• Summary of care• Findings• Procedures performed• Plan for follow-up

Clinical Death

Timeliness

Most detailed but least timely—potentially usefulfor situational awareness

Text ProcessingHow do we transform the data?

Textprocessor

cough/sob

SyndromeCategoryRespiratoryextract

Clinical Conditions

No past history of pneumonia—presentswith two day history of cough.

Pneumonia- historical- absent

classifyChiefComplaints

TextualNotes Cough

- recent- present

Three Methods for Interpreting Text• Keyword-based

– NYC Syndromic Macros– If “cough*” or “wheez*” Respiratory

• Symbolic– Semantics, syntax, discourse– stomach cramp is a type of abdominal pain

• Statistical– P ( localized infiltrate |

anatomic location = lower lobe,finding = hazy opacity ) = 0.96

Processing Chief Complaints—Challenges

• Synonyms– Short of breath dyspnea– Coughing cough– Coughs cough

• Abbreviations– ha headache– abd abdominal– gx ground

transportation• Acronyms

– n/v nausea/vomiting– sob shortness of breath

• Truncations– diar diarrhea– poss possible

• Concatenations– blurredvision burred

vision– flus sxs flu symptoms

• Misspellings & typographic errors– nausa nausea– diahrea diarrhea

Substantial word variation

Contain linguistically complex narrations• Linguistic variation• Polysemy• Negation• Contextual information• Implication• Coreference

Processing Notes—Challenges

NegationApproximately half of all clinical concepts in

dictated reports are negated

• Explicit absence“The mediastinum is not widened”

• Mediastinal widening: absent

• Implied absence“Lungs are clear upon auscultation”

• Rales/crackles: absent• Rhonchi: absent• Wheezing: absent

• Uncertainty

Contextual Information

• Temporality– Three-day history of cough– Past history of pneumonia

• Finding Validation– She received her influenza vaccine– His temperature was taken in the ED

• Hypothetical conditions– He should return for fever

Chief ComplaintsIdentifying Syndromic Cases

Performance using this data

• Seven studies– One on pediatric population– Beitel, Chapman, Espino, Gesteland,

Ivanov• Reference standards

– ICD-9 discharge diagnoses– Physician review of ED reports

• Eight syndromic definitions– Five febrile syndromic definitions

0

20

40

60

80

100P

erce

nt o

f Cas

es Id

entif

ied

34

77

22

74

31

72

31

60

39

75

10

30

27

46

Febrile Syndromes

Syndrome only23%

Fever only19%

Neither53%

Both5%

Syndrome onlyFever onlyNeitherBoth

Sensitivity 0% – 12%

Chapman and Dowling, J ISDS, 2007

Ambulatory Notes

Triage NotesNC-Detect

– EMT-P + NegEx– Performs well at identifying clinical conditions

ED ReportsBetter case detection than chief complaints

– Topaz (Chapman)– MCVS (Elkin)– MedLEE (Friedman, South)

Inpatient NotesChest radiograph reports

Pneumonia - > 90% sens and spec– SymText (Fiszman and Chapman)– MedLEE (Friedman and Hripcsak) – MCVS (Elkin)Widened mediastinum– IPS System (Chapman)

Tuberculosis (Hripcsak)– MedLEE (Hripcsak)

Identifying Syndromic Cases from Textual Notes

CC vs full text record for Influenza-like Illness

Data source Sensitivity Positive Predictive Value

Chief Complaint 13% 47%

ED Notes 51% 37%

All notes 88% 23%

South et al.

Identifying Epidemiological Factors from Clinical Notes

Gundlapalli et al.

Structured MedLEE clinical notes

ETOH abuse 2.9 3.7

Drug abuse 4.1 29.1

Smoking 1.3 45

Homelessness 0 10.5

Illness duration 0 2.7

History of illness 0 22.3

Chief complaints

Textual Notes

• Moderate performance at identifying syndromic cases

• Poor performance at identifying specific syndromes

• Good performance at identifying syndromic cases

• Ability to identify specific conditions

• Ability to identify epidemiological factors

Where do we go from here?Identifying cases• Most work on chief complaints• Current emphasis on reports

– Need better algorithms and more research• Temporality and other contextual information

Conveying information• Little if any applied work on characterizing

outbreaks and conveying information to public health

Conclusion• Data in clinical texts are useful for

biosurveillance• Chief complaints most frequently used data

source– Poor to moderate performance

• Clinical notes promise better performance– More complicated text– Timeliness dependent on institution– Early stages of development and evaluation

• Need to develop more applications applying NLP to characterization

Thank You

Wendy W. Chapman: [email protected] Language Understanding Lab

www.dbmi.pitt.edu/blulab

Chapter on NLP for Biosurveillanceto appear in

Infectious Disease Informatics and Biosurveillance: Research, Systems, and

Case Studies

mailto:[email protected]

http://www.dbmi.pitt.edu/blulab

Documents

A Review of Natural Language Processing for Biosurveillance