Upload
karin-verspoor
View
52
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Background Syndromic surveillance refers to reporting and tracking of reportable and unusual diseases to public health officials. Conventional surveillance strategies are often manual, or depend on confirmatory laboratory testing after a disease diagnosis. These traditional strategies often result in relatively late detection of an outbreak or public health emergency. Strategies for reliably accelerating surveillance are under active research. The aim of our work is detection of specific syndromes in individual patient triage records in the hospital Emergency Department (ED). We focus on analysing the free text clinical notes written by a triage nurse during a brief pre-diagnostic assessment of a patient upon arrival in the ED. The system can detect patients that appear to have a disease of interest. Methods We work with a set of over 310,000 records collected in two Victorian EDs over a several-year period. Each patient triage record in our data includes (1) a free text note and (2) a diagnostic code from the International Classification of Disease (ICD-10) that was assigned after the fact. This data was used for training and testing of various classifiers, in a cross-validation scenario. We experimented with a range of different set-ups, including attempting direct prediction of ICD-10 codes for a given triage note, as well as prediction of “syndromes” defined by a specific set of ICD-10 codes. We also experimented with several different feature representations and machine learning models. Results In general, the performance of the models for syndromes was better than for direct ICD-10 category classification, suggesting that the syndrome definitions are clinically coherent. We observed substantial variation in performance across the various syndromes; several syndromes had too few examples in the dataset to build an effective classifier. The best performance on these tasks used a machine learning model that incorporates pre-processing of the texts to identify direct mentions of ICD-10 and SNOMED CT terms. Conclusion We have demonstrated that it is possible to build an effective syndrome detection tool for ED triage notes, where there is adequate and reliable training data available for a given syndrome of interest. We have shown that semantic abstraction of the text into “medical concept space” is of benefit for this task.
Citation preview
Syndromic Surveillance from Emergency Department triage notes
Karin M. Verspoor, The University of Melbourne
Antonio Jimeno Yepes, The University of Melbourne
Bahadorreza Ofoghi, The University of Melbourne
Geoffrey White, DSTO
26 September 2014 - MQClinicalNLP workshop
SynSurv
• SynSurv– Victorian Department of Health pilot
syndromic surveillance program– Detection of outbreaks based on ICD-10
diagnostic codes and presenting complaints as captured in free text notes
Our focus:Extracting information from unstructured free text to enable “early warning” monitoring
Objectives of our project
• Exploration of the application of natural language processing techniques to triage notes for syndromic surveillance– To enable surveillance directly from notes;
integration into natural workflow of ED– To support higher sensitivity and higher
precision than keyword-based methods
Emergency Department triage notes
• Free text notes– written by triage nurse upon assessment in
the Emergency Department– captures presenting symptoms and
complaints of a patient
CENTRAL CHEST DISCOMFORT WHILE EATING, RADIATING TO ARMS. PPM INSERTED 2/52 AGO. PAIN FREE O/A. HR72, BP160
FEBRILE ILLNESS FLU LIKE SYMPTOMS NAUSEA
L BASAL GANGLIAN BLEED POST COLLAPSE, NON VERBAL, EYES SPON OPENED, HYPERTENSIVE, P 70REG, PEARL, PMX CEREBRAL BLEED
SynSurv data characteristics
• 918,330 records• 730,054 records with ICD-10 diagnosis• 456,213 records with note text• 316,362 records with ICD-10 diagnosis
and note text
Two sets of Experiments
• Given a free text note,– Predict the ICD-10 code(s) for the note
– Predict a syndromic group, based on pre-defined sets of ICD-10 codes of interest
Machine learning for text analysis
Training setNotes + labels
for classes of interest(e.g. ICD-10 codes)
Machine learning algorithm
Words, Phrases,Linguistic categories;
names of entities;Domain concepts; Document features
Biomedical knowledge sources
UMLS (SnomedCT, ICD)
Language processing
ModelRelating features
of the text to classes of interest
Machine learning for text analysis
New notesto be classified
Words, Phrases,Linguistic categories;
names of entities;Domain concepts; Document features
Biomedical knowledge sources
UMLS (SnomedCT, ICD)
Language processing
Model
Predicted Classification
(label)
Abstracting linguistic variation
• Terminology mapping tools generalise language variation
• e.g. UMLS Concept C0027497• nausea• nauseated• feels sick• feeling sick• queasy• felt sick• nauseous
Predicting ICD-10 codes(Results)
• Direct term matching strategy outperformed by machine learning– Performance difference between micro-
average and macro-average indicates that some ICD-10 codes are underrepresented in the data, and cannot be modeled well
Predicting Syndromic Groups
• Task– Syndromic groups are defined by sets of
ICD-10 codes, e.g. Flu like group
Predicting Syndromic Groups(Detailed Results)
Issues for low performance
• Inconsistency in ICD-10 annotation– ? FISH BONE IN THROAT J03– ? FISH BONE IN THROAT T18– ? FISH BONE IN THROAT T18– ? FISH BONE IN THROAT S10.9– ? FISH BONE IN THROAT J02.0
• Notes not related to the patient´s visit– DIRECT ADMISSION FROM BAIRNSDALE TO 3S BED 25
• Typos in the notes text– ? FIH BONE IN THROAT
Integrating with DSTO’s BioSurv system
• Input to the DSTO BioSurv system– Trained machine learning models used as
input to BioSurv (e.g., C2 algorithm)– Prediction probability > 0.5
Model
Predicted Classification
(label)
Yesflu-like illness
No
BioSurvCount +1
Example: Flu like syndrome NLP notes annotation
• Records with no ICD-10 codes in the database are now available to BioSurv
• 730,054 out of 918,330 records with ICD-10 codes
C2 algorithm: ICD-10 vs NLP
• Earlier alert time using NLP methods
ICD-10 NLP
Conclusions
• NLP methods can be used to support the BioSurv tool
• Machine learning methods perform better than dictionary-based methods
• Expansion of original syndromic groups improves machine learning performance
• Evaluation is a challenge– Noisy training data– What’s a “gold standard” alert?
Acknowledgements
• Victorian Department of Health(for SynSurv data)
• Defence Science and Technology Organisation (DSTO)(BioSurv system)
(funding and collaboration)
© Copyright The University of Melbourne 2011