Upload
nigel-collier
View
389
Download
0
Embed Size (px)
DESCRIPTION
Exploiting These are the slides from my talk at the Department of Computer Science at Sheffield University. The talk covers broad ground in my experience of applying natural language processing to knowledge discovery from various media including social media, news and the scientific literature.
Citation preview
Exploiting NLP for Digital Disease Informatics
Nigel CollierMarie Curie Research Fellow
EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
NLP for Digital Disease Informatics?
Accurate and timely collection of facts from a range oftext sources is crucial for supporting the work ofexperts in detecting and understanding highly complexdiseases. NLP support
Research case studies from
personal research
(1) Infectious disease alerting from news
(2) Phenotype entity extraction
(3) Tracking health rumours in Twitter
Typical workflow from text to knowledge
raw text
document
sentence
segmentation
tokenization
lexical
featurisation
entity
recognition
trigger
detection
relation
extraction
event
extraction
entity
grounding
knowledge objects
syntactic
parsing
Broad Research Objectives
• Extrinsic: Robust data collection from across health-related text types: literature, patient records, news, social media (public health alerts, developing disease profiles, etc.)
• Intrinsic: Understand how NLP/ML/Ontology techniques perform and can be improved in operational settings
INFECTIOUS DISEASE ALERTING
Infectious diseases spread rapidly
“We live in a world where threats to health arise from the speed and volume of air travel,the way we produce and trade food, the way we use and misuse antibiotics, and the way we manage the environment…” Dr. Margaret Chan, DG WHO
SARS, 2003HK, world
H5N1 flu, 2003-PRC, Thailand,ROC, Vietnam
Foot & mouth, 2001United Kingdom
Ebola, 2014-Guinea, Liberia,Sierra Leone,Nigeria
Source: World Health Organization, Timeline of Influenza A(H1N1), 2009, © WHO
Epidemic intelligence: fact and fiction
Trend graphs
Event summaries
Event alerts
Ontology browsing
Email/GeoRSS alerting
Watchboard, etc.
Real time Twitter
analysis
Up to date news in
12 languages
Event database search
GHSI
partners
US
UK
FR
DE
WHO
IT
JP
CA
Digital epidemic surveillance with BioCaster
Technical challenges
X0,000 news providers
REAL TIME SCALING 30,000-40,000 news items/day
900 on topic/day
200 events/day
4 alerts/day
Technical challenges
X0,000 news providers
MULTILINGUALITY
Percentage of News by Language
English
Chinese
German
Russian
Korean
French
Vietnamese
Portuguese
Other
Avian Flu
Influenza aviaire
鳥インフルエンザ
조류인플루엔자
โรคไข้หวดันก
Cúm gia cầm
REAL TIME SCALING
Increased sensitivity and
timeliness from multilingual
news
News event counts for porcine foot-
and-mouth outbreak in South Korea
2010-2011
Technical challenges
X0,000 news providers
MULTILINGUALITY
REAL TIME SCALING
AMBIGUITY“Obama fever builds as Americans
await a new era”
Equine influenza in Camden
Camden (UK) Camden (AU) Camden (CA) + 19 others
Entity identification
Toponym grounding
Tajoura Tajura Tajoora…
Variant transliterations
Coreference
“Two British holidaymakers fell ill… ”“Two male pensioners died…”
2 or 4 victims?
Temporal identification
“The Spanish flu outbreak…”
Semantic pipeline
Looking for bursts of activity
Source: BioCaster
Outbreak characteristics: Early surge vs multi-modal transmission
News event frequency over time
Source: GENI-DB
0
1
0
40
80
120
160
200
ct
μ
μ+3σ
Gold
Alerts with the C2 test statistic:St = max(0, (Ct – (μt + 3σt))/ σt)
First English languagereports (MMWR + AP)
Understanding norms and their violations
5 detection algorithms
1. Early aberration reporting system (EARS) C2 algorithm– captures the number of standard deviations that the current count exceeds the history mean;
– St = max(0, (Ct – (μt + kσt))/ σt)
2. EARS C3 algorithm– similar to C2 except that C3 uses a weighted sum of the previous 3 days for the current period;
3. W2 algorithm– a modified version of C2 which ignores history counts on Saturdays and Sundays to compensate for day of
week effects;
4. F statistic– compares the variance in the history window to the variance in the current window;
– St = σt 2 +σb
2
5. Exponential Weighted Moving Average (EWMA)– provides less weight to days in the history that are further from the test day.
– St = (Yt – μt)/[σt * (λ/(2- λ))1/2], where Y1 = C1 and Yt = λCt + (1- λ)Yt-1
Model parameters were estimated based on an additional 5 epidemic data sets from ProMED-mail (data not shown)
[1] Burkom H. S. (2005), “Accessible Alerting Algorithms for Biosurveillance”. National Syndromic Surveillance Conference[2] Jackson M. L. et all (2007), “A simulation study comparing aberration detection algorithms for syndromic
surveillance” Medical Informatics and Decision Making , 7(6): BMC, DOI: 10.1186/1472-6947-7-6. [3] Madoff L. (2004), “ProMED-mail: An early warning system for emerging diseases”. Clin Infect Dis , 39(2): 227–232.
Creating a benchmark data set# Disease Country ProMED-alerts
1 Hand,foot,mouth
PR China 9
2 Ebola Congo 17
3 Yellow fever Brazil 28
4 Influenza USA 21
5 Cholera Iraq 5
6 Chikungunya Singapore 8
7 Anthrax USA 15
8 Yellow fever Argentina 5
9 Ebola Reston Philippines 15
# Disease Country ProMED-alerts
10 Influenza Egypt 49
11 Plague USA 8
12 Dengue Brazil 27
13 Dengue Indonesia 14
14 Measles UK 13
15 Chikungunya Malaysia 15
16 Yellow fever Senegal 0
17 Influenza Indonesia 35
18 Influenza Bangladesh
3
• 14 countries and 11 infectious disease types
• 366 days of news data was collected from BioCaster for each disease and country
• The study period is 17th June 2008 to 17th June 2009
Comparison of 5 aberration detection
algorithmsC3 C2 W2 F-statistic EWMA
Sensitivity 0.74 0.66 0.66 0.78 0.73
(0.69-0.78) (0.61-0.72) (0.60-0.71) (0.74-0.82) (0.68-0.78)
Specificity 0.96 0.98 0.98 0.92 0.95
(0.95-0.96) (0.98-0.98) (0.98-0.99) (0.91-0.92) (0.94-0.96)
PPV 0.55 0.64 0.65 0.46 0.47
(0.98-0.99) (0.98-0.99) (0.98-0.99) (0.98-0.99) (0.98-0.99)
NPV 0.98 0.98 0.98 0.98 0.98
(0.98-0.99) (0.98-0.99) (0.98-0.99) (0.98-0.98) (0.98-0.99)
Alarms/100 days 6.48 4.52 4.17 12.34 7.85
F-measure 0.63 0.65 0.66 0.58 0.58
Results in parentheses show 95% confidence intervals
[4] Collier, N. (2009), “What’s unusual in online disease outbreak news?”, in BMC Biiomedical Semantics, 1(2).
G7+WHO: EAR Project
• (2006-2012) Global Health Security Initiative– a unique initiative by G7+WHO+EC to bring together end-users, system providers and stakeholders to test the feasibility of open source public health intelligence systems.
[5] Barboza, P., Vaillant, L., Le Strat, Y., Hartley, D. M., Nelson, N. P., Mawudeku, A., Madoff, L. C., Linge, J. P., Collier, N., Brownstein, J. S. andAstagneau, P. (2014). Factors Influencing Performance of Internet-Based Biosurveillance Systems Used in Epidemic Intelligence for Early Detection of Infectious Diseases Outbreaks. PloS one, 9(3), e90536. [6] Barboza, P., Vaillant, L., Mawudeku, A., Nelson, N., Hartley, D., Madoff, L., Linge, J., Collier, N., Brownstein, J., Yangarber, R. and Astagneau, P. (2013), “Evaluation of epidemic intelligence systems integrated in the Early Alerting and Reporting project for the detection of A/H5N1 Influenza events”, PLoS One, 8(3):e57252.
Qualitative comparison of 7EAR systems by experts
Major findings for A/H5H1:- Detection rates for individual systems from
31% to 38%- Rising to 72% for the combined system- PPV ranged from 3% to 24%- F1 ranged from 6% to 27%- Sensitivity ranged from 38% to 72%- Average improvement in alerting over WHO
or OIE was 10.2 days
Lesson learnt … toponym resolution is key
Equine flu in Camden (UK)? Equine flu in Camden (Australia)?
vs
Lesson learnt … limitations on coverage
Heat map showing lowest ranked countries by number of reports per ‘000 population gathered by BioCaster
PHENOTYPE NAMED ENTITY ANALYSIS
Small changes in genotypes can have large
changed in phenotypes
Image courtesy of Washington, Haendel, Mungall, Ashburner, Westfield and Lewis (2009), “Linking
human diseases to animal models using ontology-based phenotype annotation”, PLoS Biology,
7(11):e1000247.
From personal terminology to community
concepts
“… patients were selected for FOXP2 screening only if
they fulfilled the following criteria: presence of
speech articulation problems diagnosed by a clinician …”
HPO: 0009088 Speech articulation difficulties
Image courtesy of Damian Smedley,
Welcome Trust Sanger Institute,
Hinxton and Tudor Groza, University
of Queensland, Brisbane
SVM learn-to-rank (pairwise)Maximum entropyPriority list heuristic
“… patients were selected for FOXP2 screening only if
they fulfilled the following criteria: presence of
speech articulation problems diagnosed by a clinician”
“… patients were selected for FOXP2 screening only if
they fulfilled the following criteria: presence of
speech articulation problems diagnosed by a clinician”
Creating a benchmark data set
• Data from OMIM cited autoimmune literature (112 abstracts, 472 phenotypes, 1611 gene/gene products).
F-scores computed using ablation on various domain ontologies
F-scores using 3 hypothesis resolution strategies
[7] Collier, N., Tran, M., Le, H. Ha, Q., Oellrich, A. Rebholz-Schuhmann, D. (2013), “Learning to recognize phenotype candidates in the auto-immune literature using SVM re-ranking”, PLoS One 8(10): e72965.
Lesson learnt … disjointness matter
Named entity
Event
Type
Implicit participants
Explicit participants
TypeRole
Simple, efficient algorithms,
limited coverage, potentially
idiosyncratic annotation
Lesson learnt … sampling matters
Resource Size (records)
PubMed 23,765,575
GENIA 2,000
PennBioIe 1414
FSU-PRGE 3,236
Arizona corpus 2,775sentences
I2B2/VA 2010 826
Lesson learnt … domain adaptation matters
[8] Collier, N., Paster, F., Campus, H., & Tran, A. M. V. (2014), “The impact of near domain transfer on biomedical named entity recognition”, Proc. 5th International Workshop on Health Text Mining and Information Analysis (LOUHI) at the European Conference on Computational Linguistics (EACL), Gothenburg, Sweden, pp. 11-20.
TRACKING HEALTH RUMOURS IN TWITTER
Seasonal flu and influenza-like illness
Influenza-like Illness (ILI) =
fever (> 100o F)* AND cough
and/or sore throat (in the
absence of a known cause
other than influenza)*Temperature can be measured in the
office or at home
Epidemics of seasonal influenza result in about three to five million
cases of severe illness and 250 000 to 500 000 deaths worldwide
each year (WHO, 2009)
Case definition from CDC
Calculating ILI rate is key for
seasonal influenza
surveillance
What do people talk about?
Types Tweet samples
Influenza confirmation I got flu n coughed a lot. Now my voice is like
monster’s voice. Rrr
Influenza symptoms My day: flu-like symptoms (headache, body aches,
cough, chills, 100.9 fever). Swine flu not ruled out.
#H1N1
Flu shots I’m still getting flu shots, nothing is worth flu turning
into bronchitis into pneumonia
Self protection Cover your mouth if coughing, use a tissue, wash
your hands often & get a flu shot - protect and
defend your community from #H1N1
Medication Wondering why I didn’t take the flu shot, laying in
bed with cough drops, medicine, and the remote
Classification scheme
• Disease spread can be strongly influenced by behavioural changes [5]
• After surveying Twitter messages we conflated Jones and Salathe’sgroupings into three plus two new categories:
– (A) Avoiding behaviour
• Avoid people who cough/sneeze, Avoid large gatherings of people, Avoid public transportation, Avoid travel to infected areas
– (I) Increased sanitation
• Wash hands more often, use disinfectant
– (W) Wearing a mask
– (P) Pharmaceutical intervention• Seeking clinical advice or using medicines or vaccines to prevent disease
– (S) Self reported diagnosis• User reports that they have the flu
[9] Jones , J, Salathe, M. (2009), “Early assessment of anxiety and behavioral response to novel swine-origin inuenza A(H1N1)”, PLoS One, 4(12):e8032.[10] Collier, N. (2009), “UMG U got flu? Analysis of shared health messages for bio-surveillance”, in Proc. 4th Symposium on Semantic Mining in Biomedicine (SMBM’10).
Gold standard data
• 7412 tweets were selected that matched at least one of the keywords (flu, influenza, H1N1, H5N1, swine flu, pandemic, bird flu) balanced over the 5 classes
• Kappa for IAA was 0.86 on a sample of 2116 messages
A I P W S
Positive 251 37 499 32 741
Negative 632 43 974 230 1873
Total 883 80 1443 262 2614
Mean length 109.2 118.8 107.0 117.3 100.9
Sd. length 28.9 21.9 30.6 27.7 33.4
Mean length (+ve) 100.2 119.7 101.3 110.1 92.6
Mean length (-ve) 112.8 118.0 110.1 119.3 104.2
Message frequency in the training/testing corpus for self-protection classes
Naïve Bayes classification
P R F1
A
UNI 0.73 0.76 0.74
UNI+SRL 0.74 0.76 0.75
UNI+BI 0.73 0.77 0.75
UNI+BI+SRL 0.73 0.77 0.74
I
UNI 0.56 0.55 0.55
UNI+BI 0.49 0.49 0.49
P
UNI 0.74 0.76 0.75
UNI+SRL 0.75 0.78 0.76
UNI+BI 0.75 0.78 0.76
UNI+BI+SRL 0.76 0.79 0.77
F1 results for tweet classification using Naïve Bayes. UNI = unigram, BI = bigram,SRL = Simple Rule Language regular expression
P R F1
W
UNI 0.59 0.68 0.63
UNI+SRL 0.63 0.76 0.69
UNI+BI 0.60 0.71 0.65
UNI+BI+SRL 0.60 0.71 0.65
S
UNI 0.70 0.73 0.71
UNI+SRL 0.74 0.77 0.75
UNI+BI 0.72 0.76 0.74
UNI+BI+SRL 0.74 0.77 0.75
Anxiety indicators have moderate-strong
correlation with CDC A(H1N1) lab data
Category Spearman’s Rho
P-value
A 0.66 0.020
S 0.66 0.021
I 0.58 0.048
P 0.67 0.017
A+I+P 0.68 0.008
A+I+P+S 0.67 0.017
0
50
100
150
200
250
300
350
400
450
0
500
1000
1500
2000
2500
3000
46 47 48 49 50 51 52 1 2 3 4 5
CDC
A
S
I
P
A+I+P
A+I+P+S
Data source: CDC (2009-2010 flu season)
Frustratingly simple models work well
Classifying respiratory syndrome: Turning 225,000 Tweets into a high correlation influenza tracker
[11] Doan, S., Ohno-Machado, L. and Collier, N. (2012), "Enhancing Twitter data analysis with simple semantic filtering: example in tracking Influenza-Like Illnesses", in the 2nd IEEE Conference on Healthcare Informatics, Imaging and Systems Biology: Analyzing Big Data for Healthcare and Biomedical Sciences, California, USA, September 27-28.
Lessons learnt
– Metaphoric symptoms: Cabin fever setting in right now.
– Interrogative sentences: wonder how long u get off work with swine flu?
– Hypothetical sentences: I can ignore this sore throat no longer. And, um, maybe I should have gotten that H1N1 vaccine.
– Others: Too much lemonade. My throat is burning.
Conclusions
• Epidemic intelligence is a highly skilled human task made easier by text /data mining from open sources. Internet-based systems are playing a key role in the detection of emerging diseases such as pandemic influenza.
• Broad trend in the use of social-networking sites by clinicians, patients and the public holds potential for harnessing the experience of the masses, both for disease detection and the patient experience.
• NLP holds tremendous promise for digital disease informatics but requires careful evaluation in collaboration with biomedical and healthcare professionals.
Special thanks
• Funding– Phenominer: EC Marie Curie International Incoming Fellowship
– BioCaster: Japan Science and Technology Agency’s SAKIGAKE fund
• Postdoctoral students:– Son Doan, PhD. (now at UCSD), Mike Conway, PhD. (now at Utah University), Reiko
Goodwin, PhD. (Fordham U.), Ai Kawazoe, PhD. (now at NII)
• Ph.D. students– John McCrae, PhD. (now at Bielefeld U.), Hutchatai Chanlekha, PhD. (now at Kasetsart
U.)
• Intern students– Wita Ratsameetip (Chulalongkorn University, Thailand),Nguyen Trurong Son (Vietnam National University, Ho Chi Minh City,
Vietnam), Nguyen Thi Ngoc Mai (Vietnam National University, Ho Chi Minh City, Vietnam), Aurelie Chabord (ENSIMAG-Grenoble INP, France), Therawat Tooumnauy (Kasetsart University, Thailand), Nam Xuan Cao (Vietnam National University, Ho Chi Minh City, Vietnam), Hoang Cong Duy Vu (Vietnam National University, Ho Chi Minh City, Vietnam), Nghiem Quoc Minh (Vietnam National University, Ho Chi Minh City, Vietnam), Van Chi Nam (Vietnam National University, Ho Chi Minh City, Vietnam), Nguyen Thi Hong Nhung (Vietnam National University, Ho Chi Minh City, Vietnam), Pham Thao Thi Xuan (Vietnam National University, Ho Chi Minh City, Vietnam), Ngo Quoc Hung (Vietnam National University, Ho Chi Minh City, Vietnam), Tran Tri Quoc (Vietnam National University, Ho Chi Minh City, Vietnam), Mai Vu Tran (Vietnam National University, Hanoi), Hoang Quynh Le (Vietnam National University, Hanoi)