1
Identifying Adverse Drug Reactions by Analyzing
Twitter Messages
Presented by - Parinda Rajapaksha
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
Authors - Parinda Rajapaksha, Ruvwan Weerasinghe
2
“ The person who takes medicine must recover twice, once from the disease and once from the medicine ”
- William Osler, M.D.
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
3
• Introduction, Motivation & Related works
• Proposed solution, Research Question & limitations
• Design, Implementation & Evaluation
• Discussion
• Future works
ROAD MAP
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
4
• What is an Adverse Drug Reaction (ADR)?- Harm associated with normal dosage during normal use
- Unintended, harmful reaction
- Nausea, insomnia, hallucination, headache, depression
• Becoming a dire global problem
– Over 770 000 people are injured or died in each year
– Prescription drugs have become 4th leading medical cause of death in
Canada and US
INTRODUCTION
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
5
• Some regulatory bodies have begun programs- Surveillance systems
- Reporting systems
- Conduct clinical trials
• BUT
– Reporting systems are voluntary in most of the countries
– Spontaneous self reports do not uncover all aspect of drug safety
– Clinical trials are very cumbersome
INTRODUCTION Traditional Solutions
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
6
• Recent explosion of Social Media platforms presents a
valuable information source
• People share personal medical experiences with each other
through online community
MOTIVATION
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
7
• Gurulingappa et al. used MEDLINE case reports- 5 000 drugs were extracted from nearly 3 000 case reports
- Ontology driven methodology
• Eiji et al. extracted clinical information from Electronic health
records
– 3 000 discharge summaries accumulated in one month at Tokyo hospitals
RELATED WORKS Medical Case Reports
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
8
• Robert et al. collected comments of health related web sites- DailyStrength web site
- Limited to North America
- Not consider demographic
- Beneficial and Adverse effects are unclear
• Brant et al. investigated ‘Withdrawn’ and ‘watchlist’ drugs
– Yahoo! Groups
– No. of messages for each drug was not evenly distributed
– Did not have adequate data to prove the analysis
RELATED WORKS Online Health Forums
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
9
• Jiang et al. Analyzed textual and semantic features of Twitter- 2 Billion Tweets , 5 Cancer drugs
- Used Topic modeling approach
- Performance was limited due to data sparseness and high level of noise
• Clark et al. Extracted 7Million Tweets in Digital drug safety
surveillance research
– Data sample tends to be noisy
– Difference between internet speech, writing patterns and standardize
clinical data
RELATED WORKS Twitter Related
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
10
• Analyze user experiences through Twitter messages
• Twitter as a micro blogging platform
• WHY Twitter??
OUR PROPOSED SOLUTION
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
11
• Analyze user experiences through Twitter messages
• Twitter as a micro blogging platform
• WHY Twitter??- Public availability, update frequency and message volume
OUR PROPOSED SOLUTION
Statistic Brain -2014/01/01 (http://www.statisticbrain.com/twitter-statistics/)
Total number of active Twitter users – 645 million
Average number of tweets per day - 58 million
No. of tweets that happen in every second - 9,100
12
RESEARCH QUESTIONS
“ How to Identify drug related Tweets by removing
noise in the Twitter messages and automatically classify
them into adverse effects and other effects? ”
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
13
• Limited to one pharmaceutical name- Very large number of drugs in the world and growing frequently
• Only works for Twitter messages with English texts– Language processing becomes really hard without knowing other
languages
SCOPE AND LIMITATION
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
14
DESIGN
Data Acquisition & Filtering
Tweet Preprocessing
Text Processing
Classification
Adverse Effects Other Effects
Manual Annotation
Feature Extraction
2.
1.
3.
4.
5.
6.
15
IMPLEMENTATION Data Acquisition
• Ethical concerns? - Accordance with Terms & Conditions of Twitter API
- NOT from privet accounts
• Xanax as the test case- Used for Panic disorders and Anxiety disorders
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
16
• Why filtering method?- Capture more useful data while downloading
IMPLEMENTATION Data Filtering
Misspelled Drug Names
Trade Names
Generic Name
Twitter Data Stream
Misspelled words
DictionaryFiltered Drug Related Tweets
17
• Assumption 1- These are the only categories which can be given possible misspelled words- Xxnaa as Xanax NOT possible, hardly misspelled
IMPLEMENTATION Misspelled Word Dictionary
Reason for Misspelling
Examples
Skip Letters Xnax, XanxDouble Letters Xaanax, Xannax Reversed Letters Xnaax, Xaanx Missed Key Xabax, Xahax, XajaxInserted key Xabnax, Xamnax
18
IMPLEMENTATION Data Collection
1 829 (3%)
51 467 (94%)
1 477 (3%)
Generic Name
Trade Names
Misspelled Names
1 477 misspelled Twitter messages were captured
54 774 messages within 7 weeks
(14 Aug 2014 - 1 Oct 2014 )
19
• Identifying Twitter specific noisy information
- Retweets (RT)
- User mentions (@)
- Hash tags (#)
IMPLEMENTATION Pre-processing
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
20
IMPLEMENTATION Pre-processing
doc said being of the xanax was giving my
heart major issues and causing problems that
weren't even there in the 1st place
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
21
IMPLEMENTATION Text Processing
• Remove advertisements, news and forum posts
• Assumption 2
– Possibility of having a link in a legitimate Twitter message is considerably low15th International Conference on Advances in ICT for Emerging Regions ICTer2015
22
IMPLEMENTATION Text Processing
• Replace slang words, emoticons and abbreviations
Slang Word
Intended Meaning
abt About
w/o without
smh somehow
idk I don’t know
n2g Not too good
lol Laugh out loud
… ………………
Emoticon Intended Meaning
:-D Big grin
:((( Sad
:’( Crying
:-@ Screaming
O.o Confused
B-) Cool
… ………………
Slang word dictionary (5 242) Emoticons dictionary (80)
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
23
IMPLEMENTATION Text Processing
being doing good on my diet giving up soda and
iced coffee I do not think so laugh out loud
i would need xanax smile
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
24
IMPLEMENTATION Text Processing
• Some Twitter messages are NOT related to the context
- “ I need some Xanax”
- “ Xanax is expensive but I'm worth it”
- “@yung i need to buy xanax but the site won't let me ship to
canada?”
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
25
IMPLEMENTATION Medical Terminologies
• Use of medical terminologies
- MedDRA (Medical Dictionary for Regulatory Activities)
- SIDDER (Side Effect Resource)
- Collected 15 205 medical terminologies
• Data set reduced to 3 334 messages after checking 54 774 messages
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
26
IMPLEMENTATION Feature Extraction
• Used Bag-of-word model- Only consider occurrence (presence or absence)
• Stop words NOT removed- A Twitter message can includes really few words- Character limitation <140
• Stemming
– Ex: Takes , Taking Take
– Used Porter stemmer available in Python NLTK
– 4 572 Features
27
IMPLEMENTATION Manual Annotation
• Condition 1 :– There should be a person or a group of people who involved with the
drug to label a message as Adverse
• Condition 2 :
– Beneficial effects, Conditions or indications as Other
• Condition 3 :– Sentence should be in affirmative (not interrogative, not subjunctive)
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
28
IMPLEMENTATION Annotated Messages
• Adverse Effects
- “ People take xanax so lightly like it's nothing. people get addicted and die from that shit. i blacked out driving while high on it once ”
- “This xanax medicine causes suicidal thoughts when first taking it. what the f*** ”
- “ I should stop taking all the drugs man they are obviously ruining your brain and making you bipolar. xanax is the main contributor ”
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
29
IMPLEMENTATION Annotated Messages
• Other Effects
- “ I am going to sleep now because this xanax got me feeling good ”
- “ I need a prescription to xanax or valium or anything that will help me chill out and sleep for once ”
- “ I am not sure if it is the xanax or lack of sleep but f*** i do not feel real ”
Beneficial Effect
Suspicious feeling
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
30
IMPLEMENTATION Data Distribution
• Data distribution highly unbalanced
• 93% data contributed from Other category
Advere Effects Other Effects0
500100015002000250030003500
221
3 113
Sales
Class Label
No.
of o
bser
vatio
ns
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
31
IMPLEMENTATION Sampling
• Why Undersampling ??
- Reduce observations from Other category
- Amount of Adverse effects will not change
- It will not add synthetic observations to the Adverse effect class
A O A O A OInitial Behavior Undersampling Oversampling
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
32
IMPLEMENTATION Classification
• Used Naïve Bayes algorithm - Best and naïve classification algorithm for text classification approaches
according to the literature
- Achieve highest classifier performance in related works
• Compared with Decision Tree algorithm
• 10 fold cross validation
• Used WEKA tool box
– It supports for data pre-processing, regression, classification, clustering
and data visualization15th International Conference on Advances in ICT for Emerging Regions ICTer2015
33
• Purpose is to identifying Adverse effects as much as possible
EVALUATION Balanced vs. Unbalanced Data Set
Balanced Data Set (A-221 O-221)
Adverse Effect 0.67 0.71 0.69 0.71
Other 0.69 0.65 0.67 0.71
Unbalanced Data Set (A-221, O-3113)
Precision Recall F-Measure AUC
Adverse Effect 0.18 0.19 0.18 0.71
Other 0.94 0.93 0.94 0.71
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
34
• Proposed solution (Balanced data set) perform really well
EVALUATION Balanced vs. Unbalanced Data Set
Unbalanced Balanced
No. of Observations 3 334 442
Training Time (sec) 20.8 1.6
Accuracy 89% 68%
Contributed from Other category
Reduced data set
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
35
EVALUATION NB vs. DT
Naïve Bayes
True Class
AE O
Classified As
AE 157 64
O 78 143
Decision Tree
True Class
AE O
Classified As
AE 137 84
O 83 138
68% of Accuracy
62% of Accuracy
36
EVALUATION NB vs. DT
Naïve Bayes
Precision Recall F-Measure AUC
Adverse Effect 0.67 0.71 0.69 0.71
Other 0.69 0.65 0.67 0.71
Decision Tree
Adverse Effect 0.62 0.62 0.62 0.72
Other 0.62 0.62 0.62 0.72• Proposed solution (Naïve Bayes) perform really well
37
EVALUATION ROC Curve
0.2 0.4 0.6 0.8 10
0.10.20.30.40.50.60.70.80.9
1 Chart Title
FP Rate
TP R
ate
AUC = 0.7
More than a random guess
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
38
EVALUATION Error Analysis
• Curse of dimensionality
– The performance of the classier decreases when the dimensionality of
the problem becomes too large
4 572 features15th International Conference on Advances in ICT for Emerging Regions ICTer2015
39
CONCLUSION
• Proposed a method to identify ADR from Twitter data
• Proposed filtering method capture 1 477 (3%) additional data
• All the performance measurements lie around 70%. Training time 1.6 seconds
• ‘Curse of dimensionality ‘ has reduced the performance of classifier
• Results suggested the potential for extracting ADR related information from Twitter
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
40
FUTURE WORKS
1) Degree or level of ADR
– Can be categorized the effect into High, normal, low
– Useful in prioritizing the effects in pharmacovigilance
EX:
– “ This medicine causes suicidal thoughts when first taking it ” -
Extremely negative
– “ I'm stressed I can't even sleep after using this pills ” – High
– “ Its 2:20 a.m. and I am yawning and shaking my head vamplife ” – Less
harm
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
41
FUTURE WORKS
2) Identifying the dissemination of drug users
– Twitter provides geo locations of Twitter messages
– Weather conditions and habitual actions of each country may affect to
the drug and their effects
EX:
- “ My mom asks me to get beer while picking up a Xanax prescription.
So that looks good ” Beer + Xanax = ??
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
42
Thank YouNLP could save a Life !
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
43
PILOT STUDY
2014/04/08 8.30 AM - 11.30 AM
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
44
EVALUATION
Naïve Bayes
(NB)Decision Tree
(DT)
No. of Observations 442 442
Training Time (sec) 1.6 14.7
Accuracy 68% 62%
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
45
SIGNIFICANCE OF THIS RESEARCH
• Filtering method- Identified misspelled Drug related messages- Capture more useful data
• Removing Advertisements, Forum posts and News related Twitter messages
• Building medical corpus to remove unwanted Twitter messages- SIDDER- MedDRA
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
46
TOOLS & TECHNOLOGIES
• Data acquisition & filtering- Twitter streaming API- Tweepy library in Python- Key word typo generator online tool
• Text processing - Python NLTK (Natural Language Tool Kit)- Porter stemmer
• Classification, sub sampling , ensemble learning and data visualization- WEKA tool box
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
47
DATA DISTRIBUTION
25 289 (46%)
RT
29 485 (54%)
Pre-Processed Messages Retweets
Adds, News, Forum posts
2 113
23 176Duplicate Messages
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
48
CLASSIFICATION PROCESS
15th International Conference on Advances in ICT for Emerging Regions ICTer2015
49
TEXT PROCESSING
Raw Message After Text Processing
@utemim @goddess1207 @LadyZ_712 @misscolor63 @Cozyrosy1 There isn't enough Xanax to make me spend an hour w/ a room full of 5yr olds!
there is not enough xanax to make me spend an hour with a room full of 5 year olds
Being doing good on my diet, giving up soda and iced coffee I don't think so lol I would need Xanax :)
being doing good on my diet giving up soda and iced coffee i do not think so laugh out loud i would need xanax smile
@OMGImBoss i WANT MY XANAX BITCH :( AND im asking u with who you or they want papakush there?? O.o
i want my xanax bitch sad and i am asking you with who you or they want papakush there confused
50
JUST ASSUME…
EBOLA Virus Affected to millions of people around the world
Doctors found a cureTested using monkeys