Upload
jewel-lloyd
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Towards Detecting Influenza Epidemics by Analyzing Twitter Massages
Aron Culotta
Jedsada Chartree
Introduction
• Growing interest in monitoring disease outbreaks.• Growing of twitter users
- February, 2010 50 million tweets/day- June, 2010 65 million tweets/day (750 tweets/s
- 190 million users
Source: http://en.wikipedia.org/wiki/Twitter
Introduction
• Twitter is a website, which offers a social networking and micro-blogging service.- Users send and read messages called “tweets”
(140 characters)
Introduction
• Advantages of Twitter for this research- Full messages provide more information than query.- Twitter profiles contain more detail to analyze.
(city, state, gender, age)- Diversity of twitter users.
Methodology
• Data- Collect 574,643 messages for 10 weeks
(February 12, 2010 to April 24, 2010) - The US Centers for Disease Control and Prevention (CDC)
publishes the US Outpatient Influenza-like Illness Surveillance Network (ILINet)
Methodology
The Ground truth ILI rates obtained from the CDC statistics
Methodology
• Regression Models 1. Simple linear regression
P = the proportion of the population exhibiting ILI symptoms = the coefficients = Error = the fraction of document in D that match W = D = a document collection Dw = a document frequency for word W
logit(x) =
€
log it(P) = β1 log it(Q(W ,D))+ β 2 +ε
€
β1
€
β2€
ε
€
Q(W ,D)
€
€
DwD
€
ln(x
1− x)
Methodology
• Regression Models 2. Multiple linear regression
P = the proportion of the population exhibiting ILI symptoms = the coefficients = Error = the fraction of document in D that match Wi =
D = a document collection Dwi = a document frequency for word Wi
logit(x) =
€
log it(P) = β1 log it(Q({W1},D))+ ...+ log it(Q({Wk},D))+ β k+1 +ε
€
β1
€
β2€
ε
€
Q({Wi },D)
€
DwiD
€
ln(x
1− x)
Methodology
• Keyword Selection1. Correlation Coefficient
- Simple linear regression model evaluation
2. Residual Sum of Squares (RSS)
- It measures a discrepancy between the data and an estimation model
€
RSS(P,^
P) = ( pi − p^
)2i∑
Methodology
• Keyword Generation1. Hand-chosen keywords
(flu, cough, sore throat, headache)
2. Most frequent keywords - Search all documents containing any of hand-chosen
keywords. - Find the top 5,000 most frequently occurring words.
Methodology
• Document Filtering - Applying logistic regression to predict whether a Twitter
message is reporting an ILI symptom.
yi = a binary random variable
(1 if document Di is positive, 0 otherwise)
xi = {xij} = number of times word j appears in document i€
p(y i = 1 | x i ;θ ) =1
1+ e(−xi •θ )
Methodology
Methodology
• Classification evaluation- Accuracy
- Precision - Recall - F-measure
€
F = 2•Pr ecision • Recall
Pr ecision +Recall
Results
• Document Filtering
Evaluation of messages classification with standard error in parentheses
Results
• Regression
The 10 different systems evaluated
Results
• Regression
The regression coefficient (r), residual sum of square (RSS), and standard error of each system
Results
Results for multi-hand-rss(2) Results for classification-hand
Results
Results for multi-freq-rss(3) Results for simple-hand-rss(1)
Results
Correlation results for simple –hand-rss and multi-hand-rss
Correlation results for simple –hand-corr and multi-hand-corr
Results
Correlation results for simple –freq-rss and multi-freq-rss
Correlation results for simple –freq-corr and multi-freq-corr
Conclusion
• Several methods to identify influenza-related messages.• Compare a number of regression models to correlate the
messages with CDC statistics.• The best model achieves correlation of .78 .