21
Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Embed Size (px)

Citation preview

Page 1: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Aron Culotta

Jedsada Chartree

Page 2: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Introduction

• Growing interest in monitoring disease outbreaks.• Growing of twitter users

- February, 2010 50 million tweets/day- June, 2010 65 million tweets/day (750 tweets/s

- 190 million users

Source: http://en.wikipedia.org/wiki/Twitter

Page 3: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Introduction

• Twitter is a website, which offers a social networking and micro-blogging service.- Users send and read messages called “tweets”

(140 characters)

Page 4: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Introduction

• Advantages of Twitter for this research- Full messages provide more information than query.- Twitter profiles contain more detail to analyze.

(city, state, gender, age)- Diversity of twitter users.

Page 5: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Methodology

• Data- Collect 574,643 messages for 10 weeks

(February 12, 2010 to April 24, 2010) - The US Centers for Disease Control and Prevention (CDC)

publishes the US Outpatient Influenza-like Illness Surveillance Network (ILINet)

Page 6: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Methodology

The Ground truth ILI rates obtained from the CDC statistics

Page 7: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Methodology

• Regression Models 1. Simple linear regression

P = the proportion of the population exhibiting ILI symptoms = the coefficients = Error = the fraction of document in D that match W = D = a document collection Dw = a document frequency for word W

logit(x) =

log it(P) = β1 log it(Q(W ,D))+ β 2 +ε

β1

β2€

ε

Q(W ,D)

DwD

ln(x

1− x)

Page 8: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Methodology

• Regression Models 2. Multiple linear regression

P = the proportion of the population exhibiting ILI symptoms = the coefficients = Error = the fraction of document in D that match Wi =

D = a document collection Dwi = a document frequency for word Wi

logit(x) =

log it(P) = β1 log it(Q({W1},D))+ ...+ log it(Q({Wk},D))+ β k+1 +ε

β1

β2€

ε

Q({Wi },D)

DwiD

ln(x

1− x)

Page 9: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Methodology

• Keyword Selection1. Correlation Coefficient

- Simple linear regression model evaluation

2. Residual Sum of Squares (RSS)

- It measures a discrepancy between the data and an estimation model

RSS(P,^

P) = ( pi − p^

)2i∑

Page 10: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Methodology

• Keyword Generation1. Hand-chosen keywords

(flu, cough, sore throat, headache)

2. Most frequent keywords - Search all documents containing any of hand-chosen

keywords. - Find the top 5,000 most frequently occurring words.

Page 11: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Methodology

• Document Filtering - Applying logistic regression to predict whether a Twitter

message is reporting an ILI symptom.

yi = a binary random variable

(1 if document Di is positive, 0 otherwise)

xi = {xij} = number of times word j appears in document i€

p(y i = 1 | x i ;θ ) =1

1+ e(−xi •θ )

Page 12: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Methodology

Page 13: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Methodology

• Classification evaluation- Accuracy

- Precision - Recall - F-measure

F = 2•Pr ecision • Recall

Pr ecision +Recall

Page 14: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Results

• Document Filtering

Evaluation of messages classification with standard error in parentheses

Page 15: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Results

• Regression

The 10 different systems evaluated

Page 16: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Results

• Regression

The regression coefficient (r), residual sum of square (RSS), and standard error of each system

Page 17: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Results

Results for multi-hand-rss(2) Results for classification-hand

Page 18: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Results

Results for multi-freq-rss(3) Results for simple-hand-rss(1)

Page 19: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Results

Correlation results for simple –hand-rss and multi-hand-rss

Correlation results for simple –hand-corr and multi-hand-corr

Page 20: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Results

Correlation results for simple –freq-rss and multi-freq-rss

Correlation results for simple –freq-corr and multi-freq-corr

Page 21: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Conclusion

• Several methods to identify influenza-related messages.• Compare a number of regression models to correlate the

messages with CDC statistics.• The best model achieves correlation of .78 .