24
After the Boom No One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information Shoko Wakamiya 1 , Yukiko Kawai 2 , Eiji Aramaki 1 1 Nara Institute of Science and Technology, Japan 2 Kyoto Sangyo University, Japan Oct. 18, 2016 Twitter

Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

  • Upload
    -

  • View
    476

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

AftertheBoomNoOneTweets:Microblog-basedInfluenzaDetectionIncorporatingIndirectInformation

ShokoWakamiya1,YukikoKawai2,Eiji Aramaki11NaraInstituteofScienceandTechnology,Japan

2KyotoSangyoUniversity,Japan

Oct.18,2016Twitter

Page 2: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

ExploitingTweetingUserasSocialSensor[Sakaki2010,Lee2011,Aramaki2011]

• Variousreal-worldphenomenacanbeobservedEX)Disasters,localevents,infectiousdiseases,etc.

• Itisexpectedtooutperformothertraditionalmethodsofmedicalreportingmeans

Sakaki etal.:EarthquakeShakesTwitterUsers.WWW(2010)Lee,Wakamiya,Sumiya:DiscoveryofUnusualRegionalSocialActivitiesusingGeo-taggedMicroblogs,WorldWideWebSpecialIssueonMobileServicesontheWeb(2011)Aramaki etal.:TwitterCatchesTheFlu:DetectingInfluenzaEpidemicsusingTwitter,EMNLP(2011)

Target event

Physical Sensor-basedSocial Sensor-based

Previous Proposed

Sensors Direct information Indirect information

Direct information

Physicalsensor Socialsensor

Page 3: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

RelatedworkonTwitter-basedInfluenzaSurveillance

Target(#ofareas) Datasize(milliontweets)Aramaki [16] Japan(1area) 300Achrekar [27] US(10areas) 1.9*Culotta [28] US(1area) 0.5Kanouch [29] Japan(1area) 300DeQuincy[30] Europe(1area) 0.14Doan[31] US(1area) 24*Szomszor [32] Europe(1area) 3

• LotsofTwitter-baseddiseasedetection/predictionhavebeendeveloped•Mostofthesystemsperformedlow-resolutiongeographicanalysis(country-level)

Page 4: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

Problem(1):

ImbalanceofSocialSensorDistribution•Mostofthesocialsensorsareinurbancities(Tokyo,Osaka,etc.)• Othercitiesareaffectedbyashortageofdata

Sapporo,Hokkaido

TokyoGeographicdistributionof

influenza-relatedtweetsinJapan

Page 5: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

Problem(2):

GapbetweenSocialSensorsandPatientsRelationbetweennumbersofinfluenza-relatedtweetsandpatientsineachprefecture• Exceptforafewhigh-populationcities,mostareashavefewertweets• Somesuchareashavenumerousinfluenzapatients

0

500

1000

1500

2000

2500

3000

3500

4000

0

50000

100000

150000

200000

250000

300000

TOKYO

AREA13

OSAK

AAR

EA27

KANA

GAWA

AREA14

CHIBA

AREA12

AICH

IAR

EA23

SAITA

MA

AREA11

HOKK

AIDO

AREA1

HYOG

OAR

EA28

KYOT

OAR

EA26

FUKU

OKA

AREA40

SHIZU

OKA

AREA22

MIYA

GIAR

EA4

IBAR

AKI

AREA8

NIIGATA

AREA15

FUKU

SHIM

AAR

EA7

GUNM

AAR

EA10

HIRO

SHIM

AAR

EA34

FUKU

IAR

EA20

GIFU

AREA21

KUMAM

OTO

AREA43

SHIGA

AREA25

TOCH

IGI

AREA9

MIE

AREA24

NARA

AREA29

IWATE

AREA3

OKAYAM

AAR

EA33

KAGO

SHIM

AAR

EA46

WAK

AYAM

AAR

EA30

OKINAW

AAR

EA47

YAMAG

UCHI

AREA35

YAMAG

ATA

AREA6

KAGA

WA

AREA37

MIYA

ZAKI

AREA45

ISHIKAW

AAR

EA19

AOMOR

IAR

EA2

EHIM

EAR

EA38

NAGA

NOAR

EA17

OITA

AREA44

TOKU

SHIM

AAR

EA36

NAGA

SAKI

AREA42

AKITA

AREA5

YAMAN

ASHI

AREA16

TOTTOR

IAR

EA31

KOCH

IAR

EA39

SAGA

AREA41

TOYAMA

AREA18

SHIM

ANE

AREA32

# of patients

# of tweets

# of

pat

ient

s # of tweets

Prefectures (area)

Page 6: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

Problem(2):

GapbetweenSocialSensorsandPatientsRelationbetweennumbersofinfluenza-relatedtweetsandpatientsineachprefecture• Exceptforafewhigh-populationcities,mostareashavefewertweets• Somesuchareashavenumerousinfluenzapatients

0

500

1000

1500

2000

2500

3000

3500

4000

0

50000

100000

150000

200000

250000

300000

TOKYO

AREA13

OSAK

AAR

EA27

KANA

GAWA

AREA14

CHIBA

AREA12

AICH

IAR

EA23

SAITA

MA

AREA11

HOKK

AIDO

AREA1

HYOG

OAR

EA28

KYOT

OAR

EA26

FUKU

OKA

AREA40

SHIZU

OKA

AREA22

MIYA

GIAR

EA4

IBAR

AKI

AREA8

NIIGATA

AREA15

FUKU

SHIM

AAR

EA7

GUNM

AAR

EA10

HIRO

SHIM

AAR

EA34

FUKU

IAR

EA20

GIFU

AREA21

KUMAM

OTO

AREA43

SHIGA

AREA25

TOCH

IGI

AREA9

MIE

AREA24

NARA

AREA29

IWATE

AREA3

OKAYAM

AAR

EA33

KAGO

SHIM

AAR

EA46

WAK

AYAM

AAR

EA30

OKINAW

AAR

EA47

YAMAG

UCHI

AREA35

YAMAG

ATA

AREA6

KAGA

WA

AREA37

MIYA

ZAKI

AREA45

ISHIKAW

AAR

EA19

AOMOR

IAR

EA2

EHIM

EAR

EA38

NAGA

NOAR

EA17

OITA

AREA44

TOKU

SHIM

AAR

EA36

NAGA

SAKI

AREA42

AKITA

AREA5

YAMAN

ASHI

AREA16

TOTTOR

IAR

EA31

KOCH

IAR

EA39

SAGA

AREA41

TOYAMA

AREA18

SHIM

ANE

AREA32

# of patients

# of tweets

# of

pat

ient

s # of tweets

Prefectures (area)

Page 7: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

ExploitingIndirectInfo.

Pro)CoveringwiderareasCon)• Unreliability(toonoisyortooold)

(1)MygrandmainKyotoisinbedwithflu(2)NEWS:classesinOsakahavebeenclosedbecauseoftheflu

• Complexpattern

When?

Alreadyspread

Target event

Physical Sensor-basedSocial Sensor-based

Previous Proposed

Sensors Direct information Indirect information

Direct information

Existing Proposed

Theamountoftweetscontainingdirectinfo.

Theamountoftweetscontainingindirectinfo. Theamountofpatients

Page 8: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

OurGoal&Approach

Toestimatethenumberofpatientsineachareabasedontherelationbetweenhumanmotivationtotweet andinformationpropagation

• h1)Peoplepreferreportingnewinfo.,andthattheyareinsensitivetoalready-propagatedinfo.

• h2)Thedegreeofpropagation(popularity)iscorrelatedwiththeamountofindirectinfo.

Theamountoftweetscontainingdirectinfo.

Theamountoftweetscontainingindirectinfo. Theamountofpatients

(a) Before Epidemics (b) After Epidemics

� �Positive Negative TrappedSensor

� ��

� ��

� ��

� ��

�� ��

� ��

� �

� � � �

� �

Indirect Information

Direct Information Direct Information

Trappedsensors

Page 9: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

Outline

• Background• Goalandapproach• ConstructionofTwitter-basedInfluenzaSurveillanceSystem• Experimentalevaluation• Discussion• Conclusions

Page 10: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

Twitter-basedInfluenzaSurveillance

LOCATION DETECTION MODULE

AGGREGATION MODULE

LINEAR MODEL

TRAP MODEL

Positive

Negative

Trash

P/N Classifier

Tweets

GPS Info.

Profile Info.

Indirect Info.

Available

No

No

NLP MODULE

# of flu patients

Direct Information

Indirect Information

No

1.NLP-basedClassificationPatient(positive)ornot(negative)

2.LocationDetectionDirectinfo.orIndirectinfo.

3.DataAggregationLinearmodelorTrapmodel

Page 11: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

1.NLP-basedClassification

ToJudgewhetheragiventweetiswrittenbyapatientornot• Buildingthetrainingset

Ahumanannotatorassignedoneoftwolabels(positive/negative)to1,000influenza-relatedtweetsEx)

• Classifyingthetestset• SVM-basedclassifier• Bag-of-wordsrepresentation• Polynomialkernel(d=2)

“Mymothergotflu today” positive“Igotinfluenzashottoday” negative

Page 12: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

Tocoverwiderareasbyextractingindirectinfo.aswellasdirectinfo.• Directinfo.• GPSinfo.(GPS)• Profileinfo.(PROF)

• Indirectinfo.(IND)Locationnamesintweets’contentsextractedusingalistofprefecturenamesandfamouslandmarksEx)“MyfriendinOsaka caughtflu”

2.LocationDetection

GPS(0.5%)

PROF(26.2%)

IND(4.7%)Nolocationinfo.

Percentageoftweetswithdirect/indirectinfo.

7,666,201tweets

Page 13: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

3.DataAggregation

Toestimatetheamountofpatientsusingdifferenttypesofinfo;directinfo.andindirectinfo.i) LINEARModel

Asimplemodeltosumupdirectinfo.andindirectinfo.

ii) TRAPModelAmodelbasedonLINEARmodelandhuman’snaturetotweet

𝐼"#$%&' 𝑎, 𝑡 = 𝑤-./ 0 𝐺𝑃𝑆 𝑎, 𝑡 + 𝑤.'56 0 𝑃𝑅𝑂𝐹 𝑎, 𝑡 + 𝑤#$:; 𝐼𝑁𝐷(𝑎, 𝑏, 𝑡)�

B∈&

Thenumberofpatients𝐼"#$%&' 𝑎, 𝑡 inarea𝑎 atday𝑡:

𝐼D'&. 𝑎, 𝑡 =𝐼"#$%&' 𝑎, 𝑡

𝑤F/%'/ 0 𝑁G − 𝑤D'&. 0 log( 𝑝𝑜𝑝 𝑎, 𝑡 + 1) , 𝑝𝑜𝑝 𝑎, 𝑡 = ; 𝐼𝑁𝐷(𝑎, 𝑐)P

QRS

Thenumberofpatients𝐼D'&. 𝑎, 𝑡 inarea𝑎 atday𝑡:

Page 14: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

ConceptofTRAPModel1. Peoplepreferanewevent,andareinsensitivetoan

alreadypropagatedevent2. Thedegreeofpropagation(popularity)iscorrelated

withtheamountofindirectinfo.(a) Before Epidemics (b) After Epidemics

� �Positive Negative TrappedSensor

� ��

� ��

� ��

� ��

�� ��

� ��

� �

� � � �

� �

Indirect Information

Direct Information Direct Information

(a)Beforeepidemics (a) Before Epidemics (b) After Epidemics

� �Positive Negative TrappedSensor

� ��

� ��

� ��

� ��

�� ��

� ��

� �

� � � �

� �

Indirect Information

Direct Information Direct Information

(b)Afterepidemics(a) Before Epidemics (b) After Epidemics

� �Positive Negative TrappedSensor

� ��

� ��

� ��

� ��

�� ��

� ��

� �

� � � �

� �

Indirect Information

Direct Information Direct Information

(a) (b)

Peopleactivelyreporttheflu

Mostofthepeopleloseinteresttosharedirectinfo.

Page 15: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

3.DataAggregation

Toestimatetheamountofpatientsusingdifferenttypesofinfo;directinfo.andindirectinfo.i) LINEARModel

Asimplemodeltosumupdirectinfo.andindirectinfo.

ii) TRAPModelAmodelbasedonLINEARmodelandhuman’snaturetotweet

𝐼"#$%&' 𝑎, 𝑡 = 𝑤-./ 0 𝐺𝑃𝑆 𝑎, 𝑡 + 𝑤.'56 0 𝑃𝑅𝑂𝐹 𝑎, 𝑡 + 𝑤#$:; 𝐼𝑁𝐷(𝑎, 𝑏, 𝑡)�

B∈&

Thenumberofpatients𝐼"#$%&' 𝑎, 𝑡 inarea𝑎 atday𝑡:

𝐼D'&. 𝑎, 𝑡 =𝐼"#$%&' 𝑎, 𝑡

𝑤F/%'/ 0 𝑁G − 𝑤D'&. 0 log( 𝑝𝑜𝑝 𝑎, 𝑡 + 1) , 𝑝𝑜𝑝 𝑎, 𝑡 = ; 𝐼𝑁𝐷(𝑎, 𝑐)P

QRS

Thenumberofpatients𝐼D'&. 𝑎, 𝑡 inarea𝑎 atday𝑡:

ThedegreeofInformationpropagationinareaaduringt days

Theamountoftrappedsensors

Theamountofsocialsensorsinareaa

Page 16: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

ExperimentalDatasets

• Tweetdata• Acollectionoftweetscontainingthekeyword“I-N-FU-RU”

• Goldstandarddata• Thenumberofpatientsperweekforeveryprefecture(47areas)• ThedataisavailablefromtheInfectiousDiseaseSurveillanceCenter(IDSC)

ALL Duration 2012/08/02-2016/01/03# of tweets (Size) 7,666,201 (2.275 GB)

SEASON2012 Duration 2012/11/01-2013/05/31# of tweets (Size) 1,959,610 (729.4 MB)

SEASON2013 Duration 2013/11/01-2014/05/31# of tweets (Size) 501,542 (143.7 MB)*

SEASON2014 Duration 2014/11/01-2015/05/31# of tweets (Size) 2,736,685 (808.2 MB)

AsampleoftheweeklyreportfromIDSChttp://www.nih.go.jp/niid/ja/diseases/a/flu.html

Page 17: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

•MethodsBASELINE,BASELINE+PROF,LINEAR,TRAP

• EvaluationmetricPearsoncorrelationcoefficient(high:|r|>0.7,medium:0.4<|r|≤0.7,low:|r|≤0.4)

Experiments

Method NLP GPS PROF IND

TRAP TRAP+NLP ✓ ✓ ✓ ✓

TRAP ✓ ✓ ✓

LINEAR LINEAR+NLP ✓ ✓ ✓ ✓

LINEAR ✓ ✓ ✓

BASRLINE+PROF

BASELINE+PROF+NLP (EMNLP2011) ✓ ✓ ✓

BASELINE+PROF ✓ ✓

BASELINE BASELINE +NLP ✓ ✓

BASELINE ✓

𝐼T&/% 𝑎, 𝑡 = 𝐺𝑃𝑆 𝑎, 𝑡 𝐼T&/%U.'56 𝑎, 𝑡 = 𝐺𝑃𝑆 𝑎, 𝑡 + 𝑃𝑅𝑂𝐹 𝑎, 𝑡

𝐼"#$%&' 𝑎, 𝑡= 𝐺𝑃𝑆 𝑎, 𝑡 + 𝑃𝑅𝑂𝐹 𝑎, 𝑡 +; 𝐼𝑁𝐷(𝑎, 𝑏, 𝑡)

B∈&

𝐼D'&. 𝑎, 𝑡 =𝐼"#$%&' 𝑎, 𝑡

0.05 0 𝑁G − 0.2 0 log(𝑝𝑜𝑝 𝑎, 𝑡 + 1)

Page 18: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

Results(1/3)

ContributionofNLP-basedClassification• TRAP+NLP(r=0.70)ishigherthanTRAP(r=0.64)• NLPclassificationinthisdomain(flu)isnothard

Target Method SEASON2012

SEASON2013

SEASON2014

SEASON TOTAL

All areas

TRAP+NLP 0.76 0.70 0.69 0.70 LINEAR+NLP 0.70 0.55 0.53 0.50 EMNLP2011 0.74 0.68 0.67 0.69 BASELINE+NLP 0.33 0.37 0.48 0.36

High population areas (Top 10)

TRAP+NLP 0.80 0.77 0.72 0.75 LINEAR+NLP 0.78 0.65 0.64 0.64 EMNLP2011 0.80 0.77 0.71 0.75 BASELINE+NLP 0.55 0.60 0.63 0.53

Low population areas (Top 10)

TRAP+NLP 0.75 0.66 0.71 0.69 LINEAR+NLP 0.62 0.46 0.48 0.43 EMNLP2011 0.70 0.61 0.65 0.64 BASELINE+NLP 0.21 0.26 0.35 0.25

Target Method SEASON 2012

SEASON 2013

SEASON 2014

SEASON TOTAL

All areas

TRAP 0.72 0.63 0.64 0.64 LINEAR 0.65 0.48 0.53 0.48 BASELINE+PROF 0.69 0.59 0.66 0.64 BASELINE 0.29 0.34 0.48 0.35

High population areas (Top 10)

TRAP 0.75 0.69 0.70 0.70 LINEAR 0.72 0.60 0.63 0.61 BASELINE+PROF 0.75 0.69 0.70 0.70 BASELINE 0.44 0.56 0.63 0.50

Low population areas (Top 10)

TRAP 0.71 0.61 0.53 0.57 LINEAR 0.58 0.41 0.46 0.40 BASELINE+PROF 0.65 0.52 0.65 0.59 BASELINE 0.20 0.23 0.35 0.25

(a)WithNLP-basedclassification (b)WithoutNLP-basedclassification

Page 19: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

Results(2/3)

ContributionofIndirectInfo.inLINEARModel• LINEAR+NLP(r=0.50)islowerthanBASELINE+PROF+NLP(r=0.69)• Itisdifficulttodetectinfluenzaepidemicsbyaddingindirectinfo.inanaïvemanner

Target Method SEASON2012

SEASON2013

SEASON2014

SEASON TOTAL

All areas

TRAP+NLP 0.76 0.70 0.69 0.70 LINEAR+NLP 0.70 0.55 0.53 0.50 EMNLP2011 0.74 0.68 0.67 0.69 BASELINE+NLP 0.33 0.37 0.48 0.36

High population areas (Top 10)

TRAP+NLP 0.80 0.77 0.72 0.75 LINEAR+NLP 0.78 0.65 0.64 0.64 EMNLP2011 0.80 0.77 0.71 0.75 BASELINE+NLP 0.55 0.60 0.63 0.53

Low population areas (Top 10)

TRAP+NLP 0.75 0.66 0.71 0.69 LINEAR+NLP 0.62 0.46 0.48 0.43 EMNLP2011 0.70 0.61 0.65 0.64 BASELINE+NLP 0.21 0.26 0.35 0.25

(a)WithNLP-basedclassification

Page 20: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

Results(3/3)

ContributionofIndirectInfo.inTRAPModel• TRAP+NLPachievedthebestperformance(r=0.70)• TRAPmodeleffectivelycontributestoexploitationofbothdirectandindirectinfo.

Target Method SEASON2012

SEASON2013

SEASON2014

SEASON TOTAL

All areas

TRAP+NLP 0.76 0.70 0.69 0.70 LINEAR+NLP 0.70 0.55 0.53 0.50 EMNLP2011 0.74 0.68 0.67 0.69 BASELINE+NLP 0.33 0.37 0.48 0.36

High population areas (Top 10)

TRAP+NLP 0.80 0.77 0.72 0.75 LINEAR+NLP 0.78 0.65 0.64 0.64 EMNLP2011 0.80 0.77 0.71 0.75 BASELINE+NLP 0.55 0.60 0.63 0.53

Low population areas (Top 10)

TRAP+NLP 0.75 0.66 0.71 0.69 LINEAR+NLP 0.62 0.46 0.48 0.43 EMNLP2011 0.70 0.61 0.65 0.64 BASELINE+NLP 0.21 0.26 0.35 0.25

(a)WithNLP-basedclassification

Page 21: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

Discussion:RelationbetweenVolumeofTweetsandPerformance (1/2)

Highpopulationareas• TRAP+NLPwashigherthanEMNLP2011• Top17highpopulationareasexhibitedhighcorrelation(r>0.7)

05001000150020002500300035004000

0.60.620.640.660.680.70.720.740.760.780.8

#oftweets TRAP+NLP EMNLP2011

# of

twee

ts

Cor

rela

tion

coef

ficie

nt

Prefectures (AREAs)

TOKYO (AREA13) OSAKA (AREA27)

Page 22: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

Discussion:RelationbetweenVolumeofTweetsandPerformance(2/2)

Otherareas• Thereislargevarianceofperformance• TRAP+NLPmostlyoutperformsEMNLP2011

05001000150020002500300035004000

0.60.620.640.660.680.70.720.740.760.780.8

#oftweets TRAP+NLP EMNLP2011

# of

twee

ts

Cor

rela

tion

coef

ficie

ntPrefectures (AREAs)

FUKUI (AREA20)AOMORI (AREA2)

Page 23: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

Discussion:

AftertheBoomNoOneTweets• TRAPmodeloutperformedtheLINEARmodel

Ifinfluenzabecomesahottopic,peopledonottalkaboutit

• SimilarphenomenaweresofarproposedfromapsychologicalviewpointMoststudiesshowedrapidpropagationofrumors(especiallybadnews)anditsshortlife

• ThisstudyattemptstohandlehumannatureusingastatisticalmodelThismodelhassufficientroomforapplicationtoadditionalstudies

Page 24: Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

Conclusions• Twitter-basedinfluenzasurveillance• Utilizedindirectinfo.thatmentionotherplaces forcoveringwiderarea• DevelopedTRAPmodel basedoninformationpropagationandpeople’smotivationtotweet

• Futurework• Toexamineworldwideinfluenzasurveillance• Toestablishanovelmethodbyintegratingvariousmodelsfortheiraccurateprediction• Toconsidervariouseffectsrelatedtogeographicrelationsamongareas