25
Introduction Framework Sentiment analysis Case studies Conclusions A Descriptive Analysis of Twitter Activity Around Boston Terror Attacks Álvaro Cuesta David F. Barrero María D. R-Moreno Computer Engineering Department Universidad de Alcalá, Spain ICCCI 2013 Craiova, Romania September 11, 2013 ICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 1 / 25

Presentacion

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Presentacion

IntroductionFramework

Sentiment analysisCase studiesConclusions

A Descriptive Analysis of Twitter Activity AroundBoston Terror Attacks

Álvaro Cuesta David F. Barrero María D. R-Moreno

Computer Engineering DepartmentUniversidad de Alcalá, Spain

ICCCI 2013Craiova, RomaniaSeptember 11, 2013

ICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 1 / 25

Page 2: Presentacion

IntroductionFramework

Sentiment analysisCase studiesConclusions

Summary1 Introduction

MotivationObjectivesCase studies

2 FrameworkFramework overviewFramework messagingFramework components

3 Sentiment analysisOverviewClassifier

4 Case studiesBoston Terror AttackPolitical analysis

5 Conclusions and future workICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 2 / 25

Page 3: Presentacion

IntroductionFramework

Sentiment analysisCase studiesConclusions

MotivationObjectivesCase studies

IntroductionMotivation

Great expansion of social networks in the lastyearsOne of the most successfull ones is Twitter

Microblogging platformShort messages known as tweetsOpen nature

Twitter offers great research opportunitiesOpen natureDistributed human sensor networkEasy data extraction, difficult dataprocessing

Twitter + sentiment analysisLack of tools for sentiment analysis inSpanish

ICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 3 / 25

Page 4: Presentacion

IntroductionFramework

Sentiment analysisCase studiesConclusions

MotivationObjectivesCase studies

IntroductionObjectives

Twitter offers excelent API ... however there is a need of someinfraestructure (mainly storage and reporting)

Objectives1 Develop a framework for Twitter data extraction and analysis2 Provide reporting tools3 Foundation for sentiment analysis in Spanish

ICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 4 / 25

Page 5: Presentacion

IntroductionFramework

Sentiment analysisCase studiesConclusions

MotivationObjectivesCase studies

IntroductionCase studies

In order to assess the framework, we have included two studycases

Event driven - Boston terror attackRegular usage - Political activity on Twitter in Spanish

ICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 5 / 25

Page 6: Presentacion

IntroductionFramework

Sentiment analysisCase studiesConclusions

Framework overviewFramework messagingFramework components

Framework architectureOverview

RequirementsEasy to use, extensible, massive data processing

Design decisionsModular design: Collection of independent scriptsFocus on open data formatsBuilt around the database: MongoDB

Set of independent scripts interchanging data

ICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 6 / 25

Page 7: Presentacion

IntroductionFramework

Sentiment analysisCase studiesConclusions

Framework overviewFramework messagingFramework components

Framework architectureFramework messaging

ICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 7 / 25

Page 8: Presentacion

IntroductionFramework

Sentiment analysisCase studiesConclusions

Framework overviewFramework messagingFramework components

Framework architectureFramework components: Miner

MinerExtracts and storestweetsStream APISeveral filtersWritten in Python

ICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 8 / 25

Page 9: Presentacion

IntroductionFramework

Sentiment analysisCase studiesConclusions

Framework overviewFramework messagingFramework components

Framework architectureFramework components: Database

DatabaseStorage for futherprocessingMongoDBNoSQL databaseHigh performance

ICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 9 / 25

Page 10: Presentacion

IntroductionFramework

Sentiment analysisCase studiesConclusions

Framework overviewFramework messagingFramework components

Framework architectureFramework components: Reporting

ReportingCSV export forfuther processingR processingExtensibilityPowerful libraries

ICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 10 / 25

Page 11: Presentacion

IntroductionFramework

Sentiment analysisCase studiesConclusions

Framework overviewFramework messagingFramework components

Framework architectureFramework components: Sentiment analysis

Sentiment analysisSupervised learningNeed of labelingTools for labelingClassifier buildingClassifier testing

ICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 11 / 25

Page 12: Presentacion

IntroductionFramework

Sentiment analysisCase studiesConclusions

OverviewClassifier

Sentiment analysisOverview

Supervised learning with Natural Language Toolkit (NLTK)Three classes: “Positive”, “negative” and “neutral”

Need of labeled corpusSeveral ones in English ...... none in Spanish

Need of thousands manually classified tweetsCollaborative labelingWeb application to label tweets

ICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 12 / 25

Page 13: Presentacion

IntroductionFramework

Sentiment analysisCase studiesConclusions

OverviewClassifier

Sentiment analysisClassifier

Naïve Bayes classifierStop words removedSome parameters to setOptimus parameter setting depends on the dataset

Need of classifier evaluationTesterCross validation

ICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 13 / 25

Page 14: Presentacion

IntroductionFramework

Sentiment analysisCase studiesConclusions

Boston Terror AttackPolitical analysis

Case studyBoston Terror Attack

Main objectiveEvaluate the platform

Secondary objectiveDescribe activity around an eventStream by string filter

The eventTerror attack on 15 Apr 2013 14:49 (GMT-4) in BostonInternet witch-hunt motivated by the release of some photosShooting and manhunt

Data adquisitionBegin: Tue, 16 Apr 2013 00:43 (GMT)End: Tue, 23 Apr 2013 00:43 (GMT)Filter: “Maratón de Boston” (Boston Marathon in Spanish)

ICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 14 / 25

Page 15: Presentacion

IntroductionFramework

Sentiment analysisCase studiesConclusions

Boston Terror AttackPolitical analysis

Case studyBoston Terror Attack: Dataset description

Value Relative AverageTweets 28,892 1.16/userNo-retweets 16,029 55.48%Reweets 12,863 44.52%Geolocalized 255 0.88%Users 24,989Mentions 18,937 65.54%Replies 849 2.94%Non-replies 18,088 62.61%Size 96.39 MB 3.38 KB/tweetIndex size 0.91 MBDisk 132.99 MB

ICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 15 / 25

Page 16: Presentacion

Case studyBoston Terror attack: activity

Apr 17 Apr 19 Apr 21 Apr 23

01000

2500

Time

Tw

eets

Tweets

Apr 17 Apr 19 Apr 21 Apr 23

0400

1000

Time

Non−

retw

eets

Tweets (excluding RTs)

Apr 17 Apr 19 Apr 21 Apr 23

0400

1000

Time

Retw

eets

Retweets

Dashed line: BombingDotted line: Photo releaseSolid line: ShootingGray background : Manhunt

Page 17: Presentacion

Case studyBoston Terror attack: activity

Thu 23:00 Fri 04:00 Fri 09:00 Fri 14:00 Fri 19:00 Sat 00:00

50

150

Time

Tw

eets

Tweets

Thu 23:00 Fri 04:00 Fri 09:00 Fri 14:00 Fri 19:00 Sat 00:00

20

60

120

Time

Non−

retw

eets

Tweets (excluding RTs)

Thu 23:00 Fri 04:00 Fri 09:00 Fri 14:00 Fri 19:00 Sat 00:00

020

40

60

Time

Retw

eets

Retweets

Dotted line: Photo releaseSolid line: ShootingGray background : Manhunt

Page 18: Presentacion

IntroductionFramework

Sentiment analysisCase studiesConclusions

Boston Terror AttackPolitical analysis

Case studyPolitical analysis: Overview

Main objectiveEvaluate sentiment analysis

Secondary objectiveDescribe regular Twitter activityStream by user filter

Selection of Spanish political actorsSelected by activity and controversy

Account owner AccountsPolitical party @PPopular, @PSOE, @iunida, @UPyDPolitician @agarzon, @EduMadina, @ToniCanto1, @Re-

villaMiguelA, @ccifuentes, @_Rubalcaba_Journalist @jordievole, @iescolarActivist organization @LA_PAH

Data adquisitionFrom Tue, 16 Apr 2013 00:00 (GMT)End: 18 Apr 2013 04:00 (GMT)Filter: Account name (“@account”)

ICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 18 / 25

Page 19: Presentacion

IntroductionFramework

Sentiment analysisCase studiesConclusions

Boston Terror AttackPolitical analysis

Case studyPolitical analysis: Dataset description

Value Relative AverageTweets 65,043 1.9/userNo-retweets 28,175 43.32%Reweets 36,868 56.68%Geolocalized 528 0.81%Users 34,195Mentions 56,713 87.19%Non-replies 46,981 72.23%Replies 9,732 14.96%Size 227.51 MB 3.58 KB/tweetIndex size 2.05 MBDisk 237.95 MB

ICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 19 / 25

Page 20: Presentacion

Case studyPolitical analysis: Activity

Tue Wed Thu

01500

3500

Time

Tw

eets

Tweets

Tue Wed Thu

0500

1500

Time

Non−

retw

eets

Tweets (excluding RTs)

Tue Wed Thu

01000

2000

Time

Retw

eets

Retweets

Page 21: Presentacion

IntroductionFramework

Sentiment analysisCase studiesConclusions

Boston Terror AttackPolitical analysis

Case studyPolitical analysis: Sentiment analysis

9, 884 tweets were manually classified in a collaborative way4, 739 non-neutral tweets1, 062 positives, 3, 677 negatives

Unbalanced datasetWe tried several parameters for the Naïve Bayes classifier

N-grams: {1}, {2}, {3}, {1, 2}, {1, 3} and {2, 3}Minimum score: 0, 1, 2, 3, 4, 5, 6 and 10

10-fold cross-validation

ICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 21 / 25

Page 22: Presentacion

IntroductionFramework

Sentiment analysisCase studiesConclusions

Boston Terror AttackPolitical analysis

Case studyPolitical analysis: Sentiment analysis

AccuracyNaiveBayes-1_2-min3 0.8543

NaiveBayes-1-min3 0.8510NaiveBayes-1_3-min3 0.8507

NaiveBayes-1-min4 0.8476NaiveBayes-1_3-min5 0.8474NaiveBayes-1_2-min4 0.8469NaiveBayes-1_3-min4 0.8467NaiveBayes-1_3-min1 0.8459

NaiveBayes-1-min6 0.8452NaiveBayes-1-min1 0.8448

NaiveBayes-1_2-min5 0.8446NaiveBayes-1_3-min6 0.8438NaiveBayes-1_2-min6 0.8436

NaiveBayes-1-min5 0.8406NaiveBayes-1_2-min1 0.8389NaiveBayes-2_3-min6 0.8385

ICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 22 / 25

Page 23: Presentacion

Case studyPolitical analysis: Normalized sentiment

Tue Wed Thu

0.0

0.2

0.4

0.6

0.8

1.0

Time

Positiv

e

Page 24: Presentacion

IntroductionFramework

Sentiment analysisCase studiesConclusions

Conclusions and future work

We developed a framework that eases data extraction andanalysis on Twitter

Ready for productionIt will be released soon with a free licence

We briefly described two case studiesEvent driven activity - Boston terror attacksRegular activity - Political activity

Sentiment analysis is intrinsically difficultFuture work

LemmalizationNatural language processingTime series analysis

ICCCI 2013, Craiova, Romania A Descriptive Analysis of Twitter Activity 24 / 25

Page 25: Presentacion

Thanks for your attention!

David F. [email protected]

@dfbarrero