24
Introduction Development of the classification system Application and experimental results Conclusions Sentiment Analysis for italian language microblogging: development and valutation of an automatic system Corrado Monti 22-04-2013 Corrado Monti Sentiment Analysis for italian microblogging 1/23

Sentiment Analysis and Political Disaffection in Italy

Embed Size (px)

DESCRIPTION

Slideshow for my master thesis. I built a classification system for politic-related concepts on Italian microblogging, and applied it to political disaffection, measuring correlation between Twitter and h polls.

Citation preview

Page 1: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

Sentiment Analysis for italian languagemicroblogging: development and valutation of an

automatic system

Corrado Monti

22-04-2013

Corrado Monti Sentiment Analysis for italian microblogging 1/23

Page 2: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

Twitter

I Main microblogging platformI instant pubblication of short textual contents

I Tweet → 140 charactersI Widespread in Italy too

Italian internet users

Hearsay knowledge of

Twitter

WeeklyTwitter users

DaylyTwitter users

28.6 M people100%

25.3 M people88.6% of internet users

1.24 M people4.4% of internet users

4.7 M people16.5% of internet users

december 2012

Corrado Monti Sentiment Analysis for italian microblogging 2/23

Page 3: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

Textual classification

I We would like to classify millions of textsI Rule-based approach: definitions of rules (keywords, regular

expressions) together with field experts

I Inaccurate, hard to define

I Supervised machine learning

I They need a training set to learn how to classify new textsI Every text is transformed in a sequence of numerical features;

the algorithm learns what these numbers mean

I Recent problem: Sentiment Analysis → recognize thesentiment expressed by the author of the text

I Many applications, both industrial and academic

Corrado Monti Sentiment Analysis for italian microblogging 3/23

Page 4: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

Measuring collective feelings

I One of the first works is O’Connor et al. (2010), whomeasured economic trust through Sentiment Analysis onTwitter

Sent

imen

t Rat

io

1.5

2.0

2.5

3.0

3.5

4.0

Gal

lup

Econ

omic

Con

fiden

ce

−60

−50

−40

−30

−20

2008−01

2008−02

2008−03

2008−04

2008−05

2008−06

2008−07

2008−08

2008−09

2008−10

2008−11

2008−12

2009−01

2009−02

2009−03

2009−04

2009−05

2009−06

2009−07

2009−08

2009−09

2009−10

2009−11Dates Dates

I For which phenomena is this possible?

I How much are these measures valid? Do they work in Italytoo?

Corrado Monti Sentiment Analysis for italian microblogging 4/23

Page 5: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

Goals of this work

I Hypothesis: can political disaffection be measured throughmassive tweet classification?

I It is a relevant phenomenon, especially in ItalyI Lot of interest, academic (sociology) and not academic

I We’d also like to build reusable classifiers

Corrado Monti Sentiment Analysis for italian microblogging 5/23

Page 6: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

Political disaffection

I How to define a “disaffected” tweet?

I According to domain experts, it must

1. have a politic-related topic → topic detection2. have negative sentiment → sentiment analysis3. be directed to all politicians

Tweet Is political?

Discarded

No

YesIs negative?

Discarded

No

YesIs general?

Discarded

No

YesPolitically

DisaffectedTweet

Corrado Monti Sentiment Analysis for italian microblogging 6/23

Page 7: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

Training Set

I 28′340 tweet labelled by 40 political science studentsI Between april and june 2012I 3 labellers for every text

I We keep in the dataset only tweet with unanimous “politic”label

I For other labels we measure agreement with Krippendorff α:I “negative” → 0.78 → reliable labels XI “generic” → 0.41 → much noise on labels ×

negative non negative

politic 7′965 4′544

not politic 15′831

Corrado Monti Sentiment Analysis for italian microblogging 7/23

Page 8: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

1 – Topic Detection: politics

I Time-robust classifier

I Training Set extension: 17′388 news titles (January-October2012)

I Best feature extraction:I 5-grams of characters → uniformity from different spacingsI tf-idfI Discard terms with less than 4 occourences

Corrado Monti Sentiment Analysis for italian microblogging 8/23

Page 9: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

I 45′728 points with 78′642 featuresI SMO SVM-solver, k-Nearest Neighbor, Random Forest, kernel are too

resource-hungry

I We preferred online algorithmsI ALMA, Passive-Aggressive, PEGASOS, OIPCAC gave good results

Classifier Accuracy F-Measure Global time

ALMA 0.88 ± 0.01 0.87 ± 0.01 13.5 ± 1PA 0.89 ± 0.01 0.89 ± 0.01 10.6 ± 0.1

PEGASOS 0.88 ± 0.01 0.88 ± 0.01 1103 ± 10OIPCAC 0.89 ± 0.001 0.89 ± 0.01 5911 ± 52

I Other algorithms were tested, but with worst results (i.e. NaıveBayes)

I We selected Passive-Aggressive: good results, low costs

Corrado Monti Sentiment Analysis for italian microblogging 9/23

Page 10: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

2 – Sentiment Analysis: feature extraction

I Different ways were tested:I n-grams of characters, words, n-grams of wordsI boolean presence, term frequency, tf-idf, fuzzy match between

n-grams

I Best: term frequency of single wordsI Adjustments:

I Fraction of uppercase words as a featureI Features of synonyms were joined together

I Other adjustments did not gave better results:I Consider as different words in beginning or in the end of the

tweet

Corrado Monti Sentiment Analysis for italian microblogging 10/23

Page 11: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

2 – Removal of the target of sentiment

I We do not want that the sentiment could be guessed basingon specific entities (PD, Berlusconi. . . ). It should be based onthe surrounding words.

I Training Set prejudices ⇒ OverfittingI Problem sometimes ignored, especially in Italian language Sentiment

Analysis

I How can we decide which words must be removed?I Proposed solution: ontology-based filter

DBpedia

querySPARQL

fi rstobjects

list

text fromonline

newspapers

list ofobjects

to removeappeared?

Discarded

No

Yes

Corrado Monti Sentiment Analysis for italian microblogging 11/23

Page 12: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

2 – Sentiment Analysis: classification

I 12′476 points with 2′383 features

Classifier Accuracy F-Measure Global time

ALMA 0.70 ± 0.03 0.75 ± 0.03 0.82 ± 0.3PA 0.67 ± 0.06 0.71 ± 0.12 0.9 ± 10−3

PEGASOS 0.69 ± 0.03 0.73 ± 0.05 76 ± 0.1

OIPCAC 0.71 ± 0.03 0.75 ± 0.02 121 ± 25

RF 0.72 ± 0.03 0.78 ± 0.03 2173 ± 48

I We select ALMA: good results, low costs

Corrado Monti Sentiment Analysis for italian microblogging 12/23

Page 13: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

3 – Genericity: rule-based approach

I Problem: select tweets directed to all politicians

I Unhelpful training set labels → we simplify the problem anddecide to use a rule-based approach

I Regole usate:

Presence of keywordsdefined with field experts

(based on the most “secure”part of the training set)

∨Entities from at least three

different political areas(identified through

DBpedia ontologies)

Corrado Monti Sentiment Analysis for italian microblogging 13/23

Page 14: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

Applicazione del sistema di classificazione

Hypothesis:

Applying this classification system to tweet of April-October2012 we can obtain a good measure of the diffusion ofpolitical disaffection in society?

Corrado Monti Sentiment Analysis for italian microblogging 14/23

Page 15: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

Surveys

I Surveys are a curtesy of IPSOSI We compute two indexes:

1. Inefficacy: fraction of italians that for everyparty say that their propensity to vote themis 1 in a 1 to 10 scale

I Sociologists say that this capture thesentiment of inefficacy perceived by citizens,symptom of political disaffection

2. Non-vote: fraction of italians that will notvote

I It measures a behaviourI Influenced by other factors: election proximity,

“moral” perception of abstention

...

...

I We have a survey every ∼ 10 days in April-October

Corrado Monti Sentiment Analysis for italian microblogging 15/23

Page 16: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

Tweet sample

I We gather a set of user active in October and wesample also their followers

I We obtain 261′313 users active in October

I ∼ 5% of italian users and ∼ 10% of active italianusers in October 2012

I We select only those active in April → 167′557utenti

I Scraping of their tweets in this time period:35′882′423 tweet

Lorem ipsum dolor sit amet, consectetur adipiscing elit. In

rhoncus diam a urna pulvinar convallis.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. In

rhoncus diam a urna pulvinar convallis.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. In

rhoncus diam a urna pulvinar convallis.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. In

rhoncus diam a urna pulvinar convallis.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. In

rhoncus diam a urna pulvinar convallis.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. In

rhoncus diam a urna pulvinar convallis.

Corrado Monti Sentiment Analysis for italian microblogging 16/23

Page 17: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

Comparison

I For every survey date, we compute the il ratio betweendisaffected tweet volume and political tweet volume in acertain time window. We consider:

∆141 ∆14

7 ∆71

Giugno 2012L M M G V S D

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Giugno 2012L M M G V S D

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Giugno 2012L M M G V S D

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Last two weeks Between 14 and 7 days before Last week

I We compare these ratios with polls through Pearsoncorrelation index

Corrado Monti Sentiment Analysis for italian microblogging 17/23

Page 18: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

Inefficacy:

∆maxmin ρ 95%Confidenceinterval P-Value for ρ > 0

∆141 0.7860 0.476-0.922 0.031%

∆147 0.7749 0.454-0.917 0.042%

∆71 0.6880 0.310-0.878 0.226%

Non-voto:

∆maxmin ρ 95%Confidenceinterval P-Value for ρ > 0

∆141 0.5579 0.190-0.788 0.567%

∆147 0.5920 0.248-0.803 0.231%

∆71 0.4433 0.049-0.718 3.00%

Corrado Monti Sentiment Analysis for italian microblogging 18/23

Page 19: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

æ

æ

æ

æ

æ

æ

ææ

ææ

æ

æ

æ

æ

æ

æ

æ

à

à

à

à

à

à

à

ààà

à à

à à

à

à

à

Apr 15 May 01 May 15 Jun 01 Jun 15 Jul 01 Jul 15 Aug 01 Aug 15 Sep 01 Sep 15 Oct 010.00

0.05

0.10

0.15

0.20

0.25

0.004

0.005

0.006

0.007

0.008

0.009

0.01

0.011

Time

Inef

fica

cyin

dica

tor

Tw

itter

disa

ffec

tion

ratio

Twitter disaffection ratioInefficacy indicator

Corrado Monti Sentiment Analysis for italian microblogging 19/23

Page 20: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

Interpretation

I Data seem to indicate a quite strong correlation betweendisaffected tweet and diffusion of the phenomena in society

I This does not mean that Twitter is a representative sample

I We can guess that the quantity of discussion about thispheonomenon is connected with how much it will spread

Corrado Monti Sentiment Analysis for italian microblogging 20/23

Page 21: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

Further investigation

I With this big tweet sample, we can study causes ofoscillations with text mining techinques

I We compute the ratio day by day and we found out when wehave disaffection peaks

I For every peak

1. we take news of that day2. we compare them with the mass of tweet of the peak3. which is the most similar news?

Corrado Monti Sentiment Analysis for italian microblogging 21/23

Page 22: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

Apr 01 Apr 15 May 01 May 15 Jun 01 Jun 15 Jul 01 Jul 15 Aug 01 Aug 15 Sep 01 Sep 15 Oct 010.000

0.005

0.010

0.015

0.020

0.

0.025

0.05

0.075

0.1

0.125

0.15

0.175

0.2

Time

Twitterdisaffectionratio

Inefficacyindicator

Twitter disaffection ratioInefficacy indicatorRapporto di tweet Sondaggi (Inefficacia)

Tempo

Sondaggi

Sondaggi

Tweet

(Teorie del complotto su stragismo di Stato)

Scandalo Lega

Amministrative

Attentato di Brindisi

Scandalo Fiorito

Corrado Monti Sentiment Analysis for italian microblogging 22/23

Page 23: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

Conclusions

I We build classifiers for political messages

I Topic and sentiment classifers are reusable

I We showed a correlation between quantity of discussion onTwitter and diffusion of a phenomenon

I Future works

I Classifying users as nodes of a social networkI Confirm or denial of models that could explain our data

Corrado Monti Sentiment Analysis for italian microblogging 23/23

Page 24: Sentiment Analysis and Political Disaffection in Italy

IntroductionDevelopment of the classification system

Application and experimental resultsConclusions

Thanks

Corrado Monti Sentiment Analysis for italian microblogging 24/23