Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Analiza wydźwięku w krótkich wypowiedziachużytkowników1
Mateusz Lango
21 czerwca 2016
1Lango M., Brzeziński D., Stefanowski J.: PUT at SemEval-2016 Task 4: The ABC of Twitter Sentiment
Analysis, Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016),NAACL HLT 2016, San Diego, US
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
1 Motivation
2 Feature engineering
3 Feature selection
4 Classification techniques
5 Results (SemEval 2016)
6 Open challenges
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Why Sentiment Analysis?
Weight 110 g 133 gResolution 480 x 640 320 x 480RAM 256 128HSDPA [Mbit/s] 7.2 3.6Video call Yes NoVideo recording Yes NoVoice commands Yes NoVoice recording Yes NoMMS Yes NoMemory cards Yes No
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Why Sentiment Analysis?
HTC Touch Diamond Apple iPhone 3GWeight 110 g 133 gResolution 480 x 640 320 x 480RAM 256 128HSDPA [Mbit/s] 7.2 3.6Video call Yes NoVideo recording Yes NoVoice commands Yes NoVoice recording Yes NoMMS Yes NoMemory cards Yes No
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Why Sentiment Analysis?
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
“Energy of the Nation” project
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Why Sentiment Analysis?
Decision Support
Product Design
Market Research
Social Science
Machine Learning/Text Mining
...
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Sentiment Analysis
Document Sentiment Classification
Sentence Subjectivity and Sentiment Classification
Aspect-based Sentiment Analysis
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Example of classical unsupervised approach
Pointwise mutual information
PMI (term1, term2) = log2P(term1, term2)
P(term1)P(term2)
Sentiment orientation
SO(phrase) = PMI (phrase, excellent)− PMI (phrase, poor)
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Sentiment Classification: Task Definition
Input: An opinionated text object
Output: A sentiment tag/label
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Text preprocessing
tokenization (!)
lemmatization (!)
stop-words removal (!)
grouping rare, special tokens (urls, hashtags, numbers,percentages, prices, dates, hours)
removing rare tokens (< 5)
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
N-grams
word n-grams
character k-grams
POS n-grams
elongated words
emoticons
punctuation
all-caps
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Negation problem
Review of ”1Q84” by Haruki Murakami
Perhaps one of the most important works of science fiction of theyear ... 1Q84 does not disappoint ... [It] envelops the reader in ashifting world of strange cults and peculiar characters that issurreal and entrancing. –Matt Staggs, Suvudu.com
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Negation problem
Review of ”1Q84” by Haruki Murakami
Perhaps one of the most important works of science fiction of theyear ... 1Q84 does not disappoint ... [It] envelops the reader ina shifting world of strange cults and peculiar characters that issurreal and entrancing. –Matt Staggs, Suvudu.com
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Negation n-grams
negation list: not, never, none, nobody, nowhere, neither
negation context from the word following the negation worduntil the next punctuation mark
The voice quality of this phone is not good, but the battery life islong
The room was very nicely appointed and the bed was sooocomfortable. Even though the bathroom door did not close all theway, it was still pretty private.
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Sentiment Lexicons
SentiWordNet
Opinion Lexicon
Multi-perspective Question Answering (4 categories)
NRC (8 emotions)
...
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
How to annotate?
MaxDiff methodology
great good bad interesting
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
How to annotate?
MaxDiff methodology
great good bad interesting
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
How to annotate?
MaxDiff methodology
great good bad interesting
great good bad interestinggreat > > >goodbad < < <
interesting
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
How to annotate?
MaxDiff methodology
great good bad interesting
great good bad interestinggreat − > > >good < − >bad < < − <
interesting < > −
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Hashtag Sentiment Lexicon
assume that tweets with specific hashtags have knownsentiment (e.g. #joy, #sad, #angry, #surprised)
crawl tweets during 8 months
filter very short&misspelled tweets
use PMI
investigate influence of negation
great [highly positive] → not great [mildly negative]terrible [strong negative] → not terrible [mildly negative]
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Hashtag Sentiment Lexicon
assume that tweets with specific hashtags have knownsentiment (e.g. #joy, #sad, #angry, #surprised)
crawl tweets during 8 months
filter very short&misspelled tweets
use PMI
investigate influence of negation
great [highly positive] → not great [mildly negative]terrible [strong negative] → not terrible [mildly negative]
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Affirmative Context and Negated Context Lexicons
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Affirmative Context and Negated Context Lexicons
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
One-hot representation
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Many-hot representation
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Towards dense word representation
The intuition : similar words appear in similar contexts
The cat purrsThis cat hunts mice
The kitty purrsThis kitty hunts mice
The tiger purrsThis tiger hunts mice (?)
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Towards dense word representation
The intuition : similar words appear in similar contexts
The cat purrsThis cat hunts mice
The kitty purrsThis kitty hunts mice
The tiger purrsThis tiger hunts mice (?)
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Towards dense word representation
The intuition : similar words appear in similar contexts
The cat purrsThis cat hunts mice
The kitty purrsThis kitty hunts mice
The tiger purrsThis tiger hunts mice (?)
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Brown Clustering
P(corpus|C ) =n∏
i=1
e (wi |C (wi )) t (C (wi )|C (wi−1))
1 Take the top k most frequent words, put each into its owncluster
2 For the rest of words:Create a new cluster for the i th most frequent wordChoose two clusters to be merged: pick the merge that gives amaximum value for quality
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Brown Clustering
P(corpus|C ) =n∏
i=1
e (wi |C (wi )) t (C (wi )|C (wi−1))
1 Take the top k most frequent words, put each into its owncluster
2 For the rest of words:Create a new cluster for the i th most frequent wordChoose two clusters to be merged: pick the merge that gives amaximum value for quality
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Brown Clustering
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Word embeddings
Word2Vec
GloVe
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Word2Vec
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Word2Vec
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Word2Vec
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Properties of Word Embeddings
NN searchJapan - Korea, Chinatea - coffee, lemon, sugar
semantic analogypuppy - dog ≈ kitten - cat
syntactic analogytaller - tall ≈ smaller - small
”words arithmetic”king - man + woman = ?Paris - France + Germany = ?Tadeusza - Tadeusz + Marek = ?Shakespeare - English + Polish = ?0.5 (first + fifth) = ?
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
... embeddings
Word/Lexical embeddings
Part-of-speech embeddings
Document embeddings
Sentiment embeddings
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Polarity embeddings
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Classic methods
Information Gain
χ2
F statistics
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Intelligent Feature Selection
assign initial weight for ax = (ax1, ax2, ...)
w(ax) = wt(ax) + ws(ax)
wt(ax) = maxv ,w
P(ax |v) logP(ax |v)P(ax |w)
ws(ax) =1d
d∑i=1
1k
k∑j=1
spositive(axi ,j) + snegative(axi ,j)
explore subsumption and parallel relations
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Intelligent Feature Selection
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Popular algorithms
Support Vector Machines
Random Forests
Deep Learning...
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Convolutional Neural Networks
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
SemEval 2016 Task 4
International Workshop on Semantic Evaluation 2016collocated with the 15th Annual Conference of the NorthAmerican Chapter of the Association for ComputationalLinguistics: Human Language Technologies (NAACL HLT)
10th edition
14 different taskTask 4: Sentiment Analysis in Twitter
4th editionthe highest number of participant43 teams, 25 countries
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Our system
n-grams, k-grams, negation n-grams, POS-grams
lexicons: the NRC emotion lexicon , Hu and Liu Opinionlexicon , the Multi-perspective Question Answering corpus ,and SentiWordNet
Hashtag Lexicon
Brown Clustering
Gradient Boosting Trees with weights
SVM, RF added for robustness
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Results 4A: Message polarity classification
FPN1
1 ETH Zurich Switzerland 0.6332 Aix-Marseille University France 0.6303 University of Melbourne Australia 0.6174 Universidade de Lisboa Portgual 0.6105 Athens University of Economics and
BusinessGreece 0.605
6 Aix-Marseille University France 0.5987 Nanyang Technological University Singapore 0.596. . .14 Poznan University of Technology Poland 0.574. . .
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Winning algorithm
90M tweets (approx. 7K)
testing set from previous edition2×CNN + RF:
Word2Vec (d = 52, skip-gram 5, 200M tweets)GloVe (d = 50, 90M tweets)
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Our system
Algorithm 1 Roughly Balanced BaggingInput: D = Dmin ∪ Dmaj : original training set of examples of size N,k : number of bootstrap samples, LA: learning algorithm;Output: C∗ bagging ensemble with k component classi-fiers1: for i = 1→ k do2: Nmin
i ← |Nmin|3: Nmaj
i ← following negative binomial distribution with n = Nmini and
p = q = 0, 54: Smin
i ← Nmini -element sample drawn with replacement from Dmin
5: Smaji ← Nmaj
i -element sample drawn with replacement from Dmaj
6: Ci ← LA(Smini ∪ Smaj
i )7: end for
C∗(x) = arg maxy
k∑i=1
pCi (y |x)
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Results 4B: classification according to a two-point scale
recallmacro
1 National Technical University ofAthens, University of Athens et al.
Greece 0,797
2 Universidade da Coruna & Universi-dade de Vigo
Spain 0,791
3 Amazon.in India 0,7844 East China Normal University China 0,7685 INSIGHT Research Centre, National
University of IrelandIreland 0,767
6 Poznan University of Technology Poland 0,7637 University of Melbourne Australia 0,758. . .14 ETH Zurich Switzerland 0,648. . .
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Our system
”Simple Ordinal” ensemble
SVM + GBT
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Results 4C: classification according to a five-point scale
MAEmacro
1 University of Grenoble-Alpe France 0,7192 East China Normal University China 0,8063 Poznan University of Technology Poland 0,8604 Universidade da Coruna & Universi-
dade de VigoSpain 0,864
5 Saints Cyril and Methodius University,Skopje
Macedonia 0,869
6 INSIGHT Research Centre, NationalUniversity of Ireland
Ireland 1,006
7 Istituto di Scienza e Tecnologiedell’Informazione
Italy 1,074
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Feature name Rel. impor. [%]NRC Hashtag Lexicon: mean 0.79Brown cluster: 01110110 0.73SentiWordNet: sum of negative 0.635 k-gram: “d &am” 0.55Brown cluster: 1110011001111 0.49NRC Hashtag Lexicon: max 0.48Opinion Lexicon: negative 0.47Brown cluster: 111101011101 0.423 k-gram “ok ” 0.414 k-gram “ nor” 0.40Brown cluster: 0100100 0.383 k-gram “ NY” 0.352 n-gram: not against 0.35Brown cluster: 111101111100100 0.345 k-gram “ Anth” 0.34
Tablica: Relative feature importances (%) of top 15 features.
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Feature group Rel. impor. [%]5 character-gram 26.034 character-gram 21.753 character-gram 21.74Brown clusters 6.92Negated 1-gram 6.621-gram + POS 4.24Negated + 2-gram 3.481-gram 2.692-gram 1.87NRC Hashtag Lexicon 1.49SentiWordNet 1.00NRC Lexicon 0.93Opinion Lexicon 0.623-gram 0.34MPQA corpus 0.254-gram 0.03
Tablica: Relative feature importances (%) for features groups.
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Open challenges
blending theories of emotions with the practical engineering
multimodal data
ordinal classification
quantification
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Quantification
perfect classifier = perfect quantifierT F
T 80 0F 0 20
better classifier ? better quantifier
T FT 70 10F 10 10
T FT 75 5F 0 20
Motivation Feature engineering Feature selection Classification techniques Results (SemEval 2016) Open challenges
Thank you!