Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
NLP for Social Media: POS Tagging, Sentiment Analysis
Pawan Goyal
CSE, IITKGP
August 05, 2016
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 1 / 23
Part-of-Speech (POS) Tagging
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 2 / 23
Penn Treebank POS Tags
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 3 / 23
Part-of-Speech (POS) Tagging
Words often have more than one POS
POS tagging problem is to determine the POS tag for a particular instance of aword.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 4 / 23
Part-of-Speech (POS) Tagging
Words often have more than one POS
POS tagging problem is to determine the POS tag for a particular instance of aword.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 4 / 23
Relevant papers and tool
Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das,Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, JeffreyFlanigan, and Noah A. Smith. Part-of-Speech Tagging for Twitter:Annotation, Features, and Experiments. In Proceedings of ACL 2011.
Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, NathanSchneider and Noah A. Smith. Improved Part-of-Speech Tagging forOnline Conversational Text with Word Clusters. In Proceedings of NAACL2013.
Parts-of-Speech tagget for twitter -http://www.cs.cmu.edu/~ark/TweetNLP/
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 5 / 23
POS Tagset for Twitter
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 6 / 23
Twitter-specific Tags
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 7 / 23
The case of Hashtags
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 8 / 23
Combined forms
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 9 / 23
Tagging Scheme
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 10 / 23
Tagging Scheme
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 11 / 23
Tagging Scheme
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 12 / 23
Building the POS tagger
CRF model was used. Some insighful features:
Twitter orthography: Features for several regular expression-style rulesthat detect at-mentions, hashtags, URLs etc.
Frequently-capitalized tokens: Compiled gazetteers of tokens whichare frequently capitalized. Features would be - membership in the top Nfrequently capitalized items.
Traditional tag dictionary: Features for all coarse-grained tags thateach word occurs with in the Penn Treebank.
Phonetic normalization: Metaphone algorithm to create a coarsephonetic normalization of words to simpler keys.
Distributional similarity: Distributional features from the successor andpredecessor probabilities for the 10,000 most common terms.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 13 / 23
Building the POS tagger
CRF model was used. Some insighful features:
Twitter orthography: Features for several regular expression-style rulesthat detect at-mentions, hashtags, URLs etc.
Frequently-capitalized tokens: Compiled gazetteers of tokens whichare frequently capitalized. Features would be - membership in the top Nfrequently capitalized items.
Traditional tag dictionary: Features for all coarse-grained tags thateach word occurs with in the Penn Treebank.
Phonetic normalization: Metaphone algorithm to create a coarsephonetic normalization of words to simpler keys.
Distributional similarity: Distributional features from the successor andpredecessor probabilities for the 10,000 most common terms.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 13 / 23
Building the POS tagger
CRF model was used. Some insighful features:
Twitter orthography: Features for several regular expression-style rulesthat detect at-mentions, hashtags, URLs etc.
Frequently-capitalized tokens: Compiled gazetteers of tokens whichare frequently capitalized. Features would be - membership in the top Nfrequently capitalized items.
Traditional tag dictionary: Features for all coarse-grained tags thateach word occurs with in the Penn Treebank.
Phonetic normalization: Metaphone algorithm to create a coarsephonetic normalization of words to simpler keys.
Distributional similarity: Distributional features from the successor andpredecessor probabilities for the 10,000 most common terms.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 13 / 23
Building the POS tagger
CRF model was used. Some insighful features:
Twitter orthography: Features for several regular expression-style rulesthat detect at-mentions, hashtags, URLs etc.
Frequently-capitalized tokens: Compiled gazetteers of tokens whichare frequently capitalized. Features would be - membership in the top Nfrequently capitalized items.
Traditional tag dictionary: Features for all coarse-grained tags thateach word occurs with in the Penn Treebank.
Phonetic normalization: Metaphone algorithm to create a coarsephonetic normalization of words to simpler keys.
Distributional similarity: Distributional features from the successor andpredecessor probabilities for the 10,000 most common terms.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 13 / 23
Building the POS tagger
CRF model was used. Some insighful features:
Twitter orthography: Features for several regular expression-style rulesthat detect at-mentions, hashtags, URLs etc.
Frequently-capitalized tokens: Compiled gazetteers of tokens whichare frequently capitalized. Features would be - membership in the top Nfrequently capitalized items.
Traditional tag dictionary: Features for all coarse-grained tags thateach word occurs with in the Penn Treebank.
Phonetic normalization: Metaphone algorithm to create a coarsephonetic normalization of words to simpler keys.
Distributional similarity: Distributional features from the successor andpredecessor probabilities for the 10,000 most common terms.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 13 / 23
Related Problems
Entity RecognitionNamed Entity: Names of people, places, organization
Date and time
Can you model it as a sequence labeling problem?
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 14 / 23
Related Problems
Entity RecognitionNamed Entity: Names of people, places, organization
Date and time
Can you model it as a sequence labeling problem?
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 14 / 23
Sentiment Analysis
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 15 / 23
Sentiment Analysis: Relevant Paper
Pak, Alexander, and Patrick Paroubek. “Twitter as a Corpus for SentimentAnalysis and Opinion Mining.” LREC. Vol. 10. 2010.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 16 / 23
Sentiment Analysis: Examples
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 17 / 23
Data Creation
How to do that without manual labeling?
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 18 / 23
Data Creation
How to do that without manual labeling?
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 18 / 23
Data Creation
How to do that without manual labeling?
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 18 / 23
POS tag Distribution: Subjective vs. Objective
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 19 / 23
POS tag Distribution: Positive vs. Negative
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 20 / 23
Classification Model
Naïve Bayes Model: Main featuresPOS tags, Word n-grams
Using negations in Word n-grams
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 21 / 23
Classification Model
Naïve Bayes Model: Main featuresPOS tags, Word n-grams
Using negations in Word n-grams
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 21 / 23
Disregarding common n-grams
Using entropy and salience
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 22 / 23
Disregarding common n-grams
Using entropy and salience
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 22 / 23
Disregarding common n-grams
Using entropy and salience
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 22 / 23
N-grams with high salience and low entropy
Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 23 / 23