Upload
harit66
View
1.307
Download
3
Tags:
Embed Size (px)
Citation preview
Thumbs up? Sentiment Classification using Machine Learning Techniques
- Bo Pang and Lillian Lee
- Shivakumar Vaithyanathan
What is it??
• Input – raw text over some topic
• Output – opinion ( +ve, -ve or neutral )
• Its is hard – why???
- determines the opinion on overall text rather than just subject of the topic
-- lets understand the problem
We know …
• Web – enormous amount of data
• Topical categorization – active research
Rise of blogs, forums …
• Web 2.0 is commonly associated with web applications that facilitate interactive informationsharing, interoperability, user-centered design, and collaboration on the World Wide Web – (source : Wikipedia)
Why is it interesting?
• Represents the voice about particular topic from broader audience
• Example : product reviews, movie reviews, book reviews
• Important to business intelligence applications
- What do people (dis)like in Nikon D40
What this paper does
• Examines the effectiveness of applying machine learning techniques to sentiment classification problem
• Challenging – while topic are identifiable by keywords alone, sentiment can be expressed in a more subtle manner.
Dataset : Movie-Review Domain
Reason :
– Large online collection for reviews
– Easy to summarize with machine-extractable rating indicator than to handle data for supervised learning
Corpus of 752 –ve, 1301 +ve, with total 144 reviewers represented
Naïve approach
• Idea: people tend to use certain words to express strong sentiments, produce such list and rely to classify text
Machine Learning methods
• Let {f1, f2, …, fm} be predefined m features that can appear in document.Example : “still” or bigram “really stinks”
• ni(d) – number of times fi occurs in document d
• Document vector(d) = (n1(d), n2(d), …, nm(d))
Naïve Bayes
Assign to a given document d the class
Naïve Bayes rule :
Maximum Entropy
• Idea is to make fewest assumptions about the data while still being consistent with it
Support Vector Machines(SVM)
• Are large-margin, non-probabilistic classifiers in contrast to Naïve Bayes and Maximum Entropy
• Letting (corresponding to +ve,-ve), be the correct class of document dj,
Evaluations
• Randomly selected 700 positive, 700 negative sentiment documents
• Automatically removed rating indicators, extracted textual information from original HTML
• Added NOT_ to every word between a negation word(“not”, “isn’t”) and first punctuation.
Results
Conclusion
• Unigram presence information turned out to be most effective
• The superiority of presence information in comparison to feature frequency indicates a difference between sentiment and topic categorization.