Sentiment Analysis

Thumbs up? Sentiment Classification using Machine Learning Techniques

- Bo Pang and Lillian Lee

- Shivakumar Vaithyanathan

What is it??

• Input – raw text over some topic

• Output – opinion ( +ve, -ve or neutral )

• Its is hard – why???

- determines the opinion on overall text rather than just subject of the topic

-- lets understand the problem

We know …

• Web – enormous amount of data

• Topical categorization – active research

Rise of blogs, forums …

• Web 2.0 is commonly associated with web applications that facilitate interactive informationsharing, interoperability, user-centered design, and collaboration on the World Wide Web – (source : Wikipedia)

Why is it interesting?

• Represents the voice about particular topic from broader audience

• Example : product reviews, movie reviews, book reviews

• Important to business intelligence applications

- What do people (dis)like in Nikon D40

What this paper does

• Examines the effectiveness of applying machine learning techniques to sentiment classification problem

• Challenging – while topic are identifiable by keywords alone, sentiment can be expressed in a more subtle manner.

Dataset : Movie-Review Domain

Reason :

– Large online collection for reviews

– Easy to summarize with machine-extractable rating indicator than to handle data for supervised learning

Corpus of 752 –ve, 1301 +ve, with total 144 reviewers represented

Naïve approach

• Idea: people tend to use certain words to express strong sentiments, produce such list and rely to classify text

Machine Learning methods

• Let {f1, f2, …, fm} be predefined m features that can appear in document.Example : “still” or bigram “really stinks”

• ni(d) – number of times fi occurs in document d

• Document vector(d) = (n1(d), n2(d), …, nm(d))

Naïve Bayes

Assign to a given document d the class

Naïve Bayes rule :

Maximum Entropy

• Idea is to make fewest assumptions about the data while still being consistent with it

Support Vector Machines(SVM)

• Are large-margin, non-probabilistic classifiers in contrast to Naïve Bayes and Maximum Entropy

• Letting (corresponding to +ve,-ve), be the correct class of document dj,

Evaluations

• Randomly selected 700 positive, 700 negative sentiment documents

• Automatically removed rating indicators, extracted textual information from original HTML

• Added NOT_ to every word between a negation word(“not”, “isn’t”) and first punctuation.

Results

Conclusion

• Unigram presence information turned out to be most effective

• The superiority of presence information in comparison to feature frequency indicates a difference between sentiment and topic categorization.

Technology

Sentiment Analysis