1 Mining the peanut gallery: Opinion extraction and semantic classification of product reviews Kushal DaveSteve LawrenceDavid M. Pennock IBM Google Overture

1

Mining the peanut gallery:

Opinion extractionand semantic classification

of product reviewsKushal Dave Steve Lawrence David M. Pennock

IBM Google Overture

Work done at NEC Laboratories

2

Problem

• Many reviews spread out across Web– Product-specific sites– Editorial reviews at C|net, magazines, etc.– User reviews at C|net, Amazon, Epinions, etc.– Reviews in blogs, author sites – Reviews in Google Groups (Usenet)

• No easy aggregation, but would like to get numerous diverse opinions

• Even at one site, synthesis difficult

3

Solution!

• Find reviews on the web and mine them…

– Filtering (find the reviews)

– Classification (positive or negative)

– Separation (identify and rate specific attributes)

4

Existing work

Classification varies in granularity/purpose:

• Objectivity – Does this text express fact or opinion?

• Words– What connotations does this word have?

• Sentiments– Is this text expressing a certain type of opinion?

5

Objectivity classification

• Best features: relative frequencies of parts of speech (Finn 02)

• Subjectivity is…subjective (Wiebe 01)

6

Word classification

• Polarity vs. intensity

• Conjunctions – (Hatzivassiloglou and McKeown, 97)

• Colocations– (Turney and Littman 02)

• Linguistic colocations– (Lin 98), (Pereira 93)

7

Sentiment classification

• Manual lexicons– Fuzzy logic affect sets

(Subasic and Huettener 01)– Directionality – block/enable (Hearst 92)– Common Sense and emotions (Liu et al 03)

• Recommendations– Stocks in Yahoo (Das and Chen 01)– URLs in Usenet (Terveen 97)– Movie reviews in IMDb (Pang et al 02)

8

Applied tools

• AvaQuest’s GoogleMovies– http://24.60.188.10:8080/demos/G

oogleMovies/GoogleMovies.cgi

• Clipping services– PlanetFeedback,

Webclipping, eWatch, TracerLock, etc.

– Commonly use templates or simple searches

• Feedback analysis:– NEC’s Survey Analyzer– IBM’s Eclassifier

• Targeted aggregation– BlogStreet, onfocus, AllConsuming, etc.– Rotten Tomatoes

9

Our approach

10

Train/test

• Existing taggedcorpora!

• C|net – two cases– even, mixed

10-fold test• 4,480 reviews• 4 categories

– messy, by category7-fold test

• 32,970 reviews

11

Other corpora

• Preliminary work with Amazon– (Potentially) easier problem

• Pang et al IMDb movie reviews corpus– Our simple algorithm insignificantly worse

80.6 vs. 82.9 (unigram SVM)– Different sort of problem

12

Domain obstacles• Typos, user error, inconsistencies,

repeated reviews • Skewed distributions

– 5x as many positive reviews– 13,000 reviews of MP3 players,

350 of networking kits– fewer than 10 reviews

for ½ of products• Variety of language, sparse data

– 1/5 of reviews have fewer than 10 tokens

– More than 2/3 of terms occur in fewer than 3 documents

• Misleading passages (ambivalence/comparison)

– Battery life is good, but... Good idea, but...

0

4000

8000

12000

16000

Network TV Laser Laptop PDA MP3 Camera

13

Base Features• Unigrams

– great, camera, support, easy, poor, back, excellent

• Bigrams– can't beat, simply the, very disappointed, quit working

• Trigrams, substrings– highly recommend this, i am very happy, not worth it, i sent it back, it stopped working

• Near– greater would, find negative, refund this, automatic poor, company totally

14

Generalization

• Replacing product names, (metadata)– the iPod is an excellent

==> the _productname is an excellent

• domain-specific words, (statistical)– excellent [picture|sound] quality

==> excellent _producttypeword quality

• rare words (statistical)– overshadowed by

==> _unique by

15

Generalization (2)

• Finding synsets in WordNet– anxious ==> 2377770– nervous ==> 2377770

• Stemming– was ==> be– compromised ==> compromis

16

Qualification

• Parsing for colocation– this stupid piece of garbage ==>

(piece(N):mod:stupid(A))...– this piece is stupid and ugly ==>

(piece(N):mod:stupid(A))...

• Negation– not good or useful ==>

NOTgood NOTor NOTuseful

17

Optimal features

• Confidence/precision tradeoff• Try to find best-length substrings

– How to find features?• Marginal improvement when traversing suffix tree• Information gain worked best

– How to score?• [p(C|w) – p(C’|w)] * df during construction• Dynamic programming or simple during use

– Still usually short ( n < 5 )• Although… “i have had no problems with”

18

Clean up

• Thresholding– Reduce space with minimal impact– But not too much! (sparse data)– > 3 docs

• Smoothing: allocate weight to unseen features, regularize other values– Laplace, Simple Good-Turing, Witten-Bell tried– Laplace helps Naïve Bayes

19

Scoring

• SVMlight, naïve Bayes– Neither does better in both tests– Our scoring is simpler, less reliant on confidence,

document length, corpus regularity

• Give each word a score ranging –1 to 1– Our metric:

– Tried Fisher discriminant, information gain, odds ratio

• If sum > 0, it’s a positive review• Other probability models – presence• Bootstrapping barely helps

score(fi) = p(fi|C) – p(fi|C’)

p(fi|C) + p(fi|C’)

20

Reweighting

• Document frequency– df, idf, normalized df, ridf– Some improvement from logdf and gaussian

(arbitrary)

• Analogues for term frequency, product frequency, product type frequency

• Tried bootstrapping, different ways of assigning scores

21

Test 1 (clean) Best Results

Baseline85.0 %

Naïve Bayes + Laplace 87.0 %Weighting (log df) 85.7 %Weighting (Gaussian tf) 85.7 %

Baseline88.7 %

+ _productname 88.9 %

Unigrams

Trigrams

22

Test 2 (messy) Best Results

Baseline 82.2 %Prob. after threshold 83.3 %Presence prob. model 83.1 %+ Odds ratio 83.3 %

Baseline 84.6 %+ Odds ratio 85.4 %+ SVM 85.8 %

Baseline 85.1 %+ _productname 85.3

%

Unigrams

Bigrams

Variable

23

Extraction

Can we use the techniques from classification to mine reviews?

24

Obstacles

• Words have different implications in general usage– “main characters” is negatively biased in

reviews, but has innocuous uses elsewhere

• Much non-review subjective content– previews, sales sites, technical support,

passing mentions and comparisons, lists

25

Results

• Manual test set + web interface

• Substrings better…bigrams too general?

• Initial filtering is bad– Make “review” part of the actual search– Threshold sentence length, score– Still get non-opinions and ambiguous opinions

• Test corpus, with red herrings removed, 76% accuracy in top confidence tercile

26

27

28

Attribute heuristics

• Look for _producttypewords

• Or, just look for things starting with “the”– Of course, some thresholding– And stopwords

29

Results

• No way of judging accuracy, arbitrary

• For example, Amstel Light – Ideally:

taste, price, image...– Statistical method:

beer, bud, taste, adjunct, other, brew, lager, golden, imported...

– Dumb heuristic:the taste, the flavor, the calories, the best…

30

31

32

Future work• Methodology

– More precise corpus– More tests, more precise tests

• Algorithm– New combinations of techniques– Decrease overfitting to improve web search– Computational complexity– Ambiguity detection?

• New pieces– Separate review/non-review classifier– Review context (source, time)– Segment pages

Documents

1 Mining the peanut gallery: Opinion extraction and semantic classification of product reviews Kushal DaveSteve LawrenceDavid M. Pennock IBM Google Overture