Upload
jovani-brereton
View
213
Download
2
Embed Size (px)
Citation preview
1
Mining the peanut gallery:
Opinion extractionand semantic classification
of product reviewsKushal Dave Steve Lawrence David M. Pennock
IBM Google Overture
Work done at NEC Laboratories
2
Problem
• Many reviews spread out across Web– Product-specific sites– Editorial reviews at C|net, magazines, etc.– User reviews at C|net, Amazon, Epinions, etc.– Reviews in blogs, author sites – Reviews in Google Groups (Usenet)
• No easy aggregation, but would like to get numerous diverse opinions
• Even at one site, synthesis difficult
3
Solution!
• Find reviews on the web and mine them…
– Filtering (find the reviews)
– Classification (positive or negative)
– Separation (identify and rate specific attributes)
4
Existing work
Classification varies in granularity/purpose:
• Objectivity – Does this text express fact or opinion?
• Words– What connotations does this word have?
• Sentiments– Is this text expressing a certain type of opinion?
5
Objectivity classification
• Best features: relative frequencies of parts of speech (Finn 02)
• Subjectivity is…subjective (Wiebe 01)
6
Word classification
• Polarity vs. intensity
• Conjunctions – (Hatzivassiloglou and McKeown, 97)
• Colocations– (Turney and Littman 02)
• Linguistic colocations– (Lin 98), (Pereira 93)
7
Sentiment classification
• Manual lexicons– Fuzzy logic affect sets
(Subasic and Huettener 01)– Directionality – block/enable (Hearst 92)– Common Sense and emotions (Liu et al 03)
• Recommendations– Stocks in Yahoo (Das and Chen 01)– URLs in Usenet (Terveen 97)– Movie reviews in IMDb (Pang et al 02)
8
Applied tools
• AvaQuest’s GoogleMovies– http://24.60.188.10:8080/demos/G
oogleMovies/GoogleMovies.cgi
• Clipping services– PlanetFeedback,
Webclipping, eWatch, TracerLock, etc.
– Commonly use templates or simple searches
• Feedback analysis:– NEC’s Survey Analyzer– IBM’s Eclassifier
• Targeted aggregation– BlogStreet, onfocus, AllConsuming, etc.– Rotten Tomatoes
9
Our approach
10
Train/test
• Existing taggedcorpora!
• C|net – two cases– even, mixed
10-fold test• 4,480 reviews• 4 categories
– messy, by category7-fold test
• 32,970 reviews
11
Other corpora
• Preliminary work with Amazon– (Potentially) easier problem
• Pang et al IMDb movie reviews corpus– Our simple algorithm insignificantly worse
80.6 vs. 82.9 (unigram SVM)– Different sort of problem
12
Domain obstacles• Typos, user error, inconsistencies,
repeated reviews • Skewed distributions
– 5x as many positive reviews– 13,000 reviews of MP3 players,
350 of networking kits– fewer than 10 reviews
for ½ of products• Variety of language, sparse data
– 1/5 of reviews have fewer than 10 tokens
– More than 2/3 of terms occur in fewer than 3 documents
• Misleading passages (ambivalence/comparison)
– Battery life is good, but... Good idea, but...
0
4000
8000
12000
16000
Network TV Laser Laptop PDA MP3 Camera
13
Base Features• Unigrams
– great, camera, support, easy, poor, back, excellent
• Bigrams– can't beat, simply the, very disappointed, quit working
• Trigrams, substrings– highly recommend this, i am very happy, not worth it, i sent it back, it stopped working
• Near– greater would, find negative, refund this, automatic poor, company totally
14
Generalization
• Replacing product names, (metadata)– the iPod is an excellent
==> the _productname is an excellent
• domain-specific words, (statistical)– excellent [picture|sound] quality
==> excellent _producttypeword quality
• rare words (statistical)– overshadowed by
==> _unique by
15
Generalization (2)
• Finding synsets in WordNet– anxious ==> 2377770– nervous ==> 2377770
• Stemming– was ==> be– compromised ==> compromis
16
Qualification
• Parsing for colocation– this stupid piece of garbage ==>
(piece(N):mod:stupid(A))...– this piece is stupid and ugly ==>
(piece(N):mod:stupid(A))...
• Negation– not good or useful ==>
NOTgood NOTor NOTuseful
17
Optimal features
• Confidence/precision tradeoff• Try to find best-length substrings
– How to find features?• Marginal improvement when traversing suffix tree• Information gain worked best
– How to score?• [p(C|w) – p(C’|w)] * df during construction• Dynamic programming or simple during use
– Still usually short ( n < 5 )• Although… “i have had no problems with”
18
Clean up
• Thresholding– Reduce space with minimal impact– But not too much! (sparse data)– > 3 docs
• Smoothing: allocate weight to unseen features, regularize other values– Laplace, Simple Good-Turing, Witten-Bell tried– Laplace helps Naïve Bayes
19
Scoring
• SVMlight, naïve Bayes– Neither does better in both tests– Our scoring is simpler, less reliant on confidence,
document length, corpus regularity
• Give each word a score ranging –1 to 1– Our metric:
– Tried Fisher discriminant, information gain, odds ratio
• If sum > 0, it’s a positive review• Other probability models – presence• Bootstrapping barely helps
score(fi) = p(fi|C) – p(fi|C’)
p(fi|C) + p(fi|C’)
20
Reweighting
• Document frequency– df, idf, normalized df, ridf– Some improvement from logdf and gaussian
(arbitrary)
• Analogues for term frequency, product frequency, product type frequency
• Tried bootstrapping, different ways of assigning scores
21
Test 1 (clean) Best Results
Baseline85.0 %
Naïve Bayes + Laplace 87.0 %Weighting (log df) 85.7 %Weighting (Gaussian tf) 85.7 %
Baseline88.7 %
+ _productname 88.9 %
Unigrams
Trigrams
22
Test 2 (messy) Best Results
Baseline 82.2 %Prob. after threshold 83.3 %Presence prob. model 83.1 %+ Odds ratio 83.3 %
Baseline 84.6 %+ Odds ratio 85.4 %+ SVM 85.8 %
Baseline 85.1 %+ _productname 85.3
%
Unigrams
Bigrams
Variable
23
Extraction
Can we use the techniques from classification to mine reviews?
24
Obstacles
• Words have different implications in general usage– “main characters” is negatively biased in
reviews, but has innocuous uses elsewhere
• Much non-review subjective content– previews, sales sites, technical support,
passing mentions and comparisons, lists
25
Results
• Manual test set + web interface
• Substrings better…bigrams too general?
• Initial filtering is bad– Make “review” part of the actual search– Threshold sentence length, score– Still get non-opinions and ambiguous opinions
• Test corpus, with red herrings removed, 76% accuracy in top confidence tercile
26
27
28
Attribute heuristics
• Look for _producttypewords
• Or, just look for things starting with “the”– Of course, some thresholding– And stopwords
29
Results
• No way of judging accuracy, arbitrary
• For example, Amstel Light – Ideally:
taste, price, image...– Statistical method:
beer, bud, taste, adjunct, other, brew, lager, golden, imported...
– Dumb heuristic:the taste, the flavor, the calories, the best…
30
31
32
Future work• Methodology
– More precise corpus– More tests, more precise tests
• Algorithm– New combinations of techniques– Decrease overfitting to improve web search– Computational complexity– Ambiguity detection?
• New pieces– Separate review/non-review classifier– Review context (source, time)– Segment pages