Upload
shebuti-rayana
View
159
Download
0
Embed Size (px)
Citation preview
Rayana & Akoglu 2Collective Opinion Spam Detection using Active Inference
How do consumers learn about product quality?
Advertisements
Consumer review websites (Yelp, TripAdvisor etc.)
Impact of consumer reviews on sales?
+1 star-rating increases revenue by 5-9%Harvard Study by M. Luca Reviews, Reputation, and Revenue: The Case of Yelp.com
Rayana & Akoglu 3Collective Opinion Spam Detection using Active Inference
Paid/Biased reviewers write fake reviews
unjustly promote / demote products or businesses
Problem ?
Humans only slightly better than chanceFinding Deceptive Opinion Spam by Any Stretch of the Imagination Ott et al. 2011
Rayana & Akoglu 4Collective Opinion Spam Detection using Active Inference
Online Review System
Review networkMeta Data
spammer
target product
fake review
BudgetOracle (e.g. human)
Rayana & Akoglu 5Collective Opinion Spam Detection using Active Inference
ReviewNetwork
ReviewText
ReviewBehavior
ActiveInference
Ott’2011 ✓
Mukherjee’2013
✓ ✓
Jindal’2008 ✓
Wang’2011 ✓ ✓
FraudEagle ✓
SpEagle ✓ ✓ ✓
SpEagle+ ✓ ✓ ✓
SpEagle + EUCR
✓ ✓ ✓ ✓
Rayana & Akoglu 6
A network classification problem• Given
• user-review-product network (tri-partite)• features extracted from metadata (i.e. text, behavior)
– for users, reviews and products
• An Oracle (e.g. Human annotator, Yelp.com)• A budget B
• Select query and classify all network objects into type-specific classes
• Users (‘benign’ vs. ‘spammer’)• Products (‘non-target’ vs. ‘target’)• Reviews (‘genuine’ vs. ‘fake’)
Collective Opinion Spam Detection using Active Inference
writes belongsU R P
Rayana & Akoglu 7
Objective
Wisely select “valuable” nodes
Find metric to quantify “value” of a node
Achieve improved performance
over random selection
Collective Opinion Spam Detection using Active Inference
Rayana & Akoglu 9
SpEagle: A collective classification approach (unsupervised)
Objective function utilizes pairwise Markov Random Fields
- inference problem (NP-Hard)
Loopy Belief Propagation (LBP) for inference
Collective Opinion Spam Detection using Active Inference
1) Repeat for each node:
2) At convergence:
Prior edge type
Compatibility potential
belief of node i
Rayana & Akoglu 10Collective Opinion Spam Detection using Active Inference
Compatibility potential:
Prior:
Spam
Score
Prior
[Rayana et al. KDD2015]
Meta
data Features: (i) Review Text,
and (ii) Behavioral
(timestamp, rating)
Users: ‘benign’ ‘spammer’Products: ‘non-target’ ‘target’Reviews: ‘genuine’ ‘fake’
Rayana & Akoglu 11
SpEagle can incorporate labels seamlessly (SpEagle+) - can use user, review and/or product labels
For labeled nodes, priors are set to:
• φ ← {ϵ, 1 − ϵ} for spam category(i.e., fake, spammer, or target)
• φ ← {1− ϵ, ϵ} for non-spam category
(i.e., genuine, benign, or non-target)
Nodes are chosen randomly forlabeling
Collective Opinion Spam Detection using Active Inference
Rayana & Akoglu 12
Settings
Existing inference model pose queries to select “valuable” nodes
Selected nodes are labeled by an Oracle (e.g. Human)
Label are utilized at inference time
Objective
Minimize labeling cost
Maximize classification performance
▪ Achieve higher accuracy within a budget
Collective Opinion Spam Detection using Active Inference
Rayana & Akoglu 13
Objective: find “islands” of uncertainty EUCR incorporates three important
characteristics of a valuable node:
i. Self-uncertainty of a node
ii. Density of the region it belongs to
iii. Proximity to other uncertain nodes
Collective Opinion Spam Detection using Active Inference
Rayana & Akoglu 14
Step 1 – (i) + (ii): Calculate Weighted UnCertaintyscore,
where, 𝑏𝑥 is belief of node 𝑥 to belong to class 𝑦𝑖, and weight,
Step 2 – Find set 𝑺 of top 𝒌 nodes having high 𝐖𝐔𝐂score
Collective Opinion Spam Detection using Active Inference
• 𝑈𝐷𝑥 = user degree of a review node 𝑥• 𝑚𝑖𝑛𝑈𝐷 = minimum degree of a user node• 𝑚𝑎𝑥𝑈𝐷 = maximum degree of a user node
Rayana & Akoglu 15
Step 3 – (iii): Proximity of node 𝑥 ∈ 𝑆 to node 𝑗, ∀𝑗 ∈ 𝑅,calculated using RWR probability
where, 𝑐 = 0.85, 𝑊 = column norm. Adj. matrix of 𝐺𝑅, 𝐺𝑅= review-review
network, and 𝑒𝑥 = 1, 𝑓𝑜𝑟 𝑥
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
We can express the proximity as,
Step 4 – select most valuable node 𝑥 as,
Collective Opinion Spam Detection using Active Inference
Rayana & Akoglu 16
Uncertainty Sampling (US):
Valuable node – data instance with highest uncertainty
Metric – entropy of the final beliefs,
Query-by-Committee (QBC):
Committee consists of multiple members
Valuable node - instance on which committee members disagrees most
Metric – average Kullback-Leibler (KL) divergence (soft-voting (SV))
Collective Opinion Spam Detection using Active Inference
Rayana & Akoglu 17
Build committee 𝐶 = {𝜃 1 , … , 𝜃(|𝐶|)}, with review feature bagging
4 out of 16 features randomly selected w/o replacement
Calculate disagreement using avg. KL divergence
where, and
Collective Opinion Spam Detection using Active Inference
[MacCallum+,98]
Rayana & Akoglu 18
(i) Most-sure disagreement (QBC-MS): strong and conflicting evidence
(ii) Least-sure disagreement (QBC-LS): no conclusive evidence
Objectives to optimize: members should disagree on node 𝑥,
both should be large for most-sure , and
both should be small for least-sure disagreement
Overall evidence for node 𝑥,
Collective Opinion Spam Detection using Active Inference
Members with
+ve evidence:
-ve evidence:
total
+ve evidence:
-ve evidence:
[Sharma+,13]
Rayana & Akoglu 19
Find set 𝑺 of top 𝒌 nodes having high 𝑄𝐵𝐶 − 𝑆𝑉 score
Valuable node selection:
QBC-MS selects node with maximum evidence from S,
QBC-LS selects nodes with minimum evidence from S,
Collective Opinion Spam Detection using Active Inference
Rayana & Akoglu 20
Requires two classifiers, 1. content-only (CO): logistic regression on
features from metadata, and 2. collective classifier (CC): SpEagle (metadata +
review network)
Valuable node - instance on which decision of COand CC differs most
Constrains on CO: require enough labels for training, susceptible to class imbalance, and re-train at each iteration
Collective Opinion Spam Detection using Active Inference
Rayana & Akoglu 21Collective Opinion Spam Detection using Active Inference
2 Yelp datasets1: recommended vs. non-recommended reviews
YelpChi – hotel and restaurant reviews from Chicago
YelpNYC – restaurant reviews from New York City
1 Datasets are made available to the community2 A spammer has at least one filtered review
Settings: Pool-based active InferenceQuery Selection: Review nodes only
2
Oracle: Yelp.com
Rayana & Akoglu 23
EUCR is superior to random selection and adapted existing approaches
YelpChi
Figure shows AP at differentBudget
YelpNYC
Collective Opinion Spam Detection using Active Inference
Rayana & Akoglu 24Collective Opinion Spam Detection using Active Inference
Figu
re: N
DC
G@
10
0 (
left
) an
d
ND
CG
@1
00
0 (
righ
t) v
s B
ud
get
YelpChi
YelpNYC
Rayana & Akoglu 25
EUCR provides- correct label to almost as many users as the budget size- enough fake reviews being labeled
RS & ALFNET – imbalanced fake vs genuine reviews- ALFNET labels multiple reviews of same user
US & QBC – selects (uncertain or disagreed) node selfishly, not considering neighborhood
Collective Opinion Spam Detection using Active Inference
Rayana & Akoglu 27Collective Opinion Spam Detection using Active Inference
With budget as small as 300, EUCR outperforms random selection and other adapted baselines
EUCR
Rayana & Akoglu 28
Main contributions:
Adapted existing label acquisition approaches in network inference settings
Defined characteristics of valuable nodes
Proposed a new metric Expected UnCertainty Reach (EUCR) to quantify the value
Achieved improved performance
Evaluated on two large real-world datasets from Yelp.com
Email [email protected] .edu for code and data
Collective Opinion Spam Detection using Active Inference
Rayana & Akoglu 29Collective Opinion Spam Detection using Active Inference
For datasets email [email protected]
http://www.cs.stonybrook.edu/~datalab/