Online feedback correlation using clustering

Preview:

DESCRIPTION

My presentation given for Internet search class. I theorized that you could determine how good a product was based on the different types of negative reviews automatically

Citation preview

Online Feedback Correlation using Clustering

Research Work Done for CS 651: Internet Algorithms

Dedicated to Tibor Horvath

Whose endless pursuit of getting a PhD (imagine that) kept him from researching this topic.

Problem Statement

Millions+ of reviews available Consumers read only a small number of reviews. Reviewer content not always trustworthy

Problem Statement (continued)

What information from reviews is important? What can we extract from the overall set of reviews

efficiently to provide more utility to consumers than is already provided?

Motivation

People are increasingly relying on online feedback mechanisms in making choices [Guernsey 2000]

Online feedback mechanisms draw consumers Competitive Edge Quality currently bad

Current Solutions

“Good” review placement Show small number of reviews

. . . more Trustworthy?

Amazon Example

Observations

Consumers look at a product based on its overall rating

Consumers read “editorial review” for content Reviews indicate can indicate common issues

… Can we correlate these reviews in some meaningful way?

Observations Lead to Hypotheses!

Hypothesis: Products with numerous similar negative reviews will often not be purchased regardless of their positive reviews. Furthermore, the number of negative reviews is a high indication of the likeliness of certain flaws in a product.

Definitions

Semantic Orientation: polar classification of whether something is positive or negative

Natural Language Processing: deciphering parts of speech from free text

Feature: quality of a product that customers care about

Feature Vector: vector representing a review in a d-dimensional space where each dimension represents a feature.

Overview of Project

Obtain large repository of customer reviews Extract features from customer reviews and orient

them Create feature vectors i.e. [1,0,-1,1,1,-1 … ] from

reviews and features Cluster feature vectors to find large negative clusters Analyze clusters and compare to hypothesis

Related Work

Related work has fallen into one of three disparate camps

1. Classification: classifying Reviews into Negative or Positive reviews

2. Domain Specificity: overall effect of reviews in a domain

3. Summarization: features extraction to summarize reviews

Limitations of Related Work

Classification– Overly summarizing

Domain Specificity– Hard to generalize given domain information

Summarization– No overall knowledge of collection

Close to Summarization?

Most closely related to work done in Summarization by Hu and Liu.– Summarization with dynamical feature extraction and

orientation per review

Data for Project

Data from Amazon.com customer reviews – Available through the use of Amazon E-Commerce

Service (ECS)– Four thousand products related to mp3 players– Over twenty thousand customer reviews

Technologies Used

Java to program modules Amazon ECS NLProcessor (trial version) from Infogistics Princeton’s WordNet as a thesaurus KMLocal from David Mount’s group at University of

Maryland for clustering

Project Structure

Simplifications Made

Limited data set Feature list created a priori Features from same sentence given same

orientation Sentences without features neglected Number of clusters chosen only to see correlations in

biggest cluster Small adjective seed set

Analysis

Associated Clusters with Products Found negative clusters using threshold (-0.1) Eliminated non-Negative Clusters Sorted products list twice

– Products by sales rank (given by Amazon)– Products sorted by hypothesis with tweak

Tweak: Relative Size * Distortion Computed Spearman’s Distance

Results

Hypothesis calculates with 82% accuracy! But most of the four thousand products were pruned

due to poor orientation

Conclusion

Consumers are affected by negative reviews that correlate to show similar flaws.

Affected regardless of the positive reviews

Future Work

Larger seed set for adjectives Use more complicated NLP techniques Experiment with the size of clusters Dynamically determine features using summary

techniques Use different data sets Use different distance measure in clustering

Questions

Recommended