14
YELP DATASET CHALLENGE CAMPUS ARC II, 16 APRIL 2015 Mehdy Davary, Computer science department (IIUN)

YELP DATASET CHALLENGE CAMPUS ARC II, 16 APRIL 2015 Mehdy Davary, Computer science department (IIUN)

Embed Size (px)

Citation preview

YELP DATASET CHALLENGECAMPUS ARC II, 16 APRIL 2015

Mehdy Davary, Computer science department (IIUN)

ABOUT THE CHALLENGE DATASET• 1.2M reviews

• 400K tips by 250K users for 42K businesses

• 400K business attributes, e.g., hours, parking availability, ambience

• Social network of 250K users for a total of 1.9M social edges.

• Aggregated check-ins over time for each of the 42K businesses

CITIES• U.K.: Edinburgh

• U.S.: Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison

PLATFORM

• The Hortonworks Sandbox is a single node implementation of the Hortonworks Data Platform (HDP). It is a personal, portable Hadoop environment.

• H2O on Hortonworks Data Platform is a fully Open Source Predictive Analytics Platform.

• Neo4j is a Graph Database which stores data in a Graph, with Nodes. Neo4j uses Cypher queries to work with graph data.

The Hortonworks

Sandbox

H2O

Sentiment Analysis

Neoj4

THE HORTONWORKS SANDBOXBy now we have managed all YELP five JSON data files in Hadoop as tables which are sortable and searchable. Mainly we use HCatalog, Pig, Python and Hive to load and process data.

H2OH2O is a statistical analysis engine that uses Hadoop Distributed File System (HDFS) as its storage platform and provides a user-friendly interface for easy querying.

NEOJ4The real power of Neo4j is in connected data. To associate any two nodes, we add a Relationship which describes how the records are related.

TO ANALYZE HORTONWORKS SANDBOX DATA WITH EXCEL 2013

• Hortonworks ODBC driver (64-bit) installed and configured.

• Microsoft Excel 2013 Professional Plus 64-bit.

• Use the Microsoft Query feature to access Hortonworks sandbox data.

• Use the Excel Power View feature to analyze the data.

ABOUT REVIEWS ON “RESTAURANTS”

5 IMPORTANT DIMENSIONS

• Food

• Service

• Ambience

• Deals/Discounts

• Quality-Price Ratio

RAW DATA

• yelp_academic_dataset_review.json

• yelp_academic_dataset_business.json

A review can be associated with multiple dimensions (categories) at the same time.

DATA PREPARATION FOR DATA MINING

• All reviews

• Total reviews on “Restaurants”

• Reduced numbers of reviews on “Restaurants” by using (review.useful > 3 AND review.cool > 2 AND review.stars > 3 AND business.review_count > 5) as filtering factors

All businessesAll

restaurants

Restaurants r.useful > 3

r.cool >2r.stars > 3

b.review_count > 5

Review 1’127’525 706’290 22’584

Business 42’153

User 252’898

Tip 403’210

Checkin 31’617

review ---------------------------------funny: int useful: int cool: int

user_id: string review_id: string stars: int text: stringdate: stringtype: stringbusiness_id: string

business---------------------------------attributes: stringbusiness_id: stringfull_address: string open: boolean hours: stringcategories: string city: string review_count: int name: stringneighborhoods: stringlongitude: float state: stringstars: float latitude: floattype: string

user---------------------------------yelping_since: string votes: {funny: 1, useful: 5, cool: 0}, stringname: string review_count: int user_id: stringfriends: stringfans: int average_stars: float type: string compliments: string elite: string

JAVA IMPLEMENTATION OF THE NLTK IN HADOOPTHE STANFORD NLP GROUP

Retrieving the Parts of speech(verbs, nouns, adjectives etc) from the sentence using the Stanford NLP parser.

Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff. It seems that his staff simply never answers the phone. It usually takes 2 hours of repeated calling to get an answer. Who has time for that or wants to deal with it? I have run into this problem with many other doctors and I just don't get it. You have office workers, you have patients with medical needs, why isn't anyone answering the phone? It's incomprehensible and not work the aggravation. It's with regret that I feel that I have to give Dr. Goldberg 2 stars.

Unfortunately, frustration Dr. Goldberg's patient repeat experience I've doctors NYC -- good doctor, terrible staff. It staff simply answers phone. It takes 2 hours repeated calling answer. Who time deal it? run problem doctors it. You office workers, patients medical needs, answering phone? It's incomprehensible work aggravation. It's regret feel give Dr. Goldberg 2 stars.

((Unfortunately,RB),(frustration,NN),(being,VB),(Goldberg,NNP),(patient,NN),(repeat,NN),(experience,NN),('ve,VB),(had,VB),(so,RB),(many,JJ),(other,JJ),(doctors,NN),(NYC,NNP),(good,JJ),(doctor,NN),(terrible,JJ),(staff,NN),(seems,VB),(staff,NN),(simply,RB),(never,RB),(answers,VB),(phone,NN),(usually,RB),(takes,VB),(hours,NN),(repeated,VB),(calling,VB),(get,VB),(answer,NN),(time,NN),(wants,VB),(deal,VB),(have,VB),(run,VB),(problem,NN),(many,JJ),(other,JJ),(doctors,NN),(just,RB),(do,VB),(n't,RB),(get,VB),(have,VB),(office,NN),(workers,NN),(have,VB),(patients,NN),(medical,JJ),(needs,NN),(n't,RB),(anyone,NN),(answering,VB),(phone,NN),('s,VB),(incomprehensible,NN),(not,RB),(work,VB),(aggravation,NN),('s,VB),(regret,NN),(feel,VB),(have,VB),(give,VB),(Goldberg,NNP),(stars,NN))

{(Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff.),(It seems that his staff simply never answers the phone.),(It usually takes 2 hours of repeated calling to get an answer.),(Who has time for that or wants to deal with it?),(I have run into this problem with many other doctors and I just don't get it.),(You have office workers, you have patients with medical needs, why isn't anyone answering the phone?),(It's incomprehensible and not work the aggravation.),(It's with regret that I feel that I have to give Dr. Goldberg 2 stars.)}

SENTIWORDNET

• Retrieving the Parts of speech(verbs, nouns, adjectives etc) from the sentence using the Stanford NLP parser.

• Using the SentiWordNet to find the Positive and Negative values related to each Part of Speech.

• Summing up the Positive and Negative values obtained to calculate a Net Positive and Net Negative value related to a sentence.

A lexical resource for opinion mining