Capstone Project Slides- Yelper

YelperA Collaborative Filtering Based Recommendation System

Team DeepBlueChuan Sun

Capstone Project @ NYC Data Science Academy9/20/2016

Agenda

Why?Why we need recommendation?

What?What are the components for Yelper?

How?How to build Yelper?

DemoUser-business network; Yelper main page; Simulation of users' recommendation requests handling

Summary & Future worksWhat we learn and what's next?

Information overload is a real phenomenon which prevents us from taking decisions or actions

Image link

http://blog.at-edge.com/tag/digital-compositing/

http://blog.at-edge.com/tag/digital-compositing/

But life is short. We only have 24 hours per day

Recommendation is becoming extremely common in recent years

Real-time recommendation system may become a new normal in data era

● Gain real-time insight● Enable rapid development● Perform real-time analytics

Agenda






Yelp Challenge 2016 Dataset is a good sandbox to build recommendation engine

Yelper consists of 5 major components

User requests

PiperData

Recommendation Engine

Streaming Processor

Web server

Visualization

Data sourceProvide raw or processed data as the fuel for recommendation engine

Pipe user requests into streamingHandle real-time data feeds as message broker

Process user requestsDigest continuous data from Kafka in a scalable and fault-tolerable way

Predict ratingsAdopt collaborative filtering or content-based filtering

Visual insightDistill connected components or in/out degree for users and businesses

User frontendAllow customer to get customized recommendations

Agenda






Yelper is built on top of Spark platform

Business

User

ReviewTip Checkin

Data source(1) Use business and user data in Yelp dataset(2) Map raw IDs to integers to make later stage a lot easier(3) Splitting business data by city allows fine-tuned models

Pipe user requests into streaming(1) Pipe user requests into streaming (2) Create json-based topics to allow consumer-provider communication

Process user requests in stream(1) Use Spark SQL to query data frames(2) Support parallelism(3) Fault-tolerant

Predict ratings(1) Use Spark MLlib to train Collaborative Filtering models per city(2) Tune the best rank(3) Filter user-input keywords to prune user's preferences

Visual insight(1) Use Spark GraphX to extract connected components(2) Use graph-tool library to draw large graphs(3) Use D3.js to build dynamic graphs

Client frontend(1) Use Cherrypy to allow Flask talk to Spark context(2) Use Google Map API to show recommended results

Agenda






The city-wise user-business network in Yelp tells a lot about a city● Motivation

○ User-business interaction may reflect the economy trend in a city

● Each city is distinct, so does its user-business network

● Nodes: u1, u2, … b1, b2, ...● Edges: u1 → b1, u1 → b2, ...● The network is a bipartite● If Yelp provides the timestamp of rating

data, we may see the evolution of economy development

Demo 1: User-business network visualization in the city of Madison, US

Charlotte, USRandomly selected 3312 edges

(1% of total ratings)

Las Vegas, USRandomly selected 11547 edges


Las Vegas, USRandomly selected 23095 edges


Las Vegas, USSequentially selected 23095 edges


Phoenix, USRandomly selected 20582 edges


Pittsburgh, USRandomly selected 2230 edges


Demo 2: Yelper recommendation page

Demo 3: Simulation of user requests handling for recommendation using Spark Streaming and Kafka

Agenda






Summary & future works● Summary

○ Preprocessing by dividing business data by cities to allow fine tuned and customized recommendations

○ Collaborative filtering based recommendation using Spark MLlib

○ User-business graph visualization using D3 and graph-tool library

○ User-business graph analysis using Spark GraphX in Scala

○ Real-time user request handling simulation using Spark Streaming and Apache Kafka

○ Google Map view to recommend high rated businesses for users

● What we learn○ Streaming data analysis and mining

could be the new norm in the future and we as Data Scientist should be prepared for it

● Future works○ More graph analysis

■ Graph pagerank analysis using GraphX

■ Community discovery (similar to Facebook social network)

○ Improve recommendation■ Content-based recommendation■ Clustering all businesses■ Extract object from business

photos using Convolutional Neural Network

Thanks!

Documents

Capstone Project Slides- Yelper