13
moviEharmony Patrick Zheng Finding harmonic movies for 2

Patrick zheng week4_demo3

Embed Size (px)

Citation preview

Page 1: Patrick zheng week4_demo3

moviEharmony

Patrick Zheng

Finding harmonic movies for 2

Page 2: Patrick zheng week4_demo3

Motivation:

Page 3: Patrick zheng week4_demo3

Demo:http://www.movieharmony.com

Page 4: Patrick zheng week4_demo3

Data Source● Amazon Reviews Dataset

○ Number of users: 2.1 million○ Number of movies: 200k○ Timespan: May 1996 - July 2014

Page 5: Patrick zheng week4_demo3

Data Pipeline:

Page 6: Patrick zheng week4_demo3

Data Ingestion:

root |-- asin: string (nullable = true) |-- helpful: array (nullable = true) | |-- element: long (containsNull = true) |-- overall: double (nullable = true) |-- reviewText: string (nullable = true) |-- reviewTime: string (nullable = true) |-- reviewerID: string (nullable = true) |-- reviewerName: string (nullable = true) |-- summary: string (nullable = true) |-- unixReviewTime: long (nullable = true)

{asin:u'0790734680', helpful:[0, 0], overall:4.0, reviewText:u'Kevin Spacey always gives a credible performance. Curious to knowhow close story in book was followed.Overall I liked it.', reviewTime:u'06 28, 2014', reviewerID:u'A3MG14J7MXE9CC', reviewerName:u'George McGarrity', summary:u'Serious murder mixed with comedy', unixReviewTime:1403913600}

Page 7: Patrick zheng week4_demo3

Batch Layer:

Collaborative Filtering Model

● MLlib currently supports model-based collaborative filtering,

● Used to predict missing entries.

● Uses the alternating least squares (ALS) algorithm to learn latent factors.

Page 8: Patrick zheng week4_demo3

Web framework:

Page 9: Patrick zheng week4_demo3

Schema:

Page 10: Patrick zheng week4_demo3

Clusters:4 m4.large

4 m4.large4 m4.large8 m4.large

$0.126 * 20 = 2.52 per hour

Page 11: Patrick zheng week4_demo3

Challenges:

● Spark MLlib, train CF ALS, parameter tuning

● Scaling providing recommendations to 2 people: ○ Batch: 2.1m users * 200k movies = 420

billion combinations

○ Streaming with caching: 2 users * 200k

movies = 400k

● Implementing Power bar®○ Normalization○ Consensus function

Powerbar®

Page 12: Patrick zheng week4_demo3

About me● Patrick Zheng● MS Computer Science

● Alternative Drug Recommendation System● Retrospective Drug Utilization Review System● Drug Adherence Predictive Modeling

● Movie● Basketball● Hearthstone

Page 13: Patrick zheng week4_demo3

Low latency real time computation base on user’s input:

Movie group relevance:

Movie group disagreement:

Consensus function: