Upload
patrick-zheng
View
94
Download
0
Embed Size (px)
Citation preview
moviEharmony
Patrick Zheng
Finding harmonic movies for 2
Motivation:
Data Source● Amazon Reviews Dataset
○ Number of users: 2.1 million○ Number of movies: 200k○ Timespan: May 1996 - July 2014
Data Pipeline:
Data Ingestion:
root |-- asin: string (nullable = true) |-- helpful: array (nullable = true) | |-- element: long (containsNull = true) |-- overall: double (nullable = true) |-- reviewText: string (nullable = true) |-- reviewTime: string (nullable = true) |-- reviewerID: string (nullable = true) |-- reviewerName: string (nullable = true) |-- summary: string (nullable = true) |-- unixReviewTime: long (nullable = true)
{asin:u'0790734680', helpful:[0, 0], overall:4.0, reviewText:u'Kevin Spacey always gives a credible performance. Curious to knowhow close story in book was followed.Overall I liked it.', reviewTime:u'06 28, 2014', reviewerID:u'A3MG14J7MXE9CC', reviewerName:u'George McGarrity', summary:u'Serious murder mixed with comedy', unixReviewTime:1403913600}
Batch Layer:
Collaborative Filtering Model
● MLlib currently supports model-based collaborative filtering,
● Used to predict missing entries.
● Uses the alternating least squares (ALS) algorithm to learn latent factors.
Web framework:
Schema:
Clusters:4 m4.large
4 m4.large4 m4.large8 m4.large
$0.126 * 20 = 2.52 per hour
Challenges:
● Spark MLlib, train CF ALS, parameter tuning
● Scaling providing recommendations to 2 people: ○ Batch: 2.1m users * 200k movies = 420
billion combinations
○ Streaming with caching: 2 users * 200k
movies = 400k
● Implementing Power bar®○ Normalization○ Consensus function
Powerbar®
About me● Patrick Zheng● MS Computer Science
● Alternative Drug Recommendation System● Retrospective Drug Utilization Review System● Drug Adherence Predictive Modeling
● Movie● Basketball● Hearthstone
Low latency real time computation base on user’s input:
Movie group relevance:
Movie group disagreement:
Consensus function: