61

Spark Meetup @ Netflix, 05/19/2015

Embed Size (px)

Citation preview

Page 1: Spark Meetup @ Netflix, 05/19/2015
Page 2: Spark Meetup @ Netflix, 05/19/2015

Spark and GraphX in the Netflix Recommender System

Ehtsham Elahi and Yves Raimond(@EhtshamElahi) (@moustaki)

Algorithms EngineeringNetflix

Page 3: Spark Meetup @ Netflix, 05/19/2015

Machine Learning @ Netflix

Page 4: Spark Meetup @ Netflix, 05/19/2015

Recommendations @ Netflix● Goal: Help members find

content that they’ll enjoy to maximize satisfaction and retention

● Core part of product○ Every impression is a

recommendation

Page 5: Spark Meetup @ Netflix, 05/19/2015

5

▪ Regression (Linear, logistic, elastic net)

▪ SVD and other Matrix Factorizations

▪ Factorization Machines

▪ Restricted Boltzmann Machines

▪ Deep Neural Networks

▪ Markov Models and Graph Algorithms

▪ Clustering

▪ Latent Dirichlet Allocation

▪ Gradient Boosted Decision Trees/Random Forests

▪ Gaussian Processes

▪ …

Models & Algorithms

Page 6: Spark Meetup @ Netflix, 05/19/2015

Main Challenge - Scale● Algorithms @ Netflix Scale

○ > 62 M Members○ > 50 Countries○ > 1000 device types○ > 100M Hours / day

● Can distributed Machine Learning algorithms help with Scale?

Page 7: Spark Meetup @ Netflix, 05/19/2015

Spark and GraphX

Page 8: Spark Meetup @ Netflix, 05/19/2015

Spark and GraphX● Spark - Distributed in-memory computational engine

using Resilient Distributed Datasets (RDDs)

● GraphX - extends RDDs to Multigraphs and provides graph analytics

● Convenient and fast, all the way from prototyping (spark-notebook, iSpark, Zeppelin) to production

Page 9: Spark Meetup @ Netflix, 05/19/2015

Two Machine Learning Problems● Generate ranking of items with respect to a given item

from an interaction graph

○ Graph Diffusion algorithms (e.g. Topic Sensitive Pagerank)

● Find Clusters of related items using co-occurrence data

○ Probabilistic Graphical Models (Latent Dirichlet Allocation)

Page 10: Spark Meetup @ Netflix, 05/19/2015

Iterative Algorithms in GraphX

v1

v2v3

v4v6

v7Vertex Attribute

Edge Attribute

Page 11: Spark Meetup @ Netflix, 05/19/2015

Iterative Algorithms in GraphX

v1

v2v3

v4v6

v7Vertex Attribute

Edge Attribute

GraphX represents the graph as RDDs. e.g. VertexRDD, EdgeRDD

Page 12: Spark Meetup @ Netflix, 05/19/2015

Iterative Algorithms in GraphX

v1

v2v3

v4v6

v7Vertex Attribute

Edge Attribute

GraphX provides APIs to propagate and update attributes

Page 13: Spark Meetup @ Netflix, 05/19/2015

Iterative Algorithms in GraphX

v1

v2v3

v4v6

v7Vertex Attribute

Edge Attribute

Iterative Algorithm proceeds by creating updated graphs

Page 14: Spark Meetup @ Netflix, 05/19/2015

Graph Diffusion algorithms

Page 15: Spark Meetup @ Netflix, 05/19/2015

● Popular graph diffusion algorithm

● Capturing vertex importance with regards to a particular vertex

● e.g. for the topic “Seattle”

Topic Sensitive Pagerank @ Netflix

Page 16: Spark Meetup @ Netflix, 05/19/2015

Iteration 0

We start by activating a single node

“Seattle”

related to

shot in

featured in

related to

cast

cast

cast

related to

Page 17: Spark Meetup @ Netflix, 05/19/2015

Iteration 1

With some probability, we follow outbound edges, otherwise we go back to the origin.

Page 18: Spark Meetup @ Netflix, 05/19/2015

Iteration 2

Vertex accumulates higher mass

Page 19: Spark Meetup @ Netflix, 05/19/2015

Iteration 2

And again, until convergence

Page 20: Spark Meetup @ Netflix, 05/19/2015

GraphX implementation● Running one propagation for each possible starting

node would be slow

● Keep a vector of activation probabilities at each vertex

● Use GraphX to run all propagations in parallel

Page 21: Spark Meetup @ Netflix, 05/19/2015

Topic Sensitive Pagerank in GraphX

activation probability, starting from vertex 1

activation probability, starting from vertex 2

activation probability, starting from vertex 3

...

Activation probabilities as vertex attributes

...

...

... ...

...

...

Page 22: Spark Meetup @ Netflix, 05/19/2015

Example graph diffusion results

“Matrix”

“Zombies”

“Seattle”

Page 23: Spark Meetup @ Netflix, 05/19/2015

Distributed Clustering algorithms

Page 24: Spark Meetup @ Netflix, 05/19/2015

LDA @ Netflix● A popular clustering/latent factors model● Discovers clusters/topics of related videos from Netflix

data● e.g, a topic of Animal Documentaries

Page 25: Spark Meetup @ Netflix, 05/19/2015

LDA - Graphical Model

Per-topic word distributions

Per-document topic distributions

Topic label for document d and word w

Page 26: Spark Meetup @ Netflix, 05/19/2015

LDA - Graphical Model

Question: How to parallelize inference?

Page 27: Spark Meetup @ Netflix, 05/19/2015

LDA - Graphical Model

Question: How to parallelize inference?Answer: Read conditional independenciesin the model

Page 28: Spark Meetup @ Netflix, 05/19/2015

Gibbs Sampler 1 (Semi Collapsed)

Page 29: Spark Meetup @ Netflix, 05/19/2015

Gibbs Sampler 1 (Semi Collapsed)

Sample Topic Labels in a given document SequentiallySample Topic Labels in different documents In parallel

Page 30: Spark Meetup @ Netflix, 05/19/2015

Gibbs Sampler 2 (UnCollapsed)

Page 31: Spark Meetup @ Netflix, 05/19/2015

Gibbs Sampler 2 (UnCollapsed)

Sample Topic Labels in a given document In parallelSample Topic Labels in different documents In parallel

Page 32: Spark Meetup @ Netflix, 05/19/2015

Gibbs Sampler 2 (UnCollapsed)

Suitable For GraphX

Sample Topic Labels in a given document In parallelSample Topic Labels in different documents In parallel

Page 33: Spark Meetup @ Netflix, 05/19/2015

Distributed Gibbs Sampler

w1

w2

w3

d1

d2

0.3

0.4

0.1

0.3

0.2

0.8

0.4

0.4

0.1

0.3 0.6 0.1

0.2 0.5 0.3

A distributed parameterized graph for LDA with 3 Topics

Page 34: Spark Meetup @ Netflix, 05/19/2015

Distributed Gibbs Sampler

w1

w2

w3

d1

d2

0.3

0.4

0.1

0.3

0.2

0.8

0.4

0.4

0.1

0.3 0.6 0.1

0.2 0.5 0.3

A distributed parameterized graph for LDA with 3 Topics

document

Page 35: Spark Meetup @ Netflix, 05/19/2015

Distributed Gibbs Sampler

w1

w2

w3

d1

d2

0.3

0.4

0.1

0.3

0.2

0.8

0.4

0.4

0.1

0.3 0.6 0.1

0.2 0.5 0.3

A distributed parameterized graph for LDA with 3 Topics

word

Page 36: Spark Meetup @ Netflix, 05/19/2015

Distributed Gibbs Sampler

w1

w2

w3

d1

d2

0.3

0.4

0.1

0.3

0.2

0.8

0.4

0.4

0.1

0.3 0.6 0.1

0.2 0.5 0.3

A distributed parameterized graph for LDA with 3 Topics

Edge: if word appeared in the document

Page 37: Spark Meetup @ Netflix, 05/19/2015

Distributed Gibbs Sampler

w1

w2

w3

d1

d2

0.3

0.4

0.1

0.3

0.2

0.8

0.4

0.4

0.1

0.3 0.6 0.1

0.2 0.5 0.3

A distributed parameterized graph for LDA with 3 Topics

Per-document topic distribution

Page 38: Spark Meetup @ Netflix, 05/19/2015

Distributed Gibbs Sampler

w1

w2

w3

d1

d2

0.3

0.4

0.1

0.3

0.2

0.8

0.4

0.4

0.1

0.3 0.6 0.1

0.2 0.5 0.3

A distributed parameterized graph for LDA with 3 Topics

Per-topic word distributions

Page 39: Spark Meetup @ Netflix, 05/19/2015

Distributed Gibbs Sampler

w1

w2

w3

d1

d2

0.3

0.4

0.1

0.3

0.2

0.8

0.4

0.4

0.1

0.3 0.6 0.1

0.2 0.5 0.3

(vertex, edge, vertex) = triplet

Page 40: Spark Meetup @ Netflix, 05/19/2015

Distributed Gibbs Sampler

w1

w2

w3

d1

d2

0.3

0.4

0.1

0.3

0.2

0.8

0.4

0.4

0.1

0.3 0.6 0.1

0.2 0.5 0.3

Categorical distributionfor the triplet usingvertex attributes

Page 41: Spark Meetup @ Netflix, 05/19/2015

Distributed Gibbs Sampler

w1

w2

w3

d1

d2

0.3

0.4

0.1

0.3

0.2

0.8

0.4

0.4

0.1

0.3 0.6 0.1

0.2 0.5 0.3

Categorical distributions forall triplets

Page 42: Spark Meetup @ Netflix, 05/19/2015

Distributed Gibbs Sampler

w1

w2

w3

d1

d2

0.3

0.4

0.1

0.3

0.2

0.8

0.4

0.4

0.1

0.3 0.6 0.1

0.2 0.5 0.3

1

1

2

0

Sample Topics for all edges

Page 43: Spark Meetup @ Netflix, 05/19/2015

Distributed Gibbs Sampler

w1

w2

w3

d1

d2

0

1

0

0

1

1

1

0

0

0 2 0

1 0 1

1

1

2

0

Neighborhood aggregation for topic histograms

Page 44: Spark Meetup @ Netflix, 05/19/2015

Distributed Gibbs Sampler

w1

w2

w3

d1

d2

0.1

0.4

0.3

0.1

0.4

0.4

0.8

0.2

0.3

0.1 0.8 0.1

0.45 0.1 0.45

Realize samples from Dirichlet to update the graph

Page 45: Spark Meetup @ Netflix, 05/19/2015

Example LDA Results

Cluster of Bollywood Movies

Cluster of Kids shows

Cluster of Western movies

Page 46: Spark Meetup @ Netflix, 05/19/2015

GraphX performance comparison

Page 47: Spark Meetup @ Netflix, 05/19/2015

Algorithm Implementations● Topic Sensitive Pagerank

○ Distributed GraphX implementation○ Alternative Implementation: Broadcast graph adjacency matrix,

Scala/Breeze code, triggered by Spark

● LDA○ Distributed GraphX implementation○ Alternative Implementation: Single machine, Multi-threaded Java code

● All implementations are Netflix internal code

Page 48: Spark Meetup @ Netflix, 05/19/2015

Performance Comparison

Page 49: Spark Meetup @ Netflix, 05/19/2015

Performance Comparison

Open Source DBPedia dataset

Page 50: Spark Meetup @ Netflix, 05/19/2015

Performance Comparison

Sublinear rise in time with GraphX Vs Linear rise in the Alternative

Page 51: Spark Meetup @ Netflix, 05/19/2015

Performance Comparison

Doubling the size of cluster:2.0 speedup in the Alternative Impl Vs 1.2 in GraphX

Page 52: Spark Meetup @ Netflix, 05/19/2015

Performance Comparison

Large number of vertices propagated in parallel lead to large shuffle data, causing failures in GraphX for small clusters

Page 53: Spark Meetup @ Netflix, 05/19/2015

Performance Comparison

Netflix datasetNumber of Topics = 100

Page 54: Spark Meetup @ Netflix, 05/19/2015

Performance Comparison

GraphX setup:8 x Resources than the Multi-Core setup

Page 55: Spark Meetup @ Netflix, 05/19/2015

Performance Comparison

Wikipedia dataset, 100 Topic LDACluster: (16 x r3.2xl)(source: Databricks)

Page 56: Spark Meetup @ Netflix, 05/19/2015

Performance Comparison

GraphX for very large datasets outperforms the multi-core unCollapsed Impl

Page 57: Spark Meetup @ Netflix, 05/19/2015

Lessons Learned

Page 58: Spark Meetup @ Netflix, 05/19/2015

What we learned so far...

● Where is the cross-over point for your iterative ML algorithm?○ GraphX brings performance benefits if you’re on the right side of that

point○ GraphX lets you easily throw more hardware at a problem

● GraphX very useful (and fast) for other graph processing tasks○ Data pre-processing○ Efficient joins

Page 59: Spark Meetup @ Netflix, 05/19/2015

What we learned so far ...

● Regularly save the state○ With a 99.9% success rate, what’s the probability of successfully

running 1,000 iterations?

● Multi-Core Machine learning (r3.8xl, 32 threads, 220 GB) is very efficient○ if your data fits in memory of single machine !

Page 60: Spark Meetup @ Netflix, 05/19/2015

What we learned so far ...

● Regularly save the state○ With a 99.9% success rate, what’s the probability of successfully

running 1,000 iterations?○ ~36%

● Multi-Core Machine learning (r3.8xl, 32 threads, 220 GB) is very efficient○ if your data fits in memory of single machine !

Page 61: Spark Meetup @ Netflix, 05/19/2015

We’re hiring!(come talk to us)

https://jobs.netflix.com/