Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa Tangirala,...

  • View
    1.124

  • Download
    0

  • Category

    Software

Preview:

Citation preview

Netflix Recommendations using Spark + Cassandra

Prasanna PadmanabhanRoopa Tangirala

Desktop

Turn on Netflix and the absolute best content for you would automatically start playing

Desktop

Netflix Recommendations

Desktop

Netflix Recommendations

Desktop

Ranking

Everything is a RecommendationRo

ws

Over 80% of what members watch comes from our recommendations

Recommendations are driven by Machine Learning Algorithms

Desktop

Data Driven

Offline Experiment using Historical

Data

Online A/B Testing

Rollout Feature to ALL members

Success Success

Fail

Algorithmic Page Generation

Trending Now

Desktop

Offline Experimentation

Desktop

Algorithmic Page Generation

Personalizing the ordering of rows on the homepage

Desktop

Algorithmic Page Generation

Without Algorithmic Page Generation With Algorithmic Page Generation

Diversity of the Page

Affinity for specific rows

Drawbacks

Desktop

Algorithmic Page Generation

Production

Desktop

Algorithmic Page Generation

Production Variant 1

Desktop

Algorithmic Page Generation

Production Variant 1 Variant 2

Row DistributionTV/Movie Ratio

Desktop

Algorithmic Page Generation

Production Variant 1 Variant 2

Evaluate best variant based on the plays

Actual Plays:

Desktop

Algorithmic Page Generation

Production Variant 1 Variant 2

Evaluate best variant based on the plays

Actual Plays:

Desktop

Algorithmic Page Generation

Production Variant 1 Variant 2

Evaluate best variant based on the plays

Actual Plays:

Desktop

Variant 2

Algorithmic Page Generation

Production Variant 1

Evaluate best variant based on the plays

Actual Plays:

Desktop

Offline Experiment ArchitectureMemberSelection

Runs once a day

Ratings Service

S3

Snapshot Snapshot Store

Snapshot Forklift

Viewing History Service

MyList Service

Data Snapshots

Evaluate Metrics

Generate Pages

… …

A/B Test

Desktop

Data Model - Requirements

• Need for historical service data

• Optimize for Batch Writes and Point Reads

Desktop

Data Model

20161009_1001

20161009_1002

DATE_MEMBER_ID

MyList

BLOB

MyList

BLOB

ROWS

COLUMN

COLUMN FAMILY: MYLIST

Desktop

Data Model

20161009_1001

20161009_1002

DATE_MEMBER_ID

ViewingData

BLOB

ViewingData

BLOB

ROWS

COLUMN

COLUMN FAMILY: VIEWING-HISTORY

Desktop

Data Model

20161009_1001_0

20161009_1001_1

DATE_MEMBERID_IDX

ViewingData

BLOB

ViewingData

BLOB

ROWS

COLUMN

20161009_1001_2ViewingData

BLOB

COLUMN FAMILY: VIEWING-HISTORY

Desktop

Online A/B Testing

Desktop

Trending Now

Videos that are Trending and Personalized for you

Desktop

Trending Now

It’s 7 PM on a Monday

Desktop

Trending Now

It’s 10 PM on a Saturday

Desktop

Trending Now

Pokeman

Desktop

Fast Feedback LoopUI

Data Systems

Streaming Apps

Rec Systems

Desktop

Trending Now - Data InfrastructureImpression

Service

Viewing History Service

UI

Online Services

Trends Store

Compute Trends

Model Training

Captures videos shown in view port

Captures videos played by members

Publish Models

Viewing History Service

Ratings. .. .

Desktop

State Management in Cassandra

Video Number of Plays

Stranger Things 100

Narcos 200

Orange is the new Black 300

Desktop

State Management in Cassandra

Trends Store

State Present

?Compute Trends

Yes

NoInit State from

Cassandra

Load State

Update State

Read Events

Desktop

Data Model - Requirements

• Trending data is for a specific interval of time

• Optimize for Batch Writes and Batch Reads

Desktop

Data Model

101_METADATA

102_METADATA

VIDEOID_METADATA

Plays

BLOB

Plays

BLOB

ROWS

COLUMNS

103_METADATAPlays

BLOB

COLUMN FAMILY: Interval 1,Interval 2…Interval N

Impressions

BLOB

Impressions

BLOB

Impressions

BLOB

Desktop

Roopa TangiralaEngineering Manager @ NetflixTwitter - @roopatangirala

FORKLIFTER

ARCHITECTURE

SOURCE TARGET

USE CASES

APACHE THRIFT CQL

DEMO

WHY NOT DSE SPARK?

SCALABILITY

COST EFFECTIVENESS

LESSONS LEARNT

TTL HANDLING• TTL Reading And Writing is Asymmetric -

CASSANDRA 12216 • Thrift Column TTL vs CQL Row TTL

1

6

5

4

3

2

PARTITION DIFFERENCES

1000

00

600000

500000

4000

00

300000

200000100k

75k

50k

25k

425k450k475k

400k

325k

350k375k

300k

275k

250k

225k

200k175k150k125k

500k

525k

550k575k

600k

TUNING• spark.cassandra.connection.keep_aliv

e_ms• spark.cassandra.connection.timeout_

ms• spark.driver.maxResultSize

OOM EXCEPTIONS Spark.executor.memory

spark.cassandra.input.split.size_in_mb

WRITES SPEED SPARK• cassandra.output.batch.size.bytes• cassandra.output.batch.size.rows• cassandra.output.concurrent.writes• cassandra.output.throughput_mb_per_s

ec

Write Timeoutscassandra.output.throughput_mb_per_sec

QUESTIONS?

Recommended