53
Nov / 14 / 16 Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch Nick Pentreath

Building a Scalable Recommendation Engine with Apache Spark

Embed Size (px)

Citation preview

Page 1: Building a Scalable Recommendation Engine with Apache Spark

Nov / 14 / 16

Building a ScalableRecommender System with Apache Spark,Apache Kafkaand Elasticsearch

Nick Pentreath

Page 2: Building a Scalable Recommendation Engine with Apache Spark

§ @MLnick§ Principal Engineer, IBM§ Apache Spark PMC§ Focused on machine learning§ Author of Machine Learning with Spark

About

Page 3: Building a Scalable Recommendation Engine with Apache Spark

§ Recommender systems & the machine learning workflow

§ Data modelling for recommender systems

§ Why Spark, Kafka & Elasticsearch?§ Kafka & Spark Streaming§ Spark ML for collaborative filtering§ Deploying & scoring recommender

models with Elasticsearch§ Monitoring, feedback & re-training§ Scaling model serving§ Demo

Agenda

Page 4: Building a Scalable Recommendation Engine with Apache Spark

Recommender Systems & the ML Workflow

Page 5: Building a Scalable Recommendation Engine with Apache Spark

RecommenderSystems

Overview

Page 6: Building a Scalable Recommendation Engine with Apache Spark

The Machine Learning Workflow

Perception

Data ??? MachineLearning ??? $$$

Page 7: Building a Scalable Recommendation Engine with Apache Spark

The Machine Learning Workflow

Reality

Data

• Historical• Streaming

Ingest Data Processing

• Feature transformation & engineering

Model Training

• Model selection & evaluation

Deploy

• Pipelines, not just models

• Versioning

Live System

• Predict given new data

• Monitoring & live evaluation

Feedback Loop

Spark DataFrames

Spark ML

Various ???

Stream (Kafka)

Missing piece!

Page 8: Building a Scalable Recommendation Engine with Apache Spark

The Machine Learning Workflow

Recommender Version

Data Ingest Data Processing

• Aggregation• Handle implicit

data

Model Training

• ALS• Ranking-style

evaluation

Deploy

• Model size & complexity

Live System

•User & item recommendations•Monitoring, filters

Feedback => another Event Type

Spark DataFrames

Spark ML

Elasticsearch

• User & Item Metadata

• Events

Elasticsearch

Stream (Kafka)

Page 9: Building a Scalable Recommendation Engine with Apache Spark

Data Modeling for Recommender Systems

Page 10: Building a Scalable Recommendation Engine with Apache Spark

Data modelUser and Item Metadata

! !

Page 11: Building a Scalable Recommendation Engine with Apache Spark

System RequirementsUser and Item Metadata

! !

Filtering & Grouping

Business Rules

Page 12: Building a Scalable Recommendation Engine with Apache Spark

User interactions

Implicit preference data

• Page view• eCommerce - cart, purchase• Media – preview, watch, listen

Intent data

• Search query

Anatomy of a User Event

Explicit preference data

• Rating• Review

Social network interactions

• Like• Share• Follow

User Interactions

!

!

!

!

!

!

!

!

Page 13: Building a Scalable Recommendation Engine with Apache Spark

Data modelAnatomy of a User Event

!!

! !! !

!

Page 14: Building a Scalable Recommendation Engine with Apache Spark

How to handle implicit feedback?Anatomy of a User Event

!!

! !! !

!!

Page 15: Building a Scalable Recommendation Engine with Apache Spark

Why Kafka, Spark & Elasticsearch?

Page 16: Building a Scalable Recommendation Engine with Apache Spark

Scalability§ De facto standard for a centralized

enterprise message / event queue

Integration§ Integrates with just about every storage

& processing system§ Good Spark Streaming integration – 1st

class citizen§ Including for Structured Streaming (but

still very new & rough!)

Why Kafka?

Page 17: Building a Scalable Recommendation Engine with Apache Spark

DataFrames§ Events & metadata are “lightly

structured” data§ Suited to DataFrames§ Pluggable external data source support

Spark ML§ Spark ML pipelines – including scalable

ALS model for collaborative filtering§ Implicit feedback & NMF in ALS§ Cross-validation§ Custom transformers & algorithms

Why Spark?

Page 18: Building a Scalable Recommendation Engine with Apache Spark

Storage§ Native JSON§ Scalable§ Good support for time-series / event data§ Kibana for data visualisation§ Integration with Spark DataFrames

Scoring§ Full-text search§ Filtering§ Aggregations (grouping)§ Search ~== recommendation (more

later)

Why Elasticsearch?

Page 19: Building a Scalable Recommendation Engine with Apache Spark

Kafka for Recommender Systems

Page 20: Building a Scalable Recommendation Engine with Apache Spark

Event Data Pipeline

Kafka SparkStreaming

!

!Item analytics & aggregation

User analytics & aggregation

!

Event store

!

Dashboards

Page 21: Building a Scalable Recommendation Engine with Apache Spark

Write to Event Store

SparkStreaming

Event store

!

Page 22: Building a Scalable Recommendation Engine with Apache Spark

KibanaDashboards

SparkStreaming

!Dashboards

Page 23: Building a Scalable Recommendation Engine with Apache Spark

Item Metadata Analytics

SparkStreaming

!Item analytics & aggregation

Aggregated activity metrics

Page 24: Building a Scalable Recommendation Engine with Apache Spark

User Metadata Analytics

SparkStreaming

!User analytics & aggregation

Aggregated activity metrics &

item exclusions

Page 25: Building a Scalable Recommendation Engine with Apache Spark

Structured Streaming

Status

§ Still early days§ Initial Kafka support in Spark 2.0.2§ No ES support yet – not clear if it will be

a full-blown datasource or ForeachWriter

§ For now, you can create a custom ForeachWriter for your needs

Page 26: Building a Scalable Recommendation Engine with Apache Spark

Spark ML for Collaborative Filtering

Page 27: Building a Scalable Recommendation Engine with Apache Spark

Matrix FactorizationCollaborative Filtering

315 2

12 1

!

!

−1.1 3.2 4.30.2 1.4 3.12.5 0.3 2.34.3 −2.4 0.53.6 0.3 1.2

0.2 1.7 2.3 0.11.9 0.4 0.8 −0.31.5 −1.2 0.3 1.2

! !

Page 28: Building a Scalable Recommendation Engine with Apache Spark

PredictionCollaborative Filtering

315 2

12 1

!

!

−1.1 3.2 4.30.2 1.4 3.12.5 0.3 2.34.3 −2.4 0.53.6 0.3 1.2

0.2 1.7 2.3 0.11.9 0.4 0.8 −0.31.5 −1.2 0.3 1.2

! !

Page 29: Building a Scalable Recommendation Engine with Apache Spark

Loading Data in Spark MLCollaborative Filtering

Page 30: Building a Scalable Recommendation Engine with Apache Spark

Implicit Preference DataAlternating Least Squares

Page 31: Building a Scalable Recommendation Engine with Apache Spark

Deploying & Scoring Recommendation Models

Page 32: Building a Scalable Recommendation Engine with Apache Spark

Full-text Search & SimilarityPrelude: Search

“cat videos”

!!

cat videos0 0 ⋯ 0 1 ⋯0 1 ⋯ 1 1 ⋯1 1 ⋯ 0 0 ⋯1 0 ⋯ 0 1 ⋯

Similarity

Sort results

0 1 ⋯ 1 0 ⋯

Scoring RankingAnalysis Term vectors

Page 33: Building a Scalable Recommendation Engine with Apache Spark

Can we use the same machinery?Recommendation

!0 0 ⋯ 0 1 ⋯0 1 ⋯ 1 1 ⋯1 1 ⋯ 0 0 ⋯1 0 ⋯ 0 1 ⋯

Sort results

1.2 ⋯ −0.2 0.3

Dot product & cosine similarity… the same as we need for recommendations!

Scoring RankingAnalysis Term vectors!!!

SimilarityUser (or item) vector

?

!

Page 34: Building a Scalable Recommendation Engine with Apache Spark

Delimited Payload FilterElasticsearchTerm Vectors

Raw vector

1.2 ⋯ −0.2 0.3

Term vector with payloads

0|1.2 ⋯ 3|-0.2 4|0.3

Custom analyzer

Page 35: Building a Scalable Recommendation Engine with Apache Spark

Custom scoring function

• Native script (Java), compiled for speed• Scoring function computes dot product by:

§ For each document vector index (“term”), retrieve payload

§ score += payload * query(i)

• Normalizes with query vector norm and document vector norm for cosine similarity

ElasticsearchScoring

Page 36: Building a Scalable Recommendation Engine with Apache Spark

Can we use the same machinery?Recommendation

User (or item) vector

! Sort results

1.2 ⋯ −0.2 0.3

Scoring RankingAnalysis Term vectors

!

!!

Custom scoring function

!!

Delimited payload filter

−1.1 1.3 ⋯ 0.41.2 −0.2 ⋯ 0.30.5 0.7 ⋯ −1.30.9 1.4 ⋯ −0.8

Page 37: Building a Scalable Recommendation Engine with Apache Spark

We get search engine functionality for free!ElasticsearchScoring

Page 38: Building a Scalable Recommendation Engine with Apache Spark

Deploying to ElasticsearchAlternating Least Squares

Page 39: Building a Scalable Recommendation Engine with Apache Spark

Monitoring & Feedback

Page 40: Building a Scalable Recommendation Engine with Apache Spark

Logging Recommendations ServedSystem Events

! !!!

!!

!

Page 41: Building a Scalable Recommendation Engine with Apache Spark

Logging Recommendation ActionsSystem Events

!! !

!

Page 42: Building a Scalable Recommendation Engine with Apache Spark

Tracking Performance

Kafka SparkStreaming

!Impression capping / fatigue

Performance monitoring & alerts

!

Event store

!

Dashboards

!! !

!

!! !!

!

Page 43: Building a Scalable Recommendation Engine with Apache Spark

Scaling Model Scoring

Page 44: Building a Scalable Recommendation Engine with Apache Spark

Scoring Performance

0

100

200

300

400

500

600

100,000 1,000,000

Tim

e (m

s)

Size of item set

Scoring time per query, by factor dimension & number of items

k=20 k=50 k=100

*3x nodes, 30x shards

Page 45: Building a Scalable Recommendation Engine with Apache Spark

Scoring Performance

050

100150200250300350400450500

100,000 1,000,000

Tim

e (m

s)

Size of item set

Scoring time per query, by number of shards & number of items

10 shards 30 shards

60 shards 90 shards

*3x nodes, k=50

Increasing number of shards

Page 46: Building a Scalable Recommendation Engine with Apache Spark

Scoring Performance

Locality Sensitive Hashing

• LSH hashes each input vector into L “hash tables”. Each table contains a “hash signature” created by applying k hash functions.

• Standard for cosine similarity is Sign Random Projections

• At indexing time, create a “bucket” by combining hash table id and hash signature

• Store buckets as part of item model metadata• At scoring time, filter candidate set using term

filter on buckets of query item• Tune LSH parameters to trade off speed /

accuracy• LSH coming soon to Spark ML – SPARK-5992

Page 47: Building a Scalable Recommendation Engine with Apache Spark

Scoring Performance

0

50

100

150

200

250

Brute force LSH

Tim

e (m

s)

Scoring time per query - brute force vs LSH

*3x nodes, 30x shards, k=50, 1,000,000 items

Locality Sensitive Hashing

Page 48: Building a Scalable Recommendation Engine with Apache Spark

Scoring Performance

0

50

100

150

200

250

Brute force LSH Score-then-search

Tim

e (m

s)

Scoring time per query – LSH vs score-then-search

Score Sort Search

*3x nodes, 30x shards, k=50, 1,000,000 items

Comparison to “score then search”

Page 49: Building a Scalable Recommendation Engine with Apache Spark

Demo

Page 50: Building a Scalable Recommendation Engine with Apache Spark

Future Work

Page 51: Building a Scalable Recommendation Engine with Apache Spark

Future Work • Apache Solr version of scoring plugin (any takers?)

• Investigate ways to improve Elasticsearchscoring performance§ Performance for LSH-filtered scoring should be better!§ Can we dig deep into ES scoring internals to combine

efficiency of matrix-vector math with ES search & filter capabilities?

• Investigate more complex models§ Factorization machines & other contextual recommender

models§ Scoring performance

• Spark Structured Streaming with Kafka, Elasticsearch & Kibana§ Continuous recommender application including data,

model training, analytics & monitoring

Page 52: Building a Scalable Recommendation Engine with Apache Spark

References • Elasticsearch

• Elasticsearch Spark Integration

• Spark ML ALS for Collaborative Filtering

• Collaborative Filtering for Implicit Feedback Datasets

• Factorization Machines

• Elasticsearch Term Vectors & Payloads

• Delimited Payload Filter

• Vector Scoring Plugin

• Kafka & Spark Streaming

• Kibana

Page 53: Building a Scalable Recommendation Engine with Apache Spark

Thanks!https://github.com/MLnick/elasticsearch-vector-scoring