70
Speakers: Igor Maravić & Neville Li, Spotify From stream to recommendation with Cloud Pub/Sub and Cloud Dataflow DATA & ANALYTICS

From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

Embed Size (px)

Citation preview

Page 1: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

Speakers: Igor Maravić & Neville Li, Spotify

From stream to recommendation withCloud Pub/Sub and Cloud Dataflow

DATA & ANALYTICS

Page 2: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

22

Current Event Delivery System

Page 3: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

3

Client

Client

Client

Client

Current event delivery system

Gateway

Syslog

SyslogProducer

Any Data Centre

Groupers RealtimeBrokers

ETL job

CheckpointMonitor

Hadoop

Hadoop Data Center

Service Discovery

ACKBrokers

SyslogConsumer

LivenessMonitor

Brokers

Page 4: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

4

Client

Client

Client

Client

Complex

Gateway

Syslog

SyslogProducer

Any Data Centre

Groupers RealtimeBrokers

ETL job

CheckpointMonitor

Hadoop

Hadoop Data Center

Service Discovery

ACKBrokers

SyslogConsumer

LivenessMonitor

Brokers

Page 5: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

5

Client

Client

Client

Client

Stateless

Gateway

Syslog

SyslogProducer

Any Data Centre

Groupers RealtimeBrokers

ETL job

CheckpointMonitor

Hadoop

Hadoop Data Center

Service Discovery

ACKBrokers

SyslogConsumer

LivenessMonitor

Brokers

Page 6: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

6

Delivered data growth

2007 2008 2009 2010 2011 2012 2013 2014 2015

Page 7: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

77

Redesigning Event Delivery

Page 8: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

8

Redesigning event delivery

Gateway

Syslog

File Tailer

Any data centre

Client

Hadoop

Client

Client

Client Event Delivery Service

Reliable Persistent Queue

ETL

Page 9: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

9

Same API

Gateway

Syslog

File Tailer

Any data centreHadoop

Event Delivery Service

Reliable Persistent Queue

ETL

Client

Client

Client

Client

Page 10: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

10

Persistence

Gateway

Syslog

File Tailer

Any data centreHadoop

Event Delivery Service

Reliable Persistent Queue

ETL

Client

Client

Client

Client

Page 11: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

11

Keep it simple

Gateway

Syslog

File Tailer

Any data centreHadoop

Event Delivery Service

Reliable Persistent Queue

ETL

Client

Client

Client

Client

Page 12: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

Build it!

Page 13: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

1313

Choosing reliable persistent queue

Page 14: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

Kafka 0.8

14

Page 15: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

Proven technology

15

Page 16: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

16

Strong community

Page 17: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

1717

Reliable persistent queue

Page 18: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

18

Event delivery with Kafka 0.8

Gateway

Syslog

File Tailer

Any data centre

ClientHadoop

Client

Client

ClientEvent

Delivery Service

Hadoop data centre

Camus(ETL)

Brokers MirrorMakers

Brokers

Page 19: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

19

Gateway

Syslog

File Tailer

Any data centre

ClientHadoop

Client

Client

ClientEvent

Delivery Service

Hadoop data centre

Camus(ETL)

Brokers MirrorMakers

Brokers

Event delivery with Kafka 0.8

Page 20: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

Cloud Pub/Sub

20

Page 21: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

Retains undelivered data

Page 22: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

22

At least once delivery

Page 23: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

2323

Globally available

Page 24: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

24

Simple REST API

Page 25: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

2525

No operational responsibility*

Page 26: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

2626

SHUT UP AND

TAKE MY MONEY!

Page 27: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

2727

Caution advised!

Page 28: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

Building up trust in Cloud Pub/Sub

28

Page 29: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

29

Delivered data growth

2007 2008 2009 2010 2011 2012 2013 2014 2015

Page 30: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

Demo time!

30

Page 31: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

31

2M events per second.

Page 32: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

Cloud Pub/Sub, Spotify chooses You!

32

Page 33: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

33

Event delivery with Cloud Pub/Sub

Gateway

Any data centre

Client

HadoopClient

Client

Client

Cloud Pub/Sub

Event Delivery Service

File Tailer

Syslog

Cloud Storage

Dataflow

ETL using Cloud Dataflow

Page 34: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

3434

Streaming ETL job with Cloud Dataflow

Page 35: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

35

Dataflow SDK is a framework

Page 36: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

36

Cloud Dataflow is a managed service

Page 37: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

37

ETL job

Page 38: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

38

Single Cloud Pub/Sub subscription

ConsumeRunning

Page 39: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

39

GCS and HDFS in parallel.

Page 40: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

40

2016-03-22 03H

2016-03-2204H

Event time based hourly buckets

2016-03-2123H

2016-03-2200H

2016-03-2201H

2016-03-2202H

Page 41: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

41

Incremental bucket fill

2016-03-2123H

2016-03-2200H

2016-03-2201H

2016-03-2202H

2016-03-22 04H

2016-03-2203H

Page 42: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

42

2016-03-2200H

2016-03-2201H

2016-03-2123H

2016-03-2203H

Bucket completeness

2016-03-2202H

2016-03-2204H

Page 43: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

43

2016-03-2123H

2016-03-2204H

Late data handling

2016-03-2203H

2016-03-2200H

2016-03-2201H

2016-03-2202H

2016-03-2200H

2016-03-2201H

2016-03-2123H

2016-03-2202H

Page 44: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

44

Event time based hourly bucketsIncremental bucket fillBucket completeness

Late data handling

Page 45: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

45

Windowing

Window4,061 elements/s

ConsumeRunning

Shard4,061 elements/s

Write to HDFSRunning

Write to GCSRunning

Page 46: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

46

Windowing@Override

public PCollection<KV<String, Iterable<EventMessage>>> apply(

final PCollection<KV<String, EventMessage>> shardedEvents) {

return shardedEvents

.apply("Assign Hourly Windows",

Window.<~>into(

FixedWindows.of(ONE_HOUR))

.withAllowedLateness(ONE_DAY)

.triggering(

AfterWatermark.pastEndOfWindow()

.withEarlyFirings(AfterPane.elementCountAtLeast(maxEventsInFile))

.withLateFirings(AfterFirst.of(

AfterPane.elementCountAtLeast(maxEventsInFile),

AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(TEN_SECONDS))))

.discardingFiredPanes())

.apply("Aggregate Events", GroupByKey.create());

}

Page 47: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

4747

Streaming

Page 48: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

Where are we right now?

Page 49: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

49

Preliminary resultsWatermark Lag

Minutes

Page 50: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

5050

ScioScala API for Google Cloud Dataflow

Page 51: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

51

Origin story

Scalding and Spark popular for ML, recommendations, analytics @ Spotify

50+ users, 400+ unique jobs

Early 2015 - Dataflow Scala hack project

Page 52: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

52

Why not Scalding on GCE

Pros

● Big community - Twitter, eBay, Etsy, Stripe, LinkedIn, SoundCloud

● Stable and proven

Cons

● Hadoop cluster operations

● Multi-tenancy, resource contention and utilization

● No streaming mode

Page 53: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

53

Why not Spark on GCE

Pros

● Batch, streaming, interactive and SQL

● MLlib, GraphX

● Scala, Python, and R support

Cons

● Hard to tune and scale

● Cluster lifecycle management

Page 54: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

54

Why Dataflow with Scala

Dataflow

● Hosted solution, no operations

● Ecosystem: GCS, Bigquery, Pubsub, Datastore, Bigtable

● Simple unified model for batch and streaming

Scala

● High level DSL, easy transition for developers

● Reusable and composable code via functional programming

● Numerical libraries: Breeze, Algebird

Page 55: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

55

Cloud Storage Pub/Sub Datastore BigtableBigQuery

Batch Streaming Interactive REPL

Scio Scala API

Dataflow Java SDK Scala Libraries

Extra features

Page 56: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

56

Scio

Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i ̯o]

Verb: I can, know, understand, have knowledge.

Core API similar to spark-core, some ideas from scalding

github.com/spotify/scio

Page 57: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

57

WordCount

Almost identical to Spark version

val sc = ScioContext()sc.textFile("shakespeare.txt") .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty)) .countByValue() .saveAsTextFile("wordcount.txt")

Page 58: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

58

PageRank in 13 lines

def pageRank(in: SCollection[(String, String)]) = { val links = in.groupByKey() var ranks = links.mapValues(_ => 1.0) for (i <- 1 to 10) { val contribs = links.join(ranks).values .flatMap { case (urls, rank) => val size = urls.size urls.map((_, rank / size)) } ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _) } ranks}

Page 59: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

59

SQL and Big Data Pipelines

SQL is easier to write than data pipelines, but

Hive with TSV or Avro

● Row based storage, inefficient full scan

● No integration with other frameworks

Parquet

● Inspired by Google Dremel which powers BigQuery

● Immature Hive integration, hard to scale with Spark SQL

● Poor impedance matching with Scalding, Avro, etc.

Page 60: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

60

BigQuery and Scio BigQuery

● Slicing and dicing, aggregation, etc.

● Scaling independently

● Web UI, Tableau, QlikView etc.

Scio

● Custom logic hard to express in SQL

● Seamless integration with BigQuery IO

● Scala macros for type safety

Page 61: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

61

JSON vs Type Safe BigQuery

JSON approach, a.k.a. everything is Object

sc.bigQuerySelect("...").map { r => (r.get("track").asInstanceOf[TableRow] .get("name").asInstanceOf[String], r.get("audio").asInstanceOf[TableRow] .get("tempo").toString.toInt )}

Compile Run job Wait NullPointerException or ClassCastException Repeat

Type safe approach

@BigQueryType.fromQuery("...")class TrackTempo

sc.typedBigQuery[TrackTempo]().map { t => (t.track.name, t.audio.tempo.getOrElse(-1))}

Compile Run Profit

Page 62: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

62

Spotify Running

60 million tracks

30 million users * 10 tempo buckets * 25 personalized tracks

Audio: tempo, energy, time signature ...

Metadata: genres, categories

Latent vectors from collaborative filtering

Page 63: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

63

Rapid prototyping with Bigquery

Page 64: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

64

Spotify Running

SELECT user_id, vectorFROM UserEntity WHERE ...

SELECTtrack_id, audio.tempo ...FROM TrackEntityWHERE ...

most popularper recording

top N tracksper artist

bucket bytempo

vector LSH per bucket

GBK GBK GBK

RB

K

top tracks per user + bucket side input

Cloud Datastore

Page 65: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

65

typedBigQuery@(Runni...

typedBigQuery@(Runni...

[email protected]:1...

typedBigQuery@(Runni...

typedBigQuery@(Runni...

[email protected]:1

[email protected]:1

Succeeded

Succeeded

Succeeded

Succeeded

Running...

Running...

4,788 elements/s

Page 66: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

66

Page 67: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

67

What’s the catch?

Early stage, some rough edges

No interactive mode → Scio REPL (WIP), BigQuery + Datalab

No machine learning → TensorFlow

Licensed under Apache 2, contribution welcome!

Page 68: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

Learnings?

Page 70: From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

7070

Thank YouIgor Maravić <[email protected]>Neville Li <[email protected]>