68

Click here to load reader

Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

Embed Size (px)

Citation preview

Page 1: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 1

Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

Helena Edelson @helenaedelson Kafka Summit 2016

Page 2: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

VP of Engineering, Tuplejump

Previously: Sr Cloud / Big Data / Analytics Engineer: DataStax, CrowdStrike, VMware, SpringSource...

Event-Driven systems, Analytics, Machine Learning, Scala

Committer: Kafka Connect Cassandra, Spark Cassandra Connector

Contributor: Akka, previously: Spring Integration

Speaker: Kafka Summit, Spark Summit, Strata, QCon, Scala Days, Scala World, Philly ETE

2

twitter.com/helenaedelson github.com/helena

slideshare.net/helenaedelson

Page 3: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

The Real Topic

3

http://www.slideshare.net/palvaro/ricon-keynote-outwards-from-the-middle-of-the-maze/42

Page 4: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Chaos Of Distribution

One of the more

fascinating problems is

that of solving the chaos

of distributed systems.

Regardless of the

domain.

4

Page 5: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Aproaching this within the use case of:

High-Level Landscape

Platform & Infrastructure

Strategies and Patterns

Four-Letter Acronyms

Can't Touch This

Architecture

5

Page 6: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 6

The Landscape

Page 7: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 7

The Digital Ad Industry

Page 8: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

An RTB Drive-By

Real time auction for ad spaces, all devices High throughput, low-Latency (similar to FIN Tech but not quite) OpenRTB API Spec - but not everyone uses it

8

Open protocol for automated trading of digital media across

platforms, devices, and advertising solutions

Page 9: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 9

Ad Delivered to User

In A Nutshell

User hits a

Publisher'spage

Advertiser

Advertiser

Advertisers send Bid Requests

Highest Bid

Accepted

Page 10: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 10

Site: Ad supported

content

Real Time Exchange & Auction (SSP):

OpenRTB Server used to bid

Bidder Service (DSP):

OpenRTB client

Advertiser:Buyer wants ad

impressions. Uses bidders to bid on

behalf

Publisher:Seller has ad spaces to sell to highest

bidders

User Devices

ad request

winning ad

bid request

win notice & settlement price

insert orders

bid response

winning ad

RTB Auction for Impressions

Page 11: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 11

Time Is Money

RTB: Maximum response latency of 100 ms

Page 12: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 12

Time Is Money

Assume some network latency!

Page 13: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Sampling of RTB Events

Ad Request

Bid Request - JSON 100 bytes

Compute optimal bid for advertiser

Bid Response - JSON 1000 bytes (may include ad metadata)

Win Notification (may or may not exist) with settlement price

Ad Impression - when the ad is viewed

Ad Click

Ad Conversion

13

Page 14: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Event Streams

Auctions: auction data + bid requests

Ad Impressions: which ad ids were shown

Ad Clicks: which auction ids resulted in a click

Ad Conversions: streams joined on auction id

Analytics Aggregations & ML to derive hundreds of metrics and dimensions

14

Page 15: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 15

Real TimeJust means Event-Driven or processing events as they arrive.

Does not automatically equal sub-second latency requirements.

Seen / Ingestion TimeWhen an event is ingested into the system

Event TimeWhen an event is created, e.g. on a device.

Page 16: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 16

The Platform

Page 17: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Platform Requirements24 / 7 Uptime

Brokerage model: DSPs only make $ on successful ad deliveries, so uptime is critical

Security

Enable service across the globe

Handle thousands of concurrent requests per second

Scale to traffic of 700TB per day

Manage 700TB per day of data

Derive Metrics

17

Page 18: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Business RequirementsSupport SLAs for bid transactions

Legal constraints - user data crossing borders

The critical path must be fast to win

No data loss on ingestion path

Bid & Campaign Optimization

Frequency Capping

Management UI for Publishers & Advertisers

18

Page 19: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Questions To Answer% Writes on ingestion, analytics pre-aggregation, etc.

% Reads of raw data by analytics, aggregated views by customer management UI

How much in memory on RTB app nodes?

Dimensions of data in analytics queries

Optimization Algos

What needs real time feedback loops, what does not

Which data flows are low-lateny/high frequency, which not

Where are potential bottlenecks

19

Page 20: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

ConstraintsResources - I need to build highly functioning teams that are psyched about the work and working together

Budget

Cloud Resources

JDK Version (What?!)

Existing infrastructure & technologies that will be replaced later but you have to deal with now :(

20

Pro Tip: Pay well,

Allow people to grow & be

creative

Page 21: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 21

Strategies

To Avoid

Page 22: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Beware of the C word

Consistency?

22

Convergence?

Page 23: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 23

http://www.slideshare.net/palvaro/ricon-keynote-outwards-from-the-middle-of-the-maze/39

he went there

@palvaro

Page 24: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Complexity

24

Can't Ops your way out of that

Page 25: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 25

Occam's razor: Simpler theories are preferable to more complex

Page 26: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 26

Strategies

Page 27: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

ApproachesEventual/Tunable consistency

Time & Clocks in globally-distributed systems

Location Transparency

Asynchrony

Pub-Sub

Design for scale

Design for Failure

27

Page 28: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Kafka as Platform Fabric

28

Page 29: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

From MVP to Scalable with KafkaMicroservices

Does One Thing, Knows One Thing Separate low-latency hot path Separate deploy artifacts

Separate data mgmt clusters by concern

analytics, timeseries, etc.

CQRS: Separate Read Write paths

29

Scalpel...

Separate The Monolith

Page 30: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Immutable events stream to Kafka, partitioned by event type, time, etc.

Subscribers & Publishers

RTB microservices - receives raw, receives

Analytics cluster - receives raw, publishes aggregates

Management / Reporting nodes

30

Services communicate indirectly via Kafka

Page 31: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

CQRS: Command Query Responsibility Segregation

Decouple Write streams from Read streams

Different schemas / data structures

Writers (Publishers) publish without having awareness who needs to receive it or how to reach them (location, protocol...)

Readers (Subscribers) should be able to subscribe and asynchronously receive from topics of interest

31

Page 32: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 32

Eventually Consistent Across DCs

US-East-1

MirrorMakerEU-west-1

RTB micro

services

RTB micro

services

RTB micro

services

Publishers

Subscribers

Subscribers

Publishers

Kafka Cluster Per Region

ZK

ZK

Mgmt micro

services

Mgmt micro

services

Mgmt micro

servicesQuery Layer

Analytics & ML Cluster

Timeseries Cluster

Spark Streaming

& ML

Cassandra

Cross DC Replication

Topology Aware

Spark Streaming

& ML

Cassandra

Spark Streaming

& ML

Cassandra

Cross DC Replication

Topology Aware

Spark Streaming

& ML

Cassandra

Compute Layer

Page 33: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 33

MirrorMaker

RTB micro

services

RTB micro

services

RTB micro

services

Publishers

Subscribers

Subscribers

Publishers

C*

C*

Eventually Consistent Across DCs

Mgmt micro

services

Mgmt micro

services

Mgmt micro

services

US-East-1

EU-west-1

Kafka Cluster Per Region

Analytics & ML Cluster

Timeseries Cluster

Spark Streaming

& ML

Cassandra

Cross DC Replication

Topology Aware

Spark Streaming

& ML

Cassandra

Spark Streaming

& ML

Cassandra

Cross DC Replication

Topology Aware

Spark Streaming

& ML

Cassandra

Compute Layer

Query Layer

Page 34: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Kafka Cross Datacenter Mirroring

bin/kafka-run-class.sh kafka.tools.MirrorMaker --consumer.config config/consumer_source_cluster.properties --producer.config config/producer_target_cluster.properties --whitelist bidrequests --num.producers 2 --num.streams 4

34

Publish messages from various datacenters around the world

Page 35: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Users in the US and UK connect DCs in their geo region for lower latency

Both DCs are part of the same cluster for X-DC Replication

Configure LB policies to prefer local DC

LOCAL_QUORUM reads

Data is available cluster-wide for backup, analytics, and to account for user travel across regions

35

Cassandra Cross DC ReplicationIt's out of the box. Multi-region live backups for free:

[ NetworkTopologyStrategy ]

Page 36: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 36

Cassandra Cross DC ReplicationKeep EU User Data in the EU

CREATE KEYSPACE rtb WITH REPLICATION = {

‘class’: ‘NetworkTopologyStrategy’,

‘eu-east-dc’: ‘3’,‘eu-west-dc’: ‘3’

};

Page 37: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 37

Cassandra Time Windowed Buckets with TTL

CREATE TABLE rtb.fu_events ( id int, seen_time timeuuid, event_time timestamp, PRIMARY KEY (id,date)

) WITH CLUSTERING ORDER BY (event_time DESC) AND compaction = { 'compaction_window_unit': 'DAY', 'compaction_window_size': '3', 'class':'com.jeffjirsa.cassandra.db.compaction.TimeWindowCompactionStrategy'

} AND compression = { 'crc_check_chance': '0.5', 'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor' } AND bloom_filter_fp_chance = 0.01 AND caching = '{"keys":"ALL", "rows_per_partition":"100"}' AND dclocal_read_repair_chance = 0.0 AND default_time_to_live = 60 AND gc_grace_seconds = 0 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE';

3 DAY buckets -

larger SSTables on disk minimizes bootstrapping issues when adding nodes to a cluster

3 MINUTE buckets 1 HOUR buckets 1 DAY buckets

MICROSECOND resolution:

Page 38: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 38

Want Can Or Currently Use Status ButKafka Security Kafka Security TLS, Kerberos, SASL, Auth,

Encryption, Authenticationv0.9.0

Thanks Jun!

Integrated Streaming Kafka Streams processing inside Kafka, no alternate cluster setup or ops.

v0.10 Thanks Guozhang!

It's java :( Iw

Cassandra CDC Cassandra CDC. Triggers? Tiggers are a pre-commit

hook :(

The Epic JIRA: https://issues.apache.org/jira/browse/CASSANDRA-8844

no comment

And... Kafka Streams & Kafka Connect Integration

..wait for it..no comment

Always on, X-DC Replication, Flexible Topologies

Kafka, Cassandra

OOTB

Fault Tolerance Kafka, Spark, Mesos, Cassandra, Akka

Baked In

Location Transparency Kafka, Cassandra, Akka Check!

Asynchrony Kafka, Cassandra, Akka Check!

Decoupling Kafka, Akka Check!

Pub-Sub Kafka, Cassandra, Akka Check!

Immutability Kafka, Akka, Scala Check!

My Nerdy Chart v2.0

Page 39: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Kafka Streams in v 0.10

39

val builder = new KStreamBuilder()

val stream: KStream[K,V] = builder.stream(des, des, "raw.data.topic")

.flatMapValues(value -> Arrays.asList(value.toLowerCase.split(" ")

.map((k,v) -> new KeyValue(k,v))

.countByKey(ser, ser, des, des, "kTable")

.toStream

stream.to("results.topic", ...)

val streams = new KafkaStreams(builder, props)

streams.start()

Page 40: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Kafka Streams & Kafka Connect?

40

val builder = new KStreamBuilder()

val stream1: KStream[K,V] = builder.stream(new CassandraConnect(configs))

.flatMapValues(..)

.map((k,v) -> new KeyValue(k,v))

.countByKey(ser, ser, des, des, "kTable")

.toStream

stream.to("results.topic", ...)

val streams = new KafkaStreams(builder, props)

streams.start()

YES

Page 41: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 41

/** Writes records from Kafka to Cassandra asynchronously and non-blocking. */ override def put(records: JCollection[SinkRecord]): Unit

/** Returns a list of records when available by polling for new records. */ override def poll: JList[SourceRecord])

https://github.com/tuplejump/kafka-connect-cassandra

Page 42: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Frequency Capping

1. Count the number of times user X has seen ad Y from Advertiser A's Campaign C

2. Limit the max number of impressions of an ad within T1...T2

42

Use Case:

Continuously count impressions grouped by campaign across DCs

low-latency reads & writes

Must scale

Cross DC Counters

Translation: Distributed Counters

Page 43: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Redis? Broke under the load

Aerospike? Great candidate

Eventuate? Interesting, much lighter

Kafka streams when it's out? Interesting, already in the infra

Flink? Very interesting but...

Cassandra Counters - not applicable for this

43

Frequency Capping

Page 44: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

As a distributed counting microservice

As a key-value store for in-memory caching

Fast reads - Very read heavy

99% reads are < 1 ms latency (sweet)

30,000 writes per second

350,000 reads per second on 7 nodes

Replication factor 2:

Cross datacenter replication (XDC), SSD-backed

Excellent few posts by Dag, Tapads CTO on in-memory infrastructure + Ad Tech: (see resources slide)

44

Aerospike

Page 45: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

CRDT: Conflict Free Replicated Data TypeState-based: objects require only eventual communication between pairs of replicas

Operation-based: replication requires reliable broadcast communication with delivery in a well-defined delivery order

Both guaranteed to converge towards common, correct state

Keep replicas available for writes during a network partition requires resolution of conflicting writes when the partition heals

45

Page 46: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

EventuateA toolkit for building distributed, HA & partition-tolerant event-sourced applications. Developed by Martin Krasser (@mrt1nz) for Red Bull Media (open source)

Interactive, automated conflict resolution (via op-based CRDTs)

Separates command side of an app from its query side (CQRS)

Primary Goals: preserving causality, idempotency & event ordering guarantees even under chaotic conditions

AP of CAP - conflicts cannot be prevented & must be resolved.

Causality - tracked with Vector Clocks

Adapters provide connectivity to other stream processing solutions

Can currently chose Cassandra if desired

Kafka coming soon!

46

Page 47: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Replication of application state through async event replication across locations

Locations consume replicated events to re-construct application state locally

Multiple locations concurrently update as multi-master

47

Eventuate as Distributed CRDT Microservice

Page 48: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 48

Applications can continue writing to a local replica during

a network partition

-> To Cassandra-> To Kafka

(soon)

Pass To Pipeline:

Page 49: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 49

import scala.concurrent.Futureimport akka.actor.{ActorRef, ActorSystem}import com.rbmhtechnology.eventuate.crdt.{CRDTServiceOps, Counter, CounterService}

class CappingService(val id: String, override val log: ActorRef) (implicit val system: ActorSystem, val integral: Integral[Int], override val ops: CRDTServiceOps[Counter[Int], Int]) extends CounterService[Int](id, log) { /** Increment only op: adds `delta` to the counter identified by `id` * and returns the updated counter value. */ def increment(id: String, delta: Int): Future[Int] = value(id) flatMap { case v if v >= 0 && (delta > 0 || delta > v) => update(id, delta) case v => Future.successful(v) } start()}

import scala.concurrent.Future import akka.actor.ActorSystem

val a = new CappingService(id1, eventLog)a.increment(id1, 3) // Future(3) 3 impressionsa.value(id1) // Future(3) 3 impressionsa.increment(id1, -2) // increments only, idempotent.

val b = new CappingService(id2, eventLog) b.value(id1) // Future(a.value(id1))

Knows the same count over n-instances, all geo-locations, for the same id

class CounterService[A : Integral](val replicaId: String, val log: ActorRef) {

def value(id: String): Future[A] = { ... }

def update(id: String, delta: A): Future[A] = { ... }

}

Page 50: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 50

Eventuate

Page 51: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Eventuate TakeawayIt's just a jar!

OOTB async internal component messaging and fault tolerance

Integrate with relevant microservices

No store/cache cluster to deploy, just keep monitoring your apps Written in Scala Built on Akka - a toolkit for building highly concurrent, distributed, and resilient event-driven applications on the JVM

51

Page 52: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 52

Analytics & ML

Page 53: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Refresher: Sampling of RTB Events

Ad Request

Bid Request - JSON 100 bytes

Compute optimal bid for advertiser

Bid Response - JSON 1000 bytes (may include ad metadata)

Win Notification (may or may not exist) with settlement price

Ad Impression - when the ad is viewed

Ad Click

Ad Conversion

53

Page 54: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 54

OpenRTB: objects in the Bid Request model

Page 55: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

TopK most high performing campaigns

Number of views served in the last 7 days, by country, by city

What determined successful ad conversions

Age distribution per campaign

55

Streaming Analytics

Page 56: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Spark Streaming Kafkaclass KafkaStreamingActor(ssc: StreamingContext) extends MyAggregationActor {

val stream = KafkaUtils.createDirectStream(...).map(RawData(_))

stream .foreachRDD(_.toDF.write.format("filodb.spark")

.option("dataset", "rawdata") .save())

/* Pre-Aggregate data in the stream for fast querying and aggregation later

stream.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip)

).saveToCassandra(timeseriesKeyspace, dailyPrecipTable)

}

56

Can write to Cassandra, FiloDB...

Page 57: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Machine LearningTrain on 1+ week of data for

Recommendations

Bid Optimization

Campaign Optimization

Consumer Profiling

...and much more

57

Page 58: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Machine Learning

The probability of an ad, from a specific ISP, OS, website, demographic, etc. resulting in a conversion

Which attributes of impressions are good predictors of better ad performance?

58

Page 59: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Bid Optimization & Predictive Models

Which impressions should an Advertiser bid for?

Per campaign, per country it may run in..?

What is the best bid for each impression

59

Page 60: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 60

Compute optimal bid

price

Train the model

Score bid requests

Determine value of bid reqest

Train on every bid req attribute

Based on Campaign Objectives

Against Budget Send bid decision to bidder

Machine Learning

Page 61: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Spark Streaming, MLLib & FiloDB

61

val ssc = new StreamingContext(sparkConf, Seconds(5))

val kafkaStream = KafkaUtils.createDirectStream[..](..)

.map(transformFunc) .map(LabeledPoint.parse)

kafkaStream.foreachRDD(_.toDF.write.format("filodb.spark")

.option("dataset", "training").save())

val model = new StreamingLinearRegressionWithSGD() .setInitialWeights(Vectors.dense(weights)) .trainOn(dataStream.join(historicalEvents)) model.predictOnValues(dataStream.map(lp => (lp.label, lp.features))) .insertIntoFilo("predictions")

Page 62: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

700 Queries Per Second: Spark Streaming & FiloDB

Even for datasets with 15 million rows! Using FiloDB's

InMemoryColumnStore

Single host / MBP

5GB RAM

SQL to DataFrame caching

https://github.com/tuplejump/FiloDB

Evan Chan's (@velvia) blog post

NoLambda: A new architecture combining streaming, ad hoc, machine-learning, and batch analytics

62

Page 63: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit 63

Eventually Consistent Across DCs

US-East-1

MirrorMakerEU-west-1

RTB micro

services

RTB micro

services

RTB micro

services

Publishers

Subscribers

Subscribers

Publishers

Kafka Cluster Per Region

ZK

ZK

Mgmt micro

services

Mgmt micro

services

Mgmt micro

servicesQuery Layer

Analytics & ML Cluster

Timeseries Cluster

Spark Streaming

& ML

Cassandra

Cross DC Replication

Topology Aware

Spark Streaming

& ML

Cassandra

Spark Streaming

& ML

Cassandra

Cross DC Replication

Topology Aware

Spark Streaming

& ML

Cassandra

Compute Layer

Page 64: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Self-Healing SystemsMassive event spikes & bursty traffic

Fast producers / slow consumers

Network partitioning & out of sync systems

DC down

Not DDOS'ing ourselves from fast streams No data loss when auto-scaling down

64

Page 65: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Byzantine Fault Tolerance?

65

Looks like I'll miss standup

Page 66: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Everything fails, all the time

Monitor Everything

66

Page 67: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

@helenaedelson #kafkasummit

Non-Monotonic Snapshot Isolation: scalable and strong consistency

for geo-replicated transactional systems

Conflict-free Replicated Data Types

Implementing operation-based CRDTs

http://codebetter.com/gregyoung/2010/02/16/cqrs-task-based-uis-event-sourcing-agh

http://martinfowler.com/bliki/CQRS.html

http://github.com/openrtb/OpenRTB

http://akka.io

http://rbmhtechnology.github.io/eventuate

https://github.com/RBMHTechnology/eventuate

http://rbmhtechnology.github.io/eventuate/user-guide.html#commutative-replicated-data-types

http://www.planetcassandra.org/data-replication-in-nosql-databases-explained

http://wikibon.org/wiki/v/Optimizing_Infrastructure_for_Analytics-Driven_Real-Time_Decision_Making

Resources

67

Page 68: Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

twitter.com/helenaedelson

github.com/helena

slideshare.net/helenaedelson

Thanks!

@helenaedelson #kafkasummit