Transcript
Page 1: Spark Summit 2014: Spark Streaming for Realtime Auctions

Spark Streaming for Realtime Auctions @russellcardullo

Sharethrough

Page 2: Spark Summit 2014: Spark Streaming for Realtime Auctions

Agenda

• Sharethrough? • Streaming use cases • How we use Spark • Next steps

Page 3: Spark Summit 2014: Spark Streaming for Realtime Auctions

Sharethrough

vs

Page 4: Spark Summit 2014: Spark Streaming for Realtime Auctions

The Sharethrough Native Exchange

Page 5: Spark Summit 2014: Spark Streaming for Realtime Auctions

How can we use streaming data?

Page 6: Spark Summit 2014: Spark Streaming for Realtime Auctions

Use Cases

Creative Optimization Spend Tracking Operational Monitoring

$µ* = max { µk } k

Page 7: Spark Summit 2014: Spark Streaming for Realtime Auctions

Creative Optimization

• Choose best performing variant

• Short feedback cycle required

impression: {! device: “iOS”,! geocode: “C967”,! pkey: “p193”,! …!}

Variant 1 Variant 2 Variant 3

Click!Content

Here Hello Friend

? ? ?

Page 8: Spark Summit 2014: Spark Streaming for Realtime Auctions

Spend Tracking

• Spend on visible impressions and clicks

• Actual spend happens asynchronously

• Want to correct prediction for optimal serving

Predicted Spend

Actual Spend

Page 9: Spark Summit 2014: Spark Streaming for Realtime Auctions

Operational Monitoring

• Detect issues with content served on third party sites

• Use same logs as reporting

Placement Click Rates

P1

P5

P2 P3 P4

P6 P7 P8

Page 10: Spark Summit 2014: Spark Streaming for Realtime Auctions

We can directly measure business impact of using this data sooner

Page 11: Spark Summit 2014: Spark Streaming for Realtime Auctions

Why use Spark to build these features?

Page 12: Spark Summit 2014: Spark Streaming for Realtime Auctions

Why Spark?

• Scala API • Supports batch and streaming • Active community support • Easily integrates into existing Hadoop

ecosystem • But it doesn’t require Hadoop in order to run

Page 13: Spark Summit 2014: Spark Streaming for Realtime Auctions

How we’ve integrated Spark

Page 14: Spark Summit 2014: Spark Streaming for Realtime Auctions

Existing Data Pipelineweb!

server

log file

flume

HDFS

web!server

log file

flume

web!server

log file

flume

analytics!web!

service

database

Page 15: Spark Summit 2014: Spark Streaming for Realtime Auctions

Pipeline with Streamingweb!

server

log file

flume

HDFS

web!server

log file

flume

web!server

log file

flume

analytics!web!

service

database

Page 16: Spark Summit 2014: Spark Streaming for Realtime Auctions

Batch Streaming• “Real-Time” reporting • Low latency to use data • Only reliable as source • Low latency > correctness

• Daily reporting • Billing / earnings • Anything with strict SLA • Correctness > low latency

Page 17: Spark Summit 2014: Spark Streaming for Realtime Auctions

Spark Job Abstractions

Page 18: Spark Summit 2014: Spark Streaming for Realtime Auctions

Job Organization

nterSource Transform Sink

Job

Page 19: Spark Summit 2014: Spark Streaming for Realtime Auctions

Sources!case class BeaconLogLine(! timestamp: String,! uri: String,! beaconType: String,! pkey: String,! ckey: String!)!!object BeaconLogLine {!! def newDStream(ssc: StreamingContext, inputPath: String): DStream[BeaconLogLine] = {! ssc.textFileStream(inputPath).map { parseRawBeacon(_) }! }!! def parseRawBeacon(b: String): BeaconLogLine = {! ...! }!}!

case class!for pattern!matching

generate!DStream

encapsulate !

common!operations

Page 20: Spark Summit 2014: Spark Streaming for Realtime Auctions

Transformations

!!def visibleByPlacement(source: DStream[BeaconLogLine]): DStream[(String, Long)] = {! source.! filter(data => {! data.uri == "/strbeacon" && data.beaconType == "visible"! }).! map(data => (data.pkey, 1L)).! reduceByKey(_ + _)!}!!

type safety!

from!case class

Page 21: Spark Summit 2014: Spark Streaming for Realtime Auctions

Sinks

!!class RedisSink @Inject()(store: RedisStore) {!! def sink(result: DStream[(String, Long)]) = {! result.foreachRDD { rdd =>! rdd.foreach { element =>! val (key, value) = element! store.merge(key, value)! }! }! }!}!!

custom!

sinks for!new stores

Page 22: Spark Summit 2014: Spark Streaming for Realtime Auctions

Jobs!object ImpressionsForPlacements {!! def run(config: Config, inputPath: String) {! val conf = new SparkConf().! setMaster(config.getString("master")).! setAppName("Impressions for Placement")!! val sc = new SparkContext(conf)! val ssc = new StreamingContext(sc, Seconds(5))!! val source = BeaconLogLine.newDStream(ssc, inputPath)! val visible = visibleByPlacement(source)! sink(visible)!! ssc.start! ssc.awaitTermination! }!}!

source

transform

sink

Page 23: Spark Summit 2014: Spark Streaming for Realtime Auctions

Advantages?

Page 24: Spark Summit 2014: Spark Streaming for Realtime Auctions

Code Reuse!object PlacementVisibles {! …! val source = BeaconLogLine.newDStream(ssc, inputPath)! val visible = visibleByPlacement(source)! sink(visible)! …!}!!…!!object PlacementEngagements {! …! val source = BeaconLogLine.newDStream(ssc, inputPath)! val engagements = engagementsByPlacement(source)! sink(engagements)! …!}

composable!jobs

Page 25: Spark Summit 2014: Spark Streaming for Realtime Auctions

Readability

! ssc.textFileStream(inputPath).! map { parseRawBeacon(_) }.! filter(data => {! data._2 == "/strbeacon" && data._3 == "visible"! }).! map(data => (data._4, 1L)).! reduceByKey(_ + _).! foreachRDD { rdd =>! rdd.foreach { element =>! store.merge(element._1, element._2)! }! }!

?

Page 26: Spark Summit 2014: Spark Streaming for Realtime Auctions

Readability

!! val source = BeaconLogLine.newDStream(ssc, inputPath)! val visible = visibleByPlacement(source)! redis.sink(visible)!!

Page 27: Spark Summit 2014: Spark Streaming for Realtime Auctions

Testingdef assertTransformation[T: Manifest, U: Manifest](! transformation: T => U,! input: Seq[T],! expectedOutput: Seq[U]!): Unit = {! val ssc = new StreamingContext("local[1]", "Testing", Seconds(1))! val source = ssc.queueStream(new SynchronizedQueue[RDD[T]]())! val results = transformation(source)!! var output = Array[U]()! results.foreachRDD { rdd => output = output ++ rdd.collect() }! ssc.start! rddQueue += ssc.sparkContext.makeRDD(input, 2)! Thread.sleep(jobCompletionWaitTimeMillis)! ssc.stop(true)!! assert(output.toSet === expectedOutput.toSet)! }!

function,!

input,!expectation

test

Page 28: Spark Summit 2014: Spark Streaming for Realtime Auctions

Testing

!!test("#visibleByPlacement") {!! val input = Seq(! "pkey=abcd, …",! "pkey=abcd, …",! "pkey=wxyz, …",! )!! val expectedOutput = Seq( ("abcd",2),("wxyz", 1) )!! assertTransformation(visibleByPlacement, input, expectedOutput)!}!!

use our!test helper

Page 29: Spark Summit 2014: Spark Streaming for Realtime Auctions

Other Learnings

Page 30: Spark Summit 2014: Spark Streaming for Realtime Auctions

Other Learnings

• Keeping your driver program healthy is crucial • 24/7 operation and monitoring • Spark on Mesos? Use Marathon.

• Pay attention to settings for spark.cores.max • Monitor data rate and increase as needed

• Serialization on classes • Java • Kryo

Page 31: Spark Summit 2014: Spark Streaming for Realtime Auctions

What’s next?

Page 32: Spark Summit 2014: Spark Streaming for Realtime Auctions

Twitter Summingbird

• Write-once, run anywhere • Supports:

• Hadoop MapReduce • Storm • Spark (maybe?)

Page 33: Spark Summit 2014: Spark Streaming for Realtime Auctions

Amazon Kinesis

web!server

Kinesis

web!server

web!server

other!applications

mobile!device

app!logs

Page 34: Spark Summit 2014: Spark Streaming for Realtime Auctions

Thanks!


Recommended