44
@rweald Productionalizing Spark Streaming Spark Summit 2013 Ryan Weald @rweald Monday, December 2, 13

Productionalizing Spark Streaming

Embed Size (px)

DESCRIPTION

Spark Summit 2013 Talk: At Sharethrough we have deployed Spark to our production environment to support several user facing product features. While building these features we uncovered a consistent set of challenges across multiple streaming jobs. By addressing these challenges you can speed up development of future streaming jobs. In this talk we will discuss the 3 major challenges we encountered while developing production streaming jobs and how we overcame them. First we will look at how to write jobs to ensure fault tolerance since streaming jobs need to run 24/7 even under failure conditions. Second we will look at the programming abstractions we created using functional programming and existing libraries. Finally we will look at the way we test all the pieces of a job –from manipulating data through writing to external databases– to give us confidence in our code before we deploy to production

Citation preview

Page 1: Productionalizing Spark Streaming

@rweald

Productionalizing Spark Streaming

Spark Summit 2013Ryan Weald

@rweald

Monday, December 2, 13

Page 2: Productionalizing Spark Streaming

@rweald

What We’re Going to Cover

•What we do and Why we choose Spark

•Fault tolerance for long lived streaming jobs

•Common patterns and functional abstractions

•Testing before we “do it live”

Monday, December 2, 13

Page 3: Productionalizing Spark Streaming

@rweald

Special focus on common patterns and

their solutions

Monday, December 2, 13

Page 4: Productionalizing Spark Streaming

@rweald

What is Sharethrough?

Advertising for the Modern Internet

FunctionForm

Monday, December 2, 13

Page 5: Productionalizing Spark Streaming

@rweald

What is Sharethrough?

Monday, December 2, 13

Page 6: Productionalizing Spark Streaming

@rweald

Why Spark Streaming?

Monday, December 2, 13

Page 7: Productionalizing Spark Streaming

@rweald

Why Spark Streaming

•Liked theoretical foundation of mini-batch

•Scala codebase + functional API

•Young project with opportunities to contribute

•Batch model for iterative ML algorithms

Monday, December 2, 13

Page 8: Productionalizing Spark Streaming

@rweald

Great...Now productionalize it

Monday, December 2, 13

Page 9: Productionalizing Spark Streaming

@rweald

Fault Tolerance

Monday, December 2, 13

Page 10: Productionalizing Spark Streaming

@rweald

Keys to Fault Tolerance

1.Receiver fault tolerance

2.Monitoring job progress

Monday, December 2, 13

Page 11: Productionalizing Spark Streaming

@rweald

Receiver Fault Tolerance

•Use Actors with supervisors

•Use self healing connection pools

Monday, December 2, 13

Page 12: Productionalizing Spark Streaming

@rweald

Use Actors

class RabbitMQStreamReceiver (uri:String, exchangeName: String, routingKey: String) extends Actor with Receiver with Logging {

implicit val system = ActorSystem() override def preStart() = { //Your code to setup connections and actors //Include inner class to process messages }

def receive: Receive = { case _ => logInfo("unknown message") }}

Monday, December 2, 13

Page 13: Productionalizing Spark Streaming

@rweald

Track All Outputs

•Low watermarks - Google MillWheel

•Database updated_at

•Expected output file size alerting

Monday, December 2, 13

Page 14: Productionalizing Spark Streaming

@rweald

Common Patterns&

Functional Programming

Monday, December 2, 13

Page 15: Productionalizing Spark Streaming

@rweald

Map -> Aggregate ->Store

Common Job Pattern

Monday, December 2, 13

Page 16: Productionalizing Spark Streaming

@rweald

Mapping Data

inputData.map { rawRequest => val params = QueryParams.parse(rawRequest) (params.getOrElse("beaconType", "unknown"), 1L)}

Monday, December 2, 13

Page 17: Productionalizing Spark Streaming

@rweald

Aggregation

Monday, December 2, 13

Page 18: Productionalizing Spark Streaming

@rweald

Basic Aggregation

//beacons is DStream[String, Long]//example Seq(("click", 1L), ("click", 1L))val sum: (Long, Long) => Long = _ + _beacons.reduceByKey(sum)

Monday, December 2, 13

Page 19: Productionalizing Spark Streaming

@rweald

What Happens when we want to sum multiple things?

Monday, December 2, 13

Page 20: Productionalizing Spark Streaming

@rweald

Long Basic Aggregation

val inputData = Seq( ("user_1",(1L, 1L, 1L)), ("user_1",(2L, 2L, 2L)))def sum(l: (Long, Long, Long), r: (Long, Long, Long)) = { (l._1 + r._1, l._2 + r._2, l._3 + r._3)}inputData.reduceByKey(sum)

Monday, December 2, 13

Page 21: Productionalizing Spark Streaming

@rweald

Now Sum 4 Ints instead

(ノಥ益ಥ)ノ ┻━┻

Monday, December 2, 13

Page 22: Productionalizing Spark Streaming

@rweald

Monoids to the Rescue

Monday, December 2, 13

Page 23: Productionalizing Spark Streaming

@rweald

WTF is a Monoid?

trait Monoid[T] { def zero: T def plus(r: T, l: T): T}

* Just need to make sure plus is associative.(1+ 5) + 2 == (2 + 1) + 5

Monday, December 2, 13

Page 24: Productionalizing Spark Streaming

@rweald

Monoid Based Aggregation

object LongMonoid extends Monoid[(Long, Long, Long)] { def zero = (0, 0, 0) def plus(r: (Long, Long, Long), l: (Long, Long, Long)) = { (l._1 + r._1, l._2 + r._2, l._3 + r._3) }}

inputData.reduceByKey(LongMonid.plus(_, _))

Monday, December 2, 13

Page 25: Productionalizing Spark Streaming

@rweald

Twitter Algebird

http://github.com/twitter/algebird

Monday, December 2, 13

Page 26: Productionalizing Spark Streaming

@rweald

Algebird Based Aggregation

import com.twitter.algebird._val aggregator = implicitly[Monoid[(Long,Long, Long)]]

inputData.reduceByKey(aggregator.plus(_, _))

Monday, December 2, 13

Page 27: Productionalizing Spark Streaming

@rweald

How many unique users per publisher?

Monday, December 2, 13

Page 28: Productionalizing Spark Streaming

@rweald

Too big for memory based naive Map

Monday, December 2, 13

Page 29: Productionalizing Spark Streaming

@rweald

HyperLogLog FTW

Monday, December 2, 13

Page 30: Productionalizing Spark Streaming

@rweald

HLL Aggregation

import com.twitter.algebird._val aggregator = new HyperLogLogMonoid(12)inputData.reduceByKey(aggregator.plus(_, _))

Monday, December 2, 13

Page 31: Productionalizing Spark Streaming

@rweald

Monoids == Reusable Aggregation

Monday, December 2, 13

Page 32: Productionalizing Spark Streaming

@rweald

Common Job Pattern

Map -> Aggregate ->Store

Monday, December 2, 13

Page 33: Productionalizing Spark Streaming

@rweald

Store

Monday, December 2, 13

Page 34: Productionalizing Spark Streaming

@rweald

How do we store the results?

Monday, December 2, 13

Page 35: Productionalizing Spark Streaming

@rweald

Storage API Requirements

•Incremental updates (preferably associative)

•Pluggable to support “big data” stores

•Allow for testing jobs

Monday, December 2, 13

Page 36: Productionalizing Spark Streaming

@rweald

Storage API

trait MergeableStore[K, V] { def get(key: K): V def put(kv: (K,V)): V /* * Should follow same associative property * as our Monoid from earlier */ def merge(kv: (K,V)): V}

Monday, December 2, 13

Page 37: Productionalizing Spark Streaming

@rweald

Twitter Storehaus

http://github.com/twitter/storehaus

Monday, December 2, 13

Page 38: Productionalizing Spark Streaming

@rweald

Storing Spark Results

def saveResults(result: DStream[String, Long], store: RedisStore[String, Long]) = { result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } }

Monday, December 2, 13

Page 39: Productionalizing Spark Streaming

@rweald

Everyone can benefit

Monday, December 2, 13

Page 40: Productionalizing Spark Streaming

@rweald

Potential API additions?

class PairDStreamFunctions[K, V] { def aggregateByKey(aggregator: Monoid[V]) def store(store: MergeableStore[K, V]) }

Monday, December 2, 13

Page 41: Productionalizing Spark Streaming

@rweald

Twitter Summingbird

http://github.com/twitter/summingbird

*https://github.com/twitter/summingbird/issues/387

Monday, December 2, 13

Page 42: Productionalizing Spark Streaming

@rweald

Testing Your Jobs

Monday, December 2, 13

Page 43: Productionalizing Spark Streaming

@rweald

Testing best Practices

•Try and avoid full integration tests

•Use in-memory stores for testing

•Keep logic outside of Spark

•Use Summingbird in memory platform???

Monday, December 2, 13

Page 44: Productionalizing Spark Streaming

@rweald

Ryan Weald@rweald

Thank You

Monday, December 2, 13