Upload
ryan-weald
View
116
Download
0
Embed Size (px)
DESCRIPTION
Spark Summit 2013 Talk: At Sharethrough we have deployed Spark to our production environment to support several user facing product features. While building these features we uncovered a consistent set of challenges across multiple streaming jobs. By addressing these challenges you can speed up development of future streaming jobs. In this talk we will discuss the 3 major challenges we encountered while developing production streaming jobs and how we overcame them. First we will look at how to write jobs to ensure fault tolerance since streaming jobs need to run 24/7 even under failure conditions. Second we will look at the programming abstractions we created using functional programming and existing libraries. Finally we will look at the way we test all the pieces of a job –from manipulating data through writing to external databases– to give us confidence in our code before we deploy to production
Citation preview
@rweald
Productionalizing Spark Streaming
Spark Summit 2013Ryan Weald
@rweald
Monday, December 2, 13
@rweald
What We’re Going to Cover
•What we do and Why we choose Spark
•Fault tolerance for long lived streaming jobs
•Common patterns and functional abstractions
•Testing before we “do it live”
Monday, December 2, 13
@rweald
Special focus on common patterns and
their solutions
Monday, December 2, 13
@rweald
What is Sharethrough?
Advertising for the Modern Internet
FunctionForm
Monday, December 2, 13
@rweald
What is Sharethrough?
Monday, December 2, 13
@rweald
Why Spark Streaming?
Monday, December 2, 13
@rweald
Why Spark Streaming
•Liked theoretical foundation of mini-batch
•Scala codebase + functional API
•Young project with opportunities to contribute
•Batch model for iterative ML algorithms
Monday, December 2, 13
@rweald
Great...Now productionalize it
Monday, December 2, 13
@rweald
Fault Tolerance
Monday, December 2, 13
@rweald
Keys to Fault Tolerance
1.Receiver fault tolerance
2.Monitoring job progress
Monday, December 2, 13
@rweald
Receiver Fault Tolerance
•Use Actors with supervisors
•Use self healing connection pools
Monday, December 2, 13
@rweald
Use Actors
class RabbitMQStreamReceiver (uri:String, exchangeName: String, routingKey: String) extends Actor with Receiver with Logging {
implicit val system = ActorSystem() override def preStart() = { //Your code to setup connections and actors //Include inner class to process messages }
def receive: Receive = { case _ => logInfo("unknown message") }}
Monday, December 2, 13
@rweald
Track All Outputs
•Low watermarks - Google MillWheel
•Database updated_at
•Expected output file size alerting
Monday, December 2, 13
@rweald
Common Patterns&
Functional Programming
Monday, December 2, 13
@rweald
Map -> Aggregate ->Store
Common Job Pattern
Monday, December 2, 13
@rweald
Mapping Data
inputData.map { rawRequest => val params = QueryParams.parse(rawRequest) (params.getOrElse("beaconType", "unknown"), 1L)}
Monday, December 2, 13
@rweald
Aggregation
Monday, December 2, 13
@rweald
Basic Aggregation
//beacons is DStream[String, Long]//example Seq(("click", 1L), ("click", 1L))val sum: (Long, Long) => Long = _ + _beacons.reduceByKey(sum)
Monday, December 2, 13
@rweald
What Happens when we want to sum multiple things?
Monday, December 2, 13
@rweald
Long Basic Aggregation
val inputData = Seq( ("user_1",(1L, 1L, 1L)), ("user_1",(2L, 2L, 2L)))def sum(l: (Long, Long, Long), r: (Long, Long, Long)) = { (l._1 + r._1, l._2 + r._2, l._3 + r._3)}inputData.reduceByKey(sum)
Monday, December 2, 13
@rweald
Now Sum 4 Ints instead
(ノಥ益ಥ)ノ ┻━┻
Monday, December 2, 13
@rweald
Monoids to the Rescue
Monday, December 2, 13
@rweald
WTF is a Monoid?
trait Monoid[T] { def zero: T def plus(r: T, l: T): T}
* Just need to make sure plus is associative.(1+ 5) + 2 == (2 + 1) + 5
Monday, December 2, 13
@rweald
Monoid Based Aggregation
object LongMonoid extends Monoid[(Long, Long, Long)] { def zero = (0, 0, 0) def plus(r: (Long, Long, Long), l: (Long, Long, Long)) = { (l._1 + r._1, l._2 + r._2, l._3 + r._3) }}
inputData.reduceByKey(LongMonid.plus(_, _))
Monday, December 2, 13
@rweald
Twitter Algebird
http://github.com/twitter/algebird
Monday, December 2, 13
@rweald
Algebird Based Aggregation
import com.twitter.algebird._val aggregator = implicitly[Monoid[(Long,Long, Long)]]
inputData.reduceByKey(aggregator.plus(_, _))
Monday, December 2, 13
@rweald
How many unique users per publisher?
Monday, December 2, 13
@rweald
Too big for memory based naive Map
Monday, December 2, 13
@rweald
HyperLogLog FTW
Monday, December 2, 13
@rweald
HLL Aggregation
import com.twitter.algebird._val aggregator = new HyperLogLogMonoid(12)inputData.reduceByKey(aggregator.plus(_, _))
Monday, December 2, 13
@rweald
Monoids == Reusable Aggregation
Monday, December 2, 13
@rweald
Common Job Pattern
Map -> Aggregate ->Store
Monday, December 2, 13
@rweald
Store
Monday, December 2, 13
@rweald
How do we store the results?
Monday, December 2, 13
@rweald
Storage API Requirements
•Incremental updates (preferably associative)
•Pluggable to support “big data” stores
•Allow for testing jobs
Monday, December 2, 13
@rweald
Storage API
trait MergeableStore[K, V] { def get(key: K): V def put(kv: (K,V)): V /* * Should follow same associative property * as our Monoid from earlier */ def merge(kv: (K,V)): V}
Monday, December 2, 13
@rweald
Twitter Storehaus
http://github.com/twitter/storehaus
Monday, December 2, 13
@rweald
Storing Spark Results
def saveResults(result: DStream[String, Long], store: RedisStore[String, Long]) = { result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } }
Monday, December 2, 13
@rweald
Everyone can benefit
Monday, December 2, 13
@rweald
Potential API additions?
class PairDStreamFunctions[K, V] { def aggregateByKey(aggregator: Monoid[V]) def store(store: MergeableStore[K, V]) }
Monday, December 2, 13
@rweald
Twitter Summingbird
http://github.com/twitter/summingbird
*https://github.com/twitter/summingbird/issues/387
Monday, December 2, 13
@rweald
Testing Your Jobs
Monday, December 2, 13
@rweald
Testing best Practices
•Try and avoid full integration tests
•Use in-memory stores for testing
•Keep logic outside of Spark
•Use Summingbird in memory platform???
Monday, December 2, 13
@rweald
Ryan Weald@rweald
Thank You
Monday, December 2, 13