Upload
spark-summit
View
2.690
Download
0
Embed Size (px)
Citation preview
SPARK STREAMING:PUSHING THE THROUGHPUT LIMITS,THE REACTIVE WAY
François Garillot, Gerard Maas
Who Are We ?
Gerard MaasData Processing Team Lead
François Garillotwork done at
Spark Streaming at
@maasg @huitseeker
Spark Streaming (Refresher)
@maasg @huitseeker
Spark Streaming (Refresher)DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
@maasg @huitseeker
Spark Streaming (Refresher)DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]
Transformations
@maasg @huitseeker
Spark Streaming (Refresher)DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]Actions
Transformations
@maasg @huitseeker
Spark Streaming (Refresher)
Spark API for Streams
Fault-tolerant
High Throughput
Scalable
@maasg @huitseeker
Streaming
Spark
t0 t1 t2
#0
Con
sum
er
Con
sum
er
Con
sum
er
Scheduling
@maasg @huitseeker
Streaming
Spark
t0 t1 t2
#1
Con
sum
er
Con
sum
er
Con
sum
er
#0
Scheduling
Process Time < Batch Interval
@maasg @huitseeker
Streaming
Spark
t0 t1 t2
#2
Con
sum
er
Con
sum
er
Con
sum
er
#0 #1
#3
Scheduling
Scheduling Delay
@maasg @huitseeker
From Streams to μbatches
Consumer#0 #1
batchInterval
blockInterval
Spark StreamingSpark
#partitions = receivers x batchInterval / blockInterval
@maasg @huitseeker
From Streams to μbatches
#0
RDD
Partitions
Spark
Spark Executors
Spark Streaming
@maasg @huitseeker
From Streams to μbatches
#0
RDD
Spark
Spark Executors
Spark Streaming
@maasg @huitseeker
From Streams to μbatches
#0
RDD
Spark
Spark Executors
Spark Streaming
@maasg @huitseeker
From Streams to μbatches
#0
RDD
Spark
Spark Executors
Spark Streaming
@maasg @huitseeker
@maasg @huitseeker
@maasg @huitseeker
From Streams to μbatches
Consumer#0 #1
batchInterval
blockInterval
Spark StreamingSpark
#partitions = receivers x batchInterval / blockInterval
@maasg @huitseeker
From Streams to μbatches
Consumer#0 #1
batchInterval
blockInterval
Spark StreamingSpark
spark.streaming.blockInterval = batchInterval x receivers / (partitionFactor x sparkCores)
@maasg @huitseeker
The Importance of Cachingdstream.foreachRDD { rdd => rdd.cache() // cache the RDD before iterating!
keys.foreach{ key => rdd.filter(elem=> key(elem) == key).saveAsFooBar(...) }
rdd.unpersist()
}
@maasg @huitseeker
Intervals
(Read TD’s Adaptive Stream Processing using Dynamic Batch Sizing before drawing any conclusions !)
O(n²)
O(n²)
O(n)
O(n)
@maasg @huitseeker
The Receiver model
spark.streaming.receiver.maxRate
Fault tolerance ? WAL
@maasg @huitseeker
Direct Kafka Stream
com
pute
(offs
ets)
Kafka:The Receiver-less model
Simplified Parallelism
Efficiency
Exactly-once semantics
� Less degrees of freedom
val directKafkaStream = KafkaUtils.createDirectStream[ [key class], [value class], [key decoder class], [value decoder class] ]( streamingContext, [map of Kafka parameters], [set of topics to consume])
spark.streaming.kafka.maxRatePerPartition
@maasg @huitseeker
Demo
@maasg @huitseeker
Reactive PrinciplesReactive Streams : composable back-pressure
@maasg @huitseeker
Spark Streaming made Reactive
@maasg @huitseeker
Spark Streaming made Reactive
@maasg @huitseeker
Spark Streaming made Reactive
@maasg @huitseeker
Spark Streaming Made Reactive
@maasg @huitseeker
Demo
Putting it together
@maasg @huitseeker
Pain point : Data Locality- Where is your job getting executed ?
spark.locality.wait & spark.streaming.blockInterval
- On Mesos, it’s worse (SPARK-4940)
@maasg @huitseeker
ResourcesBackpressure in Spark Streaming:http://blog.garillot.net/post/121183250481/a-quick-update-on-spark-streaming-work-since-i
The Virdata’s Spark Streaming tuning guide:http://www.virdata.com/tuning-spark/
TD’s paper on dynamic batch sizing :http://dl.acm.org/citation.cfm?id=2670995
Diving into Spark Streaming Execution Model:https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html
Spark Streaming / Storm Trident numbered comparison:https://www.cs.utoronto.ca/~patricio/docs/Analysis_of_Real_Time_Stream_Processing_Systems_Considering_Latency.pdf
Kafka direct approach: https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
Thanks!
Gerard Maas@maasg
François Garillot@huitseeker