Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

SPARK STREAMING:PUSHING THE THROUGHPUT LIMITS,THE REACTIVE WAY

François Garillot, Gerard Maas

Who Are We ?

Gerard MaasData Processing Team Lead

François Garillotwork done at

Spark Streaming at

@maasg @huitseeker

Spark Streaming (Refresher)

@maasg @huitseeker

Spark Streaming (Refresher)DStream[T]

RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]

t0 t1 t2 t3 ti ti+1

@maasg @huitseeker



t0 t1 t2 t3 ti ti+1

RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]

Transformations

@maasg @huitseeker



t0 t1 t2 t3 ti ti+1

RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]Actions

Transformations

@maasg @huitseeker

Spark Streaming (Refresher)

Spark API for Streams

Fault-tolerant

High Throughput

Scalable

@maasg @huitseeker

Streaming

Spark

t0 t1 t2

#0

Con

sum

er

Con

sum

er

Con

sum

er

Scheduling

@maasg @huitseeker

Streaming

Spark

t0 t1 t2

#1

Con

sum

er

Con

sum

er

Con

sum

er

#0

Scheduling

Process Time < Batch Interval

@maasg @huitseeker

Streaming

Spark

t0 t1 t2

#2

Con

sum

er

Con

sum

er

Con

sum

er

#0 #1

#3

Scheduling

Scheduling Delay

@maasg @huitseeker

From Streams to μbatches

Consumer#0 #1

batchInterval

blockInterval

Spark StreamingSpark

#partitions = receivers x batchInterval / blockInterval

@maasg @huitseeker


#0

RDD

Partitions

Spark

Spark Executors

Spark Streaming

@maasg @huitseeker


#0

RDD

Spark

Spark Executors

Spark Streaming

@maasg @huitseeker


#0

RDD

Spark

Spark Executors

Spark Streaming

@maasg @huitseeker


#0

RDD

Spark

Spark Executors

Spark Streaming

@maasg @huitseeker

@maasg @huitseeker

@maasg @huitseeker


Consumer#0 #1

batchInterval

blockInterval


#partitions = receivers x batchInterval / blockInterval

@maasg @huitseeker


Consumer#0 #1

batchInterval

blockInterval


spark.streaming.blockInterval = batchInterval x receivers / (partitionFactor x sparkCores)

@maasg @huitseeker

The Importance of Cachingdstream.foreachRDD { rdd => rdd.cache() // cache the RDD before iterating!

keys.foreach{ key => rdd.filter(elem=> key(elem) == key).saveAsFooBar(...) }

rdd.unpersist()

}

@maasg @huitseeker

Intervals

(Read TD’s Adaptive Stream Processing using Dynamic Batch Sizing before drawing any conclusions !)

O(n²)

O(n²)

O(n)

O(n)

@maasg @huitseeker

The Receiver model

spark.streaming.receiver.maxRate

Fault tolerance ? WAL

@maasg @huitseeker

Direct Kafka Stream

com

pute

(offs

ets)

Kafka:The Receiver-less model

Simplified Parallelism

Efficiency

Exactly-once semantics

� Less degrees of freedom

val directKafkaStream = KafkaUtils.createDirectStream[ [key class], [value class], [key decoder class], [value decoder class] ]( streamingContext, [map of Kafka parameters], [set of topics to consume])

spark.streaming.kafka.maxRatePerPartition

@maasg @huitseeker

Demo

@maasg @huitseeker

Reactive PrinciplesReactive Streams : composable back-pressure

@maasg @huitseeker

Spark Streaming made Reactive

@maasg @huitseeker


@maasg @huitseeker


@maasg @huitseeker

Spark Streaming Made Reactive

@maasg @huitseeker

Demo

Putting it together

@maasg @huitseeker

Pain point : Data Locality- Where is your job getting executed ?

spark.locality.wait & spark.streaming.blockInterval

- On Mesos, it’s worse (SPARK-4940)

@maasg @huitseeker

ResourcesBackpressure in Spark Streaming:http://blog.garillot.net/post/121183250481/a-quick-update-on-spark-streaming-work-since-i

The Virdata’s Spark Streaming tuning guide:http://www.virdata.com/tuning-spark/

TD’s paper on dynamic batch sizing :http://dl.acm.org/citation.cfm?id=2670995

Diving into Spark Streaming Execution Model:https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html

Spark Streaming / Storm Trident numbered comparison:https://www.cs.utoronto.ca/~patricio/docs/Analysis_of_Real_Time_Stream_Processing_Systems_Considering_Latency.pdf

Kafka direct approach: https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md

http://blog.garillot.net/post/121183250481/a-quick-update-on-spark-streaming-work-since-i

http://blog.garillot.net/post/121183250481/a-quick-update-on-spark-streaming-work-since-i

http://www.virdata.com/tuning-spark/

http://www.virdata.com/tuning-spark/

http://dl.acm.org/citation.cfm?id=2670995

http://dl.acm.org/citation.cfm?id=2670995

https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html

https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html

https://www.cs.utoronto.ca/~patricio/docs/Analysis_of_Real_Time_Stream_Processing_Systems_Considering_Latency.pdf

https://www.cs.utoronto.ca/~patricio/docs/Analysis_of_Real_Time_Stream_Processing_Systems_Considering_Latency.pdf

https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md

https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md

Thanks!

Gerard Maas@maasg

François Garillot@huitseeker

Data & Analytics

Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas