35
SPARK STREAMING: PUSHING THE THROUGHPUT LIMITS, THE REACTIVE WAY François Garillot, Gerard Maas

Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

Embed Size (px)

Citation preview

Page 1: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

SPARK STREAMING:PUSHING THE THROUGHPUT LIMITS,THE REACTIVE WAY

François Garillot, Gerard Maas

Page 2: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

Who Are We ?

Gerard MaasData Processing Team Lead

François Garillotwork done at

Spark Streaming at

Page 3: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

Spark Streaming (Refresher)

Page 4: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

Spark Streaming (Refresher)DStream[T]

RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]

t0 t1 t2 t3 ti ti+1

Page 5: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

Spark Streaming (Refresher)DStream[T]

RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]

t0 t1 t2 t3 ti ti+1

RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]

Transformations

Page 6: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

Spark Streaming (Refresher)DStream[T]

RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]

t0 t1 t2 t3 ti ti+1

RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]Actions

Transformations

Page 7: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

Spark Streaming (Refresher)

Spark API for Streams

Fault-tolerant

High Throughput

Scalable

Page 8: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

Streaming

Spark

t0 t1 t2

#0

Con

sum

er

Con

sum

er

Con

sum

er

Scheduling

Page 9: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

Streaming

Spark

t0 t1 t2

#1

Con

sum

er

Con

sum

er

Con

sum

er

#0

Scheduling

Process Time < Batch Interval

Page 10: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

Streaming

Spark

t0 t1 t2

#2

Con

sum

er

Con

sum

er

Con

sum

er

#0 #1

#3

Scheduling

Scheduling Delay

Page 11: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

From Streams to μbatches

Consumer#0 #1

batchInterval

blockInterval

Spark StreamingSpark

#partitions = receivers x batchInterval / blockInterval

Page 12: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

From Streams to μbatches

#0

RDD

Partitions

Spark

Spark Executors

Spark Streaming

Page 13: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

From Streams to μbatches

#0

RDD

Spark

Spark Executors

Spark Streaming

Page 14: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

From Streams to μbatches

#0

RDD

Spark

Spark Executors

Spark Streaming

Page 15: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

From Streams to μbatches

#0

RDD

Spark

Spark Executors

Spark Streaming

Page 16: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

Page 17: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

Page 18: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

From Streams to μbatches

Consumer#0 #1

batchInterval

blockInterval

Spark StreamingSpark

#partitions = receivers x batchInterval / blockInterval

Page 19: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

From Streams to μbatches

Consumer#0 #1

batchInterval

blockInterval

Spark StreamingSpark

spark.streaming.blockInterval = batchInterval x receivers / (partitionFactor x sparkCores)

Page 20: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

The Importance of Cachingdstream.foreachRDD { rdd => rdd.cache() // cache the RDD before iterating!

keys.foreach{ key => rdd.filter(elem=> key(elem) == key).saveAsFooBar(...) }

rdd.unpersist()

}

Page 21: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

Intervals

(Read TD’s Adaptive Stream Processing using Dynamic Batch Sizing before drawing any conclusions !)

O(n²)

O(n²)

O(n)

O(n)

Page 22: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

The Receiver model

spark.streaming.receiver.maxRate

Fault tolerance ? WAL

Page 23: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

Direct Kafka Stream

com

pute

(offs

ets)

Page 24: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

Kafka:The Receiver-less model

Simplified Parallelism

Efficiency

Exactly-once semantics

� Less degrees of freedom

val directKafkaStream = KafkaUtils.createDirectStream[ [key class], [value class], [key decoder class], [value decoder class] ]( streamingContext, [map of Kafka parameters], [set of topics to consume])

spark.streaming.kafka.maxRatePerPartition

Page 25: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

Demo

Page 26: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

Reactive PrinciplesReactive Streams : composable back-pressure

Page 27: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

Spark Streaming made Reactive

Page 28: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

Spark Streaming made Reactive

Page 29: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

Spark Streaming made Reactive

Page 30: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

Spark Streaming Made Reactive

Page 31: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

Demo

Page 32: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

Putting it together

Page 33: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

Pain point : Data Locality- Where is your job getting executed ?

spark.locality.wait & spark.streaming.blockInterval

- On Mesos, it’s worse (SPARK-4940)

Page 34: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

@maasg @huitseeker

ResourcesBackpressure in Spark Streaming:http://blog.garillot.net/post/121183250481/a-quick-update-on-spark-streaming-work-since-i

The Virdata’s Spark Streaming tuning guide:http://www.virdata.com/tuning-spark/

TD’s paper on dynamic batch sizing :http://dl.acm.org/citation.cfm?id=2670995

Diving into Spark Streaming Execution Model:https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html

Spark Streaming / Storm Trident numbered comparison:https://www.cs.utoronto.ca/~patricio/docs/Analysis_of_Real_Time_Stream_Processing_Systems_Considering_Latency.pdf

Kafka direct approach: https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md

Page 35: Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas

Thanks!

Gerard Maas@maasg

François Garillot@huitseeker