View
125
Download
2
Category
Preview:
Citation preview
Every ad.Every sales channel.Every screen.One platform.
Spark streaming with Apache kafka
Vikas Gite
Principal Software EngineerBig Data Analytics - PubMatic
2
Agenda
Spark streaming 101– What is RDD– What is Dstream
Spark streaming architecture Introduction to Kafka Streaming ingestion with Kafka
3
Spark streaming 101
RDD– Immutable– Partitioned– Fault tolerant– Lazily evaluated– Can be persisted
First RDD
Second RDD
Third RDD
Filter
Map
Lineage Graph
Spark streaming 101
4
DStream– Continuous sequence of RDDs– Designed for stream processing.
Spark streaming architecture Micro batching
5
Spark streaming architecture Dynamic load balancing
6
Spark streaming architecture Failure and recovery
7
Introduction to Kafka Kafka is a message queue (Circular buffer) Based on disk space or time Oldest messages are deleted to maintain size Split into topic and partition Indexed only by offset Delivery semantics are your responsibility
8
High level consumer Offsets are stored in zookeeper Offsets are stored based on Consumer group
Low level consumer
Offsets are stored in any store Must handle broker leader changes
9
At most once Save offsets !!! Possible failure !!! Save results
On failure, restart at saved offset, messages are lost.
At least once Save results !!! Possible failure !!! Save offsets
On failure, messages are repeated
10
11
Idempotent exactly once Save result with natural unique key !!! Possible failure !!! Save offset
Operation is safe to repeat.
Pros : Simple Works well with map transformations
Cons : Hard for aggregate transformations
12
Transactional exactly once Begin transaction Save results Save offset Ensure offsets are ok Commit transaction
On failure roll back results and offsets
Pros : Works for any transformation
Cons : More complex Requires transactional data store
12
Streaming ingestion with KafkaApproach 1: Receiver-based Approach
13
Streaming ingestion with KafkaApproach 1: Receiver-based Approach
Pros : WAL design could work with non-kafka data store
Cons : Duplication of write operations Dependent on HDFS Must use idempotent for exactly once No access to offsets, can’t use transactional approach
14
Streaming ingestion with KafkaApproach 2: Direct Approach (No Receivers)
15
16
Streaming ingestion with KafkaApproach 2: Direct Approach (No Receivers)
Pros : Simplified parallelism
– One to one mapping between partition and RDD Efficiency
– Reducing WAL overhead Exactly-once semantics
– Spark checkpoints– Atomic transaction
17
Streaming ingestion with KafkaApproach 2: Direct Approach (How to use it)
// Kafka config paramsval topicsSet = topics.split(",").toSetval kafkaParams = Map[String, String]("metadata.broker.list" -> brokers,
“auto.offset.reset” -> largest)
// DirectStream method callval messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
18
Streaming ingestion with KafkaWhere to store offsets
Easy – Spark checkpoints : No need to access the offsets, automatically used on restart Must be idempotent, no transactional Checkpoints may not be recoverable
Complex – Your own data store : Must access offsets, save them, provid them on restart Idempotent or transactional Offsets are just as recoverable as your results
ad impressionsserved daily
bids processedmonthly
data processeddaily
data undermanagement
data centeracross geography
18B+10T
22TB
5PB6
Our Scale
Thank You
20
Recommended