50
Have Your Cake and Eat It Too Architectures for Batch and Stream Processing Speaker name // Speaker title

Have your cake and eat it too

Embed Size (px)

Citation preview

Page 1: Have your cake and eat it too

Have Your Cake and Eat It TooArchitectures for Batch and Stream Processing

Speaker name // Speaker title

Page 2: Have your cake and eat it too

2

Stuff We’ll Talk About

• Why do we need both streams and batches• Why is it a problem?• Stream-Only Patterns (i.e. Kappa Architecture)• Lambda-Architecture Technologies– SummingBird– Apache Spark– Apache Flink– Bring-your-own-framework

Page 3: Have your cake and eat it too

3©2014 Cloudera, Inc. All rights reserved.

• 15 years of moving data• Formerly consultant• Now Cloudera Engineer:– Sqoop Committer– Kafka– Flume

• @gwenshap

About Me

Page 4: Have your cake and eat it too

4

Why Streaming and Batch

©2014 Cloudera, Inc. All rights reserved.

Page 5: Have your cake and eat it too

5

Batch Processing

• Store data somewhere• Read large chunks of data• Do something with data• Sometimes store results

Page 6: Have your cake and eat it too

6Click to enter confidentiality information

Batch Examples

• Analytics

• ETL / ELT

• Training machine learning models

• Recommendations

Page 7: Have your cake and eat it too

7Click to enter confidentiality information

Stream Processing

• Listen to incoming events• Do something with each

event• Maybe store events /

results

Page 8: Have your cake and eat it too

8Click to enter confidentiality information

Stream Processing Examples

• Anomaly detection, alerts

• Monitoring, SLAs

• Operational intelligence

• Analytics, dashboards

• ETL

Page 9: Have your cake and eat it too

9Click to enter confidentiality information

Streaming &Batch

AlertsMonitoring, SLAs

Operational Intelligence

Risk AnalysisAnomaly detectionAnalytics

ETL

Page 10: Have your cake and eat it too

10Click to enter confidentiality information

Four Categories

• Streams Only• Batch Only• Can be done in both• Must be done in both

ETLSome Analytics

Page 11: Have your cake and eat it too

11Click to enter confidentiality information

ETL

Most Stream Processing projects I see involve few simple transformations.

• Currency conversion• JSON to Avro• Field extraction• Joining a stream to a static data set• Aggregate on window• Identifying change in trend• Document indexing

Page 12: Have your cake and eat it too

12Click to enter confidentiality information

Batch || Streaming

• Efficient:– Lower CPU utilization– Better network and disk throughput– Fewer locks and waits

• Easier administration

• Easier integration with RDBMS

• Existing expertise

• Existing tools

• Real-time information

Page 13: Have your cake and eat it too

13

The Problem

©2014 Cloudera, Inc. All rights reserved.

Page 14: Have your cake and eat it too

14Click to enter confidentiality information

We Like

• Efficiency

• Scalability

• Fault Tolerance

• Recovery from errors

• Experimenting with different approaches

• Debuggers

• Cookies

Page 15: Have your cake and eat it too

15Click to enter confidentiality information

But…We don’t likeMaintaining two applicationsThat do the same thing

Page 16: Have your cake and eat it too

16Click to enter confidentiality information

Do we really need to maintain same app twice?

Yes, because:

• We are not sure about requirements

• We sometimes need to re-process with very high efficiency

Not really:

• Different apps for batch and streaming

• Can re-process with streams

• Can error-correct with streams

• Can maintain one code-base for batches and streams

Page 17: Have your cake and eat it too

17

Stream-Only Patterns(Kappa Architecture)

Click to enter confidentiality information

Page 18: Have your cake and eat it too

18Click to enter confidentiality information

DWH Example

OLTP DB

Sensors, Logs

DWH Fact Table(Partitione

d)

Real TimeFact Tables

Dimension

Dimension

Dimension

Views

Aggregates

App 1:Stream

processing

App 2:Occasional load

Page 19: Have your cake and eat it too

19Click to enter confidentiality information

We need to fix older data

0 1 2 3 4 5 6 7 8 910

11

12

13

Streaming App v1

Streaming App v2

Real-Time Table

Replacement Partition

Partitioned Fact Table

Page 20: Have your cake and eat it too

20Click to enter confidentiality information

We need to fix older data

0 1 2 3 4 5 6 7 8 910

11

12

13

Streaming App v1

Streaming App v2

Real-Time Table

Replacement Partition

Partitioned Fact Table

Page 21: Have your cake and eat it too

21Click to enter confidentiality information

We need to fix older data

0 1 2 3 4 5 6 7 8 910

11

12

13

Streaming App v2

Real-Time Table

Page 22: Have your cake and eat it too

22

Lambda-Architecture Technologies

Click to enter confidentiality information

Page 23: Have your cake and eat it too

23

WordCount in Scala

source.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_+_) .print()

Page 24: Have your cake and eat it too

24

SummingBird

Page 25: Have your cake and eat it too

25

MapReduce was great because…

Very simple abstraction:- Map- Shuffle- Reduce- Type-safe

And it has simpler abstractions on top.

Page 26: Have your cake and eat it too

26Click to enter confidentiality information

SummingBird

• Multi-stage MapReduce• Run on Hadoop, Spark, Storm• Very easy to combine

batch and streaming results

Page 27: Have your cake and eat it too

27Click to enter confidentiality information

API

• Platform – Storm, Scalding, Spark…• Producer.source(Platform) <- get data • Producer – collection of events• Transformations – map, filter, merge, leftJoin (lookup)• Output – write(sink), sumByKey(store)• Store – contains aggregate for each key, and reduce

operation

Page 28: Have your cake and eat it too

28Click to enter confidentiality information

Associative Reduce

Page 29: Have your cake and eat it too

29Click to enter confidentiality information

WordCount SummingBird

def wordCount[P <: Platform[P]]

(source: Producer[P, String], store: P#Store[String, Long]) =

source.flatMap { sentence =>

toWords(sentence).map(_ -> 1L)

}.sumByKey(store)

val stormTopology = Storm.remote(“stormName”).plan(wordCount)

val hadoopJob = Scalding(“scaldingName”).plan(wordCount)

Page 30: Have your cake and eat it too

30

SparkStreaming

Page 31: Have your cake and eat it too

31

First, there was the RDD

• Spark is its own execution engine

• With high-level API

• RDDs are sharded collections

• Can be mapped, reduced, grouped, filtered, etc

Page 32: Have your cake and eat it too

32Confidentiality Information Goes Here

DStream

DStream

DStream

Spark Streaming

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count Print

Source Receiver RDD

RDD

RDD

Single PassFilter Count Print

Pre-first Batch

First Batch

Second Batch

Page 33: Have your cake and eat it too

33Confidentiality Information Goes Here

DStream

DStream

DStreamSpark Streaming

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count

Print

Source Receiver RDD

RDD

RDD

Single PassFilter Count

Pre-first Batch

First Batch

Second Batch

Stateful RDD 1

Print

Stateful RDD 2

Stateful RDD 1

Page 34: Have your cake and eat it too

34Click to enter confidentiality information

Compared to SummingBird

Differences:

• Micro-batches

• Completely new execution model

• Real joins

• Reduce is not limited to Monads

• SparkStreaming has Richer API

• Summingbird can aggregate batch and stream to one dataset

• SparkStreaming runs in debugger

Similarities:

• Almost same code will run in batch and streams

• Use of Scala

• Use of functional programing concepts

Page 35: Have your cake and eat it too

35

Spark Example

©2014 Cloudera, Inc. All rights reserved.

1. val conf = new SparkConf().setMaster("local[2]”)

2. val sc = new SparkContext(conf)

3. val lines = sc.textFile(path, 2)

4. val words = lines.flatMap(_.split(" "))

5. val pairs = words.map(word => (word, 1))

6. val wordCounts = pairs.reduceByKey(_ + _)

7. wordCounts.print()

Page 36: Have your cake and eat it too

36

Spark Streaming Example

©2014 Cloudera, Inc. All rights reserved.

1. val conf = new SparkConf().setMaster("local[2]”)

2. val ssc = new StreamingContext(conf, Seconds(1))

3. val lines = ssc.socketTextStream("localhost", 9999)

4. val words = lines.flatMap(_.split(" "))

5. val pairs = words.map(word => (word, 1))

6. val wordCounts = pairs.reduceByKey(_ + _)

7. wordCounts.print()

8. ssc.start()

Page 37: Have your cake and eat it too

37

Apache Flink

Page 38: Have your cake and eat it too

38

Execution Model

You don’t want to know.

Page 39: Have your cake and eat it too

39

Flink vs SparkStreaming

Differences:

• Flink is event-by-event streaming, events go through pipeline.

• SparkStreaming has good integration with Hbase as state store

• “checkpoint barriers”

• Optimization based on strong typing

• Flink is newer than SparkStreaming, there is less production experience

Similarities:

• Very similar APIs

• Built-in stream-specific operators (windows)

• Exactly once guarantees through checkpoints of offsets and state (Flink is limited to small state for now)

Page 40: Have your cake and eat it too

40

WordCount Batch

val env = ExecutionEnvironment.getExecutionEnvironment

val text = getTextDataSet(env)

val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }

.map { (_, 1) } .groupBy(0)

.sum(1)

counts.print()

env.execute(“Wordcount Example”)

Page 41: Have your cake and eat it too

41

WordCount Streaming

val env = ExecutionEnvironment.getExecutionEnvironment

val text = env.socketTextStream(host, port)

val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }

.map { (_, 1) } .groupBy(0)

.sum(1)

counts.print()

env.execute(“Wordcount Example”)

Page 42: Have your cake and eat it too

42

Bring Your Own Framework

Page 43: Have your cake and eat it too

43

If the requirements are simple…

Page 44: Have your cake and eat it too

44

How difficult it is to parallelize transformations?

Simple transformationsAre simple

Page 45: Have your cake and eat it too

45Click to enter confidentiality information

Just add Kafka

Kafka is a reliable data sourceYou can read

BatchesMicrobatchesStreams

Also allows for re-partitioning

Page 46: Have your cake and eat it too

46Click to enter confidentiality information

Cluster management

• Managing cluster resources used to be difficult• Now:– YARN– Mesos– Docker– Kubernetes

Page 47: Have your cake and eat it too

47Click to enter confidentiality information

So your app should…

• Allocate resources and track tasks with YARN / Mesos• Read from Kafka (however often you want)• Do simple transformations• Write to Kafka / Hbase

• How difficult can it possibly be?

Page 48: Have your cake and eat it too

48

Parting Thoughts

Click to enter confidentiality information

Page 49: Have your cake and eat it too

49

Good engineering lessons

• DRY – do you really need same code twice?• Error correction is critical• Reliability guarantees are critical• Debuggers are really nice• Latency / Throughput trade-offs• Use existing expertise• Stream processing is about patterns

Page 50: Have your cake and eat it too

Thank you