Have your cake and eat it too

Preview:

Citation preview

Have Your Cake and Eat It TooArchitectures for Batch and Stream Processing

Speaker name // Speaker title

2

Stuff We’ll Talk About

• Why do we need both streams and batches• Why is it a problem?• Stream-Only Patterns (i.e. Kappa Architecture)• Lambda-Architecture Technologies– SummingBird– Apache Spark– Apache Flink– Bring-your-own-framework

3©2014 Cloudera, Inc. All rights reserved.

• 15 years of moving data• Formerly consultant• Now Cloudera Engineer:– Sqoop Committer– Kafka– Flume

• @gwenshap

About Me

4

Why Streaming and Batch

©2014 Cloudera, Inc. All rights reserved.

5

Batch Processing

• Store data somewhere• Read large chunks of data• Do something with data• Sometimes store results

6Click to enter confidentiality information

Batch Examples

• Analytics

• ETL / ELT

• Training machine learning models

• Recommendations

7Click to enter confidentiality information

Stream Processing

• Listen to incoming events• Do something with each

event• Maybe store events /

results

8Click to enter confidentiality information

Stream Processing Examples

• Anomaly detection, alerts

• Monitoring, SLAs

• Operational intelligence

• Analytics, dashboards

• ETL

9Click to enter confidentiality information

Streaming &Batch

AlertsMonitoring, SLAs

Operational Intelligence

Risk AnalysisAnomaly detectionAnalytics

ETL

10Click to enter confidentiality information

Four Categories

• Streams Only• Batch Only• Can be done in both• Must be done in both

ETLSome Analytics

11Click to enter confidentiality information

ETL

Most Stream Processing projects I see involve few simple transformations.

• Currency conversion• JSON to Avro• Field extraction• Joining a stream to a static data set• Aggregate on window• Identifying change in trend• Document indexing

12Click to enter confidentiality information

Batch || Streaming

• Efficient:– Lower CPU utilization– Better network and disk throughput– Fewer locks and waits

• Easier administration

• Easier integration with RDBMS

• Existing expertise

• Existing tools

• Real-time information

13

The Problem

©2014 Cloudera, Inc. All rights reserved.

14Click to enter confidentiality information

We Like

• Efficiency

• Scalability

• Fault Tolerance

• Recovery from errors

• Experimenting with different approaches

• Debuggers

• Cookies

15Click to enter confidentiality information

But…We don’t likeMaintaining two applicationsThat do the same thing

16Click to enter confidentiality information

Do we really need to maintain same app twice?

Yes, because:

• We are not sure about requirements

• We sometimes need to re-process with very high efficiency

Not really:

• Different apps for batch and streaming

• Can re-process with streams

• Can error-correct with streams

• Can maintain one code-base for batches and streams

17

Stream-Only Patterns(Kappa Architecture)

Click to enter confidentiality information

18Click to enter confidentiality information

DWH Example

OLTP DB

Sensors, Logs

DWH Fact Table(Partitione

d)

Real TimeFact Tables

Dimension

Dimension

Dimension

Views

Aggregates

App 1:Stream

processing

App 2:Occasional load

19Click to enter confidentiality information

We need to fix older data

0 1 2 3 4 5 6 7 8 910

11

12

13

Streaming App v1

Streaming App v2

Real-Time Table

Replacement Partition

Partitioned Fact Table

20Click to enter confidentiality information

We need to fix older data

0 1 2 3 4 5 6 7 8 910

11

12

13

Streaming App v1

Streaming App v2

Real-Time Table

Replacement Partition

Partitioned Fact Table

21Click to enter confidentiality information

We need to fix older data

0 1 2 3 4 5 6 7 8 910

11

12

13

Streaming App v2

Real-Time Table

22

Lambda-Architecture Technologies

Click to enter confidentiality information

23

WordCount in Scala

source.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_+_) .print()

24

SummingBird

25

MapReduce was great because…

Very simple abstraction:- Map- Shuffle- Reduce- Type-safe

And it has simpler abstractions on top.

26Click to enter confidentiality information

SummingBird

• Multi-stage MapReduce• Run on Hadoop, Spark, Storm• Very easy to combine

batch and streaming results

27Click to enter confidentiality information

API

• Platform – Storm, Scalding, Spark…• Producer.source(Platform) <- get data • Producer – collection of events• Transformations – map, filter, merge, leftJoin (lookup)• Output – write(sink), sumByKey(store)• Store – contains aggregate for each key, and reduce

operation

28Click to enter confidentiality information

Associative Reduce

29Click to enter confidentiality information

WordCount SummingBird

def wordCount[P <: Platform[P]]

(source: Producer[P, String], store: P#Store[String, Long]) =

source.flatMap { sentence =>

toWords(sentence).map(_ -> 1L)

}.sumByKey(store)

val stormTopology = Storm.remote(“stormName”).plan(wordCount)

val hadoopJob = Scalding(“scaldingName”).plan(wordCount)

30

SparkStreaming

31

First, there was the RDD

• Spark is its own execution engine

• With high-level API

• RDDs are sharded collections

• Can be mapped, reduced, grouped, filtered, etc

32Confidentiality Information Goes Here

DStream

DStream

DStream

Spark Streaming

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count Print

Source Receiver RDD

RDD

RDD

Single PassFilter Count Print

Pre-first Batch

First Batch

Second Batch

33Confidentiality Information Goes Here

DStream

DStream

DStreamSpark Streaming

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count

Print

Source Receiver RDD

RDD

RDD

Single PassFilter Count

Pre-first Batch

First Batch

Second Batch

Stateful RDD 1

Print

Stateful RDD 2

Stateful RDD 1

34Click to enter confidentiality information

Compared to SummingBird

Differences:

• Micro-batches

• Completely new execution model

• Real joins

• Reduce is not limited to Monads

• SparkStreaming has Richer API

• Summingbird can aggregate batch and stream to one dataset

• SparkStreaming runs in debugger

Similarities:

• Almost same code will run in batch and streams

• Use of Scala

• Use of functional programing concepts

35

Spark Example

©2014 Cloudera, Inc. All rights reserved.

1. val conf = new SparkConf().setMaster("local[2]”)

2. val sc = new SparkContext(conf)

3. val lines = sc.textFile(path, 2)

4. val words = lines.flatMap(_.split(" "))

5. val pairs = words.map(word => (word, 1))

6. val wordCounts = pairs.reduceByKey(_ + _)

7. wordCounts.print()

36

Spark Streaming Example

©2014 Cloudera, Inc. All rights reserved.

1. val conf = new SparkConf().setMaster("local[2]”)

2. val ssc = new StreamingContext(conf, Seconds(1))

3. val lines = ssc.socketTextStream("localhost", 9999)

4. val words = lines.flatMap(_.split(" "))

5. val pairs = words.map(word => (word, 1))

6. val wordCounts = pairs.reduceByKey(_ + _)

7. wordCounts.print()

8. ssc.start()

37

Apache Flink

38

Execution Model

You don’t want to know.

39

Flink vs SparkStreaming

Differences:

• Flink is event-by-event streaming, events go through pipeline.

• SparkStreaming has good integration with Hbase as state store

• “checkpoint barriers”

• Optimization based on strong typing

• Flink is newer than SparkStreaming, there is less production experience

Similarities:

• Very similar APIs

• Built-in stream-specific operators (windows)

• Exactly once guarantees through checkpoints of offsets and state (Flink is limited to small state for now)

40

WordCount Batch

val env = ExecutionEnvironment.getExecutionEnvironment

val text = getTextDataSet(env)

val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }

.map { (_, 1) } .groupBy(0)

.sum(1)

counts.print()

env.execute(“Wordcount Example”)

41

WordCount Streaming

val env = ExecutionEnvironment.getExecutionEnvironment

val text = env.socketTextStream(host, port)

val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }

.map { (_, 1) } .groupBy(0)

.sum(1)

counts.print()

env.execute(“Wordcount Example”)

42

Bring Your Own Framework

43

If the requirements are simple…

44

How difficult it is to parallelize transformations?

Simple transformationsAre simple

45Click to enter confidentiality information

Just add Kafka

Kafka is a reliable data sourceYou can read

BatchesMicrobatchesStreams

Also allows for re-partitioning

46Click to enter confidentiality information

Cluster management

• Managing cluster resources used to be difficult• Now:– YARN– Mesos– Docker– Kubernetes

47Click to enter confidentiality information

So your app should…

• Allocate resources and track tasks with YARN / Mesos• Read from Kafka (however often you want)• Do simple transformations• Write to Kafka / Hbase

• How difficult can it possibly be?

48

Parting Thoughts

Click to enter confidentiality information

49

Good engineering lessons

• DRY – do you really need same code twice?• Error correction is critical• Reliability guarantees are critical• Debuggers are really nice• Latency / Throughput trade-offs• Use existing expertise• Stream processing is about patterns

Thank you

Recommended