Have your cake and eat it too

Have Your Cake and Eat It TooArchitectures for Batch and Stream Processing

Speaker name // Speaker title

Stuff We’ll Talk About

• Why do we need both streams and batches• Why is it a problem?• Stream-Only Patterns (i.e. Kappa Architecture)• Lambda-Architecture Technologies– SummingBird– Apache Spark– Apache Flink– Bring-your-own-framework

• 15 years of moving data• Formerly consultant• Now Cloudera Engineer:– Sqoop Committer– Kafka– Flume

• @gwenshap

About Me

Why Streaming and Batch

Batch Processing

• Store data somewhere• Read large chunks of data• Do something with data• Sometimes store results

6Click to enter confidentiality information

Batch Examples

• Analytics

• ETL / ELT

• Training machine learning models

• Recommendations

Stream Processing

• Listen to incoming events• Do something with each

event• Maybe store events /

results

Stream Processing Examples

• Anomaly detection, alerts

• Monitoring, SLAs

• Operational intelligence

• Analytics, dashboards

• ETL

Streaming &Batch

AlertsMonitoring, SLAs

Operational Intelligence

Risk AnalysisAnomaly detectionAnalytics

Four Categories

• Streams Only• Batch Only• Can be done in both• Must be done in both

ETLSome Analytics

Most Stream Processing projects I see involve few simple transformations.

• Currency conversion• JSON to Avro• Field extraction• Joining a stream to a static data set• Aggregate on window• Identifying change in trend• Document indexing

Batch || Streaming

• Efficient:– Lower CPU utilization– Better network and disk throughput– Fewer locks and waits

• Easier administration

• Easier integration with RDBMS

• Existing expertise

• Existing tools

• Real-time information

The Problem

We Like

• Efficiency

• Scalability

• Fault Tolerance

• Recovery from errors

• Experimenting with different approaches

• Debuggers

• Cookies

But…We don’t likeMaintaining two applicationsThat do the same thing

Do we really need to maintain same app twice?

Yes, because:

• We are not sure about requirements

• We sometimes need to re-process with very high efficiency

Not really:

• Different apps for batch and streaming

• Can re-process with streams

• Can error-correct with streams

• Can maintain one code-base for batches and streams

Stream-Only Patterns(Kappa Architecture)

Click to enter confidentiality information

DWH Example

OLTP DB

Sensors, Logs

DWH Fact Table(Partitione

Real TimeFact Tables

Dimension

Aggregates

App 1:Stream

processing

App 2:Occasional load

We need to fix older data

0 1 2 3 4 5 6 7 8 910

Streaming App v1

Streaming App v2

Real-Time Table

Replacement Partition

Partitioned Fact Table

0 1 2 3 4 5 6 7 8 910

Streaming App v1

Streaming App v2

Real-Time Table

Replacement Partition

Partitioned Fact Table

0 1 2 3 4 5 6 7 8 910

Streaming App v2

Real-Time Table

Lambda-Architecture Technologies

WordCount in Scala

source.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_+_) .print()

SummingBird

MapReduce was great because…

Very simple abstraction:- Map- Shuffle- Reduce- Type-safe

And it has simpler abstractions on top.

SummingBird

• Multi-stage MapReduce• Run on Hadoop, Spark, Storm• Very easy to combine

batch and streaming results

• Platform – Storm, Scalding, Spark…• Producer.source(Platform) <- get data • Producer – collection of events• Transformations – map, filter, merge, leftJoin (lookup)• Output – write(sink), sumByKey(store)• Store – contains aggregate for each key, and reduce

operation

Associative Reduce

WordCount SummingBird

def wordCount[P <: Platform[P]]

(source: Producer[P, String], store: P#Store[String, Long]) =

source.flatMap { sentence =>

toWords(sentence).map(_ -> 1L)

}.sumByKey(store)

val stormTopology = Storm.remote(“stormName”).plan(wordCount)

val hadoopJob = Scalding(“scaldingName”).plan(wordCount)

SparkStreaming

First, there was the RDD

• Spark is its own execution engine

• With high-level API

• RDDs are sharded collections

• Can be mapped, reduced, grouped, filtered, etc

32Confidentiality Information Goes Here

DStream

Spark Streaming

Single Pass

Source Receiver RDD

Filter Count Print

Source Receiver RDD

Single PassFilter Count Print

Pre-first Batch

First Batch

Second Batch

33Confidentiality Information Goes Here

DStream

DStreamSpark Streaming

Single Pass

Source Receiver RDD

Filter Count

Source Receiver RDD

Single PassFilter Count

Pre-first Batch

First Batch

Second Batch

Stateful RDD 1

Stateful RDD 2

Stateful RDD 1

Compared to SummingBird

Differences:

• Micro-batches

• Completely new execution model

• Real joins

• Reduce is not limited to Monads

• SparkStreaming has Richer API

• Summingbird can aggregate batch and stream to one dataset

• SparkStreaming runs in debugger

Similarities:

• Almost same code will run in batch and streams

• Use of Scala

• Use of functional programing concepts

Spark Example

1. val conf = new SparkConf().setMaster("local[2]”)

2. val sc = new SparkContext(conf)

3. val lines = sc.textFile(path, 2)

4. val words = lines.flatMap(_.split(" "))

5. val pairs = words.map(word => (word, 1))

6. val wordCounts = pairs.reduceByKey(_ + _)

7. wordCounts.print()

Spark Streaming Example

1. val conf = new SparkConf().setMaster("local[2]”)

2. val ssc = new StreamingContext(conf, Seconds(1))

3. val lines = ssc.socketTextStream("localhost", 9999)

4. val words = lines.flatMap(_.split(" "))

5. val pairs = words.map(word => (word, 1))

6. val wordCounts = pairs.reduceByKey(_ + _)

7. wordCounts.print()

8. ssc.start()

Apache Flink

Execution Model

You don’t want to know.

Flink vs SparkStreaming

Differences:

• Flink is event-by-event streaming, events go through pipeline.

• SparkStreaming has good integration with Hbase as state store

• “checkpoint barriers”

• Optimization based on strong typing

• Flink is newer than SparkStreaming, there is less production experience

Similarities:

• Very similar APIs

• Built-in stream-specific operators (windows)

• Exactly once guarantees through checkpoints of offsets and state (Flink is limited to small state for now)

WordCount Batch

val env = ExecutionEnvironment.getExecutionEnvironment

val text = getTextDataSet(env)

val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }

.map { (_, 1) } .groupBy(0)

.sum(1)

counts.print()

env.execute(“Wordcount Example”)

WordCount Streaming

val env = ExecutionEnvironment.getExecutionEnvironment

val text = env.socketTextStream(host, port)

val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }

.map { (_, 1) } .groupBy(0)

.sum(1)

counts.print()

env.execute(“Wordcount Example”)

Bring Your Own Framework

If the requirements are simple…

How difficult it is to parallelize transformations?

Simple transformationsAre simple

Just add Kafka

Kafka is a reliable data sourceYou can read

BatchesMicrobatchesStreams

Also allows for re-partitioning

Cluster management

• Managing cluster resources used to be difficult• Now:– YARN– Mesos– Docker– Kubernetes

So your app should…

• Allocate resources and track tasks with YARN / Mesos• Read from Kafka (however often you want)• Do simple transformations• Write to Kafka / Hbase

• How difficult can it possibly be?

Parting Thoughts

Good engineering lessons

• DRY – do you really need same code twice?• Error correction is critical• Reliability guarantees are critical• Debuggers are really nice• Latency / Throughput trade-offs• Use existing expertise• Stream processing is about patterns

Thank you

Have your cake and eat it too

Data & Analytics

Having Your Cake and Eating It Too

5-WE5 Have Your Cake and Eat It Too - Teledyne LeCroycdn.teledynelecroy.com/files/whitepapers/have_your_cake_and_eat_it_too.pdf · design models (that’s the source of the “have

Let them eat cake (full version)

Let them eat cake (abridged version)

How to have your cake and eat it too! AKA Compute more, collapse server count, save power and reduce cooling

Have Your Cake,Eat It,and Be Immortalized Have Your Cake, Estate $3 million Eat It,and Be Immortalized Mom & Dad

“ Let them Eat Cake”

Have Your Cake and Eat it Too

Have your cake and eat it too - Infographic

Have Your Cake and Eat It Too: Cascading Disclosure Control Language

Technically Speaking: Have Your Cake And Eat It Too!

Have your cake and eat it too!! - media.iadsnetwork.commedia.iadsnetwork.com/DisplayAds/124119.pdfHave your cake and eat it too!! Come check out our sweet deal we’re serving up on

A Time When You Can Have Your Cake and Eat It Too

Have your 3D printed cake and eat it too - Travelers Technology Risk Advisor

Have your 3D printed cake and eat it too - Travelers … your 3D printed cake and eat it too ... Fender Musical Instruments Corporation has improved its time to market by 30-40% for

Articles Have Your Cake and Eat It Too: A Proposal for a

Have your cake and eat it too, further dispelling the myths of the lambda architecture

Have your Cake and Eat it Too: GPS: The GNAT Programming

SOM management: have your cake and eat it too Michelle M. Wander mwander@uiuc

QEWD.js: Have your Node.js Cake and Eat It Too