31
Introduction to Spark Streaming Real time processing on Apache Spark

Introduction to Spark Streaming

Embed Size (px)

Citation preview

Page 1: Introduction to Spark Streaming

Introduction to Spark Streaming

Real time processing on Apache Spark

Page 2: Introduction to Spark Streaming

● Madhukara Phatak

● Big data consultant and trainer at datamantra.io

● Consult in Hadoop, Spark and Scala

● www.madhukaraphatak.com

Page 3: Introduction to Spark Streaming

Agenda● Real time analytics in Big data● Unification● Spark streaming● DStream● DStream and RDD● Stream processing● DStream transformation● Hands on

Page 4: Introduction to Spark Streaming

3 V’s of Big data● Volume

○ TB’s and PB’s of files○ Driving need for batch processing systems

● Velocity○ TB’s of stream data○ Driving need for stream processing systems

● Variety○ Structured, semi structured and unstructured○ Driving need for sql, graph processing systems

Page 5: Introduction to Spark Streaming

Velocity● Speed at which

○ Collect the data○ Process to get insights

● More and more big data analytics becoming real time● Primary drivers

○ Social media○ IoT○ Mobile applications

Page 6: Introduction to Spark Streaming

Use cases● Twitter needs to crunch few billion tweets/s to publish

trending topics● Credit card companies needs to crunch millions of

transactions/s for identifying fraud● Mobile applications like whatsapp needs to constantly

crunch logs for service availability and performance

Page 7: Introduction to Spark Streaming

Real Time analytics ● Ability to collect and process TB’s of streaming data to

get insights● Data will be consumed from one or more streams● Need for combining historical data with real time data● Ability to stream data for downstream application

Page 8: Introduction to Spark Streaming

Stream processing using M/R● Map/Reduce is inherently batch processing system

which is not suitable for streaming● Need for data source as disk put latencies in the

processing● Stream needs multiple transformation which cannot be

expressed effectively on M/R● Overhead in launch of a new M/R job is too high

Page 9: Introduction to Spark Streaming

Apache Storm● Apache storm is a stream processing system build on

top of HDFS● Apache storm has it’s on API’s and do not use

Map/Reduce● It’s a one message at time in core and micro batch is

built on top of it(trident)● Built by twitter

Page 10: Introduction to Spark Streaming

Limitations of Streaming on Hadoop● M/R is not suitable for streaming● Apache storm needs learning new API’s and new

paradigm● No way to combine batch result from M/R with Apache

storm streams● Maintaining two runtimes are always hard

Page 11: Introduction to Spark Streaming

Unified Platform for Big Data Apps

Apache Spark

Batch Interactive Streaming

Hadoop Mesos NoSQL

Page 12: Introduction to Spark Streaming

Spark streamingSpark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams

Page 13: Introduction to Spark Streaming

Micro batch● Spark streaming is a fast batch processing system● Spark streaming collects stream data into small batch and runs batch processing on it● Batch can be as small as 1s to as big as multiple hours● Spark job creation and execution overhead is so low it

can do all that under a sec● These batches are called as DStreams

Page 14: Introduction to Spark Streaming

Discretized streams (DStream)

Input stream is divided into multiple discrete batches. Batch is configurable.

Spark Streaming

batch @ t1 batch @t2 batch @ t3

Input

Stream

Page 15: Introduction to Spark Streaming

DStream● Discretized streams● Each batch of data is converted to small discrete

batches● Batch size can be from 1s - multiple mins● DStream can be constructed from

○ Sockets○ Kafka○ HDFS○ Custom receivers

Page 16: Introduction to Spark Streaming

DStream to RDD

Spark Streaming

batch @ t1 batch @t2 batch @ t3

Input

Stream

RDD @t2RDD @ t1 RDD @ t3

Page 17: Introduction to Spark Streaming

Dstream to RDD● Each batch of Dstream is represented as RDD

underneath● These RDD are replicated in cluster for fault tolerance● Every DStream operation result in RDD transformation● There are API’s to access these RDD is directly● Can combine stream and batch processing

Page 18: Introduction to Spark Streaming

DStream transformationval ssc = new StreamingContext(args(0), "wordcount", Seconds(5))

val lines = ssc.socketTextStream("localhost",50050)

val words = lines.flatMap(_.split(" "))

Spark Streaming

batch @ t1 batch @t2 batch @ t3

Socket

Stream

RDD @t2RDD @ t1 RDD @ t3

FlatMapRDD @ t2

FlatMapRDD @ t1

FlatMapRDD @ t3

flatMap flatMap flatMap

flatMap flatMap flatMap

Page 19: Introduction to Spark Streaming

Socket stream● Ability to listen to any socket on remote machines● Need to configure host and port● Both Raw and Text representation of socket available● Built in retry mechanism

Page 20: Introduction to Spark Streaming

Wordcount example

Page 21: Introduction to Spark Streaming

File Stream● File streams allows for track new files in a given

directory on HDFS● Whenever there is new file appears, spark streaming

will pick it up● Only works for new files, modification for existing files

will not be considered● Tracked using file creation time

Page 22: Introduction to Spark Streaming

FileStream example

Page 23: Introduction to Spark Streaming

Receiver architecture

Spark Cluster Streaming Application(Driver)

Reciever Block Manager

Job Generator

Dstream Transformations

Store

BlockRDD

Mini Batch

Recieve

Page 24: Introduction to Spark Streaming

Stateful operations● Ability to maintain random state across multiple batches

● Fault tolerant

● Exactly once semantics

● WAL (Write Ahead Log) for receiver crashes

Page 25: Introduction to Spark Streaming

StatefulWordcount example

Page 26: Introduction to Spark Streaming

How stateful operations work?● Generally state is a mutable operation● But in functional programming, state is represented with

state machine going from one state to another fn(oldState,newInfo) => newState● In Spark, state is represented using RDD. ● Change in the state is represented using transformation

of RDD’s● Fault tolerance of RDD helps in fault tolerance of state

Page 27: Introduction to Spark Streaming

Transform API● In stream processing, ability to combine stream data

with batch data is extremely important● Both batch API and stream API share RDD as

abstraction ● transform api of DStream allows us to access

underneath RDD’s directly

Ex : Combine customer sales data with customer information

Page 28: Introduction to Spark Streaming

CartCustomerJoin example

Page 29: Introduction to Spark Streaming

Window based operations

Page 30: Introduction to Spark Streaming

Window wordcount