Getting Started with Spark Streaming

Spark Streamingshort intro

Alex Apollonsky

[email protected]

mailto:[email protected]

What is Spark

Open Source Distributed Cluster Computing Network

Written in Scala

Running in JVM

Programs written in Scala, Java, Python, R

Main Concepts:

Driver - the program

Executors - the program’s distributed tasks

RDD - resilient distributed dataset

RDD Transformations and Actions

Spark Components

http://spark.apache.org/

http://spark.apache.org/

What is Spark Streaming

Extends Spark for Big Data stream processing

Can receive data from variety of sources

Kafka, File System, HDFS, Flume, HTTP, TCP Socket...

Breaks data stream into a series of N-seconds batch jobs

Processes data as immutable distributed DStreams (Discretized Streams)

Horizontally Scalable

High Throughput

Can process 60M records/sec (6 GB/sec) on 100 nodes at sub-second latency

Fault Tolerant

Can be seamlessly combined with Machine Learning Algorithms (MLlib)

Exactly Once Message Guarantee

What is Spark Streaming Cont’d

http://spark.apache.org/docs/latest/streaming-programming-guide.html

http://spark.apache.org/docs/latest/streaming-programming-guide.html

When to use Spark Streaming

Processing and Storage Pipeline use case:

Analyze real time or batch data coming from multiple systems

Store analytical data in the analytical database

Store transactional data in the transactional database

Store original raw data in raw storage

Response use case:

Analyze real time or batch data coming from multiple systems

Generate near-real-time alerts based on the streaming data adaptive analysis and statistical

algorithms (think MLlib)

Enrichment use case:

Enrich the data coming in with complementary data retrieved from external systems in real time

Processing and Storage Pipeline and/or Response use cases from here

Spark Transformation Examples (Java)

map: returns a new distributed dataset by converting input data

filter: returns a new distributed dataset by filtering out input data

Spark Transformation Examples (Java) Cont’d

reduceByKey: returns a new distributed dataset by aggregating values by key

using provided reduce function

Spark Streaming Program Flow

To Start (Java, Kafka, Zookeeper)

Download/Install Zookeeper, Kafka, Spark

http://zookeeper.apache.org/releases.html

http://kafka.apache.org/downloads.html

http://spark.apache.org/downloads.html

Start Servers

Zookeeper: ./bin/zkServer.sh start

Kafka: ./bin/kafka-server-start.sh config/server.properties

Spark: ./sbin/start-all.sh

Run Examples Locally or Deploy to Spark Cluster

https://github.com/aapollonsky/kafka-spark-streaming-example

http://zookeeper.apache.org/releases.html

http://kafka.apache.org/downloads.html

http://spark.apache.org/downloads.html

https://github.com/aapollonsky/kafka-spark-streaming-example

Have Fun!

Data & Analytics

Getting Started with Spark Streaming