Upload
alex-apollonsky-csm-pmp
View
470
Download
0
Embed Size (px)
Citation preview
What is Spark
Open Source Distributed Cluster Computing Network
Written in Scala
Running in JVM
Programs written in Scala, Java, Python, R
Main Concepts:
Driver - the program
Executors - the program’s distributed tasks
RDD - resilient distributed dataset
RDD Transformations and Actions
What is Spark Streaming
Extends Spark for Big Data stream processing
Can receive data from variety of sources
Kafka, File System, HDFS, Flume, HTTP, TCP Socket...
Breaks data stream into a series of N-seconds batch jobs
Processes data as immutable distributed DStreams (Discretized Streams)
Horizontally Scalable
High Throughput
Can process 60M records/sec (6 GB/sec) on 100 nodes at sub-second latency
Fault Tolerant
Can be seamlessly combined with Machine Learning Algorithms (MLlib)
Exactly Once Message Guarantee
What is Spark Streaming Cont’d
http://spark.apache.org/docs/latest/streaming-programming-guide.html
When to use Spark Streaming
Processing and Storage Pipeline use case:
Analyze real time or batch data coming from multiple systems
Store analytical data in the analytical database
Store transactional data in the transactional database
Store original raw data in raw storage
Response use case:
Analyze real time or batch data coming from multiple systems
Generate near-real-time alerts based on the streaming data adaptive analysis and statistical
algorithms (think MLlib)
Enrichment use case:
Enrich the data coming in with complementary data retrieved from external systems in real time
Processing and Storage Pipeline and/or Response use cases from here
Spark Transformation Examples (Java)
map: returns a new distributed dataset by converting input data
filter: returns a new distributed dataset by filtering out input data
Spark Transformation Examples (Java) Cont’d
reduceByKey: returns a new distributed dataset by aggregating values by key
using provided reduce function
Spark Streaming Program Flow
To Start (Java, Kafka, Zookeeper)
Download/Install Zookeeper, Kafka, Spark
http://zookeeper.apache.org/releases.html
http://kafka.apache.org/downloads.html
http://spark.apache.org/downloads.html
Start Servers
Zookeeper: ./bin/zkServer.sh start
Kafka: ./bin/kafka-server-start.sh config/server.properties
Spark: ./sbin/start-all.sh
Run Examples Locally or Deploy to Spark Cluster
https://github.com/aapollonsky/kafka-spark-streaming-example
Have Fun!