Upload
habib-ahmed-bhutto
View
284
Download
5
Embed Size (px)
Citation preview
Apache SparkStream Programming and Distributed Data Processing
Habib Ahmed BhuttoSenior Software Engineer
iConnect360
Outline
• What’s Spark• Why Spark• Fundamental concepts• Cluster Deployment • Spark Streaming• Application Development• Deployment • Application Monitoring • Debugging
What’s Spark
• Fast and speedy • General (purpose) engine • For large-scale data processing • In memory processing • Built at AMPLab,
University of California, Berkeley as sub-project of Hadoop
• Now it’s Apache’s
Why Spark
• Speed • Ease of use• Generality • Runs everywhere (Hadoop, Mesos, standalone or in cloud)
• Fault Tolerance • Integration • Deployment
Fundamental Concepts
• What exactly it does
Hadoop execution flow
Spark execution flow
Fundamental Concepts
• How exactly it does
Fundamental Concepts
• Resilient Distributed Dataset (RDD)– Abstraction – Immutable – Partitioned collection– Operated on in parallel
• RDD Operations – Actions – Transformations
• Spark Context
Fundamental Concepts
• Driver Program• Cluster Manager• Worker Node• Executer• Job • Stage• Task• Application Jar• Deploy Mode
Cluster Deployment
• Standalone• Amazon EC2 • Apache Mesos • Hadoop Yarn
Cluster Deployment
• Master page to monitor your cluster – http://<server-url>:8080
Spark Streaming
• How it works
Spark Streaming
• How it works internally
Spark Streaming
Spark Streaming
• Discretised Streams– Abstraction – Continuous Stream– Input data/ processed data – Series of RDDs
Spark Streaming
• Any operation applied on a DStream translates to operations on the underlying RDDs
Spark Streaming
• Window Operations • Output Operations • DataFrame and SQL Operations – DataFrame is abstraction that can act as
distributed SQL query engine.
Application Development
• Spark-Shell – Code in Scala with instant execution
Application Development
• Self-Contained Applications – Dependencies /Linking Libraries
Application Development
• Self-Contained Applications – A simple app
Application Development
• Self-Contained Applications – Packaging – Don’t forget app dependencies
Deployment
• That’s how you deploy
Application Monitoring• monitor your app – http://<driver-node>:4040
Application Monitor
• History Server– Enable and Start History Server http://<server-url>:18080
Application Monitor
• History Server– Enable and Start History Server http://<server-url>:18080
Debugging
• Remote debugging – Enable Remote debugging
– Must be running on local[*]
Running on Yarn
• Why to run on Yarn? – Cluster resources – Schedulers – Security
Running on Yarn
• Standalone
Running on Yarn
• Yarn Architecture – Resource Manager– Node Manager– Application Master– Container
Running on Yarn
• Yarn Client Mode
Running on Yarn
• Yarn Cluster Mode
Running on Yarn
• Standalone vs Spark on Yarn
References[1] Apache Spark official site http://spark.apache.org/[2] Introduction to Spark http://www.slideshare.net/rahuldausa/introduction-to-apache-spark-39638645 [3] Running Spark on Yarn http://badrit.com/blog/2015/2/29/running-spark-on-yarn#.VnEQub9eeaq [4] Debugging Apache Spark Jobs http://danosipov.com/?p=779 [5] Habib’s brain
A Big Thank YouSpark it up
You got questions?