33
Apache Spark Stream Programming and Distributed Data Processing Habib Ahmed Bhutto Senior Software Engineer iConnect360

Getting started with Apache Spark

Embed Size (px)

Citation preview

Page 1: Getting started with Apache Spark

Apache SparkStream Programming and Distributed Data Processing

Habib Ahmed BhuttoSenior Software Engineer

iConnect360

Page 2: Getting started with Apache Spark

Outline

• What’s Spark• Why Spark• Fundamental concepts• Cluster Deployment • Spark Streaming• Application Development• Deployment • Application Monitoring • Debugging

Page 3: Getting started with Apache Spark

What’s Spark

• Fast and speedy • General (purpose) engine • For large-scale data processing • In memory processing • Built at AMPLab,

University of California, Berkeley as sub-project of Hadoop

• Now it’s Apache’s

Page 4: Getting started with Apache Spark

Why Spark

• Speed • Ease of use• Generality • Runs everywhere (Hadoop, Mesos, standalone or in cloud)

• Fault Tolerance • Integration • Deployment

Page 5: Getting started with Apache Spark

Fundamental Concepts

• What exactly it does

Hadoop execution flow

Spark execution flow

Page 6: Getting started with Apache Spark

Fundamental Concepts

• How exactly it does

Page 7: Getting started with Apache Spark

Fundamental Concepts

• Resilient Distributed Dataset (RDD)– Abstraction – Immutable – Partitioned collection– Operated on in parallel

• RDD Operations – Actions – Transformations

• Spark Context

Page 8: Getting started with Apache Spark

Fundamental Concepts

• Driver Program• Cluster Manager• Worker Node• Executer• Job • Stage• Task• Application Jar• Deploy Mode

Page 9: Getting started with Apache Spark

Cluster Deployment

• Standalone• Amazon EC2 • Apache Mesos • Hadoop Yarn

Page 10: Getting started with Apache Spark

Cluster Deployment

• Master page to monitor your cluster – http://<server-url>:8080

Page 11: Getting started with Apache Spark

Spark Streaming

• How it works

Page 12: Getting started with Apache Spark

Spark Streaming

• How it works internally

Page 13: Getting started with Apache Spark

Spark Streaming

Page 14: Getting started with Apache Spark

Spark Streaming

• Discretised Streams– Abstraction – Continuous Stream– Input data/ processed data – Series of RDDs

Page 15: Getting started with Apache Spark

Spark Streaming

• Any operation applied on a DStream translates to operations on the underlying RDDs

Page 16: Getting started with Apache Spark

Spark Streaming

• Window Operations • Output Operations • DataFrame and SQL Operations – DataFrame is abstraction that can act as

distributed SQL query engine.

Page 17: Getting started with Apache Spark

Application Development

• Spark-Shell – Code in Scala with instant execution

Page 18: Getting started with Apache Spark

Application Development

• Self-Contained Applications – Dependencies /Linking Libraries

Page 19: Getting started with Apache Spark

Application Development

• Self-Contained Applications – A simple app

Page 20: Getting started with Apache Spark

Application Development

• Self-Contained Applications – Packaging – Don’t forget app dependencies

Page 21: Getting started with Apache Spark

Deployment

• That’s how you deploy

Page 22: Getting started with Apache Spark

Application Monitoring• monitor your app – http://<driver-node>:4040

Page 23: Getting started with Apache Spark

Application Monitor

• History Server– Enable and Start History Server http://<server-url>:18080

Page 24: Getting started with Apache Spark

Application Monitor

• History Server– Enable and Start History Server http://<server-url>:18080

Page 25: Getting started with Apache Spark

Debugging

• Remote debugging – Enable Remote debugging

– Must be running on local[*]

Page 26: Getting started with Apache Spark

Running on Yarn

• Why to run on Yarn? – Cluster resources – Schedulers – Security

Page 27: Getting started with Apache Spark

Running on Yarn

• Standalone

Page 28: Getting started with Apache Spark

Running on Yarn

• Yarn Architecture – Resource Manager– Node Manager– Application Master– Container

Page 29: Getting started with Apache Spark

Running on Yarn

• Yarn Client Mode

Page 30: Getting started with Apache Spark

Running on Yarn

• Yarn Cluster Mode

Page 31: Getting started with Apache Spark

Running on Yarn

• Standalone vs Spark on Yarn

Page 32: Getting started with Apache Spark

References[1] Apache Spark official site http://spark.apache.org/[2] Introduction to Spark http://www.slideshare.net/rahuldausa/introduction-to-apache-spark-39638645 [3] Running Spark on Yarn http://badrit.com/blog/2015/2/29/running-spark-on-yarn#.VnEQub9eeaq [4] Debugging Apache Spark Jobs http://danosipov.com/?p=779 [5] Habib’s brain

Page 33: Getting started with Apache Spark

A Big Thank YouSpark it up

You got questions?