48
1 An Introduction to Apache Spark By Amir Sedighi Datis Pars Data Technology Slides adopted from Databricks (Paco Nathan and Aaron Davidson) @amirsedighi http://hexican.com

An introduction To Apache Spark

Embed Size (px)

Citation preview

Page 1: An introduction To Apache Spark

1

An Introduction to Apache Spark

By Amir Sedighi

Datis Pars Data Technology

Slides adopted from Databricks (Paco Nathan and Aaron Davidson)

@amirsedighihttp://hexican.com

Page 2: An introduction To Apache Spark

2

History

● Developed in 2009 at UC Berkeley AMPLab.

● Open sourced in 2010.

● Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations such as:

– Databricks, Yahoo!, Intel, Cloudera, IBM, …

Page 3: An introduction To Apache Spark

3

What is Spark?

● Fast and general cluster computing system interoperable with Hadoop datasets.

Page 4: An introduction To Apache Spark

4

What are Spark improvements?

● Improves efficiency through:

– In-memory computing primitives.

– General computation graphs.

● Improves usability through

– Rich APIs in Scala, Java, Python

– Interactive shell (Scala/Python)

Page 5: An introduction To Apache Spark

5

MapReduce is a DAG in General

Page 6: An introduction To Apache Spark

6

MapReduce

● MapReduce is great for single-pass batch jobs while in many use-cases we need to use MapReduce in a multi-pass manner...

Page 7: An introduction To Apache Spark

7

What improvements Spark made on running MapReduce?

● Improving the performance of MapReduce for running as a multi-pass analytics, interactive, real-time, distributed computation model on the top of Hadoop.

Note:

– Spark is a hadoop successor.

Page 8: An introduction To Apache Spark

8

How Spark Made it?

A Wise Data Sharing!

Page 9: An introduction To Apache Spark

9

Data Sharing in Hadoop MapReduce

Page 10: An introduction To Apache Spark

10

Data Sharing in Spark

Page 11: An introduction To Apache Spark

11

Data Sharing in Spark

10-100x Faster than network and disk!

Page 12: An introduction To Apache Spark

12

Spark Programming Model

● At a high level, every Spark application consists of a driver program that runs the user’s main function.

● Promotes you to write programs in term of making transformations on distributed datasets.

Page 13: An introduction To Apache Spark

13

Spark Programming Model

● The main abstraction Spark provides is a resilient distributed dataset (RDD).

– Collection of elements partitioned across the cluster (Memory of Disk)

– Can be accessed and operated in parallel (map, filter, ...)

– Automatically rebuilt on failure

Page 14: An introduction To Apache Spark

14

Spark Programming Model

● RDDs Operations

– Transformations: Create a new dataset from an existing one.

● Example: map()

– Actions: Return a value to the driver program after running a computation on the dataset.

● Example: reduce()

Page 15: An introduction To Apache Spark

15

Spark Programming Model

Page 16: An introduction To Apache Spark

16

Spark Programming Model

● Another abstraction is Shared Variables

– Broadcast Variables, which can be used to cache a value in memory on all nodes.

– Accumulator

Page 17: An introduction To Apache Spark

17

Spark Programming Model

Page 18: An introduction To Apache Spark

18

Spark Programming Model

Page 19: An introduction To Apache Spark

19

Spark Programming Model

Page 20: An introduction To Apache Spark

20

Ease of Use

● Spark offers over 80 high-level operators that make it easy to build parallel apps.

● Scala and Python shells to use it interactively.

Page 21: An introduction To Apache Spark

21

A General Stack

Page 22: An introduction To Apache Spark

22

Apache Spark Core

Page 23: An introduction To Apache Spark

23

Apache Spark Core

● Spark Core is the general engine for the Spark platform.

– In-memory computing capabilities deliver speed

– General execution model supports wide delivery of use cases

– Ease of development – native APIs in Java, Scala, Python (+ SQL, Clojure, R)

Page 24: An introduction To Apache Spark

24

Spark SQL

Page 25: An introduction To Apache Spark

25

Spark SQL

Page 26: An introduction To Apache Spark

26

Spark SQL

Page 27: An introduction To Apache Spark

27

Spark SQL

Page 28: An introduction To Apache Spark

28

Spark SQL

Page 29: An introduction To Apache Spark

29

Spark Streaming

Page 30: An introduction To Apache Spark

30

Spark Streaming

● makes it easy to build scalable fault-tolerant streaming applications.

Page 31: An introduction To Apache Spark

31

Spark Streaming

Page 32: An introduction To Apache Spark

32

Spark Streaming

Page 33: An introduction To Apache Spark

33

Spark Streaming

Page 34: An introduction To Apache Spark

34

Spark Streaming

Page 35: An introduction To Apache Spark

35

Spark Streaming

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)

DStream: a sequence of distributed datasets (RDDs) representing a distributed stream of data

Page 36: An introduction To Apache Spark

36

Spark Streaming

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)

val hashTags = tweets.flatMap (status => getTags(status))

DStream: a sequence of distributed datasets (RDDs) representing a distributed stream of data

transformation: modify data in one DStream to create another DStream

new DStream

Page 37: An introduction To Apache Spark

37

Spark Streaming

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)

val hashTags = tweets.flatMap (status => getTags(status))

val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()

sliding window operation

window length sliding interval

Page 38: An introduction To Apache Spark

38

Spark Streaming

val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()

Page 39: An introduction To Apache Spark

39

MLLib

Page 40: An introduction To Apache Spark

40

MLLib

● MLLib is Spark's scaleable machine learning engine.

● MLLib works on any hadoop datasource such as HDFS, HBase and local files.

Page 41: An introduction To Apache Spark

41

MLLib

● Algorithms:

– linear SVM and logistic regression

– classification and regression tree

– k-means clustering

– recommendation via alternating least squares

– singular value decomposition

– linear regression with L1- and L2-regularization

– multinomial naive Bayes

– basic statistics

– feature transformations

Page 42: An introduction To Apache Spark

42

GraphX

Page 43: An introduction To Apache Spark

43

GraphX

● GraphX is Spark's API for graphs and graph-parallel computation.

● Works with both graphs and collections.

Page 44: An introduction To Apache Spark

44

GraphX

● Comparable performance to the fastest specialized graph processing systems

Page 45: An introduction To Apache Spark

45

GraphX

● Algorithms

– PageRank

– Connected components

– Label propagation

– SVD++

– Strongly connected components

– Triangle count

Page 46: An introduction To Apache Spark

46

Spark Runs Everywhere

● Spark runs on Hadoop, Mesos, standalone, or in the cloud.

● Spark accesses diverse data sources including HDFS, Cassandra, HBase, S3.

Page 47: An introduction To Apache Spark

47

Resources

● http://spark.apache.org● Intro to Apache Spark by Paco Nathan● Building a Unified Data Pipeline in Spark by Aaron

Davidson.● http://www.slideshare.net/manishgforce/lightening-fast-bi

g-data-analytics-using-apache-spark● Deep Dive with Spark Streaming - Tathagata Das - Spark

Meetup● ZYMR

Page 48: An introduction To Apache Spark

48

Thank You!

Questions?