32
Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015

Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Making Big Data Processing Simple with Spark

Matei Zaharia December 17, 2015

Page 2: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

What is Apache Spark?

Fast and general cluster computing engine that generalizes the MapReduce model

Makes it easy and fast to process large datasets • High-level APIs in Java, Scala, Python, R • Unified engine that can capture many workloads

Page 3: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

A Unified Engine

Spark

Spark Streaming

real-time

Spark SQL structured data

MLlib machine learning

GraphX graph

Page 4: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

0 20 40 60 80

100 120 140 160

2010 2011 2012 2013 2014 2015

Cont

ribut

ors

Contributors / Month to Spark

A Large Community

Most active open source project for big data

Page 5: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Overview

Why a unified engine?

Spark programming model

Built-in libraries

Applications

Page 6: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

History: Cluster Computing 2004

Page 7: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

A general engine for batch processing

MapReduce

Page 8: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Beyond MapReduce

MapReduce was great for batch processing, but users quickly needed to do more: • More complex, multi-pass algorithms • More interactive ad-hoc queries • More real-time stream processing

Result: specialized systems for these workloads

Page 9: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

MapReduce

Pregel

Dremel

Presto

Storm

Giraph

Drill

Impala

S4 . . .

Specialized systems for new workloads

General batch processing

Big Data Systems Today

Page 10: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Problems with Specialized Systems

More systems to manage, tune, deploy

Can’t easily combine processing types • Even though most applications need to do this! • E.g. load data with SQL, then run machine learning

In many cases, data transfer between engines is a dominant cost!

Page 11: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

MapReduce

Pregel

Dremel

Presto

Storm

Giraph

Drill

Impala

S4

Specialized systems for new workloads

General batch processing

Unified engine

Big Data Systems Today

? . . .

Page 12: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Overview

Why a unified engine?

Spark programming model

Built-in libraries

Applications

Page 13: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Background

Recall 3 workloads were issues for MapReduce: • More complex, multi-pass algorithms • More interactive ad-hoc queries • More real-time stream processing

While these look different, all 3 need one thing that MapReduce lacks: efficient data sharing

Page 14: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

Input

HDFS read

HDFS write

HDFS read

HDFS write

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFS read

Slow due to replication and disk I/O

Page 15: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

iter. 1 iter. 2 . . .

Input

What We’d Like

Distributed memory

Input

query 1

query 2

query 3

. . .

one-time processing

10-100x faster than network and disk

Page 16: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Spark Programming Model

Resilient Distributed Datasets (RDDs) • Collections of objects stored in RAM or disk across cluster • Built via parallel transformations (map, filter, …) • Automatically rebuilt on failure

Page 17: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines=spark.textFile(“hdfs://...”)

errors=lines.filter(lambdas:s.startswith(“ERROR”))

messages=errors.map(lambdas:s.split(‘\t’)[2])

messages.cache()Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambdas:“MySQL”ins).count()

messages.filter(lambdas:“Redis”ins).count()

...

tasks

results Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Action

Example: full-text search of Wikipedia in 0.5 sec (vs 20s for on-disk data)

Page 18: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Fault Tolerance

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

filter reduce map

Inpu

t file

RDDs track lineage info to rebuild lost data

Page 19: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

filter reduce map

Inpu

t file

Fault Tolerance

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

RDDs track lineage info to rebuild lost data

Page 20: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Example: Logistic Regression

0

500

1000

1500

2000

2500

3000

3500

4000

1 5 10 20 30

Runn

ing

Tim

e (s

)

Number of Iterations

Hadoop

Spark

110 s / iteration

first iteration 80 s further iterations 1 s

Page 21: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Source: Daytona GraySort benchmark, sortbenchmark.org

2100 machines 2013 Record: Hadoop

72 minutes

2014 Record: Spark

207 machines

23 minutes

On-Disk Performance Time to sort 100TB

Page 22: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Libraries Built on Spark

Spark

Spark Streaming

real-time

Spark SQL structured data

MLlib machine learning

GraphX graph

Page 23: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

// Load data using SQL points = ctx.sql(“select latitude, longitude from tweets”)

// Train a machine learning model model = KMeans.train(points, 10)

// Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)

Combining Processing Types

Page 24: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Combining Processing Types

Separate systems:

. . .

HDFS read

HDFS write ET

L HDFS read

HDFS write tr

ain HDFS

read HDFS write qu

ery

HDFS write

HDFS read ET

L tr

ain

quer

y

Spark:

Page 25: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Hiv

eIm

pala

(dis

k)

Impa

la (m

em)

Spar

k (d

isk)

Sp

ark

(mem

)

0

10

20

30

40

50

Resp

onse

Tim

e (s

ec)

SQL

Mah

out

Grap

hLab

Sp

ark

0

10

20

30

40

50

60

Resp

onse

Tim

e (m

in)

ML

Performance vs Specialized Systems

Stor

m

Spar

k 0

5

10

15

20

25

30

35

Thro

ughp

ut (M

B/s/

node

)

Streaming

Page 26: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Some Recent Additions

DataFrame API (similar to R and Pandas) • Easy programmatic way to work with structured data

R interface (SparkR)

Machine learning pipelines (like SciKit-learn)

Page 27: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Overview

Why a unified engine?

Spark programming model

Built-in libraries

Applications

Page 28: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Over 1000 deployments, clusters up to 8000 nodes

Spark Community

Many talks online at spark-summit.org

Page 29: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Top Applications

29%

36%

40%

44%

52%

68%

Faud Detection / Security

User-Facing Services

Log Processing

Recommendation

Data Warehousing

Business Intelligence

Page 30: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Spark Components Used

58%

58%

62%

69%

MLlib + GraphX

Spark Streaming

DataFrames

Spark SQL

75%

of users use more than one component

Page 31: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Learn More

Get started on your laptop: spark.apache.org

Resources and MOOCs: sparkhub.databricks.com

Spark Summit: spark-summit.org

Page 32: Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and

Thank You