25
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive, Language- Integrated Cluster Computing UC BERKELEY www.spark-project.org

Spark

  • Upload
    oriana

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

Spark. Fast, Interactive, Language-Integrated Cluster Computing. Matei Zaharia, Mosharaf Chowdhury , Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker , Ion Stoica. www.spark-project.org. UC BERKELEY. Project Goals. - PowerPoint PPT Presentation

Citation preview

Page 1: Spark

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,Scott Shenker, Ion Stoica

SparkFast, Interactive, Language-Integrated Cluster Computing

UC BERKELEYwww.spark-project.org

Page 2: Spark

Project GoalsExtend the MapReduce model to better support two common classes of analytics apps:»Iterative algorithms (machine learning, graphs)

»Interactive data miningEnhance programmability:»Integrate into Scala programming language

»Allow interactive use from Scala interpreter

Page 3: Spark

MotivationMost current cluster programming models are based on acyclic data flow from stable storage to stable storage

Map

Map

Map

Reduce

Reduce

Input Output

Page 4: Spark

Motivation

Map

Map

Map

Reduce

Reduce

Input Output

Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures

Most current cluster programming models are based on acyclic data flow from stable storage to stable storage

Page 5: Spark

MotivationAcyclic data flow is inefficient for applications that repeatedly reuse a working set of data:»Iterative algorithms (machine learning, graphs)

»Interactive data mining tools (R, Excel, Python)

With current frameworks, apps reload data from stable storage on each query

Page 6: Spark

Solution: ResilientDistributed Datasets (RDDs)Allow apps to keep working sets in memory for efficient reuseRetain the attractive properties of MapReduce

»Fault tolerance, data locality, scalability

Support a wide range of applications

Page 7: Spark

OutlineSpark programming modelImplementationDemoUser applications

Page 8: Spark

Programming ModelResilient distributed datasets (RDDs)

»Immutable, partitioned collections of objects

»Created through parallel transformations (map, filter, groupBy, join, …) on data in stable storage

»Can be cached for efficient reuse

Actions on RDDs»Count, reduce, collect, save, …

Page 9: Spark

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patternslines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count. . .

tasksresults

Cache 1

Cache 2

Cache 3

Base RDDTransformed

RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20

sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

Page 10: Spark

RDD Fault ToleranceRDDs maintain lineage information that can be used to reconstruct lost partitionsEx:

messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2))

HDFS File Filtered RDD

Mapped RDDfilter

(func = _.contains(...))map

(func = _.split(...))

Page 11: Spark

Example: Logistic RegressionGoal: find best line separating two sets of points

+

+ ++

+

+

++ +

– ––

–– –

+

target

random initial line

Page 12: Spark

Example: Logistic Regressionval data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}

println("Final w: " + w)

Page 13: Spark

Logistic Regression Performance

1 5 10 20 300

50010001500200025003000350040004500

Hadoop

Number of Iterations

Runn

ing

Tim

e (s

) 127 s / iteration

first iteration 174 s

further iterations 6 s

Page 14: Spark

Spark ApplicationsIn-memory data mining on Hive data (Conviva)Predictive analytics (Quantifind)City traffic prediction (Mobile Millennium)Twitter spam classification (Monarch)Collaborative filtering via matrix factorization…

Page 15: Spark

Conviva GeoReport

Aggregations on many keys w/ same WHERE clause40× gain comes from:

»Not re-reading unused columns or filtered records»Avoiding repeated decompression»In-memory storage of deserialized objects

Spark

Hive

0 2 4 6 8 10 12 14 16 18 200.5

20

Time (hours)

Page 16: Spark

Frameworks Built on SparkPregel on Spark (Bagel)

»Google message passingmodel for graph computation

»200 lines of code

Hive on Spark (Shark)»3000 lines of code»Compatible with Apache Hive»ML operators in Scala

Page 17: Spark

ImplementationRuns on Apache Mesos to share resources with Hadoop & other appsCan read from any Hadoop input source (e.g. HDFS)

Spark Hadoop MPI

Mesos

Node Node Node Node

No changes to Scala compiler

Page 18: Spark

Spark SchedulerDryad-like DAGsPipelines functionswithin a stageCache-aware workreuse & localityPartitioning-awareto avoid shuffles

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= cached data partition

Page 19: Spark

Interactive SparkModified Scala interpreter to allow Spark to be used interactively from the command lineRequired two changes:

»Modified wrapper code generation so that each line typed has references to objects for its dependencies

»Distribute generated classes over the network

Page 20: Spark

Demo

Page 21: Spark

ConclusionSpark provides a simple, efficient, and powerful programming model for a wide range of appsDownload our open source release:

www.spark-project.org

[email protected]

Page 22: Spark

Related WorkDryadLINQ, FlumeJava

»Similar “distributed collection” API, but cannot reuse datasets efficiently across queries

Relational databases»Lineage/provenance, logical logging, materialized views

GraphLab, Piccolo, BigTable, RAMCloud»Fine-grained writes similar to distributed shared memory

Iterative MapReduce (e.g. Twister, HaLoop)»Implicit data sharing for a fixed computation pattern

Caching systems (e.g. Nectar)»Store data in files, no explicit control over what is cached

Page 23: Spark

Behavior with Not Enough RAM

Cache disabled

25% 50% 75% Fully cached

020406080

10068

.8

58.1

40.7

29.7

11.5

% of working set in memory

Iter

atio

n ti

me

(s)

Page 24: Spark

Fault Recovery Results

1 2 3 4 5 6 7 8 9 10020406080

100120140

119

57 56 58 58

81

57 59 57 59

No Failure

Iteration

Iter

atri

on t

ime

(s)

Page 25: Spark

Spark Operations

Transformations

(define a new RDD)

mapfilter

samplegroupByKeyreduceByKey

sortByKey

flatMapunionjoin

cogroupcross

mapValues

Actions(return a result

to driver program)

collectreducecountsave

lookupKey