Overview of Spark

7/29/2019 Overview of Spark

1/25

MateiZaharia,MosharafChowdhury,TathagataDas,AnkurDave,JustinMa,MurphyMcCauley,MichaelFranklin,

ScottShenker,IonStoica

SparkFast,Interactive,Language-Integrated

ClusterComputing

UCBERKELEYwww.spark-project.org


2/25

ProjectGoalsExtendtheMapReducemodeltobettersupporttwocommonclassesofanalyticsapps:

Iterativealgorithms(machinelearning,graphs)Interactivedatamining

Enhanceprogrammability:IntegrateintoScalaprogramminglanguageAllowinteractiveusefromScalainterpreter


3/25

MotivationMostcurrentclusterprogrammingmodelsarebasedonacyclicdataflowfromstablestorage

tostablestorage

Map

Map

Map

Reduce

Reduce

Input Output


4/25

Motivation

Map

Map

Map

Reduce

Reduce

Input Output

Benefitsofdataflow:runtimecandecidewheretoruntasksandcanautomaticallyrecoverfromfailures

Mostcurrentclusterprogrammingmodelsarebasedonacyclicdataflowfromstablestorage

tostablestorage


5/25

MotivationAcyclicdataflowisinefficientforapplicationsthatrepeatedlyreuseaworkingsetofdata:

Iterativealgorithms(machinelearning,graphs)Interactivedataminingtools(R,Excel,Python)

Withcurrentframeworks,appsreloaddata

fromstablestorageoneachquery


6/25

Solution:Resilient

DistributedDatasets(RDDs)Allowappstokeepworkingsetsinmemoryfor

efficientreuseRetaintheattractivepropertiesofMapReduceFaulttolerance,datalocality,scalability

Supportawiderangeofapplications


7/25

OutlineSparkprogrammingmodel

ImplementationDemo

Userapplications


8/25

ProgrammingModel

Resilientdistributeddatasets(RDDs)Immutable,partitionedcollectionsofobjectsCreatedthroughparalleltransformations(map,filter,

groupBy,join,)ondatainstablestorageCanbecachedforefficientreuse

ActionsonRDDsCount,reduce,collect,save,


9/25

Example:LogMining

Loaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(hdfs://...)

errors = lines.filter(_.startsWith(ERROR))messages = errors.map(_.split(\t)(2))

cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(foo)).count

cachedMsgs.filter(_.contains(bar)).count. . .

tasks

results

Cache1

Cache2

Cache3

BaseRDDTransformedRDD

Action

Result:full-textsearchofWikipediain


10/25

RDDFaultTolerance

RDDsmaintainlineageinformationthatcanbeusedtoreconstructlostpartitions

Ex:

messages = textFile(...).filter(_.startsWith(ERROR)).map(_.split(\t)(2))

HDFSFile FilteredRDD MappedRDDfilter

(func=_.contains(...))map

(func=_.split(...))


11/25

Example:LogisticRegression

Goal:findbestlineseparatingtwosetsofpoints

+

++

+

+

+

+

++

+

target

randominitialline


12/25

Example:LogisticRegression

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x

).reduce(_ + _)w -= gradient

}

println("Final w: " + w)


13/25

LogisticRegressionPerformance

0500

1000

1500

2000

25003000

3500

4000

4500

1 5 10 20 30

RunningTime(s)

NumberofIterations

Hadoop

Spark

127s/iteration

firstiteration174sfurtheriterations6s


14/25

SparkApplications

In-memorydataminingonHivedata(Conviva)

Predictiveanalytics(Quantifind)Citytrafficprediction(MobileMillennium)

Twitterspamclassification(Monarch)

Collaborativefilteringviamatrixfactorization


15/25

ConvivaGeoReport

Aggregationsonmanykeysw/sameWHEREclause

40gaincomesfrom:Notre-readingunusedcolumnsorfilteredrecordsAvoidingrepeateddecompressionIn-memorystorageofdeserializedobjects

0.5

20

0 5 10 15 20

Spark

Hive

Time(hours)


16/25

FrameworksBuiltonSpark

PregelonSpark(Bagel)Googlemessagepassing

modelforgraphcomputation200linesofcode

HiveonSpark(Shark)

3000linesofcodeCompatiblewithApacheHiveMLoperatorsinScala


17/25

ImplementationRunsonApacheMesostoshareresourceswith

Hadoop&otherapps

CanreadfromanyHadoop

inputsource(e.g.HDFS)

Spark Hadoop MPI

Mesos

Node Node Node Node

NochangestoScalacompiler


18/25

SparkSchedulerDryad-likeDAGs

Pipelinesfunctions

withinastage

Cache-awareworkreuse&locality

Partitioning-awaretoavoidshuffles

join

union

groupBy

map

Stage3

Stage1

Stage2

A: B:

C: D:

E:

F:

G:

=cacheddatapartition


19/25

InteractiveSparkModifiedScalainterpretertoallowSparktobeusedinteractivelyfromthecommandline

Requiredtwochanges:Modifiedwrappercodegenerationsothateachline

typedhasreferencestoobjectsforitsdependenciesDistributegeneratedclassesoverthenetwork


20/25

Demo


21/25

Conclusion

Sparkprovidesasimple,efficient,andpowerful

programmingmodelforawiderangeofapps

Downloadouropensourcerelease:

www.spark-project.org

[email protected]


22/25

RelatedWorkDryadLINQ,FlumeJavaSimilardistributedcollectionAPI,butcannotreuse

datasetsefficientlyacrossqueries

RelationaldatabasesLineage/provenance,logicallogging,materializedviews

GraphLab,Piccolo,BigTable,RAMCloudFine-grainedwritessimilartodistributedsharedmemory

IterativeMapReduce(e.g.Twister,HaLoop)Implicitdatasharingforafixedcomputationpattern

Cachingsystems(e.g.Nectar)Storedatainfiles,noexplicitcontroloverwhatiscached


23/25

BehaviorwithNotEnoughRAM

68.

8

58.1

40.7

29.7

11.5

0

20

40

60

80

100

Cachedisabled

25% 50% 75% Fullycached

Iterationtime(s)

%ofworkingsetinmemory


24/25

FaultRecoveryResults119

57

56

58

58

81

57

59

57

59

020

40

60

80100

120

140

1 2 3 4 5 6 7 8 9 10

Iteratriontime(s)

Iteration

NoFailure

Failureinthe6thIteration


25/25

SparkOperations

Transformations

(defineanewRDD)

mapfilter

sample

groupByKeyreduceByKey

sortByKey

flatMapunion

join

cogroupcross

mapValues

Actions(returnaresultto

driverprogram)

collectreducecountsave

lookupKey

Documents

Overview of Spark