Upload
romanzotti
View
216
Download
0
Embed Size (px)
Citation preview
7/29/2019 Overview of Spark
1/25
MateiZaharia,MosharafChowdhury,TathagataDas,AnkurDave,JustinMa,MurphyMcCauley,MichaelFranklin,
ScottShenker,IonStoica
SparkFast,Interactive,Language-Integrated
ClusterComputing
UCBERKELEYwww.spark-project.org
7/29/2019 Overview of Spark
2/25
ProjectGoalsExtendtheMapReducemodeltobettersupporttwocommonclassesofanalyticsapps:
Iterativealgorithms(machinelearning,graphs)Interactivedatamining
Enhanceprogrammability:IntegrateintoScalaprogramminglanguageAllowinteractiveusefromScalainterpreter
7/29/2019 Overview of Spark
3/25
MotivationMostcurrentclusterprogrammingmodelsarebasedonacyclicdataflowfromstablestorage
tostablestorage
Map
Map
Map
Reduce
Reduce
Input Output
7/29/2019 Overview of Spark
4/25
Motivation
Map
Map
Map
Reduce
Reduce
Input Output
Benefitsofdataflow:runtimecandecidewheretoruntasksandcanautomaticallyrecoverfromfailures
Mostcurrentclusterprogrammingmodelsarebasedonacyclicdataflowfromstablestorage
tostablestorage
7/29/2019 Overview of Spark
5/25
MotivationAcyclicdataflowisinefficientforapplicationsthatrepeatedlyreuseaworkingsetofdata:
Iterativealgorithms(machinelearning,graphs)Interactivedataminingtools(R,Excel,Python)
Withcurrentframeworks,appsreloaddata
fromstablestorageoneachquery
7/29/2019 Overview of Spark
6/25
Solution:Resilient
DistributedDatasets(RDDs)Allowappstokeepworkingsetsinmemoryfor
efficientreuseRetaintheattractivepropertiesofMapReduceFaulttolerance,datalocality,scalability
Supportawiderangeofapplications
7/29/2019 Overview of Spark
7/25
OutlineSparkprogrammingmodel
ImplementationDemo
Userapplications
7/29/2019 Overview of Spark
8/25
ProgrammingModel
Resilientdistributeddatasets(RDDs)Immutable,partitionedcollectionsofobjectsCreatedthroughparalleltransformations(map,filter,
groupBy,join,)ondatainstablestorageCanbecachedforefficientreuse
ActionsonRDDsCount,reduce,collect,save,
7/29/2019 Overview of Spark
9/25
Example:LogMining
Loaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(hdfs://...)
errors = lines.filter(_.startsWith(ERROR))messages = errors.map(_.split(\t)(2))
cachedMsgs = messages.cache()
Block1
Block2
Block3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(foo)).count
cachedMsgs.filter(_.contains(bar)).count. . .
tasks
results
Cache1
Cache2
Cache3
BaseRDDTransformedRDD
Action
Result:full-textsearchofWikipediain
7/29/2019 Overview of Spark
10/25
RDDFaultTolerance
RDDsmaintainlineageinformationthatcanbeusedtoreconstructlostpartitions
Ex:
messages = textFile(...).filter(_.startsWith(ERROR)).map(_.split(\t)(2))
HDFSFile FilteredRDD MappedRDDfilter
(func=_.contains(...))map
(func=_.split(...))
7/29/2019 Overview of Spark
11/25
Example:LogisticRegression
Goal:findbestlineseparatingtwosetsofpoints
+
++
+
+
+
+
++
+
target
randominitialline
7/29/2019 Overview of Spark
12/25
Example:LogisticRegression
val data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)w -= gradient
}
println("Final w: " + w)
7/29/2019 Overview of Spark
13/25
LogisticRegressionPerformance
0500
1000
1500
2000
25003000
3500
4000
4500
1 5 10 20 30
RunningTime(s)
NumberofIterations
Hadoop
Spark
127s/iteration
firstiteration174sfurtheriterations6s
7/29/2019 Overview of Spark
14/25
SparkApplications
In-memorydataminingonHivedata(Conviva)
Predictiveanalytics(Quantifind)Citytrafficprediction(MobileMillennium)
Twitterspamclassification(Monarch)
Collaborativefilteringviamatrixfactorization
7/29/2019 Overview of Spark
15/25
ConvivaGeoReport
Aggregationsonmanykeysw/sameWHEREclause
40gaincomesfrom:Notre-readingunusedcolumnsorfilteredrecordsAvoidingrepeateddecompressionIn-memorystorageofdeserializedobjects
0.5
20
0 5 10 15 20
Spark
Hive
Time(hours)
7/29/2019 Overview of Spark
16/25
FrameworksBuiltonSpark
PregelonSpark(Bagel)Googlemessagepassing
modelforgraphcomputation200linesofcode
HiveonSpark(Shark)
3000linesofcodeCompatiblewithApacheHiveMLoperatorsinScala
7/29/2019 Overview of Spark
17/25
ImplementationRunsonApacheMesostoshareresourceswith
Hadoop&otherapps
CanreadfromanyHadoop
inputsource(e.g.HDFS)
Spark Hadoop MPI
Mesos
Node Node Node Node
NochangestoScalacompiler
7/29/2019 Overview of Spark
18/25
SparkSchedulerDryad-likeDAGs
Pipelinesfunctions
withinastage
Cache-awareworkreuse&locality
Partitioning-awaretoavoidshuffles
join
union
groupBy
map
Stage3
Stage1
Stage2
A: B:
C: D:
E:
F:
G:
=cacheddatapartition
7/29/2019 Overview of Spark
19/25
InteractiveSparkModifiedScalainterpretertoallowSparktobeusedinteractivelyfromthecommandline
Requiredtwochanges:Modifiedwrappercodegenerationsothateachline
typedhasreferencestoobjectsforitsdependenciesDistributegeneratedclassesoverthenetwork
7/29/2019 Overview of Spark
20/25
Demo
7/29/2019 Overview of Spark
21/25
Conclusion
Sparkprovidesasimple,efficient,andpowerful
programmingmodelforawiderangeofapps
Downloadouropensourcerelease:
www.spark-project.org
7/29/2019 Overview of Spark
22/25
RelatedWorkDryadLINQ,FlumeJavaSimilardistributedcollectionAPI,butcannotreuse
datasetsefficientlyacrossqueries
RelationaldatabasesLineage/provenance,logicallogging,materializedviews
GraphLab,Piccolo,BigTable,RAMCloudFine-grainedwritessimilartodistributedsharedmemory
IterativeMapReduce(e.g.Twister,HaLoop)Implicitdatasharingforafixedcomputationpattern
Cachingsystems(e.g.Nectar)Storedatainfiles,noexplicitcontroloverwhatiscached
7/29/2019 Overview of Spark
23/25
BehaviorwithNotEnoughRAM
68.
8
58.1
40.7
29.7
11.5
0
20
40
60
80
100
Cachedisabled
25% 50% 75% Fullycached
Iterationtime(s)
%ofworkingsetinmemory
7/29/2019 Overview of Spark
24/25
FaultRecoveryResults119
57
56
58
58
81
57
59
57
59
020
40
60
80100
120
140
1 2 3 4 5 6 7 8 9 10
Iteratriontime(s)
Iteration
NoFailure
Failureinthe6thIteration
7/29/2019 Overview of Spark
25/25
SparkOperations
Transformations
(defineanewRDD)
mapfilter
sample
groupByKeyreduceByKey
sortByKey
flatMapunion
join
cogroupcross
mapValues
Actions(returnaresultto
driverprogram)
collectreducecountsave
lookupKey