Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Neutrino:Revisi&ngMemoryCachingforItera&veDataAnaly&cs
ErciXu*,MohitSaxena,LawrenceChiuOhioStateUniversity*IBMResearchAlmaden
Background • Itera&veanaly&csisrapidlygainingpopularity– DataClustering,LogMining,GraphProcessing,MachineLearning
– Datasetisrepeatedlyaccessedacrossdifferentitera&ons
• In-MemoryCachingbestfitsItera&veAnaly&cs– In-MemorycachingframeworksavoidfrequentI/Owithunderlyingstoragesystems
– Itera&veDataAnaly&cscouldhave10x–100xspeedup
2
SparkforIn-MemoryItera&veAnaly&cs
Ref:wikibon.org
3
RDD
SparkRDD
RDD:ResilientDistributedDatasets
block1
par&&on3
par&&on1
InMemory
InHDFS
4
par&&on2
par&&on4
block3
block2
block4
Worker1 Worker2
RDDCacheOp3ons• DeserializedorSerialized• OnHeaporOffHeap• InMemoryorDisk
MemoryCachinginSpark
stripe
Spark
Mem
Disk
stripeMem
Disk
stripeMem
Disk
…
5
Problems:In-memoryCachingforItera&veDataAnaly&cs
1. DiscreteCacheLevels
2. ManualProgrammerManagement
3. NotAdap&vetoRun&meChanges
6
Problem1:DiscreteCacheLevels
SerializedCachesaves56%to63%ofthespacebutrela&velyslower7
27.7
53.1
79.8
107.3
126.1
0
20
40
60
80
100
120
140
10 20 30 40 50
SizeinM
emory(GB)
DataSize(GB) Serialized Deserialized
56%
63%
Problem1:DiscreteCacheLevels
• DeserializedCacheisanorderofmagnitudefasterbutbecomeveryslowonce
spilledtodisk
8
0
10
20
30
40
50
60
70
10 20 30 40 50 60 70 80 90 100
Time/Sec
DataSize/GB
Deserialized
Serialized
97% -71%
ClusterMemory:100GB
Problem1:DiscreteCacheLevels
9
0
10
20
30
40
50
60
70
10 20 30 40 50 60 70 80 90 100
Time/Sec
DataSize/GB
Deserialized
Serialized
Op&mal
GoalforNeutrino
Problems:In-memoryCachingforItera&veDataAnaly&cs
1. DiscreteCacheLevels
2. ManualProgrammerManagement
3. NotAdap&vetoRun&meChanges
10
Problem2:ManualManagement rdd_1 = sc.textfile(HDFS://file1)rdd_2 = sc.textfile(HDFS://file2)
rdd_1.persist(Cache_Level) rdd_2.persist(Cache_Level)
rdd_1.tranformations().action() rdd_2.tranformations().action()
datasetsize?
de/serialized?
accessorder?11
Problems:In-memoryCachingforItera&veDataAnaly&cs
1. DiscreteCacheLevels
2. ManualProgrammerManagement
3. NotAdap&vetoRun&meChanges
12
NotAdap&vetoRun&meChanges
Cachelevelsaresta&callyassignedtoRDDandsuchprogrammerdecisionsmaynotadaptto:1. Changingmemoryu&liza&ononeachworkernode
2. DifferentmemoryrequirementforaRDDpar&&onindeserialized/serializedcachelevels
13
OurSolu&on:NeutrinoLessManualManagement
Adap&vetoRun&meChanges
Fine-grainedCacheLevels
14
1.DataFlowGenera&on
• Goal:TounderstandRDDaccessorderbetweenjobs
• Solu&on:PreliminaryrunonsmallworkloadstoextractRDDaccessorder
• Example:KNearestNeighborsClassifica&on– ClassicalMLclassifica&onalgorithm– 1traindataset,3testdataset
15
KNNExample:JobExecu&on
SparkApplica&on
Job1
Split Split
KNNjoin
Train Test_1
Master
Executor
SchedulingTasks
Execu&ngTask
Map Map
Job2
Split Split
KNNjoin
Train Test_2
SchedulingTasks
Execu&ngTask
Map Map
Job3
Split Split
KNNjoin
Train Test_3
SchedulingTasks
Execu&ngTask
Map Map
16
KNNDataFlowGraph
Goal:UnderstandtheRDDaccessorderbetweenjobs.
TrainSplitRDD
KNNjoin(ac&on)
TrainRDD
Result
TrainMappedRDD
TestSplitRDD Test_1RDD Test
MappedRDD
Job1
RDD_seq[1]={TrainRDD,Test_1RDD}
17
RDD_seq[2]={TrainRDD,Test_2RDD}RDD_seq[3]={TrainRDD,Test_3RDD}
Job1 Job2 Job3
2.Adap&veCaching
• Goal:Fine-grainedcachemanagementatRDDpar&&onlevel
• Solu&on:Newcachelevel:Adap+ve.ItcanmoveRDDpar&&onsbetweencachelevelsatrun&me
• Par&&on-levelOpera&ons:cache,discardandconvert
18
CacheOpera&onsinSpark
Deserialized
Serialized
Uncached
Cache(Deserialize)
Cache(Serialize)
Discard
Discard
CachingGranularity:RDD 19
Deserialized
Serialized
Uncached Convert
Cache(Deserialize)
Cache(Serialize)
Discard
Discard
Adap&ve
RDD CachingGranularity:Par&&on
Adap&veCacheOpera&onsinNeutrino
20
3.DynamicCacheScheduling • Goal:Adapttorun&mechangesforachievingop&malperformance
• Solu&ons:Explorecachedecisionsonallpar&&onsbydynamicprogrammingeach&mebeforescheduling
• DynamicProgrammingModel– Inputs:RDDaccessorder,par&&onstatus– Output:Cachedecisionforeachpar&&oninthenextjob– CostModel:Overallexecu&on&me
21
Execu&onofDynamicCacheScheduling
KNNjoin
Master
Executor
SchedulingTasks
Execu&ngTask
DPScheduling
RDDAccessOrder
Par&&onStatus
Execu&ngCachingDecisions
Split Split
Map Map
Update
22
DynamicCacheScheduling:CachingDecisions
RDD#
DPJob1
Part# Node Status 0 1 worker1 uncached
Job# Part# 1 RDD0,RDD1
Path1Decision_1:RDD_0_Part_1àDeser_CacheRDD_1_Part_1àDeser_Cache
Par33onStatusTable RDDAccessOrder
2 RDD0,RDD2 3 RDD0,RDD3
1 1 worker1 uncached 2 1 worker1 uncached 3 1 worker1 uncached
DPJob2
Decision_2:RDD_0_Part_1àDo_NothingRDD_1_Part_1àDiscardRDD_2_Part_1àDeser_Cache
DPJob3(Final) ReturnPath2:bestcachingdecisions
DPJob2
DPJob3(Final)
Path2Decision_1:RDD_0_Part_1àDeser_CacheRDD_1_Part_1àDo_Nothing
Decision_2:RDD_0_Part_1àDo_NothingRDD_2_Part_1àDo_Nothing
23
Evalua&on
• NeutrinoImplementa&on- ExtensiontoApacheSpark
• Methodology– 6nodesof4cores,8GBmemoryeach– Itera&vemachinelearningworkloads:
• Classifica&on:KNN,Logis&cRegression• Clustering:K-Means• Inference:LDA
– SystemsCompared:• NeutrinowithAdap&veCaching• SparkwithSerializedandDeserializedCaching
24
Scenario1:AbundantMemoryDeserializeddatasize<ClusterMemory
Neutrinodeserializeallpar&&onsandmakeefficientuseofunusedmemory 25
0
0.5
1
1.5
2
2.5
3
3.5
LDA K-Means KNN LogReg
Rela3veJobExecu3onTime
Neutrino Deserialized Serialized
60%
45%
Neutrinohasextracomputa&onoverheadfordynamicschedulingandaddi&onalopera&ons
-7%
Scenario2:SufficientMemoryDeserializeddatasize>ClusterMemory
26
0
0.5
1
1.5
2
2.5
3
3.5
LDA K-Means KNN LogReg
Rela3veJobExecu3onTime
Neutrino Deserialized Serialized
66%
Serializedlevelhasextraoverheadondeserializa&on.Neutrinocachepar&allyindeserializedlevelandpar&allyinserializedlevel
35%
Deserializedlevelstartstohitdiskandhencerequirere-computa&onfromHDFS
16% 5%
WithmorefrequentcachemissesoccurredforDeserializedlevel
Scenario3:JustEnoughMemorySerializeddatasize=ClusterMemory
27
0
0.5
1
1.5
2
2.5
3
3.5
4
LDA K-Means KNN LogReg
Rela3veJobExecu3onTime
Neutrino Deserialized Serialized
46%
70%
Conclusions• DiscreteCacheLevelsforIn-MemoryCaching
– Inefficientmemoryusageànotop&malperformance
• Neutrino:– Par&&onleveladap&vecaching– Dataflowgraphgenera&on– Dynamiccachescheduling
• Neutrinoimprovesaveragejobexecu&on&mebyupto70%overna&veSparkcaching
28
MajorProblems
• CachedRDDisdeserializedinSparkMemory-CheckpointedRDDisserializedandpersistent• CachedRDDiscomputedfromHDFS-CheckpointedRDDisre-computedfromcachedRDD
• CachedRDDreadingislocalityaware-Sta&cDelayschedulinglimitsCheckpoin&nglocality
ThanksQ&A
Neutrino:Revisi&ngMemoryCaching
forItera&veDataAnaly&cs
ErciXu*,MohitSaxena,LawrenceChiuOhioStateUniversity*IBMResearchAlmaden