Upload
spark-summit
View
268
Download
2
Embed Size (px)
Citation preview
Data-Aware SparkZoltánZvara
Thisprojecthasreceived fundingfromtheEuropeanUnion’sHorizon2020research andinnovationprogramundergrantagreement No688191.
Introduction
• HungarianAcademyofSciences, InstituteforComputerScienceandControl (MTASZTAKI)• Researchinstitutewithstrongindustryties• BigDataprojects using Spark,Flink,Hadoop,Couchbase,etc…• Multiple IoTandtelecommunication use cases,withchallengingdatavolume,velocity anddistribution
Agenda
• Data-skewstory• Problemdefinitions&goals• Dynamic Repartitioning
• Architecture• Component breakdown• Repartitioning mechanism• Benchmarkresults
• Tracing• Visualization• Conclusion
Motivation
• Applications aggregatingIoT,socialnetwork,telecommunication datathattestedwellontoydata• Whendeployingitagainsttherealdatasettheapplicationseemedhealthy• Howeveritcouldbecomesurprisinglyslow orevencrash• Whatwent wrong?
Ourdata-skewstory
• Wehaveuse-caseswhenmap-sidecombineisnotanoption:groupBy, join• Powerlaws (Pareto,Zipfian)• „80%ofthetrafficgenerated by 20%ofthecommunication towers”• Mostofourdatais80–20rule
Theproblem
• Usingdefaulthashingisnotgoingtodistributethedatauniformly• Someunknownpartition(s)tocontainalotofrecordsonthereducerside• Slowtaskswillappear• Datadistributionisnotknowninadvance• Conceptdriftsarecommon
stageboundary&shuffle
evenpartitioning skeweddata
slow task
slow task
Ultimate goal
Spark to be adata-aware distributeddata-processingframework
F1
Generic,engine-level load balancing &data-skewmitigation
Low level,works with any APIs builton topofit
F2
Completely online
Requires noofflinestatisticsDoes not require simulation on asubset ofyour data
F3
Distribution agnostic
Noneed to identify the underlying distributionDoes not require to handle special corner-cases
F4
Forthe batch&the streamingguys under one hood
Same architecture andsame core logic forevery workloadSupportforunified workloads
F5
System-aware
Dynamically understand system-specific trade-offs
F6
Does not need any user-guidance or configuration
Out-of-the box solution,the userdoes not need to playwith the configurationspark.data-aware = true
Architecture
Slave Slave Slave
Master
approximatelocal keydistribution &statistics
1collect to master 2
global keydistribution &statistics
3
4redistribute newhash
Driverperspective
• RepartitioningTrackerMaster partofSparkEnv• Listenstojob&stagesubmissions• Holdsavarietyofrepartitioningstrategiesforeachjob&stage• Decides when &how to (re)partition• Collects feedbacks
Executorperspective
• RepartitioningTrackerWorker partofSparkEnv• Duties:
• StoresScannerStrategies (Scanner included) receivedfromtheRTM• Instantiates andbindsScanners toTaskMetrics (wheredata-characteristics iscollected)
• DefinesaninterfaceforScanners tosendDataCharacteristics backtotheRTM
Scalable sampling
• Key-distributions are approximatedwith astrategy,that is• not sensitive to early or late concept drifts,• lightweight andefficient,• scalable by using abackoff strategy
keys
frequency
counters
keys
frequency
sampling rate of𝑠" sampling rate of𝑠" /𝑏
keysfre
quency
sampling rate of𝑠"%&/𝑏
truncateincrease by 𝑠"
𝑇𝐵
Complexity-sensitive sampling
• Reducerrun-timecanhighlycorrelatewiththecomputationalcomplexity ofthevaluesforagivenkey• CalculatingobjectsizeiscostlyinJVMandinsomecasesshowslittlecorrelationtocomputationalcomplexity(functiononthenextstage)• Solution:
• Iftheuserobject isWeightable, usecomplexity-sensitive sampling• Whenincreasingthecounterforaspecificvalue,consider itscomplexity
Scalable sampling in numbers
• Optimizedwithmicro-benchmarking• Mainfactors:
• Samplingstrategy- aggressiveness (initialsamplingratio,back-offfactor,etc…)
• Complexity ofthecurrentstage(mapper)
• Currentstage’sruntime:• Whenusedthroughouttheexecution ofthewholestage itadds5-15%toruntime
• Afterrepartitioning, wecutoutthesampler’s code-path; inpractice,itadds0.2-1.5% toruntime
Decisiontorepartition
• RepartitioningTrackerMaster canusedifferentdecisionstrategies:• numberoflocalhistogramsneeded,• globalhistogram’sdistance fromuniformdistribution,• preferences intheconstruction ofthenewhashfunction.
Construction ofthe new hash function
𝑐𝑢𝑡, - single-key cut
hash akey herewith the probabilityproportional to the remaining space to reach 𝑙
𝑐𝑢𝑡. - uniform distribution
topprominent keys hashed
𝑙
𝑐𝑢𝑡/ - probabilistic cut
Newhash function in numbers
• MorecomplexthanahashCode• Weneedtoevaluateitforeveryrecord• Micro-benchmark(forexampleString):
• Numberofpartitions: 512• HashPartitioner: AVGtime to hash arecord is90.033 ns• KeyIsolatorPartitioner: AVGtime to hash arecord is121.933ns
• Inpracticeitaddsnegligibleoverhead,under1%
Repartitioning
• Reorganizepreviousnaivehashing or buffer the databefore hashing• Usuallyhappensin-memory• Inpractice,addsadditional1-8%overhead(usuallythelowerend),basedon:• complexity ofthemapper,• lengthofthescanningprocess.
• Forstreaming:updatethe hash in DStream graph
BatchgroupBy
MusicTimeseries – groupBy on tags from alistenings-stream
1710 200 400
83M
134M
92M
sizeofth
ebiggestpartition
number ofpartitions
over-partitioning overhead&blind scheduling64%reduction
observed 3%map-side overhead
naive
Dynamic Repartitioning
38%reduction
Batchjoin
• Joiningtrackswithtags• Tracksdatasetisskewed• Numberofpartitionssetto33• Naive:
• sizeofthebiggest partition=4.14M• reducer’sstageruntime=251seconds
• DynamicRepartitioning• sizeofthebiggest partition=2.01M• reducer’sstageruntime=124 seconds• heavymap,only0.9%overhead
Tracing
Goal:captureandrecordadata-point’slifecyclethroughoutthewholeexecution
Rewritten Spark’s core to handle wrapped data-points.
Wrapper data-point : T
payload : T 𝑓 ∶ 𝑇 → 𝑈𝑓 ∶ 𝑇 → 𝐵𝑜𝑜𝑙𝑒𝑎𝑛𝑓 ∶ 𝑇 → 𝐼𝑡𝑒𝑟𝑎𝑏𝑙𝑒 𝑈…
apply
Traceables in Spark
Each record isaTraceable object (a Wrapper implementation).
Apply function on payload andreport/build trace.spark.wrapper.class = com.ericsson.ark.spark.Traceable
Traceable data-point : T
payload : T 𝑓 ∶ 𝑇 → 𝑈𝑓 ∶ 𝑇 → 𝐵𝑜𝑜𝑙𝑒𝑎𝑛𝑓 ∶ 𝑇 → 𝐼𝑡𝑒𝑟𝑎𝑏𝑙𝑒 𝑈…
applytrace : Trace
Tracing implementation
Capture trace informations (deep profiling):• by reporting to anexternal service;• piggyback aggregated trace routes on data-points.
𝑇;𝑇<
𝑇=
𝑇>
𝑇?
𝑇@
𝑇A
𝑇B
𝑇<<
𝑇<C
𝑇<=
𝑇<A
2
4 4
10
1215
7
5
• 𝑇" can beany Spark operator• we attach useful metrics to edges
and vertices (current load, timespent at task)
F7
Provide data-characteristics to monitor&debug applications
Users can sneak-peak into the dataflowUsers can debug andmonitorapplications moreeasily
F7
• NewmetricsareavailablethroughtheRESTAPI• AddednewqueriestotheRESTAPI,forexample:„whathappenedinthelast3second?”• BlockFetches arecollectedtoShuffleReadMetrics
Execution visualization ofSpark jobs
Repartitioning in the visualization
heavy keys createheavy partitions (slow tasks)
heavy isalone,size ofthe biggest partition
isminimized
Conclusion
• OurDynamicRepartitioning canhandledataskewdynamically,on-the-flyonanyworkloadandarbitrarykey-distributions• Withverylittleoverhead,dataskewcanbehandledinanatural &general way• Tracing can help us to improve the co-location ofrelated services• Visualizationscanaiddeveloperstobetterunderstandissuesandbottlenecksofcertainworkloads• MakingSparkdata-aware paysoff
Thankyouforyourattention
Zoltá[email protected]