Spark Summit EU talk by Zoltan Zvara

Data-Aware SparkZoltánZvara

[email protected]

Thisprojecthasreceived fundingfromtheEuropeanUnion’sHorizon2020research andinnovationprogramundergrantagreement No688191.

Introduction

• HungarianAcademyofSciences, InstituteforComputerScienceandControl (MTASZTAKI)• Researchinstitutewithstrongindustryties• BigDataprojects using Spark,Flink,Hadoop,Couchbase,etc…• Multiple IoTandtelecommunication use cases,withchallengingdatavolume,velocity anddistribution

Agenda

• Data-skewstory• Problemdefinitions&goals• Dynamic Repartitioning

• Architecture• Component breakdown• Repartitioning mechanism• Benchmarkresults

• Tracing• Visualization• Conclusion

Motivation

• Applications aggregatingIoT,socialnetwork,telecommunication datathattestedwellontoydata• Whendeployingitagainsttherealdatasettheapplicationseemedhealthy• Howeveritcouldbecomesurprisinglyslow orevencrash• Whatwent wrong?

Ourdata-skewstory

• Wehaveuse-caseswhenmap-sidecombineisnotanoption:groupBy, join• Powerlaws (Pareto,Zipfian)• „80%ofthetrafficgenerated by 20%ofthecommunication towers”• Mostofourdatais80–20rule

Theproblem

• Usingdefaulthashingisnotgoingtodistributethedatauniformly• Someunknownpartition(s)tocontainalotofrecordsonthereducerside• Slowtaskswillappear• Datadistributionisnotknowninadvance• Conceptdriftsarecommon

stageboundary&shuffle

evenpartitioning skeweddata

slow task

slow task

Ultimate goal

Spark to be adata-aware distributeddata-processingframework

F1

Generic,engine-level load balancing &data-skewmitigation

Low level,works with any APIs builton topofit

F2

Completely online

Requires noofflinestatisticsDoes not require simulation on asubset ofyour data

F3

Distribution agnostic

Noneed to identify the underlying distributionDoes not require to handle special corner-cases

F4

Forthe batch&the streamingguys under one hood

Same architecture andsame core logic forevery workloadSupportforunified workloads

F5

System-aware

Dynamically understand system-specific trade-offs

F6

Does not need any user-guidance or configuration

Out-of-the box solution,the userdoes not need to playwith the configurationspark.data-aware = true

Architecture

Slave Slave Slave

Master

approximatelocal keydistribution &statistics

1collect to master 2

global keydistribution &statistics

3

4redistribute newhash

Driverperspective

• RepartitioningTrackerMaster partofSparkEnv• Listenstojob&stagesubmissions• Holdsavarietyofrepartitioningstrategiesforeachjob&stage• Decides when &how to (re)partition• Collects feedbacks

Executorperspective

• RepartitioningTrackerWorker partofSparkEnv• Duties:

• StoresScannerStrategies (Scanner included) receivedfromtheRTM• Instantiates andbindsScanners toTaskMetrics (wheredata-characteristics iscollected)

• DefinesaninterfaceforScanners tosendDataCharacteristics backtotheRTM

Scalable sampling

• Key-distributions are approximatedwith astrategy,that is• not sensitive to early or late concept drifts,• lightweight andefficient,• scalable by using abackoff strategy

keys

frequency

counters

keys

frequency

sampling rate of𝑠" sampling rate of𝑠" /𝑏

keysfre

quency

sampling rate of𝑠"%&/𝑏

truncateincrease by 𝑠"

𝑇𝐵

Complexity-sensitive sampling

• Reducerrun-timecanhighlycorrelatewiththecomputationalcomplexity ofthevaluesforagivenkey• CalculatingobjectsizeiscostlyinJVMandinsomecasesshowslittlecorrelationtocomputationalcomplexity(functiononthenextstage)• Solution:

• Iftheuserobject isWeightable, usecomplexity-sensitive sampling• Whenincreasingthecounterforaspecificvalue,consider itscomplexity

Scalable sampling in numbers

• Optimizedwithmicro-benchmarking• Mainfactors:

• Samplingstrategy- aggressiveness (initialsamplingratio,back-offfactor,etc…)

• Complexity ofthecurrentstage(mapper)

• Currentstage’sruntime:• Whenusedthroughouttheexecution ofthewholestage itadds5-15%toruntime

• Afterrepartitioning, wecutoutthesampler’s code-path; inpractice,itadds0.2-1.5% toruntime

Decisiontorepartition

• RepartitioningTrackerMaster canusedifferentdecisionstrategies:• numberoflocalhistogramsneeded,• globalhistogram’sdistance fromuniformdistribution,• preferences intheconstruction ofthenewhashfunction.

Construction ofthe new hash function

𝑐𝑢𝑡, - single-key cut

hash akey herewith the probabilityproportional to the remaining space to reach 𝑙

𝑐𝑢𝑡. - uniform distribution

topprominent keys hashed

𝑙

𝑐𝑢𝑡/ - probabilistic cut

Newhash function in numbers

• MorecomplexthanahashCode• Weneedtoevaluateitforeveryrecord• Micro-benchmark(forexampleString):

• Numberofpartitions: 512• HashPartitioner: AVGtime to hash arecord is90.033 ns• KeyIsolatorPartitioner: AVGtime to hash arecord is121.933ns

• Inpracticeitaddsnegligibleoverhead,under1%

Repartitioning

• Reorganizepreviousnaivehashing or buffer the databefore hashing• Usuallyhappensin-memory• Inpractice,addsadditional1-8%overhead(usuallythelowerend),basedon:• complexity ofthemapper,• lengthofthescanningprocess.

• Forstreaming:updatethe hash in DStream graph

BatchgroupBy

MusicTimeseries – groupBy on tags from alistenings-stream

1710 200 400

83M

134M

92M

sizeofth

ebiggestpartition

number ofpartitions

over-partitioning overhead&blind scheduling64%reduction

observed 3%map-side overhead

naive

Dynamic Repartitioning

38%reduction

Batchjoin

• Joiningtrackswithtags• Tracksdatasetisskewed• Numberofpartitionssetto33• Naive:

• sizeofthebiggest partition=4.14M• reducer’sstageruntime=251seconds

• DynamicRepartitioning• sizeofthebiggest partition=2.01M• reducer’sstageruntime=124 seconds• heavymap,only0.9%overhead

Tracing

Goal:captureandrecordadata-point’slifecyclethroughoutthewholeexecution

Rewritten Spark’s core to handle wrapped data-points.

Wrapper data-point : T

payload : T 𝑓 ∶ 𝑇 → 𝑈𝑓 ∶ 𝑇 → 𝐵𝑜𝑜𝑙𝑒𝑎𝑛𝑓 ∶ 𝑇 → 𝐼𝑡𝑒𝑟𝑎𝑏𝑙𝑒 𝑈…

apply

Traceables in Spark

Each record isaTraceable object (a Wrapper implementation).

Apply function on payload andreport/build trace.spark.wrapper.class = com.ericsson.ark.spark.Traceable

Traceable data-point : T

payload : T 𝑓 ∶ 𝑇 → 𝑈𝑓 ∶ 𝑇 → 𝐵𝑜𝑜𝑙𝑒𝑎𝑛𝑓 ∶ 𝑇 → 𝐼𝑡𝑒𝑟𝑎𝑏𝑙𝑒 𝑈…

applytrace : Trace

Tracing implementation

Capture trace informations (deep profiling):• by reporting to anexternal service;• piggyback aggregated trace routes on data-points.

𝑇;𝑇<

𝑇=

𝑇>

𝑇?

𝑇@

𝑇A

𝑇B

𝑇<<

𝑇<C

𝑇<=

𝑇<A

2

4 4

10

1215

7

5

• 𝑇" can beany Spark operator• we attach useful metrics to edges

and vertices (current load, timespent at task)

F7

Provide data-characteristics to monitor&debug applications

Users can sneak-peak into the dataflowUsers can debug andmonitorapplications moreeasily

F7

• NewmetricsareavailablethroughtheRESTAPI• AddednewqueriestotheRESTAPI,forexample:„whathappenedinthelast3second?”• BlockFetches arecollectedtoShuffleReadMetrics

Execution visualization ofSpark jobs

Repartitioning in the visualization

heavy keys createheavy partitions (slow tasks)

heavy isalone,size ofthe biggest partition

isminimized

Conclusion

• OurDynamicRepartitioning canhandledataskewdynamically,on-the-flyonanyworkloadandarbitrarykey-distributions• Withverylittleoverhead,dataskewcanbehandledinanatural &general way• Tracing can help us to improve the co-location ofrelated services• Visualizationscanaiddeveloperstobetterunderstandissuesandbottlenecksofcertainworkloads• MakingSparkdata-aware paysoff

Thankyouforyourattention

Zoltá[email protected]

Data & Analytics

Spark Summit EU talk by Zoltan Zvara