24
Automated and Interactive Debugging of Big Data Analytics Muhammad Ali Gulzar University of California, Los Angeles 1

Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

AutomatedandInteractiveDebuggingofBigDataAnalytics

MuhammadAliGulzarUniversityofCalifornia,LosAngeles

1

Page 2: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

2

Developlocally Hopeitworks Runincloud Bug!

Guesswork

BigDataDebuggingintheDark

Map Reduce

1 2 3 4

5

Page 3: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

3

• InteractiveDebuggingPrimitivesforBigDataProcessinginSparkICSE’16

• AutomatedDebugginginDataIntensiveScalableComputingSystemsSoCC ’17

• White-boxTestingofDataIntensiveScalableComputingApplicationswithuserdefinedfunctionsOngoing

Page 4: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

WhyTraditionalInteractiveDebuggingisHardforApacheSpark?

Enablinginteractivedebuggingrequiresusto re-thinkthefeaturesoftraditionaldebuggersuchasGDB

• Pausingtheentirecomputationonthecloudcouldreducethroughput

• Itisclearlyinfeasibleforausertoinspectbillionofrecordsthrougharegularwatchpoint

• EvenlaunchingremoteJVMdebuggerstoindividualworkernodescannotscaleforbigdatacomputing

Page 5: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

InteractiveDISCDebugPrimitives[ICSE‘16,FSE’16Demo,SIGMOD’17Demo]

5

4.BackwardandForwardTracing

1.SimulatedBreakpoint 2.OnDemandGuardedWatchpoint

3.CrashCulpritIdentification

Page 6: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

OurInsightsforInteractiveDISCDebugging

• Wedonotpauseprogramexecution butsimulateabreakpointthroughon-demandstateregeneration

• Wedeliverselectedprogramstatestoauserinastreamingprocessingfashion.

• Were-architecttheunderlyingbigdatasystemruntimewithnativein-memorydataprovenancesupport

6

Page 7: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

Whatistheperformanceoverheadofdebuggingprimitives?

Program Datasetsize(GB)

Max Maxw/oLatencyAlert

Watchpoint CrashCulprit

Tracing

WordCount 0.5 - 1000 2.5X 1.34X 1.09X 1.18X 1.22X

Grep 20- 90 1.76X 1.07X 1.05X 1.04X 1.05X

PigMix-L1 1- 200 1.38X 1.29X 1.03X 1.19X 1.24X

BigDebug posesatmost2.5Xoverheadwiththemaximuminstrumentationsetting.

Max:AllthefeaturesofBigDebug areenabled

7

Page 8: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

8

• InteractiveDebuggingPrimitivesforBigDataProcessinginSparkICSE’16

• AutomatedDebugginginDataIntensiveScalableComputingSystemsSoCC ’17

• White-boxTestingofDataIntensiveScalableComputingApplicationswithuserdefinedfunctionsOngoing

Page 9: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

MotivatingExample

• AlicewritesaSparkprogramthatidentifies,foreachstateintheUS,thedeltabetweentheminimumandthemaximumsnowfallreadingforeachdayofanyyearandforanyparticularyear.

• Aninputdatarecordthatmeasures1footofsnowfallonJanuary1stofYear1992,inthe99504zipcode(Anchorage,AK)area,appearsas

99504 ,01/01/1992,1ft

Page 10: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

ProblemDefinition

10

99504,01/01/1992,1ft99504,03/01/1992,0.1ft99504,01/01/1993, 70in99504,03/01/1993,145mm99504,01/01/1994 ,245mm99504,01/01/1993 ,85mm90031,02/01/1991 ,0mm

AK, 01/01 ,[304.8,21336,245,85]AK, 03/01 ,[30.5,145]AK, 1992 ,[304.8,30.5]AK, 1993 ,[21336,145, 85]AK, 1994 ,[245]CA, 02/01 ,[0]CA, 1991 ,[0]

TextFile FlatMap GroupByKey Map Output

AK ,01/01,304.8AK ,1992 , 304.8AK ,03/01 ,30.5AK ,1992 ,30.5AK ,01/01 ,21336AK ,1993 , 21336AK ,03/01 ,145AK ,1993 ,145AK ,01/01 ,245AK ,1994 ,245

…. ….

AK ,01/01,21251AK ,03/01,114.5AK ,1992 ,274.3AK ,1993 ,21251AK ,1994 ,0CA ,02/01,0CA ,1991 ,0

Givenatestfunction,thegoalistoidentifyaminimumsubsetoftheinputthatisabletoreproducethesametestfailure.

def test(key:String, delta: Float) : Boolean = {delta < 6000

}

• Usingatestfunction,ausercanspecifyincorrectresults

Page 11: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

11

99504,01/01/1992,1ft99504,03/01/1992,0.1ft99504,01/01/1993, 70in99504,03/01/1993,145mm99504,01/01/1994 ,245mm99504,01/01/1993 ,85mm90031,02/01/1991 ,0mm

AK, 01/01 ,[304.8,21336,245,85]AK, 03/01 ,[30.5,145]AK, 1992 ,[304.8,30.5]AK, 1993 ,[21336,145, 85]AK, 1994 ,[245]CA, 02/01 ,[0]CA, 1991 ,[0]

TextFile FlatMap GroupByKey Map Output

AK ,01/01,304.8AK ,1992 , 304.8AK ,03/01 ,30.5AK ,1992 ,30.5AK ,01/01 ,21336AK ,1993 , 21336AK ,03/01 ,145AK ,1993 ,145AK ,01/01 ,245AK ,1994 ,245

…. ….

AK ,01/01,21251AK ,03/01,114.5AK ,1992 ,274.3AK ,1993 ,21251AK ,1994 ,0CA ,02/01,0CA ,1991 ,0

ExistingApproach1:DataProvenanceforSpark[VLDB2015]

Itover-approximatesthescopeoffailure-inducinginputsi.e.recordsinthefaultykey-groupareallmarkedasfaulty

Page 12: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

ExistingApproach2:DeltaDebugging[Zeller1999]• DeltaDebuggingperformsasystematicbinarysearch-like

procedureontheinputdatasetusingatestoraclefunction

12

99504,01/01/1992,1ft99504,03/01/1992,0.1ft99504,01/01/1993, 70in99504,03/01/1993,145mm99504,01/01/1994,245mm99504,01/01/1993,85mm90031,02/01/1991,0mm

AK,01/01,304.8AK,1992 , 304.8AK,03/01 ,30.5AK,1992 ,30.5AK,01/01 ,21336AK,1993 , 21336AK,03/01 ,145AK,1993 ,145AK,01/01 ,245AK,1994 ,245

…. ….

AK ,01/01,[304.8,21336,245,85]AK ,03/01 ,[30.5,145]AK ,1992 ,[304.8,30.5]AK ,1993 ,[21336,145, 85]AK ,1994 ,[245]CA,02/01 ,[0]CA,1991 ,[0]

AK , 01/01 ,21251AK , 03/01 ,114.5AK , 1992 ,274.3AK , 1993 ,21251AK , 1994 ,0CA, 02/01 ,0CA, 1991 ,0

TextFile FlatMap GroupByKey Map Output

1

2

Itdoesnotpruneinputrecordsknowntobeirrelevantbecauseofthelackofsemanticunderstandingofdata-flowoperators

Page 13: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

AutomatedDebugginginDISCwithBigSift[SoCC 2017]

13

Test PredicatePushdown

PrioritizingBackwardTraces

BitmapbasedTest

Memoization

Input:ASparkProgram,ATestFunction Output:MinimumFault-InducingInputRecords

DataProvenance+DeltaDebugging

Page 14: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

14

Optimization1: TestPredicatePushdown

Ifapplicable,BigSift pushesdownthetestfunctiontotesttheoutputofcombinersinordertoisolatethefaultypartitions.

• Observation: Duringbackwardtracing,dataprovenancetracesthroughallthepartitionseventhoughonlyafewpartitionsarefaulty

Test

Test

Test

Test

Test

Test

Test

Page 15: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

DebuggingTime

1

10

100

1000

10000

100000

1000000

10000000

1000000001E+09

0 2000 4000 6000 8000 10000 12000 14000

#offault-ind

ucinginpu

trecords

FaultLocalizationTime(s)

SequenceCount

DeltaDebugging BigSift

TestDrivenDataProvenance DataProvenance

Onaverage,BigSift takes62%lesstimetodebugasinglefaultyoutput thanthetimetakenforasinglerunontheentiredata.

Page 16: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

16

• InteractiveDebuggingPrimitivesforBigDataProcessinginSparkICSE’16

• AutomatedDebugginginDataIntensiveScalableComputingSystemsSoCC ’17

• White-boxTestingofDataIntensiveScalableComputingApplicationswithuserdefinedfunctionsOngoing

Page 17: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

TestingChallengesofBigDataAnalytics

• Howcanweselectaminimalsample fromacompletedatasettoperformefficienttestingofDISCapplications?

• HowcanwegenerateadatathatexercisesallexecutionpathsinaDISCapplicationtofacilitatecompletetesting?

• Duetodataflowoperators andcomplexuserdefinedfunctionsinDISCapplication,itisextremelyhardtoanswerthetwomentionedquestions.

sc.textFile(“hdfs://..”).flatMap( s=>s.split(“.”) ).map( s => (s,1) ).reduceByKey( (a,b) => a+b )

DataflowOperators Userdefinedfunctions

Page 18: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

Ongoingwork:White-boxTestingofDISCApplications

udf1

udf2

udf3

DISCApplication

JavaPathFinder

map

filter

reduce

Stage1

Stage2

Udf1symPath & Effect

map

filter

reduce

Stage1

Stage2

Udf2symPath & Effect

Udf3symPath & Effect

PathConstraint

Effects

X>5&y= .. z=x*…

X<3 &y>… z=x/…

… ….

ADISCapplicationisdecomposedintoUDFsanddataflowoperators.

1 EachcomplexUDFissymbolicallyexecutedinisolationwithboundedpathexploration.

2 UDFPathconstraintsareintegratedw.r.t thelogicalspecificationsofdataflowoperators

3

PathConstraint

Effects

X>5&y= .. z=x*…

X<3 &y>… z=x/… Z3TestData

PathconstraintsareconvertedintoSMT2whichisusedtogeneratetestdatatheoremsolver

4

Page 19: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

Testcoveragefromgeneratedtestingdata

100 100 100 100 100 100 100

16.7

40

14.3 18.225

13.325

66.760

28.6

54.5

75 76.7

100

0

20

40

60

80

100

120

Income Movie Airport Commute PigMix Grade Word

%ofJDU

Paths

JointDataFlowandUDFPathCoverage

BigTest Sedge OrginialDataset

BigTest outperformspreviouslyknowntechniqueandprovides100%JDUpathcoverageisalmosteverybenchmarkprogram

Page 20: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

Conclusion• Bysynthesizinginsightsfromsoftwareengineeringand

databasesystems,wecandesignscalable, interactive,andautomateddebuggingalgorithmsforbigdataanalytics.

• Demo:DebuggingBigDataAnalyticsinSparkwithBigDebughttps://www.youtube.com/watch?v=aZ91EyC5-Yc

• Demo:AutomatedDebuggingofBigDataAnalyticswithBigSifthttps://www.youtube.com/watch?v=_HR3VJ2dPbE

Page 21: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

21

99504,01/01/1992,1ft99504,03/01/1992,0.1ft99504,01/01/1993, 70in99504,03/01/1993,145mm99504,01/01/1994 ,245mm99504,01/01/1993 ,85mm90031,02/01/1991 ,0mm

AK, 01/01 ,[304.8,21336,245,85]AK, 03/01 ,[30.5,145]AK, 1992 ,[304.8,30.5]AK, 1993 ,[21336,145, 85]AK, 1994 ,[245]CA, 02/01 ,[0]CA, 1991 ,[0]

TextFile FlatMap GroupByKey Map Output

AK ,01/01,304.8AK ,1992 , 304.8AK ,03/01 ,30.5AK ,1992 ,30.5AK ,01/01 ,21336AK ,1993 , 21336AK ,03/01 ,145AK ,1993 ,145AK ,01/01 ,245AK ,1994 ,245

…. ….

AK ,01/01,21251AK ,03/01,114.5AK ,1992 ,274.3AK ,1993 ,21251AK ,1994 ,0CA ,02/01,0CA ,1991 ,0

Optimization2:PrioritizingBackwardTraces

Incaseofmultiplefaultyoutputs,BigSift overlapstwobackwardtracestominimizethescopeoffault-inducinginputrecords

• Observation:Thesamefaultyinputrecordmaycontributetomultipleoutputrecordsfailingthetest.

Page 22: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

TestDataSizefromBigTest

6 5 14 11 430

4

4000000000

521344

448000000 32000010040000000 40000000 111359852

1.00E+00

1.00E+02

1.00E+04

1.00E+06

1.00E+08

1.00E+10

Income Movie Airport Commute PigMix Grade Word

#ofinpu

trecords

TestDataGeneratedtoachieve100%JDUPathCoverage

BigTest InputDataset

BigTest generatestestingdateofsizethatisseveralordersofmagnitude(106-1010)smallerthantheoriginalinputdataset

Page 23: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

BigDebug:InteractiveDebugger[FSE2016Demo,SIGMOD2017Demo]

• BigDebug ispublicallyavailableathttps://sites.google.com/site/sparkbigdebug/

Page 24: Automated and Interactive Debugging of Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/SRC-icse18.pdf · Our Insights for Interactive DISC Debugging • We do not pause program

FaultLocalizabilityoverDataProvenance

143796

6487290

520904

234115800

15003060

2554788

350

2

1350

15 13

1 1 1 1 1 12

1

10

100

1000

10000

100000

1000000

10000000

100000000

MovieHistorgram

InvertedIndex

RatingHistogram

SequenceCount

RatingFrequency

CollegeStudents

WeatherAnalysis

#offault-ind

ucinginpu

trecords

DataProvenance TestDrivenDataProvenance BigSift&DD

BigSift leveragesDDafterDPtocontinuefaultisolation,achievingseveralordersofmagnitude103 to107 betterprecision.