Modeling Soft-Error Propagation in...

Preview:

Citation preview

ModelingSoft-ErrorPropagationinProgramsGuanpeng (Justin)LiKarthik Pattabiraman

SivaHariMichaelSullivanTimothyTsai

Motivation:SoftErrors

2

= 0001 = 0101

[1]

Softerrorsbecomingmorecommoninprocessors

[1] http://aviral.lab.asu.edu/soft-error-resilience/

SilentDataCorruption(SDC)

NormalExecution

Fault

ErrorPropagation

SDC

Crash

Benign

IncorrectOutput

CorrectOutput

Exceptions,NoOutput

AmazonS3Incident

3

SoftwareSolutions

Device/CircuitLevel

ArchitecturalLevel

OperatingSystemLevel

ApplicationLevel

ImpactfulErrors

Protectio

nOverhead

SoftError

4

Increasing

Softwareprotection techniquesaremoreflexibleandcost-effective!

SelectiveInstructionDuplication

“TheGoldenCurve”

SDCCoverage

ProtectionOverhead

ApplicationSpecific!

*MeasuredinLibquantum,SPEC

InstructionSequence InstructionDuplication

Instruction:SDCRate=X%Overhead=Y%

SelectedInstructionsforGivenTargetSDCCoverage

AKnapsackProblem

5

DevelopingFault-TolerantApplications

DevelopmentofApplication EvaluateProgramSDCRate

SelectiveProtection

Acceptable

NewRelease

MeasureInstruction SDCRates

1. Thousandsoffaultinjectionsneedtobedone2.Repeateverytimecodeismodified

6

EstimatingSDCRate

OurGoal

Accuracy

Speed

AVF/PVF/ePVF

[MICRO’03,HPCA’10,DSN’16]

SymPLFIED/Relyzer/GangES

[DSN’08,ASPLOS’12,ISCA’14]

Noexistingtechniquemodelserrorpropagationinbothfastandaccurateway!

FastpredictionofSDCwithoutfaultinjection!

8

Challenges

• TrackingSDCpropagationishard

• Overbillionsofexecutedinstructions

• Everyinstructionmaypropagateerrorswithdifferentprobabilities

• Dynamicnatureofprogramexecution

• Control-flowdivergence

… …

BR

… …

Corruptingsubsequentstates

T F

8

… …… …… …… …

Trident:KeyInsight

• Errorpropagationscanbedecomposedintomodules,whichcan

beabstractedintoprobabilisticevents

• Decomposition

• Abstraction

9

Trident:Workflow

SourceCode

ProgramInput

OutputInsn.

Insn.SDCRates

OverallSDCRate

Insn.forPrediction

Profiling Prediction

10

BB12

… …

Trident:OurApproach

• Three-levelmodeling

• Register-communication

• Control-flow

• Memorydependency

Reg.

Mem.Contl.

BB4

$2=LOAD0x04

$3=ADD$2,4

CMP$4,$3,4

BR$4,BB5,BB10

BB5

$5=MUL$6,16

… …

BB10

… …

… …

BB102

...=LOAD0x08

T1 F1

T2 F2

fS

fC fM

BB11STORE…,0x08

11

fs =100%*100%*25%*100%=25%

BB12

… …BB11STORE…,0x08

BB4

$2=LOAD0x04

$3=ADD$2,4

CMP$4,$3,4

BR$4,BB5,BB10

BB5

$5=MUL$6,16

… …

BB10

… …

… …

BB102

...=LOAD0x08

T1 F1

T2 F2

<100%>

<100%>

<25%>

<100%>

PropagationprobabilitywithinBB4?

Reg.

Mem.Contl.

fS

fC fM

Reg.

12

Trident:RegisterCommn.

Trident:Control-Flow

BB12

… …BB11STORE…,0x08

BB4

$2=LOAD0x04

$3=ADD$2,4

CMP$4,$3,4

BR$4,BB5,BB10

BB5

$5=MUL$6,16

… …

BB10

… …

… …

BB102

...=LOAD0x08

T1 F1

T2 F2

CorruptionprobabilityofSTORE?

80% 20%

30% 70%

<100%>

<100%>

<25%>

<100%>=

*Fornon-loop-terminatingbranches

Reg.

Mem.Contl.

fS

fC fM

Contl.

fC

STOREexec.prob.F1*T2

BRdom.prob.F1

Corrupted

13

Trident:Memory-Dependency

BB12

… …BB11STORE…,0x08

BB4

$2=LOAD0x04

$3=ADD$2,4

CMP$4,$3,4

BR$4,BB5,BB10

BB5

$5=MUL$6,16

… …

BB10

… …

… …

BB102

...=LOAD0x08

T1 F1

T2 F2

DependentLOAD&STORE

80% 20%

30% 70%

<100%>

<100%>

<25%>

<100%>

Reg.

Mem.Contl.

fS

fC fM

Mem.

P(In) = fS (In)* fC (In2)* fS (In3)* fC (In4) … …

14

*ncorrespondstotheindexofdynamicinstructions

ExperimentalSetup

BenchmarkApplication Domains

15

• FaultModel• Singlebit-flipinjections– accurate[DSN’17]

• Randominsn.– oneperprogramexecution

• Benchmarks• 11open-sourcebenchmarksfromvariousdomains

• Comparisonwithfaultinjection• Accuracy

• Speed(wallclocktime)

ExperimentalMethodology

Reg.

Mem.Contl.

fS

Reg.

Mem.Contl.

fS+fCTwoSimplerModelsforComparison

GoalistopredictSDCrateasperfaultinjection

[1]LLVMFaultInjector[DSN’14]

Reminder:

16

• Baseline:FaultinjectionderivedbyLLFI[1]

• ThecloserSDCratetofaultinjection, thebetterprediction

• Createdtwosimplermodels

• Accuracyofeachsub-model

• Asproxytopriorwork

Evaluation:Accuracy

• MeanAbsoluteError• Trident:4.75%• SimplerModels:15.13%and19.13%

• t-TestonIndividualInstructions• Trident:8outof11arestatisticallyindistinguishable• SimplerModels(fS andfS+fC):Only2and4

ProgramSDCRate;3,000Sampled Instructions;ErrorBar:+/-0.07%~+/-1.76%at95%ConfidenceInterval

Trident isclosetofaultinjectionresults,andsignificantlybetterthanthesimplermodels!

3,000randomlysampledinstructionsforfaultinjection

andthemodels

17

Evaluation:Speed

• Program’sOverallSDCRate:• 6.7xfasterat3,000samples

• Per-InstructionSDCRate:• Onaverage,380xfasterat100samples

perinstruction

• Benchmarks:FItakesnearly100hourswhereasTridenttakes<20mins

Trident isfasterthanfaultinjectionby2ordersofmagnitude!

Wall-Clock TimeofEstimatingProgramSDCRate

18

UseCase:SelectiveInstructionDuplication

SDCCoverage

ProtectionOverhead

*MeasuredinLibquantum,SPEC

ByFaultInjections

ByTrident

“TheGoldenCurve”

ByfS+fCByfS

SelectiveInstructionDuplication

Recap:

19

Extension

• Understandhowerrorpropagationisaffectedbymultipleinputs

• ExtensionforboundingSDCratewithmultipleinputs

20

Session6:ModelingandVerificationWednesday,June27th

“ModelingInput-DependentErrorPropagationinPrograms”

Summary

• Faultinjectionsaretooslowtointegrateintosoftwaredevelopmentcycle

• Trident isbothaccurateandfastinpredictingSDCrates

• Canguideselectiveprotectionofinstructionsinprograms– comparable

tofaultinjectioninaccuracyforfractionofcost

• OpenSource:https://github.com/DependableSystemsLab/Trident

Guanpeng (Justin)LiUniversityofBritishColumbia (UBC)

gpli@ece.ubc.ca21