51
Tolera’ng Hardware Faults In Commodity So8ware: Problems, Solu’ons, and a Roadmap Karthik Pa*abiraman (h*p://blogs.ubc.ca/karthik ), Electrical and Computer Engineering University of Bri’sh Columbia (UBC)

Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

Tolera'ngHardwareFaultsInCommoditySo8ware:Problems,

Solu'ons,andaRoadmap

KarthikPa*abiraman(h*p://blogs.ubc.ca/karthik),

ElectricalandComputerEngineeringUniversityofBri'shColumbia(UBC)

Page 2: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

Mo'va'on:HardwareErrors•  Errorsarebecomingmorecommoninprocessors

–  So8errorsanddevicevaria'ons('mingerrors)–  Processorsexperiencewear-outandthermalhotspots

2

Source:ShekarBorkar(Intel)-Stanfordtalkin2005

Page 3: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

HardwareErrors:Tradi'onalSolu'ons•  Guard-banding •  DuplicaGon

Average Worst-case

Guard-bandingwastespowerasgapbetweenaverageandworst-casewidensduetovaria'ons

Guard-band

Hardwareduplica'on(DMR)canresultin2Xslowdownand/orenergyconsump'on

3

Page 4: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

Analterna'veapproach

4

Architecture

Opera'ngSystem

Applica'on

Devices/Circuits

User interacts with the application

SoHware

Hardware

User

Allowerrorsacrossthehardware-

soHwareboundary,butmakesureuserdoesnotperceiveit

Page 5: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

Device/CircuitLevel

ArchitecturalLevel

OperaGngSystemLevel

ApplicaGonLevel

Whydoso8waretechniqueswork?

ImpacRulErrors5

Page 6: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

So8wareTechniques

6

ApplicaGonProperGes

TargetedprotecGonmechanisms

Leverage the properties of the application to provide targeted protection, only for the errors that matter to it

Device/CircuitLevel

ArchitecturalLevel

OperaGngSystemLevel

ApplicaGonLevel

Page 7: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

Outline•  Mo'va'on

•  Techniquesdevelopedbymygroup[DSN’13][CASES’14]

•  Abriefhistoryofso8waretechniques

•  Adop'oninIndustry

•  Researchopportuni'esandroadmap

7

Page 8: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

EDCs: Soft Computing Applications

Ø Applica'onsinmachinelearning,mul'mediaprocessingØ Expectedtodominatefutureworkloads[Dubey’07]

8

Originalimage(le8)versusfaultyimage:JPEGdecoder

Page 9: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

EDCs:EgregiousDataCorrup'ons

9

Ø Largeorunacceptabledevia'oninoutput

EDCimage(PSNR11.37)Vs.Non-EDCimage(PSNR44.79)

Page 10: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

EDCs:Goal

Ø Selec'velydetectEDCcausingfaults,butnotothers

10

Non-EDC

EDC

Detector

Benign

Applica'onExecu'on

Page 11: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

EDCs:Faultmodel

•  Transienthardwarefaults– Causedbypar'clestrikes,supplynoise

•  OurFaultModel– Assumeonefaultperapplica'onexecu'on– Processorregistersandexecu'onunits– MemoryandcacheprotectedwithECC– Controllogicprotectedwithothermethods

11

Page 12: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

EDCs:MainIdea

12

Corruptedbyhardwarefaults

Cri'calData

ApplicaGonData

Ourpriorwork:EDCsarecausedbycorrup'onofasmallfrac'onofprogramdata[Flikker-ASPLOS’11]Thiswork:Cri'caldatacanbeiden'fiedusingsta'canddynamicanalysis,withoutanyprogrammerannota'ons

IniGalStudy

HeurisGc Algorithm

Page 13: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

EDCs:Ini'alStudy

13

MonitorControl/PointerData

Ø InstrumentcodeØ FaultInjec'on

Ø  Correla'onbetweenprogramdatause&faultoutcome

PerformedusingLLFIfaultinjector[DSN’14],attheLLVMIRcodelevel

Ini'alStudy

Heuris'c Algorithm

Page 14: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

EDCs:Ini'alStudy

14

6%

43%

23%

28%

Ini'alStudy

Heuris'c Algorithm

Page 15: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

15

voidconv422to444(char*src,char*dst,intheight,intwidth,intoffset){for(j=0;j<height;j++){for(i=0;i<width;i++){im1=(i<1)?0:i–1……}if(j+1<offset){src+=w;dst+=width;}}}

HighEDCLikelihood

Ø FaultinoffsetØ BranchFlip

Ini'alStudy

Heuris'c Algorithm

LowEDCLikelihood

EDCs:ExampleHeuris'cFaultsaffec+ngbrancheswithlargeamountofdatawithintheirbodieshaveahigherlikelihoodofresul+nginEDCoutcomes

Page 16: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

EDCs:Algorithm

16

Compiler EDCRankingAlgorithm

Selec'onAlgorithm

IR

Applica'onSourceCode

PerformanceOverhead

DataVariablesorLoca'onstoProtect

Representa'veinputs

Backwardslicereplica'on

Ini'alStudy

Heuris'c Algorithm

Page 17: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

EDCs:Detec'onCoverage

17

AverageEDCCoverageof82%at10%performanceoverhead

Higherisbeuer

𝐸𝐷𝐶 𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒= 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐷𝑒𝑡𝑒𝑐𝑡𝑒𝑑 𝐸𝐷𝐶𝑠/𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐸𝐷𝐶𝑠 

Page 18: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

EDCs:Selec'vity

18

AverageBenignandNon-EDCCoverageof10to15%foroverheadsfrom10to25%

Lowerisbeuer

Page 19: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

SDCTune:SilentDataCorrup'on(SDCs)

19

Faultoccurs

Errorac'vated

ErrorMaskedBenign

Crash/Hang

SDC

Program

Finished

Correct output SDC Output

Results lost:

Page 20: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

SDCTune:Goals

•  ProtecGngcriGcaldatainsoH-compuGngapplicaGonsfromEDCs–  CanweextendthistoSilentDataCorrup'ons(SDCs)ingeneral-purposeapplica'ons?

•  Challenge:– Notfeasibletoiden'fySDCsbasedontheamountofdataaffectedbythefaultaswasthecasewithEDCs

– Needforcomprehensivemodelforpredic'ngSDCsbasedonsta'canddynamicprogramfeatures

20

Page 21: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

SDCTune:MainIdea

•  StartfromStoreandCmpinstrucGonsandgobackwardthroughprogram’sdatadependencies

•  Usemachinelearning(CART)topredicttheSDCpronenessofStoreandCmpinstrucGons –  Extract the related features by static/dynamic analysis–  Quantify the effects by classification and regression –  Estimate SDC rates of different Stores and Cmp instructions

21

Page 22: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

SDCTune:ExampleModel

22

NotusedinMaskingopera'ons

LinearRegressionforSDC-proneness

Page 23: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

SDCTune:Benchmarks

23

(a) Data dependency ofdetector-free code

(b) Basic detector in-strumented

(c) concatenate dupli-cated instructions

Figure 5: The shaded portion of (a) shows the instructions need protection.(b) shows the duplicated instructions (the shaded nodes) and the detectorinserted at the end of the two dependency chains. (c) shows one addedinstruction to protect(node e’) that concatenates the two dependency chainsand save one checker

1 for ( =0;; ++){2 // loop body3 = < ?1:0;4 if ( == 1)5 break;6 // decompose exit

predicationto

simulateinstruction�levelbehaviour .

7 }8

(a) Detector-free code

1 =0;2 // duplication of i3 =0;4 for (;;) {5 // loop body6 = < ?1:0;7 =

< ?1:0;8 if(flag != dup_flag)9 Assert();

10 // inconsistent11 if ( == 1)12 break;13 }

(b) Basic detector in-strumented

1 =0;2 // duplication of i3 =0;4 for (;;) {5 // loop body6 = < ?1:0;7 =

< ?1:0;8 if ( == 1)9 break;

10 }11 if(flag != dup_flag)12 Assert();13 // inconsistent

(c) Lazy checking ap-plied

Figure 6: (b) shows how the loop index i in original code (a) is protectedwith bold code as check. (c) shows how we move the check out of the loopbody

5. Experimental SetupIn this section, we empirically evaluate SDCTune for config-

urable SDC protection through fault injection experiments. All theexperiments and evaluations are conducted on a Intel i7 4-coremachine with 8GB memory running Debian Linux. Section 5.1presents the details of benchmarks and section 5.2 presents ourevaluation metrics. Section 5.3 presents our methodology andworkflow for performing the experiments.5.1 Benchmarks

We choose a total of 12 applications from a wide variety of do-mains for training and testing. They are from SPEC benchmarksuite [12], SPLASH2 benchmark suite [30], NAS parallel bench-mark suite [1], PARSEC benchmark suite [2] and Parboil bench-mark suite [27]. We divide the 12 applications into two groupsof 6 applications each, one for training and the other for testing.The four benchmarks studied in Section 2.3 are incorporated in thetraining group. The details of these training and testing benchmarksare shown in Table 5 and Table 6 respectively. All the applicationsare compiled and linked into native executables with -O2 optimiza-tion flags and run in a single threaded mode.5.2 Evaluation Metrics

To gauge the accuracy of SDCTune, we use it for estimating theoverall SDC rate of an application, as well as the SDC coverage

Table 5: Training programs

Program Description Benchmarksuite Input Stores Compar-

isonsIS Integer sorting NAS default 21 20LU Linear algebra SPLASH2 test 41 110

Bzip2 Compression SPEC test 681 646

Swaptions Price portfolioof swaptions

PARSEC Sim-large 36 101

Water Moleculardynamics

SPLASH2 test 187 224

CGConjugategradientmethod

NAS default 32 97

Table 6: Testing programs

Program Description Benchmarksuite Input Stores Compar-

isons

Lbm Fluiddynamics

Parboil short 71 34

Gzip Compression SPEC test 251 399

OceanLarge-scale

oceanmovements

SPLASH test 322 813

Bfs Breadth-Firstsearch

Parboil 1M 36 57

Mcf Combinatorialoptimization

SPEC test 87 158

Libquantum Quantumcomputing

SPEC test 39 136

for different performance overhead bounds. The former is used forcomparing the resilience of different applications, while the latteris used to insert detectors for configurable protection.Estimation of overall SDC rates: We perform a random fault in-jection experiment to determine the overall SDC rate of the appli-cation. We then compare the SDC rate obtained with SDCTune withthat obtained from the fault injection experiment. We also considerthe relative SDC rate compared to other applications (i.e., its rank).We use the same experimental setup for fault injection as describedin Section 2.3.SDC coverages for different performance overhead bounds: Weuse SDCTune to predict the SDC coverage for different instructionsto satisfy the performance overhead bounds provided by the user.We start with the most SDC prone instructions and iteratively ex-pand the set of instructions until the performance overhead boundsare met. We perform fault injection experiments on the program in-strumented with our detectors for these instructions, and measurethe percentages of SDCs detected. We then compare our resultswith those of full duplication, i.e., when every instruction is dupli-cated in the program, and hot-path duplication, i.e., when the top10% most executed instructions are duplicated in the program.SDC detection efficiency: Similar to the efficiency defined inprior work [25], we define the SDC detection efficiency as theratio between SDC coverage and performance overhead for a de-tection technique. We calculate the efficiency of each benchmarkunder a given performance overhead bound, and compare it withthe efficiencies of full duplication and hot-path duplication. TheSDC coverage of full duplication is assumed to be a hundred per-cent [23].5.3 Work Flow and Implementation

Figure 7 shows the workflow for estimating the overall SDCrates and providing configurable protection using SDCTune. The

Trainingprograms TesGngprograms(a) Data dependency ofdetector-free code

(b) Basic detector in-strumented

(c) concatenate dupli-cated instructions

Figure 5: The shaded portion of (a) shows the instructions need protection.(b) shows the duplicated instructions (the shaded nodes) and the detectorinserted at the end of the two dependency chains. (c) shows one addedinstruction to protect(node e’) that concatenates the two dependency chainsand save one checker

1 for ( =0;; ++){2 // loop body3 = < ?1:0;4 if ( == 1)5 break;6 // decompose exit

predicationto

simulateinstruction�levelbehaviour .

7 }8

(a) Detector-free code

1 =0;2 // duplication of i3 =0;4 for (;;) {5 // loop body6 = < ?1:0;7 =

< ?1:0;8 if(flag != dup_flag)9 Assert();

10 // inconsistent11 if ( == 1)12 break;13 }

(b) Basic detector in-strumented

1 =0;2 // duplication of i3 =0;4 for (;;) {5 // loop body6 = < ?1:0;7 =

< ?1:0;8 if ( == 1)9 break;

10 }11 if(flag != dup_flag)12 Assert();13 // inconsistent

(c) Lazy checking ap-plied

Figure 6: (b) shows how the loop index i in original code (a) is protectedwith bold code as check. (c) shows how we move the check out of the loopbody

5. Experimental SetupIn this section, we empirically evaluate SDCTune for config-

urable SDC protection through fault injection experiments. All theexperiments and evaluations are conducted on a Intel i7 4-coremachine with 8GB memory running Debian Linux. Section 5.1presents the details of benchmarks and section 5.2 presents ourevaluation metrics. Section 5.3 presents our methodology andworkflow for performing the experiments.5.1 Benchmarks

We choose a total of 12 applications from a wide variety of do-mains for training and testing. They are from SPEC benchmarksuite [12], SPLASH2 benchmark suite [30], NAS parallel bench-mark suite [1], PARSEC benchmark suite [2] and Parboil bench-mark suite [27]. We divide the 12 applications into two groupsof 6 applications each, one for training and the other for testing.The four benchmarks studied in Section 2.3 are incorporated in thetraining group. The details of these training and testing benchmarksare shown in Table 5 and Table 6 respectively. All the applicationsare compiled and linked into native executables with -O2 optimiza-tion flags and run in a single threaded mode.5.2 Evaluation Metrics

To gauge the accuracy of SDCTune, we use it for estimating theoverall SDC rate of an application, as well as the SDC coverage

Table 5: Training programs

Program Description Benchmarksuite Input Stores Compar-

isonsIS Integer sorting NAS default 21 20LU Linear algebra SPLASH2 test 41 110

Bzip2 Compression SPEC test 681 646

Swaptions Price portfolioof swaptions

PARSEC Sim-large 36 101

Water Moleculardynamics

SPLASH2 test 187 224

CGConjugategradientmethod

NAS default 32 97

Table 6: Testing programs

Program Description Benchmarksuite Input Stores Compar-

isons

Lbm Fluiddynamics

Parboil short 71 34

Gzip Compression SPEC test 251 399

OceanLarge-scale

oceanmovements

SPLASH test 322 813

Bfs Breadth-Firstsearch

Parboil 1M 36 57

Mcf Combinatorialoptimization

SPEC test 87 158

Libquantum Quantumcomputing

SPEC test 39 136

for different performance overhead bounds. The former is used forcomparing the resilience of different applications, while the latteris used to insert detectors for configurable protection.Estimation of overall SDC rates: We perform a random fault in-jection experiment to determine the overall SDC rate of the appli-cation. We then compare the SDC rate obtained with SDCTune withthat obtained from the fault injection experiment. We also considerthe relative SDC rate compared to other applications (i.e., its rank).We use the same experimental setup for fault injection as describedin Section 2.3.SDC coverages for different performance overhead bounds: Weuse SDCTune to predict the SDC coverage for different instructionsto satisfy the performance overhead bounds provided by the user.We start with the most SDC prone instructions and iteratively ex-pand the set of instructions until the performance overhead boundsare met. We perform fault injection experiments on the program in-strumented with our detectors for these instructions, and measurethe percentages of SDCs detected. We then compare our resultswith those of full duplication, i.e., when every instruction is dupli-cated in the program, and hot-path duplication, i.e., when the top10% most executed instructions are duplicated in the program.SDC detection efficiency: Similar to the efficiency defined inprior work [25], we define the SDC detection efficiency as theratio between SDC coverage and performance overhead for a de-tection technique. We calculate the efficiency of each benchmarkunder a given performance overhead bound, and compare it withthe efficiencies of full duplication and hot-path duplication. TheSDC coverage of full duplication is assumed to be a hundred per-cent [23].5.3 Work Flow and Implementation

Figure 7 shows the workflow for estimating the overall SDCrates and providing configurable protection using SDCTune. The

Page 24: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

SDCTune:Evalua'onMethod

24

Featuresextractedbasedonheuris'c

knowledgefromtrainingprograms

SDCrateforeachinstruc'on

P(SDC|I)fromtrainingprograms

Training(CARTMethod)

P(SDC|I)Predictor

Es'matetheSDCpronenessofdifferentprogram

instruc'ons

Findthesetofinstruc'onsforanoverheadbound(∑P(I))

RandomFaultInjec'onResultsfromtesGngprograms

ActualSDCcoveragefor

tesGngprograms

FeaturesextractedfromtesGngprograms

Trainingphase

TesGngandusingphase

EvaluaGonPhase

UsagePhase

Page 25: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

SDCTune:ModelValida'on

25

Trainingprograms TesGngprograms

Rankcorrela'on* 0.9714 0.8286P-value** 0.00694 0.0125

0

2

4

6

8

0 1 2 3 4 5 6 7Rank

ofo

verallSD

Cratesb

yesGm

aGon

RankofoverallSDCratesbyfaultinjecGonexperiment

Trainingprograms

TesingprogramTes'ngprograms

Page 26: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

SDCTune:SDCCoverage

26

Trainingprograms: TesGngprograms:

Overhead Coverage

10% 44.8%

20% 78.6%

30% 86.8%

Overhead Coverage

10% 39%

20% 63.7%

30% 74.9%

Page 27: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

SDCTune:FullDuplica'onandHot-PathDuplica'onOverheads

27

Fullduplica'onoverhead:53.7%to73.6%Hot-pathduplica'onoverhead:43.5to57.6%

NormalizedDetecGonEfficiency

10%overhead

20%overhead

30%overhead

Trainingprograms 2.38 2.09 1.54Tes'ngprograms 2.87 2.34 1.84

Page 28: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

EDCsandSDCTune:Summary

•  SoHwareleveltechniquesfortunableandselecGveprotecGonfromEDCsandSDCs[DSN’13][DSN’14][CASES’14][TECS1][TECS2]

•  Completelyautomated–noprogrammerintervenGonorannotaGonsareneeded

•  SignificantefficiencygainoverfullduplicaGon

28

Page 29: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

Outline•  Mo'va'on

•  Techniquesdevelopedbymygroup[DSN’13][CASES’14]

•  Abriefhistoryofso8waretechniques

•  Adop'oninIndustry

•  Researchopportuni'esandroadmap

29

Page 30: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

HistoryofS/Wtechniques:Pre-2000

•  LonghistoryofsoHwaretechniquesforhighreliabilitysystemsgoingbacktoIBMMVS,TandemGuardian–  Reliedonarchitecturalsupportfromthehardware–  Assumedso8warewaswriuenintransac'onalstyle

•  AlgorithmBasedFaultTolerance–1984[HuangandAbraham]:specializedapplica'onsinlinearalgebra

•  Manycontrol-flowcheckingtechniquesfrom1980’s–  Onlyprotectedtheprogram’scontrol-flowinstruc'ons

30

Page 31: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

HistoryofS/Wtechniques:2000-2005

•  SoHerrorproblem[SunServer–Baumann2000]•  ARGOSprojectfromStanford(McCluksey,2001)

–  EDDI–so8ware-basedinstruc'onduplica'on–  CFCSS–Lightweightcontrol-flowchecking

•  ReliabilityandSecurityEngine(RSE)fromUIUC(2004)–  Targetedcheckingofapplica'onproper'esatrun'me

•  SWIFTfromPrinceton(2005)–  Low-overheadcheckingthroughcompilerop'miza'ons

•  FirstSELSEworkshoplaunched(2005)–  Focusonen'resystemspanningso8wareandhardware

31

Page 32: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

SELSEPapers(2009-2017)

Dataunavailableforyears2005to2008.Basedon'tleandabstractsonly.

32

0% 20% 40% 60% 80%

100%

2009 2010 2011 2012 2013 2014 2015 2016 2017

PapersatSELSE2009-2017 (source:SELSEwebsite)

Software Hardware Both

0% 20% 40% 60% 80%

100%

2009 2010 2011 2012 2013 2014 2015 2016 2017

PapersatSELSE2009-2017 (source:SELSEwebsite)

Software Hardware Both

0% 20% 40% 60% 80%

100%

2009 2010 2011 2012 2013 2014 2015 2016 2017

PapersatSELSE2009-2017 (source:SELSEwebsite)

Software Hardware Both

Page 33: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

HistoryofS/Wtechniques:2010-today

•  Cross-LayerResiliencebecomesabuzzword:Manygroupsworkingonthisproblemincludingours

•  MulGpledomains:HPCsystems,EmbeddedSystems

•  Callsfromdifferentfundingagencies(DoE,NSF,etc.)forCross-LayerResilienceTechniques–whitepapers

•  Conjoinedtwin:ApproximateCompuGngtakesoff–  PapersattopPL/architectureconferences

33

Page 34: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

Outline•  Mo'va'on

•  Techniquesdevelopedbymygroup[DSN’13][CASES’14]

•  Abriefhistoryofso8waretechniques

•  Adop'oninIndustry

•  Researchopportuni'esandroadmap

34

Page 35: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

WhataboutSo8wareResearchers?

•  PapersinthetopsoHwareengineering/tesGng/reliabilityconferencesabouthardwarefaultsanderrorsoverthelast10years(2006onwards)–  ICSE:5papers(IEEEDL)–  FSE:6papers(ACMDL)– ASE:7papers(ACMDL)–  ISSTA:3papers(ACMDL)–  ICST:2papers(IEEEDL)–  ISSRE:10papers(IEEEDL)–  Total:33outofover3000papers(about1%)

35

Page 36: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

Exampleconversa'onswithSo8wareDevelopersinIndustry

•  Developer1(larges/wcompanyyou’veheardof)

•  Me:Howdoyouhandlehardwarefaults?

•  D1:Dotheseevenoccurintherealworld?

•  Me:Showinghimdatagatheredbyhisowncompanyonh/wfaults

•  D1:Hmmm…soundslikeaproblemforQAfolks.Wedon’tdealwithfaults.

•  Tester1(larges/w-h/wcompanyyou’veheardof)

•  Me:Howdoyouhandlehardwarefaults?

•  T1:OurhardwarefolksputinvariousmechanismssuchasECCmemorytomaskthese

•  H/wguy:Notreally,wedon’thandleeverything

•  T1:Oh,well–that’snotpartofourrequirementsdoc.Maybeifwemeetourbugtargets...

36

Page 37: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

So8wareDevelopers

•  MostsoHwaredevelopers(andtesters)ignorehardwarefaults,orassumefaultswillbehandledbyhardware(e.g.,ECCmemory)

•  Eveniftheyrecognizetheimportanceoftheproblem,manythinkit’snottheirproblem– QAortes'ngpeopleshouldtakecareofit– Notpartofrequirements/specifica'ondocument

37

Page 38: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

Shouldwecareaboutdevelopers?

UlGmately,developersaretheoneswhodriveadopGonandassimilaGonwithinthebroadersoHwareecosystem

38

Page 39: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

BarrierstoAdop'on:PossibleReasons

•  Reason1:So8waredevelopersdon’tcareaboutanythingtodowithhardware

•  Reason2:Toomuch'meandeffort–manyotherpriori'esinso8waredevelopment

•  Reason3:Lackofhigh-levelabstrac'ons

•  Reason4:Noeasy-to-usetoolsthatintegratewiththeso8waredevelopmentworkflow

39

Page 40: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

BarrierstoAdop'on:Reason1

•  SoHwareengineersdon’tcareaboutanythingtodowithhardware

•  Nottrue.Manycounter-examples:– Parallelism,bothcoarseandfine-grained– Cacheconsciousdata-structuresandalgorithms– Energyefficiencyandenergy-awareprogramming– Determinism,memorymodels,etc.

40

Page 41: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

BarrierstoAdop'on:Reason2

•  ToomuchGmeandeffortconsuming:manyotherprioriGesinsoHwaredevelopment

•  ParGallytrue,butnotalways– Manyother'me-consumingac'vi'esareusede.g.,con'nuoustes'ng,sta'canalysis

– Techniquesfortolera'nghardwarefaultsdon’tneedtobe'meconsumingoreffort-intensive

41

Page 42: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

BarrierstoAdop'on:Reason3

•  Lackofhigh-levelabstracGons

•  Myexperience:MostlyTrue– Developerswanttoreasonwithquan''esthey’refamiliarwith(e.g.,run'me,defectratesetc.),notnecessarilythingslikeFITrates,orevencoverage

– Needtobeabletoreasonaboutcost-benefittradeoffsofdifferenttechniquesattheabstractlevelwithoutgoingintodetails

42

Page 43: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

BarrierstoAdop'on:Reason4

•  Noeasy-to-usetoolsthatintegratewiththesoHwaredevelopmentworkflow

•  Myexperience:Oneofthemainreasons– Needtounderstandso8waredevelopers’workflowandintegratewithit–nodevia'on

– Mustbeabletohandlelegacycode,weirdsetups,andmul'plelanguagesandlibraries

– S/Wmaintenanceaccountsfor60%ofcosts

43

Page 44: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

Whatwegotright(inmyopinion)

•  Automatedworkflow,soli*letonoeffortonthepartoftheprogrammerwasneeded

•  AbstracGonintermsthatordinaryprogrammerscanunderstand(e.g.,performance,coverage)–  Candobeueronthisfrontthough

•  Useofpopularopen-sourcetool(LLVM),whichiseasytointegratewithworkflow(intheory)

44

Page 45: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

Whatwegotwrong(inmyopinion)

•  OurabstracGonwassGlltoolowlevel– manyprogrammersdidn’tunderstandcoverage

•  Legacycode:LLVMcan’tcompileoldcode,inlineassembly,customizedbuildsystems

•  DidnothaveaneasypathtoQAtesGngorsoHwaremaintenanceinourlongtermplan

45

Page 46: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

Outline•  Mo'va'on

•  Techniquesdevelopedbymygroup[DSN’13][CASES’14]

•  Abriefhistoryofso8waretechniques

•  Adop'oninIndustry

•  Researchopportuni'esandroadmap

46

Page 47: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

OpenChallenges

•  NeedtobuildsoHware-basedtechniquesthatordinaryprogrammerscanreasonabout

•  NeedsoHwaretechniquesthatcanintegrateseamlesslywithoverallsoHwareworkflow– ShouldnotimpedeQAandtes'ngprocess– So8waremaintenanceshouldbeconsidered– Legacycodeandbuildsystems(ifrelevant)

47

Page 48: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

TheOpportunity:EvereuRoger’sModelforDisrup'veInnova'on

WearesGllintheinnovator/earlyadopterstage-needtocrosschasmandmoveto“earlymajority”

48

Page 49: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

Poten'alResearchRoadmap

49

Understandtheissuesfacingso8waredevelopersandtestersinadop'ngso8waretechniquesfortolera'nghardwarefaults

Buildtechniquestoaddresstheseissuesinacommonresearchframeworkforthecommunitytoavoideffortduplica'ons

Havedevelopersusetheframeworkinactualprac'ceandiden'fytheissuesthatcomeupduringtheuseoftheframework

Page 50: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

Conclusions

•  SoHwaretechniquesfortoleraGnghardwarefaultsincommoditysystemshavemushroomed–  Canbetunedbasedontheneedsoftheapplica'on–  Canoffersignificantefficiencyoverfullduplica'on

•  Unfortunately,thetechniqueshaveseenlimitedadopGoninindustry&bysoHwarecommunity– We,intheSELSEcommunity,allneedtoaddressthis–  Emphasisshouldbeoncompleteso8warelife-cycle–  Takeourmessagetotheso8wareengineeringvenues

50

Page 51: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,

AcknowledgementsMygraduatestudents(7current,10graduated):AnnaThomas,QiningLu,LayaliRashid,MajidDadashi,BoFang,JieshengWei,FrolinOcariza,Kar'kBajaj,ShabnamMirshokraie,XinChen,FaridTabrizi,SabaAlimadadi,SheldonSequira,NithyaMurthy,AbrahamChan,MaryamRaiyat,Jus'nLi

Collaboratorsandfundingagencies(selectedonesbelow)

51

h:p://blogs.ubc.ca/karthik/