Tolerang Hardware Faults In Commodity So8ware: Problems...

Preview:

Citation preview

Tolera'ngHardwareFaultsInCommoditySo8ware:Problems,

Solu'ons,andaRoadmap

KarthikPa*abiraman(h*p://blogs.ubc.ca/karthik),

ElectricalandComputerEngineeringUniversityofBri'shColumbia(UBC)

Mo'va'on:HardwareErrors•  Errorsarebecomingmorecommoninprocessors

–  So8errorsanddevicevaria'ons('mingerrors)–  Processorsexperiencewear-outandthermalhotspots

2

Source:ShekarBorkar(Intel)-Stanfordtalkin2005

HardwareErrors:Tradi'onalSolu'ons•  Guard-banding •  DuplicaGon

Average Worst-case

Guard-bandingwastespowerasgapbetweenaverageandworst-casewidensduetovaria'ons

Guard-band

Hardwareduplica'on(DMR)canresultin2Xslowdownand/orenergyconsump'on

3

Analterna'veapproach

4

Architecture

Opera'ngSystem

Applica'on

Devices/Circuits

User interacts with the application

SoHware

Hardware

User

Allowerrorsacrossthehardware-

soHwareboundary,butmakesureuserdoesnotperceiveit

Device/CircuitLevel

ArchitecturalLevel

OperaGngSystemLevel

ApplicaGonLevel

Whydoso8waretechniqueswork?

ImpacRulErrors5

So8wareTechniques

6

ApplicaGonProperGes

TargetedprotecGonmechanisms

Leverage the properties of the application to provide targeted protection, only for the errors that matter to it

Device/CircuitLevel

ArchitecturalLevel

OperaGngSystemLevel

ApplicaGonLevel

Outline•  Mo'va'on

•  Techniquesdevelopedbymygroup[DSN’13][CASES’14]

•  Abriefhistoryofso8waretechniques

•  Adop'oninIndustry

•  Researchopportuni'esandroadmap

7

EDCs: Soft Computing Applications

Ø Applica'onsinmachinelearning,mul'mediaprocessingØ Expectedtodominatefutureworkloads[Dubey’07]

8

Originalimage(le8)versusfaultyimage:JPEGdecoder

EDCs:EgregiousDataCorrup'ons

9

Ø Largeorunacceptabledevia'oninoutput

EDCimage(PSNR11.37)Vs.Non-EDCimage(PSNR44.79)

EDCs:Goal

Ø Selec'velydetectEDCcausingfaults,butnotothers

10

Non-EDC

EDC

Detector

Benign

Applica'onExecu'on

EDCs:Faultmodel

•  Transienthardwarefaults– Causedbypar'clestrikes,supplynoise

•  OurFaultModel– Assumeonefaultperapplica'onexecu'on– Processorregistersandexecu'onunits– MemoryandcacheprotectedwithECC– Controllogicprotectedwithothermethods

11

EDCs:MainIdea

12

Corruptedbyhardwarefaults

Cri'calData

ApplicaGonData

Ourpriorwork:EDCsarecausedbycorrup'onofasmallfrac'onofprogramdata[Flikker-ASPLOS’11]Thiswork:Cri'caldatacanbeiden'fiedusingsta'canddynamicanalysis,withoutanyprogrammerannota'ons

IniGalStudy

HeurisGc Algorithm

EDCs:Ini'alStudy

13

MonitorControl/PointerData

Ø InstrumentcodeØ FaultInjec'on

Ø  Correla'onbetweenprogramdatause&faultoutcome

PerformedusingLLFIfaultinjector[DSN’14],attheLLVMIRcodelevel

Ini'alStudy

Heuris'c Algorithm

EDCs:Ini'alStudy

14

6%

43%

23%

28%

Ini'alStudy

Heuris'c Algorithm

15

voidconv422to444(char*src,char*dst,intheight,intwidth,intoffset){for(j=0;j<height;j++){for(i=0;i<width;i++){im1=(i<1)?0:i–1……}if(j+1<offset){src+=w;dst+=width;}}}

HighEDCLikelihood

Ø FaultinoffsetØ BranchFlip

Ini'alStudy

Heuris'c Algorithm

LowEDCLikelihood

EDCs:ExampleHeuris'cFaultsaffec+ngbrancheswithlargeamountofdatawithintheirbodieshaveahigherlikelihoodofresul+nginEDCoutcomes

EDCs:Algorithm

16

Compiler EDCRankingAlgorithm

Selec'onAlgorithm

IR

Applica'onSourceCode

PerformanceOverhead

DataVariablesorLoca'onstoProtect

Representa'veinputs

Backwardslicereplica'on

Ini'alStudy

Heuris'c Algorithm

EDCs:Detec'onCoverage

17

AverageEDCCoverageof82%at10%performanceoverhead

Higherisbeuer

𝐸𝐷𝐶 𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒= 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐷𝑒𝑡𝑒𝑐𝑡𝑒𝑑 𝐸𝐷𝐶𝑠/𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐸𝐷𝐶𝑠 

EDCs:Selec'vity

18

AverageBenignandNon-EDCCoverageof10to15%foroverheadsfrom10to25%

Lowerisbeuer

SDCTune:SilentDataCorrup'on(SDCs)

19

Faultoccurs

Errorac'vated

ErrorMaskedBenign

Crash/Hang

SDC

Program

Finished

Correct output SDC Output

Results lost:

SDCTune:Goals

•  ProtecGngcriGcaldatainsoH-compuGngapplicaGonsfromEDCs–  CanweextendthistoSilentDataCorrup'ons(SDCs)ingeneral-purposeapplica'ons?

•  Challenge:– Notfeasibletoiden'fySDCsbasedontheamountofdataaffectedbythefaultaswasthecasewithEDCs

– Needforcomprehensivemodelforpredic'ngSDCsbasedonsta'canddynamicprogramfeatures

20

SDCTune:MainIdea

•  StartfromStoreandCmpinstrucGonsandgobackwardthroughprogram’sdatadependencies

•  Usemachinelearning(CART)topredicttheSDCpronenessofStoreandCmpinstrucGons –  Extract the related features by static/dynamic analysis–  Quantify the effects by classification and regression –  Estimate SDC rates of different Stores and Cmp instructions

21

SDCTune:ExampleModel

22

NotusedinMaskingopera'ons

LinearRegressionforSDC-proneness

SDCTune:Benchmarks

23

(a) Data dependency ofdetector-free code

(b) Basic detector in-strumented

(c) concatenate dupli-cated instructions

Figure 5: The shaded portion of (a) shows the instructions need protection.(b) shows the duplicated instructions (the shaded nodes) and the detectorinserted at the end of the two dependency chains. (c) shows one addedinstruction to protect(node e’) that concatenates the two dependency chainsand save one checker

1 for ( =0;; ++){2 // loop body3 = < ?1:0;4 if ( == 1)5 break;6 // decompose exit

predicationto

simulateinstruction�levelbehaviour .

7 }8

(a) Detector-free code

1 =0;2 // duplication of i3 =0;4 for (;;) {5 // loop body6 = < ?1:0;7 =

< ?1:0;8 if(flag != dup_flag)9 Assert();

10 // inconsistent11 if ( == 1)12 break;13 }

(b) Basic detector in-strumented

1 =0;2 // duplication of i3 =0;4 for (;;) {5 // loop body6 = < ?1:0;7 =

< ?1:0;8 if ( == 1)9 break;

10 }11 if(flag != dup_flag)12 Assert();13 // inconsistent

(c) Lazy checking ap-plied

Figure 6: (b) shows how the loop index i in original code (a) is protectedwith bold code as check. (c) shows how we move the check out of the loopbody

5. Experimental SetupIn this section, we empirically evaluate SDCTune for config-

urable SDC protection through fault injection experiments. All theexperiments and evaluations are conducted on a Intel i7 4-coremachine with 8GB memory running Debian Linux. Section 5.1presents the details of benchmarks and section 5.2 presents ourevaluation metrics. Section 5.3 presents our methodology andworkflow for performing the experiments.5.1 Benchmarks

We choose a total of 12 applications from a wide variety of do-mains for training and testing. They are from SPEC benchmarksuite [12], SPLASH2 benchmark suite [30], NAS parallel bench-mark suite [1], PARSEC benchmark suite [2] and Parboil bench-mark suite [27]. We divide the 12 applications into two groupsof 6 applications each, one for training and the other for testing.The four benchmarks studied in Section 2.3 are incorporated in thetraining group. The details of these training and testing benchmarksare shown in Table 5 and Table 6 respectively. All the applicationsare compiled and linked into native executables with -O2 optimiza-tion flags and run in a single threaded mode.5.2 Evaluation Metrics

To gauge the accuracy of SDCTune, we use it for estimating theoverall SDC rate of an application, as well as the SDC coverage

Table 5: Training programs

Program Description Benchmarksuite Input Stores Compar-

isonsIS Integer sorting NAS default 21 20LU Linear algebra SPLASH2 test 41 110

Bzip2 Compression SPEC test 681 646

Swaptions Price portfolioof swaptions

PARSEC Sim-large 36 101

Water Moleculardynamics

SPLASH2 test 187 224

CGConjugategradientmethod

NAS default 32 97

Table 6: Testing programs

Program Description Benchmarksuite Input Stores Compar-

isons

Lbm Fluiddynamics

Parboil short 71 34

Gzip Compression SPEC test 251 399

OceanLarge-scale

oceanmovements

SPLASH test 322 813

Bfs Breadth-Firstsearch

Parboil 1M 36 57

Mcf Combinatorialoptimization

SPEC test 87 158

Libquantum Quantumcomputing

SPEC test 39 136

for different performance overhead bounds. The former is used forcomparing the resilience of different applications, while the latteris used to insert detectors for configurable protection.Estimation of overall SDC rates: We perform a random fault in-jection experiment to determine the overall SDC rate of the appli-cation. We then compare the SDC rate obtained with SDCTune withthat obtained from the fault injection experiment. We also considerthe relative SDC rate compared to other applications (i.e., its rank).We use the same experimental setup for fault injection as describedin Section 2.3.SDC coverages for different performance overhead bounds: Weuse SDCTune to predict the SDC coverage for different instructionsto satisfy the performance overhead bounds provided by the user.We start with the most SDC prone instructions and iteratively ex-pand the set of instructions until the performance overhead boundsare met. We perform fault injection experiments on the program in-strumented with our detectors for these instructions, and measurethe percentages of SDCs detected. We then compare our resultswith those of full duplication, i.e., when every instruction is dupli-cated in the program, and hot-path duplication, i.e., when the top10% most executed instructions are duplicated in the program.SDC detection efficiency: Similar to the efficiency defined inprior work [25], we define the SDC detection efficiency as theratio between SDC coverage and performance overhead for a de-tection technique. We calculate the efficiency of each benchmarkunder a given performance overhead bound, and compare it withthe efficiencies of full duplication and hot-path duplication. TheSDC coverage of full duplication is assumed to be a hundred per-cent [23].5.3 Work Flow and Implementation

Figure 7 shows the workflow for estimating the overall SDCrates and providing configurable protection using SDCTune. The

Trainingprograms TesGngprograms(a) Data dependency ofdetector-free code

(b) Basic detector in-strumented

(c) concatenate dupli-cated instructions

Figure 5: The shaded portion of (a) shows the instructions need protection.(b) shows the duplicated instructions (the shaded nodes) and the detectorinserted at the end of the two dependency chains. (c) shows one addedinstruction to protect(node e’) that concatenates the two dependency chainsand save one checker

1 for ( =0;; ++){2 // loop body3 = < ?1:0;4 if ( == 1)5 break;6 // decompose exit

predicationto

simulateinstruction�levelbehaviour .

7 }8

(a) Detector-free code

1 =0;2 // duplication of i3 =0;4 for (;;) {5 // loop body6 = < ?1:0;7 =

< ?1:0;8 if(flag != dup_flag)9 Assert();

10 // inconsistent11 if ( == 1)12 break;13 }

(b) Basic detector in-strumented

1 =0;2 // duplication of i3 =0;4 for (;;) {5 // loop body6 = < ?1:0;7 =

< ?1:0;8 if ( == 1)9 break;

10 }11 if(flag != dup_flag)12 Assert();13 // inconsistent

(c) Lazy checking ap-plied

Figure 6: (b) shows how the loop index i in original code (a) is protectedwith bold code as check. (c) shows how we move the check out of the loopbody

5. Experimental SetupIn this section, we empirically evaluate SDCTune for config-

urable SDC protection through fault injection experiments. All theexperiments and evaluations are conducted on a Intel i7 4-coremachine with 8GB memory running Debian Linux. Section 5.1presents the details of benchmarks and section 5.2 presents ourevaluation metrics. Section 5.3 presents our methodology andworkflow for performing the experiments.5.1 Benchmarks

We choose a total of 12 applications from a wide variety of do-mains for training and testing. They are from SPEC benchmarksuite [12], SPLASH2 benchmark suite [30], NAS parallel bench-mark suite [1], PARSEC benchmark suite [2] and Parboil bench-mark suite [27]. We divide the 12 applications into two groupsof 6 applications each, one for training and the other for testing.The four benchmarks studied in Section 2.3 are incorporated in thetraining group. The details of these training and testing benchmarksare shown in Table 5 and Table 6 respectively. All the applicationsare compiled and linked into native executables with -O2 optimiza-tion flags and run in a single threaded mode.5.2 Evaluation Metrics

To gauge the accuracy of SDCTune, we use it for estimating theoverall SDC rate of an application, as well as the SDC coverage

Table 5: Training programs

Program Description Benchmarksuite Input Stores Compar-

isonsIS Integer sorting NAS default 21 20LU Linear algebra SPLASH2 test 41 110

Bzip2 Compression SPEC test 681 646

Swaptions Price portfolioof swaptions

PARSEC Sim-large 36 101

Water Moleculardynamics

SPLASH2 test 187 224

CGConjugategradientmethod

NAS default 32 97

Table 6: Testing programs

Program Description Benchmarksuite Input Stores Compar-

isons

Lbm Fluiddynamics

Parboil short 71 34

Gzip Compression SPEC test 251 399

OceanLarge-scale

oceanmovements

SPLASH test 322 813

Bfs Breadth-Firstsearch

Parboil 1M 36 57

Mcf Combinatorialoptimization

SPEC test 87 158

Libquantum Quantumcomputing

SPEC test 39 136

for different performance overhead bounds. The former is used forcomparing the resilience of different applications, while the latteris used to insert detectors for configurable protection.Estimation of overall SDC rates: We perform a random fault in-jection experiment to determine the overall SDC rate of the appli-cation. We then compare the SDC rate obtained with SDCTune withthat obtained from the fault injection experiment. We also considerthe relative SDC rate compared to other applications (i.e., its rank).We use the same experimental setup for fault injection as describedin Section 2.3.SDC coverages for different performance overhead bounds: Weuse SDCTune to predict the SDC coverage for different instructionsto satisfy the performance overhead bounds provided by the user.We start with the most SDC prone instructions and iteratively ex-pand the set of instructions until the performance overhead boundsare met. We perform fault injection experiments on the program in-strumented with our detectors for these instructions, and measurethe percentages of SDCs detected. We then compare our resultswith those of full duplication, i.e., when every instruction is dupli-cated in the program, and hot-path duplication, i.e., when the top10% most executed instructions are duplicated in the program.SDC detection efficiency: Similar to the efficiency defined inprior work [25], we define the SDC detection efficiency as theratio between SDC coverage and performance overhead for a de-tection technique. We calculate the efficiency of each benchmarkunder a given performance overhead bound, and compare it withthe efficiencies of full duplication and hot-path duplication. TheSDC coverage of full duplication is assumed to be a hundred per-cent [23].5.3 Work Flow and Implementation

Figure 7 shows the workflow for estimating the overall SDCrates and providing configurable protection using SDCTune. The

SDCTune:Evalua'onMethod

24

Featuresextractedbasedonheuris'c

knowledgefromtrainingprograms

SDCrateforeachinstruc'on

P(SDC|I)fromtrainingprograms

Training(CARTMethod)

P(SDC|I)Predictor

Es'matetheSDCpronenessofdifferentprogram

instruc'ons

Findthesetofinstruc'onsforanoverheadbound(∑P(I))

RandomFaultInjec'onResultsfromtesGngprograms

ActualSDCcoveragefor

tesGngprograms

FeaturesextractedfromtesGngprograms

Trainingphase

TesGngandusingphase

EvaluaGonPhase

UsagePhase

SDCTune:ModelValida'on

25

Trainingprograms TesGngprograms

Rankcorrela'on* 0.9714 0.8286P-value** 0.00694 0.0125

0

2

4

6

8

0 1 2 3 4 5 6 7Rank

ofo

verallSD

Cratesb

yesGm

aGon

RankofoverallSDCratesbyfaultinjecGonexperiment

Trainingprograms

TesingprogramTes'ngprograms

SDCTune:SDCCoverage

26

Trainingprograms: TesGngprograms:

Overhead Coverage

10% 44.8%

20% 78.6%

30% 86.8%

Overhead Coverage

10% 39%

20% 63.7%

30% 74.9%

SDCTune:FullDuplica'onandHot-PathDuplica'onOverheads

27

Fullduplica'onoverhead:53.7%to73.6%Hot-pathduplica'onoverhead:43.5to57.6%

NormalizedDetecGonEfficiency

10%overhead

20%overhead

30%overhead

Trainingprograms 2.38 2.09 1.54Tes'ngprograms 2.87 2.34 1.84

EDCsandSDCTune:Summary

•  SoHwareleveltechniquesfortunableandselecGveprotecGonfromEDCsandSDCs[DSN’13][DSN’14][CASES’14][TECS1][TECS2]

•  Completelyautomated–noprogrammerintervenGonorannotaGonsareneeded

•  SignificantefficiencygainoverfullduplicaGon

28

Outline•  Mo'va'on

•  Techniquesdevelopedbymygroup[DSN’13][CASES’14]

•  Abriefhistoryofso8waretechniques

•  Adop'oninIndustry

•  Researchopportuni'esandroadmap

29

HistoryofS/Wtechniques:Pre-2000

•  LonghistoryofsoHwaretechniquesforhighreliabilitysystemsgoingbacktoIBMMVS,TandemGuardian–  Reliedonarchitecturalsupportfromthehardware–  Assumedso8warewaswriuenintransac'onalstyle

•  AlgorithmBasedFaultTolerance–1984[HuangandAbraham]:specializedapplica'onsinlinearalgebra

•  Manycontrol-flowcheckingtechniquesfrom1980’s–  Onlyprotectedtheprogram’scontrol-flowinstruc'ons

30

HistoryofS/Wtechniques:2000-2005

•  SoHerrorproblem[SunServer–Baumann2000]•  ARGOSprojectfromStanford(McCluksey,2001)

–  EDDI–so8ware-basedinstruc'onduplica'on–  CFCSS–Lightweightcontrol-flowchecking

•  ReliabilityandSecurityEngine(RSE)fromUIUC(2004)–  Targetedcheckingofapplica'onproper'esatrun'me

•  SWIFTfromPrinceton(2005)–  Low-overheadcheckingthroughcompilerop'miza'ons

•  FirstSELSEworkshoplaunched(2005)–  Focusonen'resystemspanningso8wareandhardware

31

SELSEPapers(2009-2017)

Dataunavailableforyears2005to2008.Basedon'tleandabstractsonly.

32

0% 20% 40% 60% 80%

100%

2009 2010 2011 2012 2013 2014 2015 2016 2017

PapersatSELSE2009-2017 (source:SELSEwebsite)

Software Hardware Both

0% 20% 40% 60% 80%

100%

2009 2010 2011 2012 2013 2014 2015 2016 2017

PapersatSELSE2009-2017 (source:SELSEwebsite)

Software Hardware Both

0% 20% 40% 60% 80%

100%

2009 2010 2011 2012 2013 2014 2015 2016 2017

PapersatSELSE2009-2017 (source:SELSEwebsite)

Software Hardware Both

HistoryofS/Wtechniques:2010-today

•  Cross-LayerResiliencebecomesabuzzword:Manygroupsworkingonthisproblemincludingours

•  MulGpledomains:HPCsystems,EmbeddedSystems

•  Callsfromdifferentfundingagencies(DoE,NSF,etc.)forCross-LayerResilienceTechniques–whitepapers

•  Conjoinedtwin:ApproximateCompuGngtakesoff–  PapersattopPL/architectureconferences

33

Outline•  Mo'va'on

•  Techniquesdevelopedbymygroup[DSN’13][CASES’14]

•  Abriefhistoryofso8waretechniques

•  Adop'oninIndustry

•  Researchopportuni'esandroadmap

34

WhataboutSo8wareResearchers?

•  PapersinthetopsoHwareengineering/tesGng/reliabilityconferencesabouthardwarefaultsanderrorsoverthelast10years(2006onwards)–  ICSE:5papers(IEEEDL)–  FSE:6papers(ACMDL)– ASE:7papers(ACMDL)–  ISSTA:3papers(ACMDL)–  ICST:2papers(IEEEDL)–  ISSRE:10papers(IEEEDL)–  Total:33outofover3000papers(about1%)

35

Exampleconversa'onswithSo8wareDevelopersinIndustry

•  Developer1(larges/wcompanyyou’veheardof)

•  Me:Howdoyouhandlehardwarefaults?

•  D1:Dotheseevenoccurintherealworld?

•  Me:Showinghimdatagatheredbyhisowncompanyonh/wfaults

•  D1:Hmmm…soundslikeaproblemforQAfolks.Wedon’tdealwithfaults.

•  Tester1(larges/w-h/wcompanyyou’veheardof)

•  Me:Howdoyouhandlehardwarefaults?

•  T1:OurhardwarefolksputinvariousmechanismssuchasECCmemorytomaskthese

•  H/wguy:Notreally,wedon’thandleeverything

•  T1:Oh,well–that’snotpartofourrequirementsdoc.Maybeifwemeetourbugtargets...

36

So8wareDevelopers

•  MostsoHwaredevelopers(andtesters)ignorehardwarefaults,orassumefaultswillbehandledbyhardware(e.g.,ECCmemory)

•  Eveniftheyrecognizetheimportanceoftheproblem,manythinkit’snottheirproblem– QAortes'ngpeopleshouldtakecareofit– Notpartofrequirements/specifica'ondocument

37

Shouldwecareaboutdevelopers?

UlGmately,developersaretheoneswhodriveadopGonandassimilaGonwithinthebroadersoHwareecosystem

38

BarrierstoAdop'on:PossibleReasons

•  Reason1:So8waredevelopersdon’tcareaboutanythingtodowithhardware

•  Reason2:Toomuch'meandeffort–manyotherpriori'esinso8waredevelopment

•  Reason3:Lackofhigh-levelabstrac'ons

•  Reason4:Noeasy-to-usetoolsthatintegratewiththeso8waredevelopmentworkflow

39

BarrierstoAdop'on:Reason1

•  SoHwareengineersdon’tcareaboutanythingtodowithhardware

•  Nottrue.Manycounter-examples:– Parallelism,bothcoarseandfine-grained– Cacheconsciousdata-structuresandalgorithms– Energyefficiencyandenergy-awareprogramming– Determinism,memorymodels,etc.

40

BarrierstoAdop'on:Reason2

•  ToomuchGmeandeffortconsuming:manyotherprioriGesinsoHwaredevelopment

•  ParGallytrue,butnotalways– Manyother'me-consumingac'vi'esareusede.g.,con'nuoustes'ng,sta'canalysis

– Techniquesfortolera'nghardwarefaultsdon’tneedtobe'meconsumingoreffort-intensive

41

BarrierstoAdop'on:Reason3

•  Lackofhigh-levelabstracGons

•  Myexperience:MostlyTrue– Developerswanttoreasonwithquan''esthey’refamiliarwith(e.g.,run'me,defectratesetc.),notnecessarilythingslikeFITrates,orevencoverage

– Needtobeabletoreasonaboutcost-benefittradeoffsofdifferenttechniquesattheabstractlevelwithoutgoingintodetails

42

BarrierstoAdop'on:Reason4

•  Noeasy-to-usetoolsthatintegratewiththesoHwaredevelopmentworkflow

•  Myexperience:Oneofthemainreasons– Needtounderstandso8waredevelopers’workflowandintegratewithit–nodevia'on

– Mustbeabletohandlelegacycode,weirdsetups,andmul'plelanguagesandlibraries

– S/Wmaintenanceaccountsfor60%ofcosts

43

Whatwegotright(inmyopinion)

•  Automatedworkflow,soli*letonoeffortonthepartoftheprogrammerwasneeded

•  AbstracGonintermsthatordinaryprogrammerscanunderstand(e.g.,performance,coverage)–  Candobeueronthisfrontthough

•  Useofpopularopen-sourcetool(LLVM),whichiseasytointegratewithworkflow(intheory)

44

Whatwegotwrong(inmyopinion)

•  OurabstracGonwassGlltoolowlevel– manyprogrammersdidn’tunderstandcoverage

•  Legacycode:LLVMcan’tcompileoldcode,inlineassembly,customizedbuildsystems

•  DidnothaveaneasypathtoQAtesGngorsoHwaremaintenanceinourlongtermplan

45

Outline•  Mo'va'on

•  Techniquesdevelopedbymygroup[DSN’13][CASES’14]

•  Abriefhistoryofso8waretechniques

•  Adop'oninIndustry

•  Researchopportuni'esandroadmap

46

OpenChallenges

•  NeedtobuildsoHware-basedtechniquesthatordinaryprogrammerscanreasonabout

•  NeedsoHwaretechniquesthatcanintegrateseamlesslywithoverallsoHwareworkflow– ShouldnotimpedeQAandtes'ngprocess– So8waremaintenanceshouldbeconsidered– Legacycodeandbuildsystems(ifrelevant)

47

TheOpportunity:EvereuRoger’sModelforDisrup'veInnova'on

WearesGllintheinnovator/earlyadopterstage-needtocrosschasmandmoveto“earlymajority”

48

Poten'alResearchRoadmap

49

Understandtheissuesfacingso8waredevelopersandtestersinadop'ngso8waretechniquesfortolera'nghardwarefaults

Buildtechniquestoaddresstheseissuesinacommonresearchframeworkforthecommunitytoavoideffortduplica'ons

Havedevelopersusetheframeworkinactualprac'ceandiden'fytheissuesthatcomeupduringtheuseoftheframework

Conclusions

•  SoHwaretechniquesfortoleraGnghardwarefaultsincommoditysystemshavemushroomed–  Canbetunedbasedontheneedsoftheapplica'on–  Canoffersignificantefficiencyoverfullduplica'on

•  Unfortunately,thetechniqueshaveseenlimitedadopGoninindustry&bysoHwarecommunity– We,intheSELSEcommunity,allneedtoaddressthis–  Emphasisshouldbeoncompleteso8warelife-cycle–  Takeourmessagetotheso8wareengineeringvenues

50

AcknowledgementsMygraduatestudents(7current,10graduated):AnnaThomas,QiningLu,LayaliRashid,MajidDadashi,BoFang,JieshengWei,FrolinOcariza,Kar'kBajaj,ShabnamMirshokraie,XinChen,FaridTabrizi,SabaAlimadadi,SheldonSequira,NithyaMurthy,AbrahamChan,MaryamRaiyat,Jus'nLi

Collaboratorsandfundingagencies(selectedonesbelow)

51

h:p://blogs.ubc.ca/karthik/