21
From Detection to Optimization: Impact of Silent Errors on Scientific Applications SC’16 Doctoral Showcase Jon Calhoun, Luke Olson (Advisor), and Marc Snir (Advisor) University of Illinois at Urbana-Champaign 15 November 2016 Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 1/21 1 / 21

From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

From Detection to Optimization: Impact of

Silent Errors on Scientific Applications

SC’16 Doctoral Showcase

Jon Calhoun, Luke Olson (Advisor), and Marc Snir (Advisor)

University of Illinois at Urbana-Champaign

15 November 2016

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 1/21

1/21

Page 2: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

Why look at errors in HPC?

Computing systems are vulnerable to failure

• Component cost and power budget dictate how reliablycomponents are

• Rates of errors may increase as design trade-offs become moremainstream, e.g. low power, cheaper commodity parts

Need efficient error detection and recovery schemes

Slowing of Moore’s Law demands we need more out of currenthardware

• Approximate computation

• Approximate data movement

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 2/21

2/21

Page 3: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

Silent Error Detection

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 3/21

3/21

Page 4: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

Error DetectionResilient applications are critical when operating in fault proneenvironments

Solving linear systems is central to many HPC applications

Algebraic Multigrid (AMG) is and efficient solver than can beused as a preconditioner for Conjugate Gradient (CG) andGeneralized Minimal Residual (GMRES)

What happens to AMG when it encounters SDC?

Possible Bad Outcomes• Slows down convergence

• Converge to wrong solution

• Program crash

Possible Good Outcomes• Nothing

• Converges more quickly

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 4/21

4/21

Page 5: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

Why should we care about SDC in AMG?

0 5 10 15 2010-810-510-210110410710101013101610191022

Rel

ativ

e R

esid

ual

Bit 58

Bit 56

0 1Iteration

104

105

Rel

ativ

e R

esid

ual

Sign Bit

Bit 58Bit 56 Bit 54

Bit 52No SDC

Mantissa

Impact of a selective single bit-flip injection into a residualcalculation of a 2D Poisson problem.

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 5/21

5/21

Page 6: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

AMG Resiliency

Detectors:

• Residual Check - heuristicbound on range of residual

• Energy Check - monitorenergetic stability of problem

Recovery:

• Local - idempotent linearalgebra operations

• Level - restart from previouslevel in AMG hierarchy

• Cycle - restart iteration

Configurations:

• Low cost - Residual check + Full recovery (1% at 1K cores)

• All - Full detection + Full recovery (18% at 1K cores)

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 6/21

6/21

Page 7: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

Convergence Results Solving 2D Laplacian

0 1 14 37 96Average Injection Count per Trial

0.0

0.2

0.4

0.6

0.8

1.0Pr

obab

ility 

of C

onve

rgen

ce

No DetectorsLow CostAll

Normal AMG (blue) not resilient to multiple injections

All suffers more segfaults due to frequent level restarts

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 7/21

7/21

Page 8: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

Error Propagation

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 8/21

8/21

Page 9: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

Why look at error propagation?

Large changes in a value or a crash are easy to detect

Latency of detection from a silent error can be high

Understanding how silent errors propagation will allow:

• More resilient code

• Creation of better detectors

• Helps define containment boundaries for local node/processlevel recovery

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 9/21

9/21

Page 10: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

HPCCG - Propagation of All Injection Types

0 10 20 30 40 50Number of Iterations after Injection

0

2

4

6

8

10Pr

ecen

tage

 of E

lem

ents

 Cor

rupt

edCorruption Due to Injected Error

xrpAp

Average number of elements corrupted on injected rankPropagation of error due to SpMVs

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 10/21

10/21

Page 11: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

Lossy Compression

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 11/21

11/21

Page 12: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

Data Movement Problem

Data movement is becoming a limiter to HPC performance dueto:

Bandwidth

Aggregate write bandwidth [Moody et al. 2010]

Power Consumption

Power consumption for CPU operations. [Keckler2011]

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 12/21

12/21

Page 13: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

Compression

Data compression techniques fall into two categories:

Lossless

• No data loss

• Small compression factors

• Can be costly (time)• Research on integration in

hardware memories [Pekhimenkoet al. 2012] [Sardashti, Seznec, and Wood 2014]

Lossy

• Some data perturbation

• Larger compression factors

• Low/moderate cost• Research on using in HPC

checkpointing [Laney et al. 2013][Ni et al. 2014] [Sasaki et al. 2015]

Solutions to many HPC applications are approximations

Let’s investigate how lossy compression can be effectivelyused

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 13/21

13/21

Page 14: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

Lossy Compression

Majority of my remaining PhD research will be devoted to betterunderstanding lossy compression, and answer the followingquestions:

1. How to select a lossy compression error tolerance?

2. Can lossy compression be used for inter-node communication?

3. Can we leverage problem specific information when selecting acompression tolerance?

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 14/21

14/21

Page 15: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

Selecting Lossy Compression Error ToleranceSelect compression tolerance less than truncation error ofnumerical method:

• Use RK4 to solve 1-D PDE with hx = 0.1 (accurate to 1e−4)

• uh = uh + ε is equivalent to uh if ε < 1e−4.

• uh with ε > 1e−4 can be mapped to related method – e.g.,ε = 1e−2 is RK2.

OfflineSimulation

Configuration NumericalMethod(s)

SpatialDiscretization(s)

hx, hy , hz CompressionTolerance (ε)

Guidance

(ε, Discretization Accuracy)

e.g. (1e−2, RK2),(1e−4, RK4)

Online

RunSimulationε = 1e−4

Compute

UpdateState Variables

with RK4uh, vh, ph

Take checkpoint

Lossy Checkpointε = 1e−4uh, vh, ph

Take checkpoint

UpdateState Variables

with RK4uh, vh, ph

Restart

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 15/21

15/21

Page 16: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

PlasComCM - Max Error

0 10000 20000 30000 40000 50000 60000 70000Time­step

10­7

10­6

10­5

10­4

uh−uh

truncation errordensityx momenta

y momentaenergy

restart restart restart

• Average compression factor of 7x with SZ-1.3 [Di and Cappello 2016]

• Error accumulates with each checkpoint, but is slowly removed

• Accumulation is still less than truncation error

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 16/21

16/21

Page 17: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

Tools and Software

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 17/21

17/21

Page 18: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

Fault Injection - FlipItFault Model:

• Memories protected by error correcting codes (ECC)• Faults arise in processor computations and manifest as register

perturbations (single bit-flip)

FlipIt is an LLVM compiler pass that instruments compiled codeto support fault injection at runtime

*.c *.bc *.bc *.o

Compile to LLVM

IR

Instrument with FlipIt

Compile to object

code

flipit-cc

FlipIt allows for user defined injection probabilities to customizefault injection for a given system

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 18/21

18/21

Page 19: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

Fault Visualization - FaultSight

Many fault injection frameworks report similar data

No common tool for fault injection visualization

FaultSight Goals:• Easily generate high level graphs for fault injection campaign

• Quickly select subsets of fault injection data

• Ability to conduct hypothesis testing

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 19/21

19/21

Page 20: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

Summary

Future HPC systems will experience more errors due to hardwareselection and algorithm choice

To prepare for this future, my thesis focuses on:

• SDC detection and recovery for AMG

• Error propagation study

• How lossy compression can be more effectively used

Tools:

• FlipIt - Fault Injector (https://github.com/aperson40/flipit)

• FaultSight - Fault Visualizer(https://github.com/einarhorn/faultsight)

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 20/21

20/21

Page 21: From Detection to Optimization: Impact of Silent Errors on ...sc16.supercomputing.org/.../drs112s2-file6.pdf · From Detection to Optimization: Impact of Silent Errors on Scienti

Acknowledgments

• This work was sponsored by the Air Force Office of ScientificResearch under grant FA9550-12-1-0478

• This work is supported by the Office of Science, Office ofAdvanced Scientific Computing Research, of the U.S.Department of Energy under Contract DEAC0206CH11307

• This research is part of the Blue Waters sustained-petascalecomputing project, which is supported by the National ScienceFoundation (awards OCI-0725070 and ACI-1238993) and thestate of Illinois. Blue Waters is a joint effort of the Universityof Illinois at Urbana-Champaign and its National Center forSupercomputing Applications

Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 21/21

21/21