From Detection to Optimization: Impact of Silent Errors on...

Preview:

Citation preview

From Detection to Optimization: Impact of

Silent Errors on Scientific Applications

SC’16 Doctoral Showcase

Jon Calhoun, Luke Olson (Advisor), and Marc Snir (Advisor)

University of Illinois at Urbana-Champaign

15 November 2016

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 1/21

1/21

Why look at errors in HPC?

Computing systems are vulnerable to failure

• Component cost and power budget dictate how reliablycomponents are

• Rates of errors may increase as design trade-offs become moremainstream, e.g. low power, cheaper commodity parts

Need efficient error detection and recovery schemes

Slowing of Moore’s Law demands we need more out of currenthardware

• Approximate computation

• Approximate data movement

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 2/21

2/21

Silent Error Detection

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 3/21

3/21

Error DetectionResilient applications are critical when operating in fault proneenvironments

Solving linear systems is central to many HPC applications

Algebraic Multigrid (AMG) is and efficient solver than can beused as a preconditioner for Conjugate Gradient (CG) andGeneralized Minimal Residual (GMRES)

What happens to AMG when it encounters SDC?

Possible Bad Outcomes• Slows down convergence

• Converge to wrong solution

• Program crash

Possible Good Outcomes• Nothing

• Converges more quickly

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 4/21

4/21

Why should we care about SDC in AMG?

0 5 10 15 2010-810-510-210110410710101013101610191022

Rel

ativ

e R

esid

ual

Bit 58

Bit 56

0 1Iteration

104

105

Rel

ativ

e R

esid

ual

Sign Bit

Bit 58Bit 56 Bit 54

Bit 52No SDC

Mantissa

Impact of a selective single bit-flip injection into a residualcalculation of a 2D Poisson problem.

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 5/21

5/21

AMG Resiliency

Detectors:

• Residual Check - heuristicbound on range of residual

• Energy Check - monitorenergetic stability of problem

Recovery:

• Local - idempotent linearalgebra operations

• Level - restart from previouslevel in AMG hierarchy

• Cycle - restart iteration

Configurations:

• Low cost - Residual check + Full recovery (1% at 1K cores)

• All - Full detection + Full recovery (18% at 1K cores)

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 6/21

6/21

Convergence Results Solving 2D Laplacian

0 1 14 37 96Average Injection Count per Trial

0.0

0.2

0.4

0.6

0.8

1.0Pr

obab

ility 

of C

onve

rgen

ce

No DetectorsLow CostAll

Normal AMG (blue) not resilient to multiple injections

All suffers more segfaults due to frequent level restarts

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 7/21

7/21

Error Propagation

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 8/21

8/21

Why look at error propagation?

Large changes in a value or a crash are easy to detect

Latency of detection from a silent error can be high

Understanding how silent errors propagation will allow:

• More resilient code

• Creation of better detectors

• Helps define containment boundaries for local node/processlevel recovery

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 9/21

9/21

HPCCG - Propagation of All Injection Types

0 10 20 30 40 50Number of Iterations after Injection

0

2

4

6

8

10Pr

ecen

tage

 of E

lem

ents

 Cor

rupt

edCorruption Due to Injected Error

xrpAp

Average number of elements corrupted on injected rankPropagation of error due to SpMVs

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 10/21

10/21

Lossy Compression

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 11/21

11/21

Data Movement Problem

Data movement is becoming a limiter to HPC performance dueto:

Bandwidth

Aggregate write bandwidth [Moody et al. 2010]

Power Consumption

Power consumption for CPU operations. [Keckler2011]

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 12/21

12/21

Compression

Data compression techniques fall into two categories:

Lossless

• No data loss

• Small compression factors

• Can be costly (time)• Research on integration in

hardware memories [Pekhimenkoet al. 2012] [Sardashti, Seznec, and Wood 2014]

Lossy

• Some data perturbation

• Larger compression factors

• Low/moderate cost• Research on using in HPC

checkpointing [Laney et al. 2013][Ni et al. 2014] [Sasaki et al. 2015]

Solutions to many HPC applications are approximations

Let’s investigate how lossy compression can be effectivelyused

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 13/21

13/21

Lossy Compression

Majority of my remaining PhD research will be devoted to betterunderstanding lossy compression, and answer the followingquestions:

1. How to select a lossy compression error tolerance?

2. Can lossy compression be used for inter-node communication?

3. Can we leverage problem specific information when selecting acompression tolerance?

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 14/21

14/21

Selecting Lossy Compression Error ToleranceSelect compression tolerance less than truncation error ofnumerical method:

• Use RK4 to solve 1-D PDE with hx = 0.1 (accurate to 1e−4)

• uh = uh + ε is equivalent to uh if ε < 1e−4.

• uh with ε > 1e−4 can be mapped to related method – e.g.,ε = 1e−2 is RK2.

OfflineSimulation

Configuration NumericalMethod(s)

SpatialDiscretization(s)

hx, hy , hz CompressionTolerance (ε)

Guidance

(ε, Discretization Accuracy)

e.g. (1e−2, RK2),(1e−4, RK4)

Online

RunSimulationε = 1e−4

Compute

UpdateState Variables

with RK4uh, vh, ph

Take checkpoint

Lossy Checkpointε = 1e−4uh, vh, ph

Take checkpoint

UpdateState Variables

with RK4uh, vh, ph

Restart

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 15/21

15/21

PlasComCM - Max Error

0 10000 20000 30000 40000 50000 60000 70000Time­step

10­7

10­6

10­5

10­4

uh−uh

truncation errordensityx momenta

y momentaenergy

restart restart restart

• Average compression factor of 7x with SZ-1.3 [Di and Cappello 2016]

• Error accumulates with each checkpoint, but is slowly removed

• Accumulation is still less than truncation error

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 16/21

16/21

Tools and Software

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 17/21

17/21

Fault Injection - FlipItFault Model:

• Memories protected by error correcting codes (ECC)• Faults arise in processor computations and manifest as register

perturbations (single bit-flip)

FlipIt is an LLVM compiler pass that instruments compiled codeto support fault injection at runtime

*.c *.bc *.bc *.o

Compile to LLVM

IR

Instrument with FlipIt

Compile to object

code

flipit-cc

FlipIt allows for user defined injection probabilities to customizefault injection for a given system

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 18/21

18/21

Fault Visualization - FaultSight

Many fault injection frameworks report similar data

No common tool for fault injection visualization

FaultSight Goals:• Easily generate high level graphs for fault injection campaign

• Quickly select subsets of fault injection data

• Ability to conduct hypothesis testing

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 19/21

19/21

Summary

Future HPC systems will experience more errors due to hardwareselection and algorithm choice

To prepare for this future, my thesis focuses on:

• SDC detection and recovery for AMG

• Error propagation study

• How lossy compression can be more effectively used

Tools:

• FlipIt - Fault Injector (https://github.com/aperson40/flipit)

• FaultSight - Fault Visualizer(https://github.com/einarhorn/faultsight)

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 20/21

20/21

Acknowledgments

• This work was sponsored by the Air Force Office of ScientificResearch under grant FA9550-12-1-0478

• This work is supported by the Office of Science, Office ofAdvanced Scientific Computing Research, of the U.S.Department of Energy under Contract DEAC0206CH11307

• This research is part of the Blue Waters sustained-petascalecomputing project, which is supported by the National ScienceFoundation (awards OCI-0725070 and ACI-1238993) and thestate of Illinois. Blue Waters is a joint effort of the Universityof Illinois at Urbana-Champaign and its National Center forSupercomputing Applications

Jon Calhoun jccalho2@illinois.edu Impact of Silent Errors on Scientific Applications 21/21

21/21

Recommended