Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
From Detection to Optimization: Impact of
Silent Errors on Scientific Applications
SC’16 Doctoral Showcase
Jon Calhoun, Luke Olson (Advisor), and Marc Snir (Advisor)
University of Illinois at Urbana-Champaign
15 November 2016
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 1/21
1/21
Why look at errors in HPC?
Computing systems are vulnerable to failure
• Component cost and power budget dictate how reliablycomponents are
• Rates of errors may increase as design trade-offs become moremainstream, e.g. low power, cheaper commodity parts
Need efficient error detection and recovery schemes
Slowing of Moore’s Law demands we need more out of currenthardware
• Approximate computation
• Approximate data movement
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 2/21
2/21
Silent Error Detection
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 3/21
3/21
Error DetectionResilient applications are critical when operating in fault proneenvironments
Solving linear systems is central to many HPC applications
Algebraic Multigrid (AMG) is and efficient solver than can beused as a preconditioner for Conjugate Gradient (CG) andGeneralized Minimal Residual (GMRES)
What happens to AMG when it encounters SDC?
Possible Bad Outcomes• Slows down convergence
• Converge to wrong solution
• Program crash
Possible Good Outcomes• Nothing
• Converges more quickly
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 4/21
4/21
Why should we care about SDC in AMG?
0 5 10 15 2010-810-510-210110410710101013101610191022
Rel
ativ
e R
esid
ual
Bit 58
Bit 56
0 1Iteration
104
105
Rel
ativ
e R
esid
ual
Sign Bit
Bit 58Bit 56 Bit 54
Bit 52No SDC
Mantissa
Impact of a selective single bit-flip injection into a residualcalculation of a 2D Poisson problem.
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 5/21
5/21
AMG Resiliency
Detectors:
• Residual Check - heuristicbound on range of residual
• Energy Check - monitorenergetic stability of problem
Recovery:
• Local - idempotent linearalgebra operations
• Level - restart from previouslevel in AMG hierarchy
• Cycle - restart iteration
Configurations:
• Low cost - Residual check + Full recovery (1% at 1K cores)
• All - Full detection + Full recovery (18% at 1K cores)
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 6/21
6/21
Convergence Results Solving 2D Laplacian
0 1 14 37 96Average Injection Count per Trial
0.0
0.2
0.4
0.6
0.8
1.0Pr
obab
ility
of C
onve
rgen
ce
No DetectorsLow CostAll
Normal AMG (blue) not resilient to multiple injections
All suffers more segfaults due to frequent level restarts
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 7/21
7/21
Error Propagation
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 8/21
8/21
Why look at error propagation?
Large changes in a value or a crash are easy to detect
Latency of detection from a silent error can be high
Understanding how silent errors propagation will allow:
• More resilient code
• Creation of better detectors
• Helps define containment boundaries for local node/processlevel recovery
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 9/21
9/21
HPCCG - Propagation of All Injection Types
0 10 20 30 40 50Number of Iterations after Injection
0
2
4
6
8
10Pr
ecen
tage
of E
lem
ents
Cor
rupt
edCorruption Due to Injected Error
xrpAp
Average number of elements corrupted on injected rankPropagation of error due to SpMVs
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 10/21
10/21
Lossy Compression
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 11/21
11/21
Data Movement Problem
Data movement is becoming a limiter to HPC performance dueto:
Bandwidth
Aggregate write bandwidth [Moody et al. 2010]
Power Consumption
Power consumption for CPU operations. [Keckler2011]
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 12/21
12/21
Compression
Data compression techniques fall into two categories:
Lossless
• No data loss
• Small compression factors
• Can be costly (time)• Research on integration in
hardware memories [Pekhimenkoet al. 2012] [Sardashti, Seznec, and Wood 2014]
Lossy
• Some data perturbation
• Larger compression factors
• Low/moderate cost• Research on using in HPC
checkpointing [Laney et al. 2013][Ni et al. 2014] [Sasaki et al. 2015]
Solutions to many HPC applications are approximations
Let’s investigate how lossy compression can be effectivelyused
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 13/21
13/21
Lossy Compression
Majority of my remaining PhD research will be devoted to betterunderstanding lossy compression, and answer the followingquestions:
1. How to select a lossy compression error tolerance?
2. Can lossy compression be used for inter-node communication?
3. Can we leverage problem specific information when selecting acompression tolerance?
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 14/21
14/21
Selecting Lossy Compression Error ToleranceSelect compression tolerance less than truncation error ofnumerical method:
• Use RK4 to solve 1-D PDE with hx = 0.1 (accurate to 1e−4)
• uh = uh + ε is equivalent to uh if ε < 1e−4.
• uh with ε > 1e−4 can be mapped to related method – e.g.,ε = 1e−2 is RK2.
OfflineSimulation
Configuration NumericalMethod(s)
SpatialDiscretization(s)
hx, hy , hz CompressionTolerance (ε)
Guidance
(ε, Discretization Accuracy)
e.g. (1e−2, RK2),(1e−4, RK4)
Online
RunSimulationε = 1e−4
Compute
UpdateState Variables
with RK4uh, vh, ph
Take checkpoint
Lossy Checkpointε = 1e−4uh, vh, ph
Take checkpoint
UpdateState Variables
with RK4uh, vh, ph
Restart
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 15/21
15/21
PlasComCM - Max Error
0 10000 20000 30000 40000 50000 60000 70000Timestep
107
106
105
104
uh−uh
∞
truncation errordensityx momenta
y momentaenergy
restart restart restart
• Average compression factor of 7x with SZ-1.3 [Di and Cappello 2016]
• Error accumulates with each checkpoint, but is slowly removed
• Accumulation is still less than truncation error
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 16/21
16/21
Tools and Software
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 17/21
17/21
Fault Injection - FlipItFault Model:
• Memories protected by error correcting codes (ECC)• Faults arise in processor computations and manifest as register
perturbations (single bit-flip)
FlipIt is an LLVM compiler pass that instruments compiled codeto support fault injection at runtime
*.c *.bc *.bc *.o
Compile to LLVM
IR
Instrument with FlipIt
Compile to object
code
flipit-cc
FlipIt allows for user defined injection probabilities to customizefault injection for a given system
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 18/21
18/21
Fault Visualization - FaultSight
Many fault injection frameworks report similar data
No common tool for fault injection visualization
FaultSight Goals:• Easily generate high level graphs for fault injection campaign
• Quickly select subsets of fault injection data
• Ability to conduct hypothesis testing
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 19/21
19/21
Summary
Future HPC systems will experience more errors due to hardwareselection and algorithm choice
To prepare for this future, my thesis focuses on:
• SDC detection and recovery for AMG
• Error propagation study
• How lossy compression can be more effectively used
Tools:
• FlipIt - Fault Injector (https://github.com/aperson40/flipit)
• FaultSight - Fault Visualizer(https://github.com/einarhorn/faultsight)
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 20/21
20/21
Acknowledgments
• This work was sponsored by the Air Force Office of ScientificResearch under grant FA9550-12-1-0478
• This work is supported by the Office of Science, Office ofAdvanced Scientific Computing Research, of the U.S.Department of Energy under Contract DEAC0206CH11307
• This research is part of the Blue Waters sustained-petascalecomputing project, which is supported by the National ScienceFoundation (awards OCI-0725070 and ACI-1238993) and thestate of Illinois. Blue Waters is a joint effort of the Universityof Illinois at Urbana-Champaign and its National Center forSupercomputing Applications
Jon Calhoun [email protected] Impact of Silent Errors on Scientific Applications 21/21
21/21