Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
How to Deal with Radiation:
Evaluation and Mitigation
of GPUs Soft-Errors
April 6th 2015 – San José, CA
Paolo Rech
Paolo Rech – GTC2016, San José, CA
Motivation: Automotive Applications
Pedestrian Detection System:
embedded GPUs
increase cars
security
2
Paolo Rech – GTC2016, San José, CA
Motivation: Automotive Applications
Pedestrian Detection System:
embedded GPUs
increase cars
security
Observed error
2
Paolo Rech – GTC2016, San José, CA
Motivation: Automotive Applications
Pedestrian Detection System:
embedded GPUs
increase cars
security
Observed error
The insurance does not cover
those accidents caused by:
[…]
exposure to ionizing radiation*
*Paolo’s car insurance
2
Paolo Rech – GTC2016, San José, CA
Motivation: HPC Industry
Titan (Oak Ridge National Lab): 18,688 GPUs
High probability of having a GPU corrupted
Titan MTBF is ~44h* *(field data from Tiwari et al. HPCA’15)
3
Paolo Rech – GTC2016, San José, CA
Motivation: HPC Industry
Titan (Oak Ridge National Lab): 18,688 GPUs
High probability of having a GPU corrupted
Titan MTBF is ~44h* *(field data from Tiwari et al. HPCA’15)
Only Crashes/Hangs considered (correct output is unknown)
We perform radiation experiments to measure
Silent Data Corruption (SDC) rates 3
Paolo Rech – GTC2016, San José, CA
Outline
Radiation Effects Essentials
Evaluation of GPU Radiation Sensitivity
- Experimental Setup
- Parallel Algorithms Error Rates
Hardening Solution Efficiency
Codes Optimizations Effects on HPC Reliability
What’s the Plan?
4
Paolo Rech – GTC2016, San José, CA
Outline
Radiation Effects Essentials
Evaluation of GPU Radiation Sensitivity
- Experimental Setup
- Parallel Algorithms Error Rates
Hardening Solution Efficiency
Codes Optimizations Effects on HPC Reliability
What’s the Plan?
Paolo Rech – GTC2016, San José, CA
Terrestrial Radiation Environment
Galactic cosmic rays interact with atmosphere
shower of energetic particles:
Muons, Pions, Protons, Gamma rays, Neutrons
13 n/(cm2
h) @sea level
5
Paolo Rech – GTC2016, San José, CA
Terrestrial Radiation Environment
Galactic cosmic rays interact with atmosphere
shower of energetic particles:
Muons, Pions, Protons, Gamma rays, Neutrons
13 n/(cm2
h) @sea level
neutron flux
increases
exponentially with
altitude
5
Paolo Rech – GTC2016, San José, CA
Radiation Effects - Soft Errors
0 1 • One or more bit-flips
Single Event Upset (SEU)
Multiple Bit Upset (MBU)
Soft Errors: the device is not permanently damaged,
but the particle may generate:
6
Paolo Rech – GTC2016, San José, CA
Radiation Effects - Soft Errors
0
1
IONIZING PARTICLE
1
0
• One or more bit-flips
Single Event Upset (SEU)
Multiple Bit Upset (MBU)
Soft Errors: the device is not permanently damaged,
but the particle may generate:
6
Paolo Rech – GTC2016, San José, CA
Radiation Effects - Soft Errors
0
1
IONIZING PARTICLE
1
0
• One or more bit-flips
Single Event Upset (SEU)
Multiple Bit Upset (MBU)
Soft Errors: the device is not permanently damaged,
but the particle may generate:
• Transient voltage pulse
Single Event Transient (SET) FF
Logic
IONIZING
PARTICLE
6
Paolo Rech – GTC2016, San José, CA
Radiation Effects on GPUs
SM
CUDA GPU
DRAM
Blocks Scheduler and Dispatcher
L2 Cache
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
Streaming Multiprocessor
Instruction Cache
Warp Scheduler
Dispatch Unit
Register File
core
core core
core
…
core
core
Shared Memory / L1 Cache
core
core
Warp Scheduler
Dispatch Unit
7
Paolo Rech – GTC2016, San José, CA
Radiation Effects on GPUs
SM
CUDA GPU
DRAM
Blocks Scheduler and Dispatcher
L2 Cache
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
Streaming Multiprocessor
Instruction Cache
Warp Scheduler
Dispatch Unit
Register File
core
core core
core
…
core
core
Shared Memory / L1 Cache
core
core
Warp Scheduler
Dispatch Unit
X
X
7
Paolo Rech – GTC2016, San José, CA
Radiation Effects on GPUs
SM
CUDA GPU
DRAM
Blocks Scheduler and Dispatcher
L2 Cache
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
Streaming Multiprocessor
Instruction Cache
Warp Scheduler
Dispatch Unit
Register File
core
core core
core
…
core
core
Shared Memory / L1 Cache
core
core
Warp Scheduler
Dispatch Unit
X
core X
7
Paolo Rech – GTC2016, San José, CA
Radiation Effects on GPUs
SM
CUDA GPU
DRAM
Blocks Scheduler and Dispatcher
L2 Cache
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
Streaming Multiprocessor
Instruction Cache
Warp Scheduler
Dispatch Unit
Register File
core
core core
core
…
core
core
Shared Memory / L1 Cache
core
core
Warp Scheduler
Dispatch Unit
X
core
X
X
X
7
Paolo Rech – GTC2016, San José, CA
Radiation Effects on GPUs
SM
CUDA GPU
DRAM
Blocks Scheduler and Dispatcher
L2 Cache
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
Streaming Multiprocessor
Instruction Cache
Warp Scheduler
Dispatch Unit
Register File
core
core core
core
…
core
core
Shared Memory / L1 Cache
core
core
Warp Scheduler
Dispatch Unit
X
core
X
core
core core
core
core
core
core
X
X
7
Paolo Rech – GTC2016, San José, CA
Radiation Effects on GPUs
SM
CUDA GPU
DRAM
Blocks Scheduler and Dispatcher
L2 Cache
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
Streaming Multiprocessor
Instruction Cache
Warp Scheduler
Dispatch Unit
Register File
core
core core
core
…
core
core
Shared Memory / L1 Cache
core
core
Warp Scheduler
Dispatch Unit
X
X
core
X
core
core core
core
core
core
core
X
X
X
7
Paolo Rech – GTC2016, San José, CA
Radiation Effects on GPUs
SM
CUDA GPU
DRAM
Blocks Scheduler and Dispatcher
L2 Cache
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
Streaming Multiprocessor
Instruction Cache
Warp Scheduler
Dispatch Unit
Register File
core
core core
core
…
core
core
Shared Memory / L1 Cache
core
core
Warp Scheduler
Dispatch Unit SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
X
X
core
X
core
core core
core
core
core
core
X
X
X
7
Paolo Rech – GTC2016, San José, CA
Silent Data Corruption vs Crash&Hang
Errors in:
- data cache
- register files
- logic gates (ALU)
- scheduler
Silent Data Corruption
8
Paolo Rech – GTC2016, San José, CA
Silent Data Corruption vs Crash&Hang
Errors in:
- data cache
- register files
- logic gates (ALU)
- scheduler
Errors in:
- instruction cache
- scheduler / dispatcher
- PCI-e bus controller
Silent Data Corruption
Crash & Hang
8
Paolo Rech – GTC2016, San José, CA
Outline
Radiation Effects Essentials
Evaluation of GPU Radiation Sensitivity
- Experimental Setup
- Parallel Algorithms Error Rates
Hardening Solution Efficiency
Codes Optimizations Effects on HPC Reliability
What’s the Plan?
Paolo Rech – GTC2016, San José, CA
Radiation Test Facilities
Weapon Nuclear Research
9
Paolo Rech – GTC2016, San José, CA
@LANSCE 1.8x109 n/(cm2 h)
@NYC 13 n/(cm2 h)
Neutrons Spectrum
cross section [cm2] = errors/s
flux (n/cm2/s)
cross section x flux (13 n/(cm2h)) = Error Rate
10
Paolo Rech – GTC2016, San José, CA
@LANSCE 1.8x109 n/(cm2 h)
@NYC 13 n/(cm2 h)
Neutrons Spectrum
cross section [cm2] = errors/s
flux (n/cm2/s)
cross section x flux (13 n/(cm2h)) = Error Rate
probability for 1 neutron to
generate an output error
10
Paolo Rech – GTC2016, San José, CA
GPU Radiation Test Setup
microcontrollers
FPGA
SoC FPGA SoC
Flash GPU APU
11
Paolo Rech – GTC2016, San José, CA
GPU Radiation Test Setup
23/48
GPU power control
circuitry is out of beam
AMD
APU NVIDIA
K20
Intel
Xeon-Phi
desktop
PCs
Paolo Rech – GTC2016, San José, CA
Outline
Radiation Effects Essentials
Evaluation of GPU Radiation Sensitivity
- Experimental Setup
- Parallel Algorithms Error Rates
Hardening Solution Efficiency
Codes Optimizations Effects on HPC Reliability
What’s the Plan?
Paolo Rech – GTC2016, San José, CA
Tested Parallel Codes
-Matrix Multiplication (linear algebra)
-Matrix Transpose (memory)
-FFT (signal processing)
-Needleman–Wunsch (biology)
-lavaMD (physical simulations)
-Hotspot (physical simulations)
-HOG (pedestrian detection)
The selected algorithms are heterogeneous and
representative 13
Paolo Rech – GTC2016, San José, CA
Experimental Results (ECC OFF)
1
10
100
1000
10000
MxM MTrans FFT NW lavaMD Hotspot
Crashes
SDC
SDC rate varies ~3 orders of magnitude
(details on Oliveira et al. Trans. Comp. 2015)
Fa
ilure
In
Tim
e @
NY
C
execution dominated by
memory latencies
14
Paolo Rech – GTC2016, San José, CA
Experimental Results (ECC OFF)
1
10
100
1000
10000
MxM MTrans FFT NW lavaMD Hotspot
Crashes
SDC
SDC rate varies ~3 orders of magnitude
(details on Oliveira et al. Trans. Comp. 2015)
Fa
ilure
In
Tim
e @
NY
C
codes that heavily
employ registers
execution dominated by
memory latencies
14
Paolo Rech – GTC2016, San José, CA
Experimental Results (ECC OFF)
1
10
100
1000
10000
MxM MTrans FFT NW lavaMD Hotspot
Crashes
SDC
SDC rate varies ~3 orders of magnitude
(details on Oliveira et al. Trans. Comp. 2015)
Fa
ilure
In
Tim
e @
NY
C
codes that heavily
employ registers
higher
#instructions
Matrix Multiplication: 6.46102 FIT
1 error every 15 years
Titan: 18,688 errors every 15 years
(1 error every 7.3h)
14
Paolo Rech – GTC2016, San José, CA
Error Correction Code - SDC
1
10
100
1000
10000
MxM FFT NW lavaMD Hotspot
Fa
ilure
In T
ime @
NY
C
ECC reduces the SDC FIT of ~1 order of magnitude
(there is almost no code dependence)
Unprotected resources:
-logic gates
-scheduler
-queues
-flip-flops
…
ECC OFF ECC ON
15
Paolo Rech – GTC2016, San José, CA
Error Correction Code - Crash
MxM FFT NW lavaMD Hotspot
Fa
ilure
In T
ime @
NY
C
1
10
100
1000
10000
ECC increases the Crash FIT of about 50%
(there is almost no code dependence)
Double Bit Errors
cause a crash
scheduler is not
protected
ECC OFF ECC ON
16
Paolo Rech – GTC2016, San José, CA
MxM FFT NW lavaMD Hotspot
Failu
re I
n T
ime
@N
YC
ECC ON – SDC vs Crashes
1
10
100
1000
10000
When the ECC is ON Crashes are more likely to occur
than SDCs (this is GOOD for HPC centers!)
Crash
SDC
17
Paolo Rech – GTC2016, San José, CA
Outline
Radiation Effects Essentials
Evaluation of GPU Radiation Sensitivity
- Experimental Setup
- Parallel Algorithms Error Rates
Hardening Solution Efficiency
Codes Optimizations Effects on HPC Reliability
What’s the Plan?
Paolo Rech – GTC2016, San José, CA
Algorithm Based Fault Tolerance
x A B
checksum checksum
∑
∑ M =
col-check
row
-check
Freivalds ’79
ABFT: technique designed specifically for an algorithm.
ABFT requires: input coding, algorithm modification,
and output decoding with error detection/correction
col-sum
row
-sum
X
X
X
Huang and Abraham ’84
Rech et al., TNS ‘13
18
Paolo Rech – GTC2016, San José, CA
FFT Hardening Idea
J.Y. Jou and Abraham ’88
Pilla et at., TNS’13 unhardened FFT
inp
ut
co
din
g
output de-coding
error detection 19
Paolo Rech – GTC2016, San José, CA
ECC vs ABFT F
IT [lo
g s
ca
le]
MxM FFT
SDC crash SDC crash
ECC reduces FIT of ~10
times, ABFT of ~56 times!
1
10
100
1000
10000Unhardened
ECC
ABFT
20
Paolo Rech – GTC2016, San José, CA
ECC vs ABFT F
IT [lo
g s
ca
le]
MxM FFT
SDC crash SDC crash
ECC reduces FIT of ~10
times, ABFT of ~56 times!
1
10
100
1000
10000Unhardened
ECC
ABFT
20
ECC increases Crashes
of 50% ABFT of 10%!
Paolo Rech – GTC2016, San José, CA
ECC vs ABFT n
orm
aliz
ed e
xecution t
ime
MxM FFT
ECC overhead for MxM is
10%, for FFT 50%! ABFT overhead is less
than 20%
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
Unhardened
ECC
ABFT
21
Paolo Rech – GTC2016, San José, CA
Duplication With Comparison
Spatial: block i and i+N are
duplicated
SM0 a b c d
SM1 a' b' c' d'
time
22
Paolo Rech – GTC2016, San José, CA
Duplication With Comparison
Spatial: block i and i+N are
duplicated
E-O Spatial: block i and i+1
are duplicated
SM0 a b c d
SM1 a' b' c' d'
time
SM0 b b' d d'
SM1 a c c'
time
a'
22
Paolo Rech – GTC2016, San José, CA
Duplication With Comparison
Spatial: block i and i+N are
duplicated
E-O Spatial: block i and i+1
are duplicated
Time: a thread executes
twice the operations
SM0 a b c d
SM1 a' b' c' d'
time
SM0 b b' d d'
SM1 a c c'
time
a'
SM0 b & b' d & d'
SM1 a & a' c & c'
time
22
Paolo Rech – GTC2016, San José, CA
Hotspot - DWC results*
1
10
100
1000
Unhardened
ECC
Spatial DWC
E-O Spatial DWC
Time DWCFIT
[lo
g s
ca
le]
SDC crash
Spatial DWC detects all SDC
Spatial E-O detects 80% of SDC
Time DWC detects 90% of SDC
*details on Oliveira et al.
Trans. Nucl. Sci., 2014
23
Paolo Rech – GTC2016, San José, CA
Hotspot - DWC results*
1
10
100
1000
Unhardened
ECC
Spatial DWC
E-O Spatial DWC
Time DWCFIT
[lo
g s
ca
le]
SDC crash
Spatial DWC detects all SDC
Spatial E-O detects 80% of SDC
Time DWC detects 90% of SDC
Only Time DWC reduces
Crashes (no additional
Blocks scheduling
required)
*details on Oliveira et al.
Trans. Nucl. Sci., 2014
23
Paolo Rech – GTC2016, San José, CA
Hotspot - DWC results*
1
10
100
1000
Unhardened
ECC
Spatial DWC
E-O Spatial DWC
Time DWCFIT
[lo
g s
ca
le]
SDC crash
Spatial DWC detects all SDC
Spatial E-O detects 80% of SDC
Time DWC detects 90% of SDC
Only Time DWC reduces
Crashes (no additional
Blocks scheduling
required)
DWC is promising: it is generic, easily implemented, and
effective…
BUT execution time overhead for Spatial DWC and Spatial E-O
is 2.5x and for Time DWC is 2x (data is not copied)
Duplicate only the
code’s critical portions
*details on Oliveira et al.
Trans. Nucl. Sci., 2014
23
Paolo Rech – GTC2016, San José, CA
Outline
Radiation Effects Essentials
Evaluation of GPU Radiation Sensitivity
- Experimental Setup
- Parallel Algorithms Error Rates
Hardening Solution Efficiency
Codes Optimizations Effects on HPC Reliability
What’s the Plan?
Paolo Rech – GTC2016, San José, CA
Codes Optimizations (just baked!)
Novel and incremental algorithm implementations are
continuously developed [Rodinia suite].
Code optimizations impact GPUs reliability?
Three case studies (naïve vs optimized)
Matrix Multiplication
FFT
Needleman–Wunsch
different input sizes
(on GPUs optimizations
depends on workload)
24
Paolo Rech – GTC2016, San José, CA
1,00E+00
6,00E+00
1,10E+01
1,60E+01
2,10E+01
2,60E+01
Naive-SDC
Naive-Crash
Opt-SDC
Opt-Crash
Experimental Results – MxM
Opt-MxM FIT is higher. Errors in obsolete data are NOT critical:
higher hit rate in the caches = higher FIT
no
rmaliz
ed F
IT [a
.u.]
1024 2048 4096 8192
25
Paolo Rech – GTC2016, San José, CA
1,00E+00
6,00E+00
1,10E+01
1,60E+01
2,10E+01
2,60E+01
Naive-SDC
Naive-Crash
Opt-SDC
Opt-Crash
Experimental Results – MxM
~20% FIT increase with input size caused by additional threads
instantiated
Opt-MxM FIT is higher. Errors in obsolete data are NOT critical:
higher hit rate in the caches = higher FIT
no
rmaliz
ed F
IT [a
.u.]
1024 2048 4096 8192
25
Paolo Rech – GTC2016, San José, CA
Mean Workload Between Failures
Opt. cross section and FIT
26
Paolo Rech – GTC2016, San José, CA
Mean Workload Between Failures
We need to consider cross section, execution time, and
throughput
Opt.
neutrons hitting the GPU
cross section and FIT
execution time
0
100
200
300
400
500
600
MxM-naive
MxM-opt
GF
LO
Ps
1024 2048 4096 8192
Mean
WORKLOAD
Between Failure:
amount of data
produced before
failure
26
Paolo Rech – GTC2016, San José, CA
MxM - MWBF M
WB
F [d
ata
ela
bo
rate
d]
1024 2048 4096 8192 1,00E+00
1,00E+13
2,00E+13
3,00E+13
4,00E+13
Naive-SDC
Opt-SDC
Opt-MxM produces more correct data than Naïve-MxM
27
Paolo Rech – GTC2016, San José, CA
MxM - MWBF
Opt-MxM efficiency increases with input size!
If the code is optimized the throughput
increases more than the error rate!
MW
BF
[d
ata
ela
bo
rate
d]
1024 2048 4096 8192 1,00E+00
1,00E+13
2,00E+13
3,00E+13
4,00E+13
Naive-SDC
Opt-SDC
Opt-MxM produces more correct data than Naïve-MxM
27
Paolo Rech – GTC2016, San José, CA
Outline
Radiation Effects Essentials
Evaluation of GPU Radiation Sensitivity
- Experimental Setup
- Parallel Algorithms Error Rates
Hardening Solution Efficiency
Codes Optimizations Effects on HPC Reliability
What’s the Plan?
Paolo Rech – GTC2016, San José, CA
What’s The Plan?
Exascale = 55x Titan. Can we afford a 55x error rate?
Probably not.
Self Driving Cars. Reliability is a major concern!
How we can help:
28
Paolo Rech – GTC2016, San José, CA
What’s The Plan?
Exascale = 55x Titan. Can we afford a 55x error rate?
Probably not.
Self Driving Cars. Reliability is a major concern!
How we can help:
-Understand SDC criticality. Not all errors significantly
affect output: are there “acceptable” SDC?
-Propose selective-hardening solutions for GPUs
(duplicate only what matters, what REALLY matters)
28
Paolo Rech – GTC2016, San José, CA
What’s The Plan?
Exascale = 55x Titan. Can we afford a 55x error rate?
Probably not.
Self Driving Cars. Reliability is a major concern!
How we can help:
-Understand SDC criticality. Not all errors significantly
affect output: are there “acceptable” SDC?
-Propose selective-hardening solutions for GPUs
(duplicate only what matters, what REALLY matters)
- Understand how algorithm/code/compiler
optimizations will impact future machines error rate
- Fault-injection to better understand error propagation
28
Paolo Rech – GTC2016, San José, CA
Acknowledgments
Caio Lunardi
Caroline Aguiar
Laercio Pilla
Daniel Oliveira
Vinicius Frattin
Philippe Navaux
Luigi Carro
Chris Frost
Nathan DeBardeleben
Sean Blanchard
Heather Quinn
Thomas Fairbanks
Steve Wender
Timothy Tsai
Siva Hari
Steve Keckler
David Kaeli
NUCAR group