How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

How to Deal with Radiation:

Evaluation and Mitigation

of GPUs Soft-Errors

April 6th 2015 – San José, CA

Paolo Rech

Paolo Rech – GTC2016, San José, CA

Motivation: Automotive Applications

Pedestrian Detection System:

embedded GPUs

increase cars

security

2




embedded GPUs

increase cars

security

Observed error

2




embedded GPUs

increase cars

security

Observed error

The insurance does not cover

those accidents caused by:

[…]

exposure to ionizing radiation*

*Paolo’s car insurance

2


Motivation: HPC Industry

Titan (Oak Ridge National Lab): 18,688 GPUs

High probability of having a GPU corrupted

Titan MTBF is ~44h* *(field data from Tiwari et al. HPCA’15)

3


Motivation: HPC Industry

Titan (Oak Ridge National Lab): 18,688 GPUs

High probability of having a GPU corrupted

Titan MTBF is ~44h* *(field data from Tiwari et al. HPCA’15)

Only Crashes/Hangs considered (correct output is unknown)

We perform radiation experiments to measure

Silent Data Corruption (SDC) rates 3


Outline

Radiation Effects Essentials

Evaluation of GPU Radiation Sensitivity

- Experimental Setup

- Parallel Algorithms Error Rates

Hardening Solution Efficiency

Codes Optimizations Effects on HPC Reliability

What’s the Plan?

4


Outline







What’s the Plan?


Terrestrial Radiation Environment

Galactic cosmic rays interact with atmosphere

shower of energetic particles:

Muons, Pions, Protons, Gamma rays, Neutrons

13 n/(cm2

h) @sea level

5


Terrestrial Radiation Environment

Galactic cosmic rays interact with atmosphere

shower of energetic particles:

Muons, Pions, Protons, Gamma rays, Neutrons

13 n/(cm2

h) @sea level

neutron flux

increases

exponentially with

altitude

5


Radiation Effects - Soft Errors

0 1 • One or more bit-flips

Single Event Upset (SEU)

Multiple Bit Upset (MBU)

Soft Errors: the device is not permanently damaged,

but the particle may generate:

6



0

1

IONIZING PARTICLE

1

0

• One or more bit-flips





6



0

1

IONIZING PARTICLE

1

0

• One or more bit-flips





• Transient voltage pulse

Single Event Transient (SET) FF

Logic

IONIZING

PARTICLE

6


Radiation Effects on GPUs

SM

CUDA GPU

DRAM

Blocks Scheduler and Dispatcher

L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

Streaming Multiprocessor

Instruction Cache

Warp Scheduler

Dispatch Unit

Register File

core

core core

core

…

core

core

Shared Memory / L1 Cache

core

core

Warp Scheduler

Dispatch Unit

7



SM

CUDA GPU

DRAM


L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM


Instruction Cache

Warp Scheduler

Dispatch Unit

Register File

core

core core

core

…

core

core


core

core

Warp Scheduler

Dispatch Unit

X

X

7



SM

CUDA GPU

DRAM


L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM


Instruction Cache

Warp Scheduler

Dispatch Unit

Register File

core

core core

core

…

core

core


core

core

Warp Scheduler

Dispatch Unit

X

core X

7



SM

CUDA GPU

DRAM


L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM


Instruction Cache

Warp Scheduler

Dispatch Unit

Register File

core

core core

core

…

core

core


core

core

Warp Scheduler

Dispatch Unit

X

core

X

X

X

7



SM

CUDA GPU

DRAM


L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM


Instruction Cache

Warp Scheduler

Dispatch Unit

Register File

core

core core

core

…

core

core


core

core

Warp Scheduler

Dispatch Unit

X

core

X

core

core core

core

core

core

core

X

X

7



SM

CUDA GPU

DRAM


L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM


Instruction Cache

Warp Scheduler

Dispatch Unit

Register File

core

core core

core

…

core

core


core

core

Warp Scheduler

Dispatch Unit

X

X

core

X

core

core core

core

core

core

core

X

X

X

7



SM

CUDA GPU

DRAM


L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM


Instruction Cache

Warp Scheduler

Dispatch Unit

Register File

core

core core

core

…

core

core


core

core

Warp Scheduler

Dispatch Unit SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

X

X

core

X

core

core core

core

core

core

core

X

X

X

7


Silent Data Corruption vs Crash&Hang

Errors in:

- data cache

- register files

- logic gates (ALU)

- scheduler

Silent Data Corruption

8


Silent Data Corruption vs Crash&Hang

Errors in:

- data cache

- register files

- logic gates (ALU)

- scheduler

Errors in:

- instruction cache

- scheduler / dispatcher

- PCI-e bus controller

Silent Data Corruption

Crash & Hang

8


Outline







What’s the Plan?


Radiation Test Facilities

Weapon Nuclear Research

9


@LANSCE 1.8x109 n/(cm2 h)

@NYC 13 n/(cm2 h)

Neutrons Spectrum

cross section [cm2] = errors/s

flux (n/cm2/s)

cross section x flux (13 n/(cm2h)) = Error Rate

10


@LANSCE 1.8x109 n/(cm2 h)

@NYC 13 n/(cm2 h)

Neutrons Spectrum

cross section [cm2] = errors/s

flux (n/cm2/s)

cross section x flux (13 n/(cm2h)) = Error Rate

probability for 1 neutron to

generate an output error

10


GPU Radiation Test Setup

microcontrollers

FPGA

SoC FPGA SoC

Flash GPU APU

11


GPU Radiation Test Setup

23/48

GPU power control

circuitry is out of beam

AMD

APU NVIDIA

K20

Intel

Xeon-Phi

desktop

PCs


Outline







What’s the Plan?


Tested Parallel Codes

-Matrix Multiplication (linear algebra)

-Matrix Transpose (memory)

-FFT (signal processing)

-Needleman–Wunsch (biology)

-lavaMD (physical simulations)

-Hotspot (physical simulations)

-HOG (pedestrian detection)

The selected algorithms are heterogeneous and

representative 13


Experimental Results (ECC OFF)

1

10

100

1000

10000

MxM MTrans FFT NW lavaMD Hotspot

Crashes

SDC

SDC rate varies ~3 orders of magnitude

(details on Oliveira et al. Trans. Comp. 2015)

Fa

ilure

In

Tim

e @

NY

C

execution dominated by

memory latencies

14



1

10

100

1000

10000


Crashes

SDC



Fa

ilure

In

Tim

e @

NY

C

codes that heavily

employ registers

execution dominated by

memory latencies

14



1

10

100

1000

10000


Crashes

SDC



Fa

ilure

In

Tim

e @

NY

C

codes that heavily

employ registers

higher

#instructions

Matrix Multiplication: 6.46102 FIT

1 error every 15 years

Titan: 18,688 errors every 15 years

(1 error every 7.3h)

14


Error Correction Code - SDC

1

10

100

1000

10000

MxM FFT NW lavaMD Hotspot

Fa

ilure

In T

ime @

NY

C

ECC reduces the SDC FIT of ~1 order of magnitude

(there is almost no code dependence)

Unprotected resources:

-logic gates

-scheduler

-queues

-flip-flops

…

ECC OFF ECC ON

15


Error Correction Code - Crash


Fa

ilure

In T

ime @

NY

C

1

10

100

1000

10000

ECC increases the Crash FIT of about 50%

(there is almost no code dependence)

Double Bit Errors

cause a crash

scheduler is not

protected

ECC OFF ECC ON

16



Failu

re I

n T

ime

@N

YC

ECC ON – SDC vs Crashes

1

10

100

1000

10000

When the ECC is ON Crashes are more likely to occur

than SDCs (this is GOOD for HPC centers!)

Crash

SDC

17


Outline







What’s the Plan?


Algorithm Based Fault Tolerance

x A B

checksum checksum

∑

∑ M =

col-check

row

-check

Freivalds ’79

ABFT: technique designed specifically for an algorithm.

ABFT requires: input coding, algorithm modification,

and output decoding with error detection/correction

col-sum

row

-sum

X

X

X

Huang and Abraham ’84

Rech et al., TNS ‘13

18


FFT Hardening Idea

J.Y. Jou and Abraham ’88

Pilla et at., TNS’13 unhardened FFT

inp

ut

co

din

g

output de-coding

error detection 19


ECC vs ABFT F

IT [lo

g s

ca

le]

MxM FFT

SDC crash SDC crash

ECC reduces FIT of ~10

times, ABFT of ~56 times!

1

10

100

1000

10000Unhardened

ECC

ABFT

20


ECC vs ABFT F

IT [lo

g s

ca

le]

MxM FFT

SDC crash SDC crash

ECC reduces FIT of ~10

times, ABFT of ~56 times!

1

10

100

1000

10000Unhardened

ECC

ABFT

20

ECC increases Crashes

of 50% ABFT of 10%!


ECC vs ABFT n

orm

aliz

ed e

xecution t

ime

MxM FFT

ECC overhead for MxM is

10%, for FFT 50%! ABFT overhead is less

than 20%

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

Unhardened

ECC

ABFT

21


Duplication With Comparison

Spatial: block i and i+N are

duplicated

SM0 a b c d

SM1 a' b' c' d'

time

22




duplicated

E-O Spatial: block i and i+1

are duplicated

SM0 a b c d

SM1 a' b' c' d'

time

SM0 b b' d d'

SM1 a c c'

time

a'

22




duplicated

E-O Spatial: block i and i+1

are duplicated

Time: a thread executes

twice the operations

SM0 a b c d

SM1 a' b' c' d'

time

SM0 b b' d d'

SM1 a c c'

time

a'

SM0 b & b' d & d'

SM1 a & a' c & c'

time

22


Hotspot - DWC results*

1

10

100

1000

Unhardened

ECC

Spatial DWC

E-O Spatial DWC

Time DWCFIT

[lo

g s

ca

le]

SDC crash

Spatial DWC detects all SDC

Spatial E-O detects 80% of SDC

Time DWC detects 90% of SDC

*details on Oliveira et al.

Trans. Nucl. Sci., 2014

23



1

10

100

1000

Unhardened

ECC

Spatial DWC

E-O Spatial DWC

Time DWCFIT

[lo

g s

ca

le]

SDC crash




Only Time DWC reduces

Crashes (no additional

Blocks scheduling

required)



23



1

10

100

1000

Unhardened

ECC

Spatial DWC

E-O Spatial DWC

Time DWCFIT

[lo

g s

ca

le]

SDC crash




Only Time DWC reduces

Crashes (no additional

Blocks scheduling

required)

DWC is promising: it is generic, easily implemented, and

effective…

BUT execution time overhead for Spatial DWC and Spatial E-O

is 2.5x and for Time DWC is 2x (data is not copied)

Duplicate only the

code’s critical portions



23


Outline







What’s the Plan?


Codes Optimizations (just baked!)

Novel and incremental algorithm implementations are

continuously developed [Rodinia suite].

Code optimizations impact GPUs reliability?

Three case studies (naïve vs optimized)

Matrix Multiplication

FFT

Needleman–Wunsch

different input sizes

(on GPUs optimizations

depends on workload)

24


1,00E+00

6,00E+00

1,10E+01

1,60E+01

2,10E+01

2,60E+01

Naive-SDC

Naive-Crash

Opt-SDC

Opt-Crash

Experimental Results – MxM

Opt-MxM FIT is higher. Errors in obsolete data are NOT critical:

higher hit rate in the caches = higher FIT

no

rmaliz

ed F

IT [a

.u.]

1024 2048 4096 8192

25


1,00E+00

6,00E+00

1,10E+01

1,60E+01

2,10E+01

2,60E+01

Naive-SDC

Naive-Crash

Opt-SDC

Opt-Crash

Experimental Results – MxM

~20% FIT increase with input size caused by additional threads

instantiated

Opt-MxM FIT is higher. Errors in obsolete data are NOT critical:

higher hit rate in the caches = higher FIT

no

rmaliz

ed F

IT [a

.u.]

1024 2048 4096 8192

25


Mean Workload Between Failures

Opt. cross section and FIT

26


Mean Workload Between Failures

We need to consider cross section, execution time, and

throughput

Opt.

neutrons hitting the GPU

cross section and FIT

execution time

0

100

200

300

400

500

600

MxM-naive

MxM-opt

GF

LO

Ps

1024 2048 4096 8192

Mean

WORKLOAD

Between Failure:

amount of data

produced before

failure

26


MxM - MWBF M

WB

F [d

ata

ela

bo

rate

d]

1024 2048 4096 8192 1,00E+00

1,00E+13

2,00E+13

3,00E+13

4,00E+13

Naive-SDC

Opt-SDC

Opt-MxM produces more correct data than Naïve-MxM

27


MxM - MWBF

Opt-MxM efficiency increases with input size!

If the code is optimized the throughput

increases more than the error rate!

MW

BF

[d

ata

ela

bo

rate

d]

1024 2048 4096 8192 1,00E+00

1,00E+13

2,00E+13

3,00E+13

4,00E+13

Naive-SDC

Opt-SDC

Opt-MxM produces more correct data than Naïve-MxM

27


Outline







What’s the Plan?


What’s The Plan?

Exascale = 55x Titan. Can we afford a 55x error rate?

Probably not.

Self Driving Cars. Reliability is a major concern!

How we can help:

28


What’s The Plan?


Probably not.


How we can help:

-Understand SDC criticality. Not all errors significantly

affect output: are there “acceptable” SDC?

-Propose selective-hardening solutions for GPUs

(duplicate only what matters, what REALLY matters)

28


What’s The Plan?


Probably not.


How we can help:

-Understand SDC criticality. Not all errors significantly

affect output: are there “acceptable” SDC?

-Propose selective-hardening solutions for GPUs

(duplicate only what matters, what REALLY matters)

- Understand how algorithm/code/compiler

optimizations will impact future machines error rate

- Fault-injection to better understand error propagation

28


Acknowledgments

Caio Lunardi

Caroline Aguiar

Laercio Pilla

Daniel Oliveira

Vinicius Frattin

Philippe Navaux

Luigi Carro

Chris Frost

Nathan DeBardeleben

Sean Blanchard

Heather Quinn

Thomas Fairbanks

Steve Wender

Timothy Tsai

Siva Hari

Steve Keckler

David Kaeli

NUCAR group

Documents

How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry