61
How to Deal with Radiation: Evaluation and Mitigation of GPUs Soft-Errors April 6 th 2015 San José, CA Paolo Rech

How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

How to Deal with Radiation:

Evaluation and Mitigation

of GPUs Soft-Errors

April 6th 2015 – San José, CA

Paolo Rech

Page 2: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Motivation: Automotive Applications

Pedestrian Detection System:

embedded GPUs

increase cars

security

2

Page 3: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Motivation: Automotive Applications

Pedestrian Detection System:

embedded GPUs

increase cars

security

Observed error

2

Page 4: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Motivation: Automotive Applications

Pedestrian Detection System:

embedded GPUs

increase cars

security

Observed error

The insurance does not cover

those accidents caused by:

[…]

exposure to ionizing radiation*

*Paolo’s car insurance

2

Page 5: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Motivation: HPC Industry

Titan (Oak Ridge National Lab): 18,688 GPUs

High probability of having a GPU corrupted

Titan MTBF is ~44h* *(field data from Tiwari et al. HPCA’15)

3

Page 6: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Motivation: HPC Industry

Titan (Oak Ridge National Lab): 18,688 GPUs

High probability of having a GPU corrupted

Titan MTBF is ~44h* *(field data from Tiwari et al. HPCA’15)

Only Crashes/Hangs considered (correct output is unknown)

We perform radiation experiments to measure

Silent Data Corruption (SDC) rates 3

Page 7: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Outline

Radiation Effects Essentials

Evaluation of GPU Radiation Sensitivity

- Experimental Setup

- Parallel Algorithms Error Rates

Hardening Solution Efficiency

Codes Optimizations Effects on HPC Reliability

What’s the Plan?

4

Page 8: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Outline

Radiation Effects Essentials

Evaluation of GPU Radiation Sensitivity

- Experimental Setup

- Parallel Algorithms Error Rates

Hardening Solution Efficiency

Codes Optimizations Effects on HPC Reliability

What’s the Plan?

Page 9: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Terrestrial Radiation Environment

Galactic cosmic rays interact with atmosphere

shower of energetic particles:

Muons, Pions, Protons, Gamma rays, Neutrons

13 n/(cm2

h) @sea level

5

Page 10: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Terrestrial Radiation Environment

Galactic cosmic rays interact with atmosphere

shower of energetic particles:

Muons, Pions, Protons, Gamma rays, Neutrons

13 n/(cm2

h) @sea level

neutron flux

increases

exponentially with

altitude

5

Page 11: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Radiation Effects - Soft Errors

0 1 • One or more bit-flips

Single Event Upset (SEU)

Multiple Bit Upset (MBU)

Soft Errors: the device is not permanently damaged,

but the particle may generate:

6

Page 12: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Radiation Effects - Soft Errors

0

1

IONIZING PARTICLE

1

0

• One or more bit-flips

Single Event Upset (SEU)

Multiple Bit Upset (MBU)

Soft Errors: the device is not permanently damaged,

but the particle may generate:

6

Page 13: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Radiation Effects - Soft Errors

0

1

IONIZING PARTICLE

1

0

• One or more bit-flips

Single Event Upset (SEU)

Multiple Bit Upset (MBU)

Soft Errors: the device is not permanently damaged,

but the particle may generate:

• Transient voltage pulse

Single Event Transient (SET) FF

Logic

IONIZING

PARTICLE

6

Page 14: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Radiation Effects on GPUs

SM

CUDA GPU

DRAM

Blocks Scheduler and Dispatcher

L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

Streaming Multiprocessor

Instruction Cache

Warp Scheduler

Dispatch Unit

Register File

core

core core

core

core

core

Shared Memory / L1 Cache

core

core

Warp Scheduler

Dispatch Unit

7

Page 15: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Radiation Effects on GPUs

SM

CUDA GPU

DRAM

Blocks Scheduler and Dispatcher

L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

Streaming Multiprocessor

Instruction Cache

Warp Scheduler

Dispatch Unit

Register File

core

core core

core

core

core

Shared Memory / L1 Cache

core

core

Warp Scheduler

Dispatch Unit

X

X

7

Page 16: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Radiation Effects on GPUs

SM

CUDA GPU

DRAM

Blocks Scheduler and Dispatcher

L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

Streaming Multiprocessor

Instruction Cache

Warp Scheduler

Dispatch Unit

Register File

core

core core

core

core

core

Shared Memory / L1 Cache

core

core

Warp Scheduler

Dispatch Unit

X

core X

7

Page 17: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Radiation Effects on GPUs

SM

CUDA GPU

DRAM

Blocks Scheduler and Dispatcher

L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

Streaming Multiprocessor

Instruction Cache

Warp Scheduler

Dispatch Unit

Register File

core

core core

core

core

core

Shared Memory / L1 Cache

core

core

Warp Scheduler

Dispatch Unit

X

core

X

X

X

7

Page 18: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Radiation Effects on GPUs

SM

CUDA GPU

DRAM

Blocks Scheduler and Dispatcher

L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

Streaming Multiprocessor

Instruction Cache

Warp Scheduler

Dispatch Unit

Register File

core

core core

core

core

core

Shared Memory / L1 Cache

core

core

Warp Scheduler

Dispatch Unit

X

core

X

core

core core

core

core

core

core

X

X

7

Page 19: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Radiation Effects on GPUs

SM

CUDA GPU

DRAM

Blocks Scheduler and Dispatcher

L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

Streaming Multiprocessor

Instruction Cache

Warp Scheduler

Dispatch Unit

Register File

core

core core

core

core

core

Shared Memory / L1 Cache

core

core

Warp Scheduler

Dispatch Unit

X

X

core

X

core

core core

core

core

core

core

X

X

X

7

Page 20: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Radiation Effects on GPUs

SM

CUDA GPU

DRAM

Blocks Scheduler and Dispatcher

L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

Streaming Multiprocessor

Instruction Cache

Warp Scheduler

Dispatch Unit

Register File

core

core core

core

core

core

Shared Memory / L1 Cache

core

core

Warp Scheduler

Dispatch Unit SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

X

X

core

X

core

core core

core

core

core

core

X

X

X

7

Page 21: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Silent Data Corruption vs Crash&Hang

Errors in:

- data cache

- register files

- logic gates (ALU)

- scheduler

Silent Data Corruption

8

Page 22: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Silent Data Corruption vs Crash&Hang

Errors in:

- data cache

- register files

- logic gates (ALU)

- scheduler

Errors in:

- instruction cache

- scheduler / dispatcher

- PCI-e bus controller

Silent Data Corruption

Crash & Hang

8

Page 23: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Outline

Radiation Effects Essentials

Evaluation of GPU Radiation Sensitivity

- Experimental Setup

- Parallel Algorithms Error Rates

Hardening Solution Efficiency

Codes Optimizations Effects on HPC Reliability

What’s the Plan?

Page 24: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Radiation Test Facilities

Weapon Nuclear Research

9

Page 25: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

@LANSCE 1.8x109 n/(cm2 h)

@NYC 13 n/(cm2 h)

Neutrons Spectrum

cross section [cm2] = errors/s

flux (n/cm2/s)

cross section x flux (13 n/(cm2h)) = Error Rate

10

Page 26: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

@LANSCE 1.8x109 n/(cm2 h)

@NYC 13 n/(cm2 h)

Neutrons Spectrum

cross section [cm2] = errors/s

flux (n/cm2/s)

cross section x flux (13 n/(cm2h)) = Error Rate

probability for 1 neutron to

generate an output error

10

Page 27: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

GPU Radiation Test Setup

microcontrollers

FPGA

SoC FPGA SoC

Flash GPU APU

11

Page 28: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

GPU Radiation Test Setup

23/48

GPU power control

circuitry is out of beam

AMD

APU NVIDIA

K20

Intel

Xeon-Phi

desktop

PCs

Page 29: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Outline

Radiation Effects Essentials

Evaluation of GPU Radiation Sensitivity

- Experimental Setup

- Parallel Algorithms Error Rates

Hardening Solution Efficiency

Codes Optimizations Effects on HPC Reliability

What’s the Plan?

Page 30: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Tested Parallel Codes

-Matrix Multiplication (linear algebra)

-Matrix Transpose (memory)

-FFT (signal processing)

-Needleman–Wunsch (biology)

-lavaMD (physical simulations)

-Hotspot (physical simulations)

-HOG (pedestrian detection)

The selected algorithms are heterogeneous and

representative 13

Page 31: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Experimental Results (ECC OFF)

1

10

100

1000

10000

MxM MTrans FFT NW lavaMD Hotspot

Crashes

SDC

SDC rate varies ~3 orders of magnitude

(details on Oliveira et al. Trans. Comp. 2015)

Fa

ilure

In

Tim

e @

NY

C

execution dominated by

memory latencies

14

Page 32: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Experimental Results (ECC OFF)

1

10

100

1000

10000

MxM MTrans FFT NW lavaMD Hotspot

Crashes

SDC

SDC rate varies ~3 orders of magnitude

(details on Oliveira et al. Trans. Comp. 2015)

Fa

ilure

In

Tim

e @

NY

C

codes that heavily

employ registers

execution dominated by

memory latencies

14

Page 33: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Experimental Results (ECC OFF)

1

10

100

1000

10000

MxM MTrans FFT NW lavaMD Hotspot

Crashes

SDC

SDC rate varies ~3 orders of magnitude

(details on Oliveira et al. Trans. Comp. 2015)

Fa

ilure

In

Tim

e @

NY

C

codes that heavily

employ registers

higher

#instructions

Matrix Multiplication: 6.46102 FIT

1 error every 15 years

Titan: 18,688 errors every 15 years

(1 error every 7.3h)

14

Page 34: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Error Correction Code - SDC

1

10

100

1000

10000

MxM FFT NW lavaMD Hotspot

Fa

ilure

In T

ime @

NY

C

ECC reduces the SDC FIT of ~1 order of magnitude

(there is almost no code dependence)

Unprotected resources:

-logic gates

-scheduler

-queues

-flip-flops

ECC OFF ECC ON

15

Page 35: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Error Correction Code - Crash

MxM FFT NW lavaMD Hotspot

Fa

ilure

In T

ime @

NY

C

1

10

100

1000

10000

ECC increases the Crash FIT of about 50%

(there is almost no code dependence)

Double Bit Errors

cause a crash

scheduler is not

protected

ECC OFF ECC ON

16

Page 36: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

MxM FFT NW lavaMD Hotspot

Failu

re I

n T

ime

@N

YC

ECC ON – SDC vs Crashes

1

10

100

1000

10000

When the ECC is ON Crashes are more likely to occur

than SDCs (this is GOOD for HPC centers!)

Crash

SDC

17

Page 37: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Outline

Radiation Effects Essentials

Evaluation of GPU Radiation Sensitivity

- Experimental Setup

- Parallel Algorithms Error Rates

Hardening Solution Efficiency

Codes Optimizations Effects on HPC Reliability

What’s the Plan?

Page 38: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Algorithm Based Fault Tolerance

x A B

checksum checksum

∑ M =

col-check

row

-check

Freivalds ’79

ABFT: technique designed specifically for an algorithm.

ABFT requires: input coding, algorithm modification,

and output decoding with error detection/correction

col-sum

row

-sum

X

X

X

Huang and Abraham ’84

Rech et al., TNS ‘13

18

Page 39: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

FFT Hardening Idea

J.Y. Jou and Abraham ’88

Pilla et at., TNS’13 unhardened FFT

inp

ut

co

din

g

output de-coding

error detection 19

Page 40: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

ECC vs ABFT F

IT [lo

g s

ca

le]

MxM FFT

SDC crash SDC crash

ECC reduces FIT of ~10

times, ABFT of ~56 times!

1

10

100

1000

10000Unhardened

ECC

ABFT

20

Page 41: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

ECC vs ABFT F

IT [lo

g s

ca

le]

MxM FFT

SDC crash SDC crash

ECC reduces FIT of ~10

times, ABFT of ~56 times!

1

10

100

1000

10000Unhardened

ECC

ABFT

20

ECC increases Crashes

of 50% ABFT of 10%!

Page 42: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

ECC vs ABFT n

orm

aliz

ed e

xecution t

ime

MxM FFT

ECC overhead for MxM is

10%, for FFT 50%! ABFT overhead is less

than 20%

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

Unhardened

ECC

ABFT

21

Page 43: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Duplication With Comparison

Spatial: block i and i+N are

duplicated

SM0 a b c d

SM1 a' b' c' d'

time

22

Page 44: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Duplication With Comparison

Spatial: block i and i+N are

duplicated

E-O Spatial: block i and i+1

are duplicated

SM0 a b c d

SM1 a' b' c' d'

time

SM0 b b' d d'

SM1 a c c'

time

a'

22

Page 45: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Duplication With Comparison

Spatial: block i and i+N are

duplicated

E-O Spatial: block i and i+1

are duplicated

Time: a thread executes

twice the operations

SM0 a b c d

SM1 a' b' c' d'

time

SM0 b b' d d'

SM1 a c c'

time

a'

SM0 b & b' d & d'

SM1 a & a' c & c'

time

22

Page 46: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Hotspot - DWC results*

1

10

100

1000

Unhardened

ECC

Spatial DWC

E-O Spatial DWC

Time DWCFIT

[lo

g s

ca

le]

SDC crash

Spatial DWC detects all SDC

Spatial E-O detects 80% of SDC

Time DWC detects 90% of SDC

*details on Oliveira et al.

Trans. Nucl. Sci., 2014

23

Page 47: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Hotspot - DWC results*

1

10

100

1000

Unhardened

ECC

Spatial DWC

E-O Spatial DWC

Time DWCFIT

[lo

g s

ca

le]

SDC crash

Spatial DWC detects all SDC

Spatial E-O detects 80% of SDC

Time DWC detects 90% of SDC

Only Time DWC reduces

Crashes (no additional

Blocks scheduling

required)

*details on Oliveira et al.

Trans. Nucl. Sci., 2014

23

Page 48: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Hotspot - DWC results*

1

10

100

1000

Unhardened

ECC

Spatial DWC

E-O Spatial DWC

Time DWCFIT

[lo

g s

ca

le]

SDC crash

Spatial DWC detects all SDC

Spatial E-O detects 80% of SDC

Time DWC detects 90% of SDC

Only Time DWC reduces

Crashes (no additional

Blocks scheduling

required)

DWC is promising: it is generic, easily implemented, and

effective…

BUT execution time overhead for Spatial DWC and Spatial E-O

is 2.5x and for Time DWC is 2x (data is not copied)

Duplicate only the

code’s critical portions

*details on Oliveira et al.

Trans. Nucl. Sci., 2014

23

Page 49: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Outline

Radiation Effects Essentials

Evaluation of GPU Radiation Sensitivity

- Experimental Setup

- Parallel Algorithms Error Rates

Hardening Solution Efficiency

Codes Optimizations Effects on HPC Reliability

What’s the Plan?

Page 50: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Codes Optimizations (just baked!)

Novel and incremental algorithm implementations are

continuously developed [Rodinia suite].

Code optimizations impact GPUs reliability?

Three case studies (naïve vs optimized)

Matrix Multiplication

FFT

Needleman–Wunsch

different input sizes

(on GPUs optimizations

depends on workload)

24

Page 51: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

1,00E+00

6,00E+00

1,10E+01

1,60E+01

2,10E+01

2,60E+01

Naive-SDC

Naive-Crash

Opt-SDC

Opt-Crash

Experimental Results – MxM

Opt-MxM FIT is higher. Errors in obsolete data are NOT critical:

higher hit rate in the caches = higher FIT

no

rmaliz

ed F

IT [a

.u.]

1024 2048 4096 8192

25

Page 52: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

1,00E+00

6,00E+00

1,10E+01

1,60E+01

2,10E+01

2,60E+01

Naive-SDC

Naive-Crash

Opt-SDC

Opt-Crash

Experimental Results – MxM

~20% FIT increase with input size caused by additional threads

instantiated

Opt-MxM FIT is higher. Errors in obsolete data are NOT critical:

higher hit rate in the caches = higher FIT

no

rmaliz

ed F

IT [a

.u.]

1024 2048 4096 8192

25

Page 53: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Mean Workload Between Failures

Opt. cross section and FIT

26

Page 54: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Mean Workload Between Failures

We need to consider cross section, execution time, and

throughput

Opt.

neutrons hitting the GPU

cross section and FIT

execution time

0

100

200

300

400

500

600

MxM-naive

MxM-opt

GF

LO

Ps

1024 2048 4096 8192

Mean

WORKLOAD

Between Failure:

amount of data

produced before

failure

26

Page 55: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

MxM - MWBF M

WB

F [d

ata

ela

bo

rate

d]

1024 2048 4096 8192 1,00E+00

1,00E+13

2,00E+13

3,00E+13

4,00E+13

Naive-SDC

Opt-SDC

Opt-MxM produces more correct data than Naïve-MxM

27

Page 56: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

MxM - MWBF

Opt-MxM efficiency increases with input size!

If the code is optimized the throughput

increases more than the error rate!

MW

BF

[d

ata

ela

bo

rate

d]

1024 2048 4096 8192 1,00E+00

1,00E+13

2,00E+13

3,00E+13

4,00E+13

Naive-SDC

Opt-SDC

Opt-MxM produces more correct data than Naïve-MxM

27

Page 57: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Outline

Radiation Effects Essentials

Evaluation of GPU Radiation Sensitivity

- Experimental Setup

- Parallel Algorithms Error Rates

Hardening Solution Efficiency

Codes Optimizations Effects on HPC Reliability

What’s the Plan?

Page 58: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

What’s The Plan?

Exascale = 55x Titan. Can we afford a 55x error rate?

Probably not.

Self Driving Cars. Reliability is a major concern!

How we can help:

28

Page 59: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

What’s The Plan?

Exascale = 55x Titan. Can we afford a 55x error rate?

Probably not.

Self Driving Cars. Reliability is a major concern!

How we can help:

-Understand SDC criticality. Not all errors significantly

affect output: are there “acceptable” SDC?

-Propose selective-hardening solutions for GPUs

(duplicate only what matters, what REALLY matters)

28

Page 60: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

What’s The Plan?

Exascale = 55x Titan. Can we afford a 55x error rate?

Probably not.

Self Driving Cars. Reliability is a major concern!

How we can help:

-Understand SDC criticality. Not all errors significantly

affect output: are there “acceptable” SDC?

-Propose selective-hardening solutions for GPUs

(duplicate only what matters, what REALLY matters)

- Understand how algorithm/code/compiler

optimizations will impact future machines error rate

- Fault-injection to better understand error propagation

28

Page 61: How to Deal with Radiation: Evaluation and Mitigation of ...on-demand.gputechconf.com/.../s6249-rech-radiation... · Paolo Rech – GTC2016, San José, CA Motivation: HPC Industry

Paolo Rech – GTC2016, San José, CA

Acknowledgments

Caio Lunardi

Caroline Aguiar

Laercio Pilla

Daniel Oliveira

Vinicius Frattin

Philippe Navaux

Luigi Carro

Chris Frost

Nathan DeBardeleben

Sean Blanchard

Heather Quinn

Thomas Fairbanks

Steve Wender

Timothy Tsai

Siva Hari

Steve Keckler

David Kaeli

NUCAR group