1
For Inter-Warp DMR Replay Checker is used to check the instruction types and command replay(DMR) in the following cycle if different When the same type instructions are issued consecutively, ReplayQ is used to maintain the instructions so that the instructions can be verified anytime later whenever the corresponding execution unit becomes available. Opcode, Operands, and Original execution result for 32 threads (around 500B for each entry) Inter-Warp DMR: Exploiting underutilized resources among heterogeneous units In any fully utilized warps, the unused execution units conduct DMR of unverified previous warp’s execution If the result of the stored original execution and the new result mismatches ERROR detected!! Warped-DMR Light-weight Error detection for GPGPU Hyeran Jeon and Murali Annavaram University of Southern California MOTIVATION ARCHITECTURAL SUPPORT WARPED-DMR ABSTRACT CONTACT Hyeran Jeon Email: [email protected] Murali Annaram Email: [email protected] For many scientific applications that commonly run on supercomputers, program correctness is as important as performance. Few soft or hard errors could lead to corrupt results and can potentially waste days or even months of computing effort. In this research we exploit unique architectural characteristics of GPGPUs to propose a light weight error detection method, called Warped Dual Modular Redundancy (Warped-DMR). Warped-DMR detects errors in computation by relying on opportunistic spatial and temporal dual- modular execution of code. Warped-DMR is light weight because it exploits the underutilized parallelism in GPGPU computing for error detection. Error detection spans both within a warp as well as between warps, called intra-warp and inter- warp DMR, respectively. Warped-DMR achieves 96% error coverage while incurring a worst-case 16% performance overhead without extra execution units or programmer’s effort. Intra-Warp DMR: Exploiting underutilized resources among homogeneous units For any underutilized warps, the inactive threads within the warp duplicate the active threads’ execution Active mask gives a hint for duplication selection If the result of the inactive and active thread mismatches ERROR detected!! For Intra-Warp DMR Register Forwarding Unit is used to have the pair of active and inactive threads use the same operands, RFU forwards active thread’s register value to inactive thread according to active mask Overhead : 0.08ns and 390um 2 @ Synopsis Design Compiler Thread-Core mapping is used to increase error coverage by modifying thread-core affinity in scheduler Scientific computing is different to multimedia Correctness matters Some vendors began to add memory protection schemes to GPU But what about execution units? Larger portion of die area is assigned to execution units in GPU Vast number of cores Higher probability of computation errors Underutilization among Homogeneous Units Since threads within a warp share a PC value, in a diverged control flow, some threads should execute one flow but the others not Underutilization among Homogeneous Units Dispatcher issues an instruction to one of three execution units at a time In Worst case, two execution units among three become idle 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 RESULTS Warped-DMR(Intra-Warp DMR + Inter-Warp DMR) covers 96% of computations with 16% performance overhead without extra execution units BACKGROUND Instructions are executed in a batch of threads(warp or wavefront) unit Threads within a warp are running in lock-stepp manner by sharing a PC Instructions are categorized into 3 types and executed on the corresponding execution units Arithmetic operations on SP, Memory operations on LD/ST, Transcendental instructions(i.e. sin, cosine) on SFU) SFU SP LD/ST Local Memory Scheduler/Dispatcher Register File SM Global Memory GPU SM ... Thread Block A Thread Kernel Warp SPs LD/STs SFUs time sin ld add ld add ld warp4: sin.f32 %f3, %f1 warp1: ld.shared.f32 %f20,[%r99+824] warp2: add.f32 %f16, %f14, %f15 warp1: ld.shared.f32 %f21, [%r99+956] warp2: add.f32 %f18, %f12, %f17 warp3: ld.shared.f32 %f2, [%r70+4] If(cond) { b++; } else { b--; } a = b; time SP 2 SP 1 b++ b-- Cond? a = b same OK different ERROR!! SP 2 SP 1 time Cond? a = b b++ b++ DMR b-- b-- DMR COMP COMP Flush & Error Handling time SPs LD/STs SFUs sin ld add ld add ld ld add ld add ld sin DMR DMR DMR DMR DMR DMR Code Typical GPU execution With Intra-Warp DMR Code Typical GPU execution With Inter-Warp DMR < Execution time breakdown with respect to the number of active threads > Over 30% of execution time of BitonicSort is run by 16 threads 40% of execution time of BFS is run by 1 thread 2 types of Underutilization in GPGPU computing WARPED - DMR : EXPLOITING THE UNDERUTILIZATIONS FOR ERROR DETECTION Can we use these idle resources? th3.r0 th3.r1 . . th2.r0 th2.r1 . . th1.r0 th1.r1 . . th0.r0 th0.r1 . . SP SP SP SP RF EXE WB Comparator active mask ERROR!! Register Forwarding Unit 1100 th3.r1 th2.r1 th1.r1 th0.r1 th3.r1 th2.r1 th3.r1 th2.r1 SP MEM SFU ReplayQ EXE RF DEC SP CHECKER same SP SP enqueue & search SFU DMR < Error coverage w.r.t. SIMT cluster organization and Thread to Core mapping > 89.60 91.91 96.43 0 20 40 60 80 100 120 Error Coverage (%) with 4core cluster with 8core cluster < Normalized Kernel Simulation Cycles w.r.t. ReplayQ size > 1.41 1.32 1.24 1.16 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Normalized Simulation Cycles 0 1 5 10

Research Poster 36 x 48 - B · Title: Research Poster 36 x 48 - B Author: Genigraphics 800.790.4001 Created Date: 11/29/2012 4:39:55 PM

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Poster 36 x 48 - B · Title: Research Poster 36 x 48 - B Author: Genigraphics 800.790.4001 Created Date: 11/29/2012 4:39:55 PM

For Inter-Warp DMR

• Replay Checker is used to check the instruction types

and command replay(DMR) in the following cycle if

different

• When the same type instructions are issued

consecutively, ReplayQ is used to maintain the

instructions so that the instructions can be verified

anytime later whenever the corresponding execution

unit becomes available.

– Opcode, Operands, and Original execution result for 32 threads

(around 500B for each entry)

Inter-Warp DMR: Exploiting underutilized resources among heterogeneous units

• In any fully utilized warps, the unused execution units conduct DMR of unverified previous warp’s

execution

• If the result of the stored original execution and the new result mismatches ERROR detected!!

Warped-DMR

Light-weight Error detection for GPGPU

Hyeran Jeon and Murali Annavaram

University of Southern California

MOTIVATION

ARCHITECTURAL SUPPORT

WARPED-DMR ABSTRACT

CONTACT

Hyeran Jeon

Email: [email protected]

Murali Annaram

Email: [email protected]

For many scientific

applications that commonly

run on supercomputers,

program correctness

is as important as

performance. Few soft or

hard errors could lead to

corrupt results and can

potentially waste days or

even months of computing

effort. In this research we

exploit unique architectural

characteristics of GPGPUs

to propose a light weight

error detection method,

called Warped Dual

Modular Redundancy

(Warped-DMR).

Warped-DMR detects

errors in computation by

relying on opportunistic

spatial and temporal dual-

modular execution of code.

Warped-DMR is light weight

because it exploits the

underutilized parallelism in

GPGPU computing for error

detection. Error detection

spans both within a warp as

well as between warps,

called intra-warp and inter-

warp DMR, respectively.

Warped-DMR achieves

96% error coverage while

incurring a worst-case 16%

performance overhead

without extra execution

units or programmer’s

effort.

Intra-Warp DMR: Exploiting underutilized resources among homogeneous units • For any underutilized warps, the inactive threads within the warp duplicate the active threads’

execution • Active mask gives a hint for duplication selection

• If the result of the inactive and active thread mismatches ERROR detected!!

For Intra-Warp DMR

• Register Forwarding Unit is used to have the pair of active and inactive threads use the same

operands, RFU forwards active thread’s register value to inactive thread according to active

mask

– Overhead : 0.08ns and 390um2 @ Synopsis Design Compiler

• Thread-Core mapping is used to increase error coverage by modifying thread-core affinity in

scheduler

• Scientific computing is different to

multimedia

– Correctness matters

– Some vendors began to add memory protection

schemes to GPU

• But what about execution units?

– Larger portion of die area is assigned to execution

units in GPU

– Vast number of cores Higher probability of

computation errors

• Underutilization among Homogeneous Units

– Since threads within a warp share a PC value, in a

diverged control flow, some threads should execute

one flow but the others not

• Underutilization among Homogeneous Units

– Dispatcher issues an instruction to one of three

execution units at a time

– In Worst case, two execution units among three

become idle

0%10%20%30%40%50%60%70%80%90%

100%32 3130 2928 2726 2524 2322 2120 1918 1716 1514 1312 1110 98 76 54 3

RESULTS

Warped-DMR(Intra-Warp DMR + Inter-Warp DMR) covers 96% of computations with 16%

performance overhead without extra execution units

BACKGROUND

• Instructions are executed in a batch of threads(warp or wavefront) unit

– Threads within a warp are running in lock-stepp manner by sharing a PC

• Instructions are categorized into 3 types and executed on the corresponding

execution units

– Arithmetic operations on SP, Memory operations on LD/ST, Transcendental

instructions(i.e. sin, cosine) on SFU)

SFU SP LD/ST

Local Memory

Scheduler/Dispatcher

Register File SM

Global Memory

GPU

SM

... Thread Block A Thread

Kernel

Warp

SPs LD/STs SFUs

time

sin ld

add

ld

add

ld

warp4: sin.f32 %f3, %f1 warp1: ld.shared.f32 %f20,[%r99+824] warp2: add.f32 %f16, %f14, %f15 warp1: ld.shared.f32 %f21, [%r99+956] warp2: add.f32 %f18, %f12, %f17 warp3: ld.shared.f32 %f2, [%r70+4]

If(cond)

{

b++;

} else {

b--;

}

a = b;

time

SP 2 SP 1

b++ b--

Cond?

a = b

same OK

different ERROR!!

SP 2 SP 1 time

Cond?

a = b

b++ b++ DMR

b-- b-- DMR

COMP

COMP

Flush & Error Handling

time

SPs LD/STs SFUs

sin ld add

ld add

ld

ld add

ld add

ld sin

DMR

DMR DMR

DMR

DMR

DMR

Code Typical GPU execution With Intra-Warp DMR

Code Typical GPU execution With Inter-Warp DMR

< Execution time breakdown with respect to the number of active threads >

Over 30% of execution time of

BitonicSort is run by 16 threads

40% of execution time of

BFS is run by 1 thread

2 types of Underutilization in GPGPU computing

WARPED-DMR : EXPLOITING THE UNDERUTILIZATIONS FOR ERROR DETECTION

Can we use these idle resources?

th3.r0 th3.r1

.

.

th2.r0 th2.r1

.

.

th1.r0 th1.r1

.

.

th0.r0 th0.r1

.

.

SP SP SP SP

RF

EXE

WB

Comparator

active mask

ERROR!!

Register Forwarding Unit

1100

th3.r1 th2.r1 th1.r1 th0.r1

th3.r1 th2.r1 th3.r1 th2.r1

CORE CORE CORE SP MEM MEM

MEM

SFU

ReplayQ

EX

E

RF

D

EC

SP

CHECKER same

SP

SP enqueue

& search

SFU DMR

< Error coverage w.r.t. SIMT cluster organization and Thread to Core mapping >

89.60 91.91

96.43

0

20

40

60

80

100

120

Erro

r C

ove

rage

(%

)

with 4core cluster with 8core cluster

< Normalized Kernel Simulation Cycles w.r.t. ReplayQ size >

1.41 1.32

1.24 1.16

00.20.40.60.8

11.21.41.61.8

2

No

rmal

ize

d S

imu

lati

on

Cyc

les

0 1 5 10