Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

Simultaneous Branch and Warp Interweavingfor Sustained GPU Performance

ISCA'39Portland, OR

June 11, 2012

Nicolas BrunieKalray and ENS [email protected]

Sylvain CollangeUniv. Federal de

Minas [email protected]

Gregory DiamosNVIDIA [email protected]

2/25

Mitigating GPU branch divergence cost

GPUs rely on SIMD execution

Serialization of divergent branches → resource underutilization

Contribution: a second instruction scheduler to improve utilization

SIMD(baseline)

1

234

56

Control-flowgraph

SBI+SWI

3/25

Outline

GPU microarchitecture

SIMT model

The divergence problem

Simultaneous branch interweaving

Simultaneous warp interweaving

4/25

Context: GPU microarchitecture

Software: graphics shaders, OpenCL, CUDA...

Hardware: GPU

Architecture: multi-thread SPMD programming model

GPU microarchitecture

Hardware datapaths: SIMD execution units

kernel void scale(float a, float * X) {X[tid] = a * X[tid];

}

1 program Many threads

RFALU

RFALU

RFALU

RFALU

5/25

Single-Instruction Multi-Threading (SIMT)

Optimized for regular workloads

Fetch @17

Execute

Implicit SIMD execution model

Fetch 1 instruction for a warp of lockstepping threads

Execute on SIMD units

T0 T1 T2 T3

PC=17 PC=17

add add add add

add

PC=17 PC=17

Warp

6/25

The control divergence problem

Fetch @2

Execute

Control divergence: conflict for shared fetch unit

Serialize execution paths

T0 T1 T2 T3

PC=2 PC=2

add nop add nop

add

Warp

1: if(!tid%2) {2: a+b;3: else {4: a*b;5: }

Efficiency loss

7/25

The control divergence problem

Fetch @4

Execute

T0 T1 T2 T3

PC=4 PC=4

nop mul nop mul

mul

Warp

1: if(!tid%2) {2: a+b;3: else {4: a*b;5: }

Control divergence: conflict for shared fetch unit

Serialize execution paths

Efficiency loss

8/25

Outline

The GPU divergence problem


Double instruction fetch

Finding branch-level parallelism

Restoring lockstep execution

Implementation


9/25

Simultaneous Branch Interweaving

Add a second fetch unit

Simultaneous execution of divergent branches

T0 T1 T2 T3

Fetch @2

Execute

add

Fetch @4

addmul mul

PC=2 PC=2PC=4 PC=4

add mul

1: if(!tid%2) {2: a+b;3: else {4: a*b;5: }

Warp

10/25

Standard divergence control: mask stack

if(tid < 2) {

if(tid == 0) {

x = 2;

}

else {

x = 3;

}

}

Code

push

push

pop

push

pop

pop

1111

1111 1100

1111 1100 1000

1111 1100

1111 1100 0100

1111 1100

1111

Mask Stack1 activity bit / thread

tid=0

tid=1

tid=2

tid=3

Problem: does not expose branch-level parallelism

11/25

Alternative to stack: 1 PC / thread

Master PC

Program Counters (PCs)tid= 0 1 2 3

1 0 0 0

PC0

PC1

PC2

PC3

Match→ active

No match→ inactive

Policy: MPC = min(PCi)

Earliest reconvergencewith code laid out in Thread Frontiers order

Stack-based divergence control implies serialization

G. Diamos et al. SIMD re-convergence at thread frontiers. MICRO 44, 2011.

if(tid < 2) {

if(tid == 0) {

x = 2;

}

else {

x = 3;

}

}

Code

12/25

Run two branches simultaneously

PC1

MPC1 = min(PC

i)

MPC2 = min(PC

i, PC

i ≠ MPC

1)

Master PC 2

Master PC 1

Program Counters (PCs)tid= 0 1 2 3

1 0 0 0

PC0

PC2

PC3

if(tid < 2) {

if(tid == 0) {

x = 2;

}

else {

x = 3;

}

}

Code

13/25

9 9

9 9

8 8

8 8

Restoring lockstep execution

Issue: unbalanced paths break lockstep execution

Power consumption, loss of memory locality

Solution: implicit partial synchronization barrier

Instruction 6, 7broken down,issued twice Synchronize

beforeinstruction 6

Greedyscheduling

Earliestreconvergence

1

234

567

1 1 1 1234

234

5 5

67

67

67

67

1 1 1 1234

234

5 5

67

67

67

67

Control-flowgraph

T0T

1T

2T

3T

0T

1T

2T

3

98

8 8 8 89 9 9 9

14/25

Enforcing control-flow reconvergence

T0 and T2 (at F)wait for T1 (in D).

T3 (in B) can proceedin parallel.

Wait for any thread of the warp between PCdiv and PCrec

Annotate reconvergence points with pointer to immediate dominator

15/25

Enforcing control-flow reconvergence

T0 and T2 (at F)wait for T1 (in D).

T3 (in B) can proceedin parallel.

Wait for any thread of the warp between PCdiv and PCrec

Annotate reconvergence points with pointer to immediate dominator

16/25

Implementation: context table

Common case: few different PCs

Order stable in time

Keep Common PCs+activity masks in sorted heap

17

PC0

12 17 3 17 17 3 3 17

0 1 0 1 1 0 0 1

3 0 0 1 0 0 1 1 0

12 1 0 0 0 0 0 0 0PC1

CPC1

CPC2

CPC3Per-thread PCs

Sorted context table

PC7

PC2

PC3PC

4PC

5PC

6

T0T

1T

7

17/25

Two-level context table

Cache top 2 entries in the Hot Context Table register

Constant-time access to MPCi=CPC

i, activity masks

Other entries in the Cold Context Table linked list

Branch → incremental insertion in CCT

18/25

Outline

The GPU divergence problem



Idea

Dealing with lane conflicts

Implementation

Results

19/25

Simultaneous Warp Interweaving

SBI limitation: often no secondary path

Single-sided ifs, loops…

SWI: opportunistically schedule instructionsfrom other warps in divergence gaps

“SMT for SIMD”

T0 T1 T2 T3

Fetch @17

Execute

add

Fetch @42

addmul nop

PC=17

PC=42

add mul

T4 T5 T6 T7

Warp 0

Warp 1

20/25

2332

Using divergence correlations

Issue: unbalanced divergence introduces conflicts

e.g. Parallel reduction

Solution: static lane shuffling

Apply different lane permutation for each warp

Preserves inter-thread memory locality

time

warp 0 warp 1 warp 2 warp 3

Warp 0 is never compatible with warp 2:conflict in lane 0

0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

time

warp 0 warp 1 warp 2 warp 3

Threads 0 mapped to different physicallanes: no conflict

0 1 2 3 01 23 0 1 01

21/25

Detecting a compatible secondary warp

Bitset inclusion test:Content-Associative Memory

Treat zeros as don't care bits

Power-hungry!

1 1

0 1 1 0

1 1 1 1 1m

W0W1W2W3W4W5W6

hit

0 0 0

0 0 0 0 0 0

0 hit0 0 0

Set-associative lookupSplit warps in sets

Restrict lookup to 1 set

More power-efficient1 1

0 1 1 0

1 1 1 1 1m

W0W1W2W3W4W5W6

hit

0 0 0

0 0 0 0 0 0

00 0 0

sameset

22/25

Set-associative lookup is good enough

3-way: captures 66% of performance potential

Direct-mapped: 48%

23/25

Experimental configuration

Baseline: clustered SIMT architecture (Fermi, Kepler)

Tie both clusters together to form twice bigger warps

Direct both instructions to the same execution units

Baseline: warp size 322 warps / clock, 1 instruction / warp

SBI/SWI: warp size 641 warp / clock, 2 instructions / warp

T0 T1 T2 T3 T4 T5 T6 T7T0 T1 T2 T3 T4 T5 T6 T7Warp pool 1 Warp pool 2Cluster 1 Cluster 2

Common warp pool

Fetch-unit / execute-unit ratio maintained

24/25

Performance results

Regular applications Irregular applications

Speedup Regular Irregular

SBI +15% +41%

SWI +25% +33%

SBI+SWI +23% +40%

25/25

…

...

Perspective: SMT-GPU µarch convergence

Converging trends in SMT and GPU architecture

Closing micro-architectural spacebetween Clustered Multi-Threading and SIMD

Explore new tradeoffs between power efficiency and flexibility?

SMT

SIMD

SIMT

Merge instructions from concurrent threads Loosen constraints of SIMD execution

MMT[Long10]

MIS[Dechene10]

Fetch-combining[Kumar04] DWF

[Fung07]

DWS[Meng10]

Thread Fusion[González08]

BC[Fung11]

LW[Narasiman11]

SBI / SWIThis work

CAPRI[Rhu12]

Efficiency on regular MT apps Flexibility

?

TF[Diamos11]

iGPU[Menon12]

Simultaneous Branch and Warp Interweavingfor Sustained GPU Performance

ISCA'39Portland, OR

June 11, 2012

Nicolas BrunieKalray and ENS [email protected]

Sylvain CollangeUniv. Federal de

Minas [email protected]

Gregory DiamosNVIDIA [email protected]

Backups

28/25

References

R. Kumar et al. Conjoined-core chip multiprocessing. MICRO 37, 2004.

J. González et al. Thread fusion. ISLPED 13, 2008.

W. W. L. Fung et al. Dynamic warp formation: efficient MIMD control flow on SIMD graphics hardware. TACO, 2009.

G. Long et al. Minimal multi-threading: finding and removing redundant instructions in multithreaded processors. MICRO 43, 2010.

M. Dechene et al. Multi-threaded instruction sharing. Technical report, 2010.

J. Meng et al. Dynamic warp subdivision for integrated branch and memory divergence tolerance. ISCA 37, 2010.

G. Diamos et al. SIMD re-convergence at thread frontiers. MICRO 44, 2011.

W. Fung et al. Thread block compaction for efficient SIMT control flow.HPCA 17, 2011.

29/25

SWI implies SMT

Heterogeneous execution units

SWI improves utilization with superscalar execution

T0 T1 T2 T3

Fetch @17

ALU

add

Fetch @42

add load

PC=17 PC=42

add load

T4 T5 T6 T7

Warp 0 Warp 1

LSU

30/25

SBI vs. DWF

Dual fetch

Uses branch-level parallelism

Sensitive to branch unbalance

Preserves in-warp locality

T0 T1 T2 T3

Fetch @2

Execute

add

Fetch @4

addmul mul

PC=2 PC=2PC=4 PC=4

add mul

Warp

Single fetch

Uses warp-level parallelism

Sensitive to lane activity unbalance

T12 T13 T14 T15

Fetch

Execute

add add

PC=2

add

T8 T9 T10 T11T4 T5 T6 T7T0 T1 T2 T3

addadd

31/25

SWI vs. DWF

Dual fetch


Sensitive to lane conflicts

Preserves in-warp locality

Single fetch


Sensitive to lane activity unbalance, low occupancy

T12 T13 T14 T15

Fetch

Execute

add add

PC=2

add

T8 T9 T10 T11T4 T5 T6 T7T0 T1 T2 T3

addadd

Execute

add addmul nop

add mul

T12 T13 T14 T15T8 T9 T10 T11T4 T5 T6 T7T0 T1 T2 T3

32/25

Simulation platform

Barra: functional GPU simulator

modeled after NVIDIA Tesla GPUs

Runs native Tesla SASS binaries

Reproduces SIMT execution

Timing-power model

Cycle-accurate execution pipeline

Constant-latency, bandwidth-bound memory

Calibration from GPU microbenchmarks

http://gpgpu.univ-perp.fr/index.php/Barra

Sylvain Collange, Marc Daumas, David Defour, David Parello. Barra: a parallel functional simulator for GPGPU. MASCOTS 2010.

SBI scoreboarding logic

Keep track of dependencies induced by thread divergence-reconvergence

Transitive closure of dependency graph

34/25

Goto considered harmful?

jjaljrsyscall

MIPS

jmpiififfelseendifdowhilebreakconthaltmsavemrestpushpop

Intel GMAGen4(2006)

jmpiifelseendifcasewhilebreakconthaltcallreturnfork

Intel GMASB(2011)

pushpush_elsepoppush_wqmpop_wqmelse_wqmjump_anyreactivatereactivate_wqmloop_startloop_start_no_alloop_start_dx10loop_endloop_continueloop_breakjumpelsecallcall_fsreturnreturn_fsalualu_push_beforealu_pop_afteralu_pop2_afteralu_continuealu_breakalu_else_after

AMD Cayman(2011)

pushpush_elsepoploop_startloop_start_no_alloop_start_dx10loop_endloop_continueloop_breakjumpelsecallcall_fsreturnreturn_fsalualu_push_beforealu_pop_afteralu_pop2_afteralu_continuealu_breakalu_else_after

AMDR600(2007)

jumploopendlooprependrepbreakloopbreakrepcontinue

AMDR500(2005)

barbrabrkbrkptcalcontkilpbkpretretssytrap.s

NVIDIATesla(2007)

barbptbrabrkbrxcalcontexitjcaljmpjmxlongjmppbkpcntplongjmppretretssy.s

NVIDIAFermi(2010)

Control instructions in some CPU and GPU instruction sets

Control flow structure is explicit

GPU-specific instruction sets

No support for arbitrary control flow

35/25

Flynn's taxonomy revisited

Resource count

1

M

InstructionFetch

Resource type: Memory port(Address)

Computation /registers(Data)

SIMT

MIMT

F

T0T

1T

2T

3

F

T0T

1T

2T

3

F F F

2

DIMTF

T0T

1T

2T

3

F

SAMT

MAMT

M

T0T

1T

2T

3

M

T0T

1T

2T

3

M M M

DAMTM

T0T

1T

2T

3

M

SDMT

MDMT

X

T0T

1T

2T

3

X

T0T

1T

2T

3

X X X

DDMTX

T0T

1T

2T

3

X

A. Glew. Coherent vector lane threading. Berkeley ParLab Seminar, 2009.

36/25

Examples: conventional design points

F MXMulti-core

MIMD(MAMT) F MX

F MX

GPU

SI(MDSA)MT

Short-vector SIMD

SIMD(SAST)

T0

T1

T2

X

F MX

X

T0

X

F MX

X

T0

T1

T2

MI MD MA MT

SIMD

SA ST

SIMD

SA MT

Documents

Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch