36
Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 Nicolas Brunie Kalray and ENS Lyon [email protected] Sylvain Collange Univ. Federal de Minas Gerais [email protected] Gregory Diamos NVIDIA Research [email protected]

Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

Embed Size (px)

Citation preview

Page 1: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

Simultaneous Branch and Warp Interweavingfor Sustained GPU Performance

ISCA'39Portland, OR

June 11, 2012

Nicolas BrunieKalray and ENS [email protected]

Sylvain CollangeUniv. Federal de

Minas [email protected]

Gregory DiamosNVIDIA [email protected]

Page 2: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

2/25

Mitigating GPU branch divergence cost

GPUs rely on SIMD execution

Serialization of divergent branches → resource underutilization

Contribution: a second instruction scheduler to improve utilization

SIMD(baseline)

1

234

56

Control-flowgraph

SBI+SWI

Page 3: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

3/25

Outline

GPU microarchitecture

SIMT model

The divergence problem

Simultaneous branch interweaving

Simultaneous warp interweaving

Page 4: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

4/25

Context: GPU microarchitecture

Software: graphics shaders, OpenCL, CUDA...

Hardware: GPU

Architecture: multi-thread SPMD programming model

GPU microarchitecture

Hardware datapaths: SIMD execution units

kernel void scale(float a, float * X) {X[tid] = a * X[tid];

}

1 program Many threads

RFALU

RFALU

RFALU

RFALU

Page 5: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

5/25

Single-Instruction Multi-Threading (SIMT)

Optimized for regular workloads

Fetch @17

Execute

Implicit SIMD execution model

Fetch 1 instruction for a warp of lockstepping threads

Execute on SIMD units

T0 T1 T2 T3

PC=17 PC=17

add add add add

add

PC=17 PC=17

Warp

Page 6: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

6/25

The control divergence problem

Fetch @2

Execute

Control divergence: conflict for shared fetch unit

Serialize execution paths

T0 T1 T2 T3

PC=2 PC=2

add nop add nop

add

Warp

1: if(!tid%2) {2: a+b;3: else {4: a*b;5: }

Efficiency loss

Page 7: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

7/25

The control divergence problem

Fetch @4

Execute

T0 T1 T2 T3

PC=4 PC=4

nop mul nop mul

mul

Warp

1: if(!tid%2) {2: a+b;3: else {4: a*b;5: }

Control divergence: conflict for shared fetch unit

Serialize execution paths

Efficiency loss

Page 8: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

8/25

Outline

The GPU divergence problem

Simultaneous branch interweaving

Double instruction fetch

Finding branch-level parallelism

Restoring lockstep execution

Implementation

Simultaneous warp interweaving

Page 9: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

9/25

Simultaneous Branch Interweaving

Add a second fetch unit

Simultaneous execution of divergent branches

T0 T1 T2 T3

Fetch @2

Execute

add

Fetch @4

addmul mul

PC=2 PC=2PC=4 PC=4

add mul

1: if(!tid%2) {2: a+b;3: else {4: a*b;5: }

Warp

Page 10: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

10/25

Standard divergence control: mask stack

if(tid < 2) {

if(tid == 0) {

x = 2;

}

else {

x = 3;

}

}

Code

push

push

pop

push

pop

pop

1111

1111 1100

1111 1100 1000

1111 1100

1111 1100 0100

1111 1100

1111

Mask Stack1 activity bit / thread

tid=0

tid=1

tid=2

tid=3

Problem: does not expose branch-level parallelism

Page 11: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

11/25

Alternative to stack: 1 PC / thread

Master PC

Program Counters (PCs)tid= 0 1 2 3

1 0 0 0

PC0

PC1

PC2

PC3

Match→ active

No match→ inactive

Policy: MPC = min(PCi)

Earliest reconvergencewith code laid out in Thread Frontiers order

Stack-based divergence control implies serialization

G. Diamos et al. SIMD re-convergence at thread frontiers. MICRO 44, 2011.

if(tid < 2) {

if(tid == 0) {

x = 2;

}

else {

x = 3;

}

}

Code

Page 12: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

12/25

Run two branches simultaneously

PC1

MPC1 = min(PC

i)

MPC2 = min(PC

i, PC

i ≠ MPC

1)

Master PC 2

Master PC 1

Program Counters (PCs)tid= 0 1 2 3

1 0 0 0

PC0

PC2

PC3

if(tid < 2) {

if(tid == 0) {

x = 2;

}

else {

x = 3;

}

}

Code

Page 13: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

13/25

9 9

9 9

8 8

8 8

Restoring lockstep execution

Issue: unbalanced paths break lockstep execution

Power consumption, loss of memory locality

Solution: implicit partial synchronization barrier

Instruction 6, 7broken down,issued twice Synchronize

beforeinstruction 6

Greedyscheduling

Earliestreconvergence

1

234

567

1 1 1 1234

234

5 5

67

67

67

67

1 1 1 1234

234

5 5

67

67

67

67

Control-flowgraph

T0T

1T

2T

3T

0T

1T

2T

3

98

8 8 8 89 9 9 9

Page 14: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

14/25

Enforcing control-flow reconvergence

T0 and T2 (at F)wait for T1 (in D).

T3 (in B) can proceedin parallel.

Wait for any thread of the warp between PCdiv and PCrec

Annotate reconvergence points with pointer to immediate dominator

Page 15: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

15/25

Enforcing control-flow reconvergence

T0 and T2 (at F)wait for T1 (in D).

T3 (in B) can proceedin parallel.

Wait for any thread of the warp between PCdiv and PCrec

Annotate reconvergence points with pointer to immediate dominator

Page 16: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

16/25

Implementation: context table

Common case: few different PCs

Order stable in time

Keep Common PCs+activity masks in sorted heap

17

PC0

12 17 3 17 17 3 3 17

0 1 0 1 1 0 0 1

3 0 0 1 0 0 1 1 0

12 1 0 0 0 0 0 0 0PC1

CPC1

CPC2

CPC3Per-thread PCs

Sorted context table

PC7

PC2

PC3PC

4PC

5PC

6

T0T

1T

7

Page 17: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

17/25

Two-level context table

Cache top 2 entries in the Hot Context Table register

Constant-time access to MPCi=CPC

i, activity masks

Other entries in the Cold Context Table linked list

Branch → incremental insertion in CCT

Page 18: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

18/25

Outline

The GPU divergence problem

Simultaneous branch interweaving

Simultaneous warp interweaving

Idea

Dealing with lane conflicts

Implementation

Results

Page 19: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

19/25

Simultaneous Warp Interweaving

SBI limitation: often no secondary path

Single-sided ifs, loops…

SWI: opportunistically schedule instructionsfrom other warps in divergence gaps

“SMT for SIMD”

T0 T1 T2 T3

Fetch @17

Execute

add

Fetch @42

addmul nop

PC=17

PC=42

add mul

T4 T5 T6 T7

Warp 0

Warp 1

Page 20: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

20/25

2332

Using divergence correlations

Issue: unbalanced divergence introduces conflicts

e.g. Parallel reduction

Solution: static lane shuffling

Apply different lane permutation for each warp

Preserves inter-thread memory locality

time

warp 0 warp 1 warp 2 warp 3

Warp 0 is never compatible with warp 2:conflict in lane 0

0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

time

warp 0 warp 1 warp 2 warp 3

Threads 0 mapped to different physicallanes: no conflict

0 1 2 3 01 23 0 1 01

Page 21: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

21/25

Detecting a compatible secondary warp

Bitset inclusion test:Content-Associative Memory

Treat zeros as don't care bits

Power-hungry!

1 1

0 1 1 0

1 1 1 1 1m

W0W1W2W3W4W5W6

hit

0 0 0

0 0 0 0 0 0

0 hit0 0 0

Set-associative lookupSplit warps in sets

Restrict lookup to 1 set

More power-efficient1 1

0 1 1 0

1 1 1 1 1m

W0W1W2W3W4W5W6

hit

0 0 0

0 0 0 0 0 0

00 0 0

sameset

Page 22: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

22/25

Set-associative lookup is good enough

3-way: captures 66% of performance potential

Direct-mapped: 48%

Page 23: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

23/25

Experimental configuration

Baseline: clustered SIMT architecture (Fermi, Kepler)

Tie both clusters together to form twice bigger warps

Direct both instructions to the same execution units

Baseline: warp size 322 warps / clock, 1 instruction / warp

SBI/SWI: warp size 641 warp / clock, 2 instructions / warp

T0 T1 T2 T3 T4 T5 T6 T7T0 T1 T2 T3 T4 T5 T6 T7Warp pool 1 Warp pool 2Cluster 1 Cluster 2

Common warp pool

Fetch-unit / execute-unit ratio maintained

Page 24: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

24/25

Performance results

Regular applications Irregular applications

Speedup Regular Irregular

SBI +15% +41%

SWI +25% +33%

SBI+SWI +23% +40%

Page 25: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

25/25

...

Perspective: SMT-GPU µarch convergence

Converging trends in SMT and GPU architecture

Closing micro-architectural spacebetween Clustered Multi-Threading and SIMD

Explore new tradeoffs between power efficiency and flexibility?

SMT

SIMD

SIMT

Merge instructions from concurrent threads Loosen constraints of SIMD execution

MMT[Long10]

MIS[Dechene10]

Fetch-combining[Kumar04] DWF

[Fung07]

DWS[Meng10]

Thread Fusion[González08]

BC[Fung11]

LW[Narasiman11]

SBI / SWIThis work

CAPRI[Rhu12]

Efficiency on regular MT apps Flexibility

?

TF[Diamos11]

iGPU[Menon12]

Page 26: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

Simultaneous Branch and Warp Interweavingfor Sustained GPU Performance

ISCA'39Portland, OR

June 11, 2012

Nicolas BrunieKalray and ENS [email protected]

Sylvain CollangeUniv. Federal de

Minas [email protected]

Gregory DiamosNVIDIA [email protected]

Page 27: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

Backups

Page 28: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

28/25

References

R. Kumar et al. Conjoined-core chip multiprocessing. MICRO 37, 2004.

J. González et al. Thread fusion. ISLPED 13, 2008.

W. W. L. Fung et al. Dynamic warp formation: efficient MIMD control flow on SIMD graphics hardware. TACO, 2009.

G. Long et al. Minimal multi-threading: finding and removing redundant instructions in multithreaded processors. MICRO 43, 2010.

M. Dechene et al. Multi-threaded instruction sharing. Technical report, 2010.

J. Meng et al. Dynamic warp subdivision for integrated branch and memory divergence tolerance. ISCA 37, 2010.

G. Diamos et al. SIMD re-convergence at thread frontiers. MICRO 44, 2011.

W. Fung et al. Thread block compaction for efficient SIMT control flow.HPCA 17, 2011.

Page 29: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

29/25

SWI implies SMT

Heterogeneous execution units

SWI improves utilization with superscalar execution

T0 T1 T2 T3

Fetch @17

ALU

add

Fetch @42

add load

PC=17 PC=42

add load

T4 T5 T6 T7

Warp 0 Warp 1

LSU

Page 30: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

30/25

SBI vs. DWF

Dual fetch

Uses branch-level parallelism

Sensitive to branch unbalance

Preserves in-warp locality

T0 T1 T2 T3

Fetch @2

Execute

add

Fetch @4

addmul mul

PC=2 PC=2PC=4 PC=4

add mul

Warp

Single fetch

Uses warp-level parallelism

Sensitive to lane activity unbalance

T12 T13 T14 T15

Fetch

Execute

add add

PC=2

add

T8 T9 T10 T11T4 T5 T6 T7T0 T1 T2 T3

addadd

Page 31: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

31/25

SWI vs. DWF

Dual fetch

Uses warp-level parallelism

Sensitive to lane conflicts

Preserves in-warp locality

Single fetch

Uses warp-level parallelism

Sensitive to lane activity unbalance, low occupancy

T12 T13 T14 T15

Fetch

Execute

add add

PC=2

add

T8 T9 T10 T11T4 T5 T6 T7T0 T1 T2 T3

addadd

Execute

add addmul nop

add mul

T12 T13 T14 T15T8 T9 T10 T11T4 T5 T6 T7T0 T1 T2 T3

Page 32: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

32/25

Simulation platform

Barra: functional GPU simulator

modeled after NVIDIA Tesla GPUs

Runs native Tesla SASS binaries

Reproduces SIMT execution

Timing-power model

Cycle-accurate execution pipeline

Constant-latency, bandwidth-bound memory

Calibration from GPU microbenchmarks

http://gpgpu.univ-perp.fr/index.php/Barra

Sylvain Collange, Marc Daumas, David Defour, David Parello. Barra: a parallel functional simulator for GPGPU. MASCOTS 2010.

Page 33: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

SBI scoreboarding logic

Keep track of dependencies induced by thread divergence-reconvergence

Transitive closure of dependency graph

Page 34: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

34/25

Goto considered harmful?

jjaljrsyscall

MIPS

jmpiififfelseendifdowhilebreakconthaltmsavemrestpushpop

Intel GMAGen4(2006)

jmpiifelseendifcasewhilebreakconthaltcallreturnfork

Intel GMASB(2011)

pushpush_elsepoppush_wqmpop_wqmelse_wqmjump_anyreactivatereactivate_wqmloop_startloop_start_no_alloop_start_dx10loop_endloop_continueloop_breakjumpelsecallcall_fsreturnreturn_fsalualu_push_beforealu_pop_afteralu_pop2_afteralu_continuealu_breakalu_else_after

AMD Cayman(2011)

pushpush_elsepoploop_startloop_start_no_alloop_start_dx10loop_endloop_continueloop_breakjumpelsecallcall_fsreturnreturn_fsalualu_push_beforealu_pop_afteralu_pop2_afteralu_continuealu_breakalu_else_after

AMDR600(2007)

jumploopendlooprependrepbreakloopbreakrepcontinue

AMDR500(2005)

barbrabrkbrkptcalcontkilpbkpretretssytrap.s

NVIDIATesla(2007)

barbptbrabrkbrxcalcontexitjcaljmpjmxlongjmppbkpcntplongjmppretretssy.s

NVIDIAFermi(2010)

Control instructions in some CPU and GPU instruction sets

Control flow structure is explicit

GPU-specific instruction sets

No support for arbitrary control flow

Page 35: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

35/25

Flynn's taxonomy revisited

Resource count

1

M

InstructionFetch

Resource type: Memory port(Address)

Computation /registers(Data)

SIMT

MIMT

F

T0T

1T

2T

3

F

T0T

1T

2T

3

F F F

2

DIMTF

T0T

1T

2T

3

F

SAMT

MAMT

M

T0T

1T

2T

3

M

T0T

1T

2T

3

M M M

DAMTM

T0T

1T

2T

3

M

SDMT

MDMT

X

T0T

1T

2T

3

X

T0T

1T

2T

3

X X X

DDMTX

T0T

1T

2T

3

X

A. Glew. Coherent vector lane threading. Berkeley ParLab Seminar, 2009.

Page 36: Simultaneous Branch and Warp Interweaving for … · Simultaneous Branch and Warp Interweaving for Sustained GPU Performance ISCA'39 Portland, OR June 11, 2012 ... Simultaneous Branch

36/25

Examples: conventional design points

F MXMulti-core

MIMD(MAMT) F MX

F MX

GPU

SI(MDSA)MT

Short-vector SIMD

SIMD(SAST)

T0

T1

T2

X

F MX

X

T0

X

F MX

X

T0

T1

T2

MI MD MA MT

SIMD

SA ST

SIMD

SA MT