65
Warp-Aware Trace Scheduling for GPUS James Jablin (Brown) Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown)

Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Warp-Aware TraceScheduling for GPUS

James Jablin (Brown)

Thomas Jablin (UIUC)

Onur Mutlu (CMU)

Maurice Herlihy (Brown)

Page 2: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Historical Trends in GFLOPS:CPUs vs. GPUs

0

250

500

750

1000

1250

1500

1750

2000

2250

2500

2750

3000

3250

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

Northwood WoodcrestPrescott HarpertownBloomfield

WestmereSandy Bridge

NVIDIA GPU Single-Precision FP

Intel CPU Single-Precision FP

2012

GeForce 5800

GeForce 6800 Ultra

GeForce 7800 GTX

GeForce 8800 GTX

GeForce 280 GTX

GeForce 480 GTX

GeForce 580 GTX

GeForce 680 GTX

Theore

tica

l G

FLO

P/s

reproduced from NVIDIA CUDA C Programming Guide (Version 5.0)

Page 3: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Performance Pitfalls

Control flow cannegatively affect performance.

Page 4: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Pipeline Stall - execution delay in aninstruction pipeline to resolve adependency

Performance Pitfalls

Page 5: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Hardware: CPU versus GPU

ControlALU ALU

ALUALU

Cache

DRAM DRAM

CPU GPUreproduced from NVIDIA CUDA C Programming Guide (Version 5.0)

Page 6: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

6Clock Cycle

0 1 2 3 4 87

CompletedInstructions

5

Decode

Execute

Fetch

Write

Pipeline Stages

WaitingInstructions

With Branch Prediction

6Clock Cycle

0 1 2 3 4 875 109

Without Branch Prediction

Page 7: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

6Clock Cycle

0 1 2 3 4 87

CompletedInstructions

5

Decode

Execute

Fetch

Write

Pipeline Stages

WaitingInstructions

With Branch Prediction

6Clock Cycle

0 1 2 3 4 875 109

Without Branch Prediction

Page 8: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

6Clock Cycle

0 1 2 3 4 87

CompletedInstructions

5

Decode

Execute

Fetch

Write

Pipeline Stages

WaitingInstructions

With Branch Prediction

6Clock Cycle

0 1 2 3 4 875 109

Without Branch Prediction

Page 9: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

6Clock Cycle

0 1 2 3 4 87

CompletedInstructions

5

Decode

Execute

Fetch

Write

Pipeline Stages

WaitingInstructions

With Branch Prediction

6Clock Cycle

0 1 2 3 4 875 109

Without Branch Prediction

pipeline stall (bubble)

Page 10: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

6Clock Cycle

0 1 2 3 4 87

CompletedInstructions

5

Decode

Execute

Fetch

Write

Pipeline Stages

WaitingInstructions

With Branch Prediction

6Clock Cycle

0 1 2 3 4 875 109

Without Branch Prediction

Page 11: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

6Clock Cycle

0 1 2 3 4 87

CompletedInstructions

5

Decode

Execute

Fetch

Write

Pipeline Stages

WaitingInstructions

With Branch Prediction

6Clock Cycle

0 1 2 3 4 875 109

Without Branch Prediction

Page 12: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

6Clock Cycle

0 1 2 3 4 87

CompletedInstructions

5

Decode

Execute

Fetch

Write

Pipeline Stages

WaitingInstructions

With Branch Prediction

6Clock Cycle

0 1 2 3 4 875 109

Without Branch Prediction

Page 13: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

6Clock Cycle

0 1 2 3 4 87

CompletedInstructions

5

Decode

Execute

Fetch

Write

Pipeline Stages

WaitingInstructions

With Branch Prediction

6Clock Cycle

0 1 2 3 4 875 109

Without Branch Prediction

Page 14: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

6Clock Cycle

0 1 2 3 4 87

CompletedInstructions

5

Decode

Execute

Fetch

Write

Pipeline Stages

WaitingInstructions

With Branch Prediction

6Clock Cycle

0 1 2 3 4 875 109

Without Branch Prediction

Page 15: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

6Clock Cycle

0 1 2 3 4 87

CompletedInstructions

5

Decode

Execute

Fetch

Write

Pipeline Stages

WaitingInstructions

With Branch Prediction

6Clock Cycle

0 1 2 3 4 875 9 10

Without Branch Prediction

Page 16: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

6Clock Cycle

0 1 2 3 4 87

CompletedInstructions

5

Decode

Execute

Fetch

Write

Pipeline Stages

WaitingInstructions

With Branch Prediction

6Clock Cycle

0 1 2 3 4 875 109

Without Branch Prediction

Page 17: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

6Clock Cycle

0 1 2 3 4 87

CompletedInstructions

5

Decode

Execute

Fetch

Write

Pipeline Stages

WaitingInstructions

With Branch Prediction

6Clock Cycle

0 1 2 3 4 875 109

Without Branch Prediction

Page 18: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Pipeline Stall - execution delay in aninstruction pipeline to resolve adependency

Performance Pitfalls

Page 19: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Performance Pitfalls

Warp Divergence - threads within awarp take different paths and thedifferent execution paths are serialized

Pipeline Stall - execution delay in aninstruction pipeline to resolve adependency

Page 20: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Warp Divergence Example

B

C

A

D

Page 21: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Warp Divergence Example

B

C

A

D

A

Page 22: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Warp Divergence Example

Warp Divergence!

B

C

A

D

A

Page 23: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Warp Divergence Example

Warp Divergence!

B

C

A

D

A

B

Page 24: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Warp Divergence Example

Warp Divergence!

B

C

A

D

Warp Divergence!

A

B

Page 25: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Warp Divergence Example

Warp Divergence!

B

C

A

D

Warp Divergence!

A

B

Page 26: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Warp Divergence Example

Warp Divergence!

B

C

A

D

Warp Divergence!

Warp Reconverges!

A

B

Page 27: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Warp Divergence Example

Warp Divergence!

B

C

A

D

Warp Divergence!

Warp Reconverges!

A

B

D

Page 28: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Warp-Aware Trace Scheduling

Schedule instructions across basic blockboundaries to expose additional ILP...

Page 29: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Warp-Aware Trace Scheduling

Schedule instructions across basic blockboundaries to expose additional ILP...

while managing andoptimizing warp divergence.

Page 30: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Origins: Microcode Trace Scheduling

...generalizing local and disparate vertical-to-horizontal microcode compaction

Step Description

Page 31: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Origins: Microcode Trace Scheduling

...generalizing local and disparate vertical-to-horizontal microcode compaction

1. Trace Selection

Step Description

Page 32: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Origins: Microcode Trace Scheduling

...generalizing local and disparate vertical-to-horizontal microcode compaction

2. Trace Formation

1. Trace Selection

Step Description

Page 33: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Origins: Microcode Trace Scheduling

...generalizing local and disparate vertical-to-horizontal microcode compaction

3. Local Scheduling

2. Trace Formation

1. Trace Selection

Step Description

Page 34: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Origins: Microcode Trace Scheduling

...generalizing local and disparate vertical-to-horizontal microcode compaction

3. Local Scheduling

schedule instructionswithin each region

2. Trace Formation facilitate local scheduling,potentially adding nodesand edges

1. Trace Selection partition basic blocksinto regions

Step Description

Page 35: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

J

L

K

A

B

C

G

H

I

D

F

E

Page 36: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

J

L

K

50

100

A

B

C

G

H

I

90

10

99

D

F

E

87

100

13

1

92

100

8

595

50

Annotate CFG - dynamic profiling - static branch prediction

100

100

Page 37: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

J

L

K

50

100

A

B

C

G

H

I

90

10

99

D

F

E

87

100

13

1

92

100

8

595

50

Annotate CFG - dynamic profiling - static branch prediction

Find the nextunvisited node,with highest edgeweight

Add node to trace

loop

end loop

while there are unvisited nodes

end while

100

100

Page 38: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

J

L

K

50

100

A

B

C

G

H

I

90

10

99

D

F

E

87

100

13

1

92

100

8

595

50

Annotate CFG - dynamic profiling - static branch prediction

Find the nextunvisited node,with highest edgeweight

Add node to trace

loop

end loop

while there are unvisited nodes

end while

100

100

Page 39: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

J

L

K

50

100

A

B

C

G

H

I

90

10

99

D

F

E

87

100

13

1

92

100

8

595

50

Annotate CFG - dynamic profiling - static branch prediction

Find the nextunvisited node,with highest edgeweight

Add node to trace

loop

end loop

while there are unvisited nodes

end while

100

100

Page 40: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

J

L

K

50

100

A

B

C

G

H

I

90

10

99

D

F

E

87

100

13

1

92

100

8

595

50

Annotate CFG - dynamic profiling - static branch prediction

Find the nextunvisited node,with highest edgeweight

Add node to trace

loop

end loop

while there are unvisited nodes

end while

100

100

Page 41: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

J

L

K

50

100

A

B

C

G

H

I

90

10

99

D

F

E

87

100

13

1

92

100

8

Trace # 1

595

50

Annotate CFG - dynamic profiling - static branch prediction

Find the nextunvisited node,with highest edgeweight

Add node to trace

loop

end loop

while there are unvisited nodes

end while

100

100

Page 42: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

J

L

K

50

100

A

B

C

G

H

I

90

10

99

D

F

E

87

100

13

1

92

100

8

Trace # 1

595

50

Annotate CFG - dynamic profiling - static branch prediction

Find the nextunvisited node,with highest edgeweight

Add node to trace

loop

end loop

while there are unvisited nodes

end while

100

100

Page 43: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

J

L

K

50

100

A

B

C

G

H

I

90

10

99

D

F

E

87

100

13

1

92

100

8

Trace # 1

595

50

Annotate CFG - dynamic profiling - static branch prediction

Find the nextunvisited node,with highest edgeweight

Add node to trace

loop

end loop

while there are unvisited nodes

end while

100

100

Page 44: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

J

L

K

50

100

A

B

C

G

H

I

90

10

99

D

F

E

87

100

13

1

92

100

8Trace # 2

Trace # 1

Trace # 3

595

50

Annotate CFG - dynamic profiling - static branch prediction

Find the nextunvisited node,with highest edgeweight

Add node to trace

loop

end loop

while there are unvisited nodes

end while

100

100

Page 45: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...

454647484950515253...

...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;

3435363738

...

...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;

10111213

2122

......

10

100

BB0

BB1394041424344

100

BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;

90

Before After

BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...

454647484950515253...

...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;

3435363738

...

...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;

10111213

2122

......

10

100

BB0

BB1394041424344

100

BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;

90

Page 46: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...

454647484950515253...

...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;

3435363738

...

...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;

10111213

2122

......

10

100

BB0

BB1394041424344

100

BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;

90

Before After

BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...

454647484950515253...

...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;

3435363738

...

...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;

10111213

2122

......

10

100

BB0

BB1394041424344

100

BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;

90

Page 47: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...

454647484950515253...

...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;

3435363738

...

...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;

10111213

2122

......

10

100

BB0

BB1394041424344

100

BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;

90

Before After

BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...

454647484950515253...

...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;

3435363738

...

...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;

10111213

2122

......

10

100

BB0

BB1394041424344

100

BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;

90

Page 48: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...

454647484950515253...

...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;

3435363738

...

...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;

10111213

2122

......

10

100

BB0

BB1394041424344

100

BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;

90

Before After

BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...

454647484950515253...

...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;

3435363738

...

...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;

10111213

2122

......

10

100

BB0

BB1394041424344

100

BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;

90

Page 49: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Before After

BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...

454647484950515253...

...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;

3435363738

...

...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;

10111213

2122

......

10

100

BB0

BB1394041424344

100

BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;

90

BB3

...mul.wide.s32 %rd13, %r4, 4;add.s64 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;

3435363738

...

10

100

BB1

100

90

BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;

394041424344

BB2...

BB3:mul.wide.s32 %rd15, %r3, %rd8;add.s64 %rd12, %rd1, %rd15; ...

454647...

... ...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2];@%p2 bra BB2;

BB0

10111213

2148495051525322

Page 50: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Profiling

0.95×

1.00×

1.05×

1.10×

1.15×

1.20×

1.25×

1.30×

1.35×

1.40×

1.00

1.05

1.10

1.15

1.20

1.25

1.30

1.35

1.40

KernelSpeedup

Instructions

ExecutedperCycle(IPC)

ComparingSpeedup and IPC UsingDynamic

backprop

bfscfd heartwall

hotspot

kmeans

lavaMD

leukocyte

lud mummergpu

nn nw particlefilter(f)

particlefilter(n)

pathfinder

sradstreamcluster

GEOMEAN

HARMEAN

bpnnlayerforw

ardCUDA

Kernel

Kernel2

cudacom

puteflux

kernel

calculatetem

p

invertmapping

kmeansP

oint

kernelgpu

cuda

dilatekernel

IMGVFkernel

luddiagonal

mum

mergpuK

ernel

printKernel

euclidneedle

cudashared1

needlecuda

shared2

findindex

kernellikelihood

kernel

kerneldynproc

kernel

sradcuda

1

sradcuda

2

kernelcom

putecost

Speedup

IPC

Page 51: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Backup Slides

Page 52: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

store instructions

shadow store buffer,

Restricted [6] General [6] Boosting [36] Deviant (GP U)

excludes texture, shared

Scheduling Restrictions Legal and Safe Legal noneand constant memory

operations and all

shadow register file,

HardwareSupport nonenon-trapping

noneinstructions and support for

re-executing instructionsException Handling for

prohibited ignored supported absentSpeculative Instructions

Page 53: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

GPU Programming Model

CPU GPU

Tim

e

Host Code

Host Code Device CodeCPU

CPU

GPUCyclic

Communication

Page 54: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

GPU Programming Model

CPU GPU

Tim

e

Host Code

Host Code Device Code

GridBlock (0,0) Block (1,0)

Block (0,1) Block (1,1)CPU

CPU

GPUCyclic

Communication

Page 55: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

GPU Programming Model

CPU GPU

Tim

e

Host Code

Device CodeHost Code

GridBlock (0,0) Block (1,0)

Block (0,1) Block (1,1)

Block (0,1)Thread (0,0) Thread (1,0) Thread (2,0)

Thread (0,1) Thread (1,1) Thread (2,1)

CPU

CPU

GPUCyclic

Communication

Page 56: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Characterizing the Grid...

Grid

gridDim.x

gri

dD

im.y

Page 57: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Characterizing the Grid, Blocks...

Grid

gridDim.x

gri

dD

im.y Block (0,1)

blockDim.x

blo

ckD

im.y

Page 58: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Characterizing the Grid, Blocks...

Grid

gridDim.x

gri

dD

im.y Block (blockIdx.x,blockIdx.y)

blockDim.x

blo

ckD

im.y

Page 59: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Characterizing the Grid, Blocks, andThreads

Grid

gridDim.x

gri

dD

im.y Block (blockIdx.x,blockIdx.y)

Thread (0,1)

blockDim.x

blo

ckD

im.y

Page 60: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Characterizing the Grid, Blocks, andThreads

Grid

gridDim.x

gri

dD

im.y Block (blockIdx.x,blockIdx.y)

Thread (threadIdx.x,threadIdx.y)

blockDim.x

blo

ckD

im.y

Page 61: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Warp Divergence Examples

Assuming one block of 128 threads...

Divergence?Example

if (threadIdx.x < 32) { }

Page 62: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Warp Divergence Examples

Assuming one block of 128 threads...

Divergence?Example

if (threadIdx.x > 15) { }

if (threadIdx.x < 32) { } NO

Page 63: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Warp Divergence Examples

Assuming one block of 128 threads...

Divergence?Example

if (threadIdx.x > 15) { }

if (threadIdx.x < 32) { } NO

if (threadIdx.x > 65) { }

YES

Page 64: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Warp Divergence Examples

Assuming one block of 128 threads...

Divergence?Example

if (threadIdx.x > 15) { }

if (threadIdx.x < 32) { } NO

if (threadIdx.x > 65) { }

if (BlockIdx.x > 1) { }

YES

YES

Page 65: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local

Warp Divergence Examples

Assuming one block of 128 threads...

Divergence?Example

if (threadIdx.x > 15) { }

if (threadIdx.x < 32) { } NO

if (threadIdx.x > 65) { }

if (blockIdx.x > 1) { }

YES

YES

NO