100
CUDA Performance Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2012 1

CUDA Performance

  • Upload
    jontae

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

CUDA Performance. Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2012. Announcements. Project 2 due Friday 10/12 Be ready to present on Monday, 10/15 Move class on Halloween to Tuesday, 10/30?. Acknowledgements. Some slides from Varun Sampath. Agenda. - PowerPoint PPT Presentation

Citation preview

Page 1: CUDA Performance

CUDA Performance

Patrick CozziUniversity of PennsylvaniaCIS 565 - Fall 2012

1

Page 2: CUDA Performance

Announcements

Project 2 due Friday 10/12Be ready to present on Monday, 10/15

Move class on Halloween to Tuesday, 10/30?

2

Page 3: CUDA Performance

Acknowledgements

Some slides from Varun Sampath

3

Page 4: CUDA Performance

Agenda

Parallel Reduction Revisited Warp Partitioning Memory Coalescing Bank Conflicts Dynamic Partitioning of SM Resources Data Prefetching Instruction Mix Loop Unrolling

4

Page 5: CUDA Performance

Efficient data-parallel algorithms

Optimizations basedon GPU Architecture

MaximumPerformance

+=

5

Page 6: CUDA Performance

Parallel Reduction

0 1 5 2 3 4 6 7

Recall Parallel Reduction (sum)

6

Page 7: CUDA Performance

Parallel Reduction

0 1 5 2 3 4 6 7

1 5 9 13

7

Page 8: CUDA Performance

Parallel Reduction

0 1 5 2 3 4 6 7

1 5 9 13

6 22

8

Page 9: CUDA Performance

Parallel Reduction

0 1 5 2 3 4 6 7

1 5 9 13

6 22

28

9

Page 10: CUDA Performance

Parallel Reduction

Similar to brackets for a basketball tournament log(n) passes for n elements How would you implement this in CUDA?

10

Page 11: CUDA Performance

__shared__ float partialSum[];// ... load into shared memoryunsigned int t = threadIdx.x;for (unsigned int stride = 1; stride < blockDim.x;

stride *= 2) { __syncthreads(); if (t % (2 * stride) == 0) partialSum[t] +=

partialSum[t + stride];}

Code from http://courses.engr.illinois.edu/ece498/al/Syllabus.html11

Page 12: CUDA Performance

__shared__ float partialSum[];// ... load into shared memoryunsigned int t = threadIdx.x;for (unsigned int stride = 1; stride < blockDim.x;

stride *= 2) { __syncthreads(); if (t % (2 * stride) == 0) partialSum[t] +=

partialSum[t + stride];}

Code from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Computing the sum for the elements in shared memory

12

Page 13: CUDA Performance

__shared__ float partialSum[];// ... load into shared memoryunsigned int t = threadIdx.x;for (unsigned int stride = 1; stride < blockDim.x;

stride *= 2) { __syncthreads(); if (t % (2 * stride) == 0) partialSum[t] +=

partialSum[t + stride];}

Code from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Stride:1, 2, 4, …

13

Page 14: CUDA Performance

__shared__ float partialSum[];// ... load into shared memoryunsigned int t = threadIdx.x;for (unsigned int stride = 1; stride < blockDim.x;

stride *= 2) { __syncthreads(); if (t % (2 * stride) == 0) partialSum[t] +=

partialSum[t + stride];}

Code from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Why?

14

Page 15: CUDA Performance

__shared__ float partialSum[];// ... load into shared memoryunsigned int t = threadIdx.x;for (unsigned int stride = 1; stride < blockDim.x;

stride *= 2) { __syncthreads(); if (t % (2 * stride) == 0) partialSum[t] +=

partialSum[t + stride];}

Code from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

• Compute sum in same shared memory• As stride increases, what do more threads do?

15

Page 16: CUDA Performance

Thread7

Thread6

Thread5

Thread4

Thread3

Thread2

Parallel Reduction

0 1 5 2 3 4 6 7

1 5 9 13

6 22

28

Thread0

Thread1

16

Page 17: CUDA Performance

Thread7

Thread6

Thread5

Thread4

Thread3

Thread2

Parallel Reduction

0 1 5 2 3 4 6 7

1 5 9 13

6 22

28

Thread0

Thread1

1st pass: threads 1, 3, 5, and 7 don’t do anything Really only need n/2 threads for n elements

17

Page 18: CUDA Performance

Thread7

Thread6

Thread5

Thread4

Thread3

Thread2

Parallel Reduction

0 1 5 2 3 4 6 7

1 5 9 13

6 22

28

Thread0

Thread1

2nd pass: threads 2 and 6 also don’t do anything

18

Page 19: CUDA Performance

Thread7

Thread6

Thread5

Thread4

Thread3

Thread2

Parallel Reduction

0 1 5 2 3 4 6 7

1 5 9 13

6 22

28

Thread0

Thread1

3rd pass: thread 4 also doesn’t do anything

19

Page 20: CUDA Performance

Thread7

Thread6

Thread5

Thread4

Thread3

Thread2

Parallel Reduction

0 1 5 2 3 4 6 7

1 5 9 13

6 22

28

Thread0

Thread1

In general, number of required threads cuts in half after each pass

20

Page 21: CUDA Performance

Parallel Reduction

What if we tweaked the implementation?

21

Page 22: CUDA Performance

Parallel Reduction

0 1 5 2 3 4 6 7

22

Page 23: CUDA Performance

Parallel Reduction

0 1 5 2 3 4 6 7

4 6 8 10

23

Page 24: CUDA Performance

Parallel Reduction

0 1 5 2 3 4 6 7

4 6 8 10

12 16

24

Page 25: CUDA Performance

Parallel Reduction

0 1 5 2 3 4 6 7

4 6 8 10

12 16

28

25

Page 26: CUDA Performance

Code from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

stride: …, 4, 2, 1

__shared__ float partialSum[]// ... load into shared memoryunsigned int t = threadIdx.x;for(unsigned int stride = blockDim.x / 2; stride > 0;

stride /= 2) { __syncthreads(); if (t < stride) partialSum[t] +=

partialSum[t + stride];}

26

Page 27: CUDA Performance

Code from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

__shared__ float partialSum[]// ... load into shared memoryunsigned int t = threadIdx.x;for(unsigned int stride = blockDim.x / 2; stride > 0;

stride /= 2) { __syncthreads(); if (t < stride) partialSum[t] +=

partialSum[t + stride];}

27

Page 28: CUDA Performance

Thread7

Thread6

Thread5

Thread4

Thread3

Thread2

Thread0

Thread1

Parallel Reduction

0 1 5 2 3 4 6 7

4 6 8 10

12 16

28

28

Page 29: CUDA Performance

Thread7

Thread6

Thread5

Thread4

Thread3

Thread2

Thread0

Thread1

Parallel Reduction

0 1 5 2 3 4 6 7

4 6 8 10

12 16

28

1st pass: threads 4, 5, 6, and 7 don’t do anything Really only need n/2 threads for n elements

29

Page 30: CUDA Performance

Thread7

Thread6

Thread5

Thread4

Thread3

Thread2

Thread0

Thread1

Parallel Reduction

0 1 5 2 3 4 6 7

4 6 8 10

12 16

28

2nd pass: threads 2 and 3 also don’t do anything

30

Page 31: CUDA Performance

Thread7

Thread6

Thread5

Thread4

Thread3

Thread2

Thread0

Thread1

Parallel Reduction

0 1 5 2 3 4 6 7

4 6 8 10

12 16

28

3rd pass: thread 1 also doesn’t do anything

31

Page 32: CUDA Performance

Parallel Reduction

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

What is the difference?

stride = 1, 2, 4, … stride = 4, 2, 1, …

32

Page 33: CUDA Performance

Parallel Reduction

What is the difference?

if (t < stride) partialSum[t] += partialSum[t + stride];

if (t % (2 * stride) == 0) partialSum[t] += partialSum[t + stride];

stride = 1, 2, 4, … stride = 4, 2, 1, …

33

Page 34: CUDA Performance

Warp Partitioning

Warp Partitioning: how threads from a block are divided into warps

Knowledge of warp partitioning can be used to:Minimize divergent branchesRetire warps early

34

Page 35: CUDA Performance

Warp Partitioning

Partition based on consecutive increasing threadIdx

35

Page 36: CUDA Performance

Warp Partitioning

1D BlockthreadIdx.x between 0 and 512 (G80/GT200)Warp n

Starts with thread 32n Ends with thread 32(n + 1) – 1

Last warp is padded if block size is not a multiple of 32

0…31 32...63 64...95 96...127Warp 0 Warp 1 Warp 2 Warp 3

…36

Page 37: CUDA Performance

Warp Partitioning

2D Block Increasing threadIdx means

Increasing threadIdx.x Starting with row threadIdx.y == 0

37

Page 38: CUDA Performance

Warp Partitioning

Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

2D Block

38

Page 39: CUDA Performance

Warp Partitioning

3D BlockStart with threadIdx.z == 0Partition as a 2D block Increase threadIdx.z and repeat

39

Page 40: CUDA Performance

Image from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Warp Partitioning

Divergent branches are within a warp!

40

Page 41: CUDA Performance

Warp Partitioning

For warpSize == 32, does any warp have a divergent branch with this code:

if (threadIdx.x > 15){ // ...}

41

Page 42: CUDA Performance

Warp Partitioning

For any warpSize > 1, does any warp have a divergent branch with this code:

if (threadIdx.x > warpSize - 1){ // ...}

42

Page 43: CUDA Performance

Warp Partitioning

Given knowledge of warp partitioning, which parallel reduction is better?

if (t < stride) partialSum[t] += partialSum[t + stride];

if (t % (2 * stride) == 0) partialSum[t] += partialSum[t + stride];

stride = 1, 2, 4, … stride = 4, 2, 1, …

43

Page 44: CUDA Performance

Warp Partitioning

stride = 1, 2, 4, … stride = 4, 2, 1, …

Warp0

Warp1

Warp2

Warp3

Warp0

Warp1

Warp2

Warp3

Pretend warpSize == 2

44

Page 45: CUDA Performance

Warp Partitioning

1st Pass

stride = 1, 2, 4, … stride = 4, 2, 1, …

Warp0

Warp1

Warp2

Warp3

Warp0

Warp1

Warp2

Warp3

4divergentbranches

0divergentbranches

45

Page 46: CUDA Performance

Warp Partitioning

2nd Pass

stride = 1, 2, 4, … stride = 4, 2, 1, …

Warp0

Warp1

Warp2

Warp3

Warp0

Warp1

Warp2

Warp3

2divergentbranches

0divergentbranches

46

Page 47: CUDA Performance

Warp Partitioning

2nd Pass

stride = 1, 2, 4, … stride = 4, 2, 1, …

Warp0

Warp1

Warp2

Warp3

Warp0

Warp1

Warp2

Warp3

1divergentbranch

1divergentbranch

47

Page 48: CUDA Performance

Warp Partitioning

2nd Pass

stride = 1, 2, 4, … stride = 4, 2, 1, …

Warp0

Warp1

Warp2

Warp3

Warp0

Warp1

Warp2

Warp3

1divergentbranch

1divergentbranch

Still diverge when number of elements left is <= warpSize 48

Page 49: CUDA Performance

Warp Partitioning

Good partitioning also allows warps to be retired early.Better hardware utilization

if (t < stride) partialSum[t] += partialSum[t + stride];

if (t % (2 * stride) == 0) partialSum[t] += partialSum[t + stride];

stride = 1, 2, 4, … stride = 4, 2, 1, …

49

Page 50: CUDA Performance

Warp Partitioning

stride = 1, 2, 4, … stride = 4, 2, 1, …

Warp0

Warp1

Warp2

Warp3

Warp0

Warp1

Warp2

Warp3

Parallel Reduction

50

Page 51: CUDA Performance

Warp Partitioning

1st Pass

stride = 1, 2, 4, … stride = 4, 2, 1, …

Warp0

Warp1

Warp2

Warp3

Warp0

Warp1

Warp2

Warp3

0warpsretired

2warpsretired

51

Page 52: CUDA Performance

Warp Partitioning

1st Pass

stride = 1, 2, 4, … stride = 4, 2, 1, …

Warp0

Warp1

Warp2

Warp3

Warp0

Warp1

Warp2

Warp3

52

Page 53: CUDA Performance

Warp Partitioning

2nd Pass

stride = 1, 2, 4, … stride = 4, 2, 1, …

Warp0

Warp1

Warp2

Warp3

Warp0

Warp1

Warp2

Warp3

1warp

retired

2warpsretired

53

Page 54: CUDA Performance

Warp Partitioning

2nd Pass

stride = 1, 2, 4, … stride = 4, 2, 1, …

Warp0

Warp1

Warp2

Warp3

Warp0

Warp1

Warp2

Warp3

54

Page 55: CUDA Performance

Memory Coalescing

Given a matrix stored row-major in global memory, what is a thread’s desirable access pattern?

Image from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

M2,0

M1,1

M1,0M0,0

M0,1

M3,0

M2,1 M3,1

M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2

M1,2M0,2 M2,2 M3,2

M1,3M0,3 M2,3 M3,3

M1,3M0,3 M2,3 M3,3

M

55

Page 56: CUDA Performance

Memory Coalescing

Given a matrix stored row-major in global memory, what is a thread’s desirable access pattern?

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

Md Nd

WIDTH

WIDTH

Thread 0

Thread 1

Thread0

Thread1

a) column after column? b) row after row?56

Page 57: CUDA Performance

Memory Coalescing

Given a matrix stored row-major in global memory, what is a thread’s desirable access pattern?a) column after column

Individual threads read increasing, consecutive memory address

b) row after row Adjacent threads read increasing, consecutive

memory addresses

57

Page 58: CUDA Performance

Memory Coalescing

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

a) column after column

58

Page 59: CUDA Performance

Memory Coalescing

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

b) row after row

59

Page 60: CUDA Performance

Memory Coalescing

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

Recall warp partitioning; if these threads are in the same warp, global memory addresses are increasing and consecutive across warps.

60

Page 61: CUDA Performance

Memory Coalescing

Global memory bandwidth (DRAM)G80 – 86.4 GB/sGT200 – 150 GB/s

Achieve peak bandwidth by requesting large, consecutive locations from DRAMAccessing random location results in much

lower bandwidth

61

Page 62: CUDA Performance

Memory Coalescing

Memory coalescing – rearrange access patterns to improve performance

Useful today but will be less useful with large on-chip caches

62

Page 63: CUDA Performance

Memory Coalescing

The GPU coalesce consecutive reads in a half-warp into a single read

Strategy: read global memory in a coalesce-able fashion into shared memoryThen access shared memory randomly at

maximum bandwidth Ignoring bank conflicts…

See Appendix G in the NVIDIA CUDA C Programming Guide for coalescing alignment requirements63

Page 64: CUDA Performance

64

Bank Conflicts

Image from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Shared MemorySometimes called a parallel data cache

Multiple threads can access shared memory at the same time

Memory is divided into banks

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Page 65: CUDA Performance

65

Bank Conflicts

Image from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

BanksEach bank can service one address per

two cyclesPer-bank bandwidth: 32-bits per two

(shader clock) cyclesSuccessive 32-bit words are assigned

to successive banksBank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Page 66: CUDA Performance

66

Bank Conflicts

Image from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Bank Conflict: Two simultaneous accesses to the same bank, but not the same address Serialized

G80-GT200: 16 banks, with 8 SPs concurrently executing

Fermi: 32 banks, with 16 SPs concurrently executing What does this mean for conflicts?

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Page 67: CUDA Performance

67

Bank Conflicts

Image from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Bank Conflicts? Linear addressing

stride == 1

Bank Conflicts? Random 1:1 Permutation

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Page 68: CUDA Performance

68

Bank Conflicts

Image from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Bank Conflicts? Linear addressing

stride == 2

Bank Conflicts? Linear addressing

stride == 8

Thread 11Thread 10Thread 9Thread 8

Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 9Bank 8

Bank 15

Bank 7

Bank 2Bank 1Bank 0x8

x8

Page 69: CUDA Performance

69

Bank Conflicts

Image from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Fast Path 1 (G80)All threads in a half-warp

access different banks

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Page 70: CUDA Performance

70

Bank Conflicts

Image from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Fast Path 2 (G80)All threads in a half-warp

access the same address

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Sameaddress

Page 71: CUDA Performance

71

Bank Conflicts

Image from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Slow Path (G80) Multiple threads in a half-warp

access the same bank Access is serialized What is the cost?

Thread 11Thread 10Thread 9Thread 8

Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Page 72: CUDA Performance

72

Bank Conflicts__shared__ float shared[256];// ...float f = shared[index + s * threadIdx.x];

For what values of s is this conflict free?Hint: The G80 has 16 banks

Page 73: CUDA Performance

73

Bank Conflicts__shared__ float shared[256];// ...float f = shared[index + s * threadIdx.x];

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

s=1

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

s=3

Image from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Page 74: CUDA Performance

74

Bank Conflicts Without using a profiler, how can we tell what kind of

speedup we can expect by removing bank conflicts? What happens if more than one thread in a warp

writes to the same shared memory address (non-atomic instruction)?

Page 75: CUDA Performance

SM Resource Partitioning

Recall a SM dynamically partitions resources:

Registers

Thread block slots

Thread slots

Shared memory

SM

75

Page 76: CUDA Performance

SM Resource Partitioning

Recall a SM dynamically partitions resources:

Registers

Thread block slots

Thread slots

Shared memory

SM

8

768

8K registers / 32K memory

16K

G80 Limits

76

Page 77: CUDA Performance

SM Resource Partitioning

We can have8 blocks of 96 threads4 blocks of 192 threadsBut not 8 blocks of 192 threads

Registers

Thread block slots

Thread slots

Shared memory

SM

8

768

8K registers / 32K memory

16K

G80 Limits

77

Page 78: CUDA Performance

SM Resource Partitioning

We can have (assuming 256 thread blocks)768 threads (3 blocks) using 10 registers each512 threads (2 blocks) using 11 registers each

Registers

Thread block slots

Thread slots

Shared memory

SM

8

768

8K registers / 32K memory

16K

G80 Limits

78

Page 79: CUDA Performance

SM Resource Partitioning

We can have (assuming 256 thread blocks)768 threads (3 blocks) using 10 registers each512 threads (2 blocks) using 11 registers each

Registers

Thread block slots

Thread slots

Shared memory

SM

8

768

8K registers / 32K memory

16K

G80 Limits More registers decreases thread-level parallelism Can it ever

increase performance?

79

Page 80: CUDA Performance

SM Resource Partitioning

Performance Cliff: Increasing resource usage leads to a dramatic reduction in parallelismFor example, increasing the number of

registers, unless doing so hides latency of global memory access

80

Page 81: CUDA Performance

SM Resource Partitioning

CUDA Occupancy Calculatorhttp://developer.download.nvidia.com/comput

e/cuda/CUDA_Occupancy_calculator.xls

81

Page 82: CUDA Performance

Data Prefetching

Independent instructions between a global memory read and its use can hide memory latency

float m = Md[i];float f = a * b + c * d;float f2 = m * f;

82

Page 83: CUDA Performance

Data Prefetching

Independent instructions between a global memory read and its use can hide memory latency

float m = Md[i];float f = a * b + c * d;float f2 = m * f;

Read global memory

83

Page 84: CUDA Performance

Data Prefetching

Independent instructions between a global memory read and its use can hide memory latency

float m = Md[i];float f = a * b + c * d;float f2 = m * f;

Execute instructions that are not dependent on memory read 84

Page 85: CUDA Performance

Data Prefetching

Independent instructions between a global memory read and its use can hide memory latency

float m = Md[i];float f = a * b + c * d;float f2 = m * f; Use global memory after

the above line from enough warps hide the memory latency

85

Page 86: CUDA Performance

Data Prefetching

Prefetching data from global memory can effectively increase the number of independent instructions between global memory read and use

86

Page 87: CUDA Performance

Data Prefetching

Recall tiled matrix multiply:

for (/* ... */){ // Load current tile into shared memory __syncthreads(); // Accumulate dot product __syncthreads();}

87

Page 88: CUDA Performance

Data Prefetching

Tiled matrix multiply with prefetch:

// Load first tile into registers

for (/* ... */){ // Deposit registers into shared memory __syncthreads(); // Load next tile into registers // Accumulate dot product __syncthreads();}

88

Page 89: CUDA Performance

Data Prefetching

Tiled matrix multiply with prefetch:

// Load first tile into registers

for (/* ... */){ // Deposit registers into shared memory __syncthreads(); // Load next tile into registers // Accumulate dot product __syncthreads();}

89

Page 90: CUDA Performance

Data Prefetching

Tiled matrix multiply with prefetch:

// Load first tile into registers

for (/* ... */){ // Deposit registers into shared memory __syncthreads(); // Load next tile into registers // Accumulate dot product __syncthreads();}

Prefetch for next iteration of the loop

90

Page 91: CUDA Performance

Data Prefetching

Tiled matrix multiply with prefetch:

// Load first tile into registers

for (/* ... */){ // Deposit registers into shared memory __syncthreads(); // Load next tile into registers // Accumulate dot product __syncthreads();}

These instructions executed by enough threads will hide the memory latency of the prefetch

91

Page 92: CUDA Performance

92

• Special Function Units (SFUs)• Use to compute

__sinf(), __expf()

• Only 4, each can execute 1 instruction per clock

Image: NVIDIA Fermi Whitepaper

Instruction Mix

Page 93: CUDA Performance

93

Loop Unrollingfor (int k = 0; k < BLOCK_SIZE; ++k){ Pvalue += Ms[ty][k] * Ns[k][tx];}

Instructions per iterationOne floating-point multiplyOne floating-point addWhat else?

Page 94: CUDA Performance

94

for (int k = 0; k < BLOCK_SIZE; ++k){ Pvalue += Ms[ty][k] * Ns[k][tx];}

Loop Unrolling

Other instructions per iterationUpdate loop counter

Page 95: CUDA Performance

95

for (int k = 0; k < BLOCK_SIZE; ++k){ Pvalue += Ms[ty][k] * Ns[k][tx];}

Loop Unrolling

Other instructions per iterationUpdate loop counterBranch

Page 96: CUDA Performance

96

for (int k = 0; k < BLOCK_SIZE; ++k){ Pvalue += Ms[ty][k] * Ns[k][tx];}

Loop Unrolling

Other instructions per iterationUpdate loop counterBranchAddress arithmetic

Page 97: CUDA Performance

97

Loop Unrolling

Instruction Mix2 floating-point arithmetic instructions1 loop branch instruction2 address arithmetic instructions1 loop counter increment instruction

for (int k = 0; k < BLOCK_SIZE; ++k){ Pvalue += Ms[ty][k] * Ns[k][tx];}

Page 98: CUDA Performance

98

• Only 1/3 are floating-point calculations• But I want my

full theoretical 1 TFLOP (Fermi)

• Consider loop unrolling

Image: NVIDIA Fermi Whitepaper

Loop Unrolling

Page 99: CUDA Performance

99

Loop UnrollingPvalue += Ms[ty][0] * Ns[0][tx] + Ms[ty][1] * Ns[1][tx] + ... Ms[ty][15] * Ns[15][tx]; // BLOCK_SIZE = 16

• No more loop• No loop count update• No branch• Constant indices – no address arithmetic

instructions

Page 100: CUDA Performance

100

Loop Unrolling

Automatically:#pragma unroll BLOCK_SIZEfor (int k = 0; k < BLOCK_SIZE; ++k){ Pvalue += Ms[ty][k] * Ns[k][tx];} Disadvantages to unrolling?