CUDA Optimization with NVIDIA® Nsight Visual Studio Edition...

Preview:

Citation preview

CUDA Optimization with NVIDIA® Nsight™ Visual Studio Edition 3.0 Julien Demouth, NVIDIA

What Will You Learn?

An iterative method to optimize your GPU code

A way to conduct that method with Nsight VSE

APOD Method, Session S3008, Cliff Woolley

https://developer.nvidia.com/content/assess-parallelize-optimize-deploy

2

What Does the Application Do ?

It does not matter !!!

We care about memory accesses, instructions, latency, …

Companion code (with a different input file)

https://github.com/jdemouth/nsight-gtc2013

3

4

Our Method

Trace your application

Identify the hot spot and profile it

Identify the performance limiter

— Memory Bandwidth

— Instruction Throughput

— Latency

Optimize the code

Iterate

5

Our Environment

We use

— Nvidia Tesla K20c (GK110, SM 3.5), ECC OFF,

— Microsoft Windows 7 x64,

— Microsoft Visual Studio 2010 SP1,

— CUDA 5.0,

— Driver 310.34,

— Nvidia Nsight 3.0.

6

ITERATION 1

7

Trace the Application

8

CUDA Launch Summary

spmv_kernel_v0 is a hot spot, let’s start here!!!

Kernel Time Speedup

Original version 457.1ms

9

Profile the Most Expensive Kernel

10

CUDA Launches

11

Identify the Main Limiter

Is it limited by the memory bandwidth ?

Is it limited by the instruction throughput ?

Is it limited by latency ?

12

Memory Bandwidth

Utilization of DRAM Bandwidth: 37.67%

We are not limited by the memory bandwidth

13

Instruction Throughput

Instructions Per Clock (IPC): 0.04

We are not limited by instruction throughput

14

Latency

First two things to check:

— Occupancy

— Memory accesses (coalesced/uncoalesced accesses)

Other things to check (if needed):

— Control flow efficiency (branching, idle threads)

— Divergence

— Bank conflicts in shared memory

15

Latency

Occupancy: 47.58% Achieved / 50% Theoretical

Eligible Warps per Active Cycle: >4.7 on average

On GK110, 4 Eligible Warps are enough: Not an issue 16

Latency

Memory Accesses:

— Load: 22 Transactions per Request

— Store: 8 Transactions per Request

We have too many uncoalesced accesses!!!

17

Where Do Those Accesses Happen?

CUDA Source Profiler:

— Find where most of the uncoalesced requests happen

Tip: Sort “L2 Global Transactions Executed”

18

Access Pattern

Double precision numbers: 64-bit

Per Warp:

— Up to 32 L1 Transactions / Ideal case: 2 Transactions

— Up to 32 L2 Transactions / Ideal case: 8 Transactions

L2 Transaction

(32B)

L2 Transaction

(32B)

L1 Transaction (128B)

Thread 0 Thread 1

L2 Transaction

(32B)

Thread 2

19

Access Pattern

Next iteration:

Idea: Use the Read-only cache (LDG load)

— On Fermi: Use a texture or Use 48KB for L1

Thread 0 Thread 1 Thread 2

L2 Transaction

(32B)

L2 Transaction

(32B)

L1 Transaction (128B) L2 Transaction

(32B)

20

First Modification: Use __ldg

We change the source code:

It is slower: 625.8ms

Kernel Time Speedup

Original version 457.1ms

LDG to load A 625.8ms 0.73x

21

First Modification: Use __ldg

Less L2 to SM traffic: 857.1MB transferred (it was 906.2MB)

22

First Modification: Use __ldg

Why does 6% less traffic lead to 37% performance loss?

Instruction Efficiency (Eligible Warps per Active Cycle):

The average number of Eligible Warps dropped below 1

23

First Modification: Use __ldg

There are already “a lot” of Active Warps per Cycle

24

First Modification: Use __ldg

Warps cannot issue because they have to wait

Warps wait for Texture in 91.1% of the cases

25

First Modification: Use __ldg

The loads compete for the cache too much

— Low hit rate: 7.7%

Texture requests introduce too much latency

Things to check in those cases:

— Texture Hit Rate: Low means no reuse

— Issue Efficiency and Stall Reasons

It was actually expected: GPU caches are not CPU caches!!!

26

First Modification: Use __ldg

Other accesses may benefit from LDGs

Memory blocks accessed several times by several threads

How can we detect it?

— Source code analysis

— There is no way to detect it from Nsight

27

First Modification: Use __ldg

We change the source code

— In y = Ax, we use __ldg when loading x

It’s faster: 403.4ms

Kernel Time Speedup

Original version 457.1ms

LDG to load A 625.8ms 0.73x

LDG to load X 403.4ms 1.13x

28

First Modification: Use __ldg

Much less L2 to SM traffic: 774MB (it was 906.2MB)

Good hit rate in Texture Cache: 83%

29

ITERATION 2

30

CUDA Launch Summary

spmv_kernel_v2 is still a hot spot, so we profile it 31

Identify the Main Limiter

Is it limited by the memory bandwidth ?

Is it limited by the instruction throughput ?

Is it limited by latency ?

32

Identify the Main Limiter

We are still limited by latency

— Low DRAM utilization: 36.48%

— Low IPC: 0.06

33

Identify the Main Limiter

We are not limited by the Occupancy

— We have > 6 Eligible Warps per Active Cycle

We are limited by uncoalesced accesses: 48.92% of Replays

34

Second Strategy: Change Memory Accesses

4 consecutive threads load 4 consecutive elements

Per Warp:

— Up to 8 L1 Transactions / Ideal case: 2 Transactions

— Up to 8 L2 Transactions / Ideal case: 8 Transactions

Threads 0, 1, 2, 3 Threads 4, 5, 6, 7

L2 Transaction

(32B)

L2 Transaction

(32B)

L1 Transaction (128B) L2 Transaction

(32B)

Threads 8, 9, 10, 11

35

Second Strategy: Change Memory Accesses

It’s much faster: 161.7ms

Kernel Time Speedup

Original version 457.1ms

LDG to load A 625.8ms 0.73x

LDG to load X 403.4ms 1.13x

Coalescing with 4 Threads 161.7ms 2.83x

36

Second Strategy: Change Memory Accesses

We have much fewer Transactions per Request: 5.51 (LD)

37

Second Strategy: Change Memory Accesses

Much less traffic from L2: 230.5MB (it was 774MB)

Much less DRAM traffic: 210.1MB (it was 503.1MB)

38

ITERATION 3

39

CUDA Launch Summary

spmv_kernel_v3 is still a hot spot, so we profile it 40

Identify the Main Limiter

Is it limited by the memory bandwidth ?

Is it limited by the instruction throughput ?

Is it limited by latency ?

41

Identify the Main Limiter

We are still limited by latency

— Low DRAM utilization: 37.67%

— Low IPC: 0.31

42

Latency

Occupancy: 57.80% Achieved / 62.50% Theoretical

Eligible Warps per Active Cycle: ~3 on average

We need 4 warps on GK110, so ~3 could be an issue

43

Latency

Occupancy is limited by the number of registers

We change the number of registers with __launch_bounds__

It does not really help 44

__launch_bounds__(BLOCK_SIZE, MIN_BLOCKS)

Latency

Memory Accesses:

— Load: 5.51 Transactions per Request

— Store: 2 Transactions per Request

We still have too many uncoalesced accesses

45

Latency

We still have too many uncoalesced accesses

— Nearly 70% of Instruction Serialization (Replays)

— Stall Reasons: 48.1% due to Data Requests

46

Where Do Those Accesses Happen?

Same lines of code as before

47

What Can We Do?

In our kernel: 4 threads per row of the matrix A

New approach: 1 warp of threads per row of the matrix A

Threads 0, 1, 2, 3 Threads 4, 5, 6, 7

L2 Transaction

(32B)

L2 Transaction

(32B)

L1 Transaction (128B) L2 Transaction

(32B)

Threads 8, 9, 10, 11

Threads 0, 1, 2, 3, …, 31 (some possibly idle)

L2 Transaction

(32B)

L2 Transaction

(32B) L2 Transaction

(32B) 48

One Warp Per Row

It’s faster: 140.4ms

Kernel Time Speedup

Original version 457.1ms

LDG to load A 625.8ms 0.73x

LDG to load X 403.4ms 1.13x

Coalescing with 4 Threads 161.7ms 2.83x

1 Warp per Row 140.4ms 3.26x

49

One Warp Per Row

Much fewer Transactions Per Request: 1.37 (LD) / 1 (ST)

50

ITERATION 4

51

One Warp Per Row

spmv_kernel_v4 is the hot spot

52

One Warp Per Row

DRAM utilization: 40.64%

IPC: 1.57

We are still limited by latency 53

One Warp Per Row

Occupancy and memory accesses are OK (not shown)

Control Flow Efficiency: 86.59%

Only 72.5% threads active in the expensive loop

54

One Half Warp Per Row

It is faster: 114.7ms

Kernel Time Speedup

Original version 457.1ms

LDG to load A 625.8ms 0.73x

LDG to load X 403.4ms 1.13x

Coalescing with 4 Threads 161.7ms 2.83x

1 Warp per Row 140.4ms 3.26x

½ Warp per Row 114.7ms 3.99x

55

ITERATION 5

56

One Half Warp Per Row

DRAM utilization: 49.79%

IPC: 1.34

We are still limited by latency 57

One Half Warp Per Row

Memory accesses are good enough

Occupancy could be an issue: ~3.2 Eligible Warps per Cycle

— Occupancy is limited by registers

But forcing register count does not improve performance 58

One Half Warp Per Row

Branch divergence induce latency

We have 23.1% of divergent branches

59

One Half Warp Per Row

We fix branch divergence

It is faster: 91.2ms

Kernel Time Speedup

Original version 457.1ms

LDG to load A 625.8ms 0.73x

LDG to load X 403.4ms 1.13x

Coalescing with 4 Threads 161.7ms 2.83x

1 Warp per Row 140.4ms 3.26x

½ Warp per Row 114.7ms 3.99x

No divergence 91.2ms 5.01x

60

One Half Warp Per Row

DRAM utilization: 62.57%

IPC: 1.56

We achieve a much better bandwidth 61

So Far

We have consecutively:

— Improved caching using __ldg (use with care)

— Improved coalescing

— Improved control flow efficiency

— Improved branching

Our new kernel is 5x faster than our first implementation

Nsight helped us a lot 62

63

ITERATION 6

64

Next Kernel

We are satisfied with the performance of spmv_kernel

We move to the next kernel: jacobi_smooth

65

66

What Have You Seen?

An iterative method to optimize your GPU code

— Trace your application

— Identify the hot spot and profile it

— Identify the performance limiter

— Optimize the code

— Iterate

A way to conduct that method with Nsight VSE

67

Recommended