67
CUDA Optimization with NVIDIA® Nsight™ Visual Studio Edition 3.0 Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

CUDA Optimization with NVIDIA® Nsight™ Visual Studio Edition 3.0 Julien Demouth, NVIDIA

Page 2: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

What Will You Learn?

An iterative method to optimize your GPU code

A way to conduct that method with Nsight VSE

APOD Method, Session S3008, Cliff Woolley

https://developer.nvidia.com/content/assess-parallelize-optimize-deploy

2

Page 3: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

What Does the Application Do ?

It does not matter !!!

We care about memory accesses, instructions, latency, …

Companion code (with a different input file)

https://github.com/jdemouth/nsight-gtc2013

3

Page 4: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

4

Page 5: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Our Method

Trace your application

Identify the hot spot and profile it

Identify the performance limiter

— Memory Bandwidth

— Instruction Throughput

— Latency

Optimize the code

Iterate

5

Page 6: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Our Environment

We use

— Nvidia Tesla K20c (GK110, SM 3.5), ECC OFF,

— Microsoft Windows 7 x64,

— Microsoft Visual Studio 2010 SP1,

— CUDA 5.0,

— Driver 310.34,

— Nvidia Nsight 3.0.

6

Page 7: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

ITERATION 1

7

Page 8: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Trace the Application

8

Page 9: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

CUDA Launch Summary

spmv_kernel_v0 is a hot spot, let’s start here!!!

Kernel Time Speedup

Original version 457.1ms

9

Page 10: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Profile the Most Expensive Kernel

10

Page 11: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

CUDA Launches

11

Page 12: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Identify the Main Limiter

Is it limited by the memory bandwidth ?

Is it limited by the instruction throughput ?

Is it limited by latency ?

12

Page 13: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Memory Bandwidth

Utilization of DRAM Bandwidth: 37.67%

We are not limited by the memory bandwidth

13

Page 14: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Instruction Throughput

Instructions Per Clock (IPC): 0.04

We are not limited by instruction throughput

14

Page 15: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Latency

First two things to check:

— Occupancy

— Memory accesses (coalesced/uncoalesced accesses)

Other things to check (if needed):

— Control flow efficiency (branching, idle threads)

— Divergence

— Bank conflicts in shared memory

15

Page 16: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Latency

Occupancy: 47.58% Achieved / 50% Theoretical

Eligible Warps per Active Cycle: >4.7 on average

On GK110, 4 Eligible Warps are enough: Not an issue 16

Page 17: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Latency

Memory Accesses:

— Load: 22 Transactions per Request

— Store: 8 Transactions per Request

We have too many uncoalesced accesses!!!

17

Page 18: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Where Do Those Accesses Happen?

CUDA Source Profiler:

— Find where most of the uncoalesced requests happen

Tip: Sort “L2 Global Transactions Executed”

18

Page 19: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Access Pattern

Double precision numbers: 64-bit

Per Warp:

— Up to 32 L1 Transactions / Ideal case: 2 Transactions

— Up to 32 L2 Transactions / Ideal case: 8 Transactions

L2 Transaction

(32B)

L2 Transaction

(32B)

L1 Transaction (128B)

Thread 0 Thread 1

L2 Transaction

(32B)

Thread 2

19

Page 20: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Access Pattern

Next iteration:

Idea: Use the Read-only cache (LDG load)

— On Fermi: Use a texture or Use 48KB for L1

Thread 0 Thread 1 Thread 2

L2 Transaction

(32B)

L2 Transaction

(32B)

L1 Transaction (128B) L2 Transaction

(32B)

20

Page 21: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

First Modification: Use __ldg

We change the source code:

It is slower: 625.8ms

Kernel Time Speedup

Original version 457.1ms

LDG to load A 625.8ms 0.73x

21

Page 22: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

First Modification: Use __ldg

Less L2 to SM traffic: 857.1MB transferred (it was 906.2MB)

22

Page 23: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

First Modification: Use __ldg

Why does 6% less traffic lead to 37% performance loss?

Instruction Efficiency (Eligible Warps per Active Cycle):

The average number of Eligible Warps dropped below 1

23

Page 24: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

First Modification: Use __ldg

There are already “a lot” of Active Warps per Cycle

24

Page 25: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

First Modification: Use __ldg

Warps cannot issue because they have to wait

Warps wait for Texture in 91.1% of the cases

25

Page 26: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

First Modification: Use __ldg

The loads compete for the cache too much

— Low hit rate: 7.7%

Texture requests introduce too much latency

Things to check in those cases:

— Texture Hit Rate: Low means no reuse

— Issue Efficiency and Stall Reasons

It was actually expected: GPU caches are not CPU caches!!!

26

Page 27: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

First Modification: Use __ldg

Other accesses may benefit from LDGs

Memory blocks accessed several times by several threads

How can we detect it?

— Source code analysis

— There is no way to detect it from Nsight

27

Page 28: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

First Modification: Use __ldg

We change the source code

— In y = Ax, we use __ldg when loading x

It’s faster: 403.4ms

Kernel Time Speedup

Original version 457.1ms

LDG to load A 625.8ms 0.73x

LDG to load X 403.4ms 1.13x

28

Page 29: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

First Modification: Use __ldg

Much less L2 to SM traffic: 774MB (it was 906.2MB)

Good hit rate in Texture Cache: 83%

29

Page 30: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

ITERATION 2

30

Page 31: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

CUDA Launch Summary

spmv_kernel_v2 is still a hot spot, so we profile it 31

Page 32: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Identify the Main Limiter

Is it limited by the memory bandwidth ?

Is it limited by the instruction throughput ?

Is it limited by latency ?

32

Page 33: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Identify the Main Limiter

We are still limited by latency

— Low DRAM utilization: 36.48%

— Low IPC: 0.06

33

Page 34: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Identify the Main Limiter

We are not limited by the Occupancy

— We have > 6 Eligible Warps per Active Cycle

We are limited by uncoalesced accesses: 48.92% of Replays

34

Page 35: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Second Strategy: Change Memory Accesses

4 consecutive threads load 4 consecutive elements

Per Warp:

— Up to 8 L1 Transactions / Ideal case: 2 Transactions

— Up to 8 L2 Transactions / Ideal case: 8 Transactions

Threads 0, 1, 2, 3 Threads 4, 5, 6, 7

L2 Transaction

(32B)

L2 Transaction

(32B)

L1 Transaction (128B) L2 Transaction

(32B)

Threads 8, 9, 10, 11

35

Page 36: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Second Strategy: Change Memory Accesses

It’s much faster: 161.7ms

Kernel Time Speedup

Original version 457.1ms

LDG to load A 625.8ms 0.73x

LDG to load X 403.4ms 1.13x

Coalescing with 4 Threads 161.7ms 2.83x

36

Page 37: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Second Strategy: Change Memory Accesses

We have much fewer Transactions per Request: 5.51 (LD)

37

Page 38: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Second Strategy: Change Memory Accesses

Much less traffic from L2: 230.5MB (it was 774MB)

Much less DRAM traffic: 210.1MB (it was 503.1MB)

38

Page 39: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

ITERATION 3

39

Page 40: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

CUDA Launch Summary

spmv_kernel_v3 is still a hot spot, so we profile it 40

Page 41: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Identify the Main Limiter

Is it limited by the memory bandwidth ?

Is it limited by the instruction throughput ?

Is it limited by latency ?

41

Page 42: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Identify the Main Limiter

We are still limited by latency

— Low DRAM utilization: 37.67%

— Low IPC: 0.31

42

Page 43: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Latency

Occupancy: 57.80% Achieved / 62.50% Theoretical

Eligible Warps per Active Cycle: ~3 on average

We need 4 warps on GK110, so ~3 could be an issue

43

Page 44: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Latency

Occupancy is limited by the number of registers

We change the number of registers with __launch_bounds__

It does not really help 44

__launch_bounds__(BLOCK_SIZE, MIN_BLOCKS)

Page 45: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Latency

Memory Accesses:

— Load: 5.51 Transactions per Request

— Store: 2 Transactions per Request

We still have too many uncoalesced accesses

45

Page 46: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Latency

We still have too many uncoalesced accesses

— Nearly 70% of Instruction Serialization (Replays)

— Stall Reasons: 48.1% due to Data Requests

46

Page 47: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Where Do Those Accesses Happen?

Same lines of code as before

47

Page 48: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

What Can We Do?

In our kernel: 4 threads per row of the matrix A

New approach: 1 warp of threads per row of the matrix A

Threads 0, 1, 2, 3 Threads 4, 5, 6, 7

L2 Transaction

(32B)

L2 Transaction

(32B)

L1 Transaction (128B) L2 Transaction

(32B)

Threads 8, 9, 10, 11

Threads 0, 1, 2, 3, …, 31 (some possibly idle)

L2 Transaction

(32B)

L2 Transaction

(32B) L2 Transaction

(32B) 48

Page 49: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

One Warp Per Row

It’s faster: 140.4ms

Kernel Time Speedup

Original version 457.1ms

LDG to load A 625.8ms 0.73x

LDG to load X 403.4ms 1.13x

Coalescing with 4 Threads 161.7ms 2.83x

1 Warp per Row 140.4ms 3.26x

49

Page 50: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

One Warp Per Row

Much fewer Transactions Per Request: 1.37 (LD) / 1 (ST)

50

Page 51: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

ITERATION 4

51

Page 52: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

One Warp Per Row

spmv_kernel_v4 is the hot spot

52

Page 53: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

One Warp Per Row

DRAM utilization: 40.64%

IPC: 1.57

We are still limited by latency 53

Page 54: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

One Warp Per Row

Occupancy and memory accesses are OK (not shown)

Control Flow Efficiency: 86.59%

Only 72.5% threads active in the expensive loop

54

Page 55: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

One Half Warp Per Row

It is faster: 114.7ms

Kernel Time Speedup

Original version 457.1ms

LDG to load A 625.8ms 0.73x

LDG to load X 403.4ms 1.13x

Coalescing with 4 Threads 161.7ms 2.83x

1 Warp per Row 140.4ms 3.26x

½ Warp per Row 114.7ms 3.99x

55

Page 56: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

ITERATION 5

56

Page 57: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

One Half Warp Per Row

DRAM utilization: 49.79%

IPC: 1.34

We are still limited by latency 57

Page 58: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

One Half Warp Per Row

Memory accesses are good enough

Occupancy could be an issue: ~3.2 Eligible Warps per Cycle

— Occupancy is limited by registers

But forcing register count does not improve performance 58

Page 59: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

One Half Warp Per Row

Branch divergence induce latency

We have 23.1% of divergent branches

59

Page 60: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

One Half Warp Per Row

We fix branch divergence

It is faster: 91.2ms

Kernel Time Speedup

Original version 457.1ms

LDG to load A 625.8ms 0.73x

LDG to load X 403.4ms 1.13x

Coalescing with 4 Threads 161.7ms 2.83x

1 Warp per Row 140.4ms 3.26x

½ Warp per Row 114.7ms 3.99x

No divergence 91.2ms 5.01x

60

Page 61: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

One Half Warp Per Row

DRAM utilization: 62.57%

IPC: 1.56

We achieve a much better bandwidth 61

Page 62: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

So Far

We have consecutively:

— Improved caching using __ldg (use with care)

— Improved coalescing

— Improved control flow efficiency

— Improved branching

Our new kernel is 5x faster than our first implementation

Nsight helped us a lot 62

Page 63: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

63

Page 64: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

ITERATION 6

64

Page 65: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Next Kernel

We are satisfied with the performance of spmv_kernel

We move to the next kernel: jacobi_smooth

65

Page 66: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

66

Page 67: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

What Have You Seen?

An iterative method to optimize your GPU code

— Trace your application

— Identify the hot spot and profile it

— Identify the performance limiter

— Optimize the code

— Iterate

A way to conduct that method with Nsight VSE

67