Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
CUDA Optimization with NVIDIA® Nsight™ Visual Studio Edition 3.0 Julien Demouth, NVIDIA
What Will You Learn?
An iterative method to optimize your GPU code
A way to conduct that method with Nsight VSE
APOD Method, Session S3008, Cliff Woolley
https://developer.nvidia.com/content/assess-parallelize-optimize-deploy
2
What Does the Application Do ?
It does not matter !!!
We care about memory accesses, instructions, latency, …
Companion code (with a different input file)
https://github.com/jdemouth/nsight-gtc2013
3
4
Our Method
Trace your application
Identify the hot spot and profile it
Identify the performance limiter
— Memory Bandwidth
— Instruction Throughput
— Latency
Optimize the code
Iterate
5
Our Environment
We use
— Nvidia Tesla K20c (GK110, SM 3.5), ECC OFF,
— Microsoft Windows 7 x64,
— Microsoft Visual Studio 2010 SP1,
— CUDA 5.0,
— Driver 310.34,
— Nvidia Nsight 3.0.
6
ITERATION 1
7
Trace the Application
8
CUDA Launch Summary
spmv_kernel_v0 is a hot spot, let’s start here!!!
Kernel Time Speedup
Original version 457.1ms
9
Profile the Most Expensive Kernel
10
CUDA Launches
11
Identify the Main Limiter
Is it limited by the memory bandwidth ?
Is it limited by the instruction throughput ?
Is it limited by latency ?
12
Memory Bandwidth
Utilization of DRAM Bandwidth: 37.67%
We are not limited by the memory bandwidth
13
Instruction Throughput
Instructions Per Clock (IPC): 0.04
We are not limited by instruction throughput
14
Latency
First two things to check:
— Occupancy
— Memory accesses (coalesced/uncoalesced accesses)
Other things to check (if needed):
— Control flow efficiency (branching, idle threads)
— Divergence
— Bank conflicts in shared memory
15
Latency
Occupancy: 47.58% Achieved / 50% Theoretical
Eligible Warps per Active Cycle: >4.7 on average
On GK110, 4 Eligible Warps are enough: Not an issue 16
Latency
Memory Accesses:
— Load: 22 Transactions per Request
— Store: 8 Transactions per Request
We have too many uncoalesced accesses!!!
17
Where Do Those Accesses Happen?
CUDA Source Profiler:
— Find where most of the uncoalesced requests happen
Tip: Sort “L2 Global Transactions Executed”
18
Access Pattern
Double precision numbers: 64-bit
Per Warp:
— Up to 32 L1 Transactions / Ideal case: 2 Transactions
— Up to 32 L2 Transactions / Ideal case: 8 Transactions
L2 Transaction
(32B)
L2 Transaction
(32B)
L1 Transaction (128B)
Thread 0 Thread 1
L2 Transaction
(32B)
Thread 2
19
Access Pattern
Next iteration:
Idea: Use the Read-only cache (LDG load)
— On Fermi: Use a texture or Use 48KB for L1
Thread 0 Thread 1 Thread 2
L2 Transaction
(32B)
L2 Transaction
(32B)
L1 Transaction (128B) L2 Transaction
(32B)
20
First Modification: Use __ldg
We change the source code:
It is slower: 625.8ms
Kernel Time Speedup
Original version 457.1ms
LDG to load A 625.8ms 0.73x
21
First Modification: Use __ldg
Less L2 to SM traffic: 857.1MB transferred (it was 906.2MB)
22
First Modification: Use __ldg
Why does 6% less traffic lead to 37% performance loss?
Instruction Efficiency (Eligible Warps per Active Cycle):
The average number of Eligible Warps dropped below 1
23
First Modification: Use __ldg
There are already “a lot” of Active Warps per Cycle
24
First Modification: Use __ldg
Warps cannot issue because they have to wait
Warps wait for Texture in 91.1% of the cases
25
First Modification: Use __ldg
The loads compete for the cache too much
— Low hit rate: 7.7%
Texture requests introduce too much latency
Things to check in those cases:
— Texture Hit Rate: Low means no reuse
— Issue Efficiency and Stall Reasons
It was actually expected: GPU caches are not CPU caches!!!
26
First Modification: Use __ldg
Other accesses may benefit from LDGs
Memory blocks accessed several times by several threads
How can we detect it?
— Source code analysis
— There is no way to detect it from Nsight
27
First Modification: Use __ldg
We change the source code
— In y = Ax, we use __ldg when loading x
It’s faster: 403.4ms
Kernel Time Speedup
Original version 457.1ms
LDG to load A 625.8ms 0.73x
LDG to load X 403.4ms 1.13x
28
First Modification: Use __ldg
Much less L2 to SM traffic: 774MB (it was 906.2MB)
Good hit rate in Texture Cache: 83%
29
ITERATION 2
30
CUDA Launch Summary
spmv_kernel_v2 is still a hot spot, so we profile it 31
Identify the Main Limiter
Is it limited by the memory bandwidth ?
Is it limited by the instruction throughput ?
Is it limited by latency ?
32
Identify the Main Limiter
We are still limited by latency
— Low DRAM utilization: 36.48%
— Low IPC: 0.06
33
Identify the Main Limiter
We are not limited by the Occupancy
— We have > 6 Eligible Warps per Active Cycle
We are limited by uncoalesced accesses: 48.92% of Replays
34
Second Strategy: Change Memory Accesses
4 consecutive threads load 4 consecutive elements
Per Warp:
— Up to 8 L1 Transactions / Ideal case: 2 Transactions
— Up to 8 L2 Transactions / Ideal case: 8 Transactions
Threads 0, 1, 2, 3 Threads 4, 5, 6, 7
L2 Transaction
(32B)
L2 Transaction
(32B)
L1 Transaction (128B) L2 Transaction
(32B)
Threads 8, 9, 10, 11
35
Second Strategy: Change Memory Accesses
It’s much faster: 161.7ms
Kernel Time Speedup
Original version 457.1ms
LDG to load A 625.8ms 0.73x
LDG to load X 403.4ms 1.13x
Coalescing with 4 Threads 161.7ms 2.83x
36
Second Strategy: Change Memory Accesses
We have much fewer Transactions per Request: 5.51 (LD)
37
Second Strategy: Change Memory Accesses
Much less traffic from L2: 230.5MB (it was 774MB)
Much less DRAM traffic: 210.1MB (it was 503.1MB)
38
ITERATION 3
39
CUDA Launch Summary
spmv_kernel_v3 is still a hot spot, so we profile it 40
Identify the Main Limiter
Is it limited by the memory bandwidth ?
Is it limited by the instruction throughput ?
Is it limited by latency ?
41
Identify the Main Limiter
We are still limited by latency
— Low DRAM utilization: 37.67%
— Low IPC: 0.31
42
Latency
Occupancy: 57.80% Achieved / 62.50% Theoretical
Eligible Warps per Active Cycle: ~3 on average
We need 4 warps on GK110, so ~3 could be an issue
43
Latency
Occupancy is limited by the number of registers
We change the number of registers with __launch_bounds__
It does not really help 44
__launch_bounds__(BLOCK_SIZE, MIN_BLOCKS)
Latency
Memory Accesses:
— Load: 5.51 Transactions per Request
— Store: 2 Transactions per Request
We still have too many uncoalesced accesses
45
Latency
We still have too many uncoalesced accesses
— Nearly 70% of Instruction Serialization (Replays)
— Stall Reasons: 48.1% due to Data Requests
46
Where Do Those Accesses Happen?
Same lines of code as before
47
What Can We Do?
In our kernel: 4 threads per row of the matrix A
New approach: 1 warp of threads per row of the matrix A
Threads 0, 1, 2, 3 Threads 4, 5, 6, 7
L2 Transaction
(32B)
L2 Transaction
(32B)
L1 Transaction (128B) L2 Transaction
(32B)
Threads 8, 9, 10, 11
Threads 0, 1, 2, 3, …, 31 (some possibly idle)
L2 Transaction
(32B)
L2 Transaction
(32B) L2 Transaction
(32B) 48
One Warp Per Row
It’s faster: 140.4ms
Kernel Time Speedup
Original version 457.1ms
LDG to load A 625.8ms 0.73x
LDG to load X 403.4ms 1.13x
Coalescing with 4 Threads 161.7ms 2.83x
1 Warp per Row 140.4ms 3.26x
49
One Warp Per Row
Much fewer Transactions Per Request: 1.37 (LD) / 1 (ST)
50
ITERATION 4
51
One Warp Per Row
spmv_kernel_v4 is the hot spot
52
One Warp Per Row
DRAM utilization: 40.64%
IPC: 1.57
We are still limited by latency 53
One Warp Per Row
Occupancy and memory accesses are OK (not shown)
Control Flow Efficiency: 86.59%
Only 72.5% threads active in the expensive loop
54
One Half Warp Per Row
It is faster: 114.7ms
Kernel Time Speedup
Original version 457.1ms
LDG to load A 625.8ms 0.73x
LDG to load X 403.4ms 1.13x
Coalescing with 4 Threads 161.7ms 2.83x
1 Warp per Row 140.4ms 3.26x
½ Warp per Row 114.7ms 3.99x
55
ITERATION 5
56
One Half Warp Per Row
DRAM utilization: 49.79%
IPC: 1.34
We are still limited by latency 57
One Half Warp Per Row
Memory accesses are good enough
Occupancy could be an issue: ~3.2 Eligible Warps per Cycle
— Occupancy is limited by registers
But forcing register count does not improve performance 58
One Half Warp Per Row
Branch divergence induce latency
We have 23.1% of divergent branches
59
One Half Warp Per Row
We fix branch divergence
It is faster: 91.2ms
Kernel Time Speedup
Original version 457.1ms
LDG to load A 625.8ms 0.73x
LDG to load X 403.4ms 1.13x
Coalescing with 4 Threads 161.7ms 2.83x
1 Warp per Row 140.4ms 3.26x
½ Warp per Row 114.7ms 3.99x
No divergence 91.2ms 5.01x
60
One Half Warp Per Row
DRAM utilization: 62.57%
IPC: 1.56
We achieve a much better bandwidth 61
So Far
We have consecutively:
— Improved caching using __ldg (use with care)
— Improved coalescing
— Improved control flow efficiency
— Improved branching
Our new kernel is 5x faster than our first implementation
Nsight helped us a lot 62
63
ITERATION 6
64
Next Kernel
We are satisfied with the performance of spmv_kernel
We move to the next kernel: jacobi_smooth
65
66
What Have You Seen?
An iterative method to optimize your GPU code
— Trace your application
— Identify the hot spot and profile it
— Identify the performance limiter
— Optimize the code
— Iterate
A way to conduct that method with Nsight VSE
67